Structured Data and Search Engines: Schema.org

It’s been a long time coming but the the big three (Google, Bing, Yahoo) are finally getting their structured data story together. The three have collaborated on a standard for encoding structured data within web pages and published their results at schema.org.

The standard is based upon the RDF data model and includes a decent base ontology. The ontology is considerably more comprehensive than the one used by Facebook in their open graph protocol. Hard core RDFers may be bothered by the interpretation of domain and range used since it differs from that of RDF, but that’s a relatively minor quibble.

It’ll be interesting to see where this leads….

Advertisements
Posted in Semantic Web | Leave a comment

Node Buffers

I’ve been doing some more work on my json-in-mysql module for node.js. One of the areas I’ve been looking at is performance. As is often the case with these things, I’ve found that when things take longer than I’d like, it’s not for the reasons I first expect.

I wrote a simple performance test that stored 1000 auto generated json documents and then queried against them. The document writes took a reasonable amount of time — a little over 1ms per document. The queries on the other hand weren’t so blazing and took longer the more results were returned. At first I blamed MySQL, but found that the underlying SQL queries were rarely taking longer than 1ms. I then used the node.js/v8 profiling capabilities to discover that most of my query time wasn’t spent in my code, or in the node-mysql driver I’m using, or in node, but in v8 internals. After some further digging I discovered that node Buffers, used extensively by node-mysql, have some heavier than expected costs.

I ended up creating a test case like this while investigating the performance issues:

var dt = new Date;

for (var i=0; i <10000; i++){
    var buf = new Buffer(1000);
    for (var j=0; j<100; j++){
//        var s2 = buf.toString('utf-8', j*10, j*10 + 10);
        var s2 = buf.slice(j*10, j*10 + 10).toString('utf-8');
    }
}
console.log(+new Date - dt);

I found that the pattern buffer.slice(start, end).toString(encoding) was more than 10 times slower than the pattern buffer.toString(encoding, start, end). In theory a slice is cheap since it doesn’t allocate any new buffer memory — it just creates a new Buffer object that refers to the memory in the parent Buffer. In practice, there seems to be some considerable overhead somewhere.

I’m not too familiar yet with node or v8 internals, but I did take a look around to see if I could figure out what’s going on. I found that the node Buffer class was calling a method in the v8 API called SetIndexedPropertiesToExternalArrayData. I gather this is an optimization that tells v8 that an objects memory buffer will be managed externally, but still allows it fast indexed access. I’m guessing (and this really is all a guess at this point given my limited knowledge) that this has a cost in that it forces v8 to modify the generated class for the object. For big buffers with lots of data access the cost of incrementally compiling the object is far out weighed by the reduced access costs. For small buffers, though, the cost becomes a significant overhead.

If that’s all true, I’m not too sure what the solution is….possibly direct support for binary buffers directly within v8?

Posted in node.js | Leave a comment

First experiences with node.js

I’ve been hearing about node.js for a while and have been wanting to play with it. Node is all about server-side javascript and async I/O. It’s built around Google’s V8 javascript engine and is gaining quite a following.

I found it pretty easy to get started with node — the docs are decent, there’s lots of sample code, and a good number of getting started blog posts. I initially got going with some pre-built cygwin binaries on windows — they worked, but didn’t really provide the full experience. So, I picked up a cheap rackspace instance running Ubuntu and moved my experiments there (great deal, btw, $10.95/month for a 256MB instance).

I’ve always found working on something real is the best way to learn a new technology. I tried to think of a first project that was non-trivial but not too huge, made appropriate use of node, and was something that I might actually use sometime. In the end, I decided to build a node module that turns mysql into a json store with a custom query language. I called the project myjsdb.

I was already pretty familiar with javascript, and the core libraries/modules in node are pretty straight forward, so I found the biggest challenge was getting comfortable with a 100% async style of coding. There’s really no cheating in node — there’s just no way to do blocking i/o — so program flow where i/o is involved makes extensive use of callbacks. The biggest challenge I had was coming up with patterns to deal with those call backs that led to manageable/readable code.

In simple cases, just using anonymous callbacks works ok. E.g:

    this.getDocumentId = function(name, fn){
        this.client.query('select id from ' + this.name
                 + '_json_doc where name=?',
            [name],
            function(err, res){
                fn(err, res?res[0].id:null);
            }
        );
    }

In more complex cases, though, that can lead to highly nested code which I find pretty unreadable. In those cases, I found creating something like a state machine was a better model. E.g:

this.putDocument = function(name, obj, fn){
        var store = this;
        var docid = 0;

        var s1 = function(){
            store.client.query('insert into ' + store.name + '_json_doc (name, last_modified)' +
                " values(?, now())" +
                " on duplicate key update last_modified = now()", [name], function(err, info){
                    if (err)
                        return fn(err);
                    if ((docid = info.insertId) == 0)
                        s2();
                    else
                        s3();
                }
            );
        };

        var s2 = function() {
            store.getDocumentId(name, function(err, id){
                if (err)
                    return fn(err);
                docid = id;
                s3();
            });
        };

        var s3 = function() {
            store.clearDocument(docid, function(err){
                if (err)
                    return fn(err);
                s4();
            });
        };

        var s4 = function() {
            var stmts = new Json2SqlHelper(store, docid, obj).getStatements();
            store.client.query('insert into ' + store.name + '_json values ' + stmts.join(', '), function(err){
                fn(err);
            });
        };

        s1();
    }

I imagine an event handler model would work well also (it just seemed a little heavy weight in my cases to use an emitter for purely local flow).

I also found that when I had some underlying object doing some queuing, I could use a model like this:

var Store = require('./myjsdb').Store;

var store = new Store('test', {user:'root', password:'xxx', database:'testdb'}),
    doc = store.getDocument(),
    person = doc.getObject({age:Number, name:'Geoff', knows:Object}),
    p2 = doc.getObject({});

person.age.gt(25);
person.knows.eq(p2);

store.open();
store.create();
store.putDocument('doc1', {name:'geoff', age:44, knows:{name:'derrish'}});
store.query({age:person.age, name:person.name}, function(err, res){
    console.log(res);
});

store.remove();
store.close();

In that case, I don’t provide callbacks to many of the methods called on the store object (though they accept them). This works because the underlying mysql driver I’m using is queuing up operations. The code reads like the various operations are synchronous, though in fact they’re not. What really happens is all of the code shown gets executed, queuing up operations against the mysql store (well, the first mysql call will be executed but the rest will be queued). Only once the code shown has completed executing will the thread be available to process the i/o returned from mysql from the first call. Once that i/o is processed, the next queued command will be issued, etc. I guess this is a pretty special case since it relies on having a single i/o processor that queues its operations.

All in all, I’ve found node pretty nice to work with. I do wonder whether a job queueing/thread pooling model wouldn’t be better than the single threaded model used, but perhaps there are technical reasons related to V8 that make that impractical.

Posted in javascript, node.js | Leave a comment

Cube on Canvas

I first wrote a cube program in c when I was learning Windows programming many years ago. Since then, whenever I’ve wanted to learn about a new platform (e.g. Win CE, Silverlight, Java, Android, ..) I’ve ported it as a learning exercise.

I wanted to play with the HTML5 canvas, so once again I worked on a port, starting with an Android version I’d done last year. It was surprisingly easy to get everything up and running in pure javascript/html 5. The image below is a screen shot of the final result – click on it to view the real thing (I was too lazy to figure out how to embed an html5 canvas in wordpress).

Let me know if you have any problems on your platform — I’ve only tested on desktop chrome and firefox and android mobile (later…I added the excanvas.js shim for IE but at least on IE8, it’s so slow it’s not really usable).

Posted in html5, javascript | Leave a comment