Structured Data and Search Engines: Schema.org

It’s been a long time coming but the the big three (Google, Bing, Yahoo) are finally getting their structured data story together. The three have collaborated on a standard for encoding structured data within web pages and published their results at schema.org.

The standard is based upon the RDF data model and includes a decent base ontology. The ontology is considerably more comprehensive than the one used by Facebook in their open graph protocol. Hard core RDFers may be bothered by the interpretation of domain and range used since it differs from that of RDF, but that’s a relatively minor quibble.

It’ll be interesting to see where this leads….

Posted in Semantic Web | Leave a comment

Node Buffers

I’ve been doing some more work on my json-in-mysql module for node.js. One of the areas I’ve been looking at is performance. As is often the case with these things, I’ve found that when things take longer than I’d like, it’s not for the reasons I first expect.

I wrote a simple performance test that stored 1000 auto generated json documents and then queried against them. The document writes took a reasonable amount of time — a little over 1ms per document. The queries on the other hand weren’t so blazing and took longer the more results were returned. At first I blamed MySQL, but found that the underlying SQL queries were rarely taking longer than 1ms. I then used the node.js/v8 profiling capabilities to discover that most of my query time wasn’t spent in my code, or in the node-mysql driver I’m using, or in node, but in v8 internals. After some further digging I discovered that node Buffers, used extensively by node-mysql, have some heavier than expected costs.

I ended up creating a test case like this while investigating the performance issues:

var dt = new Date;

for (var i=0; i <10000; i++){
    var buf = new Buffer(1000);
    for (var j=0; j<100; j++){
//        var s2 = buf.toString('utf-8', j*10, j*10 + 10);
        var s2 = buf.slice(j*10, j*10 + 10).toString('utf-8');
    }
}
console.log(+new Date - dt);

I found that the pattern buffer.slice(start, end).toString(encoding) was more than 10 times slower than the pattern buffer.toString(encoding, start, end). In theory a slice is cheap since it doesn’t allocate any new buffer memory — it just creates a new Buffer object that refers to the memory in the parent Buffer. In practice, there seems to be some considerable overhead somewhere.

I’m not too familiar yet with node or v8 internals, but I did take a look around to see if I could figure out what’s going on. I found that the node Buffer class was calling a method in the v8 API called SetIndexedPropertiesToExternalArrayData. I gather this is an optimization that tells v8 that an objects memory buffer will be managed externally, but still allows it fast indexed access. I’m guessing (and this really is all a guess at this point given my limited knowledge) that this has a cost in that it forces v8 to modify the generated class for the object. For big buffers with lots of data access the cost of incrementally compiling the object is far out weighed by the reduced access costs. For small buffers, though, the cost becomes a significant overhead.

If that’s all true, I’m not too sure what the solution is….possibly direct support for binary buffers directly within v8?

Posted in node.js | Leave a comment

Man, I’m sick of Windows… (updated)

I’ve had nothing but problems with Windows 7 since I installed it. It periodically goes into a mode where it pegs both cpu cores and I can’t get anything done. All I can do is go read a book or something and wait it out — rebooting doesn’t help, killing processes doesn’t help…nothing does. I’ve researched and researched and tried every possible solution I’ve read about — getting rid of the antivirus software, trying different antivirus software, disabling search indexing, reinstalling the OS, etc. — all with no lasting effect. Every time it happens I waste more time searching for yet more solutions that don’t actually solve anything.

On top of that, I can’t seem to get Windows update to work reliably, even after multiple attempts with msft support to find out what’s going on. It baffles me that for something as important as Windows update I can’t just run some utility that will pinpoint the problem and either fix it, or tell me how to fix it. Oh, and let’s not forget boot time — sure the welcome/login screen comes up reasonably quickly, but it’s minutes before I can actually do something.

So, not too long ago I installed an Ubuntu dual boot. So far , so good — it just seems to work. It’s faster, and seems to have just about all I need (especially now that I’ve discovered rdesktop which allows me to remote into other windows servers just like mstsc). I find myself spending more and more time in Ubuntu and avoiding Windows as much as possible. I’m no Microsoft hater and have pretty much always used Windows on my primary machine, but I’m really beginning to question why I bother. I lose too much time/productivity fighting with things that I shouldn’t have to (and lose even more time with rants like this :) ). I have to use windows for some development I do, but I mostly do that on remote machines anyway. So, why not switch?

EDIT: (several weeks later)
I got to the bottom of most of these issues, so figured I should update in case anyone comes across this post while searching.

In the first case, I was able to get my machine to go into its cpu-pegged, get-nothing-done mode while using Ubuntu also. This led to the realization that I had a machine issue, not an OS issue. Turns out that my laptop (dell e6400) has some cpu throttling issues when it feels overheated. That explains why the problem seemed to occur most frequently when watching streaming videos (since flash seems to manage to put a heavy load on both cores). I downloaded a copy of a utility called throttlestop which allows you to control this behavior plus monitor the cpu temperature. I may be risking a meltdown by doing this, but frankly I don’t care — the machine is due for an upgrade anyway and is so frustrating when it’s misbehaving that I wouldn’t mind seeing it burn.

In the case of the other major issue — windows update failing 95% of the time — it was also my use of ubuntu that led indirectly to what seems like a solution. When I first setup the dual boot on my laptop, I was having issues with the wireless driver under ubuntu so plugged in a cable as a temporary fix. Over the next few days I noticed that windows update seemed to be working reliably for the first time in — well, forever. So, it seems like windows update had some problem with my wireless connection. Perhaps it established some affinity for the wired connection early on and was caching something… or my wireless driver is screwy in some way… or…. I really don’t know, but for now I’m just glad that it works.

Posted in Uncategorized | Leave a comment

First experiences with node.js

I’ve been hearing about node.js for a while and have been wanting to play with it. Node is all about server-side javascript and async I/O. It’s built around Google’s V8 javascript engine and is gaining quite a following.

I found it pretty easy to get started with node — the docs are decent, there’s lots of sample code, and a good number of getting started blog posts. I initially got going with some pre-built cygwin binaries on windows — they worked, but didn’t really provide the full experience. So, I picked up a cheap rackspace instance running Ubuntu and moved my experiments there (great deal, btw, $10.95/month for a 256MB instance).

I’ve always found working on something real is the best way to learn a new technology. I tried to think of a first project that was non-trivial but not too huge, made appropriate use of node, and was something that I might actually use sometime. In the end, I decided to build a node module that turns mysql into a json store with a custom query language. I called the project myjsdb.

I was already pretty familiar with javascript, and the core libraries/modules in node are pretty straight forward, so I found the biggest challenge was getting comfortable with a 100% async style of coding. There’s really no cheating in node — there’s just no way to do blocking i/o — so program flow where i/o is involved makes extensive use of callbacks. The biggest challenge I had was coming up with patterns to deal with those call backs that led to manageable/readable code.

In simple cases, just using anonymous callbacks works ok. E.g:

    this.getDocumentId = function(name, fn){
        this.client.query('select id from ' + this.name
                 + '_json_doc where name=?',
            [name],
            function(err, res){
                fn(err, res?res[0].id:null);
            }
        );
    }

In more complex cases, though, that can lead to highly nested code which I find pretty unreadable. In those cases, I found creating something like a state machine was a better model. E.g:

this.putDocument = function(name, obj, fn){
        var store = this;
        var docid = 0;

        var s1 = function(){
            store.client.query('insert into ' + store.name + '_json_doc (name, last_modified)' +
                " values(?, now())" +
                " on duplicate key update last_modified = now()", [name], function(err, info){
                    if (err)
                        return fn(err);
                    if ((docid = info.insertId) == 0)
                        s2();
                    else
                        s3();
                }
            );
        };

        var s2 = function() {
            store.getDocumentId(name, function(err, id){
                if (err)
                    return fn(err);
                docid = id;
                s3();
            });
        };

        var s3 = function() {
            store.clearDocument(docid, function(err){
                if (err)
                    return fn(err);
                s4();
            });
        };

        var s4 = function() {
            var stmts = new Json2SqlHelper(store, docid, obj).getStatements();
            store.client.query('insert into ' + store.name + '_json values ' + stmts.join(', '), function(err){
                fn(err);
            });
        };

        s1();
    }

I imagine an event handler model would work well also (it just seemed a little heavy weight in my cases to use an emitter for purely local flow).

I also found that when I had some underlying object doing some queuing, I could use a model like this:

var Store = require('./myjsdb').Store;

var store = new Store('test', {user:'root', password:'xxx', database:'testdb'}),
    doc = store.getDocument(),
    person = doc.getObject({age:Number, name:'Geoff', knows:Object}),
    p2 = doc.getObject({});

person.age.gt(25);
person.knows.eq(p2);

store.open();
store.create();
store.putDocument('doc1', {name:'geoff', age:44, knows:{name:'derrish'}});
store.query({age:person.age, name:person.name}, function(err, res){
    console.log(res);
});

store.remove();
store.close();

In that case, I don’t provide callbacks to many of the methods called on the store object (though they accept them). This works because the underlying mysql driver I’m using is queuing up operations. The code reads like the various operations are synchronous, though in fact they’re not. What really happens is all of the code shown gets executed, queuing up operations against the mysql store (well, the first mysql call will be executed but the rest will be queued). Only once the code shown has completed executing will the thread be available to process the i/o returned from mysql from the first call. Once that i/o is processed, the next queued command will be issued, etc. I guess this is a pretty special case since it relies on having a single i/o processor that queues its operations.

All in all, I’ve found node pretty nice to work with. I do wonder whether a job queueing/thread pooling model wouldn’t be better than the single threaded model used, but perhaps there are technical reasons related to V8 that make that impractical.

Posted in javascript, node.js | Leave a comment

Cube on Canvas

I first wrote a cube program in c when I was learning Windows programming many years ago. Since then, whenever I’ve wanted to learn about a new platform (e.g. Win CE, Silverlight, Java, Android, ..) I’ve ported it as a learning exercise.

I wanted to play with the HTML5 canvas, so once again I worked on a port, starting with an Android version I’d done last year. It was surprisingly easy to get everything up and running in pure javascript/html 5. The image below is a screen shot of the final result – click on it to view the real thing (I was too lazy to figure out how to embed an html5 canvas in wordpress).

Let me know if you have any problems on your platform — I’ve only tested on desktop chrome and firefox and android mobile (later…I added the excanvas.js shim for IE but at least on IE8, it’s so slow it’s not really usable).

Posted in html5, javascript | Leave a comment