Distributed computation in elliptics
In last days (or months) I thought a lot about what people really want from distributed system.
Usually it is enough just to store data being able to horizontally scale when needed. But when more and more people start using the system, they start getting out of the supported feature range.
In particular – people do not want to just store data, they want to implement some processing on top of it.
We added server-side scripting as a simple reply for that demand – server-side scripting is a way to perform arbitrary code on the data you provide to the system. In particular it can be thought of as a trigger for data write – we get data, process it and store.
Server-side scripts can be quite complex objects – for example in Pohmelfs we update directory content and hardlink counts using server-side scripts. They may read data, change object structure, serialize received objects, store combined entities back to low-level elliptics backend and so forth.
Server-side scripts can be treated as a low-level processing units which work with single object or key.
Using those basic ‘bricks’ we can create rather complex systems – it is possible to create streams and kind of graph processing engines, but it require enormous amount of low-level code. Something like ‘get-key, process-key, write-key-to-leafX, copy-to-leafY, wait-for-replyZ’ and so on. Changing computational topology is real pain.
What we really want is a way to describe how our processing engine should look. Something high-level like ‘compute-function-1(input-key) -> compute-function-2() and compute-function-3()’ and so on. Then we would submit that topology into the cluster and start emitting keys.
This whole idea I got from Storm presentation. Storm is a distributed real-time processing system recently bought by Twitter and used there for real-time analytics.
While I certainly do not like some of its concepts, in particular single master node (say hi to hadoop), closure/java implementation (it is really pain in the ass to support multiple stacks of technologies in a project – and we already have a bunch of them, so likely it is not ready to bring in java).
But Storm shows a really exceptional parallelism in computation and very convenient high-level abstraction. Network topology created on top of low-level objects like stream, spouts and bolts – this is exactly how we want to create processing engine, and not by thinking on how to receive message and how to send a reply.
My search/AI ideas definitely need something similar to be created for doing high-level computation modeling.
So I’m starting to think about how this can be created not bound to elliptics as its data storage, but generic enough.
This may sound like bicycle reinventing – and this definitely is in some way – I wish Storm could be created as embeddable library, so I could put it into our datastore easily… But it doesn’t, so we have to create something.
I would also add a true graph processing into this engine – the way Google’s Pregel work with its supersteps – this is what is missing in Storm. Storm is good for realtime ‘stateless’ processing – when your tuple is processed in the whole topology, it is forgotten. What I want is to create a way to synchronize parallel topologies, so that when they are completed, next topology gets started with the data previously generated by its predecessors.
Stay tuned. I promised to keep this blog running, and failed to find time to update it regulary.
But I will do more than my best to improve this situation :)