Recently we released rewritten from scratch version of Grape – our realtime processing pipeline.
It completely differs from what it was before. Instead of going over Storm‘s steps we dramatically changed its logic.
The main goal was data availability and data persistency. We created grape for those who can not afford losing data.
So we introduced two parts in Grape: persistent queue and workers.
Persistent queue uses simple push/pop/ack API to store/retrieve/ack chunk of data stored in Elliptics. Object may live in queue forever – until it is processed by the workers.
Contrary to Kafka we can not lose your data if ‘data-file’ was not read for a long time or its size overflows under constant write load.
Our queue may grow in distributed storage as long as it has space (which is usually considered as unlimited), and one may start processing workers not in push manner, but using pull design.
Push messaging systems implies the whole processing pipeline has to work with the same speed as pushing process. And if there are spikes of load which processing workers can not handle, data will likely be lost. Any pipeline modification (like resharding Kafka topics) ends up stopping processing.
Pull systems do not suffer from this ‘must-have-the-same-performance’ issue – one may start new worker nodes on demand, and even if they can not handle current write spike, it will be stored in the distributed persistent queue and catched up later.
We start 3 projects on Grape: realtime antifraud and human detection system, video-processing engine which detects road signs and Wikipedia->RDF generation pipeline.