Tag Archives: storm

Grape 2.0

Recently we released rewritten from scratch version of Grape – our realtime processing pipeline.

It completely differs from what it was before. Instead of going over Storm‘s steps we dramatically changed its logic.

The main goal was data availability and data persistency. We created grape for those who can not afford losing data.

So we introduced two parts in Grape: persistent queue and workers.

Persistent queue uses simple push/pop/ack API to store/retrieve/ack chunk of data stored in Elliptics. Object may live in queue forever – until it is processed by the workers.

Contrary to Kafka we can not lose your data if ‘data-file’ was not read for a long time or its size overflows under constant write load.

Our queue may grow in distributed storage as long as it has space (which is usually considered as unlimited), and one may start processing workers not in push manner, but using pull design.

Push messaging systems implies the whole processing pipeline has to work with the same speed as pushing process. And if there are spikes of load which processing workers can not handle, data will likely be lost. Any pipeline modification (like resharding Kafka topics) ends up stopping processing.

Pull systems do not suffer from this ‘must-have-the-same-performance’ issue – one may start new worker nodes on demand, and even if they can not handle current write spike, it will be stored in the distributed persistent queue and catched up later.

We start 3 projects on Grape: realtime antifraud and human detection system, video-processing engine which detects road signs and Wikipedia->RDF generation pipeline.

Twitter realtime search engine

Twitter uses humans (pool of in-house ‘turks’ in mechanical turk) each time new trending topic is being propagated to search results: http://engineering.twitter.com/2013/01/improving-twitter-search-with-real-time.html

Every time.

And Storm is only used to gather statistics and detect trending topics. It uses Thrift to upload new active search term to Amazon’s Mechanical Turk.

Instead we want Grape – our realtime processing engine – to be able to perform much more complicated tasks. In particular, we implement secondary indexes and realtime search in elliptics over grape.

Grape’s ultimate goal is to implement a platform for every kind of realtime processing tasks. For this purpose we are developing a technology for guaranteed event processing, pipeline restart, event order preserve and so on. In realtime search this will be something like emit new event with new document uploaded to elliptics distributed storage, and that event will trigger whole search indexing (like stemming, language detection, inverted index updates and so on)