Tag Archives: Realtime

Reversed real-time search

To find out documents which match a given query search engine iterates over inverted indexes and intersect the results. Inverted index is usually being built over terms found in a document, term can be a word, its transformation, normal form, a sentence and so on.

This indexing process is rather costly and slow, but usually it is not a problem – documents in a set are not frequently changed. Indexing is being performed in batches – for example iterate over all documents downloaded since the last indexing batch.
Realtime search in this context is basically a batch reduction – to 1 document in a batch in its extreme.

But there is a completely different search pattern – when a realtime flow of documents has to match a set of queries. The most common example is live twitter search.

Confluent performs realtime data processing and shows a way to perform realtime search given a huge amount of SQL queries and realtime feed of documents.

They use Kafka to store stream of documents and queries and a tricky way to optimize queries that way it would not require to run all queries against all new documents. It is achieved by building a query index which matches only those queries which potentially might match given document and then only run those queries.

Grape 2.0

Recently we released rewritten from scratch version of Grape – our realtime processing pipeline.

It completely differs from what it was before. Instead of going over Storm‘s steps we dramatically changed its logic.

The main goal was data availability and data persistency. We created grape for those who can not afford losing data.

So we introduced two parts in Grape: persistent queue and workers.

Persistent queue uses simple push/pop/ack API to store/retrieve/ack chunk of data stored in Elliptics. Object may live in queue forever – until it is processed by the workers.

Contrary to Kafka we can not lose your data if ‘data-file’ was not read for a long time or its size overflows under constant write load.

Our queue may grow in distributed storage as long as it has space (which is usually considered as unlimited), and one may start processing workers not in push manner, but using pull design.

Push messaging systems implies the whole processing pipeline has to work with the same speed as pushing process. And if there are spikes of load which processing workers can not handle, data will likely be lost. Any pipeline modification (like resharding Kafka topics) ends up stopping processing.

Pull systems do not suffer from this ‘must-have-the-same-performance’ issue – one may start new worker nodes on demand, and even if they can not handle current write spike, it will be stored in the distributed persistent queue and catched up later.

We start 3 projects on Grape: realtime antifraud and human detection system, video-processing engine which detects road signs and Wikipedia->RDF generation pipeline.

Realtime processing engine

We created a small framework called grape to feel the gap for realtime processing.

We use elliptics as a base messaging protocol, but whole logic of the routes and connection processing is hidden from the users of the system.

Basically, grape framework hides all messaging, routing, connection, locking and all other low-level stuff from higher layer.
One may create program (shared library which is linked with libgrape.so and exports function named initialize which takes json config sent from application config you are about to run) which will process events and associated data and emit other events which will be routed to different nodes in elliptics distributed network.

Grape was build on behalf of Storm processing engine, and while Twitter announced examples, which count words in your tweets (likely US election site created using this technology), we created realtime search engine.

It is a rather simple engine – we use Snowball stemmer, Chromium language detector and build namespaced reverse indexes (which are uncompressed msgpack’ed maps) with word positions.

Grape supports blocked operations, i.e. client waits until all events she emitted are finished and one of them emits completion event back to caller.
We can generate secondary indexes using the same application, it requires even less steps.

Client sends document msgpacked into structure which understands our engine (it is simple document-id/namespace-id/text/timestamp/flags tuple), which updates index for document-id/namespace-id and parses text extracting tokens using stemmer and language detector. Then every token/namespace-id’s reverse index is updated.

We do not aim at Lucene or Sphinx, this is a simpler technology just to show what our realtime processing engine can do besides twitter-like word counts :)

Unfortunately, documentation is rather subtle, but we write it right now and plan to show something interesting soon.