Tag Archives: secondary-index

Facebook’s in-memory computation engine for social analytics

Facebook’s scale is out of the radar for the vast majority of cases, but yet it is very interesting to lookup new ideas there.

Audience insight is a way to show us pages which were are about to like with higher probability. It is based on previously liked pages, gender, location and many other features.

In particular at Facebook’s scale it is about 35 Tb of raw data and query ‘give me several pages which should be shown to user X’ must be completed within hundreds of milliseconds. That requires to process hundreds of billions of likes and billions of pages – numbers beyond reasonable – and in fraction of a second.

It happens that 168 nodes can handle it with some magic like columnar storage, in-memory data, bitmap indexes, GPU and caches. Very interesting reading!

Large storage and search index installations: elasticsearch + github.com

Here is an excellent interview with github.com/search guys about their search installation

Basic points are:

  • 30 terabytes of data on SSD disks
  • 2 billions of documents in the search index
  • 44 servers on EC2 (8 frontends and 36 storage nodes)
  • self-healing index – when user pushes new data, touched files are reindexed, which heals index and catches up with missed updates
  • such a great setup crashed under load until elasticsearch was quite reconfigured, bugs got fixed and so on – 2 days offline and elasticsearch own gurus got involved
  • elasticsearch has split-brain problem, which can corrupt data
  • 44 EC2 nodes were replaced with 8 own physical servers with 32 cores and 14 Tb of SSD each
  • elasticsearch 0.90 has rack-aware load balancing
  • elasticsearch has built-in index optimization which doesn’t really delete objects from the indexes when used deletes data – this can be postponed and batched for maximum performance
  • github.com uses ‘everything-over-http+json’ since Thrift is hard to debug and operated on albeit being faster
  • each repository lives in single shard (they had 500 of them on EC2, but says its overkill) – this speeds up queries by 2 times
  • split indexes as much as possible (kind of shard by time, but not exactly), archive old ones (compress them, glue together and so on)

That’s it.

I wrote this because Elliptics supports secondary indexes which we would want to try on similar setup – several billions of records, many terabytes of data. Our friends use MongoDB for that and it really sucks at it.
Our solution is not quite ready (we easily handle that amount of data, but we have to think about how elliptics indexes should handle that number of objects properly), but we are getting close – do not know whether there will be production use in the near future, but at least major testing for sure.

In a meantime my pet-project Wookie – search engine infrastructure on top of elliptics secondary indexes – got quite a bit of support: we have httpjson interface, and/or/quotes operators, various helper tools and so on. We are working on better lemmatization (using Snowball stemming quite sucks for russian for example), spelling correction and so on.

Not that I want to reinvent any modern search engine, but instead I have completely different idea in mind on how to implement what is called ‘relevance’ in search engine world.
This idea has to be checked in a real world.

Extensive tutorial on elliptics, thevoid, HTTP access and secondary indexes

I’ve written a rather big tutorial about our distributed storage elliptics installation from Ubuntu ppa, setting up test server and providing REST HTTP access to it.

HTTP methods include reading and writing data as well as updating and searching through secondary indexes. Secondary indexes use REST API too – client posts json data describing indexes and objects one may want to update or search.

For example it describes how to find all objects which are present in all or any provided indexes. Appropriate jsons (update and find) are rahter simple:

$ cat example/update-example.json
{
    "id": "elliptics",
    "indexes": {
        "fast": "this is some 'private' data only for key 'elliptics' and index/tag 'fast'",
        "distributed": "data for tag 'distributed'",
        "fault-tolerant": "and 'fault-tolerant'"
    }
}

$ cat example/find-example.json 
{
    "type": "and",
    "indexes": [
        "fast",
        "distributed"
    ]
}

Whole tutorial: http://doc.reverbrain.com/elliptics:client-tutorial

Secondary indexes and write-cas

It was a while I wrote last time, and there were enourmous amount of changes made.

Let’s start with write-cas and secondary indexes.
Write-cas stands for write-compare-and-swap, which means server performs your write only if your cookie matches what it has. We use data checksum as a cookie, but that can be changed if needed.

CAS semantic was introduced to heal a problem when multiple clients update the same key. In some cases you might want to update a complex structure and not to loose data made by others.

Since we do not have distributed locks, client performs read-modify-write loop without any lock being held, which ends up with the race against other clients. CAS allows to detect that data was changed after we read it last time and our modification will overwrite those chages.

In this case client gets error and performs read-modify-write loop again. There are nice function to use write-cas: with functor which just modifies data (read and write is hidden) and the one, and low-level method, which requires checksum and new data – if it fails, it is up to the client to read data again, apply his changes and write it again.

The first write-cas client is elliptics’ secondary indexes. You may attach ‘tags’ or ‘indexes’ with data to any key you want. This means that after secondary indexes update has completed, you will find your key under the ‘tag’ or ‘index’.

For example, tag can be a daily timestamp – in this case you will get list of all keys ‘changed’ at given day.

Of course you can find all keys which match multiple indexes – this is kind of ‘AND’ logical operation. Our API returns all keys found under/in all requested tags/indexes.

All indexes (as well as any keys) may live in different namespaces.

So far secondary indexes are implemented on pure write-cas semantic. And our rather simple realtime search engine Wookie uses them to update its reverse indexes.
And that doesn’t scale :)

We found that having multiple clients competing to update the same popular index ends up with too many read-modify-write cycles, that the whole system becomes network bound. And we want to be CPU bould instead.

To fix this issue we will change secondary indexes to be invoked from server-side engine. All server-side operations are atomic by default, so this will eliminate races between multiple clients.

Stay tuned – I will write more about Wookie soon.
After all, we want to implement a baby (really-really baby) Watson with it :)