Tag Archives: elasticsearch

Greylock tutorial – distributed base search engine based on Elliptics

We’ve heavily updated Reverbrain documentation pages: doc.reverbrain.com, and I’m pleased to present our distributed base search engine Greylock. Documentation page includes generic information about search engine and tutorial, which includes installation process, configs and two types of clients: plain HTTP API (similar to what is expected from base search engine like Greylock and ElasticSearch) and Python client (works via HTTP too, but also uses Consul to acquire mailbox locks). If you need C++ tutorial, you can check greylock test suite which includes insertion/selection/removal as well as various iterators over data, self-recovery tests, statistics and other interesting bits.

I get a fair number of questions on how is Greylock different from ElasticSearch or Solr for instance or Amazon Cloud Search? They all have enormous amount of features and work with large amount of data, so what’s the purpose?

And the answer is just two words: scalability and automation.
If you worked with Elastic you do know which operations have to be made to reshard cluster when current sharding scheme becomes a bottleneck (how long it takes, what is the performance penalty and how dangerous is the process). When you work in the environment where new documents always arrive and space consumption grows with time, this resharding process will have to be started again and again with new servers added. At some point this becomes a serious issue.

With Greylock this is not needed at all. There is virtually no data and index movements when new servers are being added due to Elliptics bucket system. This design proved to work really well in Elliptics storage installations, where upload rates reach tens of terabytes daily, and that’s only our clients data, there are other seriously larger installations for example in Yandex.

We concentrated on scalability problem and solved it. And yet we do have a set of features. It is not comparable with Elastic of course even not counting NLP tasks which we will release later (language models and spelling correction for instance for any language where you can find a rather large corpus). Greylock supports basic relevance model based on the word distance among client request and words in the document.

Likely two of the worst issues are absence of numerical indexes and client locks. Both were made deliberately. Numerical indexes break pagination, which in turn means that if you want 100 documents out of a million, you will have to read them all into RAM, resort either to numeric order or into lexical order (that’s how document ids are stored in the inverted indexes), intersect the whole million of keys and return the first 100. For any subsequent request this has to be done again and again. Without numerics pagination works with iterators pointing to inverted indexes only, the whole index (and its millions of entries) is never being read, only some pages are accessed sequentially.

To help with numerics Greylock supports document timestamps, i.e. a single 64-bit numeric per document ID which is used in inverted indexes sorting order. Of course this is not a replacement for fair numeric index, but it does solve almost all of our use cases.

The second major issue is consistency and client locking. Greylock as well as Elliptics are not strictly consistent storages. With Greylock things are even worse – amount of data overwritten by a single client insert can be large and index structure (originally started as a distributed B+/*-tree) does not tolerate broken pages. Elastic and others implement consistency model (like Raft, Paxos or ZAB) internally. Greylock doesn’t. That’s why we require clients to acquire locks in some other system like Consul, Etcd or ZooKeeper to work properly. Our tutorial shows basic locking scheme implemented using strictly consistent Consul key-value storage.

We have a major plan for Greylock distributed search engine, expect new features and give it a try: http://doc.reverbrain.com/greylock:greylock
If you have any questions, you are welcome:
Google group: https://groups.google.com/forum/?fromgroups=#!forum/reverbrain,
this site and comments.

Large storage and search index installations: elasticsearch + github.com

Here is an excellent interview with github.com/search guys about their search installation

Basic points are:

  • 30 terabytes of data on SSD disks
  • 2 billions of documents in the search index
  • 44 servers on EC2 (8 frontends and 36 storage nodes)
  • self-healing index – when user pushes new data, touched files are reindexed, which heals index and catches up with missed updates
  • such a great setup crashed under load until elasticsearch was quite reconfigured, bugs got fixed and so on – 2 days offline and elasticsearch own gurus got involved
  • elasticsearch has split-brain problem, which can corrupt data
  • 44 EC2 nodes were replaced with 8 own physical servers with 32 cores and 14 Tb of SSD each
  • elasticsearch 0.90 has rack-aware load balancing
  • elasticsearch has built-in index optimization which doesn’t really delete objects from the indexes when used deletes data – this can be postponed and batched for maximum performance
  • github.com uses ‘everything-over-http+json’ since Thrift is hard to debug and operated on albeit being faster
  • each repository lives in single shard (they had 500 of them on EC2, but says its overkill) – this speeds up queries by 2 times
  • split indexes as much as possible (kind of shard by time, but not exactly), archive old ones (compress them, glue together and so on)

That’s it.

I wrote this because Elliptics supports secondary indexes which we would want to try on similar setup – several billions of records, many terabytes of data. Our friends use MongoDB for that and it really sucks at it.
Our solution is not quite ready (we easily handle that amount of data, but we have to think about how elliptics indexes should handle that number of objects properly), but we are getting close – do not know whether there will be production use in the near future, but at least major testing for sure.

In a meantime my pet-project Wookie – search engine infrastructure on top of elliptics secondary indexes – got quite a bit of support: we have httpjson interface, and/or/quotes operators, various helper tools and so on. We are working on better lemmatization (using Snowball stemming quite sucks for russian for example), spelling correction and so on.

Not that I want to reinvent any modern search engine, but instead I have completely different idea in mind on how to implement what is called ‘relevance’ in search engine world.
This idea has to be checked in a real world.