Here is an excellent interview with github.com/search guys about their search installation
Basic points are:
- 30 terabytes of data on SSD disks
- 2 billions of documents in the search index
- 44 servers on EC2 (8 frontends and 36 storage nodes)
- self-healing index – when user pushes new data, touched files are reindexed, which heals index and catches up with missed updates
- such a great setup crashed under load until elasticsearch was quite reconfigured, bugs got fixed and so on – 2 days offline and elasticsearch own gurus got involved
- elasticsearch has split-brain problem, which can corrupt data
- 44 EC2 nodes were replaced with 8 own physical servers with 32 cores and 14 Tb of SSD each
- elasticsearch 0.90 has rack-aware load balancing
- elasticsearch has built-in index optimization which doesn’t really delete objects from the indexes when used deletes data – this can be postponed and batched for maximum performance
- github.com uses ‘everything-over-http+json’ since Thrift is hard to debug and operated on albeit being faster
- each repository lives in single shard (they had 500 of them on EC2, but says its overkill) – this speeds up queries by 2 times
- split indexes as much as possible (kind of shard by time, but not exactly), archive old ones (compress them, glue together and so on)
I wrote this because Elliptics supports secondary indexes which we would want to try on similar setup – several billions of records, many terabytes of data. Our friends use MongoDB for that and it really sucks at it.
Our solution is not quite ready (we easily handle that amount of data, but we have to think about how elliptics indexes should handle that number of objects properly), but we are getting close – do not know whether there will be production use in the near future, but at least major testing for sure.
In a meantime my pet-project Wookie – search engine infrastructure on top of elliptics secondary indexes – got quite a bit of support: we have httpjson interface, and/or/quotes operators, various helper tools and so on. We are working on better lemmatization (using Snowball stemming quite sucks for russian for example), spelling correction and so on.
Not that I want to reinvent any modern search engine, but instead I have completely different idea in mind on how to implement what is called ‘relevance’ in search engine world.
This idea has to be checked in a real world.