Tag Archives: search

Comparing various compression algorithms for particular project

We build a hobby search engine for one blog platform and there is a challenging problem of fitting the whole search index and archival dataset into one server node.

There are 1.3 billion of posts and comments and the whole uncompressed data takes about 2Tb.
We generate *many* indexes for very tricky and fast searches, we can not afford waiting for 20+ seconds to find all comments Alice made in Bob’s blog, which is a number of seconds Sphinx takes to solve this task by enumerating all comments and filtering them according to author/blog fields, instead we have to answer in a fraction of second.

Our algorithms produce about 3x of uncompressed data compared to input, and generally this should be compressed. 1x among those 3x is original text content (slightly reduced by supported language, dropped html markup and so on), the rest is a set of binary indexes where index content is actually timestamp-like ids. Although these timestamp-like ids are monotonically increasing, they are not fixed-interval timestamps and we can not use facebook’s Gorilla without serious modifications.

As a test we decided to check how common compression algorithms will work this out. We tried zlib, snappy, lz4, lz4hc and zstd. Test dataset was about 220MB uncompressed, we measured time our tool needed to index this text with exactly the same settings, it produced about 660MB of data which was compressed during the test at the storage level, size of the final database was reported. We did not test bz2, since its compression/decompression time is much higher and that’s unacceptable for realtime.

Here are the results
index_compaction_time_database_size

It is a bit controversial that zstd is comparable in speed with zlib but final database takes 10% more space.

Greylock tutorial – distributed base search engine based on Elliptics

We’ve heavily updated Reverbrain documentation pages: doc.reverbrain.com, and I’m pleased to present our distributed base search engine Greylock. Documentation page includes generic information about search engine and tutorial, which includes installation process, configs and two types of clients: plain HTTP API (similar to what is expected from base search engine like Greylock and ElasticSearch) and Python client (works via HTTP too, but also uses Consul to acquire mailbox locks). If you need C++ tutorial, you can check greylock test suite which includes insertion/selection/removal as well as various iterators over data, self-recovery tests, statistics and other interesting bits.

I get a fair number of questions on how is Greylock different from ElasticSearch or Solr for instance or Amazon Cloud Search? They all have enormous amount of features and work with large amount of data, so what’s the purpose?

And the answer is just two words: scalability and automation.
If you worked with Elastic you do know which operations have to be made to reshard cluster when current sharding scheme becomes a bottleneck (how long it takes, what is the performance penalty and how dangerous is the process). When you work in the environment where new documents always arrive and space consumption grows with time, this resharding process will have to be started again and again with new servers added. At some point this becomes a serious issue.

With Greylock this is not needed at all. There is virtually no data and index movements when new servers are being added due to Elliptics bucket system. This design proved to work really well in Elliptics storage installations, where upload rates reach tens of terabytes daily, and that’s only our clients data, there are other seriously larger installations for example in Yandex.

We concentrated on scalability problem and solved it. And yet we do have a set of features. It is not comparable with Elastic of course even not counting NLP tasks which we will release later (language models and spelling correction for instance for any language where you can find a rather large corpus). Greylock supports basic relevance model based on the word distance among client request and words in the document.

Likely two of the worst issues are absence of numerical indexes and client locks. Both were made deliberately. Numerical indexes break pagination, which in turn means that if you want 100 documents out of a million, you will have to read them all into RAM, resort either to numeric order or into lexical order (that’s how document ids are stored in the inverted indexes), intersect the whole million of keys and return the first 100. For any subsequent request this has to be done again and again. Without numerics pagination works with iterators pointing to inverted indexes only, the whole index (and its millions of entries) is never being read, only some pages are accessed sequentially.

To help with numerics Greylock supports document timestamps, i.e. a single 64-bit numeric per document ID which is used in inverted indexes sorting order. Of course this is not a replacement for fair numeric index, but it does solve almost all of our use cases.

The second major issue is consistency and client locking. Greylock as well as Elliptics are not strictly consistent storages. With Greylock things are even worse – amount of data overwritten by a single client insert can be large and index structure (originally started as a distributed B+/*-tree) does not tolerate broken pages. Elastic and others implement consistency model (like Raft, Paxos or ZAB) internally. Greylock doesn’t. That’s why we require clients to acquire locks in some other system like Consul, Etcd or ZooKeeper to work properly. Our tutorial shows basic locking scheme implemented using strictly consistent Consul key-value storage.

We have a major plan for Greylock distributed search engine, expect new features and give it a try: http://doc.reverbrain.com/greylock:greylock
If you have any questions, you are welcome:
Google group: https://groups.google.com/forum/?fromgroups=#!forum/reverbrain,
this site and comments.

Secondary indexes and write-cas

It was a while I wrote last time, and there were enourmous amount of changes made.

Let’s start with write-cas and secondary indexes.
Write-cas stands for write-compare-and-swap, which means server performs your write only if your cookie matches what it has. We use data checksum as a cookie, but that can be changed if needed.

CAS semantic was introduced to heal a problem when multiple clients update the same key. In some cases you might want to update a complex structure and not to loose data made by others.

Since we do not have distributed locks, client performs read-modify-write loop without any lock being held, which ends up with the race against other clients. CAS allows to detect that data was changed after we read it last time and our modification will overwrite those chages.

In this case client gets error and performs read-modify-write loop again. There are nice function to use write-cas: with functor which just modifies data (read and write is hidden) and the one, and low-level method, which requires checksum and new data – if it fails, it is up to the client to read data again, apply his changes and write it again.

The first write-cas client is elliptics’ secondary indexes. You may attach ‘tags’ or ‘indexes’ with data to any key you want. This means that after secondary indexes update has completed, you will find your key under the ‘tag’ or ‘index’.

For example, tag can be a daily timestamp – in this case you will get list of all keys ‘changed’ at given day.

Of course you can find all keys which match multiple indexes – this is kind of ‘AND’ logical operation. Our API returns all keys found under/in all requested tags/indexes.

All indexes (as well as any keys) may live in different namespaces.

So far secondary indexes are implemented on pure write-cas semantic. And our rather simple realtime search engine Wookie uses them to update its reverse indexes.
And that doesn’t scale :)

We found that having multiple clients competing to update the same popular index ends up with too many read-modify-write cycles, that the whole system becomes network bound. And we want to be CPU bould instead.

To fix this issue we will change secondary indexes to be invoked from server-side engine. All server-side operations are atomic by default, so this will eliminate races between multiple clients.

Stay tuned – I will write more about Wookie soon.
After all, we want to implement a baby (really-really baby) Watson with it :)