We build a hobby search engine for one blog platform and there is a challenging problem of fitting the whole search index and archival dataset into one server node.
There are 1.3 billion of posts and comments and the whole uncompressed data takes about 2Tb.
We generate *many* indexes for very tricky and fast searches, we can not afford waiting for 20+ seconds to find all comments Alice made in Bob’s blog, which is a number of seconds Sphinx takes to solve this task by enumerating all comments and filtering them according to author/blog fields, instead we have to answer in a fraction of second.
Our algorithms produce about 3x of uncompressed data compared to input, and generally this should be compressed. 1x among those 3x is original text content (slightly reduced by supported language, dropped html markup and so on), the rest is a set of binary indexes where index content is actually timestamp-like ids. Although these timestamp-like ids are monotonically increasing, they are not fixed-interval timestamps and we can not use facebook’s Gorilla without serious modifications.
As a test we decided to check how common compression algorithms will work this out. We tried zlib, snappy, lz4, lz4hc and zstd. Test dataset was about 220MB uncompressed, we measured time our tool needed to index this text with exactly the same settings, it produced about 660MB of data which was compressed during the test at the storage level, size of the final database was reported. We did not test bz2, since its compression/decompression time is much higher and that’s unacceptable for realtime.
It is a bit controversial that zstd is comparable in speed with zlib but final database takes 10% more space.