Comparing various compression algorithms for particular project

We build a hobby search engine for one blog platform and there is a challenging problem of fitting the whole search index and archival dataset into one server node.

There are 1.3 billion of posts and comments and the whole uncompressed data takes about 2Tb.
We generate *many* indexes for very tricky and fast searches, we can not afford waiting for 20+ seconds to find all comments Alice made in Bob’s blog, which is a number of seconds Sphinx takes to solve this task by enumerating all comments and filtering them according to author/blog fields, instead we have to answer in a fraction of second.

Our algorithms produce about 3x of uncompressed data compared to input, and generally this should be compressed. 1x among those 3x is original text content (slightly reduced by supported language, dropped html markup and so on), the rest is a set of binary indexes where index content is actually timestamp-like ids. Although these timestamp-like ids are monotonically increasing, they are not fixed-interval timestamps and we can not use facebook’s Gorilla without serious modifications.

As a test we decided to check how common compression algorithms will work this out. We tried zlib, snappy, lz4, lz4hc and zstd. Test dataset was about 220MB uncompressed, we measured time our tool needed to index this text with exactly the same settings, it produced about 660MB of data which was compressed during the test at the storage level, size of the final database was reported. We did not test bz2, since its compression/decompression time is much higher and that’s unacceptable for realtime.

Here are the results
index_compaction_time_database_size

It is a bit controversial that zstd is comparable in speed with zlib but final database takes 10% more space.

6 thoughts on “Comparing various compression algorithms for particular project

  1. Yixi

    It’s really strange.
    According to your graph, lz4hc is faster than lz4.
    This is surprising. lz4 is supposed to be 20x faster.

    How sure are you of your speed results ?

    1. zbr Post author

      I tested it several times, these are the numbers.
      lz4 is never faster that much, number are about the same, but space saving is noticeably worse

  2. guest

    Which compression level of ZSTD have you used? It uses -1 (default) to -22, all have similar decoding time, but greatly differ in compression time and ratio. See some benchmarks, like https://github.com/inikep/lzbench
    Also, ZSTD has tools to build and use dictionary, which gives a big gain for small files.

  3. zbr Post author

    That was indeed -1 compression level for zstd and zlib, since setting 22 level for zstd ends up with *hours* of compaction and it hasn’t completed yet.
    Zlib -1/9 differs just a little bit: about 2% less database size.

    We use rocksdb and there is no simple option like ‘compress this file’, there are many operations involved in the whole procedure. That’s why it is ‘compression algorithms for particular project’ and now synthetic compression/decompression test.

  4. Unknown

    Zstd 22 is for precompressed data, where one wants to get absolutely better ratio within given data format, no matter what the cost is. That’s where Zstd tends to beat Zlib to dust in terms of ratio, yet decompression speed stays quite great.

    1. zbr Post author

      I checked zstd with -1 (default), 9, 21 and 22 levels. All the time it is either slower than zlib or slower and having worse compression ratio?

      The fasted zstd I checked was at level 5 – it was 20% faster than zlib9 (level9), but compression ratio was 5% worse.
      zstd4 was slower, probably it is within standard deviation, I only ran it 2-3 times.

      So far it looks like zstd is probably better for text and other perfectly compressable data, and noticeably worse for mostly binary.

      My data looks like this:
      157e0b148001b4ab.0
      158070938002ff7d.0
      1580a597400191ac.0
      1580f6b1c0003a6b.0
      15820c8ac002c69a.0
      1582b8d14002feb1.0
      1582bb28000142eb.0
      1582e5e300029852.0
      1582fe8dc0022d0d.0
      1584b6664002ff46.0
      1584eeb980003b5e.0
      1585fd1b80003b5d.0

      these are timestamps (34 bits) and increasing counter (another 30 bits).
      Timestamps are increasing, but with different offsets between any two.

Comments are closed.