Arithmetic coding is a type of lossless compression, where more frequently used bytes are replaced with fewer bits than less frequent.

Arithmetic coding is commonly associated with Huffman coding, but it is not correct – Huffman coding splits input string into smaller and smaller chunks until chunk can be replaced with smaller entity. Usually Huffman encoding can not reach Shannon limit (theoretical compression limit for given data) for reasonable number of iterations, while arithmetic coding – which transfers the whole string into packed binary stream at once – approaches that limit much faster.

I work with Greylock search engine where there is a major task to implement full-text search index with as small disk footprint as possible. In our tests 200k of uncompressed input events occupy about 220Mb of disk space and whole index takes close to 450Mb. Index includes original content, common index and per-author index (this basically doubles the original index). Elastic search with morphology takes about 539Mb, without morphology – 317Mb, Sphinx (+ mysql to store original content) is about 250Mb, but this index is heavily truncated, there is still a notion of stop-words and morphology is broken.

We compress our binary index entries as well as original content using lz4hc or gzip, but I wondered whether arithmetic coding can do better than that. I skipped Huffman encoder because of its theoretical problems approaching Shannon limit (not talking about practical issues), and decided to check rANS entropy coder – this is an Asymmetric Numeral entropy Systems coder which combines speed of Huffman coding with compression rate of arithmetic coding. Well, at least that’s what its author says in the paper.

Here is an image from the paper showing Huffman coder approaching Shannon limit

It happend to be worse than any common compression both for binary and text data.

Text data from original library: 768771 bytes

bz2: 232598

gzip: 312281

lz4: 363236

rans: 435113

Binary data from our index: 9678

bz2: 3652

gzip: 3665

lz4: 5262

rans: 7795

I do not drop idea of using arithmetic coding to pack binary data since Dropbox showed with its Lepton project that jpeg images can be losslessly compressed by 20% using tricky DCT indexes sorting and prediction (about 3-4% among 20%) and VP8 arithmetic coding for these indexes.

Naive comparison of Lepton and rANS yields dramatic difference:

Original jpeg file sile: 71517

Lepton encoded size: 55700

gzip: 71171

bz2: 71586

lz4: 71414

rANS encoded size: 71336

This is of course unfair comparison, since Lepton encodes not binary steam of the original image, but specially crafted DCT indexes.

Since our indexes are basically vectors of document IDs which are in fact partially increasing timestamp and partially sequence number, it might be a good idea to use timestamp compression algorithm described in Facebook’s Gorilla timeseries database – I wrote about it previously, but that’s another story.

guestEntropy coding is just a part of data compressor – using it alone is usually not a good idea.

For general purpose data like Huffman it is usually used with some Lempel-Ziv, see for example ZSTD

https://github.com/Cyan4973/zstd

zbrPost authorBy ‘using entropy coding alone’ do you mean without “context” generation? Neither rANS library nor its paper describe in details steps to generate context, so I just copied what example code does, it contains words like ‘heavy brute force’ and the like – my results are quite the same with the test data as what author posted on github, so I believe I made the notion of the context correctly.

Maybe there are more tricks to generate probability distribution tables, which I haven’t used, so using arithmetic coder ended up with worse compression ratio than common data compressors, but your link shows similar notes like “dictionary generation” and “training time”, I believe this is the step named ‘context generation’ and it is present in the article.

guestPure (order 0) entropy coder just assumes uncorrelated sequence of symbols.

By not alone I’ve meant it is usually used with other transformation (like Lempel-Ziv) or prediction phase (like delta coding or intraprediction in image compression).

For DNA data they use pure order 1 rANS (probability distribution depends on context being the previous byte): https://github.com/jkbonfield/rans_static

For general data try ZSTD, maybe with blosc filters: http://blosc.org/blog/zstd-has-just-landed-in-blosc.html

For image you need a specialized compressor to get essential gain – which is able to use spatial correlations like Lepton.

zbrPost authorOrder1 rANS is noticeably better for text than o0, but yet worse than gzip and bzip. Binary data is even worse.

I also added zstd test, with the highest compression ratio it is quite better than gzip and even better than bz2 for binary index data.

Original size text file: 768771

rANS o0: 435577

rANS o1: 350798

gzip: 312281

bz2: 232598

lz4: 363236

zstd: 265660

Original binary index size: 9678

rANS o0: 6497

rANS o1: 7586 (worse!)

gzip: 3665

bz2: 3652

lz4: 5262

zstd: 3078

Probably binary index file is not large enough to get more data for context.

As of using Lempel-Ziv with rANS – doesn’t rans coder already pack bitstream, why would additional dictionary lookup make it smaller?

guesto1 rANS was worse in your case because of tiny file size – it stores 256 probability distributions in header.

Lempel-Ziv searches for repeated strings, like LZ4. Gzip, ZSTD are LZ + entropy coding. BZ2 additionally uses Burrows-Wheeler transform.

ANS is just faster way to do arithmetic coding. It is also considered by author of Lepton: https://github.com/dropbox/lepton/issues/24

zbrPost authorDo you mean, that o1 coder packs additional to o0 conditional probabilities for symbol->symbol transformations?

So, basically good generic compression would be ANS coder which ‘compacts’ subsequent bytes + second round where algorithm like LZ searches for the same ‘strings’ of bytes among these generated by ANS?

guestIndeed static entropy coders need to store the used probability distribution in the header (e.g. the one used in gzip, ZSTD) . For o1 rANS it is 256 distribution for 256 size alphabet…

This cost is avoided by adaptive entropy coders – updating/learning probability distribution on the go … but such updating is relatively costly.

Choosing/designing compressor is very data-dependent, the standard general purpose are usually LZ + entropy coder. Some use Burrows-Wheeler compressor, some add filters like delta decoding, byte/bit shuffling …

Generally see for example http://mattmahoney.net/dc/text.html

zbrPost authorWow, incredible link, thank you!

There is both zstd and rans+LZ, looks like zstd is closer to the state of the art in terms of space, and there is no other compressor who decompresses data faster.

James BonfieldZstd is pretty bullet proof with the author (Yann Collet, now working at Facebook) really making sure it’s industry strength with fuzzing and rigorous testing. It’s also pretty competitive and a decent general all-rounder. It’s built on top of the FSE library, but I think it mostly uses the FSE huffman implementation as this is faster than tANS (and rANS).

Regarding rANS o1; it is used in the CRAM format for lots of things where correlations are found without reptitions. It’s main win is in DNA quality values and not the DNA itself. (That’s delta encoded vs a known reference). For this order-1 does a super job. Order-2 is better, but the lookup table is generally too large for static encoding and it’s too slow to do a dynamic encoder. (We did it – eg see fqzcomp – but the size vs speed tradeoff isn’t worth it for most people).

I wouldn’t recommend using order-1 entropy encoders unless your data happens to fit well in that model. I think Brotli may have order-1 encoding of literals (ie the non-LZ matches), but I don’t know how much extra it gains after the LZ dups have been taken away.