Tag Archives: time series database

Facebook time series storage

Time series databases are quite niche product, but it is extremely useful in monitoring. I already wrote about basic design of the timeseries database, but things are more tricky.

Having something like HBase for TS database is a good choice – it has small overhead, it is distributed, all data is sorted, HBase is designed for small records, its packs its sorted tables, it supports Hadoop (or vice versa if that matters) – there are many features why it is a great TS database. Until you are Facebook.

At their scale HBase is no capable of handling the write and read monitoring and statistics load. And Facebook created Gorilla – a fast, scalable, in-memory time series database.

Basically, it is a tricky cache on top of HBase, but it is not just a cache. Gorilla uses very intelligent algorithm to pack 64 bits of monitoring data by the factor of 12.

This allows Facebook to store Gorilla’s data in memory, reducing query latency by 73x and improving query throughput by 14x when compared to a traditional database (HBase)- backed time series data. This performance improvement has unlocked new monitoring and debugging tools, such as time series correlation search and more dense visualization tools. Gorilla also gracefully handles failures from a single-node to entire regions with little to no operational overhead.

Design of the fault tolerant part is rather straightforward, and Gorilla doesn’t care about consistency or even more – there is no recovering missing monitoring data. But after all, Gorilla is a cache in front of persistent TS database like HBase.

Gorilla uses sharding mechanism to deal with write load. Shards are stored in memory and on disk in GlusterFS. Facebook uses its own Paxos-based ShardManager software to store shard-to-host mapping information. If some shards have failed, read may return partial results – client knows how to deal with it, in particular, it will automatically try to read missing data from other replicas.

I personally love Gorilla for its compression XOR algorithm.

facebook-gorilla-compression

It is based on the idea that subsequent monitoring events generally do not differ in the most bits – for example, CPU usage doesn’t jump from zero to 100% in a moment, and thus XORing two monitoring events yields a lot of zeroes which can be replaced with some smaller meta tag. Impressive.

Gorilla article is a must-read for monitoring developers: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Designing time series database

Akumuli tells on how they designed time series database.

Basically it is a set of write-ahead-log files preallocated at the initial write. Header contains time indexes, footer hosts actual data, and they both grow towards each other. Data chunks and/or indexes can be compressed.

This is a nice design, but it only works for time series, or for keys which are by its nature increase with each new record. This trick allows to perform binary search on keys stored in increasing order at the beginning of the WAL file, and select needed file with O(1) disk lookups (or even single disk lookup, if startup processes reads key ranges for every file into memory).

If suddenly new key is less than previously written key, system would have to rearrange already written keys, which requires data copy. Or this would require hosting key->data offset mapping in memory, which doesn’t scale for time series databases, which usually contain billions of records. There is a middle point somewhere where production system would work, but it is *much* more simple just to forbid writing data with random keys. If one is afraid of time skew in the cluster, it is possible to cache keys (and sort them in memory) before storing into the database.