Tag Archives: facebook

How harmful is eventually consistent model at large scale? (spoiler: it isn’t)

Eventually consistent updates is considered to be bad design choice, but sometimes it is not possible to live without updates, one has to overwrite data and can not always write using new keys.

How harmful could this be at large scale? At Facebook’s scale.
Paper “Measuring and Understanding Consistency at Facebook” measures all consistency anomalies in Facebook TAO storage system, i.e., when results returned by eventually consistent TAO differ from what is allowed by stronger consistency models.

Facebook TAO has quite sophisticated levels of caches and storages, common update consists of 7 steps each of which may end up with temporal inconsistency.

And it happens that eventually consistent system even at Facebook scale is quite harmless, somewhere at the noise level, like 5 out of million request which violate linearizability, i.e. you overwrite data with the new content but read older value.

You may also check shorter gist paper describing how Facebook TAO works, how they measured consistency errors (by building update graph for each request from random selection out of billions facebook updates and proving there are no loops) and final results.


Facebook time series storage

Time series databases are quite niche product, but it is extremely useful in monitoring. I already wrote about basic design of the timeseries database, but things are more tricky.

Having something like HBase for TS database is a good choice – it has small overhead, it is distributed, all data is sorted, HBase is designed for small records, its packs its sorted tables, it supports Hadoop (or vice versa if that matters) – there are many features why it is a great TS database. Until you are Facebook.

At their scale HBase is no capable of handling the write and read monitoring and statistics load. And Facebook created Gorilla – a fast, scalable, in-memory time series database.

Basically, it is a tricky cache on top of HBase, but it is not just a cache. Gorilla uses very intelligent algorithm to pack 64 bits of monitoring data by the factor of 12.

This allows Facebook to store Gorilla’s data in memory, reducing query latency by 73x and improving query throughput by 14x when compared to a traditional database (HBase)- backed time series data. This performance improvement has unlocked new monitoring and debugging tools, such as time series correlation search and more dense visualization tools. Gorilla also gracefully handles failures from a single-node to entire regions with little to no operational overhead.

Design of the fault tolerant part is rather straightforward, and Gorilla doesn’t care about consistency or even more – there is no recovering missing monitoring data. But after all, Gorilla is a cache in front of persistent TS database like HBase.

Gorilla uses sharding mechanism to deal with write load. Shards are stored in memory and on disk in GlusterFS. Facebook uses its own Paxos-based ShardManager software to store shard-to-host mapping information. If some shards have failed, read may return partial results – client knows how to deal with it, in particular, it will automatically try to read missing data from other replicas.

I personally love Gorilla for its compression XOR algorithm.


It is based on the idea that subsequent monitoring events generally do not differ in the most bits – for example, CPU usage doesn’t jump from zero to 100% in a moment, and thus XORing two monitoring events yields a lot of zeroes which can be replaced with some smaller meta tag. Impressive.

Gorilla article is a must-read for monitoring developers: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Facebook’s in-memory computation engine for social analytics

Facebook’s scale is out of the radar for the vast majority of cases, but yet it is very interesting to lookup new ideas there.

Audience insight is a way to show us pages which were are about to like with higher probability. It is based on previously liked pages, gender, location and many other features.

In particular at Facebook’s scale it is about 35 Tb of raw data and query ‘give me several pages which should be shown to user X’ must be completed within hundreds of milliseconds. That requires to process hundreds of billions of likes and billions of pages – numbers beyond reasonable – and in fraction of a second.

It happens that 168 nodes can handle it with some magic like columnar storage, in-memory data, bitmap indexes, GPU and caches. Very interesting reading!

Facebook’s hot, warm and graph data

Facebook is a maze for developing data store and access algorithms. Their scale allows and actually requires to find out new ways of storing information.

For example social graph data – it is stored in MySQL and used to use aggressive memcached cache. Now Facebook uses Tao – graph-aware storage for social graph with its own graph cache.

Facebook uses 3 levels of persistent storage: hot data is stored in Haystack storage – it’s lowest level is the prototype I used for Eblob. I would describe Elliptics storage with DHT, i.e. multiple servers to store one data replica, as this level of abstraction.

Second comes so called WARM data – data which is accessed by 2 orders of magnitude less frequently than that at HOT level. Warm data is stored in F4 storage, Facebook describes this split and motivation in this article.
I saw Elliptics installation with one disk per group, i.e. without DHT, as a storage for this access level of the data.

And the last is a COLD storage, something weird and scientifically crazy like robotic hand with a stack of Blu-Ray disks.

Facebook’s “Cold storage” or saving data on Blu-Ray disks

Data has a natural live cycle – the newer the data the most actively it is accessed (read or viewed). This is exceptionally the case for digital content like pictures, movies, letters and messages. Facebook has found that the most recent 8% of the user generated content accounts for more than 80% of data access.

But the old data still has to be saved for rare access. Those servers which store the old data are virtually idle and kind of useless – the only thing they do most of the time is heating the air and eating an electricity, which is quite pricey.

Facebook experiments with moving this old data (actually only a backup copy of data) into special “Cold storage” which has two particular features: it can shutdown already written disks and data can be copied from those backup disks into Blu-Ray rack of disks with special robot handling blu-ray drives loading.

Besides the fact that blu-ray disk is much more robust than spinning disks (it doesn’t afraid of water or dust for example, it can store data for 50 years until something new appears) the rack of disks doesn’t eat pricey electricity.


250 billions of photos on facebook

Quite impressive number. I’m curios how many servers do they use.
Facebook used to use photo storage named Haystack, I based Eblob design solely on whitepaper of the proprietary Haystack design.

Although I removed the highest indirection level – the one where key indexes can live on separate servers, I only left in-memory and on-disk indexes.

That’s probably why Elliptics largest by number of keys storage hosts only 50+ billions (counting all 3 copies though) of objects. And that’s actually less than hundred of nodes (including all 3 replicas).