Reverbrain is a company we started to provide support and solutions for distributed storage and realtime computation problems.
We created the whole stack of technologies ranging from the lowest level of data storage upto high-level processing pipeline.
Time moves on and we see how complex is to deal with massively increasing amounts of data. We provide solutions for true-way horizontal scaling with fair system complexity. Our storage appliances lay in the area where one has to host billions of small-to-medium objects upto huge data streaming systems.
Your data should nost just lay in archive in the distributed system — realtime pipeline processing is our vision of how data has to be handled. We use them by ourself and want to provide the best experience for our customers.
Facebook’s scale is out of the radar for the vast majority of cases, but yet it is very interesting to lookup new ideas there.
Audience insight is a way to show us pages which were are about to like with higher probability. It is based on previously liked pages, gender, location and many other features.
In particular at Facebook’s scale it is about 35 Tb of raw data and query ‘give me several pages which should be shown to user X’ must be completed within hundreds of milliseconds. That requires to process hundreds of billions of likes and billions of pages – numbers beyond reasonable – and in fraction of a second.
It happens that 168 nodes can handle it with some magic like columnar storage, in-memory data, bitmap indexes, GPU and caches. Very interesting reading!
Facebook is a maze for developing data store and access algorithms. Their scale allows and actually requires to find out new ways of storing information.
Facebook uses 3 levels of persistent storage: hot data is stored in Haystack storage – it’s lowest level is the prototype I used for Eblob. I would describe Elliptics storage with DHT, i.e. multiple servers to store one data replica, as this level of abstraction.
Second comes so called WARM data – data which is accessed by 2 orders of magnitude less frequently than that at HOT level. Warm data is stored in F4 storage, Facebook describes this split and motivation in this article.
I saw Elliptics installation with one disk per group, i.e. without DHT, as a storage for this access level of the data.
And the last is a COLD storage, something weird and scientifically crazy like robotic hand with a stack of Blu-Ray disks.
Data shows that Facebook’s Flashcache performs the best. Swrve decided to replace safety with performance and started to use raid-0 stipe for EBS volumes plus SSD cache in front of them.
Here is the result:
Y axis is events processed per second, higher is better; X axis is time.
- single 200GB EBS volume (blue)
- 4 x 50GB EBS volumes in RAID-0 stripes (red)
- 4 x 50GB EBS volumes in RAID-0 stripes, with an enhanceio cache device (green)
EnhanceIO is a Facebook’s Flashcache fork with nicer tools. It itself increases performance by about 25% in their workload, while splitting EBS volume into raid0 added another 25%.
Above Swrve link contains more details on benchmark and other results.
Amazon has announced that DynamoDB (the grandmother of all nosql databases) extended its free tier (upto 25 Gb) and added native JSON document support.
Native JSON document support is basically ability to get/set not the whole document, but only its parts indexed by internal key. Nothing says about how such update will be done in distributed fashion, in particular when multiple copies are being updated in parallel, but DynamoDB doesn’t support realtime replication into multiple regions, and internally it is likely database uses master-slave replication scheme.
I wrote it here to ask a question whether such document oriented databases are on demand? I know about CouchDB (and derivatives), MongoDB and others, but I’m quite interested whether and why DynamoDB wants to beat on this arena? I could add native JSON support to Elliptics but not sure it is really needed.
Afterall there quite a few document-oriented databases for rather small keys, while elliptics could beat there too, but instead it better works with medium-to-large objects with streaming and distributed replication and parallel upload.
Data has a natural live cycle – the newer the data the most actively it is accessed (read or viewed). This is exceptionally the case for digital content like pictures, movies, letters and messages. Facebook has found that the most recent 8% of the user generated content accounts for more than 80% of data access.
But the old data still has to be saved for rare access. Those servers which store the old data are virtually idle and kind of useless – the only thing they do most of the time is heating the air and eating an electricity, which is quite pricey.
Facebook experiments with moving this old data (actually only a backup copy of data) into special “Cold storage” which has two particular features: it can shutdown already written disks and data can be copied from those backup disks into Blu-Ray rack of disks with special robot handling blu-ray drives loading.
Besides the fact that blu-ray disk is much more robust than spinning disks (it doesn’t afraid of water or dust for example, it can store data for 50 years until something new appears) the rack of disks doesn’t eat pricey electricity.
I want to use containers to run multiple elliptics nodes with different versions on the same server mainly to be able to quickly switch between them.
I decided to run Vagrant on top of VirtualBox, since I do need for on-demand compilation of the new package versions and can not just wrap some application into container.
Running vagrant on top of Fedora19 fully succeeded, now I have several environments to experiment.
But main goal was to be able to create ‘a box’ and put it to remote nodes to create a network of different elliptics versions, and that part is rather hard with vagrant. First, I didn’t find how to export vagrant box not to its cloud but into the local file. But I could do that with virtualbox images.
Creating own box while I actually need a default one with several packages installed is rather overkill I think.
But here comes the second problem, the latest vagrant and the latest virtualbox do not work on various debians. Installed the latest virtualbox on wheezy default machine ended up with timed out vagrant. Running it on top of backported 3.14 kernel doesn’t work at all.
It was soo good on my development server, and miserably failed on testing machines.
Adding that I could not easily find out how to export updated boxes via files I decided to switch to something else. In particular, I’m thinking about Docker, although it is quite different, for example I do not know how to update image when I have new version in repository, or even how to build it inside the image and show me errors I could fix in place… Docker is more on how to pack already cooked up application with all its configs and tunings, while I need this for development.
Here is a human-readable changelog of 2.26 major version of elliptics distributed storage: http://doc.reverbrain.com/elliptics:major-version-updates#v226
The main feature is multiple backends in the single server. One can turn on/off them, change state, each backend has own IO execution pool. Basically it allows to change old scheme of having many elliptics servers, one per disk/directory/mountpoint, to just one server with multiple backends.
This greatly simplifies node setup and heavily decreases route table updates.
Also added new C++ structured logger Blackhole. One can send logs into ElasticSearch, syslog or use oldschool files.
We also cleaned up code and client logic, introduced new kinds of errors, simplified protocol and fixed bunch of bugs.
Enjoy and stay tuned!
It happens that Twitter not only forked and extended 1 year old Redis version, but looks like it doesn’t have plans to upgrade. Redis and its latencies are much-much-much faster than Twitter infrastructure written in Java because of GC in JVM. This allows to put a bunch of proxies on top of Redis caching cluster to do cluster management, the thing Redis misses for a while.
Also Twitter uses Redis only to cache data, doesn’t care about consistency issues, doesn’t use persistent caching, at least article says data is being thrown away when server goes offline.
It is client responsibility to read data from disk storage if there is no data in the cache.
Article desribes Twitter timeline architecture, and that’s quite weird to me: instead of having list of semifixed (or limited by size) chunks of timeline which are loaded on demand, they created a bunch of realtime updated structures in Redis, found non-trivial consistency issues and eventually ended up with the same simple approach of having ‘chunks’ of timeline stored in cache.
I started to compare cache management in Twitter using Redis with what we have in Reverbrain for caching: our Elliptics SLRU cache. It uses persistent caching system (which was also described a bit in article in comparison with memcache), but also uses persistent storage to backup cache, and while cache is actually segmented LRU, its backing store can be arbitrary large at size compared to Redis.
Although article is written as ‘set of facts’ somehow cut out of context (it was interview with the twitter employee), it is a good reading to think about caching, JVM, Redis and cache cluster architecture.
There is also elliptics http proxy Rift.
I like golang because of its static type system, garbage collection and built in lightweight threading model. Let’s test HTTP proxying capabilities against Elliptics node. I already tested Elliptics cache purely against native C++ client, it showed impressive 2 millions requests per second from 10 nodes, or about 200-220 krps per node using native API (very small upto 100 bytes requests), what would be HTTP proxying numbers?
First, I ran single client, single Rift proxy, single elliptics node test. After some tuning I got 23 krps for random writes of 1k-5k bytes (very real load) per request. I tested 2 cases when elliptics node and rift server were on the same machine and on different physical servers. Maximum latencies with 98% percentile were about 25ms at the end of the test (about 23 krps) and 0-3 ms at 18 krps not counting rare spikes at graph below.
Second, I tested a simple golang HTTP proxy with the same setup – single elliptics node, single proxy node and Yandex Tank benchmark tool.
I ran tests using the following setups: golang 1.2 with GC=100 and GC=off and golang 1.3 with the same garbage collection settings. Results are impressive: without garbage collection (GC=ff) golang 1.3 test ran with the same RPS and latencies as native C++ client. Although proxy ate 90+ Gb of RAm. Golang 1.2 showed 20% worse numbers.
Turning garbage collection on with GC=100 setting lead to much worse results than native C++ client but yet it is quite impressive. I got the same RPS numbers for this test of about 23 krps, but latencies at the 20 krps were close to 80-100 msecs, and about 20-40 msecs at the middle of the test. Golang 1.2 showed 30-50% worse results here.
Numbers are not that bad for single-node setup. Writing asynchronous parallel code in Golang is incredibly simpler than that in C++ with its forest of callbacks. So I will stick to Golang for the network async code for now. Will wait for Rust to stabilize though.
Google is automatically building its next generation knowledge graph named Knowledge Vault
Although article is very pop-science (not science at all actually) and doesn’t contain any technical detail, it is clear on google’s idea and the way information retrieval systems head. Automatic knowledge gathering and fact extraction is also what I originally aimed at Reverbrain company, although my idea was much simpler – I wanted to automatically build a language model and fact relations between words to understand native language questions.
Aug 25 there will be a presentation of Google’s Knowledge Vault, I’m too much tempting to see it and try to gather and understand bits of information on how it is implemented inside.
Upfate: a paper on knowledge vault: Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion