Featured post

reverbrain.com is up and running

Reverbrain is a company we started to provide support and solutions for distributed storage and realtime computation problems.

We created the whole stack of technologies ranging from the lowest level of data storage upto high-level processing pipeline.

Time moves on and we see how complex is to deal with massively increasing amounts of data. We provide solutions for true-way horizontal scaling with fair system complexity. Our storage appliances lay in the area where one has to host billions of small-to-medium objects upto huge data streaming systems.

Your data should nost just lay in archive in the distributed system — realtime pipeline processing is our vision of how data has to be handled. We use them by ourself and want to provide the best experience for our customers.

Backrunner – next generation HTTP proxy for Elliptics distributed storage

Elliptics is a powerful distributed storage for medium and large data, but it is rather low-level. It doesn’t know about ACL or REST API for example, I would compare it to block level in Linux filesystem hierarchy. In particular, Elliptics only provides C/C++, Python and Golang API bindings.

For the vast majority of the users HTTP REST API is a must, thus we created Backrunner – a new swiss-knife HTTP proxy for Elliptics distributed storage. It supports ACL, automatic bucket selection based on disk and network speed, errors, amount of free space, automatic defragmentation, header extension, local static files handling and provides simple REST API for clients.

We call Backrunner an entry point to Elliptics distributed storage. It not only provides externally visible interfaces, but also takes many administrative tasks like running defragmentation, showing properly crafted monitoring stats and so on.

Backrunner’s load balancing operates in real-time, for example it gathers upload metrics (speed, latency, errors) on every request to properly tune algorithm placing data around the cluster. It also takes into account amount of free space, disk activity, internal errors, timings from other clients, network speed and many other metrics.

We will extend it to run basic recovery operations, right now Backrunner detects that replicas are out of sync, but do not run recovery because this will likely heavily affect timings, which is generally a bad idea. That’s why Elliptics is an eventually consistent system – we pay this price for the highest possible scalability levels.

Backrunner is also distributed in docker images: https://registry.hub.docker.com/u/reverbrain/backrunner/
Documentation: http://doc.reverbrain.com/backrunner:backrunner

Tutorial is coming, stay tuned!

Facebook’s in-memory computation engine for social analytics

Facebook’s scale is out of the radar for the vast majority of cases, but yet it is very interesting to lookup new ideas there.

Audience insight is a way to show us pages which were are about to like with higher probability. It is based on previously liked pages, gender, location and many other features.

In particular at Facebook’s scale it is about 35 Tb of raw data and query ‘give me several pages which should be shown to user X’ must be completed within hundreds of milliseconds. That requires to process hundreds of billions of likes and billions of pages – numbers beyond reasonable – and in fraction of a second.

It happens that 168 nodes can handle it with some magic like columnar storage, in-memory data, bitmap indexes, GPU and caches. Very interesting reading!

Facebook’s hot, warm and graph data

Facebook is a maze for developing data store and access algorithms. Their scale allows and actually requires to find out new ways of storing information.

For example social graph data – it is stored in MySQL and used to use aggressive memcached cache. Now Facebook uses Tao – graph-aware storage for social graph with its own graph cache.

Facebook uses 3 levels of persistent storage: hot data is stored in Haystack storage – it’s lowest level is the prototype I used for Eblob. I would describe Elliptics storage with DHT, i.e. multiple servers to store one data replica, as this level of abstraction.

Second comes so called WARM data – data which is accessed by 2 orders of magnitude less frequently than that at HOT level. Warm data is stored in F4 storage, Facebook describes this split and motivation in this article.
I saw Elliptics installation with one disk per group, i.e. without DHT, as a storage for this access level of the data.

And the last is a COLD storage, something weird and scientifically crazy like robotic hand with a stack of Blu-Ray disks.

Various flash caches benchmarked

Guys from Swrve have benchmarked various flash caches in front of Amazon EBS volume and shared the results.

Data shows that Facebook’s Flashcache performs the best. Swrve decided to replace safety with performance and started to use raid-0 stipe for EBS volumes plus SSD cache in front of them.

Here is the result: Y axis is events processed per second, higher is better; X axis is time.
Y axis is events processed per second, higher is better; X axis is time.

  • single 200GB EBS volume (blue)
  • 4 x 50GB EBS volumes in RAID-0 stripes (red)
  • 4 x 50GB EBS volumes in RAID-0 stripes, with an enhanceio cache device (green)

EnhanceIO is a Facebook’s Flashcache fork with nicer tools. It itself increases performance by about 25% in their workload, while splitting EBS volume into raid0 added another 25%.

Above Swrve link contains more details on benchmark and other results.

Document-oriented databases

Amazon has announced that DynamoDB (the grandmother of all nosql databases) extended its free tier (upto 25 Gb) and added native JSON document support.

Native JSON document support is basically ability to get/set not the whole document, but only its parts indexed by internal key. Nothing says about how such update will be done in distributed fashion, in particular when multiple copies are being updated in parallel, but DynamoDB doesn’t support realtime replication into multiple regions, and internally it is likely database uses master-slave replication scheme.

I wrote it here to ask a question whether such document oriented databases are on demand? I know about CouchDB (and derivatives), MongoDB and others, but I’m quite interested whether and why DynamoDB wants to beat on this arena? I could add native JSON support to Elliptics but not sure it is really needed.

Afterall there quite a few document-oriented databases for rather small keys, while elliptics could beat there too, but instead it better works with medium-to-large objects with streaming and distributed replication and parallel upload.

Facebook’s “Cold storage” or saving data on Blu-Ray disks

Data has a natural live cycle – the newer the data the most actively it is accessed (read or viewed). This is exceptionally the case for digital content like pictures, movies, letters and messages. Facebook has found that the most recent 8% of the user generated content accounts for more than 80% of data access.

But the old data still has to be saved for rare access. Those servers which store the old data are virtually idle and kind of useless – the only thing they do most of the time is heating the air and eating an electricity, which is quite pricey.

Facebook experiments with moving this old data (actually only a backup copy of data) into special “Cold storage” which has two particular features: it can shutdown already written disks and data can be copied from those backup disks into Blu-Ray rack of disks with special robot handling blu-ray drives loading.

Besides the fact that blu-ray disk is much more robust than spinning disks (it doesn’t afraid of water or dust for example, it can store data for 50 years until something new appears) the rack of disks doesn’t eat pricey electricity.

http://money.cnn.com/2014/08/21/technology/facebook-blu-ray/index.html

A short story on trying to use Vagrant to put elliptics node into container

I want to use containers to run multiple elliptics nodes with different versions on the same server mainly to be able to quickly switch between them.

I decided to run Vagrant on top of VirtualBox, since I do need for on-demand compilation of the new package versions and can not just wrap some application into container.
Running vagrant on top of Fedora19 fully succeeded, now I have several environments to experiment.

But main goal was to be able to create ‘a box’ and put it to remote nodes to create a network of different elliptics versions, and that part is rather hard with vagrant. First, I didn’t find how to export vagrant box not to its cloud but into the local file. But I could do that with virtualbox images.
Creating own box while I actually need a default one with several packages installed is rather overkill I think.

But here comes the second problem, the latest vagrant and the latest virtualbox do not work on various debians. Installed the latest virtualbox on wheezy default machine ended up with timed out vagrant. Running it on top of backported 3.14 kernel doesn’t work at all.

It was soo good on my development server, and miserably failed on testing machines.
Adding that I could not easily find out how to export updated boxes via files I decided to switch to something else. In particular, I’m thinking about Docker, although it is quite different, for example I do not know how to update image when I have new version in repository, or even how to build it inside the image and show me errors I could fix in place… Docker is more on how to pack already cooked up application with all its configs and tunings, while I need this for development.

Elliptics 2.26 changelog

Here is a human-readable changelog of 2.26 major version of elliptics distributed storage: http://doc.reverbrain.com/elliptics:major-version-updates#v226

The main feature is multiple backends in the single server. One can turn on/off them, change state, each backend has own IO execution pool. Basically it allows to change old scheme of having many elliptics servers, one per disk/directory/mountpoint, to just one server with multiple backends.
This greatly simplifies node setup and heavily decreases route table updates.

Also added new C++ structured logger Blackhole. One can send logs into ElasticSearch, syslog or use oldschool files.

We also cleaned up code and client logic, introduced new kinds of errors, simplified protocol and fixed bunch of bugs.

Enjoy and stay tuned!

How is Redis used in twitter

Everyone knows Redis – high-performance persistent cache system. Here is an article on how Redis is being used in Twitter.

It happens that Twitter not only forked and extended 1 year old Redis version, but looks like it doesn’t have plans to upgrade. Redis and its latencies are much-much-much faster than Twitter infrastructure written in Java because of GC in JVM. This allows to put a bunch of proxies on top of Redis caching cluster to do cluster management, the thing Redis misses for a while.

Also Twitter uses Redis only to cache data, doesn’t care about consistency issues, doesn’t use persistent caching, at least article says data is being thrown away when server goes offline.
It is client responsibility to read data from disk storage if there is no data in the cache.

Article desribes Twitter timeline architecture, and that’s quite weird to me: instead of having list of semifixed (or limited by size) chunks of timeline which are loaded on demand, they created a bunch of realtime updated structures in Redis, found non-trivial consistency issues and eventually ended up with the same simple approach of having ‘chunks’ of timeline stored in cache.

I started to compare cache management in Twitter using Redis with what we have in Reverbrain for caching: our Elliptics SLRU cache. It uses persistent caching system (which was also described a bit in article in comparison with memcache), but also uses persistent storage to backup cache, and while cache is actually segmented LRU, its backing store can be arbitrary large at size compared to Redis.

Although article is written as ‘set of facts’ somehow cut out of context (it was interview with the twitter employee), it is a good reading to think about caching, JVM, Redis and cache cluster architecture.

Elliptics, golang, GC and performance

Elliptics distributed storage has native C/C++ client API as well as Python (comes with elliptics sources) and Golang bindings.

There is also elliptics http proxy Rift.

I like golang because of its static type system, garbage collection and built in lightweight threading model. Let’s test HTTP proxying capabilities against Elliptics node. I already tested Elliptics cache purely against native C++ client, it showed impressive 2 millions requests per second from 10 nodes, or about 200-220 krps per node using native API (very small upto 100 bytes requests), what would be HTTP proxying numbers?

First, I ran single client, single Rift proxy, single elliptics node test. After some tuning I got 23 krps for random writes of 1k-5k bytes (very real load) per request. I tested 2 cases when elliptics node and rift server were on the same machine and on different physical servers. Maximum latencies with 98% percentile were about 25ms at the end of the test (about 23 krps) and 0-3 ms at 18 krps not counting rare spikes at graph below.

elliptics-cache-rift-23krpsRift HTTP proxy writing data into elliptics cache, 1k-5k bytes per request

Second, I tested a simple golang HTTP proxy with the same setup – single elliptics node, single proxy node and Yandex Tank benchmark tool.

I ran tests using the following setups: golang 1.2 with GC=100 and GC=off and golang 1.3 with the same garbage collection settings. Results are impressive: without garbage collection (GC=ff) golang 1.3 test ran with the same RPS and latencies as native C++ client. Although proxy ate 90+ Gb of RAm. Golang 1.2 showed 20% worse numbers.

elliptics-cache-golang-1.3-gc-offGolang HTTP proxy (turned off garbage collection) writing data into elliptics cache, 1k-5k bytes per request

Turning garbage collection on with GC=100 setting lead to much worse results than native C++ client but yet it is quite impressive. I got the same RPS numbers for this test of about 23 krps, but latencies at the 20 krps were close to 80-100 msecs, and about 20-40 msecs at the middle of the test. Golang 1.2 showed 30-50% worse results here.

elliptics-cache-golang-1.3-gc-100Golang HTTP proxy (GC=100 garbage collection setting) writing data into elliptics cache, 1k-5k bytes per request

Numbers are not that bad for single-node setup. Writing asynchronous parallel code in Golang is incredibly simpler than that in C++ with its forest of callbacks. So I will stick to Golang for the network async code for now. Will wait for Rust to stabilize though.

Crafting knowledge base the right way

Google is automatically building its next generation knowledge graph named Knowledge Vault

Although article is very pop-science (not science at all actually) and doesn’t contain any technical detail, it is clear on google’s idea and the way information retrieval systems head. Automatic knowledge gathering and fact extraction is also what I originally aimed at Reverbrain company, although my idea was much simpler – I wanted to automatically build a language model and fact relations between words to understand native language questions.

Aug 25 there will be a presentation of Google’s Knowledge Vault, I’m too much tempting to see it and try to gather and understand bits of information on how it is implemented inside.

Upfate: a paper on knowledge vault: Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion

Elliptics 2.25 changelog

We are proud to announce our current 2.25 production major version changelog in a human readable (not thousands of commits or several hundreds lines of packaging changelogs): http://doc.reverbrain.com/elliptics:major-version-updates#v225

Elliptics 2.25 is now in ‘support’ mode, i.e. we do not add new features which can break things, but do fix bugs and perform various cleanups.

Multi Data Center Replication in Cassandra

A good entry-level article on how replication (including multidatacenter) is implemented in Cassandra.

Basic idea is random (iirc there is weighted RR algorithm too) remote node selection which behaves like local storage (commit log + memory table) and ‘proxy’ which asynchronously sends data to remote replicas. Depending on strategies those remote replicas can be updated synchronously and can be in a different datacenter.

Seems good except that rack-awareness and complexity of the setup is not covered. Also synchronous update latencies like writing data to one node and then propagate update to others as well as local queue management are not covered at all.
But that’s a good start.

Machine learning, optimization and event stream management

This is a bit different post – it is not about storage per se, but actually it is.

Let me start a bit from the other side – I’ve read Netflix article about how they created their excellent recommendation service ( http://techblog.netflix.com/2012/06/netflix-recommendations-beyond-5-stars.html ), and it raised a question in my head – is there a way for system to heal itself if it can learn?

Well, machine learning is not quite a learning – it is an explicitly defined mathematical problem, but I wonder whether it can be applied to optimization problem.
And to simplify things, lets move away from storage to robots.

Consider a simple scenario – robot moves over the road to its final destination point. It has a number of sensors which tell its control system about speed, edge, current coordination, wheel state and so on. And there are 2 rotated wheels.

At some point robot’s right wheel moves into the hole. Robot starts turning over and in a few moments it will fall.
How to fix this problem?

There are numerous robot control algorithms which very roughly say ‘if sensor X says Y do Z’ and those parameters may vary and be controlled by the management optimization.
But what if we step away to more generic solution – what if we do not have such algorithms. But we can change wheel rotations and perform some other tasks which we do not know in advance how they affect current situation.

Solution to this problem allows to solve a storage problem too – if storage control system can write into multiple disks, which one to select so that read and write performance would be maximized, space is efficiently used and so on.

A bad solution uses heuristics – if disk A is slow at the moment, use disk B. If robot falls to the right, rotate the right wheel. And so on. It doesn’t really work in practice.

Another solution – a naive one – is to ‘try every possible solution’ – rotate the right wheel, if things are worse, stop it. Rotate the left wheel – check sensors, if situation changed – react accordingly. Combine all possible controls (at random or at some order), adjust each control’s ‘weight’, hopefully this will fix the things. But when feature space is very large, i.e. we can use many controls – this solution doesn’t scale.

So, what is a generic way to solve such a problem?
Or speaking a little bit more mathematician way – if we have a ‘satisfaction’ function on several variables (we move to given direction, all sensors are green), and suddenly something out of our control started happening, which changed satisfaction function badly (sensors became red), how to use our variables (wheels and other controls) to return things back to normal?

Mesa – Google’s analytic data warehousing system

I will bring back various tech article analysis in this blog

Let’s start with Mesa – Google’s adwords storage and query system. Albeit the article is very self-advertising and a little bit pathos, it still shows some very interesting details on how to design systems which should scale to trillions operations a day.

In particular, data is versioned and never overwritten. Old versions are immutable. Queries aka transactions operate on given version, which are asynchronously propagated over the google’s world datacenters. The only sync part is versioning database – committer updates/pins new version after update transaction criteria has been met, like number of datacenters have a copy and so on. As usual – it is Paxos. Looks like everything in Google uses Paxos, not Raft. Recovery, data movement and garbage collection tasks are also performed asynchronously. Since the only sync operation is version update, committer (updating client) can write heavy data in parallel into the storage and only then ‘attach’ version to those chunks.

Data is stored in (table, key, value) format, and keys are split in chunks. Data is stored in compressed columnar format within each chunk. Well, Mesa uses LevelDB, it explains and simplifies such things.

Mesa uses very interesting versioning – besides so called singletons – single version update, there are also ranges like [0, v0], [v1, v5], [v6, vN] named ‘deltas’ which are basically batches of key updates with appropriate versions. There are async processes which updates version ranges: compaction [0, vX] and new batch creation processes. For example to operate a query on version 7 in example above one could ‘join’ versions [0, v0], [v1, v5] and apply singletons v6 and v7. Async processes calculate various ‘common’ ranges of versions to speed up querying.

Because of versioning Mesa doesn’t need locks between query and update processing – both can be run heavily in parallel.

Here is an article: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42851.pdf

Data recovery in Elliptics

As you know, Elliptics includes built-in utility for data synchronization both within one replica and between number of replicas. In this article I describe different modes in which synchronization could be run, how it works and give details on situations when it could be used.

dnet_recovery is the main utility for data synchronization in Elliptics. It is distributed with the package elliptics-client. dnet_recovery supports 2 modes: `merge` and `dc`. `merge` is intended for data synchronization within one replica and `dc` – for data synchronization between number of replicas. Each of the modes can automatically search keys for synchronization or use keys from dump file.

Most of described bellow can be found at recovery doxygen documentation. All doxygen documentation will be published on doc.reverbrain.com soon.

Continue reading

React monitoring tool released

During last months we’ve been actively developing our monitoring tools and now we are ready to present you React!

React(Real-time Call Tree) is a library for measuring time consumption of different parts of your program. You can think of it as a real-time callgrind with very small overhead and user-defined call branches. Simple and minimalistic API allows you to collect per-thread traces of your system’s workflow.

Trace is represented as JSON that contains call tree description(actions) and arbitrary user annotations:

{
    "id": "271c32e9c21d156eb9f1bea57f6ae4f1b1de3b7fd9cee2d9cca7b4c242d26c31",
    "complete": true,
    "actions": [
        {
            "name": "READ",
            "start_time": 0,
            "stop_time": 1241,
            "actions": [
                {
                    "name": "FIND",
                    "start_time": 3,
                    "stop_time": 68
                },
                {
                    "name": "LOAD FROM DISK",
                    "start_time": 69,
                    "stop_time": 1241,
                    "actions": [
                        {
                            "name": "READ FROM DISK",
                            "start_time": 70,
                            "stop_time": 1130
                        },
                        {
                            "name": "PUT INTO CACHE",
                            "start_time": 1132,
                            "stop_time": 1240
                        }
                    ]
                }
            ]
        }
    ]
}

This kind of trace can be very informative, on top of it you can build complex analysis tools that on the one hand can trace specific user request and on the other hand can measure performance of some specific action in overall across all requests.

Also, for human readable representation of react traces we’ve build simple web-based visualization instrument. React trace React has already proved himself in practice. Under high load our caching system performance had degraded dramatically after cache overflow. Call tree statistics showed that during cache operations most of the time was consumed in function that was resizing cache pages. After careful examination, we optimized that function and immediately, performance significantly increased.

Now React is used in Elliptics and Eblob and we intent to use it in other Reverbrain products.

For more details check out documentation and official repository.

RIFT is now fully API backed: buckets, acl, bucket directories, listing and so on

This day has come – we made all RIFT – elliptics HTTP frontend – features (written in the title and more others) accessible via REST APIs.

It required URL format changes, and now URLs are much more like S3 and REST in general:
http://reverbrain.com/get/example.txt?bucket=testns

Main base stone of the RIFT – bucket – a metadata entity which shows where your data lives (group list) and how to access it (ACLs) also hosts a secondary index of the keys uploaded into that bucket (if configured to do so).
Now we have bucket directory – entity which lists your buckets.

Buckets, directories, files and indexes – everything can be created, processed and deleted via REST API calls.
Basically, RIFT + elliptics allow you to create your own private cloud storage and put your data replicas into safe locations you like.

It is like having your own Amazon S3 in the pocket :)

Soon we will set up a test cloud at reverbrain.com where everyone can check our technologies before digging deeper you will be able to create (limited) buckets and upload/download data, which will be stored in Germany and Russia for limited period of time.

For more details about RIFT please check our documentation page: http://doc.reverbrain.com/rift:rift

Stay tuned!