Storage and beyond is up and running

Reverbrain is a company we started to provide support and solutions for distributed storage and realtime computation problems.

We created the whole stack of technologies ranging from the lowest level of data storage upto high-level processing pipeline.

Time moves on and we see how complex is to deal with massively increasing amounts of data. We provide solutions for true-way horizontal scaling with fair system complexity. Our storage appliances lay in the area where one has to host billions of small-to-medium objects upto huge data streaming systems.

Your data should nost just lay in archive in the distributed system — realtime pipeline processing is our vision of how data has to be handled. We use them by ourself and want to provide the best experience for our customers.

Crafting knowledge base the right way

Google is automatically building its next generation knowledge graph named Knowledge Vault

Although article is very pop-science (not science at all actually) and doesn’t contain any technical detail, it is clear on google’s idea and the way information retrieval systems head. Automatic knowledge gathering and fact extraction is also what I originally aimed at Reverbrain company, although my idea was much simpler – I wanted to automatically build a language model and fact relations between words to understand native language questions.

Aug 25 there will be a presentation of Google’s Knowledge Vault, I’m too much tempting to see it and try to gather and understand bits of information on how it is implemented inside.

Upfate: a paper on knowledge vault: Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion

, ,

Elliptics 2.25 changelog

We are proud to announce our current 2.25 production major version changelog in a human readable (not thousands of commits or several hundreds lines of packaging changelogs):

Elliptics 2.25 is now in ‘support’ mode, i.e. we do not add new features which can break things, but do fix bugs and perform various cleanups.


Multi Data Center Replication in Cassandra

A good entry-level article on how replication (including multidatacenter) is implemented in Cassandra.

Basic idea is random (iirc there is weighted RR algorithm too) remote node selection which behaves like local storage (commit log + memory table) and ‘proxy’ which asynchronously sends data to remote replicas. Depending on strategies those remote replicas can be updated synchronously and can be in a different datacenter.

Seems good except that rack-awareness and complexity of the setup is not covered. Also synchronous update latencies like writing data to one node and then propagate update to others as well as local queue management are not covered at all.
But that’s a good start.


Machine learning, optimization and event stream management

This is a bit different post – it is not about storage per se, but actually it is.

Let me start a bit from the other side – I’ve read Netflix article about how they created their excellent recommendation service ( ), and it raised a question in my head – is there a way for system to heal itself if it can learn?

Well, machine learning is not quite a learning – it is an explicitly defined mathematical problem, but I wonder whether it can be applied to optimization problem.
And to simplify things, lets move away from storage to robots.

Consider a simple scenario – robot moves over the road to its final destination point. It has a number of sensors which tell its control system about speed, edge, current coordination, wheel state and so on. And there are 2 rotated wheels.

At some point robot’s right wheel moves into the hole. Robot starts turning over and in a few moments it will fall.
How to fix this problem?

There are numerous robot control algorithms which very roughly say ‘if sensor X says Y do Z’ and those parameters may vary and be controlled by the management optimization.
But what if we step away to more generic solution – what if we do not have such algorithms. But we can change wheel rotations and perform some other tasks which we do not know in advance how they affect current situation.

Solution to this problem allows to solve a storage problem too – if storage control system can write into multiple disks, which one to select so that read and write performance would be maximized, space is efficiently used and so on.

A bad solution uses heuristics – if disk A is slow at the moment, use disk B. If robot falls to the right, rotate the right wheel. And so on. It doesn’t really work in practice.

Another solution – a naive one – is to ‘try every possible solution’ – rotate the right wheel, if things are worse, stop it. Rotate the left wheel – check sensors, if situation changed – react accordingly. Combine all possible controls (at random or at some order), adjust each control’s ‘weight’, hopefully this will fix the things. But when feature space is very large, i.e. we can use many controls – this solution doesn’t scale.

So, what is a generic way to solve such a problem?
Or speaking a little bit more mathematician way – if we have a ‘satisfaction’ function on several variables (we move to given direction, all sensors are green), and suddenly something out of our control started happening, which changed satisfaction function badly (sensors became red), how to use our variables (wheels and other controls) to return things back to normal?

Mesa – Google’s analytic data warehousing system

I will bring back various tech article analysis in this blog

Let’s start with Mesa – Google’s adwords storage and query system. Albeit the article is very self-advertising and a little bit pathos, it still shows some very interesting details on how to design systems which should scale to trillions operations a day.

In particular, data is versioned and never overwritten. Old versions are immutable. Queries aka transactions operate on given version, which are asynchronously propagated over the google’s world datacenters. The only sync part is versioning database – committer updates/pins new version after update transaction criteria has been met, like number of datacenters have a copy and so on. As usual – it is Paxos. Looks like everything in Google uses Paxos, not Raft. Recovery, data movement and garbage collection tasks are also performed asynchronously. Since the only sync operation is version update, committer (updating client) can write heavy data in parallel into the storage and only then ‘attach’ version to those chunks.

Data is stored in (table, key, value) format, and keys are split in chunks. Data is stored in compressed columnar format within each chunk. Well, Mesa uses LevelDB, it explains and simplifies such things.

Mesa uses very interesting versioning – besides so called singletons – single version update, there are also ranges like [0, v0], [v1, v5], [v6, vN] named ‘deltas’ which are basically batches of key updates with appropriate versions. There are async processes which updates version ranges: compaction [0, vX] and new batch creation processes. For example to operate a query on version 7 in example above one could ‘join’ versions [0, v0], [v1, v5] and apply singletons v6 and v7. Async processes calculate various ‘common’ ranges of versions to speed up querying.

Because of versioning Mesa doesn’t need locks between query and update processing – both can be run heavily in parallel.

Here is an article:

, ,

RIFT documentation update

We’ve updated RIFT documentation at

It includes new Authorization header (only riftv1 method is described, S3 methods are coming), handlers ACL updates

Enjoy and stay tuned!

, ,

Data recovery in Elliptics

As you know, Elliptics includes built-in utility for data synchronization both within one replica and between number of replicas. In this article I describe different modes in which synchronization could be run, how it works and give details on situations when it could be used.

dnet_recovery is the main utility for data synchronization in Elliptics. It is distributed with the package elliptics-client. dnet_recovery supports 2 modes: `merge` and `dc`. `merge` is intended for data synchronization within one replica and `dc` – for data synchronization between number of replicas. Each of the modes can automatically search keys for synchronization or use keys from dump file.

Most of described bellow can be found at recovery doxygen documentation. All doxygen documentation will be published on soon.


, , , , , , , ,

React monitoring tool released

During last months we’ve been actively developing our monitoring tools and now we are ready to present you React!

React(Real-time Call Tree) is a library for measuring time consumption of different parts of your program. You can think of it as a real-time callgrind with very small overhead and user-defined call branches. Simple and minimalistic API allows you to collect per-thread traces of your system’s workflow.

Trace is represented as JSON that contains call tree description(actions) and arbitrary user annotations:

    "id": "271c32e9c21d156eb9f1bea57f6ae4f1b1de3b7fd9cee2d9cca7b4c242d26c31",
    "complete": true,
    "actions": [
            "name": "READ",
            "start_time": 0,
            "stop_time": 1241,
            "actions": [
                    "name": "FIND",
                    "start_time": 3,
                    "stop_time": 68
                    "name": "LOAD FROM DISK",
                    "start_time": 69,
                    "stop_time": 1241,
                    "actions": [
                            "name": "READ FROM DISK",
                            "start_time": 70,
                            "stop_time": 1130
                            "name": "PUT INTO CACHE",
                            "start_time": 1132,
                            "stop_time": 1240

This kind of trace can be very informative, on top of it you can build complex analysis tools that on the one hand can trace specific user request and on the other hand can measure performance of some specific action in overall across all requests.

Also, for human readable representation of react traces we’ve build simple web-based visualization instrument. React trace React has already proved himself in practice. Under high load our caching system performance had degraded dramatically after cache overflow. Call tree statistics showed that during cache operations most of the time was consumed in function that was resizing cache pages. After careful examination, we optimized that function and immediately, performance significantly increased.

Now React is used in Elliptics and Eblob and we intent to use it in other Reverbrain products.

For more details check out documentation and official repository.

, , ,

Elliptics 2.24 reached end-of-life

2.24 was LTS release and instead of half a year we supported it more than 1 year.
It reaches its end of life now.

We will answer questions about 2.24 and help with old version for a little while, but I strongly recommend you upgrading to 2.25


RIFT is now fully API backed: buckets, acl, bucket directories, listing and so on

This day has come – we made all RIFT – elliptics HTTP frontend – features (written in the title and more others) accessible via REST APIs.

It required URL format changes, and now URLs are much more like S3 and REST in general:

Main base stone of the RIFT – bucket – a metadata entity which shows where your data lives (group list) and how to access it (ACLs) also hosts a secondary index of the keys uploaded into that bucket (if configured to do so).
Now we have bucket directory – entity which lists your buckets.

Buckets, directories, files and indexes – everything can be created, processed and deleted via REST API calls.
Basically, RIFT + elliptics allow you to create your own private cloud storage and put your data replicas into safe locations you like.

It is like having your own Amazon S3 in the pocket :)

Soon we will set up a test cloud at where everyone can check our technologies before digging deeper you will be able to create (limited) buckets and upload/download data, which will be stored in Germany and Russia for limited period of time.

For more details about RIFT please check our documentation page:

Stay tuned!

, ,

Previous Posts