Featured post

reverbrain.com is up and running

Reverbrain is a company we started to provide support and solutions for distributed storage and realtime computation problems.

We created the whole stack of technologies ranging from the lowest level of data storage upto high-level processing pipeline.

Time moves on and we see how complex is to deal with massively increasing amounts of data. We provide solutions for true-way horizontal scaling with fair system complexity. Our storage appliances lay in the area where one has to host billions of small-to-medium objects upto huge data streaming systems.

Your data should nost just lay in archive in the distributed system — realtime pipeline processing is our vision of how data has to be handled. We use them by ourself and want to provide the best experience for our customers.

Distributed cron across the planet

Cron is used on every local machine out there – all systems have periodic tasks which have to be started at some time.

But some tasks cover more than a single server, for example check another server and perform some action. If we have 10 servers it is ok that every server will check others, but if number of machines is larger – thousands of servers – it is unfeasible to force them all to know about each other.

And yet there is a strong demand to run periodic tasks which cover multiple servers (or whole datacenters). Setting up a dedicated server which will run ‘group’ tasks is not always the right solution: server may fail, it can be loaded, it is possible that task is big enough to be started on a single server only – we might want to set a group of servers to perform group periodic tasks.

That dedicated group has to be managed – which server runs which task, what if task has failed, what to do, when multiple servers want to run the same task and so on.

Guys from Google present a paper “Distributed cron across the planet” which tells how they solved this problem. Its worth reading even if you do not have thousands of servers (well, normal people do not) – paper describes a nice ‘cell’ design – dedicated set of server, Paxos for management, retry policies, master/slave, recovery and so on

Reversed real-time search

To find out documents which match a given query search engine iterates over inverted indexes and intersect the results. Inverted index is usually being built over terms found in a document, term can be a word, its transformation, normal form, a sentence and so on.

This indexing process is rather costly and slow, but usually it is not a problem – documents in a set are not frequently changed. Indexing is being performed in batches – for example iterate over all documents downloaded since the last indexing batch.
Realtime search in this context is basically a batch reduction – to 1 document in a batch in its extreme.

But there is a completely different search pattern – when a realtime flow of documents has to match a set of queries. The most common example is live twitter search.

Confluent performs realtime data processing and shows a way to perform realtime search given a huge amount of SQL queries and realtime feed of documents.

They use Kafka to store stream of documents and queries and a tricky way to optimize queries that way it would not require to run all queries against all new documents. It is achieved by building a query index which matches only those queries which potentially might match given document and then only run those queries.

Coordination Avoidance in Database Systems

A paper whose name speaks for itself – theoretical analysis of the distributed systems in order to find invariants which might loose coordination constraints.

In particular they analyze industry standard TCP-C benchmark and find out that if properly designed, linearly scalable 200-servers systems can outperform current leader at 25x rate.

A much shorter and brief analysis of the paper available at the morning paper.

The main conclusion is (and I will make it bold)

coordination can only be avoided if all local commit decisions are globally valid

This basically means that updates are always require external coordination, and depending on versioning reads might be the only operations that do not require coordination in systems which allow updates.

Article then goes into details on how to prove coordination avoidance and that part contains a fair number of holes. In particular they discuss merge operation which fixes transaction confluence, but it doesn’t say how to coordinate this merge on different servers and how and what to tell client when his write transaction will not appear in read read results. There is also a note about ‘last writer wins’ strategy, which is more complex than that – there is always a problem with clock synchronization and notion of the ‘last’ write is very vague when applied simultaneously to multiple servers.

I very like short highlighted above conclusion articles makes, that coordination can only be avoided if local changes are globally valid. In the very basic case this is equivalent to cases when there are no updates, but only new-key writes.

Elliptics does not have distributed coordinator (although is has local one), and my point is to design systems which never require updates – such systems are always consistent and correct in respect to data replicas. Practice shows that coordination avoidance (or no updates for existing keys) can be achieved, but it always has some price.

Tachyon – “memory-centric” distributed storage

Tachyon is distributed filesystem on top of HDFS with aggressive caching. By ‘filesystem’ one has to accept Java File objects and related methods and Hadoop compatibility.

Tachyon can run on top of HDFS, S3 and GlusterFS, in fact, its fault tolerance is what HDFS provides – single name node. Tachyon requires zookeeper to maintain master server which performs all operations and checks for consistency.

By their own benchmark it scales lineary with new nodes added

Because of aggressive caching Tachyon’s performance is way ahead of in-memory HDFS. Transparent in-memory caching is a great way to speedup Hadoop and particularly Spark, which was designed for immutable datasets with multiple access patterns.

I can also recommend Andreessen Horowitz article as an example of high-profile investment and management language, it looks like it even introduced new ‘memory-centric’ kind of technical term.

As a side note, I can not slip away comparing this caching system with what Elliptics distributed cache is. Not highlighting and providing Java File API instead we concentrated on HTTP access through Backrunner – Elliptics HTTP load balancer and proxy.

Building transparent subsystem which do not require semi-manual caching policies and additional configuration layers is a way forward for distributed systems, which already require quite a configuration to run and maintain. With a16z investment support I’m pretty sure we will hear more about Tachyon soon.

Sophia embedded key-value storage

Sophia is a write append-only embedded key-value storage used by Mail.ru in their mail indexes.

Sophia supports common get/set operations and optimized for range queries. It is stated that common log-based storage design operates with set of sorted files and range query may end up with as many low-level storage requests as many log files it contains. Bloom filters and similar techniques help with single requests but not range query and Sophia aims at this problem.

Sophia supports wide range of features including ACID (probably with versions created on top of epoch numbers), cursors, versions (looks like older version can not be accessed, it is probably used for consistent read/scan/range access) and fair number of various language bindings.

Initial design (1.1) was quite simple – there was in-memory index of all keys written into log file. There are some optimizations like range indexes and epoch numbers (it is used to mark ranges for merge and GC tasks). This design ends up with eating memory for all keys and there is no way to maintain memory usage.

1.2 is a bit more complex, it introduced branch – set of sorted ranges. Unfortunately there is no much info about internal details, probably sorted ranges do not have their keys in in-memory key index, and since ranges do not overlap and are sorted, range queries can be fast.
Having branches also looks like helping with multithreading merge/GC tasks.

Keys are combined with data on disk – Sophia is designed for small-to-medium key sizes.

Performance is rather spectacular – but there is a one caveat, there is no information on the key structure. It looks like keys are mostly increasing – there are no sorting gaps in performance. This could be explained by the fact that new mails likely have higher IDs than older ones. Also there is no information on page cache usage and amount of ram servers have as well as how ‘random’ the keys are, this would clarify performance graphs.

But even if keys are mostly increasing performance is excellent

Designing time series database

Akumuli tells on how they designed time series database.

Basically it is a set of write-ahead-log files preallocated at the initial write. Header contains time indexes, footer hosts actual data, and they both grow towards each other. Data chunks and/or indexes can be compressed.

This is a nice design, but it only works for time series, or for keys which are by its nature increase with each new record. This trick allows to perform binary search on keys stored in increasing order at the beginning of the WAL file, and select needed file with O(1) disk lookups (or even single disk lookup, if startup processes reads key ranges for every file into memory).

If suddenly new key is less than previously written key, system would have to rearrange already written keys, which requires data copy. Or this would require hosting key->data offset mapping in memory, which doesn’t scale for time series databases, which usually contain billions of records. There is a middle point somewhere where production system would work, but it is *much* more simple just to forbid writing data with random keys. If one is afraid of time skew in the cluster, it is possible to cache keys (and sort them in memory) before storing into the database.

docker and IPv6: eth0: IPv6 duplicate address detected!

This is a very known error

IPv6: eth0: IPv6 duplicate address fe80::42:acff:fe11:7e detected!

its main problem is that given interface (eth0) is being reset after DAD mechanism has found error.

In docker case this happens (maybe among other cases) when container was started with ‘on-failure’ restart policy. This means when application in the container fails, docker detects that and stops/starts the same container. Subsequent start ends up with the DAD issue and host interface reset.

golang: shine and depression

Just leave it here

$ export GODEBUG="gcdead=1"
$ go
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x6d4618]

goroutine 1 [running]:
regexp/syntax.(*parser).collapse(0xc20805e000, 0xc208030088, 0x6868686868686868, 0x6868686868686868, 0x13, 0x6868686868686868)
	/usr/local/go/src/regexp/syntax/parse.go:376 +0x2b8
regexp/syntax.(*parser).factor(0xc20805e000, 0xc208030080, 0x6868686868686868, 0x6868686868686868, 0x0, 0x6868686868686868, 0x6868686868686868, 0x6868686868686868)
...

The latest stable go 1.4.1.

gcdead=1 should tell garbage collector to get rid (‘clobber’) of stack frames that it thinks are dead. Apparently either Golang GC thinks that stack being used is dead, or there is stack overflow (like pinning pointers or something like that)

Weaver – new transactional graph database

Weaver is the first distributed transactional graph database, strong consistency is being achieved by using Hyperdex distributed storage.

There is not much information on how it is implemented, performance benchmarks are also a bit vague, but it states that Weaver performs several times faster than Titan and GraphLab. For pagerank-like benchmark (what does that mean?) Weaver performs slower.

It could be very intesting to try except that setting up Hyperdex is a pain. One has to start 3 different daemons with different configs, there are a bunch of scripts to support it and I do not know what will happen if one of them fails.

Weaver is in alpha stage so far, but looks interesting.

Backrunner – next generation HTTP proxy for Elliptics distributed storage

Elliptics is a powerful distributed storage for medium and large data, but it is rather low-level. It doesn’t know about ACL or REST API for example, I would compare it to block level in Linux filesystem hierarchy. In particular, Elliptics only provides C/C++, Python and Golang API bindings.

For the vast majority of the users HTTP REST API is a must, thus we created Backrunner – a new swiss-knife HTTP proxy for Elliptics distributed storage. It supports ACL, automatic bucket selection based on disk and network speed, errors, amount of free space, automatic defragmentation, header extension, local static files handling and provides simple REST API for clients.

We call Backrunner an entry point to Elliptics distributed storage. It not only provides externally visible interfaces, but also takes many administrative tasks like running defragmentation, showing properly crafted monitoring stats and so on.

Backrunner’s load balancing operates in real-time, for example it gathers upload metrics (speed, latency, errors) on every request to properly tune algorithm placing data around the cluster. It also takes into account amount of free space, disk activity, internal errors, timings from other clients, network speed and many other metrics.

We will extend it to run basic recovery operations, right now Backrunner detects that replicas are out of sync, but do not run recovery because this will likely heavily affect timings, which is generally a bad idea. That’s why Elliptics is an eventually consistent system – we pay this price for the highest possible scalability levels.

Backrunner is also distributed in docker images: https://registry.hub.docker.com/u/reverbrain/backrunner/
Documentation: http://doc.reverbrain.com/backrunner:backrunner

Tutorial is coming, stay tuned!

Facebook’s in-memory computation engine for social analytics

Facebook’s scale is out of the radar for the vast majority of cases, but yet it is very interesting to lookup new ideas there.

Audience insight is a way to show us pages which were are about to like with higher probability. It is based on previously liked pages, gender, location and many other features.

In particular at Facebook’s scale it is about 35 Tb of raw data and query ‘give me several pages which should be shown to user X’ must be completed within hundreds of milliseconds. That requires to process hundreds of billions of likes and billions of pages – numbers beyond reasonable – and in fraction of a second.

It happens that 168 nodes can handle it with some magic like columnar storage, in-memory data, bitmap indexes, GPU and caches. Very interesting reading!

Facebook’s hot, warm and graph data

Facebook is a maze for developing data store and access algorithms. Their scale allows and actually requires to find out new ways of storing information.

For example social graph data – it is stored in MySQL and used to use aggressive memcached cache. Now Facebook uses Tao – graph-aware storage for social graph with its own graph cache.

Facebook uses 3 levels of persistent storage: hot data is stored in Haystack storage – it’s lowest level is the prototype I used for Eblob. I would describe Elliptics storage with DHT, i.e. multiple servers to store one data replica, as this level of abstraction.

Second comes so called WARM data – data which is accessed by 2 orders of magnitude less frequently than that at HOT level. Warm data is stored in F4 storage, Facebook describes this split and motivation in this article.
I saw Elliptics installation with one disk per group, i.e. without DHT, as a storage for this access level of the data.

And the last is a COLD storage, something weird and scientifically crazy like robotic hand with a stack of Blu-Ray disks.

Various flash caches benchmarked

Guys from Swrve have benchmarked various flash caches in front of Amazon EBS volume and shared the results.

Data shows that Facebook’s Flashcache performs the best. Swrve decided to replace safety with performance and started to use raid-0 stipe for EBS volumes plus SSD cache in front of them.

Here is the result: Y axis is events processed per second, higher is better; X axis is time.
Y axis is events processed per second, higher is better; X axis is time.

  • single 200GB EBS volume (blue)
  • 4 x 50GB EBS volumes in RAID-0 stripes (red)
  • 4 x 50GB EBS volumes in RAID-0 stripes, with an enhanceio cache device (green)

EnhanceIO is a Facebook’s Flashcache fork with nicer tools. It itself increases performance by about 25% in their workload, while splitting EBS volume into raid0 added another 25%.

Above Swrve link contains more details on benchmark and other results.

Document-oriented databases

Amazon has announced that DynamoDB (the grandmother of all nosql databases) extended its free tier (upto 25 Gb) and added native JSON document support.

Native JSON document support is basically ability to get/set not the whole document, but only its parts indexed by internal key. Nothing says about how such update will be done in distributed fashion, in particular when multiple copies are being updated in parallel, but DynamoDB doesn’t support realtime replication into multiple regions, and internally it is likely database uses master-slave replication scheme.

I wrote it here to ask a question whether such document oriented databases are on demand? I know about CouchDB (and derivatives), MongoDB and others, but I’m quite interested whether and why DynamoDB wants to beat on this arena? I could add native JSON support to Elliptics but not sure it is really needed.

Afterall there quite a few document-oriented databases for rather small keys, while elliptics could beat there too, but instead it better works with medium-to-large objects with streaming and distributed replication and parallel upload.

Facebook’s “Cold storage” or saving data on Blu-Ray disks

Data has a natural live cycle – the newer the data the most actively it is accessed (read or viewed). This is exceptionally the case for digital content like pictures, movies, letters and messages. Facebook has found that the most recent 8% of the user generated content accounts for more than 80% of data access.

But the old data still has to be saved for rare access. Those servers which store the old data are virtually idle and kind of useless – the only thing they do most of the time is heating the air and eating an electricity, which is quite pricey.

Facebook experiments with moving this old data (actually only a backup copy of data) into special “Cold storage” which has two particular features: it can shutdown already written disks and data can be copied from those backup disks into Blu-Ray rack of disks with special robot handling blu-ray drives loading.

Besides the fact that blu-ray disk is much more robust than spinning disks (it doesn’t afraid of water or dust for example, it can store data for 50 years until something new appears) the rack of disks doesn’t eat pricey electricity.

http://money.cnn.com/2014/08/21/technology/facebook-blu-ray/index.html

A short story on trying to use Vagrant to put elliptics node into container

I want to use containers to run multiple elliptics nodes with different versions on the same server mainly to be able to quickly switch between them.

I decided to run Vagrant on top of VirtualBox, since I do need for on-demand compilation of the new package versions and can not just wrap some application into container.
Running vagrant on top of Fedora19 fully succeeded, now I have several environments to experiment.

But main goal was to be able to create ‘a box’ and put it to remote nodes to create a network of different elliptics versions, and that part is rather hard with vagrant. First, I didn’t find how to export vagrant box not to its cloud but into the local file. But I could do that with virtualbox images.
Creating own box while I actually need a default one with several packages installed is rather overkill I think.

But here comes the second problem, the latest vagrant and the latest virtualbox do not work on various debians. Installed the latest virtualbox on wheezy default machine ended up with timed out vagrant. Running it on top of backported 3.14 kernel doesn’t work at all.

It was soo good on my development server, and miserably failed on testing machines.
Adding that I could not easily find out how to export updated boxes via files I decided to switch to something else. In particular, I’m thinking about Docker, although it is quite different, for example I do not know how to update image when I have new version in repository, or even how to build it inside the image and show me errors I could fix in place… Docker is more on how to pack already cooked up application with all its configs and tunings, while I need this for development.

Elliptics 2.26 changelog

Here is a human-readable changelog of 2.26 major version of elliptics distributed storage: http://doc.reverbrain.com/elliptics:major-version-updates#v226

The main feature is multiple backends in the single server. One can turn on/off them, change state, each backend has own IO execution pool. Basically it allows to change old scheme of having many elliptics servers, one per disk/directory/mountpoint, to just one server with multiple backends.
This greatly simplifies node setup and heavily decreases route table updates.

Also added new C++ structured logger Blackhole. One can send logs into ElasticSearch, syslog or use oldschool files.

We also cleaned up code and client logic, introduced new kinds of errors, simplified protocol and fixed bunch of bugs.

Enjoy and stay tuned!

How is Redis used in twitter

Everyone knows Redis – high-performance persistent cache system. Here is an article on how Redis is being used in Twitter.

It happens that Twitter not only forked and extended 1 year old Redis version, but looks like it doesn’t have plans to upgrade. Redis and its latencies are much-much-much faster than Twitter infrastructure written in Java because of GC in JVM. This allows to put a bunch of proxies on top of Redis caching cluster to do cluster management, the thing Redis misses for a while.

Also Twitter uses Redis only to cache data, doesn’t care about consistency issues, doesn’t use persistent caching, at least article says data is being thrown away when server goes offline.
It is client responsibility to read data from disk storage if there is no data in the cache.

Article desribes Twitter timeline architecture, and that’s quite weird to me: instead of having list of semifixed (or limited by size) chunks of timeline which are loaded on demand, they created a bunch of realtime updated structures in Redis, found non-trivial consistency issues and eventually ended up with the same simple approach of having ‘chunks’ of timeline stored in cache.

I started to compare cache management in Twitter using Redis with what we have in Reverbrain for caching: our Elliptics SLRU cache. It uses persistent caching system (which was also described a bit in article in comparison with memcache), but also uses persistent storage to backup cache, and while cache is actually segmented LRU, its backing store can be arbitrary large at size compared to Redis.

Although article is written as ‘set of facts’ somehow cut out of context (it was interview with the twitter employee), it is a good reading to think about caching, JVM, Redis and cache cluster architecture.

Elliptics, golang, GC and performance

Elliptics distributed storage has native C/C++ client API as well as Python (comes with elliptics sources) and Golang bindings.

There is also elliptics http proxy Rift.

I like golang because of its static type system, garbage collection and built in lightweight threading model. Let’s test HTTP proxying capabilities against Elliptics node. I already tested Elliptics cache purely against native C++ client, it showed impressive 2 millions requests per second from 10 nodes, or about 200-220 krps per node using native API (very small upto 100 bytes requests), what would be HTTP proxying numbers?

First, I ran single client, single Rift proxy, single elliptics node test. After some tuning I got 23 krps for random writes of 1k-5k bytes (very real load) per request. I tested 2 cases when elliptics node and rift server were on the same machine and on different physical servers. Maximum latencies with 98% percentile were about 25ms at the end of the test (about 23 krps) and 0-3 ms at 18 krps not counting rare spikes at graph below.

elliptics-cache-rift-23krpsRift HTTP proxy writing data into elliptics cache, 1k-5k bytes per request

Second, I tested a simple golang HTTP proxy with the same setup – single elliptics node, single proxy node and Yandex Tank benchmark tool.

I ran tests using the following setups: golang 1.2 with GC=100 and GC=off and golang 1.3 with the same garbage collection settings. Results are impressive: without garbage collection (GC=ff) golang 1.3 test ran with the same RPS and latencies as native C++ client. Although proxy ate 90+ Gb of RAm. Golang 1.2 showed 20% worse numbers.

elliptics-cache-golang-1.3-gc-offGolang HTTP proxy (turned off garbage collection) writing data into elliptics cache, 1k-5k bytes per request

Turning garbage collection on with GC=100 setting lead to much worse results than native C++ client but yet it is quite impressive. I got the same RPS numbers for this test of about 23 krps, but latencies at the 20 krps were close to 80-100 msecs, and about 20-40 msecs at the middle of the test. Golang 1.2 showed 30-50% worse results here.

elliptics-cache-golang-1.3-gc-100Golang HTTP proxy (GC=100 garbage collection setting) writing data into elliptics cache, 1k-5k bytes per request

Numbers are not that bad for single-node setup. Writing asynchronous parallel code in Golang is incredibly simpler than that in C++ with its forest of callbacks. So I will stick to Golang for the network async code for now. Will wait for Rust to stabilize though.

Crafting knowledge base the right way

Google is automatically building its next generation knowledge graph named Knowledge Vault

Although article is very pop-science (not science at all actually) and doesn’t contain any technical detail, it is clear on google’s idea and the way information retrieval systems head. Automatic knowledge gathering and fact extraction is also what I originally aimed at Reverbrain company, although my idea was much simpler – I wanted to automatically build a language model and fact relations between words to understand native language questions.

Aug 25 there will be a presentation of Google’s Knowledge Vault, I’m too much tempting to see it and try to gather and understand bits of information on how it is implemented inside.

Upfate: a paper on knowledge vault: Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion