backrunner: HTTPS support

We at develop highly scalable distributed storage Elliptics for medium and large objects. And in the web era the most frequently used API is HTTP. We had developed HTTP proxy for elliptics named Backrunner, it supports a wide range of options like ACL, streaming, redirect, partial upload/download, static file downloading and many others.

But if you build a system hidden behind HTTPS you likely want to secure your work with storage, in porticular your CDN will likely require you to work via HTTPS in this case.

So we have updated Backrunner to support HTTPS. It can listen for unencrypted and secured connections simultaneously on different ports/addresses, and you have to provide certificate/private key files.

This rather small change allows to deploy fully secured storage access to your frontend.

MySQL application level HA/loadbalancer

Master-master scheme doesn’t really work for relation databases, master-slave replication is used almost always and there is a serious problem to balance this load.

In particular, most of the time the most demanded setup is to direct all ‘writes’ to master and balance all ‘reads’ among multiple slaves. This can be done at application itself, after all only client knows whether he updates data or reads, but sometimes this is not possible.

Plain and simple regexp analyzer of the query most of the time is not enough, so MariaDB guys wrote MaxScale tool for this.

Percona used it to solve client’s scalability problem and here is their report:

Quite informative post about MaxScale tool.

Distributed database for small records on top of LevelDB/RocksDB

LevelDB (and actually RocksDB) is a very well known local storage for small records. It is implemented over log-structured merge trees and optimized for write performance.

LevelDB is quite horrible for records larger than 1Kb because of its merge operation – it quickly reaches level where it has to always merge trees and it takes seconds to complete.
But for small keys it is very useful and fast.

RebornDB is a proxy storage, which operates on top of LevelDB/RocksDB and provides Redis API to clients. Basically, it is a Redis on top of on-disk leveldb storage.

There is an interesting sharding scheme – there are 1024 slots each of which uses its own replica set, which can be configured by admin. When client writes a key, it is hashed and one of the 1024 slots is being selected (using modulo (% 1024) operation). When admin decides that some slots should be moved to new/different machine, it uses command line tool to reconfigure the proxies. During migration slots in question are available for IO, although it may span multiple servers since proxy doesn’t yet know whether required key has been or hasn’t been yet moved to the new destination.
Having many slots is a bit more flexible than old-school sharding which uses number of servers, although quite far from automatic ID range generation – manual resharding doesn’t scale for admins.

RebornDB uses zookeeper/etcd to store information about slot/server matching and per-slot replication policies. This doesn’t force every operation to contact zookeeper (this actually kills this service), instead every proxy has abovementioned info locally and every reconfiguration also updates info on every proxy.

There is not that much information about data recovery (and migration) except that it is implemented on key-by-key basis. Given that leveldb databases usually contain tens-to-hundreds millions of keys recovery may take a real while, snapshot migration is on todo list.

Reverbrain packages repository

Reverbrain package repository now hosts packages for the following distributives: RHEL6, RHEL7 (CentOS supported), Ubuntu Precise, Ubuntu Trusty, Debian Wheezy, Debian Jessie.

Repository includes all packages needed to install Elliptics distributed storage and Eblob low-level storage.

Here is a small tutorial on how to automatically turn on repository in your setup:

Backrunner HTTP elliptics proxy can be found in Docker repo:

LSM and fractal trees and how to really work with large files

LSM tree (stands for log-structured merge tree) is a rather simple structure which can be hardly called a tree.

This is an append-only log which is sorted when written to disk. LSM tree is intended for write-heavy workloads, since reading requires at least O(number of on-disk log files) disk-seek operations.

There is a fair number of read optimizations for LSM trees, in particular bloom filter which can reduce number of disk seek operations to minimum albeit with some probability (it can be arbitrary small though).

LSM tree behaves much better for write workloads compared to Btree and friends (B+, B* and so on), since there is only one write of the sorted tree and it is always sequential. Btree potentially has to update multiple nodes (some log of total number of keys) when performing single write. Nodes are likely located at random locations which ends up with random writes. These are slow.

Quite contrary Btree reading is usually faster than that of LSM trees – logarithm of number of keys is less than number of sorted logs in LSM tree. But this does not count bloom filters in. Which in turn doesn’t count node caching in btrees.
Multiple operations needed to perform single request – like multiple page reads to fetch single key in btree case – is called multiplication. Fractal tree is aimed at write multiplication – it is yet B+tree, but it stores data in intermediate nodes (not in leafs) for some time until page split is required. This allows to reduce number of writes needed to insert or update a key.

Anyway, btrees are considered to be faster than LSM trees for reading and slower for writing. The latter is a fact – LSM trees are designed for that, the former is questionable.

Elliptics distributed storage can use many backends, and the most popular one is Eblob – a low-level storage built with LSM trees design in mind.

LSM trees do not support data rewrite – key update is appended to new log and older version is either marked as removed or special lookup sequence is used to find out newer keys first. Eventually LSM tree merges and removes old versions of the duplicate keys.

In Eblob this process is called defragmentation, and it is a bit different than LSM tree process. LSM tree stores already sorted data to disk, it sorts it in RAM. But if your storage is intended to store large files like Elliptics – we store objects which are sometimes quite larger than amount of RAM in the system – plain LSM tree approach (sort in mem and sync to disk) doesn’t work.

Instead Eblob stores unsorted log to disk (optionally overwriting data in place) and adds in-memory index of the keys. This simple scheme could be very naive since number of keys multiplied by key size – amount of RAM needed to store key index in memory – can be much larger than amount of RAM on any given server. So we have additional on-disk index of stored keys, it can be sorted – binary search allows to find needed key rather quickly.
But not quickly enough – there will be log2 of number of keys random seek operations – we have to split sorted keys into ranges of smaller size called index blocks and store start/stop pairs for each index block in RAM. This allows to find out index block quickly without on-disk operations, and then perform single read to get the whole index block (tens-to-hundreds of keys) and perform in-memory binary search.

And even this is not enough. Iterators and for example recovery works with sorted keys – recovery merges index lists from different nodes and produces sorted list of keys which have to be recovered – since our data is not sorted yet, reads of the to be recovered keys will be actually random reads. Instead we can turn that purely random read into subsequent read plus some times head positioning. So we sort data which is performed when defragmentation process is being started the first time.

This allows Elliptics+Eblob be the ultimate solution for medium-to-large files distributed storage. For smaller files pure LSM tree is usually enough.

Apache Bookkeeper or how NOT to design replication consistency scheme

Consistency in a distributed system is hard, it is even a limitation – everyone knows that one has to sacrifice a letter in CAP theorem to be have consistent distributed system: either it will not be 100% available or it will not tolerate partition split.

PAXOS algorithm is known to be a complex solution for consistency problem – it allows to have a consensus on multiple opinions, but algorithm is quite complex to implement. And it is very slow. Cassandra for instance caught bugs in its realization for several years (and probably still have it) and it used 6 round trips to converge for a single write.

But there are several simple optimization for PAXOS, the most interesting (and the fastest) is FAST PAXOS. Basically, this is a single master – multiple learners solution. It is very similar to Apache Zookeeper Atomic Broadcast (ZAB) protocol. Single master is a point of failure of course, but there are always leader selection algorithms which might or might not converge, and usually they do (this is a tricky place in plain PAXOS – it is easily possible that it will not converge on a single write round).

ZAB and FAST PAXOS work with the simple master server which handles all updates and serialize them according to own rules. Since there is no distribution consensus problem anymore, it can use 2-phase commit to force all learners to get update. It works, and it is the fastest consistency forcing algorithm, but yet there are 2 round trips.

Apache Bookkeeper decided to work that around. They decided, that 2-phase commit can be avoided and instead data is being written directly to multiple ‘bookies’ or replicas. And only by single writer (more on this below). Since there is no commit phase replica do not know whether its data was written to quorum of nodes or not, thus reader has to read data from multiple replicas to find out whether data it has received was actually written to majority of the nodes.

This ‘write once – read multiple’ case is a horrible performance killer in real storage devices. I do now know any of the real-life workloads where reader tolerates N-fold number of reads to get data. Maybe only log collection, but even they are being read *at least* once, which eats up any possible profit from this scheme.

I created such scheme in Elliptics distributed storage – writer puts a timestamp for every record and if there are consistency issues (for example 2 writers updated the same key in different replica order), reader may use special API call which reads multiple replicas and compare timestamps – the newest data is being returned to the client.
And when I tell you this doesn’t scale – it is true. This design forces distributed storage into a centralized entity for every read request. It can be acceptible for writes, but not reads.

And here comes another issue – single writer. Apache Bookkeeper operates with single writer at a time. I do not know how they achieve that, maybe there is a master server, or there are unique node-specific IDs, which allow to serialize requests coming through different nodes, but yet it is a single writer solution.

Apache Bookkeeper traded single round trip (2-phase commit vs plain replica update) for N-fold increase in number of reads. I call this a very bad deal. Zookeeper has its issues (it has tons of them – Pinterest calls its Zookeeper clusters ‘single cluster of failure), but at least it has strong guarantees on its performance.

Distributed cron across the planet

Cron is used on every local machine out there – all systems have periodic tasks which have to be started at some time.

But some tasks cover more than a single server, for example check another server and perform some action. If we have 10 servers it is ok that every server will check others, but if number of machines is larger – thousands of servers – it is unfeasible to force them all to know about each other.

And yet there is a strong demand to run periodic tasks which cover multiple servers (or whole datacenters). Setting up a dedicated server which will run ‘group’ tasks is not always the right solution: server may fail, it can be loaded, it is possible that task is big enough to be started on a single server only – we might want to set a group of servers to perform group periodic tasks.

That dedicated group has to be managed – which server runs which task, what if task has failed, what to do, when multiple servers want to run the same task and so on.

Guys from Google present a paper “Distributed cron across the planet” which tells how they solved this problem. Its worth reading even if you do not have thousands of servers (well, normal people do not) – paper describes a nice ‘cell’ design – dedicated set of server, Paxos for management, retry policies, master/slave, recovery and so on

Reversed real-time search

To find out documents which match a given query search engine iterates over inverted indexes and intersect the results. Inverted index is usually being built over terms found in a document, term can be a word, its transformation, normal form, a sentence and so on.

This indexing process is rather costly and slow, but usually it is not a problem – documents in a set are not frequently changed. Indexing is being performed in batches – for example iterate over all documents downloaded since the last indexing batch.
Realtime search in this context is basically a batch reduction – to 1 document in a batch in its extreme.

But there is a completely different search pattern – when a realtime flow of documents has to match a set of queries. The most common example is live twitter search.

Confluent performs realtime data processing and shows a way to perform realtime search given a huge amount of SQL queries and realtime feed of documents.

They use Kafka to store stream of documents and queries and a tricky way to optimize queries that way it would not require to run all queries against all new documents. It is achieved by building a query index which matches only those queries which potentially might match given document and then only run those queries.

Coordination Avoidance in Database Systems

A paper whose name speaks for itself – theoretical analysis of the distributed systems in order to find invariants which might loose coordination constraints.

In particular they analyze industry standard TCP-C benchmark and find out that if properly designed, linearly scalable 200-servers systems can outperform current leader at 25x rate.

A much shorter and brief analysis of the paper available at the morning paper.

The main conclusion is (and I will make it bold)

coordination can only be avoided if all local commit decisions are globally valid

This basically means that updates are always require external coordination, and depending on versioning reads might be the only operations that do not require coordination in systems which allow updates.

Article then goes into details on how to prove coordination avoidance and that part contains a fair number of holes. In particular they discuss merge operation which fixes transaction confluence, but it doesn’t say how to coordinate this merge on different servers and how and what to tell client when his write transaction will not appear in read read results. There is also a note about ‘last writer wins’ strategy, which is more complex than that – there is always a problem with clock synchronization and notion of the ‘last’ write is very vague when applied simultaneously to multiple servers.

I very like short highlighted above conclusion articles makes, that coordination can only be avoided if local changes are globally valid. In the very basic case this is equivalent to cases when there are no updates, but only new-key writes.

Elliptics does not have distributed coordinator (although is has local one), and my point is to design systems which never require updates – such systems are always consistent and correct in respect to data replicas. Practice shows that coordination avoidance (or no updates for existing keys) can be achieved, but it always has some price.

Tachyon – “memory-centric” distributed storage

Tachyon is distributed filesystem on top of HDFS with aggressive caching. By ‘filesystem’ one has to accept Java File objects and related methods and Hadoop compatibility.

Tachyon can run on top of HDFS, S3 and GlusterFS, in fact, its fault tolerance is what HDFS provides – single name node. Tachyon requires zookeeper to maintain master server which performs all operations and checks for consistency.

By their own benchmark it scales lineary with new nodes added

Because of aggressive caching Tachyon’s performance is way ahead of in-memory HDFS. Transparent in-memory caching is a great way to speedup Hadoop and particularly Spark, which was designed for immutable datasets with multiple access patterns.

I can also recommend Andreessen Horowitz article as an example of high-profile investment and management language, it looks like it even introduced new ‘memory-centric’ kind of technical term.

As a side note, I can not slip away comparing this caching system with what Elliptics distributed cache is. Not highlighting and providing Java File API instead we concentrated on HTTP access through Backrunner – Elliptics HTTP load balancer and proxy.

Building transparent subsystem which do not require semi-manual caching policies and additional configuration layers is a way forward for distributed systems, which already require quite a configuration to run and maintain. With a16z investment support I’m pretty sure we will hear more about Tachyon soon.

Sophia embedded key-value storage

Sophia is a write append-only embedded key-value storage used by in their mail indexes.

Sophia supports common get/set operations and optimized for range queries. It is stated that common log-based storage design operates with set of sorted files and range query may end up with as many low-level storage requests as many log files it contains. Bloom filters and similar techniques help with single requests but not range query and Sophia aims at this problem.

Sophia supports wide range of features including ACID (probably with versions created on top of epoch numbers), cursors, versions (looks like older version can not be accessed, it is probably used for consistent read/scan/range access) and fair number of various language bindings.

Initial design (1.1) was quite simple – there was in-memory index of all keys written into log file. There are some optimizations like range indexes and epoch numbers (it is used to mark ranges for merge and GC tasks). This design ends up with eating memory for all keys and there is no way to maintain memory usage.

1.2 is a bit more complex, it introduced branch – set of sorted ranges. Unfortunately there is no much info about internal details, probably sorted ranges do not have their keys in in-memory key index, and since ranges do not overlap and are sorted, range queries can be fast.
Having branches also looks like helping with multithreading merge/GC tasks.

Keys are combined with data on disk – Sophia is designed for small-to-medium key sizes.

Performance is rather spectacular – but there is a one caveat, there is no information on the key structure. It looks like keys are mostly increasing – there are no sorting gaps in performance. This could be explained by the fact that new mails likely have higher IDs than older ones. Also there is no information on page cache usage and amount of ram servers have as well as how ‘random’ the keys are, this would clarify performance graphs.

But even if keys are mostly increasing performance is excellent

Designing time series database

Akumuli tells on how they designed time series database.

Basically it is a set of write-ahead-log files preallocated at the initial write. Header contains time indexes, footer hosts actual data, and they both grow towards each other. Data chunks and/or indexes can be compressed.

This is a nice design, but it only works for time series, or for keys which are by its nature increase with each new record. This trick allows to perform binary search on keys stored in increasing order at the beginning of the WAL file, and select needed file with O(1) disk lookups (or even single disk lookup, if startup processes reads key ranges for every file into memory).

If suddenly new key is less than previously written key, system would have to rearrange already written keys, which requires data copy. Or this would require hosting key->data offset mapping in memory, which doesn’t scale for time series databases, which usually contain billions of records. There is a middle point somewhere where production system would work, but it is *much* more simple just to forbid writing data with random keys. If one is afraid of time skew in the cluster, it is possible to cache keys (and sort them in memory) before storing into the database.

docker and IPv6: eth0: IPv6 duplicate address detected!

This is a very known error

IPv6: eth0: IPv6 duplicate address fe80::42:acff:fe11:7e detected!

its main problem is that given interface (eth0) is being reset after DAD mechanism has found error.

In docker case this happens (maybe among other cases) when container was started with ‘on-failure’ restart policy. This means when application in the container fails, docker detects that and stops/starts the same container. Subsequent start ends up with the DAD issue and host interface reset.

golang: shine and depression

Just leave it here

$ export GODEBUG="gcdead=1"
$ go
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x6d4618]

goroutine 1 [running]:
regexp/syntax.(*parser).collapse(0xc20805e000, 0xc208030088, 0x6868686868686868, 0x6868686868686868, 0x13, 0x6868686868686868)
	/usr/local/go/src/regexp/syntax/parse.go:376 +0x2b8
regexp/syntax.(*parser).factor(0xc20805e000, 0xc208030080, 0x6868686868686868, 0x6868686868686868, 0x0, 0x6868686868686868, 0x6868686868686868, 0x6868686868686868)

The latest stable go 1.4.1.

gcdead=1 should tell garbage collector to get rid (‘clobber’) of stack frames that it thinks are dead. Apparently either Golang GC thinks that stack being used is dead, or there is stack overflow (like pinning pointers or something like that)

Weaver – new transactional graph database

Weaver is the first distributed transactional graph database, strong consistency is being achieved by using Hyperdex distributed storage.

There is not much information on how it is implemented, performance benchmarks are also a bit vague, but it states that Weaver performs several times faster than Titan and GraphLab. For pagerank-like benchmark (what does that mean?) Weaver performs slower.

It could be very intesting to try except that setting up Hyperdex is a pain. One has to start 3 different daemons with different configs, there are a bunch of scripts to support it and I do not know what will happen if one of them fails.

Weaver is in alpha stage so far, but looks interesting.

Backrunner – next generation HTTP proxy for Elliptics distributed storage

Elliptics is a powerful distributed storage for medium and large data, but it is rather low-level. It doesn’t know about ACL or REST API for example, I would compare it to block level in Linux filesystem hierarchy. In particular, Elliptics only provides C/C++, Python and Golang API bindings.

For the vast majority of the users HTTP REST API is a must, thus we created Backrunner – a new swiss-knife HTTP proxy for Elliptics distributed storage. It supports ACL, automatic bucket selection based on disk and network speed, errors, amount of free space, automatic defragmentation, header extension, local static files handling and provides simple REST API for clients.

We call Backrunner an entry point to Elliptics distributed storage. It not only provides externally visible interfaces, but also takes many administrative tasks like running defragmentation, showing properly crafted monitoring stats and so on.

Backrunner’s load balancing operates in real-time, for example it gathers upload metrics (speed, latency, errors) on every request to properly tune algorithm placing data around the cluster. It also takes into account amount of free space, disk activity, internal errors, timings from other clients, network speed and many other metrics.

We will extend it to run basic recovery operations, right now Backrunner detects that replicas are out of sync, but do not run recovery because this will likely heavily affect timings, which is generally a bad idea. That’s why Elliptics is an eventually consistent system – we pay this price for the highest possible scalability levels.

Backrunner is also distributed in docker images:

Tutorial is coming, stay tuned!

Facebook’s in-memory computation engine for social analytics

Facebook’s scale is out of the radar for the vast majority of cases, but yet it is very interesting to lookup new ideas there.

Audience insight is a way to show us pages which were are about to like with higher probability. It is based on previously liked pages, gender, location and many other features.

In particular at Facebook’s scale it is about 35 Tb of raw data and query ‘give me several pages which should be shown to user X’ must be completed within hundreds of milliseconds. That requires to process hundreds of billions of likes and billions of pages – numbers beyond reasonable – and in fraction of a second.

It happens that 168 nodes can handle it with some magic like columnar storage, in-memory data, bitmap indexes, GPU and caches. Very interesting reading!

Facebook’s hot, warm and graph data

Facebook is a maze for developing data store and access algorithms. Their scale allows and actually requires to find out new ways of storing information.

For example social graph data – it is stored in MySQL and used to use aggressive memcached cache. Now Facebook uses Tao – graph-aware storage for social graph with its own graph cache.

Facebook uses 3 levels of persistent storage: hot data is stored in Haystack storage – it’s lowest level is the prototype I used for Eblob. I would describe Elliptics storage with DHT, i.e. multiple servers to store one data replica, as this level of abstraction.

Second comes so called WARM data – data which is accessed by 2 orders of magnitude less frequently than that at HOT level. Warm data is stored in F4 storage, Facebook describes this split and motivation in this article.
I saw Elliptics installation with one disk per group, i.e. without DHT, as a storage for this access level of the data.

And the last is a COLD storage, something weird and scientifically crazy like robotic hand with a stack of Blu-Ray disks.

Various flash caches benchmarked

Guys from Swrve have benchmarked various flash caches in front of Amazon EBS volume and shared the results.

Data shows that Facebook’s Flashcache performs the best. Swrve decided to replace safety with performance and started to use raid-0 stipe for EBS volumes plus SSD cache in front of them.

Here is the result: Y axis is events processed per second, higher is better; X axis is time.
Y axis is events processed per second, higher is better; X axis is time.

  • single 200GB EBS volume (blue)
  • 4 x 50GB EBS volumes in RAID-0 stripes (red)
  • 4 x 50GB EBS volumes in RAID-0 stripes, with an enhanceio cache device (green)

EnhanceIO is a Facebook’s Flashcache fork with nicer tools. It itself increases performance by about 25% in their workload, while splitting EBS volume into raid0 added another 25%.

Above Swrve link contains more details on benchmark and other results.

Document-oriented databases

Amazon has announced that DynamoDB (the grandmother of all nosql databases) extended its free tier (upto 25 Gb) and added native JSON document support.

Native JSON document support is basically ability to get/set not the whole document, but only its parts indexed by internal key. Nothing says about how such update will be done in distributed fashion, in particular when multiple copies are being updated in parallel, but DynamoDB doesn’t support realtime replication into multiple regions, and internally it is likely database uses master-slave replication scheme.

I wrote it here to ask a question whether such document oriented databases are on demand? I know about CouchDB (and derivatives), MongoDB and others, but I’m quite interested whether and why DynamoDB wants to beat on this arena? I could add native JSON support to Elliptics but not sure it is really needed.

Afterall there quite a few document-oriented databases for rather small keys, while elliptics could beat there too, but instead it better works with medium-to-large objects with streaming and distributed replication and parallel upload.