Tag Archives: PAXOS

Apache Bookkeeper or how NOT to design replication consistency scheme

Consistency in a distributed system is hard, it is even a limitation – everyone knows that one has to sacrifice a letter in CAP theorem to be have consistent distributed system: either it will not be 100% available or it will not tolerate partition split.

PAXOS algorithm is known to be a complex solution for consistency problem – it allows to have a consensus on multiple opinions, but algorithm is quite complex to implement. And it is very slow. Cassandra for instance caught bugs in its realization for several years (and probably still have it) and it used 6 round trips to converge for a single write.

But there are several simple optimization for PAXOS, the most interesting (and the fastest) is FAST PAXOS. Basically, this is a single master – multiple learners solution. It is very similar to Apache Zookeeper Atomic Broadcast (ZAB) protocol. Single master is a point of failure of course, but there are always leader selection algorithms which might or might not converge, and usually they do (this is a tricky place in plain PAXOS – it is easily possible that it will not converge on a single write round).

ZAB and FAST PAXOS work with the simple master server which handles all updates and serialize them according to own rules. Since there is no distribution consensus problem anymore, it can use 2-phase commit to force all learners to get update. It works, and it is the fastest consistency forcing algorithm, but yet there are 2 round trips.

Apache Bookkeeper decided to work that around. They decided, that 2-phase commit can be avoided and instead data is being written directly to multiple ‘bookies’ or replicas. And only by single writer (more on this below). Since there is no commit phase replica do not know whether its data was written to quorum of nodes or not, thus reader has to read data from multiple replicas to find out whether data it has received was actually written to majority of the nodes.

This ‘write once – read multiple’ case is a horrible performance killer in real storage devices. I do now know any of the real-life workloads where reader tolerates N-fold number of reads to get data. Maybe only log collection, but even they are being read *at least* once, which eats up any possible profit from this scheme.

I created such scheme in Elliptics distributed storage – writer puts a timestamp for every record and if there are consistency issues (for example 2 writers updated the same key in different replica order), reader may use special API call which reads multiple replicas and compare timestamps – the newest data is being returned to the client.
And when I tell you this doesn’t scale – it is true. This design forces distributed storage into a centralized entity for every read request. It can be acceptible for writes, but not reads.

And here comes another issue – single writer. Apache Bookkeeper operates with single writer at a time. I do not know how they achieve that, maybe there is a master server, or there are unique node-specific IDs, which allow to serialize requests coming through different nodes, but yet it is a single writer solution.

Apache Bookkeeper traded single round trip (2-phase commit vs plain replica update) for N-fold increase in number of reads. I call this a very bad deal. Zookeeper has its issues (it has tons of them – Pinterest calls its Zookeeper clusters ‘single cluster of failure), but at least it has strong guarantees on its performance.

Distributed cron across the planet

Cron is used on every local machine out there – all systems have periodic tasks which have to be started at some time.

But some tasks cover more than a single server, for example check another server and perform some action. If we have 10 servers it is ok that every server will check others, but if number of machines is larger – thousands of servers – it is unfeasible to force them all to know about each other.

And yet there is a strong demand to run periodic tasks which cover multiple servers (or whole datacenters). Setting up a dedicated server which will run ‘group’ tasks is not always the right solution: server may fail, it can be loaded, it is possible that task is big enough to be started on a single server only – we might want to set a group of servers to perform group periodic tasks.

That dedicated group has to be managed – which server runs which task, what if task has failed, what to do, when multiple servers want to run the same task and so on.

Guys from Google present a paper “Distributed cron across the planet” which tells how they solved this problem. Its worth reading even if you do not have thousands of servers (well, normal people do not) – paper describes a nice ‘cell’ design – dedicated set of server, Paxos for management, retry policies, master/slave, recovery and so on

Zab or Paxos

I really did not find a major difference between fast Paxos and Zab – it is the same message broadcast and quorum detection as long as there is only single master (leader) server.

In elliptics (actually synchronization service is a separate project, which can be used with elliptics to guarantee storage consistency) I do not to use leader election in a generic form – instead client library only allows to post messages through single server (kinda selected leader, I use simple IDs generated on server startup). It is not fair leader election, but yet only single server will broadcast updates.

I will also relax ordering – only updates with the same key must be ordered, updates with different keys may be completed out of order. This is essental, if I use not only single set of commands, but complex processing (like what I implemented in default backend – libmysqlclient client to mysql server, which runs client’s commands as sql). In this case some operations may take much longer which should not block operations on different keys.

Besides that, Zab has a terrific advantage – it is documented way better and simpler than Paxos, although algorithm itself is comparable in complexity

Elliptics: datacenters, open functionality, changes, future

Our team has grown up a little and we started to make things faster.

External projects

First, our friends in Yandex open-sourced FCGI management daemon: fastcgi-daemon
It allows a rapid deployment of C++ FCGI applications and has flexible configuration and management tools.
Kudos to Ilya Golubtsov and Vasily Kiritsev from Yandex team.

The first external elliptics application is HTTP proxy written on top of fastcgi-daemon by Leonid Movsesjan from Yandex team.
Source code contains example configs, which allow to read/write/delete data, get direct link to the object on the server in cluster, which can be streamlined or downloaded directly by client.
It easily handles 1000 random read RPS being connected to 2 elliptics nodes, where each request is handled just within tens of milliseconds. And although data set is not very large (just hundreds of thousands of small files from 10 to 50 kbytes each), it is still a spectacular number.


in a meantime our production clusters grew up. We will get first abroad datacenter this year, which will host content for international users. This cluster currently hosts about 150 millions of rather small objects (10-50 kbytes each) with 3 copies in physically distributed datacenters.
Our second production cluster comes close to half-petabyte scale (4 datacenters, about 200 elliptics nodes), its network load reaches 3.5 gbit/s daytime, although this scale took a fair technical price – we had to rewrite IO model again.

IO thread models

We recently switched IO model in elliptics core from thread-per-client to single network IO thread, which implements non-blocking reactor, and pool of disk IO threads. Elliptics maintains O(1) lookup times which implies that every node knows about every other, which in turn requires N^2 network connections in cluster and thus the same number of IO threads in thread-per-client model.
For 200 nodes this is 4k threads/connections, each machine with 24 disks starts 24 elliptics nodes, ending up with almost 100.000 sockets and threads in peaks. Linux just can not start that much threads by default (we used 2.6.32 and .38 kernels) on 64-bit servers with 24 Gb of RAM.
And even if we tune some things, it will stop at 500 nodes limit and so forth. Eventually this is a dead end.

Now we use IO thread pool and network processing thread, which handles sockets via epoll(). We dropped libevent and friends, since it is still extremely unconvenient and ugly to add events from one thread into other’s processing loops.

Future plans

We started to think about secondary indexes in elliptics network. We can easily build slow-enough map-reduce-like processing on top of our data, but for example it will not quickly return prefix-search data, although we can run brute-force regex-like search.

I will continue PAXOS implementation – we do want to build Chubby-like service for small synchronized data storages as well as locks and counters. Combined with elliptics it should provide a full operation stack for data storage.

POHMELFS is not dead, IT IS NOT DEAD, please remember this :)
I’m just a little bit busy with things.

I think about dropping transactions support in elliptics and always perform rewrite-in-place. Ext* filesystems live without this feature compared to Btrfs and everyone is happy. Well, this is a cool feature, but hardly needed for everyone.

Hmm, I bought myself a small Yamaha KX-49 MIDI-keyboard and it even works in Linux with Rosegarden (modulo non-working controls) and continue to play trumpet, but this is another story.
I will try to post more regulary here, although most of the time twitter wins :)

Multi-PAXOS consensus system implementation

We used to rely on management servers to handle our data. When it comes to huge amount of simple requests we can use NoSQL solutions, when we need more relations its time for more classic SQL databases. Sometimes we need a lock server, which will serialize data access among multiple users.

But no matter what, we want our data to be safe, which implies multiple copies. For NoSQL it is quite simple, for example elliptics network supports it from day one, SQL solutions kind of have master-master replication. But what about more generic system? A system which works with private protocol and requires some storage for its internal information. To make this system fault tolerant one has to implement a distributed consensus algorithm.

Such algorithm has to be crash and message-lost resistant and yet allow limited set of nodes to participate in consensus creation. PAXOS algorithm is such a monster.

I created open source GPL PAXOS implementation originally for lock cluster for POHMELFS, but in early development cycle decided that it could be more than just a locking service.

Instead I implemented multi-PAXOS core as a log of commands accepted by consensus of distributed nodes. Each command is a result of dedicated PAXOS algorithm roundup. When accepted, this command is stored in the persistent database (I use Kyoto Cabinet) and ‘sent’ to backend system which actually performs it.

For the locking service it can be a system, which stores information about locked value, for example.
But I moved a little bit further – I implemented SQL backend, which if being combined with multi-PAXOS algorithm, forms a true master-master replication.

Actually backend is just a shared library loaded by PAXOS core module at startup, it has to provide processing and initialization functions, which are used to actually store data to the required media. SQL was the most convenient way to solve our current needs.

Using SQL backed one can easily implement locking service – its just a matter of proper table record.
One can read data of course – it is implemented not through PAXOS (which is rather costly procedure), but directly from accepting nodes. Contrary all writes go through multi-PAXOS state machines, which guarantee that only single value in each DSA index will be accepted by all active nodes.

Code is actually in a beta stage, but there are things to mention already:

  • multi-PAXOS protocol implementation
  • single proposer selection by client for optimization
  • modular IO backend system, each backend is a shared library loaded at astartup
  • MySQL IO backend which forms master-master SQL replication being combined with this multi-PAXOS algorithm
  • simple access protocol to PAXOS cluster, test client included
  • client’s shared library with majority of needed calls implemented

And yet there are some caveats scheduled for the next release:

  • automatic reconnection to acceptors
  • real log recovery – currently if acceptor fails and then restores (or new one is used instead), it will continue to participate in update PAXOS operations, but will also allow to read its data directly, which will not be consistent with the rest of the system. It does not recover its log from neightbours yet
  • no monitoring tools
  • no other language bindings, only C library

And in a meantime, enjoy lurking through git tree or example file.

Fly, baby, fly: initial PAXOS consensus daemon implementation commit

As promised (long-long-freaking-long ago) I started next project, and this is a distributed consensus algorithm written on top of PAXOS.
I believe I started to understand its main feature, and thus progress will be noticebly. This will likely be a locking system for POHMELFS and implemented as a network service.

This is not particulary PAXOS implementation, but the first commit only (written today):

git commit --stat -a -m "Initial PAXOS consensus daemon import."
[master (root-commit) 40c0301] Initial PAXOS consensus daemon import.
 18 files changed, 3949 insertions(+), 0 deletions(-)

The most interesting part actually is quite small:

$ wc -l src/pxs.c src/lnet.c include/lnet/lnet.h include/lnet/atomic.h
  155 src/pxs.c
  371 src/lnet.c
  168 include/lnet/lnet.h
  143 include/lnet/atomic.h
  837 total

For the start, we will not support multi-PAXOS (currently we actually support nothing out of real PAXOS, only initial stub), instead we will work on some obscure current branch, which should be enough for atomic operations and maybe locking. Also there will be no master producer selection algorithm – we will use hash of the address or something like that, which is not a 100% bullet-proof idea though.

Later state machines will be extended to support real distributed multi-PAXOS.