PAXOS

Zab or Paxos

Tagged:  

I really did not find a major difference between fast Paxos and Zab - it is the same message broadcast and quorum detection as long as there is only single master (leader) server.

In elliptics (actually synchronization service is a separate project, which can be used with elliptics to guarantee storage consistency) I do not to use leader election in a generic form - instead client library only allows to post messages through single server (kinda selected leader, I use simple IDs generated on server startup). It is not fair leader election, but yet only single server will broadcast updates.

I will also relax ordering - only updates with the same key must be ordered, updates with different keys may be completed out of order. This is essental, if I use not only single set of commands, but complex processing (like what I implemented in default backend - libmysqlclient client to mysql server, which runs client's commands as sql). In this case some operations may take much longer which should not block operations on different keys.

Besides that, Zab has a terrific advantage - it is documented way better and simpler than Paxos, although algorithm itself is comparable in complexity

Elliptics: datacenters, open functionality, changes, future

Tagged:  

Our team has grown up a little and we started to make things faster.

External projects

First, our friends in Yandex open-sourced FCGI management daemon: fastcgi-daemon
It allows a rapid deployment of C++ FCGI applications and has flexible configuration and management tools.
Kudos to Ilya Golubtsov and Vasily Kiritsev from Yandex team.

The first external elliptics application is HTTP proxy written on top of fastcgi-daemon by Leonid Movsesjan from Yandex team.
Source code contains example configs, which allow to read/write/delete data, get direct link to the object on the server in cluster, which can be streamlined or downloaded directly by client.
It easily handles 1000 random read RPS being connected to 2 elliptics nodes, where each request is handled just within tens of milliseconds. And although data set is not very large (just hundreds of thousands of small files from 10 to 50 kbytes each), it is still a spectacular number.

Datacenters

in a meantime our production clusters grew up. We will get first abroad datacenter this year, which will host content for international users. This cluster currently hosts about 150 millions of rather small objects (10-50 kbytes each) with 3 copies in physically distributed datacenters.
Our second production cluster comes close to half-petabyte scale (4 datacenters, about 200 elliptics nodes), its network load reaches 3.5 gbit/s daytime, although this scale took a fair technical price - we had to rewrite IO model again.

IO thread models

We recently switched IO model in elliptics core from thread-per-client to single network IO thread, which implements non-blocking reactor, and pool of disk IO threads. Elliptics maintains O(1) lookup times which implies that every node knows about every other, which in turn requires N^2 network connections in cluster and thus the same number of IO threads in thread-per-client model.
For 200 nodes this is 4k threads/connections, each machine with 24 disks starts 24 elliptics nodes, ending up with almost 100.000 sockets and threads in peaks. Linux just can not start that much threads by default (we used 2.6.32 and .38 kernels) on 64-bit servers with 24 Gb of RAM.
And even if we tune some things, it will stop at 500 nodes limit and so forth. Eventually this is a dead end.

Now we use IO thread pool and network processing thread, which handles sockets via epoll(). We dropped libevent and friends, since it is still extremely unconvenient and ugly to add events from one thread into other's processing loops.

Future plans

We started to think about secondary indexes in elliptics network. We can easily build slow-enough map-reduce-like processing on top of our data, but for example it will not quickly return prefix-search data, although we can run brute-force regex-like search.

I will continue PAXOS implementation - we do want to build Chubby-like service for small synchronized data storages as well as locks and counters. Combined with elliptics it should provide a full operation stack for data storage.

POHMELFS is not dead, IT IS NOT DEAD, please remember this :)
I'm just a little bit busy with things.

I think about dropping transactions support in elliptics and always perform rewrite-in-place. Ext* filesystems live without this feature compared to Btrfs and everyone is happy. Well, this is a cool feature, but hardly needed for everyone.

Hmm, I bought myself a small Yamaha KX-49 MIDI-keyboard and it even works in Linux with Rosegarden (modulo non-working controls) and continue to play trumpet, but this is another story.
I will try to post more regulary here, although most of the time twitter wins :)

Multi-PAXOS consensus system implementation

Tagged:  

We used to rely on management servers to handle our data. When it comes to huge amount of simple requests we can use NoSQL solutions, when we need more relations its time for more classic SQL databases. Sometimes we need a lock server, which will serialize data access among multiple users.

But no matter what, we want our data to be safe, which implies multiple copies. For NoSQL it is quite simple, for example elliptics network supports it from day one, SQL solutions kind of have master-master replication. But what about more generic system? A system which works with private protocol and requires some storage for its internal information. To make this system fault tolerant one has to implement a distributed consensus algorithm.

Such algorithm has to be crash and message-lost resistant and yet allow limited set of nodes to participate in consensus creation. PAXOS algorithm is such a monster.

I created open source GPL PAXOS implementation originally for lock cluster for POHMELFS, but in early development cycle decided that it could be more than just a locking service.

Instead I implemented multi-PAXOS core as a log of commands accepted by consensus of distributed nodes. Each command is a result of dedicated PAXOS algorithm roundup. When accepted, this command is stored in the persistent database (I use Kyoto Cabinet) and 'sent' to backend system which actually performs it.

For the locking service it can be a system, which stores information about locked value, for example.
But I moved a little bit further - I implemented SQL backend, which if being combined with multi-PAXOS algorithm, forms a true master-master replication.

Actually backend is just a shared library loaded by PAXOS core module at startup, it has to provide processing and initialization functions, which are used to actually store data to the required media. SQL was the most convenient way to solve our current needs.

Using SQL backed one can easily implement locking service - its just a matter of proper table record.
One can read data of course - it is implemented not through PAXOS (which is rather costly procedure), but directly from accepting nodes. Contrary all writes go through multi-PAXOS state machines, which guarantee that only single value in each DSA index will be accepted by all active nodes.

Code is actually in a beta stage, but there are things to mention already:

  • multi-PAXOS protocol implementation
  • single proposer selection by client for optimization
  • modular IO backend system, each backend is a shared library loaded at astartup
  • MySQL IO backend which forms master-master SQL replication being combined with this multi-PAXOS algorithm
  • simple access protocol to PAXOS cluster, test client included
  • client's shared library with majority of needed calls implemented

And yet there are some caveats scheduled for the next release:

  • automatic reconnection to acceptors
  • real log recovery - currently if acceptor fails and then restores (or new one is used instead), it will continue to participate in update PAXOS operations, but will also allow to read its data directly, which will not be consistent with the rest of the system. It does not recover its log from neightbours yet
  • no monitoring tools
  • no other language bindings, only C library

And in a meantime, enjoy lurking through git tree or example file.

Fly, baby, fly: initial PAXOS consensus daemon implementation commit

Tagged:  

As promised (long-long-freaking-long ago) I started next project, and this is a distributed consensus algorithm written on top of PAXOS.
I believe I started to understand its main feature, and thus progress will be noticebly. This will likely be a locking system for POHMELFS and implemented as a network service.

This is not particulary PAXOS implementation, but the first commit only (written today):

git commit --stat -a -m "Initial PAXOS consensus daemon import."
[master (root-commit) 40c0301] Initial PAXOS consensus daemon import.
 18 files changed, 3949 insertions(+), 0 deletions(-)

The most interesting part actually is quite small:

$ wc -l src/pxs.c src/lnet.c include/lnet/lnet.h include/lnet/atomic.h
  155 src/pxs.c
  371 src/lnet.c
  168 include/lnet/lnet.h
  143 include/lnet/atomic.h
  837 total

For the start, we will not support multi-PAXOS (currently we actually support nothing out of real PAXOS, only initial stub), instead we will work on some obscure current branch, which should be enough for atomic operations and maybe locking. Also there will be no master producer selection algorithm - we will use hash of the address or something like that, which is not a 100% bullet-proof idea though.

Later state machines will be extended to support real distributed multi-PAXOS.

Syndicate content