ioremap.net

Storage and beyond

Multi-PAXOS consensus system implementation

We used to rely on management servers to handle our data. When it comes to huge amount of simple requests we can use NoSQL solutions, when we need more relations its time for more classic SQL databases. Sometimes we need a lock server, which will serialize data access among multiple users.

But no matter what, we want our data to be safe, which implies multiple copies. For NoSQL it is quite simple, for example elliptics network supports it from day one, SQL solutions kind of have master-master replication. But what about more generic system? A system which works with private protocol and requires some storage for its internal information. To make this system fault tolerant one has to implement a distributed consensus algorithm.

Such algorithm has to be crash and message-lost resistant and yet allow limited set of nodes to participate in consensus creation. PAXOS algorithm is such a monster.

I created open source GPL PAXOS implementation originally for lock cluster for POHMELFS, but in early development cycle decided that it could be more than just a locking service.

Instead I implemented multi-PAXOS core as a log of commands accepted by consensus of distributed nodes. Each command is a result of dedicated PAXOS algorithm roundup. When accepted, this command is stored in the persistent database (I use Kyoto Cabinet) and ‘sent’ to backend system which actually performs it.

For the locking service it can be a system, which stores information about locked value, for example.
But I moved a little bit further – I implemented SQL backend, which if being combined with multi-PAXOS algorithm, forms a true master-master replication.

Actually backend is just a shared library loaded by PAXOS core module at startup, it has to provide processing and initialization functions, which are used to actually store data to the required media. SQL was the most convenient way to solve our current needs.

Using SQL backed one can easily implement locking service – its just a matter of proper table record.
One can read data of course – it is implemented not through PAXOS (which is rather costly procedure), but directly from accepting nodes. Contrary all writes go through multi-PAXOS state machines, which guarantee that only single value in each DSA index will be accepted by all active nodes.

Code is actually in a beta stage, but there are things to mention already:

  • multi-PAXOS protocol implementation
  • single proposer selection by client for optimization
  • modular IO backend system, each backend is a shared library loaded at astartup
  • MySQL IO backend which forms master-master SQL replication being combined with this multi-PAXOS algorithm
  • simple access protocol to PAXOS cluster, test client included
  • client’s shared library with majority of needed calls implemented

And yet there are some caveats scheduled for the next release:

  • automatic reconnection to acceptors
  • real log recovery – currently if acceptor fails and then restores (or new one is used instead), it will continue to participate in update PAXOS operations, but will also allow to read its data directly, which will not be consistent with the rest of the system. It does not recover its log from neightbours yet
  • no monitoring tools
  • no other language bindings, only C library

And in a meantime, enjoy lurking through git tree or example file.

Comments are currently closed.