Consistency in a distributed system is hard, it is even a limitation – everyone knows that one has to sacrifice a letter in CAP theorem to be have consistent distributed system: either it will not be 100% available or it will not tolerate partition split.
PAXOS algorithm is known to be a complex solution for consistency problem – it allows to have a consensus on multiple opinions, but algorithm is quite complex to implement. And it is very slow. Cassandra for instance caught bugs in its realization for several years (and probably still have it) and it used 6 round trips to converge for a single write.
But there are several simple optimization for PAXOS, the most interesting (and the fastest) is FAST PAXOS. Basically, this is a single master – multiple learners solution. It is very similar to Apache Zookeeper Atomic Broadcast (ZAB) protocol. Single master is a point of failure of course, but there are always leader selection algorithms which might or might not converge, and usually they do (this is a tricky place in plain PAXOS – it is easily possible that it will not converge on a single write round).
ZAB and FAST PAXOS work with the simple master server which handles all updates and serialize them according to own rules. Since there is no distribution consensus problem anymore, it can use 2-phase commit to force all learners to get update. It works, and it is the fastest consistency forcing algorithm, but yet there are 2 round trips.
Apache Bookkeeper decided to work that around. They decided, that 2-phase commit can be avoided and instead data is being written directly to multiple ‘bookies’ or replicas. And only by single writer (more on this below). Since there is no commit phase replica do not know whether its data was written to quorum of nodes or not, thus reader has to read data from multiple replicas to find out whether data it has received was actually written to majority of the nodes.
This ‘write once – read multiple’ case is a horrible performance killer in real storage devices. I do now know any of the real-life workloads where reader tolerates N-fold number of reads to get data. Maybe only log collection, but even they are being read *at least* once, which eats up any possible profit from this scheme.
I created such scheme in Elliptics distributed storage – writer puts a timestamp for every record and if there are consistency issues (for example 2 writers updated the same key in different replica order), reader may use special API call which reads multiple replicas and compare timestamps – the newest data is being returned to the client.
And when I tell you this doesn’t scale – it is true. This design forces distributed storage into a centralized entity for every read request. It can be acceptible for writes, but not reads.
And here comes another issue – single writer. Apache Bookkeeper operates with single writer at a time. I do not know how they achieve that, maybe there is a master server, or there are unique node-specific IDs, which allow to serialize requests coming through different nodes, but yet it is a single writer solution.
Apache Bookkeeper traded single round trip (2-phase commit vs plain replica update) for N-fold increase in number of reads. I call this a very bad deal. Zookeeper has its issues (it has tons of them – Pinterest calls its Zookeeper clusters ‘single cluster of failure), but at least it has strong guarantees on its performance.