Recent elliptics changes
It was a while I wrote here last time, but there are a fair number things happened.
We fixed number of bugs in server-side scripting implementation, so it is very stable now and easily allows to create complex scripts which not only perform basic tasks, but also rather complex transactions, like read/process/update multiple records.
We added transaction locks, which allow to perform operations on given key atomically on single replica. In this case your server-side script, which reads data, updates its structure and uploads it back, will be performed atomically on single replica.
There are many other distributed storage systems, which allow atomic key operations, but they do not support datacenter replica split. Usually it is implemented (like in Oracle NoSQL or MongoDB) with single master node, which copies data to multiple replicas. This operation may even be synchronous and safe.
And it is possible to put replicas into different datacenters. But what happens, when you want to add more datacenters, but do not want to increase number of copies? In this scheme one has to rebalance replica sets.
In elliptics you just say which datacenters you want to use for given IO transaction. This forces non-atomic replica updates, which happen in parallel. This allows to have faster writes and more flexible setup, but we can not guarantee that replicas are updated synchronously.
And actually with mongodb scheme it is possible, that replicas will not be in sync, since there may be failed node which ost some data, so reads from this node will not return consistent results.
Real fix for this problem lies outside of the low-level storage subsystem. It is like forcing block device to lock against multiple processes writing into the same file. Instead this protection lives on higher layers and generally requires external locks, which are both known and used by concurrent users.
In distributed world this algorithm is based on Paxos with optimizations (like fast Paxos or Zookeeper Atomic Broadcast and friends). We do not yet have such subsystem.
Another change is embedded checksums. Previously we stored them in blobs after each data records, but never used. Instead there was a separate command to calculate and store checksum in metadata (separate column). This may resulted in stall checksum stored in metadata, so parallel read could fail with inconsistent data error. Right now metadata is just a storage for replica information and update/check status dates and other meta information (namespace and so on).
Each write updates checksum in blob atomically (again, this is not synchronized between replicas), so any read will always get correct check if data was not corrupted.
We also added bulk IO operations, namely read and write. Write issues many parallel writes and waits for all of them to complete, thus essentially being equal to the performance of the slowest nodes, while doing sequential write ends up being as long as sum of all writes in question.
Read does similar thing, but splits and optimizes read using offset sorting (i.e. that we read data from disk in sequential order).