Initial BLOB IO backend implementation in elliptics network

Elliptics network is a quite modular distributed hash table, which allows to implement and build-in different IO backends, enabled via config. IO backend is a quite simple entity which just stores data to media and allows to read it back using provided tranaction ID.

BLOB IO backend is a yet trivial append-only array of variable-sized elements, which are stored one after another on disk. Each entry’s offset is stored in hash table in RAM, indexed by transaction IDs. Currently we do not even support ID index – to create this hash table in memory initialization code runs over whole file and jumps from entry to entry. With 10 millions of entries stored on single node (about 44 Gb of data) this takes about 8-9 minutes to initialize, so it is likely a good idea to implement external index.
Hash table is neven swapped to disk if configured to be locked in RAM.

Thus to get an object we ask hash table about object offset and read it directly from the storage (send it via sendfile() actually). In theory it should be noticebly faster than filesystem IO backend, where each object is stored in separate file.

Let’s see raw results for 2 sas storages (each one contains 16 disks), about 10 millions of data objects (total of about 30 millions, since we have 2 additional history objects for each data one). To handle single request we have to read two objects from disk: one history (parse it and get ID for selected version) and data (with the ID read previously). It is possible to disable versioning and get data via single disk read of course.

Filesystem is default ext4 on 2.6.34 kernel. Machine has about 24 Gb of RAM. Random requests.

600 rps witin 200 ms, 1000 rps within 300 ms

And compare BLOB to filesystem IO backends.

Clearly blob is about 2 times faster than filesystem at the beginning (green one is blob, violet is file-per-object aka filesystem backend), but with time they become equal, likely because of filled hardware queues.

Although I play some simple tricks with read-ahead in blob backend, I still want to test with data stored in raw block device, thus eliminating potential FS overhead, although ext4 has extents it still may require multuple seeks to read different blocks in single file. Also direct-io case can be useful too.

Main problem with blobs is object removal support. While in common web scenarios we can just mark object as removed and drop it from index not even trying to ‘squeeze’ blob file, in a real life some external application (or IO backend itself triggered by timeout or whatever else) should be able to compact blobs.
Back to drawing board…