Daily Archives: 2010/07/15

Why DHTs suck or do they?

DHT is virtually the same hash table spread over multiple nodes. Thus it shares its advantages, like O(1) access when properly configured, and its weak parts, like full absence of scalability.

Yes, hash table does not scale. Because to change its size one has to perform lots of tricks which are generally end up in a full table content rehash. With millions of entries on disks this may take a while…

Thus we can not easily extend hash table storage – each node addition will require table rebuild. Depending on DHT configuration and routing protocol used, it can be full or partial content copying.

Contrary to hash table, classical distributed storage with dedicated master server is able to scale without need to copy data each time new server is added. Master server will return this new node’s address for each new write, then next node and so on.

This looks shine and well in theory. Let’s face the practice.

Data tends to wear down with time – we do not lok at old photos and do not read old mails as frequently as access recently written information. Thus servers which host old data will not be loaded compared to the new nodes in the described master server example.
To fix this issue master server has to start data copying – old servers move their old unaccesible data to some new nodes, and new writes can go to old nodes too. This task is full of non-trivial heuristics about what data to move and what server to use. With time it ends up with data copy for each new write (or usually sufficiently large chunk write).

In distributed hash table this copy is not needed, since by design writes are balanced across the whole storage (when cryptographically strong hash function and storage size are configured of course).
In DHT there is no problem of data wearing, since each write goes to (kind of) random node, thus all nodes will contain roughly the same amount of old and new data.

Drawing the line, in DHT we have to copy data when new node is added to take some load from the neighbours, while in master-server scenario we have to copy each time new data is added (this can be limited to large chunks of course). In theory master-server scenario will copy data to the distribution DHT provides out of the box, and still will have to copy again when new nodes added.

In some cases this is not an issue – we may want not to start data redistribution in master-server storage because of some reasons, or may not foresee that this demand will appear though, while need for data copy in DHT when new node is added is a must – otherwise data will not be accessible because of changed hash distribution over nodes.

This two cases frequently (if not all the time :) becomes the most significant corner cases when DHT is not selected to be used for distributed storage.

Elliptics BLOB IO backend testing: single request – single read

Previous elliptics test showed how good (or bad) is append-only BLOB backend, when HTTP proxy issued two reads to handle single client’s request.
Namely it fetched transaction history log to find out which stored transaction has the same version as client asks.

Now let’s see how well we behave when single client request results in single data read from the blob.


700 rps witin 100 ms, 900 rps within 300 ms

Surprisingly absolute numbers did not change – we still fit 1000 rps within 300 ms, which is rather unacceptible for single client. But at the beginning we are about 2 times faster than described 2-reads case: we handle 700 rps within 100 ms range.

Testing etup is the same as in previus test: 2 SAS storages attached, each has 16 disks in it. Ext4 over software RAID10. 2.6.34 kernel. Random requests. 10 millions of records (about 87 Gb total, 44 on each SAS storage).

We also ran the same test but moved storage blob to single SAS storage. Also moved it to block device directly instead of using usual file on ext4 filesystem. Results were 2 times degraded as expected: like about 600 rps within 200 ms.

There was no difference whether block device or ext4 was used as low-level storage.

In a meantime blob IO backend got loading index support to speedup its startup. The only missing feature is index truncation aka ability to compact and remove deleted entries. When this is done, I will start POHMELFS – POSIX frontend to elliptics network.
Its initial implementation will not be performance centered as well as feature-rich, instead I will create a rather simple client, which will allow trivial deployment procedure.

And I have to start writing elliptics network paper for Linux Kongress. This may take a while though…
Stay tuned!