Metadata server or distributed hash? LRU or weighted caching?

Tagged:  

The former is actually quite religious question, but still it has a fair amount of technical background to talk about.

Main pros of the dedicated metadata server is its incredible flexibility and control. To determine where given object lives we ask special server which can perform whatever we want to make the answer: system can check permissions, locate the least loaded server, update centralized statistics or contact oracle and notify external entities.
Thus metadata server becomes a complex database which is too hard to replicate and generally maintain in consistent state with its copies. And we do need another copies, since every server fails and dedicated metadata one fails too. In some practical scenarious they fail even more frequently than storage ones, although quite contrary I have example where they never fail during several years of maintenance and everyday access.

But no matter what, generally we want to replicate metadata server and preferably to implement master-master operation mode to unload single entry of failure. To date I do not know production-quality master-master replication solution neither in free nor in proprietary world. And by production I mean hundreds of millions of records with millions of records updated/created per day with physically separated datacenters with flaky link between.

A very elegant solution for metadata servers is ... full absence of them. Distributed hash storage is one of them. But it only solves access problem - client can determine needed server itself, but we still have to implement control access, statistics and notifications somewhere. If we put more complex logic into the storage, unflexibility of the central control point absence becomes even more visible.

One such problem is caching. When some object is popular we want to put it to faster media to satisfy increased access rate. For example we can create multiple copies and distribute clients between them.
With metadata server this is a trivial case - we just update appropriate database record, so that connected client could get random (or preferable, doesn't matter) path to the data object. The more popular content is the more copies we put into the storage and update metadata database.

With the distributed storage without central metadata server we have to perform a full lookup to determine whether given object is present in the storage or not. Which means we have to contact remote server and try to read some data from its media. This slows things down noticebly, especially for non-popular content which does not have hundred of cached copies.

A simple solution is to move a control entity from the storage to higher levels. In case of cache it means some external storage, which will contact low-level one only when requested objects are not in the cache. Depending on the cache implementation this can be very cheap price for the access problems.

Thus we build a layerd system where DHT is a low-level storage.

Actually this solution will also work for metadata server too, except that for some workloads classical LRU caches do not work. Thus squid or page cache should not be used for them, and metadata server solution wins again.

But what wins and especially what is needed not only for distributed metadata-free storages is cache with content weights - the more frequently given object is requested the more time it will live in cache even if currently it is not requested. Last access time does not work in this case.

To my shame I do not know such cache systems - very popular memcached and squid do not support this iirc. And they do not allow to distribute cache content among multiple nodes. Memcached actually has quite nice frontends, which can form DHT, but still pure memcached lacks some features.

Plan has been plotted by itself...

http://code.google.com/p/hazelcast/

as I read in its documentation - this is DHT with eventual consistency model, where each joining node receives its content during operaiton, and that all nodes will update their IDs (partition ownership) and thus request to not-yet-synced data will fail (client will be asked to re-do operation).

There is no clear explaination what happens when the oldest member fails in regard of partition ownership (node ids distribution) - it is only said, that the oldest member is always right. I'm not sure there is proper master selection algorithm if there are multiple 'oldest' nodes, but I do not think this should be a problem in a real life. I would be more concerned about partition redistribution if new 'oldest' node has different map than previous 'oldest' node, but if I understood correctly, this should be maintained to be the same on all nodes.

There are only two IO thread, which if could be done with classical IO primitives is absolutely not enough, but I suppose it uses internal java IO, which is properly asynchronous...

hazelcast does not allow to control how backups are distributed, thus it is not posible to implement datacenter-aware data distribution (when datacenter connection fails we want another one to have the same data as failed one). As I understood hazelcast does not allow to manually specify node's position in the partition list (aka route table).

hazelcast uses LRU and LFU (Least Frequently Used) which is just great, but there is no LFU description, namely how counters are maintained and processed by eviction handler.

And the main problem is absence of recovery description. What happens when node failed, we wrote some data and then node returned back with the same keys as we wrote, but different data?
Elliptics allows to merge transactions. Cassandra gets the latest update (modulo timer jumps I suppose). How hazelcast behaves is a question.

Also there are no benchmarks.

And I personally do not like java because of its non-deterministic behavious under load - namely gc and memory consumption issuses. But overall sounds like an interesting project.