Elliptics network background fsck

Tagged:  

Elliptics network is a distributed hash table storage, which, among other features, allows to redundantly store data. Number of copies of written object can be specified on various levels namely on client and server, and it depends on how and which transformation functions were set. Transformation function is a method of generating object ID from provided data, for example it can be object name or content hash, or it can be a database index or whetever else.
Multiple transformation functions will provide multiple IDs object should be stored with, thus implying multiple object copies.

Generally we want that different copies of the same object are stored on different nodes, in some setups we can guarantee that (for more information please take a look at per-datacenter object distribution discussion on the project homepage). But when node with some object fails and returns with empty storage, the same object will not be copied from the storage even if there is another copy somewhere else, since there is no central information server which maintains object<->id relations (eliminating this server is one of the main advantages of DHT storages actually, since it is likely main point of failure in the comon existing distributed storages), since only client itself can generate that relation.

In some cases we can eliminate this problem, namely when we use classical redundancy scheme, when object being written to some node is also copied to its neighbour. When node fails, neighbour copies appropriate objects to its new neighbour instead of failed one, so number of object copies remains constant. When node comes back online again (potentially with empty storage) it will fetch 'its' object from the neighbour node.

This mode is extremely unfriendly from administrative point of view. Admins already hate me since they need to generate IDs for the storage nodes manually to have a good load distibution (when we have small number of nodes random IDs will not evenly spread whole ID range between nodes), and while we can generally implement ID distribuition the way that object copies will be spread among different machines and even datacenters, this becomes very hard when copies go not to multiple specified nodes, but to single specified node and its neighbour (which ID we do not really know on the client).
Thus admins will have to maintain information about how neighbour nodes are spread over datacenters, which becomes a real pain in the ass, when single machine contains shitload of disks where each one has its own elliptics storage node attached, and we do want to put a copy of the object to another physical machine (and even in different datacenter).

So, redundancy scheme, where we put object to some node and its neighbour maintains number of copies when nodes go offline and online, but it does not (easily) allow to implement copies distribution among physical machines and datacenters, which is rather trivial in the case, when different transformation functions (say, sha1 and sha256 with some changes described on the homepage) are used.

To fix problem with multiple transformation functions and precise number of copies maintenance I decided to implement (read: to put into TODO list :) a background fsck daemon, which will run on every elliptics node and check whether each object stored on given node has its copies presented somewhere in the storage according to object metadata.
In a nutshell, each update transaction will have some metadata attached to it on the storage, namely original object name it was generated from and transformation functions used (with time we will want to put there IO permissions also and maybe something else, so it should have extensible on-disk attributes format). Fsck daemon will read that information and check whether copies with IDs created using all listed transformation functions are presented in the storage, otherwise it will write local object into the storage with missing ID. It can also check whether locally stored object is corrupted and get it from the storage if needed.

Having enough redundant copies we can reduce probability of the simultaneous failure of all nodes which store copies of the object to the accepted level, so background fsck will have enough time to scan local object tree and upload missing copies. This task by no means is supposed to be fast, that's why it is a background fsck, which should not affect storage node performance.

That's the plan, although not very immediate, but it has a rather high priority.

How are you going to globally purge objects that aren't wanted anymore? You need to make sure that the background fsck doesn't recreate them while you are trying to purge them.

After new nodes are added the multiple hash functions will change where objects are stored. How is a stale copy on a node no longer accessed by the hash functions going to get deleted? Once solution is to periodically check all objects against the hash. If you hash the object and the hash says the node shouldn't have a copy then you delete it.

Another background fsck is to compress objects on a given node. This is a local operation and doesn't need to talk to the other nodes. Multiple compression strategies are needed.

How are duplicates being detected? If I back up 100 Windows machines do you detect 100 copies of the OS and merge them?

Objects which are not wanted should be deleted by the client otherwise they are still needed. I should also implement a tool to 'cut' transaction logs and drop transactions which are already 'covered' by some later ones, this can also be processed by the background fsck or started manually by admin.
To date there is no such a tool, although it should be quite simple - everyting is already in the code, just has to be combined into separate tool.

As of new nodes addition, they will copy data to own storage via existing elliptics network protocol, which is exactly what is supposed to be used by background fsck to check the data, so it will look exactly like some obscure client reads some data from the strorage and then writes (if needed) something there, so IO requests will be properly forwarded according to route tables on the storage nodes.

Background fsck can compress objects and update appropriate metadata flag so that reading process should uncompress it prior using, it can also perform timed checks and remove objects with too far away access time if needed (for various kinds of distributed caches).

Duplication is handled rather simple - in the default case there are two objects being written into storage when client uploads some data. The first one has direct relation to object name or ID client provides for his data (like hash of its name or some other transformation for example), only history for this ID updated - system puts another transaction ID into its log. This second transaction is actual data to be written, and its ID is generated from actual content and not object name (like hash of the uploaded file). If there is transaction with the same ID, only its history will be updated and there will be no unneded copy created.

So efectively if you upload 100 files with different names and the same content, there will be only single object being the same as original file and 100 history logs for different names (about 60 bytes per file).
If even names are the same, then there will be only two objects in the storage: the one with file content and history log, which will contain 100 records, which means that object is referenced by 100 clients, so it will not be unlinked until transaction logs becomes empty.

Things are a little bit more complex in a real life because of additional IO flags (like we can upload file without history attached, or can update history log only), but overall deduplication works as described.

The system I worked with handled node additions differently. Each node would slowly walk its object pool. For each object the node computes all of the hash functions and makes sure the other copies are there. One of these hash functions should match to the current node. If none of the functions match then the object is in the wrong place and it simply gets deleted.

This implies that node deletion and node addition don't actually do anything except pass around info to alter the way the hash functions behave. All of the other nodes will then do their background object checks which slowly rebuild the appropriate redundancy. The background algorithm can change the speed of checking depending on how many object copies are missing.

Part of the metadata for each object is a list of which hash functions should be used. That list controls how many copies get created. You can also specify the check interval to cause the object to be verified more quickly.

The hash functions control how the objects get placed in the network. To build a cache make a function that simply returns the address of the cache node (or a pool of nodes) and then attach it to all objects you want in the cache. Other functions might spread the objects uniformly or concentrate them on supernodes.

It can do the same task, and essentially that's exactly what will happen if joining node will not perform its protocol of fetching data.

But if node will not copy data to its storage when connected, then reading will not find objects on given node (which already advertised itself) until background fsck daemon fetch them, and while it should not be a problem if there are multiple copies - client can ask for another copy using different hash function, in some cases this can hurt the things.