ioremap.net

Storage and beyond

Advanced merge conflict resolver in the elliptics network

I’ve almost recovered from being a weekend mosquito dinner so returned back to the interesting problem in the elliptics network: merge conflict resolve.

In the existing systems with the global index (like git) this problem is rather simple – index is always different, so system can find a common parent transaction and merge subsequent updates. There may be conflicts of course, but those conflicts can not be resolved generally without human intervention.

In the elliptics network there is no global index, instead each object has own history, which only contained applied transaction IDs, so it was possible that log will contain the same entries (like when updating the same block with the same data multiple times). Because of performance reasons each update does not end up generating a checksum of the whole object (which is not the case for git, where files are on the local machine and should not be fetched from the bunch of remote nodes, when some of them may be unaccessible). Since elliptics network does not have a global index with the always unique entries it is not possible to find a common parent for the two logs or this can be done with errors.

For example we can find two possible merge solutions for the diagram below:

1 2 3 4 5 1 2 3
          1 2 3 4 5
and
1 2 3 4 5 1 2 3
1 2 3 4 5
-----------------------------------> t

Where horizontal axis is a time and numbers are transaction IDs. To solve this merge conflict we need to know exactly how to differentiate transactions when their IDs are the same, and thus to find which path to take on the above diagram.
To solve this problem, I introduced a transction timestamp, which says when given data was sent into the network. It does not require a precise clocks on all machines in the network, since timestamp is only needed to be a unique number in pair with the given transaction ID, system does not use it to merge transactions based on its generation time.

I selected time since it is quite monotonically increasing number, while it could be a random cookie. Another useful usage of the timestamp is log truncation. Well, truncation is not a good definition though, since we cut off the log from the beginning dropping some old transactions. This is not implemented yet, while 5 different merge algorithms are:

  • prefer network log and drop all local changes in favor of the version stored on the remote node
  • prefer local log and overwrite the one stored in the network discarding all changes there
  • apply local changes after common ancestor on top of remote log
  • apply remote changes after common ancestor on top of local log
  • do nothing and return error when remote and local transaction logs do not match

The first two are rather simple cases, while the third and the fourth (although being quite the same) are a bit more complex. They can destroy data consistency since transactions are applied on top of existing local or remote logs potentially overwriting data written there by another client. Although it is always possible to recreate object some time in its past, the latest version may be changed the way users do not expect.
Since it is potentially not a text file, usuall offset/size check (like in diff) is not enough (consider some binary tree updates). So there is the last case when nothing is done and it is up to administrator to decide which version is correct.

There should be a fair amount of testing of this functionality before new release will see the light, as well as other tasty features implemented, but core functionality is already there.

Comments are currently closed.

4 Responses to “Advanced merge conflict resolver in the elliptics network”

  • MartinFick says:

    This sounds like a really bad idea, (to use time to differentiate the transactions). You are just setting yourself (or, worse, your users) up for the worst kind of possible errors: the really hard to reproduce uncommon error. I think that you really should use a globally unique ID for your transactions. However, it does not have to be a globally unique sequence. In other words, there are ways of breaking up the ID space so that it will still guarantee uniqueness without requiring a global sequence that needs to be coordinated on every transaction. For example, one way would be to simply use a global sequence for the creation version of each file instead. I assume that this is where your problems really occur anyway right, the fact that you have several 1′s in your version # sequence is because a file was deleted and recreated right? So, each time a new file (inode?) is created, it could get a globally unique ID. And then each transaction ID can be the combination of the file creation ID and a sequence version # for that file appended to it. This version # can then start over at 0 (or 1) again every time this file is recreated. This should ensure uniqueness while avoiding a global sequence that is constantly being incremented for each transaction.

    A similar problem did come up on the glusterfs mailing list a year or two ago, it might be worth investigating some of the potential gotchas mentioned. I do believe that they eventually chose a global sequence, but I am not sure if it was for every file version or just for creates.

  • zbr says:

    it has nothing with introduced timestamps. Transaction ID is a unique data representation and it does not depend on time, moon phase or anything else except data itself and inner nature of the transformation function which produced given set of bytes as ID from provided data array.

    Transaction is unique in the network but with the same set of transformation functions the same data results in the same ID thus we heavily drop duplicated chunks of the same data.

    What I talked about is a history entry which besides transaction ID, its size, offset and flags, also contains timestamp now. History is essentially another object in the system with own ID (corresponding to the ID of the object we were written into), but its content is a log of all update transactions. Several 1′s in the example is several writes of the same data into the object. Since data is the same, it has the same ID and we store only single copy, but it can be placed in some objects multiple times or at different offsets (not shown in the example, but history contains this metadata).

    Timestamp in the history log is only needed to differentiate those same writes. It is not required to be a timestamp – random value would also suffice, but timestamp allows to store more information for free – when object was created and modified, as well as still to differentiate the same transactions placed at different time.

    Hope this clarifies things a bit. Transaction generation process was not replaced, I just extended history log and added there timestamp to differentiate the same transactions written at different time.

  • MartinFick says:

    I think I understand what you mean now, and it is not what I was talking about, sorry. :)

    This does leave me to another question then: “how do you differentiate in your network between different versions of files with the same name?” I believe that I remember that you have something like an object Id (inode) to represent a file in your network, and I thought that this was based on the path say: /usr/lib/library.a. If this file (object) is deleted from the network and another object is re-added at the same path, will this new object get the same object id as the previous file with that name? If so, will that somehow confuse your history logging?

    Do you normally delete a file’s history log when the file is deleted from the network (I assume so, or you would have a space leakage)? Will a log for the old object potentially get “merged” with a log for the new object during conflict resolution (if the old log remains on a node which was down when the deletion occurred)? This is the problem that I thought you were attempting to resolve in your original post.

  • zbr says:

    There are two options to delete an object – to mark a special tag in the history or to delete history object itself. In the former case subsequent creation of the object with the same name (presumably transformed into the same ID) will end up in the same history, but there is no problem – we will get only last transactions needed to create new object (by default we do not dig into the history and get the latest updates).

    In the latter case new history will be created and during merge with another node (which has old object) depending on the merge strategy one of the history objects or weird mix of them will be stored on both nodes, since there will be no common ancestor so either whole remote log will be replicated on top of the local one, or vice versa. And it is valid, since effectively we try to merge two completely independent objects into one, while we should really want to get only one of them.