Advanced merge conflict resolver has been merged

Elliptics network is now able to work in a network split setup and system will merge different history logs when node rejoins.

There are 5 different strategies (described from joining node's point of view):

  • DNET_MERGE_PREFER_NETWORK (0) - discard local changes and prefer version which exists in the network
  • DNET_MERGE_PREFER_LOCAL (1) - send local transaction history into the network pretending it to be valid one, all changes in the history log, which is stored in the network will be discarded
  • DNET_MERGE_REMOTE_PLUS_LOCAL_UPDATES(2) - apply all local changes made after the common ancestor commit after full remote log
  • DNET_MERGE_LOCAL_PLUS_REMOTE_UPDATES (3) - apply all remote changes made after the common ancestor commit after full local log
  • DNET_MERGE_FAIL (4) - fail if transaction logs do not match

The first two are obvious - we discard either remote or local history in favour of the selected version. All uptadates stored in the discarded log will be lost. The next two versions are also quite simple - we find a common ancestor and then apply the rest of the selected history log on top of the base one. In this case the first part of the merged log will contain either full remote or local history (depending on the selected strategy) and the second part will contain the rest of the other history.

Here is an example. That's how original history looks on both servers:

$ ./example/dnet_hparser -f /tmp/elliptics-test/root1/ff/ff00000000000000000000000000000000000000.history 
/tmp/elliptics-test/root1/ff/ff00000000000000000000000000000000000000.history: 
  objects: 5, range: 0-0, counting from the most recent (nanoseconds resolution).
2009-06-19 17:57:43.470096000: 32ea6116: flags: 00000000, offset:     2048, size:     2048: -
2009-06-19 17:57:40.814129000: af24ab16: flags: 00000000, offset:     8192, size:     4096: -
2009-06-19 17:57:40.656900000: 0054b454: flags: 00000000, offset:     4096, size:     4096: -
2009-06-19 17:57:40.503311000: 6cdd3cb9: flags: 00000000, offset:        0, size:     4096: -
2009-06-19 17:57:40.540129000: 6cdd3cb9: flags: 00000000, offset:        0, size:    12288: -
$ ./example/dnet_hparser -f /tmp/elliptics-test/root0/ff/ff00000000000000000000000000000000000000.history 
/tmp/elliptics-test/root0/ff/ff00000000000000000000000000000000000000.history: 
  objects: 5, range: 0-0, counting from the most recent (nanoseconds resolution).
2009-06-19 17:57:42.163033000: 51b89998: flags: 00000000, offset:     1024, size:     3072: -
2009-06-19 17:57:40.814129000: af24ab16: flags: 00000000, offset:     8192, size:     4096: -
2009-06-19 17:57:40.656900000: 0054b454: flags: 00000000, offset:     4096, size:     4096: -
2009-06-19 17:57:40.503311000: 6cdd3cb9: flags: 00000000, offset:        0, size:     4096: -
2009-06-19 17:57:40.540129000: 6cdd3cb9: flags: 00000000, offset:        0, size:    12288: -

Notice that coloured lines (also italic) are different and should be merged. That's how this will look after one of the merge strategies applied:

$ ./example/dnet_hparser -f /tmp/elliptics-test/root1/ff/ff00000000000000000000000000000000000000.history 
/tmp/elliptics-test/root1/ff/ff00000000000000000000000000000000000000.history:
  objects: 6, range: 0-0, counting from the most recent (nanoseconds resolution).
2009-06-19 17:57:43.470096000: 32ea6116: flags: 00000000, offset:     2048, size:     2048: -
2009-06-19 17:57:42.163033000: 51b89998: flags: 00000000, offset:     1024, size:     3072: -
2009-06-19 17:57:40.814129000: af24ab16: flags: 00000000, offset:     8192, size:     4096: -
2009-06-19 17:57:40.656900000: 0054b454: flags: 00000000, offset:     4096, size:     4096: -
2009-06-19 17:57:40.503311000: 6cdd3cb9: flags: 00000000, offset:        0, size:     4096: -
2009-06-19 17:57:40.540129000: 6cdd3cb9: flags: 00000000, offset:        0, size:    12288: -

or

$ ./example/dnet_hparser -f /tmp/elliptics-test/root1/ff/ff00000000000000000000000000000000000000.history 
/tmp/elliptics-test/root1/ff/ff00000000000000000000000000000000000000.history:
  objects: 6, range: 0-0, counting from the most recent (nanoseconds resolution).
2009-06-19 17:57:42.163033000: 51b89998: flags: 00000000, offset:     1024, size:     3072: -
2009-06-19 17:57:43.470096000: 32ea6116: flags: 00000000, offset:     2048, size:     2048: -
2009-06-19 17:57:40.814129000: af24ab16: flags: 00000000, offset:     8192, size:     4096: -
2009-06-19 17:57:40.656900000: 0054b454: flags: 00000000, offset:     4096, size:     4096: -
2009-06-19 17:57:40.503311000: 6cdd3cb9: flags: 00000000, offset:        0, size:     4096: -
2009-06-19 17:57:40.540129000: 6cdd3cb9: flags: 00000000, offset:        0, size:    12288: -

The same history will be placed both on local and remote nodes.

The last merge strategy - DNET_MERGE_FAIL (4) will fail with the following lines in the log:

2009-06-19 18:03:27.354246 2: ff000000: histories do not match and fail strategy was selected.
2009-06-19 18:03:27.354485 8: ff000000: failed to merge histories, err: -22.

Automatic test for this functionality (written in bash) is comparable in size with the feature itself.

There are only two issues left to implement in the elliptics network to be considered complete.
One of them is a known problem when joining node does not advertise objects it stores which are outside of the specified range, while it should merge them into the network. Although solution exists, it was not yet tested in this particular case - it is exactly the same as syncing failed range to the neighbour node.
Second problem is unlink support - objects have to maintain reference counter for the finer-grained deletion. Since transaction with the same content ends up in the same object thus performing automatic data deduplication, we can not simply delete it if it is referenced by two or more objects outside. Reference counting allows to remove object only when it is not referenced by any other object. Given that each low-level transaction (i.e. that one which contains some data and not a history update) has a history of the objects it is referenced from, deletion should not be a major problem.

After those tasks are finished I consider project as completed and it will be moved into bug fixing mode. To date I do not see any other features needed to be implemented in the core library (but I do remember about PAXOS-based locking for the backed up histories).

So plan to have a small rest after it and work on regext state machine implementation, LR grammatics and knowledge extraction. A small and rather simple AI bot should be developed for some mail lists this year, and I expect a lot of fun working with it.

In a week or so I will start porting POHMELFS to the elliptics network.

How does the Elliptics Network (or POHMELFS) for that matter deal with writes which do not complete? In other words, if all the servers go down, will writes fail or simply block until enough servers come back up? NFS will block (glusterfs unfortunately does not), and many people consider this a very important essential feature since it prevents applications from having to be restarted simply because a server reboots or a client becomes disconnected from the network for a while.

Elliptics network by its nature is very asynchronous and each low-level call may get a callback invoked when given transaction is committed by the server, high-level API has a specially allocated waiting structure, but callback is still there.

This callback is invoked each time transaction is completed (either acked by the remote server or when some error occured), so if one or multiple servers failed, transaction callback will show it, and caller can make appropriate actions. In particular high-level blocking API will return success if at least one server we are writing to acked transaction, if all sends returned error, it is returned to the caller.

I'm not sure blocking is a way forward - how can system process other tasks in this condition, how to signal it to stop waiting, how to determine that there is no way we can send given data to the server?

Network can be changed dramatically when we tried to send the data and failed, so I believe returning appriate error to the caller who can resend data is the right solution. Wrapper is rather trivial for this functionality though.

"I'm not sure blocking is a way forward - how can system process other tasks in this condition, how to signal it to stop waiting, how to determine that there is no way we can send given data to the server?"

As a user of a POSIX networked filesystem, I would want the posix API to block for my apps. This does not mean that the entire system has to block, but simply that whatever presents the API to the apps (could be a wrapper) has to block.

Imagine a mail server writing to an Elliptics backend, surely if the network goes down, we don't want the still running mail server to need to be restarted just because the network or servers went down? Now picture how NFS is used in most organizations, many shares including home directories are often mounted on remote NFS shares. The NFS server goes down and everyone in the organization screams, but since the apps simply block, when the server comes back up, everything continues as normal. Without this blocking, every single application on the entire network that relies on the shared storage would have to be restarted. Where is the gain in getting a failure (a waring or alarm might be good though)??? A failure would cause a major interruption as everything is restarted, and many services would likely even be missed and not restarted until way later once someone realized that they are still in failure mode

This really is an essential feature if you ever hope to make the Elliptics network usable as a general purpose network filesystem (and I hope you do)! Without it, you are relegating your network to be an oddball special purpose filesystem (as glusterfs is). I would venture to say that this is WAAAAY more important /essential in normal operation than locking even is! I hope that you eventually agree. :)

POSIX requirements implemented in POHMELFS will essentially contain this, but this has really nothing with the elliptics library itself - it is a separate client which will talk to some obscure servers using its protocol. Elliptics API works the way I described above, but clients are free to create own wrappers or protocol implementation, that's what I will do in POHMELFS.

In the existing POHMELFS implementation user does not even know that there are some transaction sending errors - they will be resent on behalf of POHMELFS core threads and only if specified timeouts and number of resends fired (where both can be infinite) appropriate pages are marked as invalid.

Great, I am happy to hear that. Sorry that I was blurring the lines between the two systems!

One area that I am curious about is: "how you plan on dealing with split brain prevention (not recovery) then, is this something that POHMELFS will have to deal with instead of the Elliptics network?" If a write is supposed to occur on two servers and it only occurs on one, I suppose this will mean a transaction failure in the elliptics network? Will this then get propagated up to the POHMELFS library which will decide whether to retry it or not? Will there be new additional POHMELFS parameters (I assume they do not yet exists since POHMELFS did not duplicate writes correct?) that specify what to do in this case (a failure because one server is down, but no both, if there are only two). Will POHMELFS be able to make (or query an external program to make) a decision to continue with only one server? I believe you have talked about having external applications that can detect this (a server failure) and change the way POHMELFS is configured, but could this transition to single server operation be seamless to applications using POHMELFS?

In other words, to follow on the blocking idea, if my requirements are for POHMELFS to operate unimpeded when it can talk to 2 servers, but I want it to have to consult an external app as soon as it can only reach one server to decide whether: a) it should allow a write to succeed even though it is only on one server (under the assumption that this server is now authoritative), or b) to block until the second server returns even though it could write to the first server (because it believes that there may be network segregation and that the other segment of the network is treating its server as authoritative), will POHMELFS (or Elliptics) provide an interface for this decision to be made?

This problem is properly handled by the elliptics addressing model and merge algorithms.

When network splits, its parts will start accepting objects with broader ID range to cover unaccessible part. Different clients will write to different servers with the same ID (on different partitions), when servers reconnect and rejoin, data will be merged into single history on the node which will handle given ID range in the new configuration.
This new history can be non-consistent though, since it just applies one updates on top of another, which can break inner logical structure. Since history saves all transactions in the log, it is always possible to rollback to some other version.

As of writes to multiple servers - elliptics does not resend data if one of them is unnaccesible, instead it returns error via provided callback invokation. It is up to the higher layer to decide whether to resend data or perform other recovery actions. In POHMELFS transaction is marked as undelivered until all servers return ack (POHMELFS supports writes to multiple servers and reading balancing quite for a while already) or predefined number of resends happend (transaction will be resent after specified timeout number of times).

What you refer as external application detecting network split was DST - this is a block level network device and it does not have any knowledge on the data it works with, so there should be an external application which will say that given mounted partition is no longer valid to write to, since it can be written by other node.

In POHMELFS (after it is ported to the elliptics network writes to multiple servers will always succeed because of dynamic addressing model - if one server is unaccesible its object range will be handled by the neighbour, so client will write to some other server, and when failed one returns online, it will fetch updated objects and merge with the local ones.

"In POHMELFS (after it is ported to the elliptics network writes to multiple servers will always succeed because of dynamic addressing model - if one server is unaccesible its object range will be handled by the neighbour, so client will write to some other server, and when failed one returns online, it will fetch updated objects and merge with the local ones."

This model seems to assume an infinite amount of servers. My question pertains to the time when the amount of servers is less than the amount of requested copies of each transaction. I know that the elliptics network focuses on scaling so it is easy to think big, but in the HA world it is important to be able to continue operating when few services are available, this hopefully means down to one server only!

I have my hopes (of course, I do not necessarily expect you to share this vision), that the elliptics network would be a solution that truly scales, in other words, a solution that can start small and grow, not just a solution that requires committing to building a cluster before even getting started. Personally, that is what prevents me from migrating to all the other cluster solutions, I would not be surprised if that is why we do not see wide scale adoption of any of them either. It is much easier to commit to a system if I can start small and add resources as I need them. This probably means starting with one server, and by adding another one to immediately expect a performance and availability gain. Most of the systems out there can scale performance wise from 1 to n, but I have yet to see any that scale availability wise
that way. Most either require 2 normally (and cannot go above 2) and can degrade to 1, or they require 3 or more to start (not sure what they can degrade to).

The solution you mention above seems like it requires at least two servers (for single redundancy, i.e. 2 copies), but it does not sound like it can degrade to one only server, can it?

As long as you have at least one server it will be able to handle all requests clients send to the elliptics network.

System does not care about how many servers really process the requests (be it single write or multiple transactions) - automatic addressing reconfiguration takes care about failed nodes and their ID ranges so that when there is only one server it will be able to handle the whole ID range and will sync all updates with newly joining nodes according to their configuration, i.e. joininig nodes will requiest (and merge according to own merge strategies) all objects corresponding to their ID ranges.

OK, but I am still missing how you will then prevent (not recover from) split brain? It the network continues to operate with just one node, it could be two halves each operating with just one node. Sure, fancy conflict resolutions are nice, but I suspect that most people (I do) would prefer one half of a split network to cease operating so that there will never need to be any conflict resolution.

Perhaps I am being dense, or I am just missing something that you have repeated (more likely we both have assumptions that aren't expressed), but I do not understand how your network will prevent (or provide an API to prevent) split brain as soon as a potential spit brain is detected? Can split brain be detected/prevented before a half write occurs, not after? Without this API or a built in detection mechanism, this statement seems like a recipe for split brain: "As long as you have at least one server it will be able to handle all requests clients send to the elliptics network."

Both parts of the split network will work with own nodes, but you can select a merge algorithm which discards changes made on joining or remote nodes (depending on who was the first in the setup process), so effectively there will be changes made only one of the split sides.

This technique moves away from classical detect-and-switch method of dealing with the node outage, instead we always work with what we have and then recover (potentially dropping all changes made by the other side, which you would prefer).

Oh, too bad... I question how useful this is? This is the same approach that glusterfs uses. I would probably not risk any of my data to such a redundancy system (I would just use the scalability features).

Imagine something as simple as a mail server and an IMAP server which use a shared FS for the mail spool which then becomes split. The mail server gets split and still gets incoming mail to add to its spool, but the IMAP server sees a different copy of the spool which its clients delete and email from after reading them during the split. Once the network split is over this could be a disaster since two different copies of the spool have had completely different changes done to them.

It is a shame since it seems like DRBD is the only project I know which protects against this, since if setup correctly, only one copy (the master) will get updates, I hope you will reconsider your design. I suspect that many others would hope so too. :)

In the descirbed case transaction merge will take care of the different object histories, that's for the start.

Next, neither DRBD nor any other project provides a protection against this wrong system design - it is not storage problem that there are two different users who starts writing into different servers while you expect them to wite into the same one and there is no way servers can connect and determine who is the master. I believe it is clear from the end of the previous sentence how this bad design should be fixed.

You still can use heartbeat or any other similar tool like CARP to shut down some servers on the nodes believed to be in a slave mode. It is not a storage task to determine when to allow IMAP server to accept new messages or not.

But please think a bit out of the box described by the 'brain split' label, since it has really nothing with how you should operate the system - you can (and likely should) implement HA system not tied to the case when you have two servers which want to mirror data into each other's disk - consider them as a part of the unified network where each transaction is a separate object mirrored to two nodes and you do not really care which one is active, since there is no such meaning at all - it is just a storage which accepts data from one or another client and if your Apache runs on the unaccesible node then it is up to you to decide wheater to start new instance or not - and not the storage one.

So, effectively 'split brain' is an artificial problem (of the storage and badly designed system) - it is not that level which should care, so with the right design you will find that this problem is solved on the different area by the special software, which does not even know about underlying storage.

"Next, neither DRBD nor any other project provides a protection against this wrong system design - it is not storage problem that there are two different users who starts writing into different servers while you expect them to wite into the same one and there is no way servers can connect and determine who is the master."

DRBD uses a different model to ensure that either a) both servers have the same data written to them, or b) only one server is written to if the two cannot speak to each other. Specifically, it attempts to write to both servers (local and remote). When the remote fails, DRBD then needs to make a decision whether to continue to write to the local one or block (and eventually shutdown). It can be configured on remote failure to attempt to contact the secondary server via an external mechanism (typically using heartbeat's serial port protocol) and to invalidate the remote server. This ensures that the remote server does not believe that the currently active one went down and attempt to take over for it. If this (serial) contact fails, it is possible to make the local server simply block since if it cannot contact the remote through the secondary (serial) mechanism, there is a chance that the secondary got promoted to primary. So, by using an external program, it is always possible to continue serving only on half of the network and to know the difference between a failed server (the point of the redundancy) versus a failed network. This decision to fail over or not is made before every write is returned to the upper (FS typically) layer preventing any split brain divergence from occurring in the case of network segregation while still allowing fail over on server failures.

Since I may not have done justice to DRBD describing things this way, I suggest that you investigate dopd yourself:
http://www.drbd.org/users-guide-emb/s-heartbeat-dopd.html

I personally know of several telco systems which use DRBD as the basis of their filesystem HA for their server platforms (and hope that someday your network will be used that way). This model works and does prevent split brain at the storage layer (below FS api). While I am not trying to architect your system for you and tell you at which layer (POHMELFS or elliptics network) to do this. I am trying to point out that what most people want is split brain prevention, not conflict resolution, and that it is possible and expected somewhere below the FS API, not above it as you seem to suggest here:

"You still can use heartbeat or any other similar tool like CARP to shut down some servers on the nodes believed to be in a slave mode. It is not a storage task to determine when to allow IMAP server to accept new messages or not."

and

"So, effectively 'split brain' is an artificial problem (of the storage and badly designed system) - it is not that level which should care, so with the right design you will find that this problem is solved on the different area by the special software, which does not even know about underlying storage."

No, you can't because this will not happen below the FS layer. Sure, my IMAP server can shutdown once the external tool detects the problem, but at this point the IMAP server could have already written to the slave and received a success from the FS layer meaning that the IMAP server, or apps using it, may make decisions based on that success (perhaps sending an email to someone else). So, sure, you can simply ignore the divergent write to the slave after network repair, but the problem is that you then no longer are providing POSIX semantics to the apps that believed they made that write successfully, and so there is no way for these apps to resolve their conflicts which occur because of this split brain (no way to recall that previously sent email!) You see, it's not just a matter of resolving the data, but the decisions that were made by applications (or people) that used the data before you prevented the divergence, way before you can resolve it.

Bottom line, in order to provide HA posix semantics, the FS layer needs to behave differently depending on which side of a segregated network it is on,

You apache example will work if network segregation occurs between the clients and the servers, ie:

   (elliptics & apache cloud)   <-- bad --> clients

Or, in greater detail:

  (elliptics server A) <------> apache <---------> client A
                  |<------------->| ^-----------------v
  (elliptics server B) <------> apache <- bad ->  client B

but I cannot see how it will work if network segregation occurs within the elliptics network:

   (elliptics server A) <------> apache <----> client A
                  |<---- bad ---->|
   (elliptics server B) <------> apache <----> client B

Since, now client A and B may reserve the same seat to moscow on the same flight. I am curious how you think an external tool could prevent this and instantly (not mili seconds later) prevent successful writes between one of the apache servers and its elliptics server?

Again, if you simply say I don't care to solve this problem with my storage cloud, I have no choice but to plead with you to try and support it. Or, if I am flat out wrong and it is solvable with an external tool, than great. But, I just don't see it and hope that you are not making a bad assumption when you think that it is easily achievable, an assumption that will make it hard to fix this in retrospect.

I will try to make it short.

First, you can not reliably detect wheather remote server crashed or this is a network problem.
Second, you mentioned external tools used to check server statuses - you can use them everywhere.
Third, POSIX has nothing with what you describe. Neither DRBD nor elliptics work on that layer.

Returning to DRBD and split brain problem - it uses serial connection to determine if server is alive and network was down or server crashed. When server receives a request from the network immediately network can be partitioned and another client will connect to the different server with the same request and within heartbeat interval they both will write the same data since servers. Even mentiaoned page contains information about the race when link was dropped.

This is wrong design to allow separate media for each server and rely on timed hearbeat to detect if it broke. Instead you can use the same media and if server fails it is clearly visible, and when network breaks another connection shared between two nodes is started. In this case server receives request and checks wheather another one is active and updates the data. By the same/shared meadia I do not imply hub ethernet, but facility which assures that if link to the server was broken, then server can not be accessible via anything else, so effectively server link is the only media and servers can talk to each other via this connection, thus 'sharing' it.

Second, using transaction mechanism you can work in essentially any environment, but then addiitional logic in the application has to be implemented to recover from the cases when transactions should be merged into the same object.

And the last - real HA systems must be built with the distributed locking facilities. Locking daemon can be replicated on top of multiple machines to allow outages, and much more complex protocol called PAXOS should be implemented there.
When lock is granted to the system, it can update multiple nodes without need to think that anyone will do the same in parallel on separate server. This is not supported neither in DRBD nor in elliptics, but the latter will eventually have distributed locks.

To date you may consider elliptics network as not suitable for your needs, but please think twice wheather other designs really provide protection you want to have without validated distributed locks.


Returning to DRBD and split brain problem - it uses serial connection to determine if server is alive and network was down or server crashed. When server receives a request from the network immediately network can be partitioned and another client will connect to the different server with the same request and within heartbeat interval they both will write the same data since servers. Even mentiaoned page contains information about the race when link was dropped.

http://www.drbd.org/users-guide-emb/s-heartbeat-dopd.html

I reread that page twice, I do not see any mention of a race condition, could you be more explicit? Since DRBD is usually used in a single primary system (at least, that is what I have been talking about), what makes you think that two different clients could connect to two different servers (since there should only ever be one server running at a time)? Yes, this is not scalable, but it is reliable. That is my gripe, it seems like we have two options: scalability / HA, pick any one (distributed FS/DRBD)

http://www.drbd.org/users-guide-emb/s-heartbeat-dopd.html

at the very end:
When re-instituting network connectivity (either by plugging the physical link or by removing the temporary iptables rule you inserted previously), the connection state will change to Connected, and then promptly to SyncTarget (assuming changes occurred on the primary node during the network interruption). Then you will be able to observe a brief synchronization period, and finally, the previously outdated resource will be marked as UpToDate again.

When network splits it will take a while to detect it, so it is possible that two clients will work separately with the system nodes.


When network splits it will take a while to detect it, so it is possible that two clients will work separately with the system nodes.

No, I don't think you understand DRBD. Have you worked with it? If not, maybe you should setup a system with it to better understand what they are providing as a solution.

When DRBD is supporting a normal filesystem such as ext3 (not a clustered FS such as GFS or OCSF2) there is only one server in primary mode at a time. Only the server in primary mode can mount the FS directly. If you mounted an EXT3 FS directly on two different hosts, you are likely going to cause kernel crashes and for sure FS corruption. So, when configured for a local FS, DRBD prevents "promoting" both servers to primary at the same time as long as either the two DRBD servers can talk to each other directly or through heartbeat There is no race condition.

If the secondary wants to write to the primary FS, it needs to be exported from the primary (via NFS or another networked FS) and mounted on the secondary. This means that when the secondary writes to the FS, it actually sends its writes to the peer via NFS, then the peer, which is primary, writes it to EXT3 (or other local FS) and then DRBD mirrors it back to the secondary. No clients, including the secondary can write to the FS without going through the primary. This is not a distributed FS, but it does prevent split brain at all costs under normal "as designed" operation, (except for potential bugs, obviously). I do not believe there are any known race conditions. Feel free to join the DRBD mailing list if you think there are, I am sure they would be happy to fix them if you point them out. :) I would be happy too since I want my data to be immune from race conditions!

To clarify the some of the meaning of your quoted text:

...the connection state will change to Connected, and then promptly to SyncTarget (assuming changes occurred on the primary node during the network interruption).

This refers to the internal state of the DRBD device, it indicates how it believes it is talking (or not) to its peer.


Then you will be able to observe a brief synchronization period, and finally, the previously outdated resource will be marked as UpToDate again.

Once the peer it is uptodate, it does not start serving clients again. It stays secondary until a failover occurs.

I don't think you understand DRBD. Have you worked with it? If not, maybe you should setup a system with it to better understand what they are providing as a solution.

I did, although quite long time ago, maybe 4-5 years.

Race is between the time when network breaks and when heartbeat shows that there are problems with the server itself and not a network connection. System does not check at every write/update/whatever command wheather it is master or slave, so it can write data while being switched into slave mode (not particulary into FS, but accept client and serve its request for example).

And if you like to use this setup it is possible to implement it using elliptics network library.
But this is not the right way to solve the problem IMO. As was said the only mathematically proven algorithm (I know about) is PAXOS and distributed locks based on that.

It is very hard for me to follow much of what you are responding to if you do not cite the lines. I can't tell if you have answered my questions/concerns even though you probably have. :(


This is wrong design to allow separate media for each server and rely on timed hearbeat to detect if it broke. Instead you can use the same media and if server fails it is clearly visible, and when network breaks another connection shared between two nodes is started. In this case server receives request and checks wheather another one is active and updates the data. By the same/shared meadia I do not imply hub ethernet, but facility which assures that if link to the server was broken, then server can not be accessible via anything else, so effectively server link is the only media and servers can talk to each other via this connection, thus 'sharing' it.

This seems to assume that client applications cannot live on servers? That would be a crying shame since a distributed filesystem like yours could finely make use of all those empty disk sittings on everyone's desktop. It pains me to see organizations investing hugs sums of money on improving their backup and shared storage solutions, only to offer less shared/backed up storage per user than is sitting in each PC spread throughout the organization. Ideally, the only storage needed would be the distributed PCs around the office, why bother with a single point of failure / bottleneck in the data center?

Of course, that means that servers and clients would share machines. This means that some connections from some clients to some servers can never be severed while still being able to sever the connections to other servers. It seems you may not care to support such a model? It does seem like it should be the logical long term objective of any newly created distributed filesystem: a peer to peer filesystem?

This seems to assume that client applications cannot live on servers?

There are no limitations on how many clients or servers run on the given server. In particular automatic tests run on the same node over localhost (there are multiple clients and multiple servers tests).

I talked about the case when network connection if dropped from one server to another implies no clients can connect either, so each application clearly knows that data can not be damaged from the outside. You still can use different link (like heartbeat over serial line) to determine wheater another server is alive and which one should be a master, but this is pretty useless for the most cases, since if network connection was broken no clients can connect and local clients can not get new data to work with.

It can be useful for some local application which performs data calculation/generation and saves it at some intervals.

I think the issue he's trying to raise is about having an out-of-band channel for detection of nodes being alive or not which can create race conditions. In-band communication in distributed systems tends to be more reliable.

You can have clients and servers on the same machine if you wish but one of the benefits of the system is partitioning of data across servers the larger environment which brings with it fault tolerance and scalability.

The system they've been developing scales up and out - allowing you to start small and expand out as and when you need. However starting with one server won't give you redundancy (obviously) - but as you scale out and partition your ID domains - the system will naturally scale out to support your infrastructure.

With the right design - the examples of SMTP and IMAP that you've used will work well with this type of network.


I think the issue he's trying to raise is about having an out-of-band channel for detection of nodes being alive or not which can create race conditions. In-band communication in distributed systems tends to be more reliable.

I agree, I am the one suggesting in-band detection for server/network failure detection. ZBR is saying that using an external tool, out-of-band, is good enough. He says: use the external tool to cause your apps to failover based on the external tool's detection, and that the FS does not need to block based on an in-band mechanism! This means that the apps will only get triggered to fail over based on the external tool, thus a race condition! He says it's OK if the FS continues to chug along merrily for you, but if you do not detect the network problem yourself (out-of-band) quickly enough, it will cause split brain!

The out-of-band serial cable that drbd uses is not to detect failure (it does that in-band), the out-of-band mechanism is used to detect the difference between network versus server failure, so that in the server failure case, fail over can happen, it is safe for the secondary to take over. And in the network segregation case, no failover happens, the primary likely continues serving (preventing split brain), or the primary is shutdown and the secondary takes over! In-band cannot tell the difference and know whether the secondary should take over or not, and leads to either split brain conditions (if automated), or forced manual failover.

So, I am not advocating out-of-band or in-band, you need both for safe automated failover without race conditions!

Cool - so we're kinda on the same page.
From what I understand ellipsis is the network carrier and pohmelfs is the higher level interface that will implement locking which should handle in-band prevention of split-brain issues.

I think the general intention is to have ellipsis as a fairly generic abstraction for cloud storage - in essence it's a network namespace layer. While some of the examples use a file metaphor (as a vehicle to describe it's functionality) it's true power comes from it's ability to work in many scenarios.

I'm using it in AI as a system that allows distributed knowledge processing to generate abstract representations when trying to deduce the form, function and meaning of language. In my situation I actually want split-brain problems to happen :)

Cool - so we're kinda on the same page.

I think you and I are, but I do not believe ZBR agrees. :(

From what I understand ellipsis is the network carrier and pohmelfs is the higher level interface that will implement locking which should handle in-band prevention of split-brain issues.

Unfortunately, if I understood ZBR correctly, he does not seem to want pohmelfs to prevent split brain either. :( Maybe it is something he envisions much later, after distributed locking is implemented?

Don't write him off too quickly - he's a bright chap and I'm sure he's mentioned it in a post previously.

The system they've been developing scales up and out..

I am curious, which systems are you talking about?

umm - the Elliptics network

"There are only two issues left to implement in the elliptics network to be considered complete..."

What about distributed locking? Has this been implemented and I just missed it, or is that what you meant by "(but I do remember about PAXOS-based locking for the backed up histories)."? Do you not consider locking a core feature?

It will be implemented as a separate system (started by the server node though) and as a separate project.

It is an essential part, I just want to reuse it for different projects, so decided to move it outside elliptics network.

I expect POHMELFS port will not take much time, so I will start implementing those locking system after it is ready.