New distributed storage release.

Tagged:  

Essentially it is a resent of the old release with trivial changes:

  • whitespace cleanups
  • new name, this time about armenian mountains (casted by talking about Koran with Leontin)
  • email changes

As usual, get it from GIT tree or archive.

Hi!

Have you tried using dst with mdadm (for example, with radi 1) ? If I'm not misundertood everything, this is a valid usage case. But I don't really see how should you do the array and those stuff.

I mean, for example if I have 3 machines, every one with one block device that I export. On machine A I create /dev/md0 (raid 1) with this 3 block devices (2 remotes and one local). Then I mount /dev/md0 and write to it, so I get this data written on remote hosts too. But, let's say machine A goes down, and I want to continue writing in machine B and C. What should I do ? create (raid 1) array with B and C block devices and a missing one (for machine A) ? would this generate any trouble with the super block ? And what when machine A is up again ? wouldn't machine A reconnect to remote nodes and also use /dev/md0 with those ? Should be a problem if machine B also have /dev/mdo created locally (not mounted, of course if both mount them and the fs is, for example ext3, there would be a problem. Just one mount it) and machine A too (besides both have the same block devices in the array) ?

Sorry If I completely misunderstood everything and my questions are just crap, you can just ignore them =)

Thanks a lot!

Yes, I used DST in RAID1 created with mdadm. I do not remember exact command right now, but it does not differ from usual mdadm commands used to created RAID storages with local disks.

When one node in DST fails administrator has two possible cases:

  • remove failed node from the array via usual mdadm commands just like with local disks
  • rely on internal DST transaction support to complete pending writes when node goes online again

The former case should be used when you do know that node will not be brought online quickly, for example when appropriate remote node is brought offline to the long maintenance mode. The latter case is perfect when you work with faulty network which blinks or offline is short enough.

Transactions are not freed in DST (until specified number of resend iterations passed, when it is committed with error, which is propagated to the higher layer, namely RAID in this case), so every write will stuck waiting for remote node to commit that it is completed. So if you have raid1 configuration with multiple remote nodes, every write will not be completed until all remote nodes commit it. With long enough resend timeout in DST (or large enough resend threshold) it may take a while, but, on the other hand, it will not force full array check in case of short offline period.

By default RAID will mark array as degraded after about 25 seconds after node went down (5 failed resends each one after 5 seconds timeout), during this time all writes will stuck waiting for node to go online and no errors will be returned to the RAID level from the DST.

(heh, that subject is way better :)

Thanks a lot for the answer. Sorry for my delay, I've been out of town for a while and busy at university.

(I think :) I understand what you said. But I do not fully understand what should I do in this situation: I only create /dev/md0 in machine A. Machine A goes down for long time. What should I do ? What is reasonable to do in this situation if I want to continue writing in the array with machine B and C meanwhile ? Or I must always create on machine A an array, on machine B an other array identical and on machine C an other identical array and mount only one of them ?

I mean, should I create an other array (when machine A is down) on machine B or C and then start writing to it ? If this is reasonable, what would happend when machine A goes up (because it also has /dev/md0. Besides the problem of being mount it in different machines at the same time, let's asume that did not happend) ? should I make those md arrays not to start automatically and then export machine's A block device and in machine B (or C, or the one I create the array :) add it to the array ?

What I really should do is install some umls and start playing with it, I think in december I will have time to this :-). Of course you can just ignore my disturbing questions without any problem ;)

Thanks a lot,

What you described, when you have multiple machines, which are supposed to operate with the same storage in parallel, is not a task for DST, but a distributed filesystem. DST is a block level device, which connects remote node to the local storage. So it is up to the higher layer to guard access to the storage from different machines.

In the simplest case, when you have to create high-available cluster, when there is number of machines, and only one of them is allowed to operate on given storage at a time, then you need to create a md storage on each of them and turn it on on some node after 'master' node detected to be failed and some new node selected to be a new one. One can use hearbeat or CARP protocol for that.

Parallel access to the same storage with data coherency support is a non-trivial task, which requires special software. POHMELFS is distributed parallel filesystem which was designed for this task. There are others of course.

I knew I needed some higher layer like heartbeat or carp or what ever. But I wanted to know what was good to do when I know what I wanted to do (i.e. "node X become the master"). Thanks a lot again for the answer, it was really clear :)

Hope it gets into mainline kernel :-D