An interesting thread was started in BTRFS maillist recently about features filessytem should conain to be actively used by some users. Besides that there was a good question rised about how to handle partial write errors.
Consider the case, when we have a sequence of writes finished with a barrier call, which in a theory would end up with perfectly performed action, but in a real life any write in that sequence may fail, it will be returned to the system, it will return it to the user or just mark page as bad, but any subsequent write succeeded as long as barrier call, so actually filesystem may belive that everything is ok except given failed writes. Now, if we have a power loss or disk removal, system is not in a consistent state, since suceeded subsequent writes might depend on the failed on (like directory metadata update with the failed file metadata).
That's the problem, which may be handled by the filesystem, which will split major updates and do not allow subsequent writes if previous one failed. But whatever filesystem is doing batched writes (afaics event ext* filesystems write journal entries not one after another, but flush it as a whole), it has a described problem, since failed write may be detected too late.
DST and POHMELFS use different approach, since network media they are working on is a way too unstable in that regard, we have to deal not only with power outages or disk swaps, but also with even temporal network outages, which are part of the usual life even in high-end networks. Both DST as block network storage and POHMELFS as a network filesystem utilize transaction approach, when number of meaningful operations may be combined into single entity, which will be fully repeated in case of some errors. In this case server will not reply with successful completion if intermediate write fails, and given transaction (including previous and subsequent writes, barriers and whatever else) will be resent. In case of reading from POHMELFS, this will be done from the different server (if it exists in the config).
This is not some kind of new feature of DST or POHMELFS, different kinds of transactions exist even in local filesystems, iirc journal update can be considered as such in ext4, but not data write and journal write as a whole, i.e. multiple dependant metadata updates may be not properly guarded by journal transactions, but I may be wrong; BTRFS likely uses transactions as a COW update, i.e. allocation of the new node and appropriate btree update, but for network filesystems this is an exceptionally useful feature.
Recent comments
1 day 5 hours ago
1 day 5 hours ago
1 week 2 days ago
1 week 2 days ago
1 week 4 days ago
1 week 5 days ago
1 week 5 days ago
2 weeks 1 day ago
2 weeks 1 day ago
2 weeks 3 days ago