Is overwrite a bad decision? Distributed transactional filesystem
strugling Enjoying the muscle pain switches brain into the thinking mode compared to the usual slacking one. This brought me a nice idea of combining POSIX filesystem with the distributed transactional approach used in the elliptics network.
Every POSIX filesystem as long as usual write applications are supposed to overwrite the data placed in the middle of the object. Transactional storage actually does the same – the elliptics network overwrites the local object, but it also creates a new object which stores update transaction itself. It is potentially placed on the different nodes in the cloud. With the simple extension it is possible not to overwrite the original object and redirect all reads to fetch different transactions (or their parts) instead.
What if the POSIX filesystem will not actually overwrite the data, since it requires either a complex cache-coherecy protocol to be involved between multiple clients, working with the same object, and server, which complexity quickly grows when we want to have multiple servers; or use write-through cache (still with races though), which kills the performace compared to the write-back one for the local operations.
Basic idea is to never lock the object itself, it is never updated, only its history log, which is rather small and its updates can be serialized. Every transaction is placed in the different place (potentially – it depends on the network configuration), so when we want to read some data from the object, we check the history log instead, which contains sizes, offsets and the IDs of the data written, and fetch needed transactions (or their parts) from the network and not from the file itseld.
First, this allows to read data in parallel even if object itself was never mirrored to the different nodes.
Second, updates will lock the history index for the very short time, writes itself will not lock anything and will be done in parallel to the multiple nodes, since each transaction will move to the unique location.
Third, history lock may be done distributed, since overhead over its short aciquire time should still be small enough compared to the time needed to write huge transaction into the object and lock over this operation.
Moreover we can eliminate history update locking completely by using versioning of the object state, i.e. all clients who previously read that object still have a valid copy, but with the different version, and thier states are consistent, but not up-to-date. This may rise some concerns from the POSIX side, but overall idea looks very appealing.
As of negative sides, this will force POHMELFS server not to work with the local storage as we know it today – it will become part of the distributed network and thus will store all the data (when used in single node mode, i.e. as a network and not distributed filesystem) in a strange format used currently in the elliptics network – a directories full of files named as 40 chars instead of common names.
POSIX issues introduce potentially serious limitations, but idea looks very promising so far and I will definitely think about its implementation in the POHMELFS.