ioremap.net

Storage and beyond

Is overwrite a bad decision? Distributed transactional filesystem

strugling Enjoying the muscle pain switches brain into the thinking mode compared to the usual slacking one. This brought me a nice idea of combining POSIX filesystem with the distributed transactional approach used in the elliptics network.

Every POSIX filesystem as long as usual write applications are supposed to overwrite the data placed in the middle of the object. Transactional storage actually does the same – the elliptics network overwrites the local object, but it also creates a new object which stores update transaction itself. It is potentially placed on the different nodes in the cloud. With the simple extension it is possible not to overwrite the original object and redirect all reads to fetch different transactions (or their parts) instead.

What if the POSIX filesystem will not actually overwrite the data, since it requires either a complex cache-coherecy protocol to be involved between multiple clients, working with the same object, and server, which complexity quickly grows when we want to have multiple servers; or use write-through cache (still with races though), which kills the performace compared to the write-back one for the local operations.

Basic idea is to never lock the object itself, it is never updated, only its history log, which is rather small and its updates can be serialized. Every transaction is placed in the different place (potentially – it depends on the network configuration), so when we want to read some data from the object, we check the history log instead, which contains sizes, offsets and the IDs of the data written, and fetch needed transactions (or their parts) from the network and not from the file itseld.

First, this allows to read data in parallel even if object itself was never mirrored to the different nodes.
Second, updates will lock the history index for the very short time, writes itself will not lock anything and will be done in parallel to the multiple nodes, since each transaction will move to the unique location.
Third, history lock may be done distributed, since overhead over its short aciquire time should still be small enough compared to the time needed to write huge transaction into the object and lock over this operation.

Moreover we can eliminate history update locking completely by using versioning of the object state, i.e. all clients who previously read that object still have a valid copy, but with the different version, and thier states are consistent, but not up-to-date. This may rise some concerns from the POSIX side, but overall idea looks very appealing.

As of negative sides, this will force POHMELFS server not to work with the local storage as we know it today – it will become part of the distributed network and thus will store all the data (when used in single node mode, i.e. as a network and not distributed filesystem) in a strange format used currently in the elliptics network – a directories full of files named as 40 chars instead of common names.

POSIX issues introduce potentially serious limitations, but idea looks very promising so far and I will definitely think about its implementation in the POHMELFS.

Comments are currently closed.

2 Responses to “Is overwrite a bad decision? Distributed transactional filesystem”

  • jonsmirl says:

    You are describing record versioning, a feature found on Oracle DBMS. When you run transactions you can control the trade off of staleness versus performance. I don’t think there is ever a need for a consistent view with multiple writers on a WAN based file system. If you absolutely need a consistent view you should be using a database engine.

    Delta chains are valuable here fpr minimizing replication traffic. When a node notices it is missing an object that it needs for replication, it could send a message to the node that has the object. The reply could indicate a list of delta candidates. Then if the original server has any of those objects, request a delta to build the missing object instead of cloning the entire object.

    Also think about log based replication. Proactively send the logs to other servers, replay them to quickly find missing objects that need to be created.

    There’s no need for the 40 chars names to be visible to the user.

  • zbr says:

    Logs should be replicated over the network, that’s for sure, the fact that they are only stored with the object itself is a limitation even for the recovery process not talking about manual construction of the previous state.

    I implement this with the server-side replication – in this case server will decide how many copies of the original object and thus its history log should exist in the network. Client may still send multiple copies himself.

    Names of the objects are not visible to the user, but the owner of the storage node can see them in the local filesystem. When we connect existing POHMELFS or NFS client to the appropriate server we export existing directory where objects are placed with the known names in the common directory based tree, With the new approach every write will not be really a object with the common name, but some non-obvious blob.
    And this will prevent to work with the storage locally on the server as with local filesystem, which may be rather unconvenient. Although it could be possible to mount exported storage locally and get access to the whole data via common path-based approach.

    Moreover, when I add stackable backend storage for the nodes, it will be possible to implement database storage for example, and in this case admin will not see data objects in the local filesystem at all, but this is quite a huge step away from the known network filesystems and how they may be accessed locally on the server itself.

    I want to allow POHMELFS to be a network filesystem the same way we work with NFS or CIFS today, when exported data is accessible on the server also, but it goes in the different direction with the object-based storage I want to use in the distributed filesystem.