POHMELFS and DST usage cases, design and roadmap.

Tagged:  

POHMELFS kernel client at its current state and by design is a parallel network client to the distributed filesystem (called elliptics network in the design notes) itself. So you can consider it like parallel NFS (but with fast protocol), where parallel means read balancing and redundancy writing to multiple nodes. POHMELFS heavily utilizes local coherent data and metadata cache, so it is also very high-performance, but that's it: a simple client, which is able (with server's help of course) to maintain a coherent cache of data and metadata on multiple clients, which work with the same servers.

So, right now it is effectively a way to create data servers, where client sees only set of the same nodes, and balance operations (when appropriate) between them. Server itself works with local filesystem, which can be built on top of whtever raid combinations you like of the disks or DST nodes. So effectively in this case POHMELFS mounted node can be extended by increasing local filesystem on the server.

That's what the current state of the POHMELFS is. Next step in this direction is to extend server only (modulo some new commands to the client if needed and proper statistics) and add distributed facilities. By design there will be a cloud of servers, where each of which has own ID and data is distributed over this cloud, which has elliptics network working title. This name somewhat reflects the main algorithm of the ID distribution. Cloud will not have any dedicated metadata servers and every node will have exactly the same priority and usage case. POHMELFS kenel client (and thus usual user, which works with mounted POHMEL filesystem node) connects to the arbitrary server in the cloud and asks for needed data, this request is transformed into the elliptics network format and is forwarded to the server, which hosts it, data is returned to the client and it does not even know that it was stored elsewhere. In this case it is possible to infinitely extend the server space by adding new nodes, which will automatically join the network and will not require any work from client or administrator. Redundancy is maintained by having multiple nodes with the same id, so client will balance reading between them and write to them simultaneously. Node join protocol will maintain coherency of the updated and old data.

Proof-of-concept implementation is scheduled for the next month or so, this should be working (but simple enough for the start) library which can be used by other applications. Then I will integrate it with existing POHMELFS server. This is optimistic timings though, it depends on how many bugs will be found in all projects I maintain :)

DST is a network block device. It has a dumb protocol, which allows to connect two machines and use read/write commands between them (each command is effectively a pointer to where data should be stored and its size). So, there is always one machine, which exports some storage, and one which connects to it. There are no locks, no protection against parallel usage, nothing. Just plain IO commands. System can start several connections to the remote nodes and thus will have multiple block devices, which appear like local disks. Administrator can combine those block devices into single one via device mapper/lvm or mount btrfs on top of multiple nodes. And then export it via POHMELFS to some clients or work with it locally.
I consider DST as a completed project.

I did not write more detailed feature description of both POHMELFS and DST and how they are used in the failover cases or data integrity, it is always possible to grab those design notes from the appropriate homepages.

This seems potentially problematic. It means that:

- A new node replica requires a large amount of work before becoming completely useful, i.e. it needs to mirror an existing replica. You could allow IO to be serviced by the node while it is being built, but then you have a situation where there is unpredictable network load and performance, e.g. some reads from a particular node ID occur in 20ms, others in 50ms, while the new node, and its copy source are flooded by a 1gb/sec copy.

- A node going offline results in a loss of redundancy for only the data existing in that node ID, i.e. the negative effect of a failure is concentrated on a subset of your data, rather than being more evenly spread across the storage pool. I can't decide if this is a good or bad thing.

- Extending your storage pool and initially planning it becomes more involved, since total nodes needs to be a factor of the number of replicas you desire for your data, which itself depends on what type of data is involved and how it will be used.

There is also more restricted possibility for dynamically balancing load in the pool in the future, say, by measuring read count for particular objects. The only possibility is allowing the system to commision new machines itself, or page an operator, whereas with object-level replication, the system could more easily schedule more nodes to serve the same objects (e.g. the way BigTable works).

Likely I will not allow IO to the joining replica, since things already worked without that node, so new node will join after it mirrors all the data.

There is a possibility to always assign several replicas to each node, and since each node prior joining does not know what data it will host, and all its data will be spread over the appropriate nodes, there should be no no problems with node's offline.

Object-level replication means replication for the same id, which in turn means replication of the given node. In the approach I want to imlement node, which exports some data when joined, will not host particulary that set of objects, but likely only small part of it if any, they will be copied into the cloud to some other nodes, which ID correspond to ID of the added objects.