POHMELFS, data integrity and versioning
As you might know, new POHMELFS will be a fully versioning filesystem, since it will work with the transaction log-structured distributed hash table called elliptics network.
Object versioning implies that each update is supposed to be a separate version, thus each write should be a separate transaction. We can do this by using two ways – put data into separate storage and sleep when it is full; and allocate new storages for the new data in each write call.
The former is used in the network – there is a limited socket buffer, which we can fill either by copying data into or pinning external data pages. No matter what and how, it has a limited size and when socket buffer is full, no new writes are possible, so we will either sleep or return error.
Another way is a bit different – for the subsequent write into the same area we will allocate new storage and copy data there. Effectively both methods are the same, but in the first one we kind of ‘allocate’ from the fixed size area, while in the second one – from the main system memory allocator, where we will block by reaching either some limit or when there is no more free memory.
Then pages or data blocks can be attached to transaction, which will commit them to disk or remote node. I decided that I will use the first method, so that each write will allocate a transaction, and data will be sent to the remote nodes. If socket buffer is full, write will block.
This has fair number of cons, namely need to copy data twice – from userspace into page cache and then from the page cache into socket buffer. It is possible to use sendpage() and friends, but this will force us to have a per-inode write lock to order writes, and actually this may be not enough, since the way sendpage() works clearly allows to write into the page being transmitted.
Since I decided to send data at write time system has to hash data at the same time to create transaction ID (which becomes data checksum). Hashes used to generate IDs (they are called transformation functions, since they ‘transform’ data into fixed-size IDs) are provided as mount option (HMAC is not supported for now, only plain hash).
Linux crypto API requires to have a preallocated crypto structure to work with, and its allocation as well as freeing is rather costly process (there are global locks and potentially very long waits).
So I decided to preallocate and initialize number of those crypto control structures at superblock allocation time, i.e. during mount option parsing.
Write operation will block waiting for free crypto worker, which will process data when ready. It does not use pool of threads, instead work will be done on behalf of writing process, thus scaling with number of writes, but limited by the number of crypto control structures allocated at mount time. It will be remount-configurable option of course, but in the future, there is no remount hook for now :)
This decision has number of cons either, but its pros look very promising.
Architecture looks interesting, but only practice will draw the conclusion line. So, stay tuned!
Like a piece of shit Hacking jabber chats
Comments are currently closed.

I am curious if you have investigated data locality in pohmelfs?
When building a network filesystem with distributed data storage, it can be demonstrated that efficiencies arise from increased data locality. For example, if all files in directory /a/b/c are often accessed together, then latency of the overall system is decreased when all data for directory /a/b/c is concentrated onto a limited number of data storage nodes.
In particular, in my experiments with http://www.nicemice.net/amc/research/tangle/ I have seen that certain use patterns with a distributed hash table often result in data being spread too widely, hurting locality. Filesystem clients wind up contacting 100+ servers simply to read all files in a single directory!
Observation over time shows us that most filesystem cilents will focus on a tiny subset of a large filesystem, e.g. /home/jgarzik rather than /home/* or /data/oracle rather than /data/* Data should be clustered such that overall filesystem client work is minimized.
There is such a thing as being too parallel, with large systems based on distributed hash tables.
Try some experiments with 100 KVM virtual machines, each with 2GB of storage, as your networked data storage “cloud”. Examples like that may help with tuning pohmelfs for deployments greater than 5-10 storage nodes.
-jgarzik
I studied this problem in the elliptics network, and introduced a tunable transformation functions, which can be used to implement needed ID generation principles, so that objects close in UNIX namespace would be close in generic elliptics network namespace.
The simplest case is described on the homepage, and forces transformation function to assign the highest ID byte(s) to the datacenter or set of computers number, for example it can use 0 for /home/* objects, 1 for /home/jgarzik/* and 2 for /home/zbr/* and so on, and the rest of the ID bytes are generated by using usual hash functions or some external index. Server nodes then can select IDs with the highest byte equal to the locality group we want to enlarge.
This is the simplest example of course, but it shows that flexible ID generation allows to implement locality groups in the existing design. For the initial release there will be no such complex things, namely POHMELFS will use plain Linux crypto functions first, but for the elliptics network one can implement them in own project and provide those callbacks during initialization.