POHMELFS, data integrity and versioning
Object versioning implies that each update is supposed to be a separate version, thus each write should be a separate transaction. We can do this by using two ways – put data into separate storage and sleep when it is full; and allocate new storages for the new data in each write call.
The former is used in the network – there is a limited socket buffer, which we can fill either by copying data into or pinning external data pages. No matter what and how, it has a limited size and when socket buffer is full, no new writes are possible, so we will either sleep or return error.
Another way is a bit different – for the subsequent write into the same area we will allocate new storage and copy data there. Effectively both methods are the same, but in the first one we kind of ‘allocate’ from the fixed size area, while in the second one – from the main system memory allocator, where we will block by reaching either some limit or when there is no more free memory.
Then pages or data blocks can be attached to transaction, which will commit them to disk or remote node. I decided that I will use the first method, so that each write will allocate a transaction, and data will be sent to the remote nodes. If socket buffer is full, write will block.
This has fair number of cons, namely need to copy data twice – from userspace into page cache and then from the page cache into socket buffer. It is possible to use
sendpage() and friends, but this will force us to have a per-inode write lock to order writes, and actually this may be not enough, since the way
sendpage() works clearly allows to write into the page being transmitted.
Since I decided to send data at write time system has to hash data at the same time to create transaction ID (which becomes data checksum). Hashes used to generate IDs (they are called transformation functions, since they ‘transform’ data into fixed-size IDs) are provided as mount option (HMAC is not supported for now, only plain hash).
Linux crypto API requires to have a preallocated crypto structure to work with, and its allocation as well as freeing is rather costly process (there are global locks and potentially very long waits).
So I decided to preallocate and initialize number of those crypto control structures at superblock allocation time, i.e. during mount option parsing.
Write operation will block waiting for free crypto worker, which will process data when ready. It does not use pool of threads, instead work will be done on behalf of writing process, thus scaling with number of writes, but limited by the number of crypto control structures allocated at mount time. It will be remount-configurable option of course, but in the future, there is no remount hook for now :)
This decision has number of cons either, but its pros look very promising.
Architecture looks interesting, but only practice will draw the conclusion line. So, stay tuned!