Tag Archives: Filesystems

Filesystems development

POHMELFS and DST usage cases, design and roadmap.

POHMELFS kernel client at its current state and by design is a parallel network client to the distributed filesystem (called elliptics network in the design notes) itself. So you can consider it like parallel NFS (but with fast protocol), where parallel means read balancing and redundancy writing to multiple nodes. POHMELFS heavily utilizes local coherent data and metadata cache, so it is also very high-performance, but that’s it: a simple client, which is able (with server’s help of course) to maintain a coherent cache of data and metadata on multiple clients, which work with the same servers.

So, right now it is effectively a way to create data servers, where client sees only set of the same nodes, and balance operations (when appropriate) between them. Server itself works with local filesystem, which can be built on top of whtever raid combinations you like of the disks or DST nodes. So effectively in this case POHMELFS mounted node can be extended by increasing local filesystem on the server.

That’s what the current state of the POHMELFS is. Next step in this direction is to extend server only (modulo some new commands to the client if needed and proper statistics) and add distributed facilities. By design there will be a cloud of servers, where each of which has own ID and data is distributed over this cloud, which has elliptics network working title. This name somewhat reflects the main algorithm of the ID distribution. Cloud will not have any dedicated metadata servers and every node will have exactly the same priority and usage case. POHMELFS kenel client (and thus usual user, which works with mounted POHMEL filesystem node) connects to the arbitrary server in the cloud and asks for needed data, this request is transformed into the elliptics network format and is forwarded to the server, which hosts it, data is returned to the client and it does not even know that it was stored elsewhere. In this case it is possible to infinitely extend the server space by adding new nodes, which will automatically join the network and will not require any work from client or administrator. Redundancy is maintained by having multiple nodes with the same id, so client will balance reading between them and write to them simultaneously. Node join protocol will maintain coherency of the updated and old data.

Proof-of-concept implementation is scheduled for the next month or so, this should be working (but simple enough for the start) library which can be used by other applications. Then I will integrate it with existing POHMELFS server. This is optimistic timings though, it depends on how many bugs will be found in all projects I maintain :)

DST is a network block device. It has a dumb protocol, which allows to connect two machines and use read/write commands between them (each command is effectively a pointer to where data should be stored and its size). So, there is always one machine, which exports some storage, and one which connects to it. There are no locks, no protection against parallel usage, nothing. Just plain IO commands. System can start several connections to the remote nodes and thus will have multiple block devices, which appear like local disks. Administrator can combine those block devices into single one via device mapper/lvm or mount btrfs on top of multiple nodes. And then export it via POHMELFS to some clients or work with it locally.
I consider DST as a completed project.

I did not write more detailed feature description of both POHMELFS and DST and how they are used in the failover cases or data integrity, it is always possible to grab those design notes from the appropriate homepages.

IT development roadmap.

This will be a short enough post about what projects and theirs status are included into the nearest roadmap. I wrote IT because I will describe programming projects only, and not electronics for example.

So, let’s start with existing two: DST and POHMELFS.
The former is essentially ready (with fun experiment I run with the latest version, kernel and public releases) and will be pushed upstream for some time. Next version will be released in a week or so and will only have new name. That will be 10’th distributed storge release, and 4’th resend of the same code.
Recently released POHMELFS got a bug just after release, and it is supposed to be fixed in the current version (pull from both kernel and userspace POHMELFS git repos). So far I do not see any new major changes in the POHMELFS client code, so essentially kernel side will only be extended when this is required by distributed server changes. I will not push it upstream until server side is also close to be finished.

Obviously both projects will be maintained and bugs will be fixed with the highest priority.

Here we comes to the new project I’m thinking about for some time already. This is a network storage server built on top of distributed hash table design without centralized architecture and need to have metadata servers. So far there is not that much of a code, only trivial bits, and I’m designing node join/leave part of the protocol. Some results are expected to be sooner than later, but not immediately.

And of course brain needs something to play with for the rest. Here comes language parser (LISP XML parser) I mentioned previously, and some computer language application based on this idea. That’s wjat I’m about to work on for the next days.

As another programming exercise I frequently think about buying myself a Play Station 3 and playing with its SPU processors and parallel applications, for example graphic algorithms (the first one which comes in mind is wavelet transformation and non-precise searcing of the images, which I made severaly years ago, I even have sources hidden in some old arch repo). Or playing with video card engines.
This does not have a very high priority though.

That’s the programming plans. Let’s see if anything will be completed anytime soon :)

New POHMELFS release.

POHMELFS is a high-performance distributed parallel filesystem.

This release contains following changes:

  • combine locking with inode attribute changes
  • add get/set attrbibute commands into cache coherency protocol (attributes are cached and flushed to the server at a writeback time or on demand because of cache coherency protocol)
  • crypto threads pool’s cleanup
  • 2.6.27 rebase
  • optimize read/write locking and number of messages needed for cache coherency management
  • debug cleanups
  • bug fixes

You know where to get the latest version, don’t you: git tree and archive are always opened.

POHMELFS distributed locks solution.

Solution I found looks very clean, quite simple and, the main part, I like it: now every get or set callback (generally speaking, for example directory reading is a get callback, while data writing is a set one) is guarded by appropriate distributed lock type. So when one system updates directory content or write to the object (writing data to the file or changing some attributes), it is not sent to the server and thus is not broadcasted to the other clients. Instead it lives in the local filesystem cache until client receives a message from the server to flush its data because of cache coherency protocol. When another client requests the same object, it is first synced to the server, and its new attributes are sent to the requested client.

So far this implemenation is under the testing, and revealed problem case with turned on cryptography, so after this issue is fixed, POHMELFS will get a new release.

Inotify PIDs and security.

Looks like my inotify patch was rejected because of security violation. This may sound as somewhat security flaw to allow any process to watch what IO is performed by other processes. Well, it is a valid observation, so I created new version, which only puts PID into the inotify message when either UID of the inotify backend is 0 or equals to UID of the process doing IO. So far without any comments though.

Since I can not get that this it a security flaw, and arguing about that will be endless, I accept that this limitation is valid. In the same line of ‘security’ flaws could be placed following small information leak I found in inotify.

In classical UNIX permission model it is not allowed to get directory listing if it does not have read permission and change directory to given one if it does not have execute bit. Even if you have created directory with some permissions/owner bits, switched current dir to this newly creaated, and then changed its permissions/owner bit, you will not be able to see neither its content nor newly created objects . Here is an example:

$ mkdir /tmp/test
$ chmod 700 /tmp/test
$ cd /tmp/test
/tmp/test$ ls -lai
total 24
9486756 drwx------  2 zbr  zbr   4096 2008-11-10 15:40 .
 123969 drwxrwxrwt 34 root root 20480 2008-11-10 15:40 ..
/tmp/test$ sudo chown 0.0 .
[sudo] password for zbr:
/tmp/test$ ls -lai
ls: .: Permission denied
/tmp/test$

With inotify you are able to watch what is being done in directory (or actually in any object) if you were able to attach a watch to its inode. So, if object had read permission, we are able to attach a watch to it, so if later it will change its permissions, watches will not be removed and we will be able to watch its content. Like this:

libionotify-1.1$ LD_PRELOAD=./libionotify.so ./inotify -r /tmp/
CREATE: /tmp/test
CREATE: /tmp/test/test1
WRITE : /tmp/test/test1
CREATE: /tmp/test/test2
WRITE : /tmp/test/test2
READ  : /tmp/test/test2

while in parallel we do:

$ mkdir /tmp/test
$ chmod 700 /tmp/test
$ cd /tmp/test
/tmp/test$ sudo chown 0.0 .
/tmp/test$ sudo dd if=/dev/zero of=./test1 bs=4k count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000326018 seconds, 12.6 MB/s
/tmp/test$ sudo dd if=/dev/zero of=./test2 bs=4k count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000321779 seconds, 12.7 MB/s
/tmp/test$ sudo cat ./test2

Fix would be to check inotify watch list for given inode when its permissions are changed.

Inotify is watching you!

Partial write errors and recovery.

An interesting thread was started in BTRFS maillist recently about features filessytem should conain to be actively used by some users. Besides that there was a good question rised about how to handle partial write errors.

Consider the case, when we have a sequence of writes finished with a barrier call, which in a theory would end up with perfectly performed action, but in a real life any write in that sequence may fail, it will be returned to the system, it will return it to the user or just mark page as bad, but any subsequent write succeeded as long as barrier call, so actually filesystem may belive that everything is ok except given failed writes. Now, if we have a power loss or disk removal, system is not in a consistent state, since suceeded subsequent writes might depend on the failed on (like directory metadata update with the failed file metadata).

That’s the problem, which may be handled by the filesystem, which will split major updates and do not allow subsequent writes if previous one failed. But whatever filesystem is doing batched writes (afaics event ext* filesystems write journal entries not one after another, but flush it as a whole), it has a described problem, since failed write may be detected too late.

DST and POHMELFS use different approach, since network media they are working on is a way too unstable in that regard, we have to deal not only with power outages or disk swaps, but also with even temporal network outages, which are part of the usual life even in high-end networks. Both DST as block network storage and POHMELFS as a network filesystem utilize transaction approach, when number of meaningful operations may be combined into single entity, which will be fully repeated in case of some errors. In this case server will not reply with successful completion if intermediate write fails, and given transaction (including previous and subsequent writes, barriers and whatever else) will be resent. In case of reading from POHMELFS, this will be done from the different server (if it exists in the config).

This is not some kind of new feature of DST or POHMELFS, different kinds of transactions exist even in local filesystems, iirc journal update can be considered as such in ext4, but not data write and journal write as a whole, i.e. multiple dependant metadata updates may be not properly guarded by journal transactions, but I may be wrong; BTRFS likely uses transactions as a COW update, i.e. allocation of the new node and appropriate btree update, but for network filesystems this is an exceptionally useful feature.