POHMELFS

POHMELFS development

Data de-duplication in ZFS and elliptics network (POHMELFS)

Jon Smirl sent me a link describing new ZFS feature - data deduplication.

This is a technique which allows to store multiple data objects in the same place when their content is the same, thus effectively saving the space. There are three levels of data deduplication - files (objects actually), blocks and bytes. Every level allows to store single entity for the multiple identical objects, like single block for several equal data blocks or byte range and so on. ZFS supports block deduplication.

This feature existed effectively from the beginning in the elliptics network distributed hash table storage, but it has two levels of data deduplication: object and transaction. Well, actually we have transaction only, but maximum transaction size can be limited to some large enough block (like megabytes or more, or can be unlimited if needed), so if object is smaller than that, it will be deduplicated automatically.

Which basically means that if multiple users write the same content into the storage and use the same ID, no new storage space will be used, instead transaction log for the selected object will be updated to show that two external objects refer to given transaction.

Depending on transaction size it may have a negative impact, in particular when transaction size is smaller than log entry, it will be actually a waste of space, but transactions are required for the log-strucutred filesystem and to implement things like snapshots and update history. By default log entry size equals to 56 bytes, so it should not be a problem in the common case.

POHMELFS as elliptics network frontend will support this feature without actually any steps out of the box.

POHMELFS transactions

Tagged:  

It happend that my previous idea of using socket buffer and VFS pages is very wrong. Mainly because of POHMELFS transaction nature. Transaction must stay in memory until remote server acknoledges its data.
But what will happen when second write is about to update the same area? We can not overwrite data, since then we will lost previous transaction and there will be no way to resend it and store elsewhere on timeout or other error. Instead we should allocate new buffer and copy data there. But this is not that simple, since we have to update VFS page cache, and thus to evict previous page first. Also all pages have to be somehow linked, so that when transaction is committed, appropriate pages could be freed.

Other filesystems, namely btrfs, waits until writback is over on the page about to be overwritten, which may or may not be a good idea for the overwrite workload, and I expect it actually to be a bad idea, especially for the high-latency storages, but it is noticebly simpler to implement. Buffer heads used to track partial page updates are quite heavy and not really needed for my case, so I will implement trivial tags attached to pages, and when overwrite is going to happen, system will wait for the pages in question to be flushed to the remote server, and then overwritten in place creating new transction.

Above tags are needed for the usual writeback - we will not really write data at writeback time, instead we will find transactions which refer to given page and resend them. In the perfect case, which I expect to happen most of the time, there should be no such stall transactions at all, since they will be quickly acked soon after write time when we will send data to the server, but it is still possible that there are no quick acks, so writeback can fire the inode.

That's the plan, now back to drawing board to actually find out how pages should be attached to transactions... Stay tuned!

POHMELFS, data integrity and versioning

Tagged:  

As you might know, new POHMELFS will be a fully versioning filesystem, since it will work with the transaction log-structured distributed hash table called elliptics network.

Object versioning implies that each update is supposed to be a separate version, thus each write should be a separate transaction. We can do this by using two ways - put data into separate storage and sleep when it is full; and allocate new storages for the new data in each write call.

The former is used in the network - there is a limited socket buffer, which we can fill either by copying data into or pinning external data pages. No matter what and how, it has a limited size and when socket buffer is full, no new writes are possible, so we will either sleep or return error.

Another way is a bit different - for the subsequent write into the same area we will allocate new storage and copy data there. Effectively both methods are the same, but in the first one we kind of 'allocate' from the fixed size area, while in the second one - from the main system memory allocator, where we will block by reaching either some limit or when there is no more free memory.

Then pages or data blocks can be attached to transaction, which will commit them to disk or remote node. I decided that I will use the first method, so that each write will allocate a transaction, and data will be sent to the remote nodes. If socket buffer is full, write will block.

This has fair number of cons, namely need to copy data twice - from userspace into page cache and then from the page cache into socket buffer. It is possible to use sendpage() and friends, but this will force us to have a per-inode write lock to order writes, and actually this may be not enough, since the way sendpage() works clearly allows to write into the page being transmitted.

Since I decided to send data at write time system has to hash data at the same time to create transaction ID (which becomes data checksum). Hashes used to generate IDs (they are called transformation functions, since they 'transform' data into fixed-size IDs) are provided as mount option (HMAC is not supported for now, only plain hash).
Linux crypto API requires to have a preallocated crypto structure to work with, and its allocation as well as freeing is rather costly process (there are global locks and potentially very long waits).
So I decided to preallocate and initialize number of those crypto control structures at superblock allocation time, i.e. during mount option parsing.
Write operation will block waiting for free crypto worker, which will process data when ready. It does not use pool of threads, instead work will be done on behalf of writing process, thus scaling with number of writes, but limited by the number of crypto control structures allocated at mount time. It will be remount-configurable option of course, but in the future, there is no remount hook for now :)
This decision has number of cons either, but its pros look very promising.

Architecture looks interesting, but only practice will draw the conclusion line. So, stay tuned!

Per mount-point inode and transaction caches

Tagged:  

happend to be a bad idea. Well, it is not that bad, but it does not save anything, since I want to have transformation functions to be replaceble at remount time, but having to rebuild inode cache for this is not a good idea.

So I will have a list of destination IDs attached to inode (and thus subsequently to transactions), and maybe eventually there will be ability to attach filtes to store some objects with one set of transformation functions (and thus amount of copies) and other objects with another one. Like having all '*.txt' files stored with sha1(name) and sha256(name) IDs while '*.sql' with only one copy. But I will rest this for some future.

We are getting closer to have a write support in new POHMELFS connected to elliptics network, but yet there is a fair amount of work even for this rather simple task.

Stay tuned!

POHMELFS transactions and inode caches

Tagged:  

I decided to change common practice of having single inode cache per filesystem and isntead implemented per-mountpoint caches (and memory pools for that matter). It is because mount point allows to specify different size of the inode ID, which is highly connected to transformation mechanism used for given mount.

So, for example, one can mount remote storage and specify only one transformation function, while other mount will have multiple functions, thus forcing each inode update to be redundantly stored with multiple IDs. Each inode has its IDs cached in the inode structure, so with different number of transformation functions inode size also becomes different. The same applies when various mounts use different transformation functions, which for kernel POHMELFS client are plain crypto hashes.

Getting that elliptics network topology always allows to store data, no matter which ID it has, even when network connections to some other servers are broken (ability to store given ID range is delegated to the neighbours of the failed nodes), each transactions now has to be acked by all remote nodes before it is considered completed. Thus network transaction has the same issues with mounts as inodes, and thus may have different sizes depending on mount point.

So actually transaction becomes a list of per-id metadata objects used to create network command for given data. When all per-id objects are acked by the remote servers, transaction is completed.

Linux Kernel Summit

Tagged:  

It somehow missed from my radar this year, getting that it will start in two weeks, and I did not receive invitation, Tokyo will not see me :) But there should be other russians, although they do not bring vodka with them.

For the next year I plan to complete POHMELS, actually this will be done this year for sure, at least to the state when it can easily talk to elliptics network distributed hash table. And likely this is all with storage subsystem.

For the april first I plan to implement dmesg tetris, which will show up during the boot, and although this idea I thought about quite for a while already, I did not find enough time. Let's see what we will eventually have, and in a meantime new POHMELFS started to successfully compile. And although it can not do anything useful yet, there is already a low-level base to stay on.

POHMELFS low-level IO machinery

Tagged:  

I decided to use single thread per connection, which maintains sending and receiving state machines, where each IO unit has a completion callback, invoked when all requested data has been processed. In theory this can be extended to support multiple network connections per thread, but I will postpone this for the future, if practice will show that it is a major overhead to have single thread per connection.

When receiving state machine has nothing to do, it schedules itself to receive a command header which has fixed size. This unit completion will search for transaction which corresponds to given ID and invoke its completion, which in turn can schedule new receiving processing with own completion.

The same applies to sending, although data sending has empty completion callback (except debug dump about finished operation).

Transactions have fixed header size with optional (unlimited) data, since there is no more variable sized path in the header, instead system will use ID generated by hash transformations over provided data: either local path to object (from the root, which corresponds to global filesystem root) or data content.

Ok, let's move forward

Tagged:  

I finally finished the most complex part of the process, which took most of the time last couple of weeks, so I feel much better about IT tasks now, when I have additional level of freedom, I can afford to fully concentrate on the interesting tasks not looking at clock or location.
And although I still need to find another 3 days for this task, I essentially consider myself returning to the old state.

Oh, well, and thus I quit drinking. At least this will require special steps to be made. Not that strong barrier actually, but its something.

Now, about current tasks. Most of the time is devoted to POHMELFS rework of course. I started new project and to date decided to rewrite POHMELFS from scratch, since approach supposed to be used in the new distributed filesystem is completely different from old-school network filesystem one. Getting how elliptics network (a distributed hash table storage) is organized, it may be challenging to implement proper POSIX compliance, but I decided to solve problems in order of their appearence and for now implement simple bits like object reading/writing and creation/deletion.

So, the only common part is a low-level network IO processing (i.e. sockets reading/writing, connection and the like) and filesystem registration. That's what I currently have. Also implemented new mount option parser. It is possible to specify all needed parameters without external configuration application now. Admins will be happy, since there will be no need to run mount scripts with some obscure programms, instead simple /etc/fstab edit will work.

Network IO processing will be rather simple. Since elliptics network protocol is organized into stacked attributes stored in transactions, POHMELFS will not use plain commands like it could before, so all commands will be wrapped into transactions which can be resent or finished with timeout error. Thus receiving path becomes rather simple - receive header, find corresponding transaction and invoke its completion callback which will receive the rest of the data if needed.

I also decided to drop writeback cache, since MOSI and firends cache-coherency protocols do not scale with the distributed network topology. Neither node can spy on bus and invalidate local cache on given events, instead remote side should maintain a list of nodes which have given cache 'line' in use, so that appropriate dirtification would invalidate them. With increasing number of nodes, clients and data objects it becomes unfeasible for each node to handle coherency requests, which will grow as multiplication of above numbers.
Thus POHMELFS will use old-school write-through cache like NFS and others do. System will allocate transaction at write time, populate it with the data pages and queue to IO thread pool. Thus before transaction is sent, it is possible to rewrite data or append/truncate if needed. Time will show whether this is a good idea or not, it is always possible to forbid this 'optimization', which I suppose should help with the random IO, at least that part which extends/shrinks transactions with some additional data.

Getting that it is 1 A.M. in Moscow its time to sleep, but I plan to start frequently update blog with the development bits again.
Stay tuned!

POHMELFS and elliptics network design issues

Tagged:  

Currently POHMELFS has a coherent writeback cache on each client. This feature requires special protocol to be developed which manages data consistency between clients. Namely it maintains per-object-per-client data on the server, which is checked each time new client access the data.

There are at least three problems with this:

  • need to store a state on the server about client, which means that server crash can not be easily handled
  • increasing number of communication messages needed to be sent when new client access some object, it grows lineary with number of clients, now suppose a huge server with lots of clients and large amount of data objects
  • need to send lock request and wait for ack for each new or flushed from the local cache object

Plus code complexity, POHMELFS implements MOSI-like cache coherency protocol, and while modern processors can observe the bus and flush cache lines, distributed cloud system has to maintain message protocol for each data access.

Another huge problem is server redundancy, since effectively all states have to be replicated on multiple servers.

To date there are no open network systems with such cache support. Originally it was presented in Oracle CRFS developed by Zach Brown, but currently its development is rather dead.

Experiments with POHMELFS and latency led to decision to drop this implementation while maintaining high performance for metadata-intensive operations, which especially benefit from it.

Main idea is to continue to split IO operations from network processing, the same way it was done in elliptics network, where network IO is handled by the properly implemented state machine on top of pool of threads. POHMELFS currently supports this only for data receiving. Data writing blocks in cache writeback, which sends data to the servers.

Now I plan to introduce transaction queue where each new data IO request will find and concatenate transactions before they are sent by the dedicated threads. Reading for up-to-date pages will not force any kind of network IO, while writes and fresh reads will just allocate and queue a transaction. Write then stops while reading sleeps for transaction completion.
Subsequent write may append data into just created transaction if it was noet yet sent.
As previously transaction will live in retransmit queue until ack is received and resend if needed.

Local cache invalidation will be handled similar way it exists currently in POHMELFS and its server, which I will think some more about. Likely I will use built into elliptics network update notification mechanism.

Transaction queue will be processed by the pool of threads, or maybe even by the single thread per network connection, separately from userspace-visible IO.

That's the original plan I ended up with to date, but it may happen that this will be just a start of the things breaking. While I'm suffering from climbing aching and appartment development I may think out something additional.
Stay tuned!

POHMELFS update and thoughts on moving POHMELFS and DST outside of the staging tree

Tagged:  

Synced local tree with changes made in staging tree in vanilla kernel.

Patch is rather small, but includes several bugfixes and command extension made by Pierpaolo Giacomin (yrz_anche.no), which allows to dump and delete all configured indexes.
It is already included in the staging tree and will be pushed upstream when merge window is opened.

This brings us a question whether DST and POHMELFS should be pushed out of the staging tree into main branch.

DST is a block level network device. It has fair number of interesting features like reconnection, large IO support, no need to copy data from userspace, but overall it is still simple point-to-point network block device. My opinion is that it is not really needed in the modern environment.

POHMELFS is a distributed parallel filesystem. Its current state is closer to parallel NFS than to real distributed filesystem like Lustre. But I start integration with the elliptics network, which is a real distributed network hash table storage, which will put POHMELFS to the completely new storage level not actually accessible by existing distributed filesystems. Such storages were only made for extremely huge amount of web 2.0 data, which does not require POSIX and ability to work with the storage as a convenient filesystem. Contrary, existing distributed filesystem are mostly made for the non-faulty environment, i.e. where network does not dissapear frequently, where dedicated servers do not break frequently and so on, where 'frequently' is rather subjective measure, for example I work with people who deliberately break network connectivity between major parts of its infrastructure to be sure that system continues to work as expected once per several days. How do you expect cool-named vendors and bought solutions work in that environment?

I designed elliptics network without any assumptions and requirements for stability of the subsystems. The same should be done for POHMELFS, which basically means that whle its network protocol and existing usage model will be completely changed.

So, I'm rather stumbled upon pushing both projects out of staging tree. DST is likely not needed in the vanilla tree, while POHMELFS will be changed dramatically in several days and weeks (but probably I will not complete it before 2.6.31 kernel release and subsequent merge window).

It is possible though to move POHMELFS into fs/, but add a huge warning during module load, which will scream, that POHMELFS will be changed completely in the next kernel version and will not be compatible with the existing usage case.

Opinions?

Going production with POHMELFS

Tagged:  

Into the small one so far and really trivial: read-only, single server, but tasty reconnection things GPFS lacks, inotify-based local server-side flusher. Similar setup was tested before quite for a while but without any serious load though.

And friday evening just before going climbing is exactly the best time for the rollout. But we are so optimistically naive that it can not stop us. Stay tuned!

EDITED TO ADD: that it is called pre-production here :)

Git madness

Tagged:  

Some people just can not relax and enjoy the situation, when everything works and nothing should be changed. So they want to use some new shiny perfectly stable enterprise quality stuff.

So I started to rebase POHMELFS against 2.6.30 kernel, which is a really non-trivial task, since 2.6.30 already contains POHMELFS in staging directory, but its documentation is placed in the common to my tree and vanilla one - Documentation/filesystems/pohmelfs.

So effectively every patch which touches documentation will break. And you might not know, but POHMELFS has quite a lot of documentation. So I ended up merging conflicts until stuck in this situation:

Applying: Special note about endianes and async events.
error: patch failed: Documentation/filesystems/pohmelfs/design_notes.txt:10
error: Documentation/filesystems/pohmelfs/design_notes.txt: patch does not apply
error: patch failed: Documentation/filesystems/pohmelfs/network_protocol.txt:20
error: Documentation/filesystems/pohmelfs/network_protocol.txt: patch does not apply
Using index info to reconstruct a base tree...
Falling back to patching base and 3-way merge...
Auto-merged Documentation/filesystems/pohmelfs/design_notes.txt
CONFLICT (content): Merge conflict in Documentation/filesystems/pohmelfs/design_notes.txt
Auto-merged Documentation/filesystems/pohmelfs/network_protocol.txt
CONFLICT (content): Merge conflict in Documentation/filesystems/pohmelfs/network_protocol.txt
Failed to merge in the changes.
Patch failed at 0069.

When you have resolved this problem run "git rebase --continue".
If you would prefer to skip this patch, instead run "git rebase --skip".
To restore the original branch and stop rebasing run "git rebase --abort".

# edit files

zbr@gavana:~/aWork/git/linux-2.6/linux-2.6.pohmelfs$ git add \
Documentation/filesystems/pohmelfs/design_notes.txt \
Documentation/filesystems/pohmelfs/network_protocol.txt
zbr@gavana:~/aWork/git/linux-2.6/linux-2.6.pohmelfs$ git rebase --continue
Applying: Special note about endianes and async events.
No changes - did you forget to use 'git add'?

When you have resolved this problem run "git rebase --continue".
If you would prefer to skip this patch, instead run "git rebase --skip".
To restore the original branch and stop rebasing run "git rebase --abort".

No matter what, it refuses to go further except than skipping the patch.

And while I implemented similar transaction based distributed hash table in the elliptics network I failed to fix the problem. So decided to add another stand-alone POHMELFS git tree which will only contain module itself. It will only work with 2.6.30+ kernels.

Compilation is rather trivial:

make KDIR=~/aWork/git/linux-2.6/linux-2.6.mainline/

where KDIR is a path to kernel sources.

Now I'm back to testing it with 2.6.30 kernel, in particular Debian Etch backport of 2.6.30 kernel fails to unload module from the staging tree :)

EDITED TO ADD: that above standalone module works without problems on the same system. Whoops :)

Linux-mag on POHMELFS: From Russia with love

Tagged:  

There is a new file distributed file system in the staging area of the 2.6.30 kernel called POHMELFS. Sporting better performance than classic NFS, it’s definitely worth a look.
...
Evgeniy Polyakov, a long time Linux hacker, has recently contributed a new distributed file system, called POHMELFS (Parallel Optimized Host Message Exchange Layered File System). It has appeared in the “chock-full-of-filesystems” kernel version 2.6.30 in the staging area. It is ready for testing and can give you a boost in performance (remember - it’s parallel!). This article will discuss POHMELFS and where it is headed.

An interesting article about POHMELFS, its state and future, namely elliptics network integration (with details on how it works and what to expect), NFS and its limitations.

Many thanks to Jon Smirl for the link.

POHMELFS in the 2.6.30 kernel

Tagged:  

2.6.30 is out and contains both POHMELFS and DST in staging directory.

While DST is a rather mature project and I do not expect it to change, POHMELFS will evolve into quite different entity.
To date it is a simple NFS-like filesystem with several features, namely the most interesting are local data and metadata writeback cache, ability to write to multiple servers and balance reading among them (you can find others on the homepage). This is really not a distributed filesystem, albeit a parallel one to some degree.

And reality says that no one really wants to change their existing NFS network to something new, even with higher performance, since the old systems already work and all their underwater stones were found.

Instead POHMELFS will evolve into the real distributed area with elliptics network, which will become its distributed hash table data storage. This is the main reason I will not ask to move POHMELFS from the staging tree for some time (at least in upcoming merge window) and will change it there.

Amount of features elliptics network provides is not really comparable with the existing old-school distributed systems, this is just a completely new and very different solution.

There is a number of potentially complex parts, but they are all solvable, so stay tuned for the new results!

EDITED TO ADD: Someone created a wikipedia article with spartan description :)

A development roadmap experimental poll - you can make the difference

Documentation, automatic tests and 3.0 elliptics release
26% (38 votes)
SMTP/IMAP elliptics storage backend
7% (11 votes)
POHMELFS elliptics frontend
30% (45 votes)
SSL support and more advanced merge config resolve in elliptics network
3% (5 votes)
LISP (and eventually C) regexp and subsequent LR grammatics analyzer
3% (4 votes)
HTML validator (based on above code though)
1% (1 vote)
PAXOS library and distributed locking
18% (27 votes)
Write your own - vote or lose!
1% (2 votes)
More tight electronics projects (w1 sniffer, digital electronics, robotics)
10% (15 votes)
Total votes: 148

POHMELFS server update

Tagged:  

Moved the whole building process to autoconf tools.

One can now build it easily like this:

$ ./autogen.sh
$ ./configure --with-kdir-path=/some/path/linux-2.6.pohmelfs --prefix=/usr/local/pohmelfs
$ make
$ make install

It will properly detect different pathes to the needed libraries, will not build configuration tool, if netlink connector is not accessible (for storage nodes) and so on.

Effectively one only needs to have fs/pohmelfs/netfs.h header file in the above kernel directory for the successful build, so one can create this dir somewhere, put there header and build the server, configuration tool and remote cache flushing utility.

I also spent a lot of time porting server to FreeBSD, but then found, that 7.0 does not have fstatat() and some other calls, so I dropped this task. Rediscovered actually, the same problem existed when I ported elliptics network, but it was not really needed there, so now elliptics nodes successfully run on FreeBSD.

Anyway, this will not be needed, when I port POHMELFS to elliptics.
Maybe I will even eliminate userspace POHMELFS server and move the whole client side into kernel, but so far I plan to change userspace server first to become a part of the elliptics cloud and drop directory export support.

This requires to implement a support for the directory content within a file (first, on the userspace server) - each directory entry in POHMELFS will become a record in the directory object, stored in the elliptics cloud. Likely I will implement some kind of a lazy tree of direntries, it will be indexed by the object name and will contain inode information needed for the filesystem.

That's the plan for the next days, stay tuned!

Slowly moving POHMELFS into production

Tagged:  

Spent the day compiling, installing and some testing of the POHMELFS in a pseudo-production environment. So far it is a read-only setup (just on the single node) to get data from multiple nodes with POHMELFS io periority and read balancing features instead of IBM's GPFS. And even simple case revealed couple of bugs, which could end up an object lookup returned error, while it actually exists.

Now everything is fixed in the git tree. I will sync it with the Greg's staging queue, which contains number of cleanups, and send patches soon.

As of writing, I made the same mistake all of the master-master replication projects do - I did not think in advance. So effectively it is not a good idea to use POHMELFS for the multiple-servers and multiple-clients case, since it may end up in a wrong locking order for the different clients, so they will update different replicas in different order.

To fix this issue I will implement a PAXOS algorithm in a separate library. Likely there are different public projects already, but to fully get the idea I always implement it myself. This reveals tricky places and allows to deeply dig into the design and understand the final result.

Basic idea is rather simple, but devil is in details (and optimizations).

POHMELFS IO permissions and priority features

Tagged:  

Let's see in details how to implement read balancing between multiple nodes, create IO groups, write-only backup solutions and manage node IO priorities.

As described yesterday, IO priority is a feature which manages read order and IO importance of the appropriate node. By 'read' I mean all operations which do not modify content of the remote object, like reading itself, directory listing, (extended) attributes fetching and so on.

So, assigning a network state a higher IO priority means we prefer to fetch data from the given server. Usually having only a single server to read data from will end up with the huge load on the given machine, while others will slack. So we can add multiple servers with the same IO priority and system will try to balance requests between servers with the same highest priority. POHMELFS uses round-robin algorithm among machines with the selected highest IO priority. If there is at least one node with the higher prio, it will be used for all read requests.

Adding and modifying IO permissions is a rather simple task:

# cfg -A add -a devfs1 -p 1026 -i 0 -P 250 -I 3
# cfg -A add -a devfs2 -p 1026 -i 0 -P 250 -I 3
# cfg -A add -a devfs3 -p 1026 -i 0 -P 250 -I 3
...
# cfg -A modify -a devfs1 -p 1026 -i 0 -P 500 -I 3
# cfg -A modify -a devfs2 -p 1026 -i 0 -P 500 -I 3

Above commands will first add 3 nodes with devfs{1,2,3} addresses and priority 250 (-P option).
-I switch provides IO permission mask (1 - read, 2 - write, can be ORed).

So, after above steps are completed, we have two IO 'groups': the first one with the IO priority of 500, which contains devfs1 and devfs2 servers and the second one, which contains machine devfs3 and has IO priority of 250.

In this case every read request will be sent either to devfs1 or devfs2 machine. We can monitor states of the connections via /proc/$PID/mountstats file (addresses were replaced with the names for easier reading).

# cat /proc/1/mountstats
...
device none mounted on /mnt with fstype pohmel 
idx addr(:port) socket_type protocol active priority permissions
0 devfs1:1026 1 6 1 500 3
0 devfs2:1026 1 6 0 500 3
0 devfs3:1026 1 6 1 250 3

Where the parameter before the priority (500 or 250) shows if given connection is active (1) or broken (0). When connection breaks, we can calculate number of the active servers with the highest priority and if number is less than some value, we can lower their priority and thus 'move' second group of machines with the second highest priority to the first place and start reading data from there.

Let's look at IO permission masks.
We have only two bits used - read and write operation, and it is possible to create read-write, read and write-only connections. While the first two are obvious, write-only may look somewhat questionable, but it is quite useful as a backup solution so that given node could not be used to perform read requests to minimize its load.

By default configuration utility uses priority 0 and read-write mask.

Writes are always sent to the all nodes which have write permission bit set, and priority here only means that packet will be sent to the nodes according to them (servers may receive data in a different order of course if multiple network pathes are used). Write transaction will be completed only when all nodes acked given request or it was finished with error after timeout and number of resends.

POHMELFS and real-time statistics

Tagged:  

Now it is possible to monitor attached servers and their status and watch connection details and an appropriate node's IO priority. It is showed in the /proc/$PID/mountstats.

IO priority is an interesting feature which allows to prioritize nodes according to their IO capabilities. There is also an appropriate IO mask. Thus reading will be first tried from the node with the highest IO priority, while writing will be sent to all the nodes which are connected in a read-write or write-only mode.
Watching a connection status allows to implement server groups with the same IO priority, and then switch them to the lower/higher priority, when number of the active (connected) servers goes below or higher than some limit.

It is possible to modify IO priorities and permission mask in a real-time, but all requests already issued against selected servers will be completed unchanged.

Meanwhile the elliptics network got a little bit extended documentation, multiple completion callback invokations (reading reply is now being split into number of packets and is not sent as a single huge frame), transformation functions were converted from the single transform() to the init/update/final format, where the update() can be invoked multiple times between the init() and the final().
The main user-visible header, an interface.h file, now contains extended API documentation for the all exported helpers.

POHMELFS directory reading

Tagged:  

POHMELFS directory reading is implemented similar to NFS readdirplus version, when client only asks server to send him a list of the objects, and received information is enough to create the whole inode, i.e. it contains all the metadata needed for this operation.

Now we can do more - system is able to populate directory entry cache with the new objects directly when receiving readdir responses and thus to eliminate a subsequent ->lookup() call.

Another resolved must-have issue concerns the dentry cache and the umount process. Generally umount itself does not involve anything related to the filesystem syncing or superblock processing in general, it just unregisteres the filesystem from the mount cache and drops the reference. It may happen (and actually happens most of the time when no one else grabbed the mount point reference) that it will be the last one, so mount point should be destroyed and superblock freed, and dirty data should be flushed to disk prior to this.

It is a tricky place. There is an issue related to the read-only mounts for example. Some filesystems (like ext*) will actually write journal updates and sync corresponding block device. This is not an issue for the POHMELFS though. What is more important here is the order of the freeing operations.

Eventually superblock freeing will end up with the generic_shutdown_super() callback, which will perform all fs-independent work during the shutdown. Namely its the first operation for the filesystem with the root directory is to shrink related dentry cache and only then flush inodes to the disk. Thus all dirty inodes will not be able to access corresponding dentry aliases from its writeback path and POHMELFS will not be able to save the data. One of the solutions is to invoke syncing earlier, but how?

I tested multiple shutdown abusing techniques including ->sync_fs() callback, which is invoked too late and thus without the dentry cache entries in place, ->umount_begin() which is used by NFS and CIFS to perform some tasks either, but... It is only invoked for the forced umounts and never for the usual ones. ->write_super() is also invoked from the 'wrong' place.

Eventualy I ended up replacing generic kill_anon_super() helper as POHMELFS ->kill_sb() callback, invoked just in time before all above shutdown processes start (actually they are invoked from it) with the own function, which syncs the cache and calls kill_anon_super() directly.

Now the world is safe again - no data will be lost because of the shrunk dentry cache during the superblock shutdown.

As of other features I fixed number of minor bugs in the code and cleaned the debugging prints. Also improved remount process, it is possible to change now not only a read-only/read-write state but also all the timeouts and other non-critical options. Number of crypto threads and maximum IO size are fixed yet and are specified during the mount. I will work on this later.
What is missed now is the statistics (I will use sysfs to collect data about each mount point and related data like servers it is connected to, connection statuses and other data), will also extend reading selection algorithm by assigning a priorities for the all servers client is connected to. Then I will start thinking deeply on the locking problem described yesterday.

Also want to update the elliptics network documentation, at least to provide a howto examples, since without them it looks a little bit unobvious on how to setup and start playing with this distributed hash table storage.

Those are fairly short-term tasks. For a little bit longer period I have another very interesting problem to work with, which is related to the compiler theory.

Stay tuned, there will be something to look at for sure...

stat() performance and POHMELFS cold cache

Tagged:  

I've found a very interesting feature of the Linux stat(2) call (although expect it behaves the same everywhere), and its magnitude was somewhat surprising for me. Stat takes so enourmous amount of time on every filesystem including /dev/shm - about 2 milliseconds for /dev/shm and XFS for the fully populated into the cache file.

This perfectly explains why POHMELFS cold cache find /mnt benchmark takes so long to complete. POHMELFS server stats every object it accesses, so for the linux kernel tree it adds additional 30+ seconds of the server-side stat overhead.

Moreover, if we will run the same cold cache benchmark with the additional stat, i.e. find /mnt/linux-2.6.28 -exec stat {} \;, it will show the same 30+ seconds of the overhead for the NFS (and about 1m8sec for the POHMELFS, which includes stating on the server and client). Moreover, hot cache still has the same issue, namely 30+ seconds for NFS and POHMELFS, which looks like related to the buffer copy between the kernel and userspace, so I do not consider this as a POHMELFS bug :)

What's worse: not knowing or knowing that things are bad?

Tagged:  

Sometimes this is a rather complex question for me. Of course not knowing something allows to make assumptions which are usually very narrow and shine. This heats up the feelings and things look even better. Until suddenly the reality hits the head with its hammer. And being in the neutral position seems to provide an easier recovery path than when you believe that everything is really good.

So I prefer to know everything even when it is bad. In the long run this allows to make better decisions.

After my POHMELFS presentation I rethought some of the questions and answers and found that at least in one major case I showed how things are supposed to be and not how they are implemented (or actually not implemented) in a real life. I'm talking about distributed locks with multiple servers.

In the classical case (I'm not sure here, but would not be surprised, if OCFS/GFS and similar systems fit) there is a single lock management code which is accessed by all clients during their IO operations. It can be replicated to one or more systems, so that when it crashes clients could still work with other instances. IIRC similar scheme is used in Lustre and databases replication.

POHMELFS does not have this replication, it operates great with the single lock management instance only. I just stupidly forgot about this side of the problem when implemented multiple server support, so right now when the multiple clients are connected to the multiple servers the locking protocol will be messed up and it is unlikely that things will work flawlessly, so this case can not be recommented for usage.

That was the bad news and no good one today. Back to the drawing board...

POHMELFS cold cache bench

Tagged:  

Just in case you believed that it is extremely great:

$ time (find /mnt/testdir | wc -l )
3963
 
POHMELFS:   1m1.493s
     NFS:   0m2.521s

You have been warned :)

I will work on this issue. This happens when POHMELFS follows VFS call to lookup the object on the server, while it already contains enough information from the initial listing. Thus in some cases of the cold cache this lookup can be avoided, while it is still needed when we access the new object without prior directory listing.

POHMELFS, distributed facilities, the elliptics network and where is all my porn

Tagged:  

[1] What's the architecture of the POHMELFS, would you give me the figure of structure.
[2] Is there a MDS? How to get the Metadata?

POHMELFS as it exists today and likely will not be heavily extended in the future represents itself as a network filesystem. Effectively it can be considered as NFS with lots of additional features which do not really affect overall view of the filesystem: it is a client connected to the multiple hosts so that each update IO command is stored to all servers while reading is balanced.

POHMELFS does not really know how data is stored on the server: it can be database or usual filesystem (like it exists today in the server), or it can be distributed hash table storage (developed elliptics network). Connection protcol is opened and documented in the POHMELFS sources.

Here we come to the second part of the system - data storage.
Right now POHMELFS connects to the simple server (called fserver), which stores data locally on the provided filesystem. It runs as a simple userspace application which operates with the number of the common known syscalls like open(), read() and write().
It is simple enough not to incorporate any kind of distributed facilities.

That is the place where the elliptics network enters the scene.
This storage is being implemented as a distributed hash table over the cloud of nodes connected over the local network or internet. Each node in the cloud has exactly the same meaning like others, so there are no metadata server or any other separated nodes with the special goals. All nodes store data and operate as a forwarding proxies for the requests. Node IDs form the circle (or finite field in math terms) and each stored object has ID within that ring. Node join/leave protocol takes care about data integrity.

IDs are unified and their underlying structure is not known to the system. It can be hashed absolute path, or database index, or hashed URL or anything else indexed by 128 bits (will be extended likely to 512 bits). So effectively the elliptics network becomes a distributed object filesystem, where objects are indexed by the above IDs.

When this system will be ready for the broader usage (I expect to release the first version quite soon though), I will start porting POHMELFS server (as mentioned above it is called fserver) to use this library, so that each POSIX write would become an appropriate transaction within the elliptics network stored on some node. POSIX reading thus will involve searching for the appropriate node within the elliptics network and reading data from the specified node and sending it back to the POHMELFS client, which thus will be able to access the whole data stored in the cloud via the single mounted server.
It is possible to connect to the elliptics network via appropriate protocol exported by the API of the corresponding library, POHMELFS will be a POSIX client for this storage.

[3] How to use the POHMELFS?

Right now it is quite simple: one has to create a configuration group with the specified index and put some data into it: address(es) of the server(s), crypto keys and other information if needed, then mount remote system with the specified index (i.e. connect to the specified servers, autonegotiate selected crypto algorithms and so on). Effectively the same will be done when the actual server, POHMELFS connects to, becomes a node in the distributed cloud, just its IO transactions will not be stored locally on the disk, but sent to the appropriate nodes in the network.

Hope this clarifies things a bit.

POHMELFS is ...

Tagged:  

Looks great, I've now added this to the -staging tree and it will show up in the next -next releases and is queued up to go to Linus for 2.6.30

as Greg KH wrote.

Thanks a lot!

Staging tree now contains about 12k lines of my code :)

The release time!

Tagged:  

Yes, POHMELFS got a new release.

This version brings us following changes:

  • drop own path cache in favour of dcache
  • bug fixes
  • ask for drivers/staging merge
  • probably something else

That's it. POHMELFS is ready. So far I do not have any entries in its TODO list except the bug fixes (it is possible to optimize the object removal though :), and will fully concentrate on the elliptics network. When it is ready I will use its library in the POHMELFS server.
It is possible that I will add some features though if there will be such requests.

Stay tuned!

POHMELFS metadata benchmarks: tar and large dbench

Tagged:  

POHMELFS userspace server and the latest to date (pushed into the git) kernel client.
In-kernel async NFS server and client (2.6.28 tree).
Both client and server have 8Gb of RAM, 32bit CPUs.

The first bench: untarring the linux kernel sources (smaller is better):

POHMELFS:   5 seconds
     NFS:  44 seconds

Sync called after the untar takes less than a second, so it does not affect the test.

The second one: dbench with 30 threads. Servers exported /dev/shm to the clients.

Updated POHMELFS version fixes fair number of bugs both from its move from own path to dentry cache and old cache coherency issues. There is a nasty issue in the code: object removal. It is not optimized at all, and, hmm, it does not work for directories :) It happens because unhashed direntry path lookup in the dcache returns not only path to the direntry, but also adds (deleted) string after it, so obviously object with such a pathname can not be removed.
I will work on this before pushing upstream (scheduled for tomorrow).

Stay tuned!

POHMELFS vs NFS: dbench and the power of the local metadata cache

Tagged:  


POHMELFS vs NFS dbench performance

Client and server machines have 8Gb of RAM and I ran single-threaded test to show the power of the local metadata cache, so get this data with a fair grain of salt.

POHMELFS served close to 160k operations over the network, while async in-kernel NFS ended up with this dump:

   1    643930    23.31 MB/sec  execute 149 sec   
   1    649695    23.37 MB/sec  cleanup 150 sec   
/bin/rm: cannot remove directory `/mnt//clients/client0/~dmtmp/ACCESS': Directory not empty
/bin/rm: cannot remove directory `/mnt//clients/client0/~dmtmp/SEED': Directory not empty
   1    649695    23.36 MB/sec  cleanup 150 sec   

More precise performance values:

POHMELFS   652.481 MB/sec
NFS         23.366 MB/sec

I've pushed POHMELFS update to the GIT, which includes a long-waiting switch from the own path cache to the system's dcache (kudos to the one who exported dcache_lock to the modules :)

It will be likely pushed into the drivers/staging in a day or so.

New POHMELFS vs NFS vs DST benchmarks

Tagged:  

Here we go, the latest POHMELFS against drivers/staging DST and in-kernel async NFS.

Server hardware: 4-way Xeon (2 physical CPU + 2 HT CPU) server with 1 Gb of RAM (actually it has 8, but high-mem was disabled), scsi disk, default xfs 300 gb partition.
Client hardware: 4-way Core2 Xeon with 4 Gb of RAM (again no appropriate high-mem option).
Gigabit ethernet, in-kernel async NFS. 2.6.29-rc1 kernel.

Iozone tests for POHMELFS, NFS and XFS.

ReadReread
WriteRewrite
Random readRandom write

As we can see, read and write performance is way ahead of NFS, but random read is noticebly slower.

Bonnie++ benchmark for POHMELFS, NFS and DST.

Bonnie IO data

Bonnie was not able to calculate object creation/removal time for POHMELFS, since with local data writeback cache this is very fast compared to write-through NFS case.

So, POHMELFS operates fast. Even in its basic network filesystem mode. But I refer to the random read performance, which is not something we can be proud of :)
But I will work on this, and likely will start with read-ahead games on the server.

Contrary dbench will not run very well on POHMELFS currently, since its rename operation is synchronous and rather slow (it forces inode sync to the server). After I switched to the system's dcache, there yet untested areas which I work on, so it is not yet pushed to the drivers/staging, but it will be there quite soon.

POHMELFS and DST changes

Tagged:  

While preparing POHMELFS for drivers/staging I decided to run more benchmarks, and suddenly found a nasty bug in the way filesystem process the names of the objects.

Effectively POHMELFS uses path name as ID, so it does not suffer from the NFS -ESTALE problem, when object related to the received ID was (re)moved. It also greatly helps for the writeback cache implementation, when we do not need to sync name/id pairs on the server, which operates with names only.
For this purpose I implemented simple trie-like name cache in POHMELFS, which hosted names in the descending order starting from the root. There was a bug in the rename part of the algorithm, but while looking at the implementation I thought, what the hell, Linux already has a very scalable dentry cache, why do I need to naively reinvent it here.

So, I dropped my own implementation and started to use dcache. It is not very optimal though: for example there is no helper to determine the path length, so it should be preallocated long enough. So far I use hardcoded length of 256 bytes. I agree, that it is not something particulary good part of the code, but that's what I'm playing with right now. And you know, results are quite interesting (besides the fact that I fixed a nasty bug which was triggered by dbench in particular), but POHMELFS still has a slower random read performance compared to NFS (for some patterns NFS random read performance is higher than sequential read, so I think there are some tricky cheats besides request comounds).

In the meantime I rebased DST to the latest git tree where it is now possible to allocate a bio (block IO request) with several embedded bio_vecs (pages) and ability to prepend some data without additional allocation, which was suggested by Jens Axboe. This works by creating a memory pool of large enough objects to contain bio itlsef and space for the requested object. Now one additional allocation in the DST export node is eliminated. But I can not test this, since machines are occupied by POHMELFS testing, so DST is not in the drivers/staging yet.

Discussion revealed interesting moments, like:

Hm, then why can't this whole thing just go into fs/dst/ right now? It's self-contained, so there shouldn't be any special "must live in staging" rule for filesystems before adding them.

Well... the case for merging drivers is usually pretty simple - the hardware exists, so we need the driver.
Whereas the "do we need this" case for new filesystems isn't this simple.

FWIW we definitely want pohmelfs in staging...

So, let's see how things will flow in a few moments...

Syndicate content