Tag Archives: Filesystems

Filesystems development

POSIX filesystem interface for Elliptics distributed storage

We in Reverbrain create storages. Starting from single-node search engines to multi-petabyte clouds spawning hundreds of servers. All of them are accessed mostly via HTTP API and sometimes using programming interfaces for Python, Go and C/C++

But there is a huge number of clients who do not need complexity of the HTTP API, instead they want to have locally connected storage directory with virtually unlimited free space. They will store and load data using existing applications which operate with files. It is very convenient for user to mount remote storage into local directory and run for example backup application which will copy local files into this directory and this will end up in several physically distributed copies spread around the world. Without any additional steps from the user.

The first incarnation of the POSIX filesystem interface for some simpler distributed storage we created was a network block device which contained a connection layer and depending on the block offset it selected a server to work with. That was a bit ugly way to work, since block device doesn’t know which object is being stored or read, and what locking should be performed. Locking was always either too coarse or too fine, it ended up performing a lot of operations for simple block transfer, it became obvious that locking has to be performed on the higher layer, namely in the filesystem. This distributed block device was named DST and it lived in linux kernel for couple of years.

The second approach was to implement a filesystem. We created POHMELFS – Parallel Optimized Host Message Exchange Layered FileSystem. Its first commits were imported at the very beginning of January 2008. Actually POHMELFS was not very active and it clearly became visible that existing Linux VM stack doesn’t really scale to billions of objects, we do not have enough resources to change that – that’s a huge task both from technical and political sides. We implemented several features which were then found in Parallel NFS and Ceph. POHMELFS lived in linux kernel for several years.

We removed both project from the linux kernel back then and concentrated on Elliptics distributed storage. And now its time to resurrect POSIX filesystem interface.

But we decided to move another way. Native filesystem interface is fast, but you have to implement it for every OS, this requires a lot of resources which will be wasted supporting different versions for different OSes. Do you know inode allocation differences between Windows 8 and 10?

We found that our clients do not need this, instead they want network attached directory which works pretty well using WebDAV protocol. Well, not exactly, since Windows clients do not support authenticated webdav, and some applications like NetDrive has to be installed, but it happened to be almost standard application for NAS/SAN surprisingly.

We implemented WebDAV server which supports HTTP authentication and connects to Elliptics storage. There are limitations both in WebDAV protocol and in our server, in particular we do not allow locking to be transferred among servers, i.e. if client connected to storage via one gateway and the reconnected using the other, interlocking will not see each other. But that should not be a problem, since webdav prohibits parallel update of any object.

We can create many private folders for every user, it is even possible to add features on top of user files like indexing for search, version control and so on, but that’s a different story.

The main advantage is that this distributed storage is cheap per gigabyte. You can add many commodity servers into Elliptics cluster this will just increase the size of the storage without interruption – system scales linearly to ~4Tb/day writes in our setups and 200+Tb/day in Yandex for example. You can also put replicas of your data into different datacenters – this is inherent feature of Elliptics, and if connection to one datacenter drops down, client will just work with the other replicas.

And all those features are now accessible via usual filesystem interface. It is possible to access data via HTTP or other APIs though.

If you are interested, feel free to contact us info@reverbrain.com

Elliptics: server-side scripting

I added binary data support into server-side scripting (currently we only support Python) as well as extended HTTP fastcgi proxy to support server-side script execution.

Previously we only were able to use C/C++/Python to force some server to execute script on our data – API is rather simple and can be found in examples in binding dirs.
Now we can do that through HTTP.

Here is an example of how to setup python server-side scripting and run scripts on posted through HTTP data.
First of all, you have to put python.init script into ‘history’ directory (specified in server config).
It may look like this:

import sys
from libelliptics_python import *

log = elliptics_log_file('/tmp/data/history.2/python.log', 40)
n = elliptics_node_python(log)
n.add_remote('elisto19f.dev', 1030)
__return_data = 'unused'

This creates global context, which is copied for every input request.
Second, you may add some scripts into the same dir, let’s add test script named test_script.py:

__return_data = "its nothing I can do for you without cookies: "

if len(__cookie_string) > 0:
       # here we can add arbitrary complex cookie check, which can go into elliptics to get some data for example
       # aforementioned global context ('n' in above python.init script) is available here
       # so you can do something like data = n.read_data(...)
        __return_data = "cookie: " + __cookie_string + ": "

__return_data = __return_data + __input_binary_data_tuple[0].decode('utf-8')

HTTP proxy places cookies specified in its config into variable __cookie_string, and it is accessible from your script. Anything you put into __return_data is returned to the caller (and eventually to the client who started request).

__input_binary_data_tuple[0] is a global tuple which contains binary data from your request.
When using high-level API you may find, that we send 3 parameters to the server:

  • script name to invoke (can be NULL or empty)
  • script content – it is code which is invoked on the server before named script (if present), its context is available for called script
  • binary data

That binary data is placed into global tuple without modification. In Python 2.6 it holds byte array, I do not yet know what it will be in older versions though (basically, we limited support to python 2.6+ for now).

HTTP proxy puts your POST request data into aforementioned binary data. It also puts simple string like __cookie_string = 'here is your cookie' into ‘script content’ part of the request.
One can put there whatever data you like if using C/C++/Python API of course.

Here is example HTTP request and reply:

$ wget -q -O- -S --post-data="this is some data in POST request" "http://elisto19f.dev:9000/elliptics?exec&name=test_script.py&id=0123456789"
  HTTP/1.0 200 OK
  Content-Length: 79
  Connection: keep-alive
  Date: Wed, 02 Nov 2011 23:24:10 GMT
  Server: lighttpd/1.4.26
its nothing I can do for you without cookies: this is some data in POST request

As you see, we do not have cookies, so our script just concatenated some string with binary data and returned that data to the client. Name used in POST parameters is actually a name of the server-side script to invoke. ID is used to select server to run this code, otherwise it will run on every server in every group.

It is now possible to run our performance testing tools against server-side scripting implementation (it is not straightforward – there is separate SRW library which implements pool of processes each of which has own initialized global context, which is copied for every incoming request and so on

We believe that numbers will be good out of the box, and of course I expect that performance will suffer compared to plain data read or write, but it should not be that slow – we still expect that every request completes within milliseconds even when it uses python to work with data.
Our microbenchmark (every command executed on server is written into log with time it took to complete) shows that timings are essentially the same – hundreds of microseconds to complete simple scripts.
So I believe with IO intensive tasks we will not be limited by python server scripts and/or various reschedulings.

I’m quite excited with idea of adding full-text (rather simple though) search for all uploaded content into elliptics.
And I’m working hard on this – expect some results really soon!

P.S. We have almost finished directory support for POHMELFS via server-side scripts, expect it quite soon also!

splice() syscall

I found that splice() syscall does not transfer data, when in and out file descriptors are the same or refer to the same file. Instead destination ‘space’ is filled with zeroes while supposed to contain input buffer content.

Here is my code for the reference (with some debug added):

static int eblob_splice_data_one(int *fds, int fd_in, loff_t *off_in,
		int fd_out, loff_t *off_out, size_t len)
	int err;
	size_t to_write = len;

	while (to_write > 0) {
		err = splice(fd_in, off_in, fds[1], NULL, to_write, 0);
		printf("splice  in: %zu bytes from fd: %d, off: %llu: %d\n",
				to_write, fd_in, *off_in, err);
		if (err == 0) {
			err = -ENOSPC;
			goto err_out_exit;
		if (err  0) {
		err = splice(fds[0], NULL, fd_out, off_out, to_write, 0);
		printf("splice out: %zu bytes into fd: %d, off: %llu: %d\n",
				to_write, fd_out, *off_out, err);
		if (err == 0) {
			err = -ENOSPC;
			goto err_out_exit;
		if (err < 0) {
			err = -errno;
			goto err_out_exit;
		to_write -= err;

	err = 0;

	return err;

Unfortunately it does not even return error, but silently corrupts data.

I would be happy to be wrong of course.

Initial POHMELFS bits

I’m a bit budy with other boring tasks around, so only get chance to hack on POHMELFS today. It is almost 1000 lines of code already, but yet its functionality is virtually near zero.
But I believe all interface for VFS are implemented and I can concentrate on actual interaction with remote elliptics storage.

$ wc -l fs/pohmelfs/{*.[ch],Makefile,Kconfig}
   41 fs/pohmelfs/dir.c
   27 fs/pohmelfs/file.c
  185 fs/pohmelfs/inode.c
   85 fs/pohmelfs/net.c
  104 fs/pohmelfs/pohmelfs.h
   96 fs/pohmelfs/pohmelfs.mod.c
  437 fs/pohmelfs/super.c
    7 fs/pohmelfs/Makefile
   11 fs/pohmelfs/Kconfig
  993 total

# mount -t pohmelfs -o "server=ipv4_addr1:1025:2;server=ipv6_addr2:1025:6" none /mnt/
# ls -laui /mnt/
total 4
1 drwxr-xr-x   2 root root    0 Aug 26 08:52 .
2 dr-xr-xr-x. 23 root root 4096 Aug 26 07:47 ..

File creation does not work as well as directory content reading. It will be rather trivial for start.
Maybe I will even implement server-side scripting instead and will use it for directory updates, so that I do not create leases (in the first release) needed for read-modify-write loop of directory update.

Or (what is more probable) I will just create read-modify-write loop for directory update without server locks, which is rather bad idea from concurrent point of view, but it is the simplest case which can be used as a base for future improvements.

Stay tuned, I plan to create kind of alpha working version very soon!

Pohmelfs in da haus

Linux kernel for the last 3 years didn’t change at all – I wrote 600 lines of code, and it does not yet even connect normally to remote elliptics node.

$ wc -l fs/pohmelfs/{*.[ch],Makefile,Kconfig}
   37 fs/pohmelfs/inode.c
   88 fs/pohmelfs/net.c
   58 fs/pohmelfs/pohmelfs.h
   62 fs/pohmelfs/pohmelfs.mod.c
  353 fs/pohmelfs/super.c
    7 fs/pohmelfs/Makefile
   11 fs/pohmelfs/Kconfig
  616 total

Data de-duplication in ZFS and elliptics network (POHMELFS)

Jon Smirl sent me a link describing new ZFS feature – data deduplication.

This is a technique which allows to store multiple data objects in the same place when their content is the same, thus effectively saving the space. There are three levels of data deduplication – files (objects actually), blocks and bytes. Every level allows to store single entity for the multiple identical objects, like single block for several equal data blocks or byte range and so on. ZFS supports block deduplication.

This feature existed effectively from the beginning in the elliptics network distributed hash table storage, but it has two levels of data deduplication: object and transaction. Well, actually we have transaction only, but maximum transaction size can be limited to some large enough block (like megabytes or more, or can be unlimited if needed), so if object is smaller than that, it will be deduplicated automatically.

Which basically means that if multiple users write the same content into the storage and use the same ID, no new storage space will be used, instead transaction log for the selected object will be updated to show that two external objects refer to given transaction.

Depending on transaction size it may have a negative impact, in particular when transaction size is smaller than log entry, it will be actually a waste of space, but transactions are required for the log-strucutred filesystem and to implement things like snapshots and update history. By default log entry size equals to 56 bytes, so it should not be a problem in the common case.

POHMELFS as elliptics network frontend will support this feature without actually any steps out of the box.

NTT Cyber Space Labs presents Sheepdog – distributed storage system for KVM

MORITA Kazutaka wrote:

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.

The following list describes the features of Sheepdog.

* Linear scalability in performance and capacity
* No single point of failure
* Redundant architecture (data is written to multiple nodes)
– Tolerance against network failure
* Zero configuration (newly added machines will join the cluster automatically)
– Autonomous load balancing
* Snapshot
– Online snapshot from qemu-monitor
* Clone from a snapshot volume
* Thin provisioning
– Amazon EBS API support (to use from a Eucalyptus instance)

(* = current features, – = on our todo list)

More details and download links are here:

Note that the code is still in an early stage.
There are some critical TODO items:

– VM image deletion support
– Support architectures other than X86_64
– Data recoverys
– Free space management
– Guarantee reliability and availability under heavy load
– Performance improvement
– Reclaim unused blocks
– More documentation

IMHO, block level distrubuted systems are dead overall, although it has its niche.

POHMELFS and elliptics network design issues

Currently POHMELFS has a coherent writeback cache on each client. This feature requires special protocol to be developed which manages data consistency between clients. Namely it maintains per-object-per-client data on the server, which is checked each time new client access the data.

There are at least three problems with this:

  • need to store a state on the server about client, which means that server crash can not be easily handled
  • increasing number of communication messages needed to be sent when new client access some object, it grows lineary with number of clients, now suppose a huge server with lots of clients and large amount of data objects
  • need to send lock request and wait for ack for each new or flushed from the local cache object

Plus code complexity, POHMELFS implements MOSI-like cache coherency protocol, and while modern processors can observe the bus and flush cache lines, distributed cloud system has to maintain message protocol for each data access.

Another huge problem is server redundancy, since effectively all states have to be replicated on multiple servers.

To date there are no open network systems with such cache support. Originally it was presented in Oracle CRFS developed by Zach Brown, but currently its development is rather dead.

Experiments with POHMELFS and latency led to decision to drop this implementation while maintaining high performance for metadata-intensive operations, which especially benefit from it.

Main idea is to continue to split IO operations from network processing, the same way it was done in elliptics network, where network IO is handled by the properly implemented state machine on top of pool of threads. POHMELFS currently supports this only for data receiving. Data writing blocks in cache writeback, which sends data to the servers.

Now I plan to introduce transaction queue where each new data IO request will find and concatenate transactions before they are sent by the dedicated threads. Reading for up-to-date pages will not force any kind of network IO, while writes and fresh reads will just allocate and queue a transaction. Write then stops while reading sleeps for transaction completion.
Subsequent write may append data into just created transaction if it was noet yet sent.
As previously transaction will live in retransmit queue until ack is received and resend if needed.

Local cache invalidation will be handled similar way it exists currently in POHMELFS and its server, which I will think some more about. Likely I will use built into elliptics network update notification mechanism.

Transaction queue will be processed by the pool of threads, or maybe even by the single thread per network connection, separately from userspace-visible IO.

That’s the original plan I ended up with to date, but it may happen that this will be just a start of the things breaking. While I’m suffering from climbing aching and appartment development I may think out something additional.
Stay tuned!

New filesystems in drivers/staging for 2.6.30

I believe drivers/staging should be renamed into fs/staging, since this tree will likely contain CEPH and NILFS2, as long as POHMELFS and DST.

Ceph is a distributed file system designed for reliability, scalability, and performance. The storage system consists of some (potentially large) number of storage servers (bricks), a smaller set of metadata server daemons, and a few monitor daemons for managing cluster membership and state.
It relies on BTRFS to store data and closely works with its internal features like transactions and cloning.

NILFS2 is a log-structured file system supporting versioning of the entire file system and continuous snapshotting which allows users to even restore files mistakenly overwritten or destroyed just a few seconds ago.
NILFS2 lives in -mm tree for a while already, so this actually may be a call for the mainstream inclusion directly into fs/.

More filesystems – good and different!

Is overwrite a bad decision? Distributed transactional filesystem

strugling Enjoying the muscle pain switches brain into the thinking mode compared to the usual slacking one. This brought me a nice idea of combining POSIX filesystem with the distributed transactional approach used in the elliptics network.

Every POSIX filesystem as long as usual write applications are supposed to overwrite the data placed in the middle of the object. Transactional storage actually does the same – the elliptics network overwrites the local object, but it also creates a new object which stores update transaction itself. It is potentially placed on the different nodes in the cloud. With the simple extension it is possible not to overwrite the original object and redirect all reads to fetch different transactions (or their parts) instead.

What if the POSIX filesystem will not actually overwrite the data, since it requires either a complex cache-coherecy protocol to be involved between multiple clients, working with the same object, and server, which complexity quickly grows when we want to have multiple servers; or use write-through cache (still with races though), which kills the performace compared to the write-back one for the local operations.

Basic idea is to never lock the object itself, it is never updated, only its history log, which is rather small and its updates can be serialized. Every transaction is placed in the different place (potentially – it depends on the network configuration), so when we want to read some data from the object, we check the history log instead, which contains sizes, offsets and the IDs of the data written, and fetch needed transactions (or their parts) from the network and not from the file itseld.

First, this allows to read data in parallel even if object itself was never mirrored to the different nodes.
Second, updates will lock the history index for the very short time, writes itself will not lock anything and will be done in parallel to the multiple nodes, since each transaction will move to the unique location.
Third, history lock may be done distributed, since overhead over its short aciquire time should still be small enough compared to the time needed to write huge transaction into the object and lock over this operation.

Moreover we can eliminate history update locking completely by using versioning of the object state, i.e. all clients who previously read that object still have a valid copy, but with the different version, and thier states are consistent, but not up-to-date. This may rise some concerns from the POSIX side, but overall idea looks very appealing.

As of negative sides, this will force POHMELFS server not to work with the local storage as we know it today – it will become part of the distributed network and thus will store all the data (when used in single node mode, i.e. as a network and not distributed filesystem) in a strange format used currently in the elliptics network – a directories full of files named as 40 chars instead of common names.

POSIX issues introduce potentially serious limitations, but idea looks very promising so far and I will definitely think about its implementation in the POHMELFS.

POHMELFS vs NFS: dbench and the power of the local metadata cache

POHMELFS vs NFS dbench performance

Client and server machines have 8Gb of RAM and I ran single-threaded test to show the power of the local metadata cache, so get this data with a fair grain of salt.

POHMELFS served close to 160k operations over the network, while async in-kernel NFS ended up with this dump:

   1    643930    23.31 MB/sec  execute 149 sec   
   1    649695    23.37 MB/sec  cleanup 150 sec   
/bin/rm: cannot remove directory `/mnt//clients/client0/~dmtmp/ACCESS': Directory not empty
/bin/rm: cannot remove directory `/mnt//clients/client0/~dmtmp/SEED': Directory not empty
   1    649695    23.36 MB/sec  cleanup 150 sec   

More precise performance values:

POHMELFS   652.481 MB/sec
NFS         23.366 MB/sec

I’ve pushed POHMELFS update to the GIT, which includes a long-waiting switch from the own path cache to the system’s dcache (kudos to the one who exported dcache_lock to the modules :)

It will be likely pushed into the drivers/staging in a day or so.

NFS/credentials leak in 2.6.29-rc1 and thoughts on NFS performance

I decided to find out how NFS managed to have that fast random read performance (with so slow sequential read), and started a 8gb random IO test in IOzone. And machine started to die. I already killed it three times for this day, and reason is likely in the NFS server. That’s what slabtop shows on the server:

 Active / Total Objects (% used)    : 4741969 / 4755356 (99.7%)
 Active / Total Slabs (% used)      : 201029 / 201049 (100.0%)
 Active / Total Caches (% used)     : 91 / 162 (56.2%)
 Active / Total Size (% used)       : 750871.15K / 753121.28K (99.7%)
 Minimum / Average / Maximum Object : 0.01K / 0.16K / 4096.00K

1798890 1798672  99%    0.12K  59963       30    239852K cred_jar
1798320 1798307  99%    0.25K 119888       15    479552K size-256
1091430 1091401  99%    0.05K  16290       67     65160K buffer_head
  18824   17997  95%    0.28K   1448       13      5792K radix_tree_node

Both cred_jar and size-256 slabs constantly grew during the test, so I suppose there is a leak in the current kernel (iirc there were no leaks in .28), while I’m waiting for Trond Myklebust for comments, I thought on how NFS is capable to have higher random read than sequential one.

The main theory is its request combining on the client. I.e. when system joins two random but close enough requests, server will send not only requested data, but also additional region between them. Or some similar logic. I.e. essentially increased readahead by both client and server.
If this theory is correct, then simple way to solve it is to increase readahead in POHMELFS, or actually not to shrink it in some conditions. I will try this idea…

DST has been asked for drivers/staging merge

DST is fully self-contained and really is not expected to get any changes in the future :)

POHMELFS is a bit more complex project, and it requires two exports from the Linux VFS, which are safe as is, but I’m waiting for Linus/Andrew to confirm that (we already talked about them with Andrew some time ago though).

In parallel I’m testing POHMELFS, and while it still shows superior perfromance compared to async in-kernel NFS, one of my systems refuse to mount it. It just says that it does not know ‘pohmel’ filesystem type, not even entering the kernel. Do not yet know what is the problem, but it worked ok with the previous kernel (it was some -rc tree). Will investigate further and prepare the patches.

Also I would like to know what benchmark could be used for the multi-user parallel testing. I use iozone for the single-user load.

Btrfs is in mainline now.

Linus pulled this excellent filesystem into the vanilla tree. So if you were curious about how to extend your storage, this is a good way.

But beware, that it is a development tree and very likely contains fair number of errors you may encounter and thus lost the data.

Contrary to this we can check how Andrew Morton opposes to merge SquashFS (compressed read-only filesystem) because of not enough of testing (it is a default FS to build live CDs and installators for several years in all distros), not enough review (it passed three in the fsdevel maillist, although my trivials were not commented :), something else and (!) because it is called after a vegetable. I do hope Andrew made a joke about the last argument, even Linus entered the discussion, and now SquashFS is effectively merged, since Andrew agreed to pull it into the tree.

I’m curious what will happen when I will push POHMELFS upstream (and I will do it soon enough). Actually while I was writing this I decided to push it just tomorrow after documentation update. At least to ask Linus and Andrew, and you know, I can bet you there will be zero responses (related to the process), although POHMELFS is very competive already (although a bit younger than btrfs).

And although I’m pretty sure about the result, let’s do this for fun :)

POHMELFS: The Great Southern Trendkill release.

POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System.

POHMELFS is a kernel client for the developed distributed parallel internet filesystem. As it exists today, it is a high-performance parallel network filesystem with ability to balance reading from multiple hosts and simultaneously write data to multiple hosts.

Main design goal of this filesystem is to implement very fast and scalable network filesystem with local writeback cache of data and metadata, which greatly speeds up every IO operation compared to traditional writethrough based network filesystems.

Read balancing and writing to multiple hosts features can be used to improve parallel multithreaded read-mostly data processing workload and organize fault-tolerant systems. POHMELFS as a network client does not support data synchrnonization between the nodes, so this task should be implemented in servers. POHMELFS and multiple-server-write can be used as backup solution for the physically distributed network servers.

Currently development is concentrated on the distributed object-based server development implemeneted with distributed hash table design approach in mind, which main goals it completely transparent from client point of view node management, full absence of any controlling central servers (points of failure), transaction/history based object storage.

POHMELFS utilizes writeback cache, which is built on top of MO(E)SI-like coherency protocol. It uses scalable cached read/write locking. No additional requests are performed if lock is granted to the filesystem. The same protocol is used by the server to on-demand flushing of the client’s cache (for example when server wants to update local data or send some new content into the clients caches).

POHMELFS is able to encrypt data channel or perform strong data checksumming. Algorithms used by the filesystems are autoconfigured during startup and mount may fail (depending on options) if server does not support requested algorithms.

Autoconfiguration also involve sending information about size of the exported directory specified by the server, permission, statistics about amount of inodes, used space and so on.

POHMELFS utilizes transaction model for all its operations. Each transction is an object, which may embed multiple commands completed atomically. When server fails the whole transaction will be replied against it (or different server) later. This approach allows to maintain high data integrity and do not desynchronize filesystem state in case of network or server failures.

More details can be found at the homepage.

The first elliptics distributed network draft.

POHMELFS was created as a client (and besides that it is already a very high-performance network filesystem) for this kind of distributed network. It was called elliptics network.

Each node has ID in flat space with modulo operation in it.

Each object has ID from the same finite space. Object can only be identified by its id, there are no dirs, files or whatever else.

Server with ID N hosts objects from (N-1; N] where N-1 is ID of the neighbour node.

Any object can have multiple IDs, in this case it will be copied to different nodes.
Each additional ID can be obtained from the first one via known mathematical operations (like sum in finite field). IDs (including additional) should not cross for different objects.

Information about object and its clones will be stored in object itself and in objects, which correspond to the parent directory in classical path-based lookup, thus to get the object client will create ID from its path, fetch the data, and if it is not available, it can created ID from parent’s directory path and request information about clones from that object.

Each write operation is transaction based, thus each transaction is a separate object in the filesystem, which will be stored on different servers according to client’s settings (how many clones). Information about how to apply transactions on top of some object will be stored in the parent’s object the same way as described above.

Transaction is committed (and can be remoevd) after all clones have it applied, otherwise it should live in the network until explicitely removed.

Node join protocol.
Node looks up a pair of node, whose IDs surround ID of the joining node (called next and prev nodes here according to how IDs correlate). It sends a request to the next node to return list of objects (and its attributes) which correspond to the joining node ID. Let’s assume next node has ID N, prev node – P and joining node has id J.
1. Joining node J sends request to next node N to grab list of objects it hosts in (P; J] and to get routing table next node has.
1a. Next node forwards all write requests to the joining node J, which will block util step 3 is completed.
2. Joining node J runs through received list and selects objects which are newer than those which are presented on joining node itself.
3. Those objects are fetched from the next node.
3a. All write requests forwarded from the next node are applied.
4. Joining node connects to previous node P and announce that it is now in the network, so that previous node updated its routing table. This announce can be sent to all nodes in the routing table received from the next node in step 1.

Each node has a routing table, which corresponds to the tree indexed by node IDs. Each object in the tree hosts an address of the remote node. Even large enough routing table will not take lots of RAM, but this redundacy (and not only addresses of the immediate neighbours) allows to greatly reduce amount of lookup messages neeeded to find an appropriate node. When node receives some request which does not correspond to IDs it hosts, node should forward this request to another node according to its routing table. If routing table does not have a node, which corresponds to given ID, request should be forwarded to the nearest to that ID node.
Some heueristics should be introduced to determine when to stop a lookup and return no-entry error, for example when next node in the routing table has ID less than requested ID, then error should be returned. This relies on correct routing table, which should be updated when next node leaves, or nodes next to the next node join.

When node joins and fetches list of updated objects from the next node, it may notice that some objects were changed and request transaction list from another nodes to apply. Transaction list can be obtained from the object, which corresponds to parent directory in the path-based object name.


Nested attributes in inotify.

Essentially nothing major changed since yesterday’s patch: cleanups and minor fixes. Also created test application:

$ gcc ./iotest.c -W -Wall -I/path/to/kernel/include/ -o iotest
$ ./iotest -f /tmp/ -t -n -p -i
2008-11-25 22:29:47.8477 pid: 1850, tid: 1850, name: /tmp/, wd: 1, mask: 303,
attributes: pid: 1, tid: 1, io: 1, name: 1.
event: 2, wd: 1, cookie: 0, len: 61.
	tid: 1672.
	pid: 1672.
	io details: start: 0, size: 0.
	name: test.
event: 2, wd: 1, cookie: 0, len: 61.
	tid: 1672.
	pid: 1672.
	io details: start: 0, size: 6.
	name: test.

Example with only tid and io details:

$ ./iotest -f /tmp/ -t -i
2008-11-25 22:40:30.201286 pid: 1928, tid: 1928, name: /tmp/, wd: 1, mask: 303,
attributes: pid: 0, tid: 1, io: 1, name: 0.
event: 2, wd: 1, cookie: 0, len: 36.
	tid: 1672.
	io details: start: 0, size: 0.

All above events are for this command:

$ echo qwe11 > /tmp/test

One can get it from archive.

Nested attributes in inotify.

Linux inotify does not have anyhow extensible architecture: it is tied to small fixed structure, which carries information about generated event, watcher’s ID (which is aunique number identifying added IO watch for given object) and magic cookie unused by all but rename events where it is used to bind different move events (move-from and move-to events have the same cookie). It is also possible to attach name of the object in question, for example when some file is created or written into, directory watch which contains given file will fire events with appropriate file name. That’s it.

So, basically, you cann not attach PID/TID of the calling process, IO start/size or anything else you would like to transfer to userspace linked to given file or directory.

So, after my initial extension was massively rejected because of ugliness of the design (I remind, that I reused unused by anything bug rename events cookie field to carry PID information, which indeed is ugly, but does not break anything), I decided to implement things using The Right Way.

So, I spent several hours today to cook up following idea of nested attributes in the inotify. Patch uses


to show that new event format should be returned when reading from inotify file descriptor.

Attributes are attached to given inotify instance via TIOCSETD ioctl, where ID of the attribute is transferred. If entry with given ID exists, it is removed and -EEXIST is returned.

There is a set of callbacks indexed by ID (currently 4 attributes are supported: pid, tid, io details (start/size of the written block) and name), which are attached to given inotify device. When new event is created, each callback is executed and may queue some data into event structure, based on its internal details. When event is read from the userspace, first its header is transferred (old inotify_event structure) and then all attributes data in the order of its registration via above ioctl. Nested attributes have following structure:

struct inotify_attribute
	unsigned int		id;
	unsigned int		size;
	unsigned char		data[0];

where size is nuber of attached bytes for given attribute. inotify_event.len contains number of bytes used to store all attached attributes and data.

Attribute callback can check if mask contains its bits and do some steps based on that information, for example io details attribute does not add own data (and header) if event mask does not contain IN_MODIFY bit.

It is possible to infinitely extend number of attributes processed, so we will be no longer limited by inotify API and ABI.

I will cook up userspace example tomorrow and fix possible bugs if find them during testing and would like to get a feedback on idea and implementation, let’s see how many comments I will get on this proposal. Patch contains just 897 lines.