Filesystems

Filesystems development

Data de-duplication in ZFS and elliptics network (POHMELFS)

Jon Smirl sent me a link describing new ZFS feature - data deduplication.

This is a technique which allows to store multiple data objects in the same place when their content is the same, thus effectively saving the space. There are three levels of data deduplication - files (objects actually), blocks and bytes. Every level allows to store single entity for the multiple identical objects, like single block for several equal data blocks or byte range and so on. ZFS supports block deduplication.

This feature existed effectively from the beginning in the elliptics network distributed hash table storage, but it has two levels of data deduplication: object and transaction. Well, actually we have transaction only, but maximum transaction size can be limited to some large enough block (like megabytes or more, or can be unlimited if needed), so if object is smaller than that, it will be deduplicated automatically.

Which basically means that if multiple users write the same content into the storage and use the same ID, no new storage space will be used, instead transaction log for the selected object will be updated to show that two external objects refer to given transaction.

Depending on transaction size it may have a negative impact, in particular when transaction size is smaller than log entry, it will be actually a waste of space, but transactions are required for the log-strucutred filesystem and to implement things like snapshots and update history. By default log entry size equals to 56 bytes, so it should not be a problem in the common case.

POHMELFS as elliptics network frontend will support this feature without actually any steps out of the box.

NTT Cyber Space Labs presents Sheepdog - distributed storage system for KVM

Tagged:  

MORITA Kazutaka wrote:

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.

The following list describes the features of Sheepdog.

* Linear scalability in performance and capacity
* No single point of failure
* Redundant architecture (data is written to multiple nodes)
- Tolerance against network failure
* Zero configuration (newly added machines will join the cluster automatically)
- Autonomous load balancing
* Snapshot
- Online snapshot from qemu-monitor
* Clone from a snapshot volume
* Thin provisioning
- Amazon EBS API support (to use from a Eucalyptus instance)

(* = current features, - = on our todo list)

More details and download links are here:
http://www.osrg.net/sheepdog/

Note that the code is still in an early stage.
There are some critical TODO items:

- VM image deletion support
- Support architectures other than X86_64
- Data recoverys
- Free space management
- Guarantee reliability and availability under heavy load
- Performance improvement
- Reclaim unused blocks
- More documentation

IMHO, block level distrubuted systems are dead overall, although it has its niche.

POHMELFS and elliptics network design issues

Tagged:  

Currently POHMELFS has a coherent writeback cache on each client. This feature requires special protocol to be developed which manages data consistency between clients. Namely it maintains per-object-per-client data on the server, which is checked each time new client access the data.

There are at least three problems with this:

  • need to store a state on the server about client, which means that server crash can not be easily handled
  • increasing number of communication messages needed to be sent when new client access some object, it grows lineary with number of clients, now suppose a huge server with lots of clients and large amount of data objects
  • need to send lock request and wait for ack for each new or flushed from the local cache object

Plus code complexity, POHMELFS implements MOSI-like cache coherency protocol, and while modern processors can observe the bus and flush cache lines, distributed cloud system has to maintain message protocol for each data access.

Another huge problem is server redundancy, since effectively all states have to be replicated on multiple servers.

To date there are no open network systems with such cache support. Originally it was presented in Oracle CRFS developed by Zach Brown, but currently its development is rather dead.

Experiments with POHMELFS and latency led to decision to drop this implementation while maintaining high performance for metadata-intensive operations, which especially benefit from it.

Main idea is to continue to split IO operations from network processing, the same way it was done in elliptics network, where network IO is handled by the properly implemented state machine on top of pool of threads. POHMELFS currently supports this only for data receiving. Data writing blocks in cache writeback, which sends data to the servers.

Now I plan to introduce transaction queue where each new data IO request will find and concatenate transactions before they are sent by the dedicated threads. Reading for up-to-date pages will not force any kind of network IO, while writes and fresh reads will just allocate and queue a transaction. Write then stops while reading sleeps for transaction completion.
Subsequent write may append data into just created transaction if it was noet yet sent.
As previously transaction will live in retransmit queue until ack is received and resend if needed.

Local cache invalidation will be handled similar way it exists currently in POHMELFS and its server, which I will think some more about. Likely I will use built into elliptics network update notification mechanism.

Transaction queue will be processed by the pool of threads, or maybe even by the single thread per network connection, separately from userspace-visible IO.

That's the original plan I ended up with to date, but it may happen that this will be just a start of the things breaking. While I'm suffering from climbing aching and appartment development I may think out something additional.
Stay tuned!

A development roadmap experimental poll - you can make the difference

Documentation, automatic tests and 3.0 elliptics release
26% (38 votes)
SMTP/IMAP elliptics storage backend
7% (11 votes)
POHMELFS elliptics frontend
30% (45 votes)
SSL support and more advanced merge config resolve in elliptics network
3% (5 votes)
LISP (and eventually C) regexp and subsequent LR grammatics analyzer
3% (4 votes)
HTML validator (based on above code though)
1% (1 vote)
PAXOS library and distributed locking
18% (27 votes)
Write your own - vote or lose!
1% (2 votes)
More tight electronics projects (w1 sniffer, digital electronics, robotics)
10% (15 votes)
Total votes: 148

New filesystems in drivers/staging for 2.6.30

Tagged:  

I believe drivers/staging should be renamed into fs/staging, since this tree will likely contain CEPH and NILFS2, as long as POHMELFS and DST.

Ceph is a distributed file system designed for reliability, scalability, and performance. The storage system consists of some (potentially large) number of storage servers (bricks), a smaller set of metadata server daemons, and a few monitor daemons for managing cluster membership and state.
It relies on BTRFS to store data and closely works with its internal features like transactions and cloning.

NILFS2 is a log-structured file system supporting versioning of the entire file system and continuous snapshotting which allows users to even restore files mistakenly overwritten or destroyed just a few seconds ago.
NILFS2 lives in -mm tree for a while already, so this actually may be a call for the mainstream inclusion directly into fs/.

More filesystems - good and different!

Is overwrite a bad decision? Distributed transactional filesystem

Tagged:  

strugling Enjoying the muscle pain switches brain into the thinking mode compared to the usual slacking one. This brought me a nice idea of combining POSIX filesystem with the distributed transactional approach used in the elliptics network.

Every POSIX filesystem as long as usual write applications are supposed to overwrite the data placed in the middle of the object. Transactional storage actually does the same - the elliptics network overwrites the local object, but it also creates a new object which stores update transaction itself. It is potentially placed on the different nodes in the cloud. With the simple extension it is possible not to overwrite the original object and redirect all reads to fetch different transactions (or their parts) instead.

What if the POSIX filesystem will not actually overwrite the data, since it requires either a complex cache-coherecy protocol to be involved between multiple clients, working with the same object, and server, which complexity quickly grows when we want to have multiple servers; or use write-through cache (still with races though), which kills the performace compared to the write-back one for the local operations.

Basic idea is to never lock the object itself, it is never updated, only its history log, which is rather small and its updates can be serialized. Every transaction is placed in the different place (potentially - it depends on the network configuration), so when we want to read some data from the object, we check the history log instead, which contains sizes, offsets and the IDs of the data written, and fetch needed transactions (or their parts) from the network and not from the file itseld.

First, this allows to read data in parallel even if object itself was never mirrored to the different nodes.
Second, updates will lock the history index for the very short time, writes itself will not lock anything and will be done in parallel to the multiple nodes, since each transaction will move to the unique location.
Third, history lock may be done distributed, since overhead over its short aciquire time should still be small enough compared to the time needed to write huge transaction into the object and lock over this operation.

Moreover we can eliminate history update locking completely by using versioning of the object state, i.e. all clients who previously read that object still have a valid copy, but with the different version, and thier states are consistent, but not up-to-date. This may rise some concerns from the POSIX side, but overall idea looks very appealing.

As of negative sides, this will force POHMELFS server not to work with the local storage as we know it today - it will become part of the distributed network and thus will store all the data (when used in single node mode, i.e. as a network and not distributed filesystem) in a strange format used currently in the elliptics network - a directories full of files named as 40 chars instead of common names.

POSIX issues introduce potentially serious limitations, but idea looks very promising so far and I will definitely think about its implementation in the POHMELFS.

POHMELFS vs NFS: dbench and the power of the local metadata cache

Tagged:  


POHMELFS vs NFS dbench performance

Client and server machines have 8Gb of RAM and I ran single-threaded test to show the power of the local metadata cache, so get this data with a fair grain of salt.

POHMELFS served close to 160k operations over the network, while async in-kernel NFS ended up with this dump:

   1    643930    23.31 MB/sec  execute 149 sec   
   1    649695    23.37 MB/sec  cleanup 150 sec   
/bin/rm: cannot remove directory `/mnt//clients/client0/~dmtmp/ACCESS': Directory not empty
/bin/rm: cannot remove directory `/mnt//clients/client0/~dmtmp/SEED': Directory not empty
   1    649695    23.36 MB/sec  cleanup 150 sec   

More precise performance values:

POHMELFS   652.481 MB/sec
NFS         23.366 MB/sec

I've pushed POHMELFS update to the GIT, which includes a long-waiting switch from the own path cache to the system's dcache (kudos to the one who exported dcache_lock to the modules :)

It will be likely pushed into the drivers/staging in a day or so.

NFS/credentials leak in 2.6.29-rc1 and thoughts on NFS performance

Tagged:  

I decided to find out how NFS managed to have that fast random read performance (with so slow sequential read), and started a 8gb random IO test in IOzone. And machine started to die. I already killed it three times for this day, and reason is likely in the NFS server. That's what slabtop shows on the server:

 Active / Total Objects (% used)    : 4741969 / 4755356 (99.7%)
 Active / Total Slabs (% used)      : 201029 / 201049 (100.0%)
 Active / Total Caches (% used)     : 91 / 162 (56.2%)
 Active / Total Size (% used)       : 750871.15K / 753121.28K (99.7%)
 Minimum / Average / Maximum Object : 0.01K / 0.16K / 4096.00K
 
  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
1798890 1798672  99%    0.12K  59963       30    239852K cred_jar
1798320 1798307  99%    0.25K 119888       15    479552K size-256
1091430 1091401  99%    0.05K  16290       67     65160K buffer_head
  18824   17997  95%    0.28K   1448       13      5792K radix_tree_node

Both cred_jar and size-256 slabs constantly grew during the test, so I suppose there is a leak in the current kernel (iirc there were no leaks in .28), while I'm waiting for Trond Myklebust for comments, I thought on how NFS is capable to have higher random read than sequential one.

The main theory is its request combining on the client. I.e. when system joins two random but close enough requests, server will send not only requested data, but also additional region between them. Or some similar logic. I.e. essentially increased readahead by both client and server.
If this theory is correct, then simple way to solve it is to increase readahead in POHMELFS, or actually not to shrink it in some conditions. I will try this idea...

DST has been asked for drivers/staging merge

Tagged:  

DST is fully self-contained and really is not expected to get any changes in the future :)

POHMELFS is a bit more complex project, and it requires two exports from the Linux VFS, which are safe as is, but I'm waiting for Linus/Andrew to confirm that (we already talked about them with Andrew some time ago though).

In parallel I'm testing POHMELFS, and while it still shows superior perfromance compared to async in-kernel NFS, one of my systems refuse to mount it. It just says that it does not know 'pohmel' filesystem type, not even entering the kernel. Do not yet know what is the problem, but it worked ok with the previous kernel (it was some -rc tree). Will investigate further and prepare the patches.

Also I would like to know what benchmark could be used for the multi-user parallel testing. I use iozone for the single-user load.

Filesystems for the drivers/staging

Tagged:  

Greg Kroah-Hartman wrote:

So, if anyone wants to send me filesystems, I'll be glad to take them into drivers/staging, as long as they are self-contained

P.S. Both btrfs and squashfs are in the mainline tree now. And I'm setting up the test machines for the performance tests for the recent POHMELFS tests.

Btrfs is in mainline now.

Tagged:  

Linus pulled this excellent filesystem into the vanilla tree. So if you were curious about how to extend your storage, this is a good way.

But beware, that it is a development tree and very likely contains fair number of errors you may encounter and thus lost the data.

Contrary to this we can check how Andrew Morton opposes to merge SquashFS (compressed read-only filesystem) because of not enough of testing (it is a default FS to build live CDs and installators for several years in all distros), not enough review (it passed three in the fsdevel maillist, although my trivials were not commented :), something else and (!) because it is called after a vegetable. I do hope Andrew made a joke about the last argument, even Linus entered the discussion, and now SquashFS is effectively merged, since Andrew agreed to pull it into the tree.

I'm curious what will happen when I will push POHMELFS upstream (and I will do it soon enough). Actually while I was writing this I decided to push it just tomorrow after documentation update. At least to ask Linus and Andrew, and you know, I can bet you there will be zero responses (related to the process), although POHMELFS is very competive already (although a bit younger than btrfs).

And although I'm pretty sure about the result, let's do this for fun :)

POHMELFS: The Great Southern Trendkill release.

Tagged:  

POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System.

POHMELFS is a kernel client for the developed distributed parallel internet filesystem. As it exists today, it is a high-performance parallel network filesystem with ability to balance reading from multiple hosts and simultaneously write data to multiple hosts.

Main design goal of this filesystem is to implement very fast and scalable network filesystem with local writeback cache of data and metadata, which greatly speeds up every IO operation compared to traditional writethrough based network filesystems.

Read balancing and writing to multiple hosts features can be used to improve parallel multithreaded read-mostly data processing workload and organize fault-tolerant systems. POHMELFS as a network client does not support data synchrnonization between the nodes, so this task should be implemented in servers. POHMELFS and multiple-server-write can be used as backup solution for the physically distributed network servers.

Currently development is concentrated on the distributed object-based server development implemeneted with distributed hash table design approach in mind, which main goals it completely transparent from client point of view node management, full absence of any controlling central servers (points of failure), transaction/history based object storage.

POHMELFS utilizes writeback cache, which is built on top of MO(E)SI-like coherency protocol. It uses scalable cached read/write locking. No additional requests are performed if lock is granted to the filesystem. The same protocol is used by the server to on-demand flushing of the client's cache (for example when server wants to update local data or send some new content into the clients caches).

POHMELFS is able to encrypt data channel or perform strong data checksumming. Algorithms used by the filesystems are autoconfigured during startup and mount may fail (depending on options) if server does not support requested algorithms.

Autoconfiguration also involve sending information about size of the exported directory specified by the server, permission, statistics about amount of inodes, used space and so on.

POHMELFS utilizes transaction model for all its operations. Each transction is an object, which may embed multiple commands completed atomically. When server fails the whole transaction will be replied against it (or different server) later. This approach allows to maintain high data integrity and do not desynchronize filesystem state in case of network or server failures.

More details can be found at the homepage.

POHMELFS and DST mail lists.

Tagged:  

There are two new mail lists created to discuss and develop POHMELFS (and more generic filesystem questions) and DST: pohmelfs@ioremap.net and dst@ioremap.net

Lists are archived, not moderated and allow to post for non-subscribers.

Enjoy!

The first elliptics distributed network draft.

POHMELFS was created as a client (and besides that it is already a very high-performance network filesystem) for this kind of distributed network. It was called elliptics network.

Each node has ID in flat space with modulo operation in it.

Each object has ID from the same finite space. Object can only be identified by its id, there are no dirs, files or whatever else.

Server with ID N hosts objects from (N-1; N] where N-1 is ID of the neighbour node.

Any object can have multiple IDs, in this case it will be copied to different nodes.
Each additional ID can be obtained from the first one via known mathematical operations (like sum in finite field). IDs (including additional) should not cross for different objects.

Information about object and its clones will be stored in object itself and in objects, which correspond to the parent directory in classical path-based lookup, thus to get the object client will create ID from its path, fetch the data, and if it is not available, it can created ID from parent's directory path and request information about clones from that object.

Each write operation is transaction based, thus each transaction is a separate object in the filesystem, which will be stored on different servers according to client's settings (how many clones). Information about how to apply transactions on top of some object will be stored in the parent's object the same way as described above.

Transaction is committed (and can be remoevd) after all clones have it applied, otherwise it should live in the network until explicitely removed.

Node join protocol.
Node looks up a pair of node, whose IDs surround ID of the joining node (called next and prev nodes here according to how IDs correlate). It sends a request to the next node to return list of objects (and its attributes) which correspond to the joining node ID. Let's assume next node has ID N, prev node - P and joining node has id J.
1. Joining node J sends request to next node N to grab list of objects it hosts in (P; J] and to get routing table next node has.
1a. Next node forwards all write requests to the joining node J, which will block util step 3 is completed.
2. Joining node J runs through received list and selects objects which are newer than those which are presented on joining node itself.
3. Those objects are fetched from the next node.
3a. All write requests forwarded from the next node are applied.
4. Joining node connects to previous node P and announce that it is now in the network, so that previous node updated its routing table. This announce can be sent to all nodes in the routing table received from the next node in step 1.

Each node has a routing table, which corresponds to the tree indexed by node IDs. Each object in the tree hosts an address of the remote node. Even large enough routing table will not take lots of RAM, but this redundacy (and not only addresses of the immediate neighbours) allows to greatly reduce amount of lookup messages neeeded to find an appropriate node. When node receives some request which does not correspond to IDs it hosts, node should forward this request to another node according to its routing table. If routing table does not have a node, which corresponds to given ID, request should be forwarded to the nearest to that ID node.
Some heueristics should be introduced to determine when to stop a lookup and return no-entry error, for example when next node in the routing table has ID less than requested ID, then error should be returned. This relies on correct routing table, which should be updated when next node leaves, or nodes next to the next node join.

When node joins and fetches list of updated objects from the next node, it may notice that some objects were changed and request transaction list from another nodes to apply. Transaction list can be obtained from the object, which corresponds to parent directory in the path-based object name.

Comments?

Nested attributes in inotify.

Tagged:  

Essentially nothing major changed since yesterday's patch: cleanups and minor fixes. Also created test application:

$ gcc ./iotest.c -W -Wall -I/path/to/kernel/include/ -o iotest
$ ./iotest -f /tmp/ -t -n -p -i
2008-11-25 22:29:47.8477 pid: 1850, tid: 1850, name: /tmp/, wd: 1, mask: 303,
attributes: pid: 1, tid: 1, io: 1, name: 1.
event: 2, wd: 1, cookie: 0, len: 61.
	tid: 1672.
	pid: 1672.
	io details: start: 0, size: 0.
	name: test.
event: 2, wd: 1, cookie: 0, len: 61.
	tid: 1672.
	pid: 1672.
	io details: start: 0, size: 6.
	name: test.
 
Example with only tid and io details:
 
$ ./iotest -f /tmp/ -t -i
2008-11-25 22:40:30.201286 pid: 1928, tid: 1928, name: /tmp/, wd: 1, mask: 303,
attributes: pid: 0, tid: 1, io: 1, name: 0.
event: 2, wd: 1, cookie: 0, len: 36.
	tid: 1672.
	io details: start: 0, size: 0.
 
All above events are for this command:
 
$ echo qwe11 > /tmp/test

One can get it from archive.

Nested attributes in inotify.

Tagged:  

Linux inotify does not have anyhow extensible architecture: it is tied to small fixed structure, which carries information about generated event, watcher's ID (which is aunique number identifying added IO watch for given object) and magic cookie unused by all but rename events where it is used to bind different move events (move-from and move-to events have the same cookie). It is also possible to attach name of the object in question, for example when some file is created or written into, directory watch which contains given file will fire events with appropriate file name. That's it.

So, basically, you cann not attach PID/TID of the calling process, IO start/size or anything else you would like to transfer to userspace linked to given file or directory.

So, after my initial extension was massively rejected because of ugliness of the design (I remind, that I reused unused by anything bug rename events cookie field to carry PID information, which indeed is ugly, but does not break anything), I decided to implement things using The Right Way.

So, I spent several hours today to cook up following idea of nested attributes in the inotify. Patch uses inotify_init1() to show that new event format should be returned when reading from inotify file descriptor.

Attributes are attached to given inotify instance via TIOCSETD ioctl, where ID of the attribute is transferred. If entry with given ID exists, it is removed and -EEXIST is returned.

There is a set of callbacks indexed by ID (currently 4 attributes are supported: pid, tid, io details (start/size of the written block) and name), which are attached to given inotify device. When new event is created, each callback is executed and may queue some data into event structure, based on its internal details. When event is read from the userspace, first its header is transferred (old inotify_event structure) and then all attributes data in the order of its registration via above ioctl. Nested attributes have following structure:

struct inotify_attribute
{
	unsigned int		id;
	unsigned int		size;
	unsigned char		data[0];
};

where size is nuber of attached bytes for given attribute. inotify_event.len contains number of bytes used to store all attached attributes and data.

Attribute callback can check if mask contains its bits and do some steps based on that information, for example io details attribute does not add own data (and header) if event mask does not contain IN_MODIFY bit.

It is possible to infinitely extend number of attributes processed, so we will be no longer limited by inotify API and ABI.

I will cook up userspace example tomorrow and fix possible bugs if find them during testing and would like to get a feedback on idea and implementation, let's see how many comments I will get on this proposal. Patch contains just 897 lines.

POHMELFS and DST usage cases, design and roadmap.

Tagged:  

POHMELFS kernel client at its current state and by design is a parallel network client to the distributed filesystem (called elliptics network in the design notes) itself. So you can consider it like parallel NFS (but with fast protocol), where parallel means read balancing and redundancy writing to multiple nodes. POHMELFS heavily utilizes local coherent data and metadata cache, so it is also very high-performance, but that's it: a simple client, which is able (with server's help of course) to maintain a coherent cache of data and metadata on multiple clients, which work with the same servers.

So, right now it is effectively a way to create data servers, where client sees only set of the same nodes, and balance operations (when appropriate) between them. Server itself works with local filesystem, which can be built on top of whtever raid combinations you like of the disks or DST nodes. So effectively in this case POHMELFS mounted node can be extended by increasing local filesystem on the server.

That's what the current state of the POHMELFS is. Next step in this direction is to extend server only (modulo some new commands to the client if needed and proper statistics) and add distributed facilities. By design there will be a cloud of servers, where each of which has own ID and data is distributed over this cloud, which has elliptics network working title. This name somewhat reflects the main algorithm of the ID distribution. Cloud will not have any dedicated metadata servers and every node will have exactly the same priority and usage case. POHMELFS kenel client (and thus usual user, which works with mounted POHMEL filesystem node) connects to the arbitrary server in the cloud and asks for needed data, this request is transformed into the elliptics network format and is forwarded to the server, which hosts it, data is returned to the client and it does not even know that it was stored elsewhere. In this case it is possible to infinitely extend the server space by adding new nodes, which will automatically join the network and will not require any work from client or administrator. Redundancy is maintained by having multiple nodes with the same id, so client will balance reading between them and write to them simultaneously. Node join protocol will maintain coherency of the updated and old data.

Proof-of-concept implementation is scheduled for the next month or so, this should be working (but simple enough for the start) library which can be used by other applications. Then I will integrate it with existing POHMELFS server. This is optimistic timings though, it depends on how many bugs will be found in all projects I maintain :)

DST is a network block device. It has a dumb protocol, which allows to connect two machines and use read/write commands between them (each command is effectively a pointer to where data should be stored and its size). So, there is always one machine, which exports some storage, and one which connects to it. There are no locks, no protection against parallel usage, nothing. Just plain IO commands. System can start several connections to the remote nodes and thus will have multiple block devices, which appear like local disks. Administrator can combine those block devices into single one via device mapper/lvm or mount btrfs on top of multiple nodes. And then export it via POHMELFS to some clients or work with it locally.
I consider DST as a completed project.

I did not write more detailed feature description of both POHMELFS and DST and how they are used in the failover cases or data integrity, it is always possible to grab those design notes from the appropriate homepages.

IT development roadmap.

This will be a short enough post about what projects and theirs status are included into the nearest roadmap. I wrote IT because I will describe programming projects only, and not electronics for example.

So, let's start with existing two: DST and POHMELFS.
The former is essentially ready (with fun experiment I run with the latest version, kernel and public releases) and will be pushed upstream for some time. Next version will be released in a week or so and will only have new name. That will be 10'th distributed storge release, and 4'th resend of the same code.
Recently released POHMELFS got a bug just after release, and it is supposed to be fixed in the current version (pull from both kernel and userspace POHMELFS git repos). So far I do not see any new major changes in the POHMELFS client code, so essentially kernel side will only be extended when this is required by distributed server changes. I will not push it upstream until server side is also close to be finished.

Obviously both projects will be maintained and bugs will be fixed with the highest priority.

Here we comes to the new project I'm thinking about for some time already. This is a network storage server built on top of distributed hash table design without centralized architecture and need to have metadata servers. So far there is not that much of a code, only trivial bits, and I'm designing node join/leave part of the protocol. Some results are expected to be sooner than later, but not immediately.

And of course brain needs something to play with for the rest. Here comes language parser (LISP XML parser) I mentioned previously, and some computer language application based on this idea. That's wjat I'm about to work on for the next days.

As another programming exercise I frequently think about buying myself a Play Station 3 and playing with its SPU processors and parallel applications, for example graphic algorithms (the first one which comes in mind is wavelet transformation and non-precise searcing of the images, which I made severaly years ago, I even have sources hidden in some old arch repo). Or playing with video card engines.
This does not have a very high priority though.

That's the programming plans. Let's see if anything will be completed anytime soon :)

New POHMELFS release.

Tagged:  

POHMELFS is a high-performance distributed parallel filesystem.

This release contains following changes:

  • combine locking with inode attribute changes
  • add get/set attrbibute commands into cache coherency protocol (attributes are cached and flushed to the server at a writeback time or on demand because of cache coherency protocol)
  • crypto threads pool's cleanup
  • 2.6.27 rebase
  • optimize read/write locking and number of messages needed for cache coherency management
  • debug cleanups
  • bug fixes

You know where to get the latest version, don't you: git tree and archive are always opened.

POHMELFS distributed locks solution.

Tagged:  

Solution I found looks very clean, quite simple and, the main part, I like it: now every get or set callback (generally speaking, for example directory reading is a get callback, while data writing is a set one) is guarded by appropriate distributed lock type. So when one system updates directory content or write to the object (writing data to the file or changing some attributes), it is not sent to the server and thus is not broadcasted to the other clients. Instead it lives in the local filesystem cache until client receives a message from the server to flush its data because of cache coherency protocol. When another client requests the same object, it is first synced to the server, and its new attributes are sent to the requested client.

So far this implemenation is under the testing, and revealed problem case with turned on cryptography, so after this issue is fixed, POHMELFS will get a new release.

Inotify PIDs and security.

Tagged:  

Looks like my inotify patch was rejected because of security violation. This may sound as somewhat security flaw to allow any process to watch what IO is performed by other processes. Well, it is a valid observation, so I created new version, which only puts PID into the inotify message when either UID of the inotify backend is 0 or equals to UID of the process doing IO. So far without any comments though.

Since I can not get that this it a security flaw, and arguing about that will be endless, I accept that this limitation is valid. In the same line of 'security' flaws could be placed following small information leak I found in inotify.

In classical UNIX permission model it is not allowed to get directory listing if it does not have read permission and change directory to given one if it does not have execute bit. Even if you have created directory with some permissions/owner bits, switched current dir to this newly creaated, and then changed its permissions/owner bit, you will not be able to see neither its content nor newly created objects . Here is an example:

$ mkdir /tmp/test
$ chmod 700 /tmp/test
$ cd /tmp/test
/tmp/test$ ls -lai
total 24
9486756 drwx------  2 zbr  zbr   4096 2008-11-10 15:40 .
 123969 drwxrwxrwt 34 root root 20480 2008-11-10 15:40 ..
/tmp/test$ sudo chown 0.0 .
[sudo] password for zbr:
/tmp/test$ ls -lai
ls: .: Permission denied
/tmp/test$

With inotify you are able to watch what is being done in directory (or actually in any object) if you were able to attach a watch to its inode. So, if object had read permission, we are able to attach a watch to it, so if later it will change its permissions, watches will not be removed and we will be able to watch its content. Like this:

libionotify-1.1$ LD_PRELOAD=./libionotify.so ./inotify -r /tmp/
CREATE: /tmp/test
CREATE: /tmp/test/test1
WRITE : /tmp/test/test1
CREATE: /tmp/test/test2
WRITE : /tmp/test/test2
READ  : /tmp/test/test2

while in parallel we do:

$ mkdir /tmp/test
$ chmod 700 /tmp/test
$ cd /tmp/test
/tmp/test$ sudo chown 0.0 .
/tmp/test$ sudo dd if=/dev/zero of=./test1 bs=4k count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000326018 seconds, 12.6 MB/s
/tmp/test$ sudo dd if=/dev/zero of=./test2 bs=4k count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000321779 seconds, 12.7 MB/s
/tmp/test$ sudo cat ./test2

Fix would be to check inotify watch list for given inode when its permissions are changed.

Inotify is watching you!

libionotify

Tagged:  

libionotify is a IO notification library based on inotify. It allows application not to care about low-level inotify interface, but just provide a set of callbacks for existing events. Currently read, write, create and remove events are used. When multiple events form a mask, callback is invoked if at least one of them matches.

Example:

#include "notify.h"
 
static int notify_create_callback(struct notify_root *r, struct callback *c,
		struct notify_event *e)
{
	return notify_add_object(r, e->path, e->len, 0, 0);
}
 
static int notify_delete_callback(struct notify_root *r, struct callback *c,
		struct notify_event *e)
{
	return notify_remove_object(r, e->path, e->len);
}
 
static struct callback notify_callbacks[] = {
	{ .mask = CREATE_EVENT, .callback = notify_create_callback, .data = NULL, },
	{ .mask = DELETE_EVENT, .callback = notify_delete_callback, .data = NULL, },
};
 
int main()
{
	r = notify_init(num, notify_callbacks, sizeof(notify_callbacks)/sizeof(struct callback));
	if (!r)
		return err;
 
	err = notify_add_object(r, root_dir, strlen(root_dir), 0, 0);
	if (err)
		return err;
 
	while (1)
		sleep(1);
}

This code sets a watch on given root directory and then add or remove watches for all objects created or removed in the root directory itself and its subdirs.
struct notify_event used in callbacks is defined in notify.h header:

struct notify_event
{
        char                    *path;
        unsigned int            len;
        unsigned int            hash;
        unsigned int            event;
        unsigned int            cookie;
        __u64                   id;
        __u64                   ino;
};

path is a string containing full path based on provided root directory
len its length
hash - jenkins hash of the originated path (i.e. if new object is created or removed in dir, this contains hash of the directory path, and if read or write event happend, it contains hash of the filename itself)
event - what event has happend: read, write, create or remove. Definitions can be found in notify.h header. This can be a mask of multiple events.
cookie is used to determine move event, but it is unused now.
id and ino are private parameters provided to notify_add_object()

One could store private data in each callback structure and then use it in provided function.

Function declarations.

int notify_remove_object(struct notify_root *r, char *path, unsigned int len);
int notify_add_object(struct notify_root *r, char *path, unsigned int len, __u64 id, __u64 ino);
 
int notify_thread_add(struct notify_root *r);
void notify_thread_remove(struct notify_root *r);
 
struct notify_root *notify_init(int thread_num, struct callback *cbs, int callnum);
void notify_exit(struct notify_root *r);

First two are used to add and remove new objects from the watched list with some additional private data (id and ino). The last two are used to initialize and cleanup notification control structure. notify_thread_add() and notify_thread_remove() are used to add remove notification reading events on behalf of which callbacks are executed. Initial number of such threads is provided via notify_init() parameter.

Source code is always available from archive.

Partial write errors and recovery.

Tagged:  

An interesting thread was started in BTRFS maillist recently about features filessytem should conain to be actively used by some users. Besides that there was a good question rised about how to handle partial write errors.

Consider the case, when we have a sequence of writes finished with a barrier call, which in a theory would end up with perfectly performed action, but in a real life any write in that sequence may fail, it will be returned to the system, it will return it to the user or just mark page as bad, but any subsequent write succeeded as long as barrier call, so actually filesystem may belive that everything is ok except given failed writes. Now, if we have a power loss or disk removal, system is not in a consistent state, since suceeded subsequent writes might depend on the failed on (like directory metadata update with the failed file metadata).

That's the problem, which may be handled by the filesystem, which will split major updates and do not allow subsequent writes if previous one failed. But whatever filesystem is doing batched writes (afaics event ext* filesystems write journal entries not one after another, but flush it as a whole), it has a described problem, since failed write may be detected too late.

DST and POHMELFS use different approach, since network media they are working on is a way too unstable in that regard, we have to deal not only with power outages or disk swaps, but also with even temporal network outages, which are part of the usual life even in high-end networks. Both DST as block network storage and POHMELFS as a network filesystem utilize transaction approach, when number of meaningful operations may be combined into single entity, which will be fully repeated in case of some errors. In this case server will not reply with successful completion if intermediate write fails, and given transaction (including previous and subsequent writes, barriers and whatever else) will be resent. In case of reading from POHMELFS, this will be done from the different server (if it exists in the config).

This is not some kind of new feature of DST or POHMELFS, different kinds of transactions exist even in local filesystems, iirc journal update can be considered as such in ext4, but not data write and journal write as a whole, i.e. multiple dependant metadata updates may be not properly guarded by journal transactions, but I may be wrong; BTRFS likely uses transactions as a COW update, i.e. allocation of the new node and appropriate btree update, but for network filesystems this is an exceptionally useful feature.

Filesystem benchmarks

Tagged:  

Ext234, reiser34, xfs, jfs, btrfs in dbench, iozone, postmark, maildir pop/smtp macro-benchmark and simple file create/write/sync micro-benchmark.

Round 1
Round 2

Syndicate content