Tag Archives: DST

Distributed network storage


Hey, hackers, did you ever commit such a patch:

  11 files changed, 4550 deletions(-)

I’m pretty sure this is a rather rare event. And next Linux kernel will include such a brilliant – Greg Kroah “driver killer” Hartman dropped DST driver from the tree – my cool network block device with tons of tasty features will rest in peace.

And I fully acknowledge this decision. Actually it was me who pushed that idea at first place, since block devices are dead. In my opinion, only few niches are left for this low-level stuff, and most if not all of them do not need something new, even if it has something others did not have.

Let’s move forward and look for real distributed storage systems, which do not require additional metadata servers to maintain object info, which scale horizontally, which do not dedicate special nodes for network routing and so on. Let’s move to peer-to-peer solutions, which I believe are the future, at least the closest one, of the parallel distributed solutions. To the solutions like elliptics network.

There can be multiple interfaces to such storages, like filesystem one (future POHMELFS version) or block level access (like in NTT’s SheepDog), but no matter what, it should be much more than a simple point-to-point connection.

So, DST is dead. And this is good. Let’s not stay on the same place.

 drivers/staging/Kconfig           |    2
 drivers/staging/Makefile          |    1
 drivers/staging/dst/Kconfig       |   67 --
 drivers/staging/dst/Makefile      |    3
 drivers/staging/dst/crypto.c      |  733 ----------------------------
 drivers/staging/dst/dcore.c       |  968 --------------------------------------
 drivers/staging/dst/export.c      |  660 -------------------------
 drivers/staging/dst/state.c       |  844 ---------------------------------
 drivers/staging/dst/thread_pool.c |  348 ------------
 drivers/staging/dst/trans.c       |  337 -------------
 include/linux/dst.h               |  587 -----------------------
 11 files changed, 4550 deletions(-)

P.S. Found myself reading ‘Engineering a compiler’ by Keith Cooper and Linda Torczon with much more interest and understanding than Dragon compiler book from Alfred Aho.

DST update

I updated the DST and pushed it ‘upstream’ (into drivers/staging). Besides bio allocation optimization suggested by Jens Axboe, update fixes an interesting bug I started to observe recently between DST and the block layer

kernel BUG at drivers/scsi/scsi_lib.c:1310!
EIP is at scsi_setup_fs_cmnd+0x37/0x6c [scsi_mod]
Call Trace:
  [] sd_prep_fn+0x61/0x792 [sd_mod]
  [] __cfq_slice_expired+0x57/0x62
  [] elv_next_request+0xa8/0x14c
  [] scsi_request_fn+0x6c/0x3f5 [scsi_mod]

This is a BUG_ON() in scsi_setup_fs_cmnd(), which fires when there are no data segments attached to the block IO. This may be empty barrier or recently added data discard request though.

First I thought it is somehow related to the fact, that I recently swiched DST request queue initialization from plain and dumb allocation to the more advanced blk_init_queue(), although I do not use its features. Something like request queue issues a final barrier. Reverting the change did not fix the problem.

So I lazily googled and found 51fd77bd9f512ab6 commit, which disables empty barriers (by completing bio with -EOPNOTSUPP error). This code was changed to only disable bios which have data discard request and no flush prepare function. I used the same logic in DST and still got the same bug. After I enabled the log, I found that XFS issued a final barrier at the umount time without attached data. I used the same logic as in 51fd77bd9f512ab6 commit, i.e. completing bio with -EOPNOTSUPP error, but XFS froze the system. Oops, I really do not know XFS at all, but succesfully completing bio (with 0 error) fixed the issue. Weird.

New POHMELFS vs NFS vs DST benchmarks

Here we go, the latest POHMELFS against drivers/staging DST and in-kernel async NFS.

Server hardware: 4-way Xeon (2 physical CPU + 2 HT CPU) server with 1 Gb of RAM (actually it has 8, but high-mem was disabled), scsi disk, default xfs 300 gb partition.
Client hardware: 4-way Core2 Xeon with 4 Gb of RAM (again no appropriate high-mem option).
Gigabit ethernet, in-kernel async NFS. 2.6.29-rc1 kernel.

Iozone tests for POHMELFS, NFS and XFS.

Random readRandom write

As we can see, read and write performance is way ahead of NFS, but random read is noticebly slower.

Bonnie++ benchmark for POHMELFS, NFS and DST.

Bonnie IO data

Bonnie was not able to calculate object creation/removal time for POHMELFS, since with local data writeback cache this is very fast compared to write-through NFS case.

So, POHMELFS operates fast. Even in its basic network filesystem mode. But I refer to the random read performance, which is not something we can be proud of :)
But I will work on this, and likely will start with read-ahead games on the server.

Contrary dbench will not run very well on POHMELFS currently, since its rename operation is synchronous and rather slow (it forces inode sync to the server). After I switched to the system’s dcache, there yet untested areas which I work on, so it is not yet pushed to the drivers/staging, but it will be there quite soon.

DST is in drivers/staging

Greg pulled it into the tree today. And to fully satisfy drivers/staging policy, DST has following changes pending:

  • documentation update pointed by Randy Dunlap
  • bioset_create() optimization update pointed by Jens Axboe

Which I will sent after the weekend.


POHMELFS and DST changes

While preparing POHMELFS for drivers/staging I decided to run more benchmarks, and suddenly found a nasty bug in the way filesystem process the names of the objects.

Effectively POHMELFS uses path name as ID, so it does not suffer from the NFS -ESTALE problem, when object related to the received ID was (re)moved. It also greatly helps for the writeback cache implementation, when we do not need to sync name/id pairs on the server, which operates with names only.
For this purpose I implemented simple trie-like name cache in POHMELFS, which hosted names in the descending order starting from the root. There was a bug in the rename part of the algorithm, but while looking at the implementation I thought, what the hell, Linux already has a very scalable dentry cache, why do I need to naively reinvent it here.

So, I dropped my own implementation and started to use dcache. It is not very optimal though: for example there is no helper to determine the path length, so it should be preallocated long enough. So far I use hardcoded length of 256 bytes. I agree, that it is not something particulary good part of the code, but that’s what I’m playing with right now. And you know, results are quite interesting (besides the fact that I fixed a nasty bug which was triggered by dbench in particular), but POHMELFS still has a slower random read performance compared to NFS (for some patterns NFS random read performance is higher than sequential read, so I think there are some tricky cheats besides request comounds).

In the meantime I rebased DST to the latest git tree where it is now possible to allocate a bio (block IO request) with several embedded bio_vecs (pages) and ability to prepend some data without additional allocation, which was suggested by Jens Axboe. This works by creating a memory pool of large enough objects to contain bio itlsef and space for the requested object. Now one additional allocation in the DST export node is eliminated. But I can not test this, since machines are occupied by POHMELFS testing, so DST is not in the drivers/staging yet.

Discussion revealed interesting moments, like:

Hm, then why can’t this whole thing just go into fs/dst/ right now? It’s self-contained, so there shouldn’t be any special “must live in staging” rule for filesystems before adding them.

Well… the case for merging drivers is usually pretty simple – the hardware exists, so we need the driver.
Whereas the “do we need this” case for new filesystems isn’t this simple.

FWIW we definitely want pohmelfs in staging…

So, let’s see how things will flow in a few moments…

DST has been asked for drivers/staging merge

DST is fully self-contained and really is not expected to get any changes in the future :)

POHMELFS is a bit more complex project, and it requires two exports from the Linux VFS, which are safe as is, but I’m waiting for Linus/Andrew to confirm that (we already talked about them with Andrew some time ago though).

In parallel I’m testing POHMELFS, and while it still shows superior perfromance compared to async in-kernel NFS, one of my systems refuse to mount it. It just says that it does not know ‘pohmel’ filesystem type, not even entering the kernel. Do not yet know what is the problem, but it worked ok with the previous kernel (it was some -rc tree). Will investigate further and prepare the patches.

Also I would like to know what benchmark could be used for the multi-user parallel testing. I use iozone for the single-user load.

New distributed storage release.

DST is a network block device storage, which can be used to organize exported storages on the remote nodes into the local block device.

The main goal of the project is to allow creation of the block devices on top of different network media and connect physically distributed devices into single storage using existing network infrastructure and not introducing new limitations into the protocol and network usage model.

Tree was rebased against 2.6.28 kernel release.

For those who naively believe

into all that sweet talks about ‘tell a story’ and extended description of the patches…

DST was released more than a week ago with two-page extended description of the ideas, implementations, features and use cases for the distributed storage. Each file was separately introduced with description of the content and rough usage cases in the project.

Guess the result? We talked a little with Arnd Bergmann and Benjamin Herrenschmidt about thread pools, mainly that it could be good idea to push it separately, and that likely David Howells’ slow_work patches will be pushed into the kernel as a thread pool implementation.

I will rebase against 2.6.28 and resend DST and POHMELFS today. Interested people are invited to the appropriate maillists to ask the questions and discuss the needed features.

Dementianting goldfish in the new DST release.

DST is a network block device storage, which can be used to organize exported storages on the remote nodes into the local block device.

DST works on top of any network media and protocol, it is just a matter of configuration utility to understand the correct addresses. The most common example is TCP over IP allows to pass through firewalls and created remote backup storage in the different datacenter. DST requires single port to be enabled on the exporting node and outgoing connections on the local node.

DST works with in-kernel client and server, which improves the performance eliminating unneded data copies and allows not to depend on the version of the external IO components. It requires userspace configuration utility though.

DST uses transaction model, when each store has to be explicitly acked from the remote node to be considered as successfully written. There may be lots of in-flight transactions. When remote host does not ack the transaction it will be resent predefined number of times with specified timeouts between them. All those parameters are configurable. Transactions are marked as failed after all resends completed unsuccessfully, having long enough resend timeout and/or large number of resends allows not to return error to the higher (FS usually) layer in case of short network problems or remote node outages. In case of network RAID setup this means that storage will not degrade until transactions are marked as failed, and thus will not force checksum recalculation and data rebuild. In case of connection failure DST will try to reconnect to the remote node automatically. DST sends ping commands at idle time to detect if remote node is alive.

Because of transactional model it is possible to use zero-copy sending without worry of data corruption (which in turn could be detected by the strong checksums though).

DST may fully encrypt the data channel in case of untrusted channel and implement strong checksum of the transferred data. It is possible to configure algorithms and crypto keys, they should match on both sides of the network channel. Crypto processing does not introduce noticeble performance overhead, since DST uses configurable pool of threads to perform crypto processing.

DST utilizes memory pool model of all its transaction allocations (it is the only additional allocation on the client) and server allocations (bio pools, while pages are allocated from the slab).

At startup DST performs a simple negotiation with the export node to determine access permissions and size of the exported storage. It can be extended if new parameters should be autonegotiated.

DST carries block IO flags in the protocol, which allows to transparently implement barriers and sync/flush operations. Those flags are used in the export node where IO against the local storage is performed, which means that sync write will be sync on the remote node too, which in turn improves data integrity and improved resistance to errors and data corruption during power outages or storage damages.

This release is a very minor update to the project, namely I extended userspace configuration utility to add command line options for the maximum IO size (in pages) and transaction scan timeout parameter (in milliseconds). Also updated documentation and Kconfig help text. And changed the name of course.


DST benchmark. IO elevators and TCP issues.

Bonnie++ benchmark of the latest DST version.

Previously released versions used blk_alloc_queue() for block IO queue allocation. This is rather simple fuction which does not do anything special. It happend that with the last kenels this does not attach IO elevator (also known as IO scheduler, subsystem which may reorder and combine multiple requests into single one to improve performance), so all DST requests wen into the network via single-segments block. This kills performance, but I did not notice that, since this behaviour of the blk_alloc_queue() was quite surprising for me (maybe I just imagined that, but in previous DST releases, which worked with older kernels (before DST rewrite), had elevator attached). So current DST tree uses blk_init_queue(), which is very different actually (there are two methods to handle IO requests), but has IO scheduler initialization, so I use it with request_fn() callback set to NULL and using own make_request_fn() to handle block IOs.

This change allows to combine multiple IO blocks by the attached IO scheduler, so now client sends big-sized requests to the server. Which brings us to the next problem: TCP ack storm.
When client sends big enough read requests (by default it is up to 32 pages), for example reading big file from the filesystem created on top of the DST, server will reply with MTU sized frames, which are in turn transformed into MSS sized TCP blocks. With 1500 MTU we will get about 1460 MSS (depending on enabled options this may be a little bit smaller). Linux TCP stack replies with single ACK message to each two received messages, which is about 45 ACKs per above max request, which in turn messes with existing TCP states and introduces additional overhead for server (amount of interrupts and tcp ack processing). This in turn kills read performance (about 1.5-2 MB/s bulk read performance when reading from filesystem mounted over DST. This is also very surprising to me, since effectively having large enough read requests should not kill performance because of lots of ACKs. I do not observe this behaviour in POHMELFS for example. I also tested the same workload havin SACK and FACK turned off, with essentially the same results.
DST has maximum block IO size (in pages) parameter, the best results are obtained when client sends as much as single (or two) page read requests. Reading performance is close to local filesystem performance. Writing performance in turn reduces (from being equal to local filesystem performance).
Here is example of what tcpdump shows:

20:40:08.836975 IP server.1025 > client.46340: . 1772313:1776507(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836978 IP server.1025 > client.46340: . 1776507:1780701(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836980 IP server.1025 > client.46340: . 1780701:1783497(2796) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836984 IP server.1025 > client.46340: . 1783497:1787691(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836987 IP server.1025 > client.46340: . 1787691:1791885(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836990 IP server.1025 > client.46340: . 1791885:1796079(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836993 IP server.1025 > client.46340: . 1796079:1800273(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836995 IP server.1025 > client.46340: . 1800273:1804467(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836998 IP server.1025 > client.46340: . 1804467:1808661(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.837001 IP server.1025 > client.46340: . 1808661:1812855(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.837003 IP server.1025 > client.46340: . 1812855:1817049(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.837019 IP server.1025 > client.46340: . 1838019:1842213(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.837095 IP client.46340 > server.1025: . ack 1713597 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837099 IP server.1025 > client.46340: . 1842213:1842401(188) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184289>
20:40:08.837100 IP client.46340 > server.1025: . ack 1716393 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837102 IP client.46340 > server.1025: . ack 1719189 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837346 IP client.46340 > server.1025: . ack 1721985 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837348 IP client.46340 > server.1025: . ack 1724781 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837350 IP client.46340 > server.1025: . ack 1727577 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837352 IP client.46340 > server.1025: . ack 1730373 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837353 IP client.46340 > server.1025: . ack 1733169 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837354 IP client.46340 > server.1025: . ack 1735965 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837356 IP client.46340 > server.1025: . ack 1738761 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837357 IP client.46340 > server.1025: . ack 1741557 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837359 IP client.46340 > server.1025: . ack 1744353 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837360 IP client.46340 > server.1025: . ack 1747149 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837362 IP client.46340 > server.1025: . ack 1749945 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837597 IP client.46340 > server.1025: . ack 1752741 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837600 IP client.46340 > server.1025: . ack 1755537 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837601 IP client.46340 > server.1025: . ack 1758333 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837602 IP client.46340 > server.1025: . ack 1761129 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837604 IP client.46340 > server.1025: . ack 1763925 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837605 IP client.46340 > server.1025: . ack 1766721 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837607 IP client.46340 > server.1025: . ack 1769517 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837608 IP client.46340 > server.1025: . ack 1772313 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837609 IP client.46340 > server.1025: . ack 1775109 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837611 IP client.46340 > server.1025: . ack 1777905 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837612 IP client.46340 > server.1025: . ack 1780701 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837846 IP client.46340 > server.1025: . ack 1783497 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837850 IP client.46340 > server.1025: . ack 1786293 win 13493 <nop,nop,timestamp 1184289 1316925>

What happend with TCP is so strange, that I definitely want to dig deeper into the problem, but so far the only idea I have is a difference between POHMELFS and DST: the former sends read reply as long as first page has been read, while DST server sends the whole reply only when all requested data has been read, so DST produces huge bursts of IO and POHMELFS does not. I expected that DST will be a bit slower than AoE, but in some cases (writing performance and lots of small requests) because of above tweak difference is noticeble. In case of writing and small requests (especially random) NFS starts to behave comparable becaue of its ability to combine requests. AoE is a clear winner in small requests case, since it is not limited by TCP ping-pong issues.

MTU 1500 data (dst with 1 and 2 pages as maximum IO size) :

Version        ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP

local           16G 44100  99 68515  16 31931  12 46348  98 75432   7 288.8   0
nfs             16G 42498  98 68708   9 25402   6 30040  66 50496   5 350.6   0
aoe             16G 39055  99 72872  18 27076   9 38785  98 75510  11 201.6   0
dst             16G 39498  98 67943  19 25614  10 37676  91 72807  16  17.8   0
dst-nosack-nofa 16G 39439  98 67041  20 25789  12 37556  92 72802  17  90.7   0
dst-2           16G 38658  92 61499  20 30195  11 43373  93 72451  18  17.6   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
local            16   486   2 +++++ +++   418   1   486   2 +++++ +++   384   1
nfs              16  1619   5  8709  14  2708   7  2578   8 11918  14  1275   3
aoe              16  8581  49 +++++ +++  6626  34  8495  48 +++++ +++  5876  35
dst              16  2031  12 +++++ +++  2795  16  2196  11 +++++ +++   781   4
dst-nosack-nofac 16  1922  13 +++++ +++  2954  20  1830  12 +++++ +++  1079   7
dst-2            16  1949  12 +++++ +++  2503  16  1908  12 +++++ +++   846   5

Distributed storage release: sik tee or kunem kez.

I asked a friend to translate very common phrase into not very common in kernel communitey at least language, so that I could use it as a name release without insulting too sensitive people. Because this is not just a phrase, but words which clearly describes my feeling about linux kernel review and feedback process. And maybe somewhat about kernel itself.

This is the 10’th renewed DST release and third resend of the same code, I added comments, changed name and turned on and off some debug messages for the previous releases. The last public DST comment was received more than a month ago, and it concerned whitespace placement in the patch. Since than patch was sent 4 times including changed whitespaces and added documentation. And still no feedback. I ask for inclusion, but the first I want to implement a good idea. This can be proved in discussion, and since it does not happen with people directly added into To: list, I get this as there are no objections on the idea and implementation: either because none cares or patch is perfect.

To understand the roots of this issue, I made a simple experiment with the previous DST release. I added following lines into the patch to catch reviewer’s eyes:

+ass licker
+static char dst_name[] = "Successful erackliss screwing into";

As you may expect, this does not compile and thus was never read by the people who are subscribed to the appropriate mail lists. I got one private mail about this fact for the whole week. The same DST code (without above lines) was sent public first time more than month ago and was resent 3 times after that.

That’s why I do not care about DST inclusion anymore. I do not care about its linux-kernel@ feedback.
I care about project, so if you use it, send mails directly to me and soon-to-be opened DST (there will be lots of mail lists for every project in active state) mail list, and problem will be fixed with the highest priority.

POHMELFS and DST usage cases, design and roadmap.

POHMELFS kernel client at its current state and by design is a parallel network client to the distributed filesystem (called elliptics network in the design notes) itself. So you can consider it like parallel NFS (but with fast protocol), where parallel means read balancing and redundancy writing to multiple nodes. POHMELFS heavily utilizes local coherent data and metadata cache, so it is also very high-performance, but that’s it: a simple client, which is able (with server’s help of course) to maintain a coherent cache of data and metadata on multiple clients, which work with the same servers.

So, right now it is effectively a way to create data servers, where client sees only set of the same nodes, and balance operations (when appropriate) between them. Server itself works with local filesystem, which can be built on top of whtever raid combinations you like of the disks or DST nodes. So effectively in this case POHMELFS mounted node can be extended by increasing local filesystem on the server.

That’s what the current state of the POHMELFS is. Next step in this direction is to extend server only (modulo some new commands to the client if needed and proper statistics) and add distributed facilities. By design there will be a cloud of servers, where each of which has own ID and data is distributed over this cloud, which has elliptics network working title. This name somewhat reflects the main algorithm of the ID distribution. Cloud will not have any dedicated metadata servers and every node will have exactly the same priority and usage case. POHMELFS kenel client (and thus usual user, which works with mounted POHMEL filesystem node) connects to the arbitrary server in the cloud and asks for needed data, this request is transformed into the elliptics network format and is forwarded to the server, which hosts it, data is returned to the client and it does not even know that it was stored elsewhere. In this case it is possible to infinitely extend the server space by adding new nodes, which will automatically join the network and will not require any work from client or administrator. Redundancy is maintained by having multiple nodes with the same id, so client will balance reading between them and write to them simultaneously. Node join protocol will maintain coherency of the updated and old data.

Proof-of-concept implementation is scheduled for the next month or so, this should be working (but simple enough for the start) library which can be used by other applications. Then I will integrate it with existing POHMELFS server. This is optimistic timings though, it depends on how many bugs will be found in all projects I maintain :)

DST is a network block device. It has a dumb protocol, which allows to connect two machines and use read/write commands between them (each command is effectively a pointer to where data should be stored and its size). So, there is always one machine, which exports some storage, and one which connects to it. There are no locks, no protection against parallel usage, nothing. Just plain IO commands. System can start several connections to the remote nodes and thus will have multiple block devices, which appear like local disks. Administrator can combine those block devices into single one via device mapper/lvm or mount btrfs on top of multiple nodes. And then export it via POHMELFS to some clients or work with it locally.
I consider DST as a completed project.

I did not write more detailed feature description of both POHMELFS and DST and how they are used in the failover cases or data integrity, it is always possible to grab those design notes from the appropriate homepages.

IT development roadmap.

This will be a short enough post about what projects and theirs status are included into the nearest roadmap. I wrote IT because I will describe programming projects only, and not electronics for example.

So, let’s start with existing two: DST and POHMELFS.
The former is essentially ready (with fun experiment I run with the latest version, kernel and public releases) and will be pushed upstream for some time. Next version will be released in a week or so and will only have new name. That will be 10’th distributed storge release, and 4’th resend of the same code.
Recently released POHMELFS got a bug just after release, and it is supposed to be fixed in the current version (pull from both kernel and userspace POHMELFS git repos). So far I do not see any new major changes in the POHMELFS client code, so essentially kernel side will only be extended when this is required by distributed server changes. I will not push it upstream until server side is also close to be finished.

Obviously both projects will be maintained and bugs will be fixed with the highest priority.

Here we comes to the new project I’m thinking about for some time already. This is a network storage server built on top of distributed hash table design without centralized architecture and need to have metadata servers. So far there is not that much of a code, only trivial bits, and I’m designing node join/leave part of the protocol. Some results are expected to be sooner than later, but not immediately.

And of course brain needs something to play with for the rest. Here comes language parser (LISP XML parser) I mentioned previously, and some computer language application based on this idea. That’s wjat I’m about to work on for the next days.

As another programming exercise I frequently think about buying myself a Play Station 3 and playing with its SPU processors and parallel applications, for example graphic algorithms (the first one which comes in mind is wavelet transformation and non-precise searcing of the images, which I made severaly years ago, I even have sources hidden in some old arch repo). Or playing with video card engines.
This does not have a very high priority though.

That’s the programming plans. Let’s see if anything will be completed anytime soon :)

New distributed storage release.

DST is a very feature-rich network block device capable to be a base block in high-performance and high-available local/network RAIDs.

This release contains some code movement from structure to structure and small comments cleanups.
It is the 9’th official release, second resend of the essentially the same code and at lest third ask for inclusion.

 drivers/block/Kconfig           |    2 +
 drivers/block/Makefile          |    2 +
 drivers/block/dst/Kconfig       |   14 +
 drivers/block/dst/Makefile      |    3 +
 drivers/block/dst/crypto.c      |  731 +++++++++++++++++++++++++++++
 drivers/block/dst/dcore.c       |  973 +++++++++++++++++++++++++++++++++++++++
 drivers/block/dst/export.c      |  664 ++++++++++++++++++++++++++
 drivers/block/dst/state.c       |  839 +++++++++++++++++++++++++++++++++
 drivers/block/dst/thread_pool.c |  345 ++++++++++++++
 drivers/block/dst/trans.c       |  335 ++++++++++++++
 include/linux/connector.h       |    4 +-
 include/linux/dst.h             |  587 +++++++++++++++++++++++
 12 files changed, 4498 insertions(+), 1 deletions(-)

I added Andrew Morton to the receiver list, let’s how many comments it will take this time.
Not sure if any. There was a single review (by Andrew only), where only some space issues left unresolved. Updated version was never commented.

As usual, patch is available in archive or via git tree.

New distributed storage release.

Essentially it is a resent of the old release with trivial changes:

  • whitespace cleanups
  • new name, this time about armenian mountains (casted by talking about Koran with Leontin)
  • email changes

As usual, get it from GIT tree or archive.

Partial write errors and recovery.

An interesting thread was started in BTRFS maillist recently about features filessytem should conain to be actively used by some users. Besides that there was a good question rised about how to handle partial write errors.

Consider the case, when we have a sequence of writes finished with a barrier call, which in a theory would end up with perfectly performed action, but in a real life any write in that sequence may fail, it will be returned to the system, it will return it to the user or just mark page as bad, but any subsequent write succeeded as long as barrier call, so actually filesystem may belive that everything is ok except given failed writes. Now, if we have a power loss or disk removal, system is not in a consistent state, since suceeded subsequent writes might depend on the failed on (like directory metadata update with the failed file metadata).

That’s the problem, which may be handled by the filesystem, which will split major updates and do not allow subsequent writes if previous one failed. But whatever filesystem is doing batched writes (afaics event ext* filesystems write journal entries not one after another, but flush it as a whole), it has a described problem, since failed write may be detected too late.

DST and POHMELFS use different approach, since network media they are working on is a way too unstable in that regard, we have to deal not only with power outages or disk swaps, but also with even temporal network outages, which are part of the usual life even in high-end networks. Both DST as block network storage and POHMELFS as a network filesystem utilize transaction approach, when number of meaningful operations may be combined into single entity, which will be fully repeated in case of some errors. In this case server will not reply with successful completion if intermediate write fails, and given transaction (including previous and subsequent writes, barriers and whatever else) will be resent. In case of reading from POHMELFS, this will be done from the different server (if it exists in the config).

This is not some kind of new feature of DST or POHMELFS, different kinds of transactions exist even in local filesystems, iirc journal update can be considered as such in ext4, but not data write and journal write as a whole, i.e. multiple dependant metadata updates may be not properly guarded by journal transactions, but I may be wrong; BTRFS likely uses transactions as a COW update, i.e. allocation of the new node and appropriate btree update, but for network filesystems this is an exceptionally useful feature.

DST mainline status.

Likely I messed up again :)

I agree with Andrew Morton, that it had small amount of documentation, so I extended it. I also fixed long lines and other cleanups. But I did not change some spaces/braces issues.


for (i=0; i<10; ++i)


for (i = 0; i < 10; ++i)


struct abcde
    int a, b, c;


struct abcde {
    int a, b, c;

I have to admit, that I actually behaved very stupid, and referring to other places in the code was not too smart either, but this is not the code, which people comment on, but whitespaces and braces. In other places spaces (as long as in DST) are not even detected without using checkpatch.pl script.
And this is the code, which I will maintain, since no one asked (ever) about how things are implmeneted except Andrew, but he did that only because there were no comments. Of course I agree, that having completely different coding style will hurt, since even simple review process will be disturbed by switching between, but above example of spaces and braces is just beyond the resonable.

As an example, I was asked to read this string:
and now compare it to
w h a t t h e h e c k w a s w r i t t e n h e r e

IMHO both are crappy, that's why resonable amount of brain power should be applied to whatever we are doing. Not too much, and not too little.