Tag Archives: POHMELFS

POHMELFS development

Linux kernel workqueue latencies

Kernel usually uses workqueues for process context execution. Pohmelfs use to for network processing engine, which is executed on top of socket stack.

Since it is kernelspace entity it is expected to be at least somehow fast and has low latency.
But in current linux kernel (3.3.0-rc3) latency is usually about 100 milliseconds

Its just as little as whoooping 0.1 second to schedule kernelspace work

Yes, there is a bunch of debug options turned on in my kernel, it is a viartual machine (vmware), but still it is a one tenth of a second!

pohmelfs: network raid1 example

Pohmelfs configuration is actually trivial:

# mount -t pohmelfs -o "server=,fsid=xxx,groups=3:2:1,noatime,noreadcsum,successful_write_count=1,sync_timeout=600,readdir_allocation=5" none /mnt/

where ‘server’ mount option specifies IP address in form address:port:family (2 – ipv4, 6 – ipv6). It is ok to specify only subset of all cluster IP address – pohmelfs will download route table itself, it only needs at least one alive node at connection time to discover other nodes

‘groups’ mount option specifies groups you want to write data into. Group is kind of replica ID.
That’s all for pohmelfs.

Let’s configure elliptics and create 2 groups (with id 2 and 3 for example), which will store essentially identical replicas (they may differ, since writes can be unordered, or one group may be down for some time)
There are 2 configuration files – elliptics server (let’s call it ioserv.conf) and server-side script environment (we use python, so it is python.init).

Here is ioserv.conf
I will highlight parameters, which differ in separate groups

# log file
# set to 'syslog' without inverted commas if you want elliptics to log through syslog
log = syslog

# log mask
#log_mask = 10
log_mask = 15

# specifies whether to join storage network
join = 1

# config flags
# bits start from 0, 0 is unused (its actuall above join flag)
# bit 1 - do not request remote route table
# bit 2 - mix states before read operations according to state's weights
# bit 3 - do not checksum data on upload and check it during data read
# bit 4 - do not update metadata at all
# bit 5 - randomize states for read requests
flags = 4

# node will join nodes in this group
group = 2

# list of remote nodes to connect
# address:port:family where family is either 2 (AF_INET) or 6 (AF_INET6)
# address can be host name or IP
remote =

# local address to bind to
# port 0 means random port
#addr = localhost:1025:2
addr =

# wait timeout specifies number of seconds to wait for command completion
wait_timeout = 60

# this timeout specifies number of seconds to wait before killing
# unacked transaction
check_timeout = 60

# number of IO threads in processing pool
io_thread_num = 64

# number of IO threads in processing pool dedicated to nonblocking operations
# they are invoked from recursive commands like DNET_CMD_EXEC, when script
# tries to read/write some data using the same id/key as in original exec command
nonblocking_io_thread_num = 32

# number of thread in network processing pool
net_thread_num = 64

# specifies history environment directory
# it will host file with generated IDs
# and server-side execution scripts
history = /opt/elliptics/history.2

# specifies whether to go into background
daemon = 1

# authentification cookie
# if this string (32 bytes long max) does not match to server nodes,
# new node can not join and serve IO
auth_cookie = qwerty

# Background jobs (replica checks and recovery) IO priorities
# ionice for background operations (disk scheduler should support it)
# class - number from 0 to 3
# 0 - default class
# 1 - realtime class
# 2 - best-effort class
# 3 - idle class
bg_ionice_class = 3
# prio - number from 0 to 7, sets priority inside class
bg_ionice_prio = 0

# IP priorities
# man 7 socket for IP_PRIORITY
# server_net_prio is set for all joined (server) connections
# client_net_prio is set for other connection
# is only turned on when non zero
server_net_prio = 1
client_net_prio = 6

# anything below this line will be processed
# by backend's parser and will not be able to
# change global configuration

# backend can be 'filesystem' or 'blob'

backend = blob

# zero here means 'sync on every write'
# positive number means data amd metadata updates
# are synced every @sync seconds
sync = 300

# eblob objects prefix. System will append .NNN and .NNN.index to new blobs
data = /opt/elliptics/eblob.2/data

# Maximum blob size. New file will be opened after current one
# grows beyond @blob_size limit
# Supports K, M and G modifiers
blob_size = 500G

# Maximum number of records in blob.
# When number of records reaches this level,
# blob is closed and sorted index is generated.
# Its meaning is similar to above @blob_size,
# except that it operates on records and not bytes.
records_in_blob = 10000000

Our second replica will live in group 3, so you should change above ‘group’ parameter to 3 as well as node’s address and optionally ‘remote’ parameter, which is a list of nodes to connect. It can include local address itself.

Second configuration file is python.init
It must live in directory specified in ‘history’ parameter above
You should put all srw/pohmelfs* scripts in ‘history’ path too.

import sys

from libelliptics_python import *

# groups used in metadata write
pohmelfs_groups = [1, 2, 3]
pohmelfs_log_file = '/opt/elliptics/history.2/python.log'

log = elliptics_log_file(pohmelfs_log_file, 10)
n = elliptics_node_python(log)
# we should only add own local group, since we do not want all updates to be repeated for all groups
# this should be changed to 3 for group number 3
# this is an IP address for local node, i.e. server, which belongs to group 2
# you may specify multiple addresses with multiple calls
n.add_remote('', 1025)

__return_data = 'unused'

import gc

import struct
# python sstable implementation
from sstable2 import sstable

import logging
FORMAT = "%(asctime)-15s %(process)d %(script)s %(dentry_name)s %(message)s"
logging.basicConfig(filename=pohmelfs_log_file, level=logging.DEBUG, format=FORMAT)

pohmelfs_offset = 0
pohmelfs_size = 0
# do not check csum
#pohmelfs_ioflags_read = 256
pohmelfs_ioflags_read = 0
pohmelfs_ioflags_write = 0
# do not lock operation, since we are 'inside' DNET_CMD_EXEC command already
pohmelfs_aflags = 16
pohmelfs_column = 0
pohmelfs_link_number_column = 2
pohmelfs_inode_info_column = 3
pohmelfs_group_id = 0

def pohmelfs_write(parent_id, content):
	n.write_data(parent_id, content, pohmelfs_offset, pohmelfs_aflags, pohmelfs_ioflags_write)
	n.write_metadata(parent_id, '', pohmelfs_groups, pohmelfs_aflags)

Putting together this initialization script (you may edit one in source tree) with pohmelfs scripts ends up adding support for server-side scripts executed with above context.
There is a pool of processes which pick up execution contexts to run your requests.

That’s it, feel free to ask if you hit any problem!

pohmelfs: call for inclusion

I’m please to announce new and completely rewritten distributed filesystem – POHMELFS

It went a long way from parallel NFS design which lived in drivers/staging/pohmelfs for years effectively without usage case – that design was dead.


Let’s see, how this will end up

pohmelfs got http compatibility support

Now inode’s ID is generated as a hash from its full path.

And it is possible to read data from such elliptics storage not only via pohmelfs, but using HTTP and library APIs
In particular it is very convenient to upload huge number of files into storage via pohmelfs, and then provide them to clients via common interfaces.

Http compatibility disables rename, since key changes and given object should possibly be moved to different node.

I also added sync-on-close mount option, since pohmelfs uses local page cache quite for a long time and object might not be written to storage until writeback fires. Sync forces data and metadata to be written to storage, making object accessible to other clients.

I believe that’s all for pohmelfs. I will post patch to lkml tomorrow, likely I will post 2 of them – to remove drivers/staging/pohmelfs and just put it into fs/pohmelfs, since it goes production quite soon here.

pohmlefs got hardlink and socket/pipe support

While adding pipe/socket is pretty much straightforward – it is just another inode types, which have full VFS implemnetation in kernel, hardlinks are much harder.

Basically saying, hardlink is a link to inode, so that updates made through one path ended up with new data read from all other hardlinked pathes. Removing one hardlink may not end up removing whole fie, instead there should be some kind of reference counter – we only remove file’s data when there are no more hardlinks to given file, and of course when we remove file by original path, we can not remove its data, if it can be accessed via its hardlinks.

POHMELFS uses elliptics as its data storage backend, which supports atomic transactions per replica and per-column data IO. So I put reference counter update into elliptics server-side script, which is called for every unlink and hardlink creation command.

Scripts are executed atomically against other operations for given key, but only on single replica. It is possible that different operations with the same key are executed on different replicas, but for unlink it does not matter – if we drop refcnt by removing ‘hardlink-1’ and then original path or vice versa data will only be removed when last reference drops to zero (actually to -1, since for performance reasons we do not create reference counter for plain object, only for its hardlinks).

Last weeks we extensively tested POHMELFS in a lab, but its time to move it to small testing environment, which mimics behaviour of one of our large systems. We will setup all software we use daily there and start looking for real bugs.
Getting our previous testing ‘mode’ (we synced about hundred of terabytes of files of different sizes) I do not expect any problems to arise.

Still, I want http compatibility mode, when we save files with IDs generated from full path length. This implies that move is not supported, but it is a special case where we can afford this. Basically we want it to write data via POHMELFS for performance and then read it via HTTP.

Stay tuned, I will send kernel inclusion request next week. First patch will replace drivers/staging/pohmelfs with this new code. Hopefully it will move into fs/ in some next release.

Pohmelfs got quorum read support

Due to eventual consistency nature of elliptics there may exist a window in which replicas are not in sync.

Depending on various factors this window may be rather large, for example in our production clusters we run this kind of check weekly, so having replica outdated for that long may require additional steps to get synchronized data.
We store metadata for every written object, which among others contain update timestamp. So it is quite simple task to make a parallel request for metadata from multiple replicas in different datacenters and select the one with the latest timestamp.

This exists in elliptics API quite for a while, and now I added it to pohmelfs.
At open() time we check all replicas and build a list of groups, where it is present, sorted by timestamp, so any subsequent read will get the must up-to-date data.

As a bonus I added keepalive mount options to detect broken links early. In our tests where datacenters may shutdown on weekly basis waiting for couple of hours is a bit of overhead to detect it.

There are 2 pohmelfs tasks to date – http compatibility mode, where object ID is generated according to its full path, and umount bug, which is likely because of my dirty dcache games.

I’ve just announced pohmelfs in linux-kernel

Here is the link: https://lkml.org/lkml/2011/12/23/161

As mentioned there, there are features yet to complete:

  • quorum read. pohmelfs supports quorum write only so far.
  • http compatibility mode – we do want to upload data via pohmelfs and read it through http applications. And vice versa actually too.
  • column read-write or more generally file-as-directory feature.
  • even more testing – abusing dcache is fun, but likely there are hiddens stones
  • replace drivers/staging/pohmelfs with this code.

It is quite a lot of work, but not that much of work. Maybe a week or so to actually implement all that stuff and then to start real testing. Right now we strike into system limitations quite fast – like doing multi-terabyte rsync between local raid and pohmelfs mounted storage and hitting NIC limitations all the time – it is 1 Gbps after all (and I want to play with 10 GigE but still without luck). Pohmelfs uses local VFS page cache so it can easily saturate bandwidth during writeback.
Since we also have to write multiple copies in parallel (we write 3, but only 2 groups are turned on to get quorum), write is limited by network 40-50 MB/s.

Also elliptics sometimes scares me – we sync data into eblob once per 300 seconds, and it takes about 3-10 seconds to actually write it to disks on every system. At that time machine almost freezes – 100% disks utilization. And rsync sometimes looses its mind – it stops and timeouts, although sync is already finished.

Another problem is related to eblob. We store 10+ gigabyte log files in test case, and since eblob is append-only, it can fill its quota for given key and subsequent write will have to copy previous record into new chunk. Copying 10 Gb takes some time too, and it also takes space – old records are just marked as deleted until offline defragmentation cleans it up.

And my favourite – dcache abuse. Sometimes I get kernel BUG at /home/apw/COD/linux/fs/dcache.c:873 which means at umount time there are dentries with positive reference counters. I do not yet know whether it is pohmelfs or not, but more likely that it is the reason :)
I will fix that too of course.

But overall pohmelfs is close to finish.

Next task is to implement indexing and embedded search in elliptics.
It is quite good idea to have out-of-the box full-text search engine inside storage system.

Stay tuned, new things are close!

POHMELFS and elliptics server-side

POHMELFS in a meantime got ‘noreadcsum’ mount option support as well as some other tricky bits.

This bumps read performance several times, since eblob (elliptics backend) stores data in contiguous chunks increasing read/write performance, but optionally forcing to copy data, when prepared space is not enough.
Reading with csum enabled forces to checksum the whole written area, which in turn requires to populate it from disk to page cache and so forth. For gigabyte file this takes about 5-6 seconds (first time).

And that’s only to read one page. I.e. to read every page (or readahead requested chunk).
Not sure that disabling read checksumming is a good idea, so I made this a mount option. Maybe eventually we end up with some better solution.

I also fixed nasty remount bug in POHMELFS which uncovered a really unexpected (for me at least) behaviour of Linux VFS.
Every inode may have ->drop_inode() callback, which is called each time its reference counter reaches 0 and inode is about to be freed.
But sometimes when inode was recently accessed, it is not evicted, but placed into lru list with special bits set.
Inode cache shrink code (invoked from umount path in particular, but likely may be called from memory shrink path too) grabs all inodes in that lru list and later calls plain iput() on them, which in turn invokes ->drop_callback() for inode in question.

Thus it is possible to get multiple invocation of callback in question without reinitializing inode between them. This crashed pohmelfs in some cases. Now it is fixed with appropriate comment in the code, but I’m wonder how many other such tricks are yet to discover?

POHMELFS is a great tester for elliptics server-side scripting support. I was lazy and put all somewhat complex processing into server-side scripts written in Python. Anton Kortunov implemented simple sstable-like structure in Python for directory entries used by pohmelfs.

Since every command is processed atomically (on single replica) in elliptics, we can put complex directory update mechanism in this ‘kind-of-transactions’. In particular server-side scripting is used to insert and remove inodes from directory. Lookup is also implemented using server-side scripting – we read directory inode in python code, search for requested name, and return inode information if something is found, which is sent back to pohmelfs.

Overall this takes about 2-13 msecs. I.e. receive command from pohmelfs, ‘forward’ it to pool of python executers (srw project), where python code will read directory inode data from elliptics (using elliptics_node.read_data_wait()), search for inode with given name there and send it back to pohmelfs.
Insert takes about 30-150 msecs – script reads directory content, adds new entry (or update old) and then writes it back into the storage.

That’s how it looks in python – srw/pohmelfs_lookup.py

Given that we spend 10 msecs in such not really trivial piece of code, I believe that my implementation is actually not that bad.
Those are recent news. Stay tuned for more!


In a meantime I rewrote pohmelfs from scratch and it enters heavy testing stage.
As promised, it became just a POSIX frontend to elliptics network with weak synchronization. By using elliptics as its backend, it gets multiple copies support, atomic transactions (in single replica), multiple datacenter support with IO balance, checksums, namespaces and so on.

And by ‘weak synchronization’ here I mean, that all writes are not visible to other users, who mounted external storage, until writer performs sync or writer’s host system decides to writeback dirty pages to the storage.
This actually mirrors behaviour of VFS in all modern OSes – we write data into page cache, and if system catches power failure, its data is lost. Even more – users are not synchronized in any way, and if one of them removes file, another one will only detect that after reading directory again (or trying to open/access given filename).

There is a very interesting approach I use in directory listing. We store directory information as a record indexed by directory key id. It is atomically (as in single replica, multiple replicas are updated independently) updated for every written/removed object and hosts whole inode indexed by dentry names.
But directory listing just reads that whole directory structure and parses it adding inodes/dentries not at lookup time (this is supported too of course), but at readdir time. Since records are stored as single continuous areas in elliptics, we only have to download and sequentially iterate over this blob data to get listing completed without multiple server lookups per name.

But we have to cleanup parent direntry list every time we are about to perform directory listing, since other users may delete some files or rename them. For example rsync creates ‘.blah.random-crap’ files first and then renames them to ‘blah’ when copy is completed, which resulted in 2 files having the same inode and id previously.

There is no hardlink support yet, balanced/random reads from multiple replicas and quorum read, when we try to reach multiple replicas and select one with the latest consistent data. Writes also should support quorum option (at least mark pages back as dirty if write did not reach quorum or requested number of replicas).

I also plan to add column read/write (this is definitely not a POSIX interface, but kind of file-is-a-directory feature).
Also we want HTTP API compatibility with elliptics, i.e. we write data via pohmelfs and read it via HTTP (or any other) client, which uses default id-is-a-name-hash approach.

This is all is planned for future releases though, I plan to submit new stable version in December and better in a week or two.
Stay tuned!

Elliptics: server-side scripting

I added binary data support into server-side scripting (currently we only support Python) as well as extended HTTP fastcgi proxy to support server-side script execution.

Previously we only were able to use C/C++/Python to force some server to execute script on our data – API is rather simple and can be found in examples in binding dirs.
Now we can do that through HTTP.

Here is an example of how to setup python server-side scripting and run scripts on posted through HTTP data.
First of all, you have to put python.init script into ‘history’ directory (specified in server config).
It may look like this:

import sys
from libelliptics_python import *

log = elliptics_log_file('/tmp/data/history.2/python.log', 40)
n = elliptics_node_python(log)
n.add_remote('elisto19f.dev', 1030)
__return_data = 'unused'

This creates global context, which is copied for every input request.
Second, you may add some scripts into the same dir, let’s add test script named test_script.py:

__return_data = "its nothing I can do for you without cookies: "

if len(__cookie_string) > 0:
       # here we can add arbitrary complex cookie check, which can go into elliptics to get some data for example
       # aforementioned global context ('n' in above python.init script) is available here
       # so you can do something like data = n.read_data(...)
        __return_data = "cookie: " + __cookie_string + ": "

__return_data = __return_data + __input_binary_data_tuple[0].decode('utf-8')

HTTP proxy places cookies specified in its config into variable __cookie_string, and it is accessible from your script. Anything you put into __return_data is returned to the caller (and eventually to the client who started request).

__input_binary_data_tuple[0] is a global tuple which contains binary data from your request.
When using high-level API you may find, that we send 3 parameters to the server:

  • script name to invoke (can be NULL or empty)
  • script content – it is code which is invoked on the server before named script (if present), its context is available for called script
  • binary data

That binary data is placed into global tuple without modification. In Python 2.6 it holds byte array, I do not yet know what it will be in older versions though (basically, we limited support to python 2.6+ for now).

HTTP proxy puts your POST request data into aforementioned binary data. It also puts simple string like __cookie_string = 'here is your cookie' into ‘script content’ part of the request.
One can put there whatever data you like if using C/C++/Python API of course.

Here is example HTTP request and reply:

$ wget -q -O- -S --post-data="this is some data in POST request" "http://elisto19f.dev:9000/elliptics?exec&name=test_script.py&id=0123456789"
  HTTP/1.0 200 OK
  Content-Length: 79
  Connection: keep-alive
  Date: Wed, 02 Nov 2011 23:24:10 GMT
  Server: lighttpd/1.4.26
its nothing I can do for you without cookies: this is some data in POST request

As you see, we do not have cookies, so our script just concatenated some string with binary data and returned that data to the client. Name used in POST parameters is actually a name of the server-side script to invoke. ID is used to select server to run this code, otherwise it will run on every server in every group.

It is now possible to run our performance testing tools against server-side scripting implementation (it is not straightforward – there is separate SRW library which implements pool of processes each of which has own initialized global context, which is copied for every incoming request and so on

We believe that numbers will be good out of the box, and of course I expect that performance will suffer compared to plain data read or write, but it should not be that slow – we still expect that every request completes within milliseconds even when it uses python to work with data.
Our microbenchmark (every command executed on server is written into log with time it took to complete) shows that timings are essentially the same – hundreds of microseconds to complete simple scripts.
So I believe with IO intensive tasks we will not be limited by python server scripts and/or various reschedulings.

I’m quite excited with idea of adding full-text (rather simple though) search for all uploaded content into elliptics.
And I’m working hard on this – expect some results really soon!

P.S. We have almost finished directory support for POHMELFS via server-side scripts, expect it quite soon also!

Elliptics network: server-side scripting for data processing engine

Storage systems more and more commonly demand not just ability to save data to disk and provide fast access.
It is almost impossible to find a plain storage stack, which will not perform some kind of data processing.
This may include applications starting from simple client-side operations (like wrapping data in HTML in web case) and finishing with complex real-time processing or batch jobs wrapped in mapreduce-like systems.

Writing most of server-side work in low-level languages is not convenient and frequently is error-prone. Also performance is not commonly limited by CPU power, but disk IO speeds.

So, I added server-side scripting support to elliptics. Separate lib (called libsrw) creates a pool of processes (since many scripting languages are single-threaded like Python or popular javascript v8) which talk to external world via pipes. Each process initializes global context where it can run user provided script.
For example in elliptics we store a client node, which is connected to the storage and has access to all local data.
Every execution command is started in context, which is copied from global environment, and although global context can be affected by command (for example connection can be dropped by remote node), all variables created for scripts to be executed are destroyed when command/script completes.

To date I implemented python context and we plan to complete v8 javascript soon.
It is possible to create, for example, a simple checker on every node, where systems checks clients credential when processing data via direct URL. Direct URLs are used when we do not want to proxy data through dedicated servers (like when streaming huge files), but instead allow client to connect and send requests to particular server which hosts needed data objects.

Another example is batch data processing, like background data conversion from one format into another without need to copy every object to processing node. We also plan to use this mode for batch recovery – instead of checking replicas for every stored key, we can collect a bunch of objects which are known to me missed on some other node and later upload this blob to remote host in one go.

Classical mapreduce can also be implemented on servers. Actually our contexts are exact mappers, which have access to data and can iterate over all records selecting those to ‘reduce’. In our scheme reduce operation will run on the node, which sent request (or client). This may not be the optimal solution though (for example when reduce data set is rather huge), so we could extend it in the future.
Plan is to allow execution contexts to store its state locally when needed, so that subsequent commands have access to already processed data (for example consider the case, when we calculate number of links to given URL – we generally do not want to check all files every time we start processing, instead would like to only parse files uploaded since previous start).

Another application for server-side scripting is directory structure implementation for POHMELFS. This may be implemented in low-level code like C/C++ though and linked to server binary. Storing and processing complex structure on the server allows client (which is in kernel) to be really simple and do not mess with complex locks.

What I want to play right now is some kind of full-text search embedded into elliptics. The most naive implementation could just parse all written documents and build reverse indexes (using appropriate language morphology if needed) and store them in dedicated columns in elliptics as well. Writing this kind of scripts in Python or JavaScript is rather simple task, which greatly extends functionality of the storage kind of ‘for free’ – bunch of server-grade processors are usually unused in storage systems.

Here is context initialization script for elliptics for example:

$ cat /tmp/test/history.2/python.init 
import sys
from libelliptics_python import *

log = elliptics_log_file('/dev/stderr', 40)
n = elliptics_node_python(log)
n.add_remote('localhost', 1025)
__return_data = 'unused'

We create global elliptics node object ‘n’ (not a good name for such things of course, but that’s just an example), which connects to storage server listening on a localhost:1025. __return_data is a special variable used to return data from execution context back to client. It is possible to write data into a file on the local filesystem or upload test results back into elliptics though.

That’s how client requests may look (there are also API extensions for C++ and Python):

$ ./example/dnet_ioclient -r server:1025:2 -C1 -c "local_addr = '`hostname`'" -n script.py
$ ./example/dnet_ioclient -r localhost:1025:2 -C2 -c "__return_data = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)"

The latter example is just a plain python script executed on the server – it only reads data from elliptics.
First example executes a script named script.py, which should exist on the server (config specifies scripting dir).
local_addr = '`hostname`' is part of the final script which initializes some variable (it can be arbitrary complex if needed).

Script (on the server) may look something like this:

$ cat /tmp/test/history.2/script.py
from time import time, ctime

ret = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)
__return_data = local_addr + '>>>  ' + ctime(time()) + ' --- ' + ret

Prior being executed server-side code concatenates client’s part of the script (string with local_addr in above example) with script itself, so server executes following code:

local_addr = 'zbr.localnet'
from time import time, ctime

ret = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)
__return_data = local_addr + '>>>  ' + ctime(time()) + ' --- ' + ret

Everything stored in __return_data is returned back to client.
Next execution command initializes context back into the state which existed just after initialization script execution (although global elliptics node in our example can be internally modified).

Those were our first set of changes to data processing engine implementation in elliptics network.
Stay tuned for more results!

POHMELFS got full read/write support

as well as directory listing and file creation.

All operations are not group-aware though, i.e. all writes are made only into single group and reads (directory listing and object lookup) do not balance between multiple groups.

There was a fair number of inode/dentry hacks and I suppose that populating dentry cache from outside of directory inode operations (like ->create() callback) is not a good idea, but I ended up adding new dentries when reading directory content in ->lookup() or ->readdir().

POHMELFS also does not yet handle errors – i.e. it is a luck we do not crash if server returns borrked structures for directory for example, network errors are not fixed also – client will not reconnect and will not even drop connection if some errors are found.

But it is a matter of time to clean things up. Stay tuned!

Initial POHMELFS commit

$ git commit -a --stat -m "Initial POHMELFS commit"
[master 19ce0b3] Initial POHMELFS commit
 13 files changed, 3301 insertions(+), 0 deletions(-)
 create mode 100644 fs/pohmelfs/Kconfig
 create mode 100644 fs/pohmelfs/Makefile
 create mode 100644 fs/pohmelfs/Module.symvers
 create mode 100644 fs/pohmelfs/dir.c
 create mode 100644 fs/pohmelfs/file.c
 create mode 100644 fs/pohmelfs/inode.c
 create mode 100644 fs/pohmelfs/net.c
 create mode 100644 fs/pohmelfs/packet.h
 create mode 100644 fs/pohmelfs/pohmelfs.h
 create mode 100644 fs/pohmelfs/route.c
 create mode 100644 fs/pohmelfs/super.c
 create mode 100644 fs/pohmelfs/trans.c

POHMELFS is able to create objects (files only so far) and perform directory listing.
No directories or file IO support yet. No object removal.
Directory structure is trivial linear array which does not scale.
And that in freaking 3.3 thousands of lines of code. And 5 real nights of code.
So do not expect magic here, we will update it in time.

POHMELFS works with elliptics and it is possible to read (write and remove) binary data using elliptics tools and/or via HTTP proxy if you want to mess with raw IDs (64-bytes long numbers).

Stay tuned, fun things are about to begin!

Initial POHMELFS bits

I’m a bit budy with other boring tasks around, so only get chance to hack on POHMELFS today. It is almost 1000 lines of code already, but yet its functionality is virtually near zero.
But I believe all interface for VFS are implemented and I can concentrate on actual interaction with remote elliptics storage.

$ wc -l fs/pohmelfs/{*.[ch],Makefile,Kconfig}
   41 fs/pohmelfs/dir.c
   27 fs/pohmelfs/file.c
  185 fs/pohmelfs/inode.c
   85 fs/pohmelfs/net.c
  104 fs/pohmelfs/pohmelfs.h
   96 fs/pohmelfs/pohmelfs.mod.c
  437 fs/pohmelfs/super.c
    7 fs/pohmelfs/Makefile
   11 fs/pohmelfs/Kconfig
  993 total

# mount -t pohmelfs -o "server=ipv4_addr1:1025:2;server=ipv6_addr2:1025:6" none /mnt/
# ls -laui /mnt/
total 4
1 drwxr-xr-x   2 root root    0 Aug 26 08:52 .
2 dr-xr-xr-x. 23 root root 4096 Aug 26 07:47 ..

File creation does not work as well as directory content reading. It will be rather trivial for start.
Maybe I will even implement server-side scripting instead and will use it for directory updates, so that I do not create leases (in the first release) needed for read-modify-write loop of directory update.

Or (what is more probable) I will just create read-modify-write loop for directory update without server locks, which is rather bad idea from concurrent point of view, but it is the simplest case which can be used as a base for future improvements.

Stay tuned, I plan to create kind of alpha working version very soon!

Pohmelfs in da haus

Linux kernel for the last 3 years didn’t change at all – I wrote 600 lines of code, and it does not yet even connect normally to remote elliptics node.

$ wc -l fs/pohmelfs/{*.[ch],Makefile,Kconfig}
   37 fs/pohmelfs/inode.c
   88 fs/pohmelfs/net.c
   58 fs/pohmelfs/pohmelfs.h
   62 fs/pohmelfs/pohmelfs.mod.c
  353 fs/pohmelfs/super.c
    7 fs/pohmelfs/Makefile
   11 fs/pohmelfs/Kconfig
  616 total

Data de-duplication in ZFS and elliptics network (POHMELFS)

Jon Smirl sent me a link describing new ZFS feature – data deduplication.

This is a technique which allows to store multiple data objects in the same place when their content is the same, thus effectively saving the space. There are three levels of data deduplication – files (objects actually), blocks and bytes. Every level allows to store single entity for the multiple identical objects, like single block for several equal data blocks or byte range and so on. ZFS supports block deduplication.

This feature existed effectively from the beginning in the elliptics network distributed hash table storage, but it has two levels of data deduplication: object and transaction. Well, actually we have transaction only, but maximum transaction size can be limited to some large enough block (like megabytes or more, or can be unlimited if needed), so if object is smaller than that, it will be deduplicated automatically.

Which basically means that if multiple users write the same content into the storage and use the same ID, no new storage space will be used, instead transaction log for the selected object will be updated to show that two external objects refer to given transaction.

Depending on transaction size it may have a negative impact, in particular when transaction size is smaller than log entry, it will be actually a waste of space, but transactions are required for the log-strucutred filesystem and to implement things like snapshots and update history. By default log entry size equals to 56 bytes, so it should not be a problem in the common case.

POHMELFS as elliptics network frontend will support this feature without actually any steps out of the box.

POHMELFS transactions

It happend that my previous idea of using socket buffer and VFS pages is very wrong. Mainly because of POHMELFS transaction nature. Transaction must stay in memory until remote server acknoledges its data.
But what will happen when second write is about to update the same area? We can not overwrite data, since then we will lost previous transaction and there will be no way to resend it and store elsewhere on timeout or other error. Instead we should allocate new buffer and copy data there. But this is not that simple, since we have to update VFS page cache, and thus to evict previous page first. Also all pages have to be somehow linked, so that when transaction is committed, appropriate pages could be freed.

Other filesystems, namely btrfs, waits until writback is over on the page about to be overwritten, which may or may not be a good idea for the overwrite workload, and I expect it actually to be a bad idea, especially for the high-latency storages, but it is noticebly simpler to implement. Buffer heads used to track partial page updates are quite heavy and not really needed for my case, so I will implement trivial tags attached to pages, and when overwrite is going to happen, system will wait for the pages in question to be flushed to the remote server, and then overwritten in place creating new transction.

Above tags are needed for the usual writeback – we will not really write data at writeback time, instead we will find transactions which refer to given page and resend them. In the perfect case, which I expect to happen most of the time, there should be no such stall transactions at all, since they will be quickly acked soon after write time when we will send data to the server, but it is still possible that there are no quick acks, so writeback can fire the inode.

That’s the plan, now back to drawing board to actually find out how pages should be attached to transactions… Stay tuned!

POHMELFS, data integrity and versioning

As you might know, new POHMELFS will be a fully versioning filesystem, since it will work with the transaction log-structured distributed hash table called elliptics network.

Object versioning implies that each update is supposed to be a separate version, thus each write should be a separate transaction. We can do this by using two ways – put data into separate storage and sleep when it is full; and allocate new storages for the new data in each write call.

The former is used in the network – there is a limited socket buffer, which we can fill either by copying data into or pinning external data pages. No matter what and how, it has a limited size and when socket buffer is full, no new writes are possible, so we will either sleep or return error.

Another way is a bit different – for the subsequent write into the same area we will allocate new storage and copy data there. Effectively both methods are the same, but in the first one we kind of ‘allocate’ from the fixed size area, while in the second one – from the main system memory allocator, where we will block by reaching either some limit or when there is no more free memory.

Then pages or data blocks can be attached to transaction, which will commit them to disk or remote node. I decided that I will use the first method, so that each write will allocate a transaction, and data will be sent to the remote nodes. If socket buffer is full, write will block.

This has fair number of cons, namely need to copy data twice – from userspace into page cache and then from the page cache into socket buffer. It is possible to use sendpage() and friends, but this will force us to have a per-inode write lock to order writes, and actually this may be not enough, since the way sendpage() works clearly allows to write into the page being transmitted.

Since I decided to send data at write time system has to hash data at the same time to create transaction ID (which becomes data checksum). Hashes used to generate IDs (they are called transformation functions, since they ‘transform’ data into fixed-size IDs) are provided as mount option (HMAC is not supported for now, only plain hash).
Linux crypto API requires to have a preallocated crypto structure to work with, and its allocation as well as freeing is rather costly process (there are global locks and potentially very long waits).
So I decided to preallocate and initialize number of those crypto control structures at superblock allocation time, i.e. during mount option parsing.
Write operation will block waiting for free crypto worker, which will process data when ready. It does not use pool of threads, instead work will be done on behalf of writing process, thus scaling with number of writes, but limited by the number of crypto control structures allocated at mount time. It will be remount-configurable option of course, but in the future, there is no remount hook for now :)
This decision has number of cons either, but its pros look very promising.

Architecture looks interesting, but only practice will draw the conclusion line. So, stay tuned!

Per mount-point inode and transaction caches

happend to be a bad idea. Well, it is not that bad, but it does not save anything, since I want to have transformation functions to be replaceble at remount time, but having to rebuild inode cache for this is not a good idea.

So I will have a list of destination IDs attached to inode (and thus subsequently to transactions), and maybe eventually there will be ability to attach filtes to store some objects with one set of transformation functions (and thus amount of copies) and other objects with another one. Like having all ‘*.txt’ files stored with sha1(name) and sha256(name) IDs while ‘*.sql’ with only one copy. But I will rest this for some future.

We are getting closer to have a write support in new POHMELFS connected to elliptics network, but yet there is a fair amount of work even for this rather simple task.

Stay tuned!

POHMELFS transactions and inode caches

I decided to change common practice of having single inode cache per filesystem and isntead implemented per-mountpoint caches (and memory pools for that matter). It is because mount point allows to specify different size of the inode ID, which is highly connected to transformation mechanism used for given mount.

So, for example, one can mount remote storage and specify only one transformation function, while other mount will have multiple functions, thus forcing each inode update to be redundantly stored with multiple IDs. Each inode has its IDs cached in the inode structure, so with different number of transformation functions inode size also becomes different. The same applies when various mounts use different transformation functions, which for kernel POHMELFS client are plain crypto hashes.

Getting that elliptics network topology always allows to store data, no matter which ID it has, even when network connections to some other servers are broken (ability to store given ID range is delegated to the neighbours of the failed nodes), each transactions now has to be acked by all remote nodes before it is considered completed. Thus network transaction has the same issues with mounts as inodes, and thus may have different sizes depending on mount point.

So actually transaction becomes a list of per-id metadata objects used to create network command for given data. When all per-id objects are acked by the remote servers, transaction is completed.