Pohmelfs got quorum read support

Tagged:  

Due to eventual consistency nature of elliptics there may exist a window in which replicas are not in sync.

Depending on various factors this window may be rather large, for example in our production clusters we run this kind of check weekly, so having replica outdated for that long may require additional steps to get synchronized data.
We store metadata for every written object, which among others contain update timestamp. So it is quite simple task to make a parallel request for metadata from multiple replicas in different datacenters and select the one with the latest timestamp.

This exists in elliptics API quite for a while, and now I added it to pohmelfs.
At open() time we check all replicas and build a list of groups, where it is present, sorted by timestamp, so any subsequent read will get the must up-to-date data.

As a bonus I added keepalive mount options to detect broken links early. In our tests where datacenters may shutdown on weekly basis waiting for couple of hours is a bit of overhead to detect it.

There are 2 pohmelfs tasks to date - http compatibility mode, where object ID is generated according to its full path, and umount bug, which is likely because of my dirty dcache games.

I've just announced pohmelfs in linux-kernel

Tagged:  

Here is the link: https://lkml.org/lkml/2011/12/23/161

As mentioned there, there are features yet to complete:

  • quorum read. pohmelfs supports quorum write only so far.
  • http compatibility mode - we do want to upload data via pohmelfs and read it through http applications. And vice versa actually too.
  • column read-write or more generally file-as-directory feature.
  • even more testing - abusing dcache is fun, but likely there are hiddens stones
  • replace drivers/staging/pohmelfs with this code.

It is quite a lot of work, but not that much of work. Maybe a week or so to actually implement all that stuff and then to start real testing. Right now we strike into system limitations quite fast - like doing multi-terabyte rsync between local raid and pohmelfs mounted storage and hitting NIC limitations all the time - it is 1 Gbps after all (and I want to play with 10 GigE but still without luck). Pohmelfs uses local VFS page cache so it can easily saturate bandwidth during writeback.
Since we also have to write multiple copies in parallel (we write 3, but only 2 groups are turned on to get quorum), write is limited by network 40-50 MB/s.

Also elliptics sometimes scares me - we sync data into eblob once per 300 seconds, and it takes about 3-10 seconds to actually write it to disks on every system. At that time machine almost freezes - 100% disks utilization. And rsync sometimes looses its mind - it stops and timeouts, although sync is already finished.

Another problem is related to eblob. We store 10+ gigabyte log files in test case, and since eblob is append-only, it can fill its quota for given key and subsequent write will have to copy previous record into new chunk. Copying 10 Gb takes some time too, and it also takes space - old records are just marked as deleted until offline defragmentation cleans it up.

And my favourite - dcache abuse. Sometimes I get kernel BUG at /home/apw/COD/linux/fs/dcache.c:873 which means at umount time there are dentries with positive reference counters. I do not yet know whether it is pohmelfs or not, but more likely that it is the reason :)
I will fix that too of course.

But overall pohmelfs is close to finish.

Next task is to implement indexing and embedded search in elliptics.
It is quite good idea to have out-of-the box full-text search engine inside storage system.

Stay tuned, new things are close!

POHMELFS and elliptics server-side

POHMELFS in a meantime got 'noreadcsum' mount option support as well as some other tricky bits.

This bumps read performance several times, since eblob (elliptics backend) stores data in contiguous chunks increasing read/write performance, but optionally forcing to copy data, when prepared space is not enough.
Reading with csum enabled forces to checksum the whole written area, which in turn requires to populate it from disk to page cache and so forth. For gigabyte file this takes about 5-6 seconds (first time).

And that's only to read one page. I.e. to read every page (or readahead requested chunk).
Not sure that disabling read checksumming is a good idea, so I made this a mount option. Maybe eventually we end up with some better solution.

I also fixed nasty remount bug in POHMELFS which uncovered a really unexpected (for me at least) behaviour of Linux VFS.
Every inode may have ->drop_inode() callback, which is called each time its reference counter reaches 0 and inode is about to be freed.
But sometimes when inode was recently accessed, it is not evicted, but placed into lru list with special bits set.
Inode cache shrink code (invoked from umount path in particular, but likely may be called from memory shrink path too) grabs all inodes in that lru list and later calls plain iput() on them, which in turn invokes ->drop_callback() for inode in question.

Thus it is possible to get multiple invocation of callback in question without reinitializing inode between them. This crashed pohmelfs in some cases. Now it is fixed with appropriate comment in the code, but I'm wonder how many other such tricks are yet to discover?

POHMELFS is a great tester for elliptics server-side scripting support. I was lazy and put all somewhat complex processing into server-side scripts written in Python. Anton Kortunov implemented simple sstable-like structure in Python for directory entries used by pohmelfs.

Since every command is processed atomically (on single replica) in elliptics, we can put complex directory update mechanism in this 'kind-of-transactions'. In particular server-side scripting is used to insert and remove inodes from directory. Lookup is also implemented using server-side scripting - we read directory inode in python code, search for requested name, and return inode information if something is found, which is sent back to pohmelfs.

Overall this takes about 2-13 msecs. I.e. receive command from pohmelfs, 'forward' it to pool of python executers (srw project), where python code will read directory inode data from elliptics (using elliptics_node.read_data_wait()), search for inode with given name there and send it back to pohmelfs.
Insert takes about 30-150 msecs - script reads directory content, adds new entry (or update old) and then writes it back into the storage.

That's how it looks in python - srw/pohmelfs_lookup.py

Given that we spend 10 msecs in such not really trivial piece of code, I believe that my implementation is actually not that bad.
Those are recent news. Stay tuned for more!

POHMELFS

Tagged:  

In a meantime I rewrote pohmelfs from scratch and it enters heavy testing stage.
As promised, it became just a POSIX frontend to elliptics network with weak synchronization. By using elliptics as its backend, it gets multiple copies support, atomic transactions (in single replica), multiple datacenter support with IO balance, checksums, namespaces and so on.

And by 'weak synchronization' here I mean, that all writes are not visible to other users, who mounted external storage, until writer performs sync or writer's host system decides to writeback dirty pages to the storage.
This actually mirrors behaviour of VFS in all modern OSes - we write data into page cache, and if system catches power failure, its data is lost. Even more - users are not synchronized in any way, and if one of them removes file, another one will only detect that after reading directory again (or trying to open/access given filename).

There is a very interesting approach I use in directory listing. We store directory information as a record indexed by directory key id. It is atomically (as in single replica, multiple replicas are updated independently) updated for every written/removed object and hosts whole inode indexed by dentry names.
But directory listing just reads that whole directory structure and parses it adding inodes/dentries not at lookup time (this is supported too of course), but at readdir time. Since records are stored as single continuous areas in elliptics, we only have to download and sequentially iterate over this blob data to get listing completed without multiple server lookups per name.

But we have to cleanup parent direntry list every time we are about to perform directory listing, since other users may delete some files or rename them. For example rsync creates '.blah.random-crap' files first and then renames them to 'blah' when copy is completed, which resulted in 2 files having the same inode and id previously.

There is no hardlink support yet, balanced/random reads from multiple replicas and quorum read, when we try to reach multiple replicas and select one with the latest consistent data. Writes also should support quorum option (at least mark pages back as dirty if write did not reach quorum or requested number of replicas).

I also plan to add column read/write (this is definitely not a POSIX interface, but kind of file-is-a-directory feature).
Also we want HTTP API compatibility with elliptics, i.e. we write data via pohmelfs and read it via HTTP (or any other) client, which uses default id-is-a-name-hash approach.

This is all is planned for future releases though, I plan to submit new stable version in December and better in a week or two.
Stay tuned!

Recent elliptics changes

Tagged:  

It was a while I wrote here last time, but there are a fair number things happened.

First, elliptics.
We fixed number of bugs in server-side scripting implementation, so it is very stable now and easily allows to create complex scripts which not only perform basic tasks, but also rather complex transactions, like read/process/update multiple records.

We added transaction locks, which allow to perform operations on given key atomically on single replica. In this case your server-side script, which reads data, updates its structure and uploads it back, will be performed atomically on single replica.

There are many other distributed storage systems, which allow atomic key operations, but they do not support datacenter replica split. Usually it is implemented (like in Oracle NoSQL or MongoDB) with single master node, which copies data to multiple replicas. This operation may even be synchronous and safe.
And it is possible to put replicas into different datacenters. But what happens, when you want to add more datacenters, but do not want to increase number of copies? In this scheme one has to rebalance replica sets.

In elliptics you just say which datacenters you want to use for given IO transaction. This forces non-atomic replica updates, which happen in parallel. This allows to have faster writes and more flexible setup, but we can not guarantee that replicas are updated synchronously.

And actually with mongodb scheme it is possible, that replicas will not be in sync, since there may be failed node which ost some data, so reads from this node will not return consistent results.

Real fix for this problem lies outside of the low-level storage subsystem. It is like forcing block device to lock against multiple processes writing into the same file. Instead this protection lives on higher layers and generally requires external locks, which are both known and used by concurrent users.

In distributed world this algorithm is based on Paxos with optimizations (like fast Paxos or Zookeeper Atomic Broadcast and friends). We do not yet have such subsystem.

Another change is embedded checksums. Previously we stored them in blobs after each data records, but never used. Instead there was a separate command to calculate and store checksum in metadata (separate column). This may resulted in stall checksum stored in metadata, so parallel read could fail with inconsistent data error. Right now metadata is just a storage for replica information and update/check status dates and other meta information (namespace and so on).
Each write updates checksum in blob atomically (again, this is not synchronized between replicas), so any read will always get correct check if data was not corrupted.

We also added bulk IO operations, namely read and write. Write issues many parallel writes and waits for all of them to complete, thus essentially being equal to the performance of the slowest nodes, while doing sequential write ends up being as long as sum of all writes in question.
Read does similar thing, but splits and optimizes read using offset sorting (i.e. that we read data from disk in sequential order).

Elliptics: server-side scripting

I added binary data support into server-side scripting (currently we only support Python) as well as extended HTTP fastcgi proxy to support server-side script execution.

Previously we only were able to use C/C++/Python to force some server to execute script on our data - API is rather simple and can be found in examples in binding dirs.
Now we can do that through HTTP.

Here is an example of how to setup python server-side scripting and run scripts on posted through HTTP data.
First of all, you have to put python.init script into 'history' directory (specified in server config).
It may look like this:

import sys
sys.path.append('/usr/lib')
from libelliptics_python import *

log = elliptics_log_file('/tmp/data/history.2/python.log', 40)
n = elliptics_node_python(log)
n.add_groups([1,2,3])
n.add_remote('elisto19f.dev', 1030)
__return_data = 'unused'

This creates global context, which is copied for every input request.
Second, you may add some scripts into the same dir, let's add test script named test_script.py:

__return_data = "its nothing I can do for you without cookies: "

if len(__cookie_string) > 0:
       # here we can add arbitrary complex cookie check, which can go into elliptics to get some data for example
       # aforementioned global context ('n' in above python.init script) is available here
       # so you can do something like data = n.read_data(...)
        __return_data = "cookie: " + __cookie_string + ": "

__return_data = __return_data + __input_binary_data_tuple[0].decode('utf-8')

HTTP proxy places cookies specified in its config into variable __cookie_string, and it is accessible from your script. Anything you put into __return_data is returned to the caller (and eventually to the client who started request).

__input_binary_data_tuple[0] is a global tuple which contains binary data from your request.
When using high-level API you may find, that we send 3 parameters to the server:

  • script name to invoke (can be NULL or empty)
  • script content - it is code which is invoked on the server before named script (if present), its context is available for called script
  • binary data

That binary data is placed into global tuple without modification. In Python 2.6 it holds byte array, I do not yet know what it will be in older versions though (basically, we limited support to python 2.6+ for now).

HTTP proxy puts your POST request data into aforementioned binary data. It also puts simple string like __cookie_string = 'here is your cookie' into 'script content' part of the request.
One can put there whatever data you like if using C/C++/Python API of course.

Here is example HTTP request and reply:

$ wget -q -O- -S --post-data="this is some data in POST request" "http://elisto19f.dev:9000/elliptics?exec&name=test_script.py&id=0123456789"
  HTTP/1.0 200 OK
  Content-Length: 79
  Connection: keep-alive
  Date: Wed, 02 Nov 2011 23:24:10 GMT
  Server: lighttpd/1.4.26
its nothing I can do for you without cookies: this is some data in POST request

As you see, we do not have cookies, so our script just concatenated some string with binary data and returned that data to the client. Name used in POST parameters is actually a name of the server-side script to invoke. ID is used to select server to run this code, otherwise it will run on every server in every group.

It is now possible to run our performance testing tools against server-side scripting implementation (it is not straightforward - there is separate SRW library which implements pool of processes each of which has own initialized global context, which is copied for every incoming request and so on

We believe that numbers will be good out of the box, and of course I expect that performance will suffer compared to plain data read or write, but it should not be that slow - we still expect that every request completes within milliseconds even when it uses python to work with data.
Our microbenchmark (every command executed on server is written into log with time it took to complete) shows that timings are essentially the same - hundreds of microseconds to complete simple scripts.
So I believe with IO intensive tasks we will not be limited by python server scripts and/or various reschedulings.

I'm quite excited with idea of adding full-text (rather simple though) search for all uploaded content into elliptics.
And I'm working hard on this - expect some results really soon!

P.S. We have almost finished directory support for POHMELFS via server-side scripts, expect it quite soon also!

Elliptics network: authentication bits, IO/net priorities and write performance

In a meantime elliptics got basic authentication support. It is not aimed at protecting against rogue invaders, but instead to protect against configuration error, where different clusters happen to connect to each other breaking route table and data forwarding.

Auth cookie can be set in elliptics config via 'auth_cookie' global parameter. By default it is empty string.
Nodes with different cookies can not join each other and will not reconnect if connection failed.

Cookie is transferred over unencrypted channel and is not supposed to add security bits, but instead to prevent misconfiguration. Storage channels are not usually cryptographically protected (at least not at this level - client should encrypt content if needed, or it can be done via server-side scripting), and all 'real' authentication happens at higher levels.

We also added IO priorities (for those IO schedulers which support it) - server-server IO operations (like data recovery) may get lower priority. This applies to network coloring - it is possible to assign different network priorities (as in socket(7), which may turn on TOS bits for example) to server-server and client-server traffic.

And small note on write performance.
Elliptics uses pool of IO threads to do actual low-level work. It is possible to have forward progress when some of threads are blocked in long writes for example, although usually system is kind of unresponsible at those moments. Having backlocg thread to perform your work means that there will be number of reschedulings to complete single work request.
Usually this is not a problem, since requests come in queue and IO threads pick them up as soon as processor/kernel permits.

But for single write request this may introduce unneeded latency, disappearing in batch load. Latency came up to 40 ms per single request, which is not acceptable for some cases. So we lightly changed send logic (added CORK usage) to eliminate unneeded rescheduling on header/data split reads and also dropped acknowledge sending in writes - if we wrote data successfully there is a proper reply already (information about where and how record was stored), otherwise negative error code is sent in ack message.

Those changes dropped single write request latency to sub-millisecond range, which should be fine for now - this is a median time (not including disk _sync_, but counting blob write) we perform write request when we have a queue of jobs and not just a single request.

And now I plan to implement a secondary index in elliptics via server-side scripting. For the starters, I will put all wikipedia articles and add prefix search (secondary index over written names), like returning list of urls starting with en.wikipedia.org/wiki/abc*.

Also will post Elliptics vs MongoDB benchmark on huge number of small records on single node.
Stay tuned!

Elliptics network: server-side scripting for data processing engine

Tagged:  

Storage systems more and more commonly demand not just ability to save data to disk and provide fast access.
It is almost impossible to find a plain storage stack, which will not perform some kind of data processing.
This may include applications starting from simple client-side operations (like wrapping data in HTML in web case) and finishing with complex real-time processing or batch jobs wrapped in mapreduce-like systems.

Writing most of server-side work in low-level languages is not convenient and frequently is error-prone. Also performance is not commonly limited by CPU power, but disk IO speeds.

So, I added server-side scripting support to elliptics. Separate lib (called libsrw) creates a pool of processes (since many scripting languages are single-threaded like Python or popular javascript v8) which talk to external world via pipes. Each process initializes global context where it can run user provided script.
For example in elliptics we store a client node, which is connected to the storage and has access to all local data.
Every execution command is started in context, which is copied from global environment, and although global context can be affected by command (for example connection can be dropped by remote node), all variables created for scripts to be executed are destroyed when command/script completes.

To date I implemented python context and we plan to complete v8 javascript soon.
It is possible to create, for example, a simple checker on every node, where systems checks clients credential when processing data via direct URL. Direct URLs are used when we do not want to proxy data through dedicated servers (like when streaming huge files), but instead allow client to connect and send requests to particular server which hosts needed data objects.

Another example is batch data processing, like background data conversion from one format into another without need to copy every object to processing node. We also plan to use this mode for batch recovery - instead of checking replicas for every stored key, we can collect a bunch of objects which are known to me missed on some other node and later upload this blob to remote host in one go.

Classical mapreduce can also be implemented on servers. Actually our contexts are exact mappers, which have access to data and can iterate over all records selecting those to 'reduce'. In our scheme reduce operation will run on the node, which sent request (or client). This may not be the optimal solution though (for example when reduce data set is rather huge), so we could extend it in the future.
Plan is to allow execution contexts to store its state locally when needed, so that subsequent commands have access to already processed data (for example consider the case, when we calculate number of links to given URL - we generally do not want to check all files every time we start processing, instead would like to only parse files uploaded since previous start).

Another application for server-side scripting is directory structure implementation for POHMELFS. This may be implemented in low-level code like C/C++ though and linked to server binary. Storing and processing complex structure on the server allows client (which is in kernel) to be really simple and do not mess with complex locks.

What I want to play right now is some kind of full-text search embedded into elliptics. The most naive implementation could just parse all written documents and build reverse indexes (using appropriate language morphology if needed) and store them in dedicated columns in elliptics as well. Writing this kind of scripts in Python or JavaScript is rather simple task, which greatly extends functionality of the storage kind of 'for free' - bunch of server-grade processors are usually unused in storage systems.

Here is context initialization script for elliptics for example:

$ cat /tmp/test/history.2/python.init 
import sys
sys.path.append('/tmp/dnet/lib')
from libelliptics_python import *

log = elliptics_log_file('/dev/stderr', 40)
n = elliptics_node_python(log)
n.add_groups([1,2,3])
n.add_remote('localhost', 1025)
__return_data = 'unused'

We create global elliptics node object 'n' (not a good name for such things of course, but that's just an example), which connects to storage server listening on a localhost:1025. __return_data is a special variable used to return data from execution context back to client. It is possible to write data into a file on the local filesystem or upload test results back into elliptics though.

That's how client requests may look (there are also API extensions for C++ and Python):

$ ./example/dnet_ioclient -r server:1025:2 -C1 -c "local_addr = '`hostname`'" -n script.py
$ ./example/dnet_ioclient -r localhost:1025:2 -C2 -c "__return_data = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)"

The latter example is just a plain python script executed on the server - it only reads data from elliptics.
First example executes a script named script.py, which should exist on the server (config specifies scripting dir).
local_addr = '`hostname`' is part of the final script which initializes some variable (it can be arbitrary complex if needed).

Script (on the server) may look something like this:

$ cat /tmp/test/history.2/script.py
from time import time, ctime

ret = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)
__return_data = local_addr + '>>>  ' + ctime(time()) + ' --- ' + ret

Prior being executed server-side code concatenates client's part of the script (string with local_addr in above example) with script itself, so server executes following code:

local_addr = 'zbr.localnet'
from time import time, ctime

ret = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)
__return_data = local_addr + '>>>  ' + ctime(time()) + ' --- ' + ret

Everything stored in __return_data is returned back to client.
Next execution command initializes context back into the state which existed just after initialization script execution (although global elliptics node in our example can be internally modified).

Those were our first set of changes to data processing engine implementation in elliptics network.
Stay tuned for more results!

Elliptics vs HBase on hundreds of millions of small records

Tagged:  

We recently run a test on 215 millions of production records, which we wanted to store on single server.
Objects are rather small about 200 bytes, server hardware is quite common in our environment: 24 Gb of RAM, handful of mostly unused CPU cores, 4 SATA disks of 1-2 Tb in RAD10
Unfortunately due to configuration issues there were only 2 working disks in elliptics server (raid near layout sucks), while HBase had fair load spread over all 4 disks

Usage case: random reads, random range reads.

Out of the box elliptics _yet_ does not provide good enough on-disk index, so it is usually a binary search, which is costly. So we warmed index cache files by reading them into /dev/null

Upload speed was roughly the same in HBase and Ellipitcs - 2.5-3 Krps. But using HBase batch upload it was possible to write data upto 30.000 objects per second, object were packed into 5000 blocks. Elliptics does not yet support such batch upload mode.

So, reading. First of all, we started plain run of 200 random IO rps test. Its duration was about 10 minutes.
Elliptics showed about 30ms median reply times. HBase was closer to 100-150 ms.


Elliptics data (timing scale is wrong, though, median time is 30 ms)


HBase data

Second test was set to find maxium RPS rate of random IOs starting from cold caches and data.
I will not post graphs though, only numbers. Graphs on demand.
Elliptics showed 115 RPS within 100 ms each.
HBase reached almost 300 RPS, but with 100-200 ms median.

And that was with only 2 working disks in elliptics setup (blame on me), while HBase used 4 disks in RAID. Here are proof pictures (first one is elliptics dstat, second - HBase)

HBase also supports index compression, which ended up with 1500 RPS of random IO within 100 ms. Pretty good numbers for single server with 215 millions of records. Our objects are easily compressable, so HBase's data base only occupied about 15 GBs of space.

Elliptics does not compress index, only data, so compression would not help us. Instead we plan to replace current index (binary search on disk, roughly the same happens in every filesystem with b-tree, when you are looking for a file by name). New scheme will cache each N'th key in ram with specifying ID range stored behind given key. Keys are rather small - 64 bytes by default (specified at compile-time), so we could pack a bunch of them into 4k page for example. Reading this page from disk will take single IO, and searching for the key in this page (read into RAM) will be very fast.

This only optimization will kick hbase's ass I think. Getting that HBase can not safely live in multi-datacenter environment, elliptics will fit all needs for huge-number-of-small-records data storage.
But there is also a plan to add bloom filter for faster detection whether given key is present in blob or not.
Since blob file is 'closed' for writes after reaching maximum size or number of records, and no more writes goes into it except when overwrite is turned on or deletions sets the bit, but both do not change keys, it is possible to create rather static and very optimized index. This comes after HBase index design acutally, when its blocks are immutable and so are index ranges.

We currently cache in RAM each key we read from disk, but practice shows that overwhelming number of old enough records (so theirs keys moved to disk from ram) are never (or extremely rarely) read again, so we will add cache timeout for keys to drop them back to disk.

Elliptics recovery process

Tagged:  


Click for full size

24 disks system, 350-450 MB/s
101% disks utilization

I suppose it buzzes very fun in the shelve

Fun set of slides for Yac 2011

It was shown on the 'wall' for eyes pleasure, we did not talk and did not make real presentation, instead there was a stand where several people from our team answered questions and supported talks about distributed processing, universe and so on

Enjoy! :)

P.S. slideshare.net provides only iframe embedding, so you may fail to watch it here, so click to watch on slideshare

Elliptics HOWTO

Tagged:  

Elliptics was put into Ubuntu PPA

deb http://ppa.launchpad.net/eightn/elliptics/ubuntu lucid main
deb-src http://ppa.launchpad.net/eightn/elliptics/ubuntu lucid main

and we started a community site www.elliptics.ru

There is a nice howto there, but only in russian so far.
There are also couple of articles about how elliptics works and what it is all about as well as video from my previous year presentation on YaC 2010.

Elliptics range requests benchmark

Tagged:  

73+ millions of records, 25-30 Gb total space on single node (actually there were 3 replicas, but we used only one)

Here is the graph

3000 rps within 10 milliseconds, where each range request returned 20-4k records.

Elliptics range request is a full analogue of SQL's "SELECT * from TABLE WHERE key > X and key < Y LIMIT (from, num)". In this test each record's key contained timestamp and range request asked for data in some time range.

Each key (elliptics uses 64 bytes for key) looked like this:
xxxyyy...whatever else...timestamp[16 bytes]

and range request was from
xxxyyy...whatever else...0000...0000[16 bytes]
to
xxxyyy...whatever else...ffff...ffff[16 bytes]

POHMELFS got full read/write support

Tagged:  

as well as directory listing and file creation.

All operations are not group-aware though, i.e. all writes are made only into single group and reads (directory listing and object lookup) do not balance between multiple groups.

There was a fair number of inode/dentry hacks and I suppose that populating dentry cache from outside of directory inode operations (like ->create() callback) is not a good idea, but I ended up adding new dentries when reading directory content in ->lookup() or ->readdir().

POHMELFS also does not yet handle errors - i.e. it is a luck we do not crash if server returns borrked structures for directory for example, network errors are not fixed also - client will not reconnect and will not even drop connection if some errors are found.

But it is a matter of time to clean things up. Stay tuned!

Initial POHMELFS commit

Tagged:  
$ git commit -a --stat -m "Initial POHMELFS commit"
[master 19ce0b3] Initial POHMELFS commit
 13 files changed, 3301 insertions(+), 0 deletions(-)
 create mode 100644 fs/pohmelfs/Kconfig
 create mode 100644 fs/pohmelfs/Makefile
 create mode 100644 fs/pohmelfs/Module.symvers
 create mode 100644 fs/pohmelfs/dir.c
 create mode 100644 fs/pohmelfs/file.c
 create mode 100644 fs/pohmelfs/inode.c
 create mode 100644 fs/pohmelfs/net.c
 create mode 100644 fs/pohmelfs/packet.h
 create mode 100644 fs/pohmelfs/pohmelfs.h
 create mode 100644 fs/pohmelfs/route.c
 create mode 100644 fs/pohmelfs/super.c
 create mode 100644 fs/pohmelfs/trans.c

POHMELFS is able to create objects (files only so far) and perform directory listing.
No directories or file IO support yet. No object removal.
Directory structure is trivial linear array which does not scale.
And that in freaking 3.3 thousands of lines of code. And 5 real nights of code.
So do not expect magic here, we will update it in time.

POHMELFS works with elliptics and it is possible to read (write and remove) binary data using elliptics tools and/or via HTTP proxy if you want to mess with raw IDs (64-bytes long numbers).

Stay tuned, fun things are about to begin!

splice() syscall

Tagged:  

I found that splice() syscall does not transfer data, when in and out file descriptors are the same or refer to the same file. Instead destination 'space' is filled with zeroes while supposed to contain input buffer content.

Here is my code for the reference (with some debug added):

static int eblob_splice_data_one(int *fds, int fd_in, loff_t *off_in,
		int fd_out, loff_t *off_out, size_t len)
{
	int err;
	size_t to_write = len;

	while (to_write > 0) {
		err = splice(fd_in, off_in, fds[1], NULL, to_write, 0);
		printf("splice  in: %zu bytes from fd: %d, off: %llu: %d\n",
				to_write, fd_in, *off_in, err);
		if (err == 0) {
			err = -ENOSPC;
			goto err_out_exit;
		}
		if (err < 0) {
			err = -errno;
			perror("splice1");
			goto err_out_exit;
		}
		to_write -= err;
	}

	to_write = len;
	while (to_write > 0) {
		err = splice(fds[0], NULL, fd_out, off_out, to_write, 0);
		printf("splice out: %zu bytes into fd: %d, off: %llu: %d\n",
				to_write, fd_out, *off_out, err);
		if (err == 0) {
			err = -ENOSPC;
			goto err_out_exit;
		}
		if (err < 0) {
			err = -errno;
			perror("splice2");
			goto err_out_exit;
		}
		to_write -= err;
	}

	err = 0;

err_out_exit:
	return err;
}

Unfortunately it does not even return error, but silently corrupts data.

I would be happy to be wrong of course.

Initial POHMELFS bits

Tagged:  

I'm a bit budy with other boring tasks around, so only get chance to hack on POHMELFS today. It is almost 1000 lines of code already, but yet its functionality is virtually near zero.
But I believe all interface for VFS are implemented and I can concentrate on actual interaction with remote elliptics storage.

$ wc -l fs/pohmelfs/{*.[ch],Makefile,Kconfig}
   41 fs/pohmelfs/dir.c
   27 fs/pohmelfs/file.c
  185 fs/pohmelfs/inode.c
   85 fs/pohmelfs/net.c
  104 fs/pohmelfs/pohmelfs.h
   96 fs/pohmelfs/pohmelfs.mod.c
  437 fs/pohmelfs/super.c
    7 fs/pohmelfs/Makefile
   11 fs/pohmelfs/Kconfig
  993 total

# mount -t pohmelfs -o "server=ipv4_addr1:1025:2;server=ipv6_addr2:1025:6" none /mnt/
# ls -laui /mnt/
total 4
1 drwxr-xr-x   2 root root    0 Aug 26 08:52 .
2 dr-xr-xr-x. 23 root root 4096 Aug 26 07:47 ..

File creation does not work as well as directory content reading. It will be rather trivial for start.
Maybe I will even implement server-side scripting instead and will use it for directory updates, so that I do not create leases (in the first release) needed for read-modify-write loop of directory update.

Or (what is more probable) I will just create read-modify-write loop for directory update without server locks, which is rather bad idea from concurrent point of view, but it is the simplest case which can be used as a base for future improvements.

Stay tuned, I plan to create kind of alpha working version very soon!

Pohmelfs in da haus

Linux kernel for the last 3 years didn't change at all - I wrote 600 lines of code, and it does not yet even connect normally to remote elliptics node.

$ wc -l fs/pohmelfs/{*.[ch],Makefile,Kconfig}
   37 fs/pohmelfs/inode.c
   88 fs/pohmelfs/net.c
   58 fs/pohmelfs/pohmelfs.h
   62 fs/pohmelfs/pohmelfs.mod.c
  353 fs/pohmelfs/super.c
    7 fs/pohmelfs/Makefile
   11 fs/pohmelfs/Kconfig
  616 total

Looking forward for the larger datasets

Tagged:  

To date largest (in terms of number of objects) elliptics cluster hosts as much as 400 millions of records on every node. Those are quite small records, so it does not occupy much space (just about 4 Tb per server node), but index becomes quite large (45+ Gb).

We dropped in-memory index in eblob quite for a while already, but having it on disk means that we have to check disk to find out needed key. Currently it is binary-searchable structure on disk, but this is very suboptimal. Well, for 45+ Gb indexes lookup has acceptable timings and likely will have it for 2-3 times larger datasets, but we want every node to host as much as 5-10 times more data, i.e. 40 Tb of data and 4 billions of records.

In this case binary search in 450 Gb index will take too long. We can add more servers to spread data, but in this case we will not optimally use quite limited physical space in datacenters. We have couple of ideas on faster indexes, which basically employ kind of sharding property of our keys, i.e. we can split our key (512 bits in default configuration) into chunks where each part will be an offset into box-like index structure.

In a meantime I added in-memory caching of the read keys - now every key read will be placed into hash table in memory for the fast next-time access. Eblob also got Python bindings and set of handy utils to scan blobs (like regexp match). It also supports statistics file (you will find it in root directory), which shows number of objects present on disk, removed on disk and pushed into memory. Removing this file will force eblob to regenerate it on the next start.

And now one really good news - new POHMELFS will be started next week. Estimated completion time is rather short (we want it to be if not production ready, but more usable at the end of the month), since it will be just elliptics frontend. To date main question is object indexing - the most naive and simple design will have quite slow file/dir renames, i.e. full copy and delete, which is not a good idea generally.
But I did not yet think in details about design problems, so this is an open question.

Stay tuned, there will be very interesting thins shortly!

Elliptics API extension

Tagged:  

I added a bunch of new URI parameters, so HTTP frontend is now almost as capable as clients created using C/C++/Python API. (Github has not yet been updated though)

In particular, added

  • column= - read/write data from particular column. Number of columns is not limited. Well, each new column is a separate large eblob, so it is limited by number of blob files in directory, by default blob size is 50 Gb, so it should not be a limitation
  • size=/offset= - to read/write data from particular offset and size
  • prepare/plain_write/commit - using 'prepare' one can reserve large enough space and then put there chunks of data using 'plain_write'. 'commit' will commit data into index, calculate checksum and so on
  • aflags/ioflags - one can enable/disable checksum, turn on/off compression, APPEND mode and so on
  • range/from/to - range requests. HTTP server returns array of data chunks, where each record is prepended with DNET_ID_SIZE key (64 bytes by default) and 8 bytes of size (little endian)
  • id - to read/write data using own ID (DNET_ID_SIZE max, 64 bytes in default compile - just for sha512 keys)

Here are examples.

1. Write data using own 64-byte keys (in hex: 1122334400... - 1122334600...):

wget -q  -S -O- --post-data="1234567890 id=11223344" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223344"
wget -q  -S -O- --post-data="1234567890 id=11223345" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223345"
wget -q  -S -O- --post-data="1234567890 id=11223346" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223346"

2. Read data by own user-provided key:

wget -q  -S -O- "http://elisto19f.dev/elliptics-proxy?name=1.xml&id=11223345&direct"

3. Range request hex keys are from 1122334400... to 112233ff00...:

wget -q  -S -O- "http://elisto19f.dev/elliptics?range&from=11223344&to=112233ff"

hexdump of the stream (red is key, blue is size):

 wget -q  -S -O- "http://elisto19f.dev.yandex.net/elliptics?range&from=11223344&to=112233ff" | hexdump -C
  HTTP/1.0 200 OK
  Content-Length: 459
  Content-Type: application/octet
  Connection: keep-alive
  Date: Thu, 04 Aug 2011 19:27:57 GMT
  Server: lighttpd/1.4.26

00000000  11 22 33 44 00 00 00 00  00 00 00 00 00 00 00 00  |."3D............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000040  16 00 00 00 00 00 00 00  31 32 33 34 35 36 37 38  |........12345678|
00000050  39 30 20 69 64 3d 31 31  32 32 33 33 34 34 11 22  |90 id=11223344."|
00000060  33 45 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |3E..............|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 16 00  |................|
000000a0  00 00 00 00 00 00 31 32  33 34 35 36 37 38 39 30  |......1234567890|
000000b0  20 69 64 3d 31 31 32 32  33 33 34 35 11 22 33 46  | id=11223345."3F|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 16 00 00 00  |................|
00000100  00 00 00 00 31 32 33 34  35 36 37 38 39 30 20 69  |....1234567890 i|
00000110  64 3d 31 31 32 32 33 33  34 36 11 22 33 47 00 00  |d=11223346."3G..|
00000120  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000150  00 00 00 00 00 00 00 00  00 00 16 00 00 00 00 00  |................|
00000160  00 00 31 32 33 34 35 36  37 38 39 30 20 69 64 3d  |..1234567890 id=|
00000170  31 31 32 32 33 33 34 37                           |11223347|
00000178

We will stress-test range requests next week, it is expected that performance should not be radically different from plain random IO performance (benchmarks can be found here)

We also tested Cassandra in 4-nodes configuration, and I'm about to collect raw number and graphs to show.

Stay tuned!

Sometimes I think I'm doing something wrong

I could be an artist, if trained to paint 15 years instead of programming :)

Created under impression about new album from Leningrad band. That's the only way I can imagine Julia Kogan while listening her vocal.

EDITED TO ADD: Blog has been removed from kernelplanet syndication and probably from many others likely because of some raffinated policies.
But that's what I do, and if you do not like that, do not read it. But if you still want to know how to create cool technical stuff, but suddenly missed start of the 21 century - there are freaking tags at the right of this page.

LevelDB benchmarks against Kyoto Cabinet and SQLite

Tagged:  

Holy shit - who the hell tests DATABASE by writing thousands-to-millions records of 100 bytes?

There is a test, where they run 1000 values of 100k each - a bit overkill for database, but yet it is still ALL IN MEMORY.
Those tests are actually "how fast we can read from / write to RAM if our control structure is good enough not to content on locks and the like"

What is really interesting is how it behaves, when database size does not fit memory, or at least is close enough, but not hundred of MB.

Yfrog presentation about HBASE usage

Tagged:  

They removed my drawings from yfrog, so I will say some words about their presentation - mostly about how great we are

First, its setup - 60 servers for 250 millions of photos. I do not really know whether this number includes number of copies, but we use ONE server with elliptics to host 200+ millions of small objects. And we actually have 3 copies, which turns out to 600 millions of records in 3 different datacenters. Well, that server has 48 Gb of RAM and 24 disks in RAID6 - kinda big box.
But our 1 Pb cluster contains not only those machines, but also smaller ones with 4 disks and smaller amount of RAM.

Metadata cluster with 1 Bn of records and 20 nodes is much more interesting, but I'm afraid most of it fits RAM, since metadata records are small.

Presentation speaks about 10 krps - great number, but wait a minute - from 20 or even 60 nodes? Above mentioned setup handles 3krps (2 krps write and 1 krps read) easily. And I do not call it 'Super fast!'

We tested Cassandra in a lab, it handled 2 krps great (1400 rps read, 600 rps write) from 4 nodes (24 Gb of RAM, 4-disk RAID10, 15 Gb of data on every node), but only until data fits memory. Overall it was not very stable and required huge amount of tuning, so that it wouldn't end up in OOM or JVM GC killer.

Huge MongoDB tests are on the road.

Using eblob as low-level IO backend disk bottleneck was pushed away, but it adds limitation on number of records stored effectively - we store key index in RAM. Now when I added on-disk index we are about to retest our benchmarks, which will likely show performance degradation. Although this change allows to put more object into single node and do not heavily depend on amount of RAM, it may be slow enough.

Next change will be to use fast in-memory index for frequently used keys.

Our main disadvantage is lack of documentation. Nobody uses elliptics except our company friends, who bite me about how things are implemented and their meaning. But this will change very soon! We are working, and that's actually kinda harder than code :)

Recent news about projects

Tagged:  

So far I'm mostly concentrated on elliptics and eblob, since we build rather large and loaded storage systems on top of them.
But I plan to step a little bit away and work with language grammatics (starting with simple flex/bison so far), but about this later.

Elliptics got another project to host huge amounts of small files (1-20kbytes) uploaded virtually continuously all day long into servers in 3 different datacenters. Here is a monitoring pic (amount of files uploaded this week) from one of the servers:



There is a small caveat though - eblob stores data and index on disk and also fast index in RAM. With such amount of objects written eventually we will run out of memory for fast index, which in turn will force us to setup more servers, which sometimes is not that simple (mostly because of absence of physical space in datacenters).

Eblob and on-disk vs in-ram lookup indexes

To eliminate this problem I mapped fast index (hash table) into on-disk file, but because of serious fragmentation issues this ended up with huge slowdown - mostly to move disk heads from the beginning of the map file to its end to find out needed object. Although this was O(1) in terms of number of records, it still required fair number of disk seeks to iterate over hash table to get single record. So, it was dropped.

Now system is quite different - I sort on-disk index after blob is completed, which is quite fast operation (index file always fits free memory), then it is written to disk and all records placed there are removed from in-memory fast index. File is mapped and binary search is used to find data records.
Currently we do not propagate frequently accessed records into fast in-memory index, but will do in next releases.
Also if there is an old on-disk index it is not converted on startup, which should be changed too.

We did not yet test new lookup scheme performance, it is scheduled for the end of the week or next one. It is not that simple to generate a hundred of millions of records quickly, so likely I will use existing data first (it requires sorting of the existing on-disk indexes).

Eblob as append-only and overwrite in-place storage

Besides on-disk index eblob also got overwrite support - when new records is less than object (including alignment) already written with this key, it will be overwritten in place instead of appending it at the very end. This feature should eliminate quick metadata grows, since we update it each time object is checked or written.

Language grammatics

I plan to implement mapreduce on top of column data store used in elliptics, and want to create a rather trivial language for that. In particular I plan to use boolean and arithmetic operations, regex string search and generic read/write primitives mostly to eliminate data dereference (kind of string class). Maybe it will even contain loops and functions.
So far I wrote a simple reenterable grammatics parser using Flex/Bison, which supports multiline strings in Flex among others, but grammatics is yet very trivial and basically unusable for language parsing. I will experiment with it more, since I want to add it into elliptics sooner than later.

Those a plans and infos. Stay tuned!

Eblob got automatic defragmentation support

Tagged:  

Eblob is an extremely fast append-only low-level storage used in elliptics network distributed storage.

Since it is append-only all updates will look like new records appended somewhere at the end of the blob file. Thus old entries will just waste space.

Previously there was a special tool which was able to defragment blobs in offline and elliptics could be restarted to get new data location (it worked with old blobs until then). It required lots of space or lots of restarts to defragment really large storage.

Now this is gone - eblob uses online automatic defragmentation, which finds blobs with lots of removed objects (there should be at least one third of objects removed) and sequentially reads records and writes them back into another blobs.
In-memory index is also updated, so when this process completes scanning given blob file it can be removed.

Automatic defragmentation requires at most size of the single blob of free space to defragment whole storage, since blobs are processed one after another without storage restart.

Getting that all reads during defragmentation are sequential and all writes are actually appends properly aligned by VFS to make huge sequential writes, this process is rather fast.

Looks like this is the last feature for eblob, so we can concentrate on recovery tools, tests and eventually new features.

Since elliptics@eblob and API support column storage, I plan to implement incremental mapreduce on top of that. It is a bit far plan though, maybe to the end of the summer...
Anyway, stay tuned!

Eventual consistency and data correctness

Tagged:  

With eventual consistency there is always a window when written data can not be read because of node/network failure.
Depending on how frequently recovery process starts this may be not an issue especially if data is unique.

But it is possible that there is an old copy, which is not yet updated because of failed node or because recovery is not yet completed. In this case client will sometimes get this stall data which may be not consistent with what he expected after write.

To solve this problem eventual consistency systems introduce various kinds of timestamps attached to data. Reading them from multiple nodes can tell us with some degree of probability that given data is stall or valid as well as data consistency between nodes in question.

Elliptics network is such a system - eventual consistency is maintained by external application which starts recovery process on timed base. To get always last written data we have a set of API calls, which read metadata from multiple nodes in parallel and select node(s) with the latest written data.
In theory we can ask for data written to majority of nodes instead of latest write for example, but our practice shows that this behavior is not required if not harmed.

In my sweet dreams there is a nice looking wiki-like site which hosts all documentation and examples, but this is yet to become reality. In a meantime one can grab example sources and look at how this is implemented there.

Another example (C++ this time) is how to read/write data using columns.

Python code should be self containing if read binding implementation first, which will tell what those unnamed fields are like offset, size and various flags.

Eblob got Snappy compression support

Tagged:  

Kudos to Google for opensourcing it.

I did not yet test its performance, but it compresses our small common XML about 2 times.
API was extended to support per-command flag to compress data or not. Eblob stores this information in its structure, so reading detects this and automatically decompresses data.

So, to store textual data elliptics and eblob look very promising, getting that Cassandra and MongoDB do not yet support compression (although there are tickets back from 2009) :)

Enjoy!

Elliptics 2.10.0

Tagged:  

New release with huge number of improvements, the most notable of which are

  • column data storage (eblob only)
  • range requests
  • faster metadata and overall IO performance
  • server-side checksumming and verification support
  • base for ultra-fast recovery and local mapreduce
  • simplified interfaces for POHMELFS

So far this is a beta release, since not all API calls were propagated to high-level interfaces. There are some issues with recovery process - we will not yet fully finished transition process. Eblob offline defragmentation tool was removed, but online processes was not yet added into the core.

We plan to resolve this issues this week and move further.

Elliptics currently hosts not-that-large clusters (the biggest cluster spawns over about 200 nodes which live on ~20 physical servers and close to 1 Pb of data) with about hundreds of millions objects (not counting redundant copies), which covers 4 datacenters including abroad. With above features we plan to expand our usage area. Unfortunately it is quite complex to get into with elliptics usage, and that should be my main goal actually.

In a meantime we got www.elliptics.ru and www.eblob.ru, which will host documentation portal with example configs, API docs, benchmarks and so on. Many thanks to Andrey Godin.

Maybe eventually this will be a community driven project.
Anyway, stay tuned!

Column data store

Tagged:  

In a meantime elliptics got column store support.
It is implemented in eblob only though, filesystem backend does not support this feature.

I'm thinking about proper API for this feature though.
There are million of changes in the current tree, we will polish it and release new version very soon, which will draw the line between current rapid development cycles with extreme feature creation rate and steady maintenance support. Probably :)

Stay tuned, I will make new post, when things are ready for prime use.

Key range requests in elliptics network

Tagged:  

Elliptics network is a horizontally scalable key/value storage, which by default implements distributed hash table for ID spreading scheme.

But it is possible to implement your own ID generation mechanism, which can introduce data locality. Range requests in such scenarios can produce a huge win for users.

I implemented range requests (in eblob backend only), which can be kind of transformed into following SQL command:

SELECT * from eblob WHERE key >= start AND key < end LIMIT (from, num)

One can use half of the key to store master ID and another half (we use 512-bits keys, but it can be changed at compile time) for each object client wants to attach to selected master ID, for example if we name client X as key '123456.000000' then all its data can be stored with keys 123456.000001, 123456.000002 and so on
To get all data objects with 'master ID' 123456 in one request one may execute range query starting with 123456.000000 and ending with 123457.000000

For example using python binding it looks like this (different range is used here):

$ cat dnet/bindings/python/range.py
#!/usr/bin/python
# -*- coding: utf-8 -*-

from libelliptics_python import *

log = elliptics_log_file("/dev/stderr", 8)
n = elliptics_node_python(log)

group = 2
n.add_groups([group])
n.add_remote("localhost", 1025)

r = elliptics_range()
r.start = [1, 2, 3, 4]
r.end = [0xcc, 0xdd]
r.group_id = group
r.limit_start = 2
r.limit_num = 2

print n.read_data_range(r)

API hides cases when requested range is handled by different hosts, although this should be very rare case, it is supported in the underlying methods.

Range requests are supported in C, C++ and Python bindings as well as usual IO methods.

Next huge step is fair column data store. I'm open for ideas about better implementation.

Syndicate content