Elliptics

Elliptics distributed network

5 billions

Tagged:  

It is amount of objects in one of our clusters - it hosts avatars and other rather small objects (hundreds of bytes to several kilobytes)
This amount of objects is hosted just on 2 nodes, each has 24x2Tb disks, 48Gb of ram, only 6Tb are actually used

We should multiply it to 3, since that's amount of objects in whole cluster spread over 3 datacenters in Moscow region
Load is quite small though, about 500 rps of reads and about the same for simultaneous writes

pohmlefs got hardlink and socket/pipe support

Tagged:  

While adding pipe/socket is pretty much straightforward - it is just another inode types, which have full VFS implemnetation in kernel, hardlinks are much harder.

Basically saying, hardlink is a link to inode, so that updates made through one path ended up with new data read from all other hardlinked pathes. Removing one hardlink may not end up removing whole fie, instead there should be some kind of reference counter - we only remove file's data when there are no more hardlinks to given file, and of course when we remove file by original path, we can not remove its data, if it can be accessed via its hardlinks.

POHMELFS uses elliptics as its data storage backend, which supports atomic transactions per replica and per-column data IO. So I put reference counter update into elliptics server-side script, which is called for every unlink and hardlink creation command.

Scripts are executed atomically against other operations for given key, but only on single replica. It is possible that different operations with the same key are executed on different replicas, but for unlink it does not matter - if we drop refcnt by removing 'hardlink-1' and then original path or vice versa data will only be removed when last reference drops to zero (actually to -1, since for performance reasons we do not create reference counter for plain object, only for its hardlinks).

Last weeks we extensively tested POHMELFS in a lab, but its time to move it to small testing environment, which mimics behaviour of one of our large systems. We will setup all software we use daily there and start looking for real bugs.
Getting our previous testing 'mode' (we synced about hundred of terabytes of files of different sizes) I do not expect any problems to arise.

Still, I want http compatibility mode, when we save files with IDs generated from full path length. This implies that move is not supported, but it is a special case where we can afford this. Basically we want it to write data via POHMELFS for performance and then read it via HTTP.

Stay tuned, I will send kernel inclusion request next week. First patch will replace drivers/staging/pohmelfs with this new code. Hopefully it will move into fs/ in some next release.

Pohmelfs got quorum read support

Tagged:  

Due to eventual consistency nature of elliptics there may exist a window in which replicas are not in sync.

Depending on various factors this window may be rather large, for example in our production clusters we run this kind of check weekly, so having replica outdated for that long may require additional steps to get synchronized data.
We store metadata for every written object, which among others contain update timestamp. So it is quite simple task to make a parallel request for metadata from multiple replicas in different datacenters and select the one with the latest timestamp.

This exists in elliptics API quite for a while, and now I added it to pohmelfs.
At open() time we check all replicas and build a list of groups, where it is present, sorted by timestamp, so any subsequent read will get the must up-to-date data.

As a bonus I added keepalive mount options to detect broken links early. In our tests where datacenters may shutdown on weekly basis waiting for couple of hours is a bit of overhead to detect it.

There are 2 pohmelfs tasks to date - http compatibility mode, where object ID is generated according to its full path, and umount bug, which is likely because of my dirty dcache games.

I've just announced pohmelfs in linux-kernel

Tagged:  

Here is the link: https://lkml.org/lkml/2011/12/23/161

As mentioned there, there are features yet to complete:

  • quorum read. pohmelfs supports quorum write only so far.
  • http compatibility mode - we do want to upload data via pohmelfs and read it through http applications. And vice versa actually too.
  • column read-write or more generally file-as-directory feature.
  • even more testing - abusing dcache is fun, but likely there are hiddens stones
  • replace drivers/staging/pohmelfs with this code.

It is quite a lot of work, but not that much of work. Maybe a week or so to actually implement all that stuff and then to start real testing. Right now we strike into system limitations quite fast - like doing multi-terabyte rsync between local raid and pohmelfs mounted storage and hitting NIC limitations all the time - it is 1 Gbps after all (and I want to play with 10 GigE but still without luck). Pohmelfs uses local VFS page cache so it can easily saturate bandwidth during writeback.
Since we also have to write multiple copies in parallel (we write 3, but only 2 groups are turned on to get quorum), write is limited by network 40-50 MB/s.

Also elliptics sometimes scares me - we sync data into eblob once per 300 seconds, and it takes about 3-10 seconds to actually write it to disks on every system. At that time machine almost freezes - 100% disks utilization. And rsync sometimes looses its mind - it stops and timeouts, although sync is already finished.

Another problem is related to eblob. We store 10+ gigabyte log files in test case, and since eblob is append-only, it can fill its quota for given key and subsequent write will have to copy previous record into new chunk. Copying 10 Gb takes some time too, and it also takes space - old records are just marked as deleted until offline defragmentation cleans it up.

And my favourite - dcache abuse. Sometimes I get kernel BUG at /home/apw/COD/linux/fs/dcache.c:873 which means at umount time there are dentries with positive reference counters. I do not yet know whether it is pohmelfs or not, but more likely that it is the reason :)
I will fix that too of course.

But overall pohmelfs is close to finish.

Next task is to implement indexing and embedded search in elliptics.
It is quite good idea to have out-of-the box full-text search engine inside storage system.

Stay tuned, new things are close!

POHMELFS and elliptics server-side

POHMELFS in a meantime got 'noreadcsum' mount option support as well as some other tricky bits.

This bumps read performance several times, since eblob (elliptics backend) stores data in contiguous chunks increasing read/write performance, but optionally forcing to copy data, when prepared space is not enough.
Reading with csum enabled forces to checksum the whole written area, which in turn requires to populate it from disk to page cache and so forth. For gigabyte file this takes about 5-6 seconds (first time).

And that's only to read one page. I.e. to read every page (or readahead requested chunk).
Not sure that disabling read checksumming is a good idea, so I made this a mount option. Maybe eventually we end up with some better solution.

I also fixed nasty remount bug in POHMELFS which uncovered a really unexpected (for me at least) behaviour of Linux VFS.
Every inode may have ->drop_inode() callback, which is called each time its reference counter reaches 0 and inode is about to be freed.
But sometimes when inode was recently accessed, it is not evicted, but placed into lru list with special bits set.
Inode cache shrink code (invoked from umount path in particular, but likely may be called from memory shrink path too) grabs all inodes in that lru list and later calls plain iput() on them, which in turn invokes ->drop_callback() for inode in question.

Thus it is possible to get multiple invocation of callback in question without reinitializing inode between them. This crashed pohmelfs in some cases. Now it is fixed with appropriate comment in the code, but I'm wonder how many other such tricks are yet to discover?

POHMELFS is a great tester for elliptics server-side scripting support. I was lazy and put all somewhat complex processing into server-side scripts written in Python. Anton Kortunov implemented simple sstable-like structure in Python for directory entries used by pohmelfs.

Since every command is processed atomically (on single replica) in elliptics, we can put complex directory update mechanism in this 'kind-of-transactions'. In particular server-side scripting is used to insert and remove inodes from directory. Lookup is also implemented using server-side scripting - we read directory inode in python code, search for requested name, and return inode information if something is found, which is sent back to pohmelfs.

Overall this takes about 2-13 msecs. I.e. receive command from pohmelfs, 'forward' it to pool of python executers (srw project), where python code will read directory inode data from elliptics (using elliptics_node.read_data_wait()), search for inode with given name there and send it back to pohmelfs.
Insert takes about 30-150 msecs - script reads directory content, adds new entry (or update old) and then writes it back into the storage.

That's how it looks in python - srw/pohmelfs_lookup.py

Given that we spend 10 msecs in such not really trivial piece of code, I believe that my implementation is actually not that bad.
Those are recent news. Stay tuned for more!

Recent elliptics changes

Tagged:  

It was a while I wrote here last time, but there are a fair number things happened.

First, elliptics.
We fixed number of bugs in server-side scripting implementation, so it is very stable now and easily allows to create complex scripts which not only perform basic tasks, but also rather complex transactions, like read/process/update multiple records.

We added transaction locks, which allow to perform operations on given key atomically on single replica. In this case your server-side script, which reads data, updates its structure and uploads it back, will be performed atomically on single replica.

There are many other distributed storage systems, which allow atomic key operations, but they do not support datacenter replica split. Usually it is implemented (like in Oracle NoSQL or MongoDB) with single master node, which copies data to multiple replicas. This operation may even be synchronous and safe.
And it is possible to put replicas into different datacenters. But what happens, when you want to add more datacenters, but do not want to increase number of copies? In this scheme one has to rebalance replica sets.

In elliptics you just say which datacenters you want to use for given IO transaction. This forces non-atomic replica updates, which happen in parallel. This allows to have faster writes and more flexible setup, but we can not guarantee that replicas are updated synchronously.

And actually with mongodb scheme it is possible, that replicas will not be in sync, since there may be failed node which ost some data, so reads from this node will not return consistent results.

Real fix for this problem lies outside of the low-level storage subsystem. It is like forcing block device to lock against multiple processes writing into the same file. Instead this protection lives on higher layers and generally requires external locks, which are both known and used by concurrent users.

In distributed world this algorithm is based on Paxos with optimizations (like fast Paxos or Zookeeper Atomic Broadcast and friends). We do not yet have such subsystem.

Another change is embedded checksums. Previously we stored them in blobs after each data records, but never used. Instead there was a separate command to calculate and store checksum in metadata (separate column). This may resulted in stall checksum stored in metadata, so parallel read could fail with inconsistent data error. Right now metadata is just a storage for replica information and update/check status dates and other meta information (namespace and so on).
Each write updates checksum in blob atomically (again, this is not synchronized between replicas), so any read will always get correct check if data was not corrupted.

We also added bulk IO operations, namely read and write. Write issues many parallel writes and waits for all of them to complete, thus essentially being equal to the performance of the slowest nodes, while doing sequential write ends up being as long as sum of all writes in question.
Read does similar thing, but splits and optimizes read using offset sorting (i.e. that we read data from disk in sequential order).

Elliptics: server-side scripting

I added binary data support into server-side scripting (currently we only support Python) as well as extended HTTP fastcgi proxy to support server-side script execution.

Previously we only were able to use C/C++/Python to force some server to execute script on our data - API is rather simple and can be found in examples in binding dirs.
Now we can do that through HTTP.

Here is an example of how to setup python server-side scripting and run scripts on posted through HTTP data.
First of all, you have to put python.init script into 'history' directory (specified in server config).
It may look like this:

import sys
sys.path.append('/usr/lib')
from libelliptics_python import *

log = elliptics_log_file('/tmp/data/history.2/python.log', 40)
n = elliptics_node_python(log)
n.add_groups([1,2,3])
n.add_remote('elisto19f.dev', 1030)
__return_data = 'unused'

This creates global context, which is copied for every input request.
Second, you may add some scripts into the same dir, let's add test script named test_script.py:

__return_data = "its nothing I can do for you without cookies: "

if len(__cookie_string) > 0:
       # here we can add arbitrary complex cookie check, which can go into elliptics to get some data for example
       # aforementioned global context ('n' in above python.init script) is available here
       # so you can do something like data = n.read_data(...)
        __return_data = "cookie: " + __cookie_string + ": "

__return_data = __return_data + __input_binary_data_tuple[0].decode('utf-8')

HTTP proxy places cookies specified in its config into variable __cookie_string, and it is accessible from your script. Anything you put into __return_data is returned to the caller (and eventually to the client who started request).

__input_binary_data_tuple[0] is a global tuple which contains binary data from your request.
When using high-level API you may find, that we send 3 parameters to the server:

  • script name to invoke (can be NULL or empty)
  • script content - it is code which is invoked on the server before named script (if present), its context is available for called script
  • binary data

That binary data is placed into global tuple without modification. In Python 2.6 it holds byte array, I do not yet know what it will be in older versions though (basically, we limited support to python 2.6+ for now).

HTTP proxy puts your POST request data into aforementioned binary data. It also puts simple string like __cookie_string = 'here is your cookie' into 'script content' part of the request.
One can put there whatever data you like if using C/C++/Python API of course.

Here is example HTTP request and reply:

$ wget -q -O- -S --post-data="this is some data in POST request" "http://elisto19f.dev:9000/elliptics?exec&name=test_script.py&id=0123456789"
  HTTP/1.0 200 OK
  Content-Length: 79
  Connection: keep-alive
  Date: Wed, 02 Nov 2011 23:24:10 GMT
  Server: lighttpd/1.4.26
its nothing I can do for you without cookies: this is some data in POST request

As you see, we do not have cookies, so our script just concatenated some string with binary data and returned that data to the client. Name used in POST parameters is actually a name of the server-side script to invoke. ID is used to select server to run this code, otherwise it will run on every server in every group.

It is now possible to run our performance testing tools against server-side scripting implementation (it is not straightforward - there is separate SRW library which implements pool of processes each of which has own initialized global context, which is copied for every incoming request and so on

We believe that numbers will be good out of the box, and of course I expect that performance will suffer compared to plain data read or write, but it should not be that slow - we still expect that every request completes within milliseconds even when it uses python to work with data.
Our microbenchmark (every command executed on server is written into log with time it took to complete) shows that timings are essentially the same - hundreds of microseconds to complete simple scripts.
So I believe with IO intensive tasks we will not be limited by python server scripts and/or various reschedulings.

I'm quite excited with idea of adding full-text (rather simple though) search for all uploaded content into elliptics.
And I'm working hard on this - expect some results really soon!

P.S. We have almost finished directory support for POHMELFS via server-side scripts, expect it quite soon also!

Elliptics network: server-side scripting for data processing engine

Tagged:  

Storage systems more and more commonly demand not just ability to save data to disk and provide fast access.
It is almost impossible to find a plain storage stack, which will not perform some kind of data processing.
This may include applications starting from simple client-side operations (like wrapping data in HTML in web case) and finishing with complex real-time processing or batch jobs wrapped in mapreduce-like systems.

Writing most of server-side work in low-level languages is not convenient and frequently is error-prone. Also performance is not commonly limited by CPU power, but disk IO speeds.

So, I added server-side scripting support to elliptics. Separate lib (called libsrw) creates a pool of processes (since many scripting languages are single-threaded like Python or popular javascript v8) which talk to external world via pipes. Each process initializes global context where it can run user provided script.
For example in elliptics we store a client node, which is connected to the storage and has access to all local data.
Every execution command is started in context, which is copied from global environment, and although global context can be affected by command (for example connection can be dropped by remote node), all variables created for scripts to be executed are destroyed when command/script completes.

To date I implemented python context and we plan to complete v8 javascript soon.
It is possible to create, for example, a simple checker on every node, where systems checks clients credential when processing data via direct URL. Direct URLs are used when we do not want to proxy data through dedicated servers (like when streaming huge files), but instead allow client to connect and send requests to particular server which hosts needed data objects.

Another example is batch data processing, like background data conversion from one format into another without need to copy every object to processing node. We also plan to use this mode for batch recovery - instead of checking replicas for every stored key, we can collect a bunch of objects which are known to me missed on some other node and later upload this blob to remote host in one go.

Classical mapreduce can also be implemented on servers. Actually our contexts are exact mappers, which have access to data and can iterate over all records selecting those to 'reduce'. In our scheme reduce operation will run on the node, which sent request (or client). This may not be the optimal solution though (for example when reduce data set is rather huge), so we could extend it in the future.
Plan is to allow execution contexts to store its state locally when needed, so that subsequent commands have access to already processed data (for example consider the case, when we calculate number of links to given URL - we generally do not want to check all files every time we start processing, instead would like to only parse files uploaded since previous start).

Another application for server-side scripting is directory structure implementation for POHMELFS. This may be implemented in low-level code like C/C++ though and linked to server binary. Storing and processing complex structure on the server allows client (which is in kernel) to be really simple and do not mess with complex locks.

What I want to play right now is some kind of full-text search embedded into elliptics. The most naive implementation could just parse all written documents and build reverse indexes (using appropriate language morphology if needed) and store them in dedicated columns in elliptics as well. Writing this kind of scripts in Python or JavaScript is rather simple task, which greatly extends functionality of the storage kind of 'for free' - bunch of server-grade processors are usually unused in storage systems.

Here is context initialization script for elliptics for example:

$ cat /tmp/test/history.2/python.init 
import sys
sys.path.append('/tmp/dnet/lib')
from libelliptics_python import *

log = elliptics_log_file('/dev/stderr', 40)
n = elliptics_node_python(log)
n.add_groups([1,2,3])
n.add_remote('localhost', 1025)
__return_data = 'unused'

We create global elliptics node object 'n' (not a good name for such things of course, but that's just an example), which connects to storage server listening on a localhost:1025. __return_data is a special variable used to return data from execution context back to client. It is possible to write data into a file on the local filesystem or upload test results back into elliptics though.

That's how client requests may look (there are also API extensions for C++ and Python):

$ ./example/dnet_ioclient -r server:1025:2 -C1 -c "local_addr = '`hostname`'" -n script.py
$ ./example/dnet_ioclient -r localhost:1025:2 -C2 -c "__return_data = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)"

The latter example is just a plain python script executed on the server - it only reads data from elliptics.
First example executes a script named script.py, which should exist on the server (config specifies scripting dir).
local_addr = '`hostname`' is part of the final script which initializes some variable (it can be arbitrary complex if needed).

Script (on the server) may look something like this:

$ cat /tmp/test/history.2/script.py
from time import time, ctime

ret = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)
__return_data = local_addr + '>>>  ' + ctime(time()) + ' --- ' + ret

Prior being executed server-side code concatenates client's part of the script (string with local_addr in above example) with script itself, so server executes following code:

local_addr = 'zbr.localnet'
from time import time, ctime

ret = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)
__return_data = local_addr + '>>>  ' + ctime(time()) + ' --- ' + ret

Everything stored in __return_data is returned back to client.
Next execution command initializes context back into the state which existed just after initialization script execution (although global elliptics node in our example can be internally modified).

Those were our first set of changes to data processing engine implementation in elliptics network.
Stay tuned for more results!

Elliptics vs HBase on hundreds of millions of small records

Tagged:  

We recently run a test on 215 millions of production records, which we wanted to store on single server.
Objects are rather small about 200 bytes, server hardware is quite common in our environment: 24 Gb of RAM, handful of mostly unused CPU cores, 4 SATA disks of 1-2 Tb in RAD10
Unfortunately due to configuration issues there were only 2 working disks in elliptics server (raid near layout sucks), while HBase had fair load spread over all 4 disks

Usage case: random reads, random range reads.

Out of the box elliptics _yet_ does not provide good enough on-disk index, so it is usually a binary search, which is costly. So we warmed index cache files by reading them into /dev/null

Upload speed was roughly the same in HBase and Ellipitcs - 2.5-3 Krps. But using HBase batch upload it was possible to write data upto 30.000 objects per second, object were packed into 5000 blocks. Elliptics does not yet support such batch upload mode.

So, reading. First of all, we started plain run of 200 random IO rps test. Its duration was about 10 minutes.
Elliptics showed about 30ms median reply times. HBase was closer to 100-150 ms.


Elliptics data (timing scale is wrong, though, median time is 30 ms)


HBase data

Second test was set to find maxium RPS rate of random IOs starting from cold caches and data.
I will not post graphs though, only numbers. Graphs on demand.
Elliptics showed 115 RPS within 100 ms each.
HBase reached almost 300 RPS, but with 100-200 ms median.

And that was with only 2 working disks in elliptics setup (blame on me), while HBase used 4 disks in RAID. Here are proof pictures (first one is elliptics dstat, second - HBase)

HBase also supports index compression, which ended up with 1500 RPS of random IO within 100 ms. Pretty good numbers for single server with 215 millions of records. Our objects are easily compressable, so HBase's data base only occupied about 15 GBs of space.

Elliptics does not compress index, only data, so compression would not help us. Instead we plan to replace current index (binary search on disk, roughly the same happens in every filesystem with b-tree, when you are looking for a file by name). New scheme will cache each N'th key in ram with specifying ID range stored behind given key. Keys are rather small - 64 bytes by default (specified at compile-time), so we could pack a bunch of them into 4k page for example. Reading this page from disk will take single IO, and searching for the key in this page (read into RAM) will be very fast.

This only optimization will kick hbase's ass I think. Getting that HBase can not safely live in multi-datacenter environment, elliptics will fit all needs for huge-number-of-small-records data storage.
But there is also a plan to add bloom filter for faster detection whether given key is present in blob or not.
Since blob file is 'closed' for writes after reaching maximum size or number of records, and no more writes goes into it except when overwrite is turned on or deletions sets the bit, but both do not change keys, it is possible to create rather static and very optimized index. This comes after HBase index design acutally, when its blocks are immutable and so are index ranges.

We currently cache in RAM each key we read from disk, but practice shows that overwhelming number of old enough records (so theirs keys moved to disk from ram) are never (or extremely rarely) read again, so we will add cache timeout for keys to drop them back to disk.

Elliptics recovery process

Tagged:  


Click for full size

24 disks system, 350-450 MB/s
101% disks utilization

I suppose it buzzes very fun in the shelve

Elliptics HOWTO

Tagged:  

Elliptics was put into Ubuntu PPA

deb http://ppa.launchpad.net/eightn/elliptics/ubuntu lucid main
deb-src http://ppa.launchpad.net/eightn/elliptics/ubuntu lucid main

and we started a community site www.elliptics.ru

There is a nice howto there, but only in russian so far.
There are also couple of articles about how elliptics works and what it is all about as well as video from my previous year presentation on YaC 2010.

Elliptics range requests benchmark

Tagged:  

73+ millions of records, 25-30 Gb total space on single node (actually there were 3 replicas, but we used only one)

Here is the graph

3000 rps within 10 milliseconds, where each range request returned 20-4k records.

Elliptics range request is a full analogue of SQL's "SELECT * from TABLE WHERE key > X and key < Y LIMIT (from, num)". In this test each record's key contained timestamp and range request asked for data in some time range.

Each key (elliptics uses 64 bytes for key) looked like this:
xxxyyy...whatever else...timestamp[16 bytes]

and range request was from
xxxyyy...whatever else...0000...0000[16 bytes]
to
xxxyyy...whatever else...ffff...ffff[16 bytes]

Pohmelfs in da haus

Linux kernel for the last 3 years didn't change at all - I wrote 600 lines of code, and it does not yet even connect normally to remote elliptics node.

$ wc -l fs/pohmelfs/{*.[ch],Makefile,Kconfig}
   37 fs/pohmelfs/inode.c
   88 fs/pohmelfs/net.c
   58 fs/pohmelfs/pohmelfs.h
   62 fs/pohmelfs/pohmelfs.mod.c
  353 fs/pohmelfs/super.c
    7 fs/pohmelfs/Makefile
   11 fs/pohmelfs/Kconfig
  616 total

Looking forward for the larger datasets

Tagged:  

To date largest (in terms of number of objects) elliptics cluster hosts as much as 400 millions of records on every node. Those are quite small records, so it does not occupy much space (just about 4 Tb per server node), but index becomes quite large (45+ Gb).

We dropped in-memory index in eblob quite for a while already, but having it on disk means that we have to check disk to find out needed key. Currently it is binary-searchable structure on disk, but this is very suboptimal. Well, for 45+ Gb indexes lookup has acceptable timings and likely will have it for 2-3 times larger datasets, but we want every node to host as much as 5-10 times more data, i.e. 40 Tb of data and 4 billions of records.

In this case binary search in 450 Gb index will take too long. We can add more servers to spread data, but in this case we will not optimally use quite limited physical space in datacenters. We have couple of ideas on faster indexes, which basically employ kind of sharding property of our keys, i.e. we can split our key (512 bits in default configuration) into chunks where each part will be an offset into box-like index structure.

In a meantime I added in-memory caching of the read keys - now every key read will be placed into hash table in memory for the fast next-time access. Eblob also got Python bindings and set of handy utils to scan blobs (like regexp match). It also supports statistics file (you will find it in root directory), which shows number of objects present on disk, removed on disk and pushed into memory. Removing this file will force eblob to regenerate it on the next start.

And now one really good news - new POHMELFS will be started next week. Estimated completion time is rather short (we want it to be if not production ready, but more usable at the end of the month), since it will be just elliptics frontend. To date main question is object indexing - the most naive and simple design will have quite slow file/dir renames, i.e. full copy and delete, which is not a good idea generally.
But I did not yet think in details about design problems, so this is an open question.

Stay tuned, there will be very interesting thins shortly!

Elliptics API extension

Tagged:  

I added a bunch of new URI parameters, so HTTP frontend is now almost as capable as clients created using C/C++/Python API. (Github has not yet been updated though)

In particular, added

  • column= - read/write data from particular column. Number of columns is not limited. Well, each new column is a separate large eblob, so it is limited by number of blob files in directory, by default blob size is 50 Gb, so it should not be a limitation
  • size=/offset= - to read/write data from particular offset and size
  • prepare/plain_write/commit - using 'prepare' one can reserve large enough space and then put there chunks of data using 'plain_write'. 'commit' will commit data into index, calculate checksum and so on
  • aflags/ioflags - one can enable/disable checksum, turn on/off compression, APPEND mode and so on
  • range/from/to - range requests. HTTP server returns array of data chunks, where each record is prepended with DNET_ID_SIZE key (64 bytes by default) and 8 bytes of size (little endian)
  • id - to read/write data using own ID (DNET_ID_SIZE max, 64 bytes in default compile - just for sha512 keys)

Here are examples.

1. Write data using own 64-byte keys (in hex: 1122334400... - 1122334600...):

wget -q  -S -O- --post-data="1234567890 id=11223344" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223344"
wget -q  -S -O- --post-data="1234567890 id=11223345" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223345"
wget -q  -S -O- --post-data="1234567890 id=11223346" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223346"

2. Read data by own user-provided key:

wget -q  -S -O- "http://elisto19f.dev/elliptics-proxy?name=1.xml&id=11223345&direct"

3. Range request hex keys are from 1122334400... to 112233ff00...:

wget -q  -S -O- "http://elisto19f.dev/elliptics?range&from=11223344&to=112233ff"

hexdump of the stream (red is key, blue is size):

 wget -q  -S -O- "http://elisto19f.dev.yandex.net/elliptics?range&from=11223344&to=112233ff" | hexdump -C
  HTTP/1.0 200 OK
  Content-Length: 459
  Content-Type: application/octet
  Connection: keep-alive
  Date: Thu, 04 Aug 2011 19:27:57 GMT
  Server: lighttpd/1.4.26

00000000  11 22 33 44 00 00 00 00  00 00 00 00 00 00 00 00  |."3D............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000040  16 00 00 00 00 00 00 00  31 32 33 34 35 36 37 38  |........12345678|
00000050  39 30 20 69 64 3d 31 31  32 32 33 33 34 34 11 22  |90 id=11223344."|
00000060  33 45 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |3E..............|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 16 00  |................|
000000a0  00 00 00 00 00 00 31 32  33 34 35 36 37 38 39 30  |......1234567890|
000000b0  20 69 64 3d 31 31 32 32  33 33 34 35 11 22 33 46  | id=11223345."3F|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 16 00 00 00  |................|
00000100  00 00 00 00 31 32 33 34  35 36 37 38 39 30 20 69  |....1234567890 i|
00000110  64 3d 31 31 32 32 33 33  34 36 11 22 33 47 00 00  |d=11223346."3G..|
00000120  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000150  00 00 00 00 00 00 00 00  00 00 16 00 00 00 00 00  |................|
00000160  00 00 31 32 33 34 35 36  37 38 39 30 20 69 64 3d  |..1234567890 id=|
00000170  31 31 32 32 33 33 34 37                           |11223347|
00000178

We will stress-test range requests next week, it is expected that performance should not be radically different from plain random IO performance (benchmarks can be found here)

We also tested Cassandra in 4-nodes configuration, and I'm about to collect raw number and graphs to show.

Stay tuned!

Yfrog presentation about HBASE usage

Tagged:  

They removed my drawings from yfrog, so I will say some words about their presentation - mostly about how great we are

First, its setup - 60 servers for 250 millions of photos. I do not really know whether this number includes number of copies, but we use ONE server with elliptics to host 200+ millions of small objects. And we actually have 3 copies, which turns out to 600 millions of records in 3 different datacenters. Well, that server has 48 Gb of RAM and 24 disks in RAID6 - kinda big box.
But our 1 Pb cluster contains not only those machines, but also smaller ones with 4 disks and smaller amount of RAM.

Metadata cluster with 1 Bn of records and 20 nodes is much more interesting, but I'm afraid most of it fits RAM, since metadata records are small.

Presentation speaks about 10 krps - great number, but wait a minute - from 20 or even 60 nodes? Above mentioned setup handles 3krps (2 krps write and 1 krps read) easily. And I do not call it 'Super fast!'

We tested Cassandra in a lab, it handled 2 krps great (1400 rps read, 600 rps write) from 4 nodes (24 Gb of RAM, 4-disk RAID10, 15 Gb of data on every node), but only until data fits memory. Overall it was not very stable and required huge amount of tuning, so that it wouldn't end up in OOM or JVM GC killer.

Huge MongoDB tests are on the road.

Using eblob as low-level IO backend disk bottleneck was pushed away, but it adds limitation on number of records stored effectively - we store key index in RAM. Now when I added on-disk index we are about to retest our benchmarks, which will likely show performance degradation. Although this change allows to put more object into single node and do not heavily depend on amount of RAM, it may be slow enough.

Next change will be to use fast in-memory index for frequently used keys.

Our main disadvantage is lack of documentation. Nobody uses elliptics except our company friends, who bite me about how things are implemented and their meaning. But this will change very soon! We are working, and that's actually kinda harder than code :)

Recent news about projects

Tagged:  

So far I'm mostly concentrated on elliptics and eblob, since we build rather large and loaded storage systems on top of them.
But I plan to step a little bit away and work with language grammatics (starting with simple flex/bison so far), but about this later.

Elliptics got another project to host huge amounts of small files (1-20kbytes) uploaded virtually continuously all day long into servers in 3 different datacenters. Here is a monitoring pic (amount of files uploaded this week) from one of the servers:



There is a small caveat though - eblob stores data and index on disk and also fast index in RAM. With such amount of objects written eventually we will run out of memory for fast index, which in turn will force us to setup more servers, which sometimes is not that simple (mostly because of absence of physical space in datacenters).

Eblob and on-disk vs in-ram lookup indexes

To eliminate this problem I mapped fast index (hash table) into on-disk file, but because of serious fragmentation issues this ended up with huge slowdown - mostly to move disk heads from the beginning of the map file to its end to find out needed object. Although this was O(1) in terms of number of records, it still required fair number of disk seeks to iterate over hash table to get single record. So, it was dropped.

Now system is quite different - I sort on-disk index after blob is completed, which is quite fast operation (index file always fits free memory), then it is written to disk and all records placed there are removed from in-memory fast index. File is mapped and binary search is used to find data records.
Currently we do not propagate frequently accessed records into fast in-memory index, but will do in next releases.
Also if there is an old on-disk index it is not converted on startup, which should be changed too.

We did not yet test new lookup scheme performance, it is scheduled for the end of the week or next one. It is not that simple to generate a hundred of millions of records quickly, so likely I will use existing data first (it requires sorting of the existing on-disk indexes).

Eblob as append-only and overwrite in-place storage

Besides on-disk index eblob also got overwrite support - when new records is less than object (including alignment) already written with this key, it will be overwritten in place instead of appending it at the very end. This feature should eliminate quick metadata grows, since we update it each time object is checked or written.

Language grammatics

I plan to implement mapreduce on top of column data store used in elliptics, and want to create a rather trivial language for that. In particular I plan to use boolean and arithmetic operations, regex string search and generic read/write primitives mostly to eliminate data dereference (kind of string class). Maybe it will even contain loops and functions.
So far I wrote a simple reenterable grammatics parser using Flex/Bison, which supports multiline strings in Flex among others, but grammatics is yet very trivial and basically unusable for language parsing. I will experiment with it more, since I want to add it into elliptics sooner than later.

Those a plans and infos. Stay tuned!

Eventual consistency and data correctness

Tagged:  

With eventual consistency there is always a window when written data can not be read because of node/network failure.
Depending on how frequently recovery process starts this may be not an issue especially if data is unique.

But it is possible that there is an old copy, which is not yet updated because of failed node or because recovery is not yet completed. In this case client will sometimes get this stall data which may be not consistent with what he expected after write.

To solve this problem eventual consistency systems introduce various kinds of timestamps attached to data. Reading them from multiple nodes can tell us with some degree of probability that given data is stall or valid as well as data consistency between nodes in question.

Elliptics network is such a system - eventual consistency is maintained by external application which starts recovery process on timed base. To get always last written data we have a set of API calls, which read metadata from multiple nodes in parallel and select node(s) with the latest written data.
In theory we can ask for data written to majority of nodes instead of latest write for example, but our practice shows that this behavior is not required if not harmed.

In my sweet dreams there is a nice looking wiki-like site which hosts all documentation and examples, but this is yet to become reality. In a meantime one can grab example sources and look at how this is implemented there.

Another example (C++ this time) is how to read/write data using columns.

Python code should be self containing if read binding implementation first, which will tell what those unnamed fields are like offset, size and various flags.

Eblob got Snappy compression support

Tagged:  

Kudos to Google for opensourcing it.

I did not yet test its performance, but it compresses our small common XML about 2 times.
API was extended to support per-command flag to compress data or not. Eblob stores this information in its structure, so reading detects this and automatically decompresses data.

So, to store textual data elliptics and eblob look very promising, getting that Cassandra and MongoDB do not yet support compression (although there are tickets back from 2009) :)

Enjoy!

Elliptics 2.10.0

Tagged:  

New release with huge number of improvements, the most notable of which are

  • column data storage (eblob only)
  • range requests
  • faster metadata and overall IO performance
  • server-side checksumming and verification support
  • base for ultra-fast recovery and local mapreduce
  • simplified interfaces for POHMELFS

So far this is a beta release, since not all API calls were propagated to high-level interfaces. There are some issues with recovery process - we will not yet fully finished transition process. Eblob offline defragmentation tool was removed, but online processes was not yet added into the core.

We plan to resolve this issues this week and move further.

Elliptics currently hosts not-that-large clusters (the biggest cluster spawns over about 200 nodes which live on ~20 physical servers and close to 1 Pb of data) with about hundreds of millions objects (not counting redundant copies), which covers 4 datacenters including abroad. With above features we plan to expand our usage area. Unfortunately it is quite complex to get into with elliptics usage, and that should be my main goal actually.

In a meantime we got www.elliptics.ru and www.eblob.ru, which will host documentation portal with example configs, API docs, benchmarks and so on. Many thanks to Andrey Godin.

Maybe eventually this will be a community driven project.
Anyway, stay tuned!

Column data store

Tagged:  

In a meantime elliptics got column store support.
It is implemented in eblob only though, filesystem backend does not support this feature.

I'm thinking about proper API for this feature though.
There are million of changes in the current tree, we will polish it and release new version very soon, which will draw the line between current rapid development cycles with extreme feature creation rate and steady maintenance support. Probably :)

Stay tuned, I will make new post, when things are ready for prime use.

Key range requests in elliptics network

Tagged:  

Elliptics network is a horizontally scalable key/value storage, which by default implements distributed hash table for ID spreading scheme.

But it is possible to implement your own ID generation mechanism, which can introduce data locality. Range requests in such scenarios can produce a huge win for users.

I implemented range requests (in eblob backend only), which can be kind of transformed into following SQL command:

SELECT * from eblob WHERE key >= start AND key < end LIMIT (from, num)

One can use half of the key to store master ID and another half (we use 512-bits keys, but it can be changed at compile time) for each object client wants to attach to selected master ID, for example if we name client X as key '123456.000000' then all its data can be stored with keys 123456.000001, 123456.000002 and so on
To get all data objects with 'master ID' 123456 in one request one may execute range query starting with 123456.000000 and ending with 123457.000000

For example using python binding it looks like this (different range is used here):

$ cat dnet/bindings/python/range.py
#!/usr/bin/python
# -*- coding: utf-8 -*-

from libelliptics_python import *

log = elliptics_log_file("/dev/stderr", 8)
n = elliptics_node_python(log)

group = 2
n.add_groups([group])
n.add_remote("localhost", 1025)

r = elliptics_range()
r.start = [1, 2, 3, 4]
r.end = [0xcc, 0xdd]
r.group_id = group
r.limit_start = 2
r.limit_num = 2

print n.read_data_range(r)

API hides cases when requested range is handled by different hosts, although this should be very rare case, it is supported in the underlying methods.

Range requests are supported in C, C++ and Python bindings as well as usual IO methods.

Next huge step is fair column data store. I'm open for ideas about better implementation.

Mapped file index, on-disk sorting, range requests and per-column data store

Tagged:  

Mapped file for fast in-memory index

I started to test eblob - append-only storage used in elliptics - with mapped file for its in-memory index. Since it is a hash table, I decided to reduce number of seeks in each hash bucket search, so I implemented array of elements and not a classical list. We reallocate array each time collision happens to add another entry at the end.

Unfortunately this tends completely not to scale. Eventually defragmentation of the mapped file becomes so huge (and it becomes clear just after several tens of millions of records) especially if we preallocate large enough hash bucket arrays (like for 5-10 elements), that write of element at the beginning of the hash table tends to flush write at the end of the mapped file, which in turn requires seeks and seeks and more seeks.

After about 40-50 millions of entries in single mmap file kernel flush thread starts eating 100% CPU and write performance degrade to hell. Less than that it works with the same spped as glibc malloc() though.
This idea has to be rethought.

On-disk sorting and range requests

To implement range requests in elliptics I decided first to optimize on-disk eblob storage. Range requests likely will not be supported for filesystem IO backend (where every object is a separate file), eventually I will drop it.

So, I imlemented offline on-disk sorting tool, which uses eblob iterators (C++ interface) and its speed equals to one half of write speed of underlying disk array, since it writes to-be-sorted data 2 times - temporal tables and merged result.
Sorting tool uses combination of in-memory quick sort (std::sort() to be precise) and writes resulted less-than-ram blocks into temporal files, which are in turn merged the way merge sort works.

12.5 millions of records (written objects from 1.5 to 20 Kb each, ~100 Gb total) were sorted with 22-25 Mb/s speed, while write speed of underlying RAID array is about 45-50 Mb/s (we write 2 files at the very end - eblob with data and its on-disk small index), so it took a bit more than an hour.

Range requests will be handled by elliptics and will return data objects for separate keys. On the storage it will iterate over all keys in requested range and send data to client. If range covers multiple machines, API will automatically split request to multiple nodes.

Since by default elliptics uses hash of the name as a key, range requests will not have any reasonable meaning. But it can be very useful when keys are generated manually.
For example client may use half of the key as user login (hash from it) and second half as a written object id (hash from uploaded file for example). In this case range requests allow to get all user's data via single request. Keys can be split whatever client likes, we use 512 bit keys, so there is plenty of space there.

Column data storage

But IMHO range requests and manual key generation is a half step to the real column data store. I do want to implement this feature this summer, so we could compete with systems like Cassandra and MongoDB in calculation and processing markets. On a performance (and optionally reliability) elliptics@eblob are unbeatable already :)

Documentation

It sucks. And I mean it - it is horrible as freakin' bloody hell.
This uglymoron can be really called docuementation, but I will, I promise, I will work with people who can write good docs, so that writing applications and setting up systems would not be a rocket science.
Actually it is very simple, but still this knowledge has to be documented.

Stay tuned, and things will be even better!

Per-column data store in eblob and elliptics

Tagged:  

A simple test to iterate over 38 millions of name metadata records placed separately into own eblob gave us almost 2 times performance increase - regex took about 50 seconds.
When combined metadata (about 3-5 different types) lived in single eblob it took 80 seconds to complete.

Per-column buzz is quite trending, although above data shows that if we store data in separated databases (or eblobs in our case), it is much more efficient. Databases like Cassandra store multiple columns in single record, which albeit not optimal but is quite convenient - we get all columns via single request and then iterate over them.

I'm thinking about putting such feature into elliptics and namely eblob - server will itself update columns in selected row, although placing all columns into single record looks quite questionable for me.

Eblob iterators, mapped file index, mapreduce over column storage and elliptics metadata

Tagged:  

Eblob iterators over elliptics metadata

To continue mapreduce games I decided to check how fast will be EBLOB as a metadata storage. Since eblob is a low-level append-only system, it should allow to have very fast concurent iterators.

So I copied whole metadata database from previous test from Kyoto Cabinet (which in turn will support concurent iterators in the next version) into eblob. We have about 38 millions of records there total of 13 Gb in Kyoto Cabinet, which took 16 Gb of data and 4 Gb of index in eblob.

I used the same regex to check object names and started 16 threads to iterate over eblob storage.
Eblob iterators took 80 seconds to complete (.*).jpg(.*) regex over 38 millions of records with cold VFS cache, which is rougly sequential read speed of underlying 4-disks raid10 array.

Since our database is 16 Gb in size it is quite obvious it fits into RAM (test server has 24 Gb of memory).

Kyoto Cabinet also fits RAM (13 Gb in size), but somehow it did not take advantage of that, since its time (only single iteration thread supported in current version) is about 15 minutes

As of HDFS/Hadoop (500 millions of records / 5 Gb on disk / 130 seconds), one can try to compare our apples to their oranges as I did in previous post, but I suspect that it still will be slower.

Eblob in-memory index as on-disk mapped file

In a meantime I implemented in-memory eblob index as a mapped file on disk, which should solve problems with unused kyes in memory. It is supposed that VFS will drop to disk pages which correspond to unused entries and only actively used keys will take place in RAM. This allows to store more keys to single node than physical memory in turn.

We start to extensively test it and I plan to test more benchmarks with it later. I want to load even more entries (like those in Hadoop test) so they would not fit VFS cache and run iterators again, but expect that performance will be the same - after all we just sequentially read data from disk.

It could be also a good test to run iterators over defragmented eblob (i.e. with removed entries), but I expect that it will be limited by disk speed, since we will just skip removed entries.

Mapreduce over column eblob storage

In a meantime I found a nice way on how to store metadata in eblobs split on per-column basis. Elliptics metadata is virtually a set of columns written by client after data is placed on disk. We support checksum, object name, timestamp and several other records, although none of them are required for system work. List of copies is only required for recovery check.

If we ever move from Kyoto Cabinet as metadata storage to EBLOB I could add a special IO flag, which will force elliptics not only to write metadata, but also update per-column indexes, so we could VERY quickly find out all records within given namespace for example or with given regex name or records older than some time and so on.

I suspect that initial implementation will not be a real mapreduce, since we likely will not process reduce() operation. I.e. we will just iterate or map() records and process them in place, without storing for reduce, without sorting and so on. For our current needs this is more than enough, but it can be extended if we need a real mapreduce.

I even plan to write some kind of a language for this - after all I have a dragon book and compiler principles unread yet :))

Elliptics network and MapReduce

Tagged:  

Yesterday I implemented quck and dirty MapReduce service on top of elliptics storage. It only worked with metadata, i.e. names, timestamps and so on, and was built using Kyoto Cabinet Map Reduce subsystem, since we store object metadata in its database on the same machines which host data.

Unfortunately all Kyoto Cabinet iterators including MapReduce are single-threaded, i.e. you can access data through parallel iterators (or map/reduce processors), but can not spread the same database to multiple iterating threads.

I implemented simple regex grep on uploaded filenames. On a typical storage node containing 40 millions of objects (13 Gb database size) (.*).xml regex took about 15 minutes (single map thread) to complete, which is roughly 44 krps. Smaller databases took less time to iterate, for example database with 200 thousands records can be 'parsed' with the same regexp in 2.5 seconds.

I found Hive@HDFS benchmark, where 500 millions of objects (50 Gb of data) were spread to 380 mappers (likely multiple machines), and similar regex took 130 seconds to complete, which is about 10 krps per mapper.

Unfortunately Kyoto Cabinet Map Reduce is not concurrent and I'm not sure that its internal structure can be easily made parallel for iterators. But eblob - our fast append-only subsystem used by elliptics as underlying storage, contains all its on-disk indexes as simple arrays (it is 'converted' into hash table in memory at startup though), which can be trivially iterated in parallel. In particular, uploading index into hash table in memory can be run in multiple iteration threads.

So I'm thinking about moving metadata from KyotoCabinet into out eblob storage (we can put it before or after data in the same chunks), and implement fair MapReduce in elliptics which can start multiple mappers on top of local data (as well as multiple mappers on different servers of course).

I believe I have to run some tests in particular put metadata for aforementioned 40 millions of objects into eblob storage and run multiple MapReduce iterators on top of our array-like indexes on single machine. If results will be multiple times faster that single-threaded Kyoto Cabinet MapReduce iterator, then this is the way to go.

I do not plan to implement complex Hive-like language for MapReduce processing, but several simple enough commands (like comparison of timestamps and regexes for namespace and object name matches) should be enough for start.

If it will work well, we can even implement MapReduce processors for data (all above is for metadata only so far), but this is a long-term idea to date.

Sounds like a plan?

Elliptics and checksums

Tagged:  

Current version supports data checksums calculated on every metadata write (happens after whole object is written to disk). By default we verify checksum for every READ and LOOKUP - this command returns link to requested object, which includes address of one of the hosting servers, file path on it, permission bits, object size and offset (if it lives within blob) and many other fields.

If checksum does not match error is returned. One can turn checksum verification off with per-request flag.
Check utilis do not yet support this feature, but we will teach them soon.

In a meantime we started to implement migration tool - this will be elliptics wrapper which will be able to move current group from its hosting machine to remote server. It will generate appropriate configs, copy all data and start new elliptics node instead of current one on another server. Very suitable for manual and automatic node migration from datacenter to datacenter.

Streaming from blobs

Tagged:  

Elliptics IO backend is a modular subsystem, which allows to implement different data storages which will be handled by the library core.

In particular we support EBLOB - storage module designed after Facebook's Haystack. We do not run for their data volumes, so eblob has only single index level, which can be locked in memory. This ends up with guaranteed single seek to get your data in the worst case. I posted its benchmark and design notes.

One of the most wanted feature from the storage is ability to stream data to clients. Kind of youtube/whatever. If you store your data in files, it will be very slow (millions of files on ext4 filesystem end up with at most 100 RPS from 4-disks raid10 storage). The same eblob configuration allows to have almost order of magnitude more.

Simple streaming elliptics storage looks like this: we have a set of nodes (one per server or per disk), which are combined into elliptics storage. There is a web server running on the same storage machines. There is a proxy web server(s) which controls client requests and returns 'links' to the actual data, which can be streamed directly from server nodes.

So, client asks proxy about its file. Proxy returns XML describing full path, size and offset of the requested data as well as address of one of the servers, which host requested content. Client sends usual HTTP request to given address (this is a web server on storage node, which runs FLV streamer for example) and gets its stream from provided offset.

But streaming is dangerous. Blobs are append-only storages, where other client's data is placed somewhere behind yours. So, client can hack request and ask not only his own data, but also something else. That's why proxy signs XML data it returned using provate key. Streaming server has to check this sign and confirm that requested size and offset are valid. This should be done in another simple web server module.

That's it. Among other info we also return file's owner, permission bits, access, create and modify time and so on.
This will be used in POHMELFS for inode info.

It is of course possible to return data directly from proxy when client asked for this. It works great with pictures and other small objects. But gigabyte videos and even smaller mp3 files can not be effectively proxied to multiple clients.

In a meantime we started to write simple migration tool - it will allow to move content of selected elliptics node to different location without administration intervention.

Real-time network state weights

Tagged:  

I added a simple quality metrics into elliptics network read calls - we can select network states (which eventually correspond to remote nodes) according to their weights, which are updated for all read transactions.

If completion time is less than median, we increase weights of the state (and vice versa), so next time we select host to read data from, given node will have a higher priority.
Weights are also updated (halved actually) when state hits a timeout - each transaction should be processed within specified time frame (5 seconds by default) otherwise state is marked as stall. 10 subsequent stall marks means state should be destroyed. Keepalive can not always take care about such states, since network packets from kernel can be passed, but userspace data will not reach remote node's processing code, for example, because of very high IO.

IO thread model

Tagged:  

Elliptics started to use pool of threads for network IO as well as disk operations. It removed CPU bottleneck on hugely loaded servers, which were not capable to copy data from memory or even file descriptor into non-blocking sockets in single network processing thread - after all, having several tens-to-hundreds disk processing threads which send their data through single queue/network thread - this will overrun pretty quickly.

Now we have configurable number of them, which are assigned in round-robin fashion to accepted sockets.

We started to work on bulk check operations - currently single group with millions of records will be checked several days and most of the time it will take to fetch records to check its timestamp or object presence. Now it is going to be much faster and will reduce number of operations to check a key.
Namely servers will exchange diff of the key ranges, and check-initiating node will compare and save timestamps of all copies locally (as well as send updates to the outdated nodes) - this will be kind of vector clocks maintained for elliptics groups.

I decided to drop transactions support from elliptics. They were not real transactions - just some obscure way to map multiple data objects into single key. Although it allowed to create snapshots and versions easily, but this functionality can still be implemented on top of low-level elliptics code.

Now we use rewrite-in-place always and are going to clean up database and storage code as well as API calls. Finally we started to drop features from the project, not add them. This will also greatly simplify any external code (like kernel side) which will want to work with elliptics storage and is not able to use our library.

But things are not that shine - not only we stopped adding features, but I do plan to add secondary index very soon, which will allow to have real-time prefix search of the object names (and corresponding data keys), written into the storage.
It can be used by external workers like statistics or search engine.

Another recovery tool is in a process of developement - we want to slowly parse logs and find out records which are known not to be present in some groups, i.e. we can determine whether some object lost one or more of its copies. This is not 100% bullet-proof method to recover missed data, but it is significantly easier for the whole cluster - we do not need to transfer whole hundreds-of-millions bunch of keys.

Syndicate content