Tag Archives: libeblob

The year of eblob

Hi, this is rbtz speaking again. I’m the engineer responsible for eblob codebase for
almost a year now. Here is small recap of what was happening with eblob
since v0.17.2 with some commit highlights.

The year of eblob

v0.17.2
* Eblob now builds under Mac OS X. This improved experience of developers with Macs.
* Changed most links to point to newly created http://reverbrain.com.
* Added comments to all eblob subsystems: e254fc3. This improves learning curve of new developers.
v0.17.3
* Added l2hash support: c8fa62c. This reduces memory consumption of elliptics metadata .by 25% on LP64.
* Added first edition of eblob stress test. Year after it’s responsible for catching 99% bugs that otherwise would go to testing: 8eab8ed.
v0.17.4

v0.17.6
* Added config variables for index block and bloom: a106d9d. This allows sysadmins to limit memory occupied by bloom filter.
* Added config variable to limit total blob size: f7da001. This allows sysadmins to limit eblobs size in case many databases are located on one shared drive.
v0.17.7
* Reduce memory consumption of “unsorted” blobs by 20% on LP64: 19e8612
v0.17.8
* First static analyzer crusade (feat. clang static analyzer) – number of “almost impossible to spot” bugs found.
* Added data-sort and binlog v1. This allows “on the fly” eblob defragmtntation and memory cleanups.
* Added travis-ci tests after each commit: f08fea2.
* Removed custom in-memory cache in favor of OS page cache: a7e74a7; This removed number of nasty races in eblob code and also opened way for some future optimizations.
* Added Doxyfile stub, so that in future libeblob man pages may be autogenerated: aac9cb3.
* Decreased memory consumption of in-memory data structures by 10% on LP64: c6afffa.
v0.18.0

v0.18.2
* Replaced core mutexes with rwlocks; This improves out Intel vTune concurrency benchmarks, along with our QA tests.
v0.18.3

v0.18.5-1
* Second static analyzer crusade (feat. Coverity);
* Switched to <a href=”https://en.wikipedia.org/wiki/Spinlock#Alternatives”>adaptive mutexes</a> when available: 43b35d8.
* Speeded up statistics update v1: 40a60d7. Do not hold global lock while computing and writing stats to disk.
* Rewritten bloom filter v1: 6f08e07. This improves speed and reduces memory fragmentation.
* Allocate index blocks in one big chunk instead of millions of small, thus speeding up initialization and reducing memory fragmentation: b87e273.
v0.18.6
v0.19.0
* Do not hold global lock for the whole duration of sync(): 6f6be68. This removes “stalls” in configs where sync > 0.
v0.19.1
* Switched to POSIX.1-2008 + XSI extensions: 6ece045.
* Build with -Wextra and -Wall by default: 0e8c713. This should in long term substantially improve code quality.
* Added options to build with hardening and sanitizers: c8b8a34, 2d8a42c. This improves our internal automated tests.
v0.19.2

v0.19.7
* Do not set bloom filter bits on start on removed entries: 36e7750. This will improve lookup times of “long removed” but still not defragmentated entries.
v0.19.8
v0.19.9
* Added separate thread for small periodic tasks: ea17fc0. This in future can be upgraded to simple background task manager;
* Moved statvfs(3) to periodic thread which speeds up write-only micro benchmarks by 50%: f36ab9d.
* Lock database on init to prevent data corruption by simultanious accesses to the same database by different processes: 5e5039d. See more about EB0000 in kb article.
v0.19.10
v0.19.11
* Removed columns aka types: 6b1f173; This greatly simplifies code and as side effect improves elliptics memory usage and startup times;
* Removed compression: 35ac55f; This removes dependency on Snappy.
* Removed bsize knob for write alignment: 8d87b32;
* Rewritten stats v2: 94c85ec; Now stats update very lightweight and atomic;
* Added writev(2)-like interface to eblob, so that elliptics backend could implement very efficient metadata handling: b9e0391;
v0.20.0

v0.21.3
* Replaced complex binlog with very tiny binlog v2: 1dde6f3; This greatly simplifies code, improves data-sort speed and memory efficiency;
* Made tests multithreaded: 1bd2f43. Now we can spot even more errors via automated tests before they hit staging.
* Move to GNU99 standard: f65955a. It’s already 15 years old already =)
* Fixed very old bug with log mesage truncation/corruption on multithreaded workloads: 10b6d47.
v0.21.4
* Bloom filter rewrite v2: 1bfadaf. Now we use many hash functions instead of one thus trading CPU time for improved IO efficiency. This improved bloom filter efficiency by order of magnitude.
v0.21.5
* Merge small blobs into one on defrag: ace7ca7. This improves eblob performance on databases with high record rotation maintaining almost fixed number of blobs.
v0.21.6
v0.21.7
* Added record record validity check on start: bcdb0be; See more about database corruption EB0001 in kb article.
v0.21.8

v0.21.11
* More robust eblob_merge tool that can be used to recover corrupted blobs.
v0.21.12
v0.21.13
* Reduced memory consumption of in-memory data-structures by 10% on LP64: e851820;
v0.21.14

v0.21.16
* Added schedule-based data-sort: 2f457b8; More on this topic in previous post: data-sort implications on eblob performance.
v0.21.17

Here I’ve mentioned only most notable commits, mostly performance and memory usage oriented changes. There are of course lots of other stuff going on like bugfixes, minor usability improvements and some internal changes.

Here are some basic stats for this year:

Total commits: 1375
Stats total: 65 files changed, 8057 insertions(+), 4670 deletions(-)
Stats excl. docs, tests and ci: 39 files changed, 5368 insertions(+), 3782 deletions(-)

Also if you are interested in whats going to happen in near future in eblob world you should probably take a look into it’s roadmap.

By the way for those of you who is interested in numbers and pretty graphs – after recent upgrade of our internal elliptics cluster storing billions of records to new LTS releases of elliptics 2.24.X and eblob 0.22.X we’ve got:

Response time reduction (log scale):

98th percentile that was around 100ms dropped below 50 ms

98th percentile that was around 100ms dropped below 50 ms

Disk IO (linear scale):

IO dropped more than one order of magnitude.

IO dropped more than one order of magnitude.

Memory (linear scale):

There is much more "cached" memory now. Also periodic data-sort routine successfully frees unused cached keys.

There is much more “cached” memory now. Also periodic data-sort routine successfully frees unused cached keys.

Eblob as append-mostly database

In this post I want to dig a little bit deeper into one of eblob’s least
understood parts: record overwrite. Let’s start with notion how overwrite was
meant to work in eblob: it simply shouldn’t. Indeed eblob started as
append-only database, so no overwrites were allowed. Then overwrite was added
to reduce fragmentation for use cases when some records were continuously
overwritten so that appending them over and over again was inefficient. So
special case “overwrite-in-place” for same-or-smaller size records overwrite
was added: it was controllable via two knobs – one global for database called
EBLOB_TRY_OVERWRITE and second is per-write flag
BLOB_DISK_CTL_OVERWRITE. Both of them are now deprecated, let me
explain why.

From this point on eblob became what I call “append-mostly” database, which
means usually it only appends data to the end of file, but in some rare cases it
can overwrite data “in-place”.

During binlog rewrite it was decided that having flags that drastically change
eblob write behavior is too complicated so those were replaced with the one
“canonical” write path: we are trying to overwrite records only if it’s overwrite of
same/smaller size and blob is not currently under data-sort. So we basically
turned on EBLOB_TRY_OVERWRITE flag by default, which should greatly reduce
fragmentation on some workloads. There are possible performance implications of
that decision so we’ve done much testing before applying this change via our
internal QA.

As side effect it allowed us to remove 90% of binlog code since there is no need to
“log” data overwrites and keep track of their offsets in original blob to apply
them later to sorted one.

By the way we’ve considered three possible alternatives:
– Always append data to end of “opened” blob on overwrite. (True append only database)
– Overwrite in-place only in last blob. (Almost append only)
– Always overwrite in-place when blob is not under data-sort. (Append-mostly database)

Here are some graphs from our benchmark system yandex-tank (Yay! It’s opensource).

Setup:
10000 rps, 1kb records, eblob 0.21.16

Always append:

(no-overwrite) read:write 10k eblob_0.21.16 flags 14

Always append

(no-overwrite) read:write 10k eblob_0.21.16 flags 142

Always append + EBLOB_AUTO_DATASORT

Overwrite in last blob only:

(partial-overwrite) read:write 10k eblob_0.21.16 flags 14

Overwrite in last blob only

(partial-overwrite) read:write 10k eblob_0.21.16 flags 142

Overwrite in last blob only + EBLOB_AUTO_DATASORT

Always overwrite:

(overwrite) read:write 10k eblob_0.21.16 flags 14

Always overwrite

(overwrite) read:write 10k eblob_0.21.16 flags 142

Always overwrite + EBLOB_AUTO_DATASORT

 

We’ve picked “always overwrite in-place” because it behaves really well with
background data-sort – as you can see there is almost no noticeable difference
between graph with data-sort in AUTO mode and disabled data-sort. One common
choice also simplifies write path and makes performance independent of
“magic-knobs”.

NB! Future eblob versions may change this behavior or reintroduce some of
knobs in case regressions of “always overwrite” approach be found.

Data-sort implications on eblob performance

Lots of stuff has been written about data-sort and defragmentaion in recent
eblob versions (>= 0.18.0) both in documentation and blogposts. Today I
want to speak about eblob memory management, data structures and why regular
data-sort is essential for eblob/elliptics performance.

First when key is written to data file it’s basic information like key itself,
size of record, and location on disk is stored in in-memory index (internally
rb-tree) and also written to so-called “unsorted” index. So both data file and
“unsorted” index have records sorted by their write time, but we still can very
efficiently find given key in in-memory index or iterate over
“unsorted” index because order of records matches one used in datafile.

But having all keys in memory is not always possible (especially when you have
billions of them). So on each startup eblob sorts all but last “unsorted”
indexes by key, so it can use more efficient data-structures instead of storing
all keys in-memory rb-tree. Memory-efficient as it is this breaks record ordering
between data file (records are sorted by write time) and new “sorted” index
(records sorted by key). This makes iteration over such blob very inefficient
(consuming way too many IOPS).

To mitigate those problems data-sort routine was introduced. It combines in itself three purposes:
* Purges records from in-memory index, so that recently unused keys won’t
occupy precious RAM and cause OOM on write-heavy workloads.
* Defragments data by physically deleting keys that were marked as “removed”.
It’s analogues to some NoSQLs’ compact procedure or SQLs’ VACUUM command.
* Restores record ordering between data file and index, so that iteration speed
is maximized.

As fast, efficient and useful as it is data-sort is rather heavy-weight routine,
because it can theoretically move terabytes across the drive, so it’s rather unwise to run
it in peak hours. Given that number of knobs were introduced so that administrator can
manage time of data-sort startup.

Starting with eblob v0.21.17 and elliptics v2.24.14.11 admin may select between
four different methods of managing data-sort startup times.

  • AUTO – run data-sort on startup for each “unsorted” blob (but last). Also run it on every blob’s “close”. This is preferred method for small databases.
  • TIMED – old way of running defrag each defrag_timout seconds. It’s useful for autogenerated config files where each server gets it’s own timeout.
  • SCHEDULED – most sophisticated built-in method. It automagically spreads data-sort load across nodes in time based on given defrag_time and defrag_splay so that each node selects some random time in range [defrag_time - defrag_splay, defrag_time + defrag_splay] hours. This is preferred method for medium/big clusters.
  • NONE – when none of given options are selected one must periodically run    data-sort via provided API – for example with elliptics one can use dnet_ioclient -r host:port:family -d start command to run defrag on node given by host:port:family tuple. This is preferred method for very big clusters that require some external synchronization based on e.g.: replica location, current time or node load.

NB! Failure of periodically running data-sort will lead to extensive memory
usage, HDD space waste and slow iteration (e.g: dnet_recovery) speed.

For more information see:

Eblob got new automatic defragmentation

Eblob is a low-level storage for elliptics distributed network.
It is append-only (with rewrite support though) system with tricky disk and memory indexes, which allow to have very high IO performance.
But everything comes with its own price – when key is deleted, it is only marked as removed in eblob and later defragmentation tool has to restructure eblob to make this space free.

Quite for a while this functionality was disabled – there was only offline tool which required node restart and manual admin work. Admins do not like to work and I can not blame them – it is rather annoying task to defragment 2 hundreds of nodes and then run recover (since while node was offline its replicas could get updates which are better synchronized to given node too).

Now eblob has automatic defragmentation again. It is rather costly task, so it is recommended to ‘run’ (timeout is specified in config) it rarely like once per several hours. Moreover, only single blob will be processed in one timeout slot, since eblob reserves free space only for one blob – thus defragmented ‘copy’ can live in the same place as data blobs.

So far only our tests show that things work as expected, so we start testing this in our large clusters.

5 billions

It is amount of objects in one of our clusters – it hosts avatars and other rather small objects (hundreds of bytes to several kilobytes)
This amount of objects is hosted just on 2 nodes, each has 24x2Tb disks, 48Gb of ram, only 6Tb are actually used

We should multiply it to 3, since that’s amount of objects in whole cluster spread over 3 datacenters in Moscow region
Load is quite small though, about 500 rps of reads and about the same for simultaneous writes

POHMELFS and elliptics server-side

POHMELFS in a meantime got ‘noreadcsum’ mount option support as well as some other tricky bits.

This bumps read performance several times, since eblob (elliptics backend) stores data in contiguous chunks increasing read/write performance, but optionally forcing to copy data, when prepared space is not enough.
Reading with csum enabled forces to checksum the whole written area, which in turn requires to populate it from disk to page cache and so forth. For gigabyte file this takes about 5-6 seconds (first time).

And that’s only to read one page. I.e. to read every page (or readahead requested chunk).
Not sure that disabling read checksumming is a good idea, so I made this a mount option. Maybe eventually we end up with some better solution.

I also fixed nasty remount bug in POHMELFS which uncovered a really unexpected (for me at least) behaviour of Linux VFS.
Every inode may have ->drop_inode() callback, which is called each time its reference counter reaches 0 and inode is about to be freed.
But sometimes when inode was recently accessed, it is not evicted, but placed into lru list with special bits set.
Inode cache shrink code (invoked from umount path in particular, but likely may be called from memory shrink path too) grabs all inodes in that lru list and later calls plain iput() on them, which in turn invokes ->drop_callback() for inode in question.

Thus it is possible to get multiple invocation of callback in question without reinitializing inode between them. This crashed pohmelfs in some cases. Now it is fixed with appropriate comment in the code, but I’m wonder how many other such tricks are yet to discover?

POHMELFS is a great tester for elliptics server-side scripting support. I was lazy and put all somewhat complex processing into server-side scripts written in Python. Anton Kortunov implemented simple sstable-like structure in Python for directory entries used by pohmelfs.

Since every command is processed atomically (on single replica) in elliptics, we can put complex directory update mechanism in this ‘kind-of-transactions’. In particular server-side scripting is used to insert and remove inodes from directory. Lookup is also implemented using server-side scripting – we read directory inode in python code, search for requested name, and return inode information if something is found, which is sent back to pohmelfs.

Overall this takes about 2-13 msecs. I.e. receive command from pohmelfs, ‘forward’ it to pool of python executers (srw project), where python code will read directory inode data from elliptics (using elliptics_node.read_data_wait()), search for inode with given name there and send it back to pohmelfs.
Insert takes about 30-150 msecs – script reads directory content, adds new entry (or update old) and then writes it back into the storage.

That’s how it looks in python – srw/pohmelfs_lookup.py

Given that we spend 10 msecs in such not really trivial piece of code, I believe that my implementation is actually not that bad.
Those are recent news. Stay tuned for more!

Recent elliptics changes

It was a while I wrote here last time, but there are a fair number things happened.

First, elliptics.
We fixed number of bugs in server-side scripting implementation, so it is very stable now and easily allows to create complex scripts which not only perform basic tasks, but also rather complex transactions, like read/process/update multiple records.

We added transaction locks, which allow to perform operations on given key atomically on single replica. In this case your server-side script, which reads data, updates its structure and uploads it back, will be performed atomically on single replica.

There are many other distributed storage systems, which allow atomic key operations, but they do not support datacenter replica split. Usually it is implemented (like in Oracle NoSQL or MongoDB) with single master node, which copies data to multiple replicas. This operation may even be synchronous and safe.
And it is possible to put replicas into different datacenters. But what happens, when you want to add more datacenters, but do not want to increase number of copies? In this scheme one has to rebalance replica sets.

In elliptics you just say which datacenters you want to use for given IO transaction. This forces non-atomic replica updates, which happen in parallel. This allows to have faster writes and more flexible setup, but we can not guarantee that replicas are updated synchronously.

And actually with mongodb scheme it is possible, that replicas will not be in sync, since there may be failed node which ost some data, so reads from this node will not return consistent results.

Real fix for this problem lies outside of the low-level storage subsystem. It is like forcing block device to lock against multiple processes writing into the same file. Instead this protection lives on higher layers and generally requires external locks, which are both known and used by concurrent users.

In distributed world this algorithm is based on Paxos with optimizations (like fast Paxos or Zookeeper Atomic Broadcast and friends). We do not yet have such subsystem.

Another change is embedded checksums. Previously we stored them in blobs after each data records, but never used. Instead there was a separate command to calculate and store checksum in metadata (separate column). This may resulted in stall checksum stored in metadata, so parallel read could fail with inconsistent data error. Right now metadata is just a storage for replica information and update/check status dates and other meta information (namespace and so on).
Each write updates checksum in blob atomically (again, this is not synchronized between replicas), so any read will always get correct check if data was not corrupted.

We also added bulk IO operations, namely read and write. Write issues many parallel writes and waits for all of them to complete, thus essentially being equal to the performance of the slowest nodes, while doing sequential write ends up being as long as sum of all writes in question.
Read does similar thing, but splits and optimizes read using offset sorting (i.e. that we read data from disk in sequential order).

Elliptics vs HBase on hundreds of millions of small records

We recently run a test on 215 millions of production records, which we wanted to store on single server.
Objects are rather small about 200 bytes, server hardware is quite common in our environment: 24 Gb of RAM, handful of mostly unused CPU cores, 4 SATA disks of 1-2 Tb in RAD10
Unfortunately due to configuration issues there were only 2 working disks in elliptics server (raid near layout sucks), while HBase had fair load spread over all 4 disks

Usage case: random reads, random range reads.

Out of the box elliptics _yet_ does not provide good enough on-disk index, so it is usually a binary search, which is costly. So we warmed index cache files by reading them into /dev/null

Upload speed was roughly the same in HBase and Ellipitcs – 2.5-3 Krps. But using HBase batch upload it was possible to write data upto 30.000 objects per second, object were packed into 5000 blocks. Elliptics does not yet support such batch upload mode.

So, reading. First of all, we started plain run of 200 random IO rps test. Its duration was about 10 minutes.
Elliptics showed about 30ms median reply times. HBase was closer to 100-150 ms.


Elliptics data (timing scale is wrong, though, median time is 30 ms)


HBase data

Second test was set to find maxium RPS rate of random IOs starting from cold caches and data.
I will not post graphs though, only numbers. Graphs on demand.
Elliptics showed 115 RPS within 100 ms each.
HBase reached almost 300 RPS, but with 100-200 ms median.

And that was with only 2 working disks in elliptics setup (blame on me), while HBase used 4 disks in RAID. Here are proof pictures (first one is elliptics dstat, second – HBase)

HBase also supports index compression, which ended up with 1500 RPS of random IO within 100 ms. Pretty good numbers for single server with 215 millions of records. Our objects are easily compressable, so HBase’s data base only occupied about 15 GBs of space.

Elliptics does not compress index, only data, so compression would not help us. Instead we plan to replace current index (binary search on disk, roughly the same happens in every filesystem with b-tree, when you are looking for a file by name). New scheme will cache each N’th key in ram with specifying ID range stored behind given key. Keys are rather small – 64 bytes by default (specified at compile-time), so we could pack a bunch of them into 4k page for example. Reading this page from disk will take single IO, and searching for the key in this page (read into RAM) will be very fast.

This only optimization will kick hbase’s ass I think. Getting that HBase can not safely live in multi-datacenter environment, elliptics will fit all needs for huge-number-of-small-records data storage.
But there is also a plan to add bloom filter for faster detection whether given key is present in blob or not.
Since blob file is ‘closed’ for writes after reaching maximum size or number of records, and no more writes goes into it except when overwrite is turned on or deletions sets the bit, but both do not change keys, it is possible to create rather static and very optimized index. This comes after HBase index design acutally, when its blocks are immutable and so are index ranges.

We currently cache in RAM each key we read from disk, but practice shows that overwhelming number of old enough records (so theirs keys moved to disk from ram) are never (or extremely rarely) read again, so we will add cache timeout for keys to drop them back to disk.

Elliptics range requests benchmark

73+ millions of records, 25-30 Gb total space on single node (actually there were 3 replicas, but we used only one)

Here is the graph

3000 rps within 10 milliseconds, where each range request returned 20-4k records.

Elliptics range request is a full analogue of SQL’s “SELECT * from TABLE WHERE key > X and key < Y LIMIT (from, num)". In this test each record's key contained timestamp and range request asked for data in some time range.

Each key (elliptics uses 64 bytes for key) looked like this:
xxxyyy...whatever else...timestamp[16 bytes]

and range request was from
xxxyyy...whatever else...0000...0000[16 bytes]
to
xxxyyy...whatever else...ffff...ffff[16 bytes]

Looking forward for the larger datasets

To date largest (in terms of number of objects) elliptics cluster hosts as much as 400 millions of records on every node. Those are quite small records, so it does not occupy much space (just about 4 Tb per server node), but index becomes quite large (45+ Gb).

We dropped in-memory index in eblob quite for a while already, but having it on disk means that we have to check disk to find out needed key. Currently it is binary-searchable structure on disk, but this is very suboptimal. Well, for 45+ Gb indexes lookup has acceptable timings and likely will have it for 2-3 times larger datasets, but we want every node to host as much as 5-10 times more data, i.e. 40 Tb of data and 4 billions of records.

In this case binary search in 450 Gb index will take too long. We can add more servers to spread data, but in this case we will not optimally use quite limited physical space in datacenters. We have couple of ideas on faster indexes, which basically employ kind of sharding property of our keys, i.e. we can split our key (512 bits in default configuration) into chunks where each part will be an offset into box-like index structure.

In a meantime I added in-memory caching of the read keys – now every key read will be placed into hash table in memory for the fast next-time access. Eblob also got Python bindings and set of handy utils to scan blobs (like regexp match). It also supports statistics file (you will find it in root directory), which shows number of objects present on disk, removed on disk and pushed into memory. Removing this file will force eblob to regenerate it on the next start.

And now one really good news – new POHMELFS will be started next week. Estimated completion time is rather short (we want it to be if not production ready, but more usable at the end of the month), since it will be just elliptics frontend. To date main question is object indexing – the most naive and simple design will have quite slow file/dir renames, i.e. full copy and delete, which is not a good idea generally.
But I did not yet think in details about design problems, so this is an open question.

Stay tuned, there will be very interesting thins shortly!

Elliptics API extension

I added a bunch of new URI parameters, so HTTP frontend is now almost as capable as clients created using C/C++/Python API. (Github has not yet been updated though)

In particular, added

  • column= – read/write data from particular column. Number of columns is not limited. Well, each new column is a separate large eblob, so it is limited by number of blob files in directory, by default blob size is 50 Gb, so it should not be a limitation
  • size=/offset= – to read/write data from particular offset and size
  • prepare/plain_write/commit – using ‘prepare’ one can reserve large enough space and then put there chunks of data using ‘plain_write’. ‘commit’ will commit data into index, calculate checksum and so on
  • aflags/ioflags – one can enable/disable checksum, turn on/off compression, APPEND mode and so on
  • range/from/to – range requests. HTTP server returns array of data chunks, where each record is prepended with DNET_ID_SIZE key (64 bytes by default) and 8 bytes of size (little endian)
  • id – to read/write data using own ID (DNET_ID_SIZE max, 64 bytes in default compile – just for sha512 keys)

Here are examples.

1. Write data using own 64-byte keys (in hex: 1122334400… – 1122334600…):

wget -q  -S -O- --post-data="1234567890 id=11223344" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223344"
wget -q  -S -O- --post-data="1234567890 id=11223345" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223345"
wget -q  -S -O- --post-data="1234567890 id=11223346" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223346"

2. Read data by own user-provided key:

wget -q  -S -O- "http://elisto19f.dev/elliptics-proxy?name=1.xml&id=11223345&direct"

3. Range request hex keys are from 1122334400… to 112233ff00…:

wget -q  -S -O- "http://elisto19f.dev/elliptics?range&from=11223344&to=112233ff"

hexdump of the stream (red is key, blue is size):

 wget -q  -S -O- "http://elisto19f.dev.yandex.net/elliptics?range&from=11223344&to=112233ff" | hexdump -C
  HTTP/1.0 200 OK
  Content-Length: 459
  Content-Type: application/octet
  Connection: keep-alive
  Date: Thu, 04 Aug 2011 19:27:57 GMT
  Server: lighttpd/1.4.26

00000000  11 22 33 44 00 00 00 00  00 00 00 00 00 00 00 00  |."3D............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000040  16 00 00 00 00 00 00 00  31 32 33 34 35 36 37 38  |........12345678|
00000050  39 30 20 69 64 3d 31 31  32 32 33 33 34 34 11 22  |90 id=11223344."|
00000060  33 45 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |3E..............|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 16 00  |................|
000000a0  00 00 00 00 00 00 31 32  33 34 35 36 37 38 39 30  |......1234567890|
000000b0  20 69 64 3d 31 31 32 32  33 33 34 35 11 22 33 46  | id=11223345."3F|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 16 00 00 00  |................|
00000100  00 00 00 00 31 32 33 34  35 36 37 38 39 30 20 69  |....1234567890 i|
00000110  64 3d 31 31 32 32 33 33  34 36 11 22 33 47 00 00  |d=11223346."3G..|
00000120  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000150  00 00 00 00 00 00 00 00  00 00 16 00 00 00 00 00  |................|
00000160  00 00 31 32 33 34 35 36  37 38 39 30 20 69 64 3d  |..1234567890 id=|
00000170  31 31 32 32 33 33 34 37                           |11223347|
00000178

We will stress-test range requests next week, it is expected that performance should not be radically different from plain random IO performance (benchmarks can be found here)

We also tested Cassandra in 4-nodes configuration, and I’m about to collect raw number and graphs to show.

Stay tuned!

Yfrog presentation about HBASE usage

They removed my drawings from yfrog, so I will say some words about their presentation – mostly about how great we are

First, its setup – 60 servers for 250 millions of photos. I do not really know whether this number includes number of copies, but we use ONE server with elliptics to host 200+ millions of small objects. And we actually have 3 copies, which turns out to 600 millions of records in 3 different datacenters. Well, that server has 48 Gb of RAM and 24 disks in RAID6 – kinda big box.
But our 1 Pb cluster contains not only those machines, but also smaller ones with 4 disks and smaller amount of RAM.

Metadata cluster with 1 Bn of records and 20 nodes is much more interesting, but I’m afraid most of it fits RAM, since metadata records are small.

Presentation speaks about 10 krps – great number, but wait a minute – from 20 or even 60 nodes? Above mentioned setup handles 3krps (2 krps write and 1 krps read) easily. And I do not call it ‘Super fast!’

We tested Cassandra in a lab, it handled 2 krps great (1400 rps read, 600 rps write) from 4 nodes (24 Gb of RAM, 4-disk RAID10, 15 Gb of data on every node), but only until data fits memory. Overall it was not very stable and required huge amount of tuning, so that it wouldn’t end up in OOM or JVM GC killer.

Huge MongoDB tests are on the road.

Using eblob as low-level IO backend disk bottleneck was pushed away, but it adds limitation on number of records stored effectively – we store key index in RAM. Now when I added on-disk index we are about to retest our benchmarks, which will likely show performance degradation. Although this change allows to put more object into single node and do not heavily depend on amount of RAM, it may be slow enough.

Next change will be to use fast in-memory index for frequently used keys.

Our main disadvantage is lack of documentation. Nobody uses elliptics except our company friends, who bite me about how things are implemented and their meaning. But this will change very soon! We are working, and that’s actually kinda harder than code :)

Recent news about projects

So far I’m mostly concentrated on elliptics and eblob, since we build rather large and loaded storage systems on top of them.
But I plan to step a little bit away and work with language grammatics (starting with simple flex/bison so far), but about this later.

Elliptics got another project to host huge amounts of small files (1-20kbytes) uploaded virtually continuously all day long into servers in 3 different datacenters. Here is a monitoring pic (amount of files uploaded this week) from one of the servers:



There is a small caveat though – eblob stores data and index on disk and also fast index in RAM. With such amount of objects written eventually we will run out of memory for fast index, which in turn will force us to setup more servers, which sometimes is not that simple (mostly because of absence of physical space in datacenters).

Eblob and on-disk vs in-ram lookup indexes

To eliminate this problem I mapped fast index (hash table) into on-disk file, but because of serious fragmentation issues this ended up with huge slowdown – mostly to move disk heads from the beginning of the map file to its end to find out needed object. Although this was O(1) in terms of number of records, it still required fair number of disk seeks to iterate over hash table to get single record. So, it was dropped.

Now system is quite different – I sort on-disk index after blob is completed, which is quite fast operation (index file always fits free memory), then it is written to disk and all records placed there are removed from in-memory fast index. File is mapped and binary search is used to find data records.
Currently we do not propagate frequently accessed records into fast in-memory index, but will do in next releases.
Also if there is an old on-disk index it is not converted on startup, which should be changed too.

We did not yet test new lookup scheme performance, it is scheduled for the end of the week or next one. It is not that simple to generate a hundred of millions of records quickly, so likely I will use existing data first (it requires sorting of the existing on-disk indexes).

Eblob as append-only and overwrite in-place storage

Besides on-disk index eblob also got overwrite support – when new records is less than object (including alignment) already written with this key, it will be overwritten in place instead of appending it at the very end. This feature should eliminate quick metadata grows, since we update it each time object is checked or written.

Language grammatics

I plan to implement mapreduce on top of column data store used in elliptics, and want to create a rather trivial language for that. In particular I plan to use boolean and arithmetic operations, regex string search and generic read/write primitives mostly to eliminate data dereference (kind of string class). Maybe it will even contain loops and functions.
So far I wrote a simple reenterable grammatics parser using Flex/Bison, which supports multiline strings in Flex among others, but grammatics is yet very trivial and basically unusable for language parsing. I will experiment with it more, since I want to add it into elliptics sooner than later.

Those a plans and infos. Stay tuned!

Eblob got automatic defragmentation support

Eblob is an extremely fast append-only low-level storage used in elliptics network distributed storage.

Since it is append-only all updates will look like new records appended somewhere at the end of the blob file. Thus old entries will just waste space.

Previously there was a special tool which was able to defragment blobs in offline and elliptics could be restarted to get new data location (it worked with old blobs until then). It required lots of space or lots of restarts to defragment really large storage.

Now this is gone – eblob uses online automatic defragmentation, which finds blobs with lots of removed objects (there should be at least one third of objects removed) and sequentially reads records and writes them back into another blobs.
In-memory index is also updated, so when this process completes scanning given blob file it can be removed.

Automatic defragmentation requires at most size of the single blob of free space to defragment whole storage, since blobs are processed one after another without storage restart.

Getting that all reads during defragmentation are sequential and all writes are actually appends properly aligned by VFS to make huge sequential writes, this process is rather fast.

Looks like this is the last feature for eblob, so we can concentrate on recovery tools, tests and eventually new features.

Since elliptics@eblob and API support column storage, I plan to implement incremental mapreduce on top of that. It is a bit far plan though, maybe to the end of the summer…
Anyway, stay tuned!

Eblob got Snappy compression support

Kudos to Google for opensourcing it.

I did not yet test its performance, but it compresses our small common XML about 2 times.
API was extended to support per-command flag to compress data or not. Eblob stores this information in its structure, so reading detects this and automatically decompresses data.

So, to store textual data elliptics and eblob look very promising, getting that Cassandra and MongoDB do not yet support compression (although there are tickets back from 2009) :)

Enjoy!

Elliptics 2.10.0

New release with huge number of improvements, the most notable of which are

  • column data storage (eblob only)
  • range requests
  • faster metadata and overall IO performance
  • server-side checksumming and verification support
  • base for ultra-fast recovery and local mapreduce
  • simplified interfaces for POHMELFS

So far this is a beta release, since not all API calls were propagated to high-level interfaces. There are some issues with recovery process – we will not yet fully finished transition process. Eblob offline defragmentation tool was removed, but online processes was not yet added into the core.

We plan to resolve this issues this week and move further.

Elliptics currently hosts not-that-large clusters (the biggest cluster spawns over about 200 nodes which live on ~20 physical servers and close to 1 Pb of data) with about hundreds of millions objects (not counting redundant copies), which covers 4 datacenters including abroad. With above features we plan to expand our usage area. Unfortunately it is quite complex to get into with elliptics usage, and that should be my main goal actually.

In a meantime we got www.elliptics.ru and www.eblob.ru, which will host documentation portal with example configs, API docs, benchmarks and so on. Many thanks to Andrey Godin.

Maybe eventually this will be a community driven project.
Anyway, stay tuned!

Column data store

In a meantime elliptics got column store support.
It is implemented in eblob only though, filesystem backend does not support this feature.

I’m thinking about proper API for this feature though.
There are million of changes in the current tree, we will polish it and release new version very soon, which will draw the line between current rapid development cycles with extreme feature creation rate and steady maintenance support. Probably :)

Stay tuned, I will make new post, when things are ready for prime use.

Key range requests in elliptics network

Elliptics network is a horizontally scalable key/value storage, which by default implements distributed hash table for ID spreading scheme.

But it is possible to implement your own ID generation mechanism, which can introduce data locality. Range requests in such scenarios can produce a huge win for users.

I implemented range requests (in eblob backend only), which can be kind of transformed into following SQL command:

SELECT * from eblob WHERE key >= start AND key < end LIMIT (from, num)

One can use half of the key to store master ID and another half (we use 512-bits keys, but it can be changed at compile time) for each object client wants to attach to selected master ID, for example if we name client X as key '123456.000000' then all its data can be stored with keys 123456.000001, 123456.000002 and so on
To get all data objects with 'master ID' 123456 in one request one may execute range query starting with 123456.000000 and ending with 123457.000000

For example using python binding it looks like this (different range is used here):

$ cat dnet/bindings/python/range.py
#!/usr/bin/python
# -*- coding: utf-8 -*-

from libelliptics_python import *

log = elliptics_log_file("/dev/stderr", 8)
n = elliptics_node_python(log)

group = 2
n.add_groups([group])
n.add_remote("localhost", 1025)

r = elliptics_range()
r.start = [1, 2, 3, 4]
r.end = [0xcc, 0xdd]
r.group_id = group
r.limit_start = 2
r.limit_num = 2

print n.read_data_range(r)

API hides cases when requested range is handled by different hosts, although this should be very rare case, it is supported in the underlying methods.

Range requests are supported in C, C++ and Python bindings as well as usual IO methods.

Next huge step is fair column data store. I'm open for ideas about better implementation.

Mapped file index, on-disk sorting, range requests and per-column data store

Mapped file for fast in-memory index

I started to test eblob – append-only storage used in elliptics – with mapped file for its in-memory index. Since it is a hash table, I decided to reduce number of seeks in each hash bucket search, so I implemented array of elements and not a classical list. We reallocate array each time collision happens to add another entry at the end.

Unfortunately this tends completely not to scale. Eventually defragmentation of the mapped file becomes so huge (and it becomes clear just after several tens of millions of records) especially if we preallocate large enough hash bucket arrays (like for 5-10 elements), that write of element at the beginning of the hash table tends to flush write at the end of the mapped file, which in turn requires seeks and seeks and more seeks.

After about 40-50 millions of entries in single mmap file kernel flush thread starts eating 100% CPU and write performance degrade to hell. Less than that it works with the same spped as glibc malloc() though.
This idea has to be rethought.

On-disk sorting and range requests

To implement range requests in elliptics I decided first to optimize on-disk eblob storage. Range requests likely will not be supported for filesystem IO backend (where every object is a separate file), eventually I will drop it.

So, I imlemented offline on-disk sorting tool, which uses eblob iterators (C++ interface) and its speed equals to one half of write speed of underlying disk array, since it writes to-be-sorted data 2 times – temporal tables and merged result.
Sorting tool uses combination of in-memory quick sort (std::sort() to be precise) and writes resulted less-than-ram blocks into temporal files, which are in turn merged the way merge sort works.

12.5 millions of records (written objects from 1.5 to 20 Kb each, ~100 Gb total) were sorted with 22-25 Mb/s speed, while write speed of underlying RAID array is about 45-50 Mb/s (we write 2 files at the very end – eblob with data and its on-disk small index), so it took a bit more than an hour.

Range requests will be handled by elliptics and will return data objects for separate keys. On the storage it will iterate over all keys in requested range and send data to client. If range covers multiple machines, API will automatically split request to multiple nodes.

Since by default elliptics uses hash of the name as a key, range requests will not have any reasonable meaning. But it can be very useful when keys are generated manually.
For example client may use half of the key as user login (hash from it) and second half as a written object id (hash from uploaded file for example). In this case range requests allow to get all user’s data via single request. Keys can be split whatever client likes, we use 512 bit keys, so there is plenty of space there.

Column data storage

But IMHO range requests and manual key generation is a half step to the real column data store. I do want to implement this feature this summer, so we could compete with systems like Cassandra and MongoDB in calculation and processing markets. On a performance (and optionally reliability) elliptics@eblob are unbeatable already :)

Documentation

It sucks. And I mean it – it is horrible as freakin’ bloody hell.
This uglymoron can be really called docuementation, but I will, I promise, I will work with people who can write good docs, so that writing applications and setting up systems would not be a rocket science.
Actually it is very simple, but still this knowledge has to be documented.

Stay tuned, and things will be even better!

Per-column data store in eblob and elliptics

A simple test to iterate over 38 millions of name metadata records placed separately into own eblob gave us almost 2 times performance increase – regex took about 50 seconds.
When combined metadata (about 3-5 different types) lived in single eblob it took 80 seconds to complete.

Per-column buzz is quite trending, although above data shows that if we store data in separated databases (or eblobs in our case), it is much more efficient. Databases like Cassandra store multiple columns in single record, which albeit not optimal but is quite convenient – we get all columns via single request and then iterate over them.

I’m thinking about putting such feature into elliptics and namely eblob – server will itself update columns in selected row, although placing all columns into single record looks quite questionable for me.