Tag Archives: eblob

Reverbrain packages repository

Reverbrain package repository now hosts packages for the following distributives: RHEL6, RHEL7 (CentOS supported), Ubuntu Precise, Ubuntu Trusty, Debian Wheezy, Debian Jessie.

Repository includes all packages needed to install Elliptics distributed storage and Eblob low-level storage.

Here is a small tutorial on how to automatically turn on repository in your setup: http://doc.reverbrain.com/elliptics:server-tutorial

Backrunner HTTP elliptics proxy can be found in Docker repo: https://registry.hub.docker.com/u/reverbrain/backrunner/

LSM and fractal trees and how to really work with large files

LSM tree (stands for log-structured merge tree) is a rather simple structure which can be hardly called a tree.

This is an append-only log which is sorted when written to disk. LSM tree is intended for write-heavy workloads, since reading requires at least O(number of on-disk log files) disk-seek operations.

There is a fair number of read optimizations for LSM trees, in particular bloom filter which can reduce number of disk seek operations to minimum albeit with some probability (it can be arbitrary small though).

LSM tree behaves much better for write workloads compared to Btree and friends (B+, B* and so on), since there is only one write of the sorted tree and it is always sequential. Btree potentially has to update multiple nodes (some log of total number of keys) when performing single write. Nodes are likely located at random locations which ends up with random writes. These are slow.

Quite contrary Btree reading is usually faster than that of LSM trees – logarithm of number of keys is less than number of sorted logs in LSM tree. But this does not count bloom filters in. Which in turn doesn’t count node caching in btrees.
Multiple operations needed to perform single request – like multiple page reads to fetch single key in btree case – is called multiplication. Fractal tree is aimed at write multiplication – it is yet B+tree, but it stores data in intermediate nodes (not in leafs) for some time until page split is required. This allows to reduce number of writes needed to insert or update a key.

Anyway, btrees are considered to be faster than LSM trees for reading and slower for writing. The latter is a fact – LSM trees are designed for that, the former is questionable.

Elliptics distributed storage can use many backends, and the most popular one is Eblob – a low-level storage built with LSM trees design in mind.

LSM trees do not support data rewrite – key update is appended to new log and older version is either marked as removed or special lookup sequence is used to find out newer keys first. Eventually LSM tree merges and removes old versions of the duplicate keys.

In Eblob this process is called defragmentation, and it is a bit different than LSM tree process. LSM tree stores already sorted data to disk, it sorts it in RAM. But if your storage is intended to store large files like Elliptics – we store objects which are sometimes quite larger than amount of RAM in the system – plain LSM tree approach (sort in mem and sync to disk) doesn’t work.

Instead Eblob stores unsorted log to disk (optionally overwriting data in place) and adds in-memory index of the keys. This simple scheme could be very naive since number of keys multiplied by key size – amount of RAM needed to store key index in memory – can be much larger than amount of RAM on any given server. So we have additional on-disk index of stored keys, it can be sorted – binary search allows to find needed key rather quickly.
But not quickly enough – there will be log2 of number of keys random seek operations – we have to split sorted keys into ranges of smaller size called index blocks and store start/stop pairs for each index block in RAM. This allows to find out index block quickly without on-disk operations, and then perform single read to get the whole index block (tens-to-hundreds of keys) and perform in-memory binary search.

And even this is not enough. Iterators and for example recovery works with sorted keys – recovery merges index lists from different nodes and produces sorted list of keys which have to be recovered – since our data is not sorted yet, reads of the to be recovered keys will be actually random reads. Instead we can turn that purely random read into subsequent read plus some times head positioning. So we sort data which is performed when defragmentation process is being started the first time.

This allows Elliptics+Eblob be the ultimate solution for medium-to-large files distributed storage. For smaller files pure LSM tree is usually enough.

React monitoring tool released

During last months we’ve been actively developing our monitoring tools and now we are ready to present you React!

React(Real-time Call Tree) is a library for measuring time consumption of different parts of your program. You can think of it as a real-time callgrind with very small overhead and user-defined call branches. Simple and minimalistic API allows you to collect per-thread traces of your system’s workflow.

Trace is represented as JSON that contains call tree description(actions) and arbitrary user annotations:

    "id": "271c32e9c21d156eb9f1bea57f6ae4f1b1de3b7fd9cee2d9cca7b4c242d26c31",
    "complete": true,
    "actions": [
            "name": "READ",
            "start_time": 0,
            "stop_time": 1241,
            "actions": [
                    "name": "FIND",
                    "start_time": 3,
                    "stop_time": 68
                    "name": "LOAD FROM DISK",
                    "start_time": 69,
                    "stop_time": 1241,
                    "actions": [
                            "name": "READ FROM DISK",
                            "start_time": 70,
                            "stop_time": 1130
                            "name": "PUT INTO CACHE",
                            "start_time": 1132,
                            "stop_time": 1240

This kind of trace can be very informative, on top of it you can build complex analysis tools that on the one hand can trace specific user request and on the other hand can measure performance of some specific action in overall across all requests.

Also, for human readable representation of react traces we’ve build simple web-based visualization instrument. React trace React has already proved himself in practice. Under high load our caching system performance had degraded dramatically after cache overflow. Call tree statistics showed that during cache operations most of the time was consumed in function that was resizing cache pages. After careful examination, we optimized that function and immediately, performance significantly increased.

Now React is used in Elliptics and Eblob and we intent to use it in other Reverbrain products.

For more details check out documentation and official repository.

Elliptics monitoring spoiler alert

Here is Elliptics per-cmd monitoring

Cache/eblob command timing

Monitoring includes a very detailed statistics about the most interesting bits of the storage. Above picture shows write command execution flow (whole command and how time was spent within) – one can turn it on for own command if special trace bit has been set.

Other monitoring parts include queues (network and processing), cache, eblob (command timings, data stats, counters), VFS and other stats.
More details on how to use that data in json format is coming.

Eblob documentation and code samples

Eblob is a main low-level storage supported in elliptics. It was created as append-only base, but eventually we added rewrite, data sorting, in-memory and on-disk indexes, added and then removed columns, bloom-filters, iterators, blob merge and defragmentation and so on and so on.

Eblob has rather simple interfaces accessible from C, C++ and Python.

So, we documented our most actively used low-level storage from every point of view: design and architecture, interfaces, known features, gotchas and best practices.
It is generally a good idea to read how things are done at the lowest level to understand how whole system behaves, so I present eblob documentation link here: http://doc.reverbrain.com/eblob:eblob


250 billions of photos on facebook

Quite impressive number. I’m curios how many servers do they use.
Facebook used to use photo storage named Haystack, I based Eblob design solely on whitepaper of the proprietary Haystack design.

Although I removed the highest indirection level – the one where key indexes can live on separate servers, I only left in-memory and on-disk indexes.

That’s probably why Elliptics largest by number of keys storage hosts only 50+ billions (counting all 3 copies though) of objects. And that’s actually less than hundred of nodes (including all 3 replicas).

The year of eblob

Hi, this is rbtz speaking again. I’m the engineer responsible for eblob codebase for
almost a year now. Here is small recap of what was happening with eblob
since v0.17.2 with some commit highlights.

The year of eblob

* Eblob now builds under Mac OS X. This improved experience of developers with Macs.
* Changed most links to point to newly created http://reverbrain.com.
* Added comments to all eblob subsystems: e254fc3. This improves learning curve of new developers.
* Added l2hash support: c8fa62c. This reduces memory consumption of elliptics metadata .by 25% on LP64.
* Added first edition of eblob stress test. Year after it’s responsible for catching 99% bugs that otherwise would go to testing: 8eab8ed.

* Added config variables for index block and bloom: a106d9d. This allows sysadmins to limit memory occupied by bloom filter.
* Added config variable to limit total blob size: f7da001. This allows sysadmins to limit eblobs size in case many databases are located on one shared drive.
* Reduce memory consumption of “unsorted” blobs by 20% on LP64: 19e8612
* First static analyzer crusade (feat. clang static analyzer) – number of “almost impossible to spot” bugs found.
* Added data-sort and binlog v1. This allows “on the fly” eblob defragmtntation and memory cleanups.
* Added travis-ci tests after each commit: f08fea2.
* Removed custom in-memory cache in favor of OS page cache: a7e74a7; This removed number of nasty races in eblob code and also opened way for some future optimizations.
* Added Doxyfile stub, so that in future libeblob man pages may be autogenerated: aac9cb3.
* Decreased memory consumption of in-memory data structures by 10% on LP64: c6afffa.

* Replaced core mutexes with rwlocks; This improves out Intel vTune concurrency benchmarks, along with our QA tests.

* Second static analyzer crusade (feat. Coverity);
* Switched to <a href=”https://en.wikipedia.org/wiki/Spinlock#Alternatives”>adaptive mutexes</a> when available: 43b35d8.
* Speeded up statistics update v1: 40a60d7. Do not hold global lock while computing and writing stats to disk.
* Rewritten bloom filter v1: 6f08e07. This improves speed and reduces memory fragmentation.
* Allocate index blocks in one big chunk instead of millions of small, thus speeding up initialization and reducing memory fragmentation: b87e273.
* Do not hold global lock for the whole duration of sync(): 6f6be68. This removes “stalls” in configs where sync > 0.
* Switched to POSIX.1-2008 + XSI extensions: 6ece045.
* Build with -Wextra and -Wall by default: 0e8c713. This should in long term substantially improve code quality.
* Added options to build with hardening and sanitizers: c8b8a34, 2d8a42c. This improves our internal automated tests.

* Do not set bloom filter bits on start on removed entries: 36e7750. This will improve lookup times of “long removed” but still not defragmentated entries.
* Added separate thread for small periodic tasks: ea17fc0. This in future can be upgraded to simple background task manager;
* Moved statvfs(3) to periodic thread which speeds up write-only micro benchmarks by 50%: f36ab9d.
* Lock database on init to prevent data corruption by simultanious accesses to the same database by different processes: 5e5039d. See more about EB0000 in kb article.
* Removed columns aka types: 6b1f173; This greatly simplifies code and as side effect improves elliptics memory usage and startup times;
* Removed compression: 35ac55f; This removes dependency on Snappy.
* Removed bsize knob for write alignment: 8d87b32;
* Rewritten stats v2: 94c85ec; Now stats update very lightweight and atomic;
* Added writev(2)-like interface to eblob, so that elliptics backend could implement very efficient metadata handling: b9e0391;

* Replaced complex binlog with very tiny binlog v2: 1dde6f3; This greatly simplifies code, improves data-sort speed and memory efficiency;
* Made tests multithreaded: 1bd2f43. Now we can spot even more errors via automated tests before they hit staging.
* Move to GNU99 standard: f65955a. It’s already 15 years old already =)
* Fixed very old bug with log mesage truncation/corruption on multithreaded workloads: 10b6d47.
* Bloom filter rewrite v2: 1bfadaf. Now we use many hash functions instead of one thus trading CPU time for improved IO efficiency. This improved bloom filter efficiency by order of magnitude.
* Merge small blobs into one on defrag: ace7ca7. This improves eblob performance on databases with high record rotation maintaining almost fixed number of blobs.
* Added record record validity check on start: bcdb0be; See more about database corruption EB0001 in kb article.

* More robust eblob_merge tool that can be used to recover corrupted blobs.
* Reduced memory consumption of in-memory data-structures by 10% on LP64: e851820;

* Added schedule-based data-sort: 2f457b8; More on this topic in previous post: data-sort implications on eblob performance.

Here I’ve mentioned only most notable commits, mostly performance and memory usage oriented changes. There are of course lots of other stuff going on like bugfixes, minor usability improvements and some internal changes.

Here are some basic stats for this year:

Total commits: 1375
Stats total: 65 files changed, 8057 insertions(+), 4670 deletions(-)
Stats excl. docs, tests and ci: 39 files changed, 5368 insertions(+), 3782 deletions(-)

Also if you are interested in whats going to happen in near future in eblob world you should probably take a look into it’s roadmap.

By the way for those of you who is interested in numbers and pretty graphs – after recent upgrade of our internal elliptics cluster storing billions of records to new LTS releases of elliptics 2.24.X and eblob 0.22.X we’ve got:

Response time reduction (log scale):

98th percentile that was around 100ms dropped below 50 ms

98th percentile that was around 100ms dropped below 50 ms

Disk IO (linear scale):

IO dropped more than one order of magnitude.

IO dropped more than one order of magnitude.

Memory (linear scale):

There is much more "cached" memory now. Also periodic data-sort routine successfully frees unused cached keys.

There is much more “cached” memory now. Also periodic data-sort routine successfully frees unused cached keys.

Eblob as append-mostly database

In this post I want to dig a little bit deeper into one of eblob’s least
understood parts: record overwrite. Let’s start with notion how overwrite was
meant to work in eblob: it simply shouldn’t. Indeed eblob started as
append-only database, so no overwrites were allowed. Then overwrite was added
to reduce fragmentation for use cases when some records were continuously
overwritten so that appending them over and over again was inefficient. So
special case “overwrite-in-place” for same-or-smaller size records overwrite
was added: it was controllable via two knobs – one global for database called
EBLOB_TRY_OVERWRITE and second is per-write flag
BLOB_DISK_CTL_OVERWRITE. Both of them are now deprecated, let me
explain why.

From this point on eblob became what I call “append-mostly” database, which
means usually it only appends data to the end of file, but in some rare cases it
can overwrite data “in-place”.

During binlog rewrite it was decided that having flags that drastically change
eblob write behavior is too complicated so those were replaced with the one
“canonical” write path: we are trying to overwrite records only if it’s overwrite of
same/smaller size and blob is not currently under data-sort. So we basically
turned on EBLOB_TRY_OVERWRITE flag by default, which should greatly reduce
fragmentation on some workloads. There are possible performance implications of
that decision so we’ve done much testing before applying this change via our
internal QA.

As side effect it allowed us to remove 90% of binlog code since there is no need to
“log” data overwrites and keep track of their offsets in original blob to apply
them later to sorted one.

By the way we’ve considered three possible alternatives:
– Always append data to end of “opened” blob on overwrite. (True append only database)
– Overwrite in-place only in last blob. (Almost append only)
– Always overwrite in-place when blob is not under data-sort. (Append-mostly database)

Here are some graphs from our benchmark system yandex-tank (Yay! It’s opensource).

10000 rps, 1kb records, eblob 0.21.16

Always append:

(no-overwrite) read:write 10k eblob_0.21.16 flags 14

Always append

(no-overwrite) read:write 10k eblob_0.21.16 flags 142


Overwrite in last blob only:

(partial-overwrite) read:write 10k eblob_0.21.16 flags 14

Overwrite in last blob only

(partial-overwrite) read:write 10k eblob_0.21.16 flags 142

Overwrite in last blob only + EBLOB_AUTO_DATASORT

Always overwrite:

(overwrite) read:write 10k eblob_0.21.16 flags 14

Always overwrite

(overwrite) read:write 10k eblob_0.21.16 flags 142

Always overwrite + EBLOB_AUTO_DATASORT


We’ve picked “always overwrite in-place” because it behaves really well with
background data-sort – as you can see there is almost no noticeable difference
between graph with data-sort in AUTO mode and disabled data-sort. One common
choice also simplifies write path and makes performance independent of

NB! Future eblob versions may change this behavior or reintroduce some of
knobs in case regressions of “always overwrite” approach be found.

Data-sort implications on eblob performance

Lots of stuff has been written about data-sort and defragmentaion in recent
eblob versions (>= 0.18.0) both in documentation and blogposts. Today I
want to speak about eblob memory management, data structures and why regular
data-sort is essential for eblob/elliptics performance.

First when key is written to data file it’s basic information like key itself,
size of record, and location on disk is stored in in-memory index (internally
rb-tree) and also written to so-called “unsorted” index. So both data file and
“unsorted” index have records sorted by their write time, but we still can very
efficiently find given key in in-memory index or iterate over
“unsorted” index because order of records matches one used in datafile.

But having all keys in memory is not always possible (especially when you have
billions of them). So on each startup eblob sorts all but last “unsorted”
indexes by key, so it can use more efficient data-structures instead of storing
all keys in-memory rb-tree. Memory-efficient as it is this breaks record ordering
between data file (records are sorted by write time) and new “sorted” index
(records sorted by key). This makes iteration over such blob very inefficient
(consuming way too many IOPS).

To mitigate those problems data-sort routine was introduced. It combines in itself three purposes:
* Purges records from in-memory index, so that recently unused keys won’t
occupy precious RAM and cause OOM on write-heavy workloads.
* Defragments data by physically deleting keys that were marked as “removed”.
It’s analogues to some NoSQLs’ compact procedure or SQLs’ VACUUM command.
* Restores record ordering between data file and index, so that iteration speed
is maximized.

As fast, efficient and useful as it is data-sort is rather heavy-weight routine,
because it can theoretically move terabytes across the drive, so it’s rather unwise to run
it in peak hours. Given that number of knobs were introduced so that administrator can
manage time of data-sort startup.

Starting with eblob v0.21.17 and elliptics v2.24.14.11 admin may select between
four different methods of managing data-sort startup times.

  • AUTO – run data-sort on startup for each “unsorted” blob (but last). Also run it on every blob’s “close”. This is preferred method for small databases.
  • TIMED – old way of running defrag each defrag_timout seconds. It’s useful for autogenerated config files where each server gets it’s own timeout.
  • SCHEDULED – most sophisticated built-in method. It automagically spreads data-sort load across nodes in time based on given defrag_time and defrag_splay so that each node selects some random time in range [defrag_time - defrag_splay, defrag_time + defrag_splay] hours. This is preferred method for medium/big clusters.
  • NONE – when none of given options are selected one must periodically run    data-sort via provided API – for example with elliptics one can use dnet_ioclient -r host:port:family -d start command to run defrag on node given by host:port:family tuple. This is preferred method for very big clusters that require some external synchronization based on e.g.: replica location, current time or node load.

NB! Failure of periodically running data-sort will lead to extensive memory
usage, HDD space waste and slow iteration (e.g: dnet_recovery) speed.

For more information see:

Tuning eblob (or any other low-level storage) for maximum write performance

This will be a short article on how to tune your VM for maximum write performance.
I’m not sure whether it is generic enough, but at least this is very useful when you have write-heavy workload in append-like mode, or when there are multiple files and you write into one after another (change files when some boundary has been crossed).

This is the case for Eblob – our the most widely used low-level local storage in Elliptics. Eblob can be turned into append system, which allows overwrite, but if new size is larger than old one, then object being written is copied into new location and more space is reserved for future writes. Old copy is marked as removed and will be deleted when defragmentation takes place.

Every such ‘copy’ within eblob reserves 2 times more space than current size of the object.

If you write into Elliptics using different keys each time, then they will be appended to the blob’s end. When blob’s size reaches its maximum size described in config (or reaches maximum number of elements, which is also configurable parameter), new blob is created.

Its time for small VM details here. There are flush kernel processes, which main goal is to write your data from page cache to the disk. VM can be tuned to kick in this processes according to our demand.

The main issue here is that until data is written to disk, inode is locked. This means no further writes into given file is possible. In some cases reads are forbidden too – if page you want to read is not present in page cache, it has to be read from disk, and this process may lock inode too.

So, until flush process completes you can not update data in the appropriate file. Thus main design goal is to split to-be-updated file and that one to be flushed to disk.
If we can make a prognosis on the amount of data written, we can tuned eblob to create new files frequently enough and tune VM to write old ones and do not touch currently being written.

Let’s suppose we write 1Kb objects with 20 krps rate. This is about 20 MB/s write speed. Let’s limit blob size to 1 Gb, this can be configured this way in elliptics config:

backend = blob
blob_size = 1G

With 20 Mb/s write speed, new blob will be created roughly every 50 seconds.

Its time to tune VM now. We want flush process to kick in frequently, but we do not want it to work with new data. Thus we want it to write (and lock) old blob files.
Let’s say 100-second-old data is ok to be written to disk. Here is a set of sysctls for good write behaviour.

vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 75
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 10000
vm.dirty_ratio = 80
vm.dirty_writeback_centisecs = 50

vm.dirty_expire_centisecs says how old dirty (or written) data should be, so that flush process pushed it to disk. It is measured in centiseconds, i.e. 1/100 of second. In example above it is 100 seconds.

vm.dirty_background_ratio and vm.dirty_ratio draws a boundary (in percents) of what page cache can take before flush is forced. If page cache is more than 80% then process’ write will block flushing data to disk. If it is more than 75% background flush starts.

vm.dirty_writeback_centisecs says that flush process should check dirty pages 2 times a second (50 centiseconds).

Above eblob + vm config and specified write load turns to behave the way we wanted – eblob creates new blob file every 50 seconds, 2 times per second kernel checks pages which are 100 seconds old (written 100 seconds ago) and flushes them to disk. In our example this will be ‘2 blobs ago’ files.

Does LevelDB suck?

We have following config for Elliptics leveldb backend:

sync = 0

root = /opt/elliptics/leveldb
log = /var/log/elliptics/leveldb.log
cache_size = 64424509440
write_buffer_size = 1073741824
block_size = 10485760
max_open_files = 10000
block_restart_interval = 16

And with 1kb of pretty compressable data (ascii strings) chunks pushed into single server with 128 Gb of RAM and 4 SATA disks combined into RAID10 ends up with poor 6-7 krps.

If request rate is about 20 krps median reply time is about 7 seconds (!)

Elliptics with Eblob backend on the same machine easily handles the same load.

dstat shows that it is not disk (well, with 20 krps it is disk), but before that it is neither CPU nor disk – leveldb just doesn’t allow more than 5-7 krps with 1 Kb data chunks from parallel threads (we have 8-64 IO threads depending on config). When snappy compression is enabled things get worse.

Is it ever possible to push 20 MB/s into LevelDB with small-to-medium (1Kb) chunks?

New 0.18 eblob sorts data on disk

Alexey ‘rbtz’ Ivanov implemented data sorting in eblob in 0.18 version.
This is the second major patch made by Alexey, the first one squizzed eblob in-memory index by 40-60%: http://www.ioremap.net/node/725/

This closes races between data defragmentation and blob file update, but main goal was to eliminate sorted indexes. Well, now we only have single sorted index as well as sorted data, so iteration and data lookup are very fast.

This is the first step in preparation for unified iterators in Elliptics storage, which ultimate goal is to get rid of metadata and to move recovery process out of elliptics core into separate module/process. This will allow to create different recovery policies: from rsync-like copy of the whole local storage to particular key cherry-pick-like recovery.

Sorted keys in eblob also allows much faster scans and column-like storage creation. We are about to remove column support from eblob and elliptics, but not because we do not want to have it, but instead because column is actually a very different abstraction level than low-level storage. Column just doesn’t belong to that ground level.

Instead column can be trivially implemented on top of existing key mapping in elliptics. We will likely create a special API for this, but it can be already easily made by hands by properly setting up keys.

More about column storage in the Paper Trail Columnar Storage article.

Eblob: squeezing in-memory index by 40%

Eblob is a low-level append-only (with overwrite support though) storage which is also used as default backend in Elliptics.

Eblob operates on entities called (suddenly) blobs. Each blob has an index of objects written into. Since eblob is append-only by-default, each new record will be added to the last blob and potentially mark some previous record (maybe not in the last blob) as removed.

The last blob in this case is called ‘opened’ blob. Its index is stored on disk and is also populated into memory. Previous blobs have theirs index on disk sorted (and quite soon we will also sort data on disk), while the last ‘opened’ blob’s index is not sorted, instead we read it into memory on start.

Since last blob can be really huge (its maximum size and maximum number of records stored are specified in config file), its index in RAM may take quite a lot of space. In particular, we have systems where in-memory indexes are about 200Gb per server (it is close to 1 Tb in this small cluster) – and that is all in memory, not on disk (whole system hosts about 36+ billions of records including replicas).

Index in memory is rahter trivial structure – 512-bits keys (by default), size/offset and other parameters, total of about 100+ bytes. We decided to squeeze it as much as we can.

Our fist step in this direction is key changes. But we may not just change key size – on-disk entries must be preserved, so we decided to hash those keys, which are populated to RAM, using murmur2 and put all collisions into separate tree, indexed by real (large 64-byte) keys.

This feature reduces amount of memory used by eblob index by 40% in practice (and almost without collisions on our datasets).
To enable, one must set bit 6 in blob_flags (+64 to its current value).

Feature added by Alexey Ivanov

Eblob: http://reverbrain.com/eblob/
Elliptics: http://reverbrain.com/elliptics/
Docs: http://doc.reverbrain.com/