Tag Archives: Elliptics

Elliptics distributed network

RIFT is now fully API backed: buckets, acl, bucket directories, listing and so on

This day has come – we made all RIFT – elliptics HTTP frontend – features (written in the title and more others) accessible via REST APIs.

It required URL format changes, and now URLs are much more like S3 and REST in general:
http://reverbrain.com/get/example.txt?bucket=testns

Main base stone of the RIFT – bucket – a metadata entity which shows where your data lives (group list) and how to access it (ACLs) also hosts a secondary index of the keys uploaded into that bucket (if configured to do so).
Now we have bucket directory – entity which lists your buckets.

Buckets, directories, files and indexes – everything can be created, processed and deleted via REST API calls.
Basically, RIFT + elliptics allow you to create your own private cloud storage and put your data replicas into safe locations you like.

It is like having your own Amazon S3 in the pocket :)

Soon we will set up a test cloud at reverbrain.com where everyone can check our technologies before digging deeper you will be able to create (limited) buckets and upload/download data, which will be stored in Germany and Russia for limited period of time.

For more details about RIFT please check our documentation page: http://doc.reverbrain.com/rift:rift

Stay tuned!

Rift persistent caching

Rift allows you to store popular content into separate groups for caching. This is quite different from elliptics cache where data is stored in memory in segmented LRU lists. Persistent caching allows you to temporarily put your data into additional elliptics groups, which will serve IO requests. This is usually very useful for heavy content like big images or audio/video files, which are rather expensive to put into memory cache.

One can update list of objects to be cached as well as per-object list of additional groups. There is a cache.py with excessive help to work with cached keys. This tool will grab requested key from source groups and put them into caching groups as well as update special elliptics cache list object which is periodically (timeout option in cache configuration block of the Rift) checked by Rift. As soon as Rift found new keys in elliptics cache list object, it will start serving IO from those cached groups too as well as from original groups specified in the bucket. When using cache.py tool please note that its file-namespace option is actually a bucket name.

To remove object from cache one should use the same cache.py tool – it will remove data from caching groups (please note that physically removing objects from disk in elliptics may require running online eblob defragmentation) and update special elliptics cache list object. This object will be reread sometime in the future, so if requested key can not be found in cache, it will be automatically served from original bucket groups.

Elliptics monitoring spoiler alert

Here is Elliptics per-cmd monitoring

Cache/eblob command timing

Monitoring includes a very detailed statistics about the most interesting bits of the storage. Above picture shows write command execution flow (whole command and how time was spent within) – one can turn it on for own command if special trace bit has been set.

Other monitoring parts include queues (network and processing), cache, eblob (command timings, data stats, counters), VFS and other stats.
More details on how to use that data in json format is coming.

Documentation updates

We always improve our documentation.
Now we added detailed page about elliptics configuration that is available at http://doc.reverbrain.com/elliptics:configuration.
Also we added description of dnet_balancer tool that allows to balance DHT ring partition within one group. It can be found at http://doc.reverbrain.com/elliptics:tools#dnet_balancer.

Elliptics HTTP frontend RIFT got full per-bucket ACL support

The second most wanted feature in elliptics HTTP frontend RIFT is ACL support.

Rift already provides S3-like buckets – namespace metadata which allows to store data in separate per-bucket unique groups, to have ability to fetch the whole list of objects stored in given bucket, and of course use the same object names stored in different namespaces. Rift also allows you to have a set of objects cached in special groups, each ‘cached’ object may have its own set of groups to check first. This option can be used to cache commonly used objects in additional groups like temporal in-memory or SSD groups.

And now I’ve added ACL support to buckets. As stated ACL is an access control list where each username is associated with secure token used to check Authorization header and auth flags which allow to bypass some or every auth check.

Admin must setup per-bucket ACL using rift_bucket_ctl tool where multiple –acl option can be used. This control tool uses following format: user:secure-token:flags

user is a username provided both in URI (&user=XXX) and acl (< code>–acl XXX:token:0). token is used to check Authorization header.

Here is the whole state machine of the Rift’s authentication checker (when bucket has been found and successfully read and parsed by the server):

  1. if group list is empty, not found error is returned
  2. is ACL is empty, ok is returned – there is nothing to check against
  3. if no user= URI parameter found, forbidden error is returned – one must provide username if ACL is configured
  4. user is being searched in ACL, if no match was found, forbidden error is returned
  5. if flags has bit 1 (starting from zero) set, this means bypass security check for given user – ok is returned
  6. Authorization header is being searched for, if there is no such header bad request is returned
  7. security data in Authorization header is being checked using secure token found in ACL entry, is auth data mismatch, forbidden is returned
  8. ok is returned

Result of this check can be found in log with verdict: prefix in ERROR/INFO (1/2) log level and higher.

But even if state machine returned non-ok verdict, operation can be processed. This may happen if per-bucket ACL flags allow not all (bit 1), but only read requests (bit 0). In this case /get, /list, /download-info and other reading-only handlers will check ACL flags bits and optionally rewrite verdict. You will find something like this in logs:

[NOTICE] get-base: checked: url: /get?name=test.txt, original-verdict: 400, passed-no-auth-check

That’s it.
Check out more info about Rift – our full-featured elliptics HTTP frontend: http://doc.reverbrain.com/rift:rift

Listing of the keys uploaded into elliptics

I often get requests on how to get a list of keys written into elliptics. Do not really understand why is this really needed especially considering storage setups where billions of keys were uploaded, but yet, this is one of the most frequently asked question.

Elliptics has secondary indexes for that purpose. Indexes are automatically sharded and evenly distributed across the nodes in the group.

One can tag own uploaded keys with special indexes and then intersect those indexes on servers or read the whole index key-by-key. That’s essentially what RIFT – http elliptics frontend does when you upload file through its HTTP interface.

And I’ve added listing support into RIFT proxy via /list URI – it reads an index from the server, iterates over the keys and creates a nice output json. It also prints a timestamp of the key update in the index, both in seconds and current timezone.

URI accepts a namespace – bucket name to get indexes from and name – a placeholder for future indexes names (if we will support multiple indexes).

$ curl "http://example.com/list?namespace=testns&name="

{
    "indexes": [
        {
            "id": "4e040aa8a798d04d56548d4917460f5759434fdf3ed948fd1cf35fd314cad3290e69b80deb0fc9b87a6bfbcbd08583919eb5b966658b3ed65e127236e1632525",
            "key": "test1",
            "timestamp": "1970-01-01 03:00:00.0",
            "time_seconds": "0"
        },
        {
            "id": "e5b7143155f46c9e9023cbf5e04be7276ae2e9a7583fee655c32aaff39755fa213468217291f0e08428a787bf282b416be1d26a5211f244fc66d1ce8ce545382",
            "key": "test7",
            "timestamp": "2014-02-18 03:29:44.835283",
            "time_seconds": "1392679784"
        }
    ]
}

Zero timestamp is for older indexes when timestamps were not yet supported. key is an object name given at upload time, id is numeric elliptics ID (one can read those objects directly from elliptics without namespace name), time_seconds is a coarse grained timeout in seconds since the Epoch. timestamp is a real parsed timestamp with microsecond resolution.

There is also an example python script which does basically the same – reads an index, unpacks it and print to console: https://github.com/reverbrain/rift/blob/elliptics-2.25/example/listing.py

Monitoring of Elliptics server node

Meet scalable monitoring subsystem for Elliptics server nodes.

Monitoring allows to track performance of various parts of Elliptics server node such as: commands handling, cache, backend etc.

It includes simple HTTP server which provides json statistics via REST API. The statistics can be requested fully or partly by category. List of available statistics categories can be found at http://host:monitoring_port/list.

Monitoring can be extended via external providers that allow to deepen basic statistics.

For more details check out docs:
http://doc.reverbrain.com/elliptics:monitoring – users documentation that describes how to gather and read statistics
http://doc.reverbrain.com/elliptics:monitoring-inside – developers documentation that describes how monitoring is implemented and how you can write custom statistics provider.

SLRU cache in Elliptics

In order to add some flexibility to our cache-layer, we replaced simple LRU-cache with Segmented-LRU-cache. It’s structure perfectly handles the concept of “hot data” which is so common in real world applications.

In Segmented-LRU, cached data is divided into several pages depending on it’s access frequency. This way one-time requests will only touch temporal pages and won’t affect data in popular pages, thus giving a security to hot data against overflow evictions.

One other feature implemented in new cache in order to decrease the size of cache-record was replacing binary-search-tree+heap data structures combination with one structure called Cartesian_tree that can encompass both aspects just as effectively.

For more information and implementation details check out our docs: http://doc.reverbrain.com/elliptics:cache

Elliptics Go bindings

Go is in active development language, although it is already used in production in many environments. I decided to use it to implement SaaS part (or basically frontend and API) of the upcoming Elliptics storage service.

Anton Tyurin started Elliptics Golang bindings and to date implementation contains all major elliptics client features: IO, bulk operations, indexes. I added possibility to use Go loggers within elliptics client code (this feature is not present in Python bindings for example).

SaaS (storage as a service in our case) will allow clients to register, create new buckets (metadata objects similar to Amazon S3 buckets) with secondary index containing all records uploaded into it, basic authentication and various limits (like capped collections in MongoDB). All these features are actually implemented in Rift – HTTP elliptics access point, and I only need to wrap them nicely in HTTP frontend.

Although I planned to finish this task and open new service in December 2013, things changed a bit (I actively develop Wookie for web-search contest in parallel), but yet it should be ready this month.
Stay tuned!

New elliptics HTTP proxy, authentification, caching and Go bindings

I separated elliptics HTTP proxy from our high-performance server HTTP framework TheVoid. TheVoid continues to be a framework for writing HTTP servers in C++, whlist Rift becomes elliptics HTTP access point.

Rift support usual object upload/get/removal as well as upload/download flow control. The latter (soon will be default and the only possible mode) is basically an arbiter who doesn’t allow to read more data from client if current chunk hasn’t been yet written. It uses chunked upload and download. Rift supports range request (Range: HTTP header).

There is basic authentification support in the Rift. I will extend it to be per-bucket fashion similar to what Amazon S3 has (not the same API though). Rift also support multiple-groups caching, this is extremely useful for bigger content, when you suddenly decided that given objects has to be spread into many groups instead of just those originally written into. There is ‘a button’ (basically a python script) which copies given keys from theirs original groups into caching and broadcasts updates to all Rift proxies via updating special keys which are periodically checked by the proxies. Caching can be turned on and off on per-key basis.

One can create spacial SSD caching groups for example and put the needed files for some time. Or those can be commodity spinning disks for larger files like video content.
More details on this later at documentation site.

Everything above and several new features will be available both in Rift proxies and our new cluster we currently develop. Not that it will be plain Amazon S3, but something similar. More details later this year :)

And right now one can check out new Go elliptics bindings project started by Anton Tyurin. That will be a base for our HTTP entry point.

Epic elliptics commit

I’ve just accepted commit by Ruslan Nigmatullin which fixes 31-bits integer overflow in atomic counters used as transaction ID.

This basically means that we have clients each of which has sent us more than 2 billions requests without restart or disconnect – impressive clients!

Distributed in-memory cache in Elliptics

Ruslan “elessar” Nigmatullin wrote a new aricle on elliptics cache architecture: http://doc.reverbrain.com/elliptics:cache

It describes how cache is organized, how one can use it to speed up disk operations or just to store data in memory only.
Some non-trivial cases are also touched: how to organize cache for append writes and partial updates as well as various corner cases.

Enjoy!

250 billions of photos on facebook

Quite impressive number. I’m curios how many servers do they use.
Facebook used to use photo storage named Haystack, I based Eblob design solely on whitepaper of the proprietary Haystack design.

Although I removed the highest indirection level – the one where key indexes can live on separate servers, I only left in-memory and on-disk indexes.

That’s probably why Elliptics largest by number of keys storage hosts only 50+ billions (counting all 3 copies though) of objects. And that’s actually less than hundred of nodes (including all 3 replicas).

The year of eblob

Hi, this is rbtz speaking again. I’m the engineer responsible for eblob codebase for
almost a year now. Here is small recap of what was happening with eblob
since v0.17.2 with some commit highlights.

The year of eblob

v0.17.2
* Eblob now builds under Mac OS X. This improved experience of developers with Macs.
* Changed most links to point to newly created http://reverbrain.com.
* Added comments to all eblob subsystems: e254fc3. This improves learning curve of new developers.
v0.17.3
* Added l2hash support: c8fa62c. This reduces memory consumption of elliptics metadata .by 25% on LP64.
* Added first edition of eblob stress test. Year after it’s responsible for catching 99% bugs that otherwise would go to testing: 8eab8ed.
v0.17.4

v0.17.6
* Added config variables for index block and bloom: a106d9d. This allows sysadmins to limit memory occupied by bloom filter.
* Added config variable to limit total blob size: f7da001. This allows sysadmins to limit eblobs size in case many databases are located on one shared drive.
v0.17.7
* Reduce memory consumption of “unsorted” blobs by 20% on LP64: 19e8612
v0.17.8
* First static analyzer crusade (feat. clang static analyzer) – number of “almost impossible to spot” bugs found.
* Added data-sort and binlog v1. This allows “on the fly” eblob defragmtntation and memory cleanups.
* Added travis-ci tests after each commit: f08fea2.
* Removed custom in-memory cache in favor of OS page cache: a7e74a7; This removed number of nasty races in eblob code and also opened way for some future optimizations.
* Added Doxyfile stub, so that in future libeblob man pages may be autogenerated: aac9cb3.
* Decreased memory consumption of in-memory data structures by 10% on LP64: c6afffa.
v0.18.0

v0.18.2
* Replaced core mutexes with rwlocks; This improves out Intel vTune concurrency benchmarks, along with our QA tests.
v0.18.3

v0.18.5-1
* Second static analyzer crusade (feat. Coverity);
* Switched to <a href=”https://en.wikipedia.org/wiki/Spinlock#Alternatives”>adaptive mutexes</a> when available: 43b35d8.
* Speeded up statistics update v1: 40a60d7. Do not hold global lock while computing and writing stats to disk.
* Rewritten bloom filter v1: 6f08e07. This improves speed and reduces memory fragmentation.
* Allocate index blocks in one big chunk instead of millions of small, thus speeding up initialization and reducing memory fragmentation: b87e273.
v0.18.6
v0.19.0
* Do not hold global lock for the whole duration of sync(): 6f6be68. This removes “stalls” in configs where sync > 0.
v0.19.1
* Switched to POSIX.1-2008 + XSI extensions: 6ece045.
* Build with -Wextra and -Wall by default: 0e8c713. This should in long term substantially improve code quality.
* Added options to build with hardening and sanitizers: c8b8a34, 2d8a42c. This improves our internal automated tests.
v0.19.2

v0.19.7
* Do not set bloom filter bits on start on removed entries: 36e7750. This will improve lookup times of “long removed” but still not defragmentated entries.
v0.19.8
v0.19.9
* Added separate thread for small periodic tasks: ea17fc0. This in future can be upgraded to simple background task manager;
* Moved statvfs(3) to periodic thread which speeds up write-only micro benchmarks by 50%: f36ab9d.
* Lock database on init to prevent data corruption by simultanious accesses to the same database by different processes: 5e5039d. See more about EB0000 in kb article.
v0.19.10
v0.19.11
* Removed columns aka types: 6b1f173; This greatly simplifies code and as side effect improves elliptics memory usage and startup times;
* Removed compression: 35ac55f; This removes dependency on Snappy.
* Removed bsize knob for write alignment: 8d87b32;
* Rewritten stats v2: 94c85ec; Now stats update very lightweight and atomic;
* Added writev(2)-like interface to eblob, so that elliptics backend could implement very efficient metadata handling: b9e0391;
v0.20.0

v0.21.3
* Replaced complex binlog with very tiny binlog v2: 1dde6f3; This greatly simplifies code, improves data-sort speed and memory efficiency;
* Made tests multithreaded: 1bd2f43. Now we can spot even more errors via automated tests before they hit staging.
* Move to GNU99 standard: f65955a. It’s already 15 years old already =)
* Fixed very old bug with log mesage truncation/corruption on multithreaded workloads: 10b6d47.
v0.21.4
* Bloom filter rewrite v2: 1bfadaf. Now we use many hash functions instead of one thus trading CPU time for improved IO efficiency. This improved bloom filter efficiency by order of magnitude.
v0.21.5
* Merge small blobs into one on defrag: ace7ca7. This improves eblob performance on databases with high record rotation maintaining almost fixed number of blobs.
v0.21.6
v0.21.7
* Added record record validity check on start: bcdb0be; See more about database corruption EB0001 in kb article.
v0.21.8

v0.21.11
* More robust eblob_merge tool that can be used to recover corrupted blobs.
v0.21.12
v0.21.13
* Reduced memory consumption of in-memory data-structures by 10% on LP64: e851820;
v0.21.14

v0.21.16
* Added schedule-based data-sort: 2f457b8; More on this topic in previous post: data-sort implications on eblob performance.
v0.21.17

Here I’ve mentioned only most notable commits, mostly performance and memory usage oriented changes. There are of course lots of other stuff going on like bugfixes, minor usability improvements and some internal changes.

Here are some basic stats for this year:

Total commits: 1375
Stats total: 65 files changed, 8057 insertions(+), 4670 deletions(-)
Stats excl. docs, tests and ci: 39 files changed, 5368 insertions(+), 3782 deletions(-)

Also if you are interested in whats going to happen in near future in eblob world you should probably take a look into it’s roadmap.

By the way for those of you who is interested in numbers and pretty graphs – after recent upgrade of our internal elliptics cluster storing billions of records to new LTS releases of elliptics 2.24.X and eblob 0.22.X we’ve got:

Response time reduction (log scale):

98th percentile that was around 100ms dropped below 50 ms

98th percentile that was around 100ms dropped below 50 ms

Disk IO (linear scale):

IO dropped more than one order of magnitude.

IO dropped more than one order of magnitude.

Memory (linear scale):

There is much more "cached" memory now. Also periodic data-sort routine successfully frees unused cached keys.

There is much more “cached” memory now. Also periodic data-sort routine successfully frees unused cached keys.

Data-sort implications on eblob performance

Lots of stuff has been written about data-sort and defragmentaion in recent
eblob versions (>= 0.18.0) both in documentation and blogposts. Today I
want to speak about eblob memory management, data structures and why regular
data-sort is essential for eblob/elliptics performance.

First when key is written to data file it’s basic information like key itself,
size of record, and location on disk is stored in in-memory index (internally
rb-tree) and also written to so-called “unsorted” index. So both data file and
“unsorted” index have records sorted by their write time, but we still can very
efficiently find given key in in-memory index or iterate over
“unsorted” index because order of records matches one used in datafile.

But having all keys in memory is not always possible (especially when you have
billions of them). So on each startup eblob sorts all but last “unsorted”
indexes by key, so it can use more efficient data-structures instead of storing
all keys in-memory rb-tree. Memory-efficient as it is this breaks record ordering
between data file (records are sorted by write time) and new “sorted” index
(records sorted by key). This makes iteration over such blob very inefficient
(consuming way too many IOPS).

To mitigate those problems data-sort routine was introduced. It combines in itself three purposes:
* Purges records from in-memory index, so that recently unused keys won’t
occupy precious RAM and cause OOM on write-heavy workloads.
* Defragments data by physically deleting keys that were marked as “removed”.
It’s analogues to some NoSQLs’ compact procedure or SQLs’ VACUUM command.
* Restores record ordering between data file and index, so that iteration speed
is maximized.

As fast, efficient and useful as it is data-sort is rather heavy-weight routine,
because it can theoretically move terabytes across the drive, so it’s rather unwise to run
it in peak hours. Given that number of knobs were introduced so that administrator can
manage time of data-sort startup.

Starting with eblob v0.21.17 and elliptics v2.24.14.11 admin may select between
four different methods of managing data-sort startup times.

  • AUTO – run data-sort on startup for each “unsorted” blob (but last). Also run it on every blob’s “close”. This is preferred method for small databases.
  • TIMED – old way of running defrag each defrag_timout seconds. It’s useful for autogenerated config files where each server gets it’s own timeout.
  • SCHEDULED – most sophisticated built-in method. It automagically spreads data-sort load across nodes in time based on given defrag_time and defrag_splay so that each node selects some random time in range [defrag_time - defrag_splay, defrag_time + defrag_splay] hours. This is preferred method for medium/big clusters.
  • NONE – when none of given options are selected one must periodically run    data-sort via provided API – for example with elliptics one can use dnet_ioclient -r host:port:family -d start command to run defrag on node given by host:port:family tuple. This is preferred method for very big clusters that require some external synchronization based on e.g.: replica location, current time or node load.

NB! Failure of periodically running data-sort will lead to extensive memory
usage, HDD space waste and slow iteration (e.g: dnet_recovery) speed.

For more information see:

Stronger Semantics for Low-Latency Geo-Replicated Storage

Interesting USENIX paper about Eiger – new consistency system in a distributed storage.

A short gist of it is quite simple: client tracks what it saw, thus operations must obey/fix dependency on those objects.
Each datacenter maintains whole replica of data, and data within datacenter can not be lost as well as its update is always consistent. This is achieved by Paxos within datacenter.

Eiger is based on Spanner – much hyped Google distributed storage with atomic clocks, GPS and other such cool stuff.
Because of that Eiger has so called logical clocks – timestamps unique across all datacenters, this is achieved via aforementioned atomic clocks and GPS. Given those unique IDs servers order operations and client can track dependencies.

Eiger is a next step from simple key-value storage, it supports columns and read-only/write-only transactions. Transactions are based on dependencies.
Write operations are replicated between datacenters, this is being done by the server which received data from client. Replication just sends data to other servers in different datacenters, which compare unique timestamps, and if timestamp is older than that in replica, update is discarded – the last writer wins.

I did not really read how transactions that spans multiple datacenters are implemented – real life applications do not have atomic clocks and GPS to implement distributed uinque timestamp, thus it will not be able to work with such system. In a real life we either have to deal with eventual consistency or not being able to scale to web sizes.

Elliptics has eventual consistency model, albeit with ability to read latest updated data among multiple replicas, and that’s so far the only way to implement web-scale volumes (our largest cluster hosts 36+ billions of records – about 2 instagrams).