Rift allows you to store popular content into separate groups for caching. This is quite different from elliptics cache where data is stored in memory in segmented LRU lists. Persistent caching allows you to temporarily put your data into additional elliptics groups, which will serve IO requests. This is usually very useful for heavy content like big images or audio/video files, which are rather expensive to put into memory cache.
One can update list of objects to be cached as well as per-object list of additional groups. There is a cache.py with excessive help to work with cached keys. This tool will grab requested key from source groups and put them into caching groups as well as update special elliptics cache list object which is periodically (timeout option in cache configuration block of the Rift) checked by Rift. As soon as Rift found new keys in elliptics cache list object, it will start serving IO from those cached groups too as well as from original groups specified in the bucket. When using cache.py tool please note that its file-namespace option is actually a bucket name.
To remove object from cache one should use the same cache.py tool – it will remove data from caching groups (please note that physically removing objects from disk in elliptics may require running online eblob defragmentation) and update special elliptics cache list object. This object will be reread sometime in the future, so if requested key can not be found in cache, it will be automatically served from original bucket groups.
Here is Elliptics per-cmd monitoring
Monitoring includes a very detailed statistics about the most interesting bits of the storage. Above picture shows write command execution flow (whole command and how time was spent within) – one can turn it on for own command if special trace bit has been set.
Other monitoring parts include queues (network and processing), cache, eblob (command timings, data stats, counters), VFS and other stats.
More details on how to use that data in json format is coming.
We always improve our documentation.
Now we added detailed page about elliptics configuration that is available at http://doc.reverbrain.com/elliptics:configuration.
Also we added description of dnet_balancer tool that allows to balance DHT ring partition within one group. It can be found at http://doc.reverbrain.com/elliptics:tools#dnet_balancer.
The second most wanted feature in elliptics HTTP frontend RIFT is ACL support.
Rift already provides S3-like buckets – namespace metadata which allows to store data in separate per-bucket unique groups, to have ability to fetch the whole list of objects stored in given bucket, and of course use the same object names stored in different namespaces. Rift also allows you to have a set of objects cached in special groups, each ‘cached’ object may have its own set of groups to check first. This option can be used to cache commonly used objects in additional groups like temporal in-memory or SSD groups.
And now I’ve added ACL support to buckets. As stated ACL is an access control list where each username is associated with secure token used to check
Authorization header and auth flags which allow to bypass some or every auth check.
Admin must setup per-bucket ACL using
rift_bucket_ctl tool where multiple
–acl option can be used. This control tool uses following format:
user is a username provided both in URI (
&user=XXX) and acl (< code>–acl XXX:token:0).
token is used to check
Here is the whole state machine of the Rift’s authentication checker (when bucket has been found and successfully read and parsed by the server):
- if group list is empty, not found error is returned
- is ACL is empty, ok is returned – there is nothing to check against
- if no user= URI parameter found, forbidden error is returned – one must provide username if ACL is configured
- user is being searched in ACL, if no match was found, forbidden error is returned
- if flags has bit 1 (starting from zero) set, this means bypass security check for given user – ok is returned
- Authorization header is being searched for, if there is no such header bad request is returned
- security data in Authorization header is being checked using secure token found in ACL entry, is auth data mismatch, forbidden is returned
- ok is returned
Result of this check can be found in log with
verdict: prefix in ERROR/INFO (1/2) log level and higher.
But even if state machine returned non-ok verdict, operation can be processed. This may happen if per-bucket ACL flags allow not all (bit 1), but only read requests (bit 0). In this case /get, /list, /download-info and other reading-only handlers will check ACL flags bits and optionally rewrite verdict. You will find something like this in logs:
[NOTICE] get-base: checked: url: /get?name=test.txt, original-verdict: 400, passed-no-auth-check
Check out more info about Rift – our full-featured elliptics HTTP frontend: http://doc.reverbrain.com/rift:rift
I often get requests on how to get a list of keys written into elliptics. Do not really understand why is this really needed especially considering storage setups where billions of keys were uploaded, but yet, this is one of the most frequently asked question.
Elliptics has secondary indexes for that purpose. Indexes are automatically sharded and evenly distributed across the nodes in the group.
One can tag own uploaded keys with special indexes and then intersect those indexes on servers or read the whole index key-by-key. That’s essentially what RIFT – http elliptics frontend does when you upload file through its HTTP interface.
And I’ve added listing support into RIFT proxy via
/list URI – it reads an index from the server, iterates over the keys and creates a nice output json. It also prints a timestamp of the key update in the index, both in seconds and current timezone.
URI accepts a
namespace – bucket name to get indexes from and
name – a placeholder for future indexes names (if we will support multiple indexes).
$ curl "http://example.com/list?namespace=testns&name="
"timestamp": "1970-01-01 03:00:00.0",
"timestamp": "2014-02-18 03:29:44.835283",
timestamp is for older indexes when timestamps were not yet supported.
key is an object name given at upload time, id is numeric elliptics ID (one can read those objects directly from elliptics without namespace name),
time_seconds is a coarse grained timeout in seconds since the Epoch.
timestamp is a real parsed timestamp with microsecond resolution.
There is also an example python script which does basically the same – reads an index, unpacks it and print to console: https://github.com/reverbrain/rift/blob/elliptics-2.25/example/listing.py
Meet scalable monitoring subsystem for Elliptics server nodes.
Monitoring allows to track performance of various parts of Elliptics server node such as: commands handling, cache, backend etc.
It includes simple HTTP server which provides json statistics via REST API. The statistics can be requested fully or partly by category. List of available statistics categories can be found at http://host:monitoring_port/list.
Monitoring can be extended via external providers that allow to deepen basic statistics.
For more details check out docs:
http://doc.reverbrain.com/elliptics:monitoring – users documentation that describes how to gather and read statistics
http://doc.reverbrain.com/elliptics:monitoring-inside – developers documentation that describes how monitoring is implemented and how you can write custom statistics provider.
In order to add some flexibility to our cache-layer, we replaced simple LRU-cache with Segmented-LRU-cache. It’s structure perfectly handles the concept of “hot data” which is so common in real world applications.
In Segmented-LRU, cached data is divided into several pages depending on it’s access frequency. This way one-time requests will only touch temporal pages and won’t affect data in popular pages, thus giving a security to hot data against overflow evictions.
One other feature implemented in new cache in order to decrease the size of cache-record was replacing binary-search-tree+heap data structures combination with one structure called Cartesian_tree that can encompass both aspects just as effectively.
For more information and implementation details check out our docs: http://doc.reverbrain.com/elliptics:cache
Go is in active development language, although it is already used in production in many environments. I decided to use it to implement SaaS part (or basically frontend and API) of the upcoming Elliptics storage service.
Anton Tyurin started Elliptics Golang bindings and to date implementation contains all major elliptics client features: IO, bulk operations, indexes. I added possibility to use Go loggers within elliptics client code (this feature is not present in Python bindings for example).
SaaS (storage as a service in our case) will allow clients to register, create new buckets (metadata objects similar to Amazon S3 buckets) with secondary index containing all records uploaded into it, basic authentication and various limits (like capped collections in MongoDB). All these features are actually implemented in Rift – HTTP elliptics access point, and I only need to wrap them nicely in HTTP frontend.
Although I planned to finish this task and open new service in December 2013, things changed a bit (I actively develop Wookie for web-search contest in parallel), but yet it should be ready this month.
I’ve just accepted commit by Ruslan Nigmatullin which fixes 31-bits integer overflow in atomic counters used as transaction ID.
This basically means that we have clients each of which has sent us more than 2 billions requests without restart or disconnect – impressive clients!
DotCloud has merged pull request made by Anton Tyurin
So, Docker‘s repository supports storing its layered ‘filesystem’ in Elliptics. With its automagic recovery, replication into multiple datacenters (all around the world), scalability and so on and so on
Now you can safely distribute your VM images over Elliptics.
Ruslan “elessar” Nigmatullin wrote a new aricle on elliptics cache architecture: http://doc.reverbrain.com/elliptics:cache
It describes how cache is organized, how one can use it to speed up disk operations or just to store data in memory only.
Some non-trivial cases are also touched: how to organize cache for append writes and partial updates as well as various corner cases.
Quite impressive number. I’m curios how many servers do they use.
Facebook used to use photo storage named Haystack, I based Eblob design solely on whitepaper of the proprietary Haystack design.
Although I removed the highest indirection level – the one where key indexes can live on separate servers, I only left in-memory and on-disk indexes.
That’s probably why Elliptics largest by number of keys storage hosts only 50+ billions (counting all 3 copies though) of objects. And that’s actually less than hundred of nodes (including all 3 replicas).
Hi, this is rbtz speaking again. I’m the engineer responsible for eblob codebase for
almost a year now. Here is small recap of what was happening with eblob
since v0.17.2 with some commit highlights.
* Eblob now builds under Mac OS X. This improved experience of developers with Macs.
* Changed most links to point to newly created http://reverbrain.com.
* Added comments to all eblob subsystems: e254fc3. This improves learning curve of new developers.
* Added l2hash support: c8fa62c. This reduces memory consumption of elliptics metadata .by 25% on LP64.
* Added first edition of eblob stress test. Year after it’s responsible for catching 99% bugs that otherwise would go to testing: 8eab8ed.
* Added config variables for index block and bloom: a106d9d. This allows sysadmins to limit memory occupied by bloom filter.
* Added config variable to limit total blob size: f7da001. This allows sysadmins to limit eblobs size in case many databases are located on one shared drive.
* Reduce memory consumption of “unsorted” blobs by 20% on LP64: 19e8612
* First static analyzer crusade (feat. clang static analyzer) – number of “almost impossible to spot” bugs found.
* Added data-sort and binlog v1. This allows “on the fly” eblob defragmtntation and memory cleanups.
* Added travis-ci tests after each commit: f08fea2.
* Removed custom in-memory cache in favor of OS page cache: a7e74a7; This removed number of nasty races in eblob code and also opened way for some future optimizations.
* Added Doxyfile stub, so that in future libeblob man pages may be autogenerated: aac9cb3.
* Decreased memory consumption of in-memory data structures by 10% on LP64: c6afffa.
* Replaced core mutexes with rwlocks; This improves out Intel vTune concurrency benchmarks, along with our QA tests.
* Second static analyzer crusade (feat. Coverity);
* Switched to <a href=”https://en.wikipedia.org/wiki/Spinlock#Alternatives”>adaptive mutexes</a> when available: 43b35d8.
* Speeded up statistics update v1: 40a60d7. Do not hold global lock while computing and writing stats to disk.
* Rewritten bloom filter v1: 6f08e07. This improves speed and reduces memory fragmentation.
* Allocate index blocks in one big chunk instead of millions of small, thus speeding up initialization and reducing memory fragmentation: b87e273.
* Do not hold global lock for the whole duration of sync(): 6f6be68. This removes “stalls” in configs where
sync > 0.
* Switched to POSIX.1-2008 + XSI extensions: 6ece045.
* Build with -Wextra and -Wall by default: 0e8c713. This should in long term substantially improve code quality.
* Added options to build with hardening and sanitizers: c8b8a34, 2d8a42c. This improves our internal automated tests.
* Do not set bloom filter bits on start on removed entries: 36e7750. This will improve lookup times of “long removed” but still not defragmentated entries.
* Added separate thread for small periodic tasks: ea17fc0. This in future can be upgraded to simple background task manager;
statvfs(3) to periodic thread which speeds up write-only micro benchmarks by 50%: f36ab9d.
* Lock database on init to prevent data corruption by simultanious accesses to the same database by different processes: 5e5039d. See more about EB0000 in kb article.
* Removed columns aka types: 6b1f173; This greatly simplifies code and as side effect improves elliptics memory usage and startup times;
* Removed compression: 35ac55f; This removes dependency on
bsize knob for write alignment: 8d87b32;
* Rewritten stats v2: 94c85ec; Now stats update very lightweight and atomic;
* Added writev(2)-like interface to eblob, so that elliptics backend could implement very efficient metadata handling: b9e0391;
* Replaced complex binlog with very tiny binlog v2: 1dde6f3; This greatly simplifies code, improves data-sort speed and memory efficiency;
* Made tests multithreaded: 1bd2f43. Now we can spot even more errors via automated tests before they hit staging.
* Move to
GNU99 standard: f65955a. It’s already 15 years old already =)
* Fixed very old bug with log mesage truncation/corruption on multithreaded workloads: 10b6d47.
* Bloom filter rewrite v2: 1bfadaf. Now we use many hash functions instead of one thus trading CPU time for improved IO efficiency. This improved bloom filter efficiency by order of magnitude.
* Merge small blobs into one on defrag: ace7ca7. This improves eblob performance on databases with high record rotation maintaining almost fixed number of blobs.
* Added record record validity check on start: bcdb0be; See more about database corruption EB0001 in kb article.
* More robust
eblob_merge tool that can be used to recover corrupted blobs.
* Reduced memory consumption of in-memory data-structures by 10% on LP64: e851820;
* Added schedule-based data-sort: 2f457b8; More on this topic in previous post: data-sort implications on eblob performance.
Here I’ve mentioned only most notable commits, mostly performance and memory usage oriented changes. There are of course lots of other stuff going on like bugfixes, minor usability improvements and some internal changes.
Here are some basic stats for this year:
Total commits: 1375
Stats total: 65 files changed, 8057 insertions(+), 4670 deletions(-)
Stats excl. docs, tests and ci: 39 files changed, 5368 insertions(+), 3782 deletions(-)
Also if you are interested in whats going to happen in near future in eblob world you should probably take a look into it’s roadmap.
By the way for those of you who is interested in numbers and pretty graphs – after recent upgrade of our internal elliptics cluster storing billions of records to new LTS releases of elliptics 2.24.X and eblob 0.22.X we’ve got:
Response time reduction (log scale):
98th percentile that was around 100ms dropped below 50 ms
Disk IO (linear scale):
IO dropped more than one order of magnitude.
Memory (linear scale):
There is much more “cached” memory now. Also periodic data-sort routine successfully frees unused cached keys.
I made a new presentation on Fedora’s Road to cloud tech conference.
Its all about safety – what problems can be found when developing and deploying distributed storage solutions, and how we solved them in Elliptics.
One can find an article (in Russian) for this presentation at http://ioremap.net/tmp/roadtocloud.txt
Lots of stuff has been written about data-sort and defragmentaion in recent
eblob versions (>=
0.18.0) both in documentation and blogposts. Today I
want to speak about eblob memory management, data structures and why regular
data-sort is essential for eblob/elliptics performance.
First when key is written to data file it’s basic information like key itself,
size of record, and location on disk is stored in in-memory index (internally
rb-tree) and also written to so-called “unsorted” index. So both data file and
“unsorted” index have records sorted by their write time, but we still can very
efficiently find given key in in-memory index or iterate over
“unsorted” index because order of records matches one used in datafile.
But having all keys in memory is not always possible (especially when you have
billions of them). So on each startup eblob sorts all but last “unsorted”
indexes by key, so it can use more efficient data-structures instead of storing
all keys in-memory rb-tree. Memory-efficient as it is this breaks record ordering
between data file (records are sorted by write time) and new “sorted” index
(records sorted by key). This makes iteration over such blob very inefficient
(consuming way too many IOPS).
To mitigate those problems data-sort routine was introduced. It combines in itself three purposes:
* Purges records from in-memory index, so that recently unused keys won’t
occupy precious RAM and cause OOM on write-heavy workloads.
* Defragments data by physically deleting keys that were marked as “removed”.
It’s analogues to some NoSQLs’
compact procedure or SQLs’
* Restores record ordering between data file and index, so that iteration speed
As fast, efficient and useful as it is data-sort is rather heavy-weight routine,
because it can theoretically move terabytes across the drive, so it’s rather unwise to run
it in peak hours. Given that number of knobs were introduced so that administrator can
manage time of data-sort startup.
eblob v0.21.17 and
elliptics v220.127.116.11 admin may select between
four different methods of managing data-sort startup times.
- AUTO – run data-sort on startup for each “unsorted” blob (but last). Also run it on every blob’s “close”. This is preferred method for small databases.
- TIMED – old way of running defrag each
defrag_timout seconds. It’s useful for autogenerated config files where each server gets it’s own timeout.
- SCHEDULED – most sophisticated built-in method. It automagically spreads data-sort load across nodes in time based on given
defrag_splay so that each node selects some random time in range
[defrag_time - defrag_splay, defrag_time + defrag_splay] hours. This is preferred method for medium/big clusters.
- NONE – when none of given options are selected one must periodically run data-sort via provided API – for example with elliptics one can use
dnet_ioclient -r host:port:family -d start command to run defrag on node given by host:port:family tuple. This is preferred method for very big clusters that require some external synchronization based on e.g.: replica location, current time or node load.
NB! Failure of periodically running data-sort will lead to extensive memory
usage, HDD space waste and slow iteration (e.g:
For more information see:
Interesting USENIX paper about Eiger – new consistency system in a distributed storage.
A short gist of it is quite simple: client tracks what it saw, thus operations must obey/fix dependency on those objects.
Each datacenter maintains whole replica of data, and data within datacenter can not be lost as well as its update is always consistent. This is achieved by Paxos within datacenter.
Eiger is based on Spanner – much hyped Google distributed storage with atomic clocks, GPS and other such cool stuff.
Because of that Eiger has so called logical clocks – timestamps unique across all datacenters, this is achieved via aforementioned atomic clocks and GPS. Given those unique IDs servers order operations and client can track dependencies.
Eiger is a next step from simple key-value storage, it supports columns and read-only/write-only transactions. Transactions are based on dependencies.
Write operations are replicated between datacenters, this is being done by the server which received data from client. Replication just sends data to other servers in different datacenters, which compare unique timestamps, and if timestamp is older than that in replica, update is discarded – the last writer wins.
I did not really read how transactions that spans multiple datacenters are implemented – real life applications do not have atomic clocks and GPS to implement distributed uinque timestamp, thus it will not be able to work with such system. In a real life we either have to deal with eventual consistency or not being able to scale to web sizes.
Elliptics has eventual consistency model, albeit with ability to read latest updated data among multiple replicas, and that’s so far the only way to implement web-scale volumes (our largest cluster hosts 36+ billions of records – about 2 instagrams).