ioremap.net

Storage and beyond

reverbrain.com is up and running

Reverbrain is a company we started to provide support and solutions for distributed storage and realtime computation problems.

We created the whole stack of technologies ranging from the lowest level of data storage upto high-level processing pipeline.

Time moves on and we see how complex is to deal with massively increasing amounts of data. We provide solutions for true-way horizontal scaling with fair system complexity. Our storage appliances lay in the area where one has to host billions of small-to-medium objects upto huge data streaming systems.

Your data should nost just lay in archive in the distributed system — realtime pipeline processing is our vision of how data has to be handled. We use them by ourself and want to provide the best experience for our customers.

Elliptics 2.24 reached end-of-life

2.24 was LTS release and instead of half a year we supported it more than 1 year.
It reaches its end of life now.

We will answer questions about 2.24 and help with old version for a little while, but I strongly recommend you upgrading to 2.25

,

RIFT is now fully API backed: buckets, acl, bucket directories, listing and so on

This day has come – we made all RIFT – elliptics HTTP frontend – features (written in the title and more others) accessible via REST APIs.

It required URL format changes, and now URLs are much more like S3 and REST in general:
http://reverbrain.com/get/example.txt?bucket=testns

Main base stone of the RIFT – bucket – a metadata entity which shows where your data lives (group list) and how to access it (ACLs) also hosts a secondary index of the keys uploaded into that bucket (if configured to do so).
Now we have bucket directory – entity which lists your buckets.

Buckets, directories, files and indexes – everything can be created, processed and deleted via REST API calls.
Basically, RIFT + elliptics allow you to create your own private cloud storage and put your data replicas into safe locations you like.

It is like having your own Amazon S3 in the pocket :)

Soon we will set up a test cloud at reverbrain.com where everyone can check our technologies before digging deeper you will be able to create (limited) buckets and upload/download data, which will be stored in Germany and Russia for limited period of time.

For more details about RIFT please check our documentation page: http://doc.reverbrain.com/rift:rift

Stay tuned!

, ,

Rift persistent caching

Rift allows you to store popular content into separate groups for caching. This is quite different from elliptics cache where data is stored in memory in segmented LRU lists. Persistent caching allows you to temporarily put your data into additional elliptics groups, which will serve IO requests. This is usually very useful for heavy content like big images or audio/video files, which are rather expensive to put into memory cache.

One can update list of objects to be cached as well as per-object list of additional groups. There is a cache.py with excessive help to work with cached keys. This tool will grab requested key from source groups and put them into caching groups as well as update special elliptics cache list object which is periodically (timeout option in cache configuration block of the Rift) checked by Rift. As soon as Rift found new keys in elliptics cache list object, it will start serving IO from those cached groups too as well as from original groups specified in the bucket. When using cache.py tool please note that its file-namespace option is actually a bucket name.

To remove object from cache one should use the same cache.py tool – it will remove data from caching groups (please note that physically removing objects from disk in elliptics may require running online eblob defragmentation) and update special elliptics cache list object. This object will be reread sometime in the future, so if requested key can not be found in cache, it will be automatically served from original bucket groups.

, ,

Elliptics monitoring spoiler alert

Here is Elliptics per-cmd monitoring

Cache/eblob command timing

Monitoring includes a very detailed statistics about the most interesting bits of the storage. Above picture shows write command execution flow (whole command and how time was spent within) – one can turn it on for own command if special trace bit has been set.

Other monitoring parts include queues (network and processing), cache, eblob (command timings, data stats, counters), VFS and other stats.
More details on how to use that data in json format is coming.

, , ,

Documentation updates

We always improve our documentation.
Now we added detailed page about elliptics configuration that is available at http://doc.reverbrain.com/elliptics:configuration.
Also we added description of dnet_balancer tool that allows to balance DHT ring partition within one group. It can be found at http://doc.reverbrain.com/elliptics:tools#dnet_balancer.

, , ,

ZooKeeper Resilience at Pinterest

Or why does ZooKeeper suck at Pinterest at all.

Here are some quotations:

Too many transactions: When there’s a surge in ZooKeeper transactions, such as a large number of servers restarting in a short period and attempting to re-register themselves with ZooKeeper (a variant of the thundering herd problem). In this case, even if the number of connections isn’t too high, the spike in transactions could take down ZooKeeper.
Protocol bugs: Occasionally under high load, we’ve run into protocol bugs in ZooKeeper that result in data corruption. In this case, recovery usually involves taking down the cluster, bringing it back up from a clean slate and then restoring data from backup.
We ultimately realized that the fundamental problem here was not ZooKeeper itself. Like any service or component in our stack, ZooKeeper can fail. The problem was our complete reliance on it for overall functioning of our site. In essence, ZooKeeper was a Single Point of Failure (SPoF) in our stack.

Pinterest, basically saying, added local file cache of the data stored in ZooKeeper and wrote a daemon to switch from ZK to those files if something suspicious was found in zookeeper logs or activity.

,

An analysis of Facebook photo caching

Rift was created to fulfill the edge/origin stack gap in elliptics stack.

Rift and its caching+bucket layer allows not only to have multiple levels of caches but also to finely tune caching capabilities (like having multiple caches for some content)

Article: An analysis of Facebook photo caching

,

Elliptics on Wikipedia

https://en.wikipedia.org/wiki/Elliptics

Elliptics HTTP frontend RIFT got full per-bucket ACL support

The second most wanted feature in elliptics HTTP frontend RIFT is ACL support.

Rift already provides S3-like buckets – namespace metadata which allows to store data in separate per-bucket unique groups, to have ability to fetch the whole list of objects stored in given bucket, and of course use the same object names stored in different namespaces. Rift also allows you to have a set of objects cached in special groups, each ‘cached’ object may have its own set of groups to check first. This option can be used to cache commonly used objects in additional groups like temporal in-memory or SSD groups.

And now I’ve added ACL support to buckets. As stated ACL is an access control list where each username is associated with secure token used to check Authorization header and auth flags which allow to bypass some or every auth check.

Admin must setup per-bucket ACL using rift_bucket_ctl tool where multiple –acl option can be used. This control tool uses following format: user:secure-token:flags

user is a username provided both in URI (&user=XXX) and acl (< code>–acl XXX:token:0). token is used to check Authorization header.

Here is the whole state machine of the Rift's authentication checker (when bucket has been found and successfully read and parsed by the server):

  1. if group list is empty, not found error is returned
  2. is ACL is empty, ok is returned - there is nothing to check against
  3. if no user= URI parameter found, forbidden error is returned - one must provide username if ACL is configured
  4. user is being searched in ACL, if no match was found, forbidden error is returned
  5. if flags has bit 1 (starting from zero) set, this means bypass security check for given user - ok is returned
  6. Authorization header is being searched for, if there is no such header bad request is returned
  7. security data in Authorization header is being checked using secure token found in ACL entry, is auth data mismatch, forbidden is returned
  8. ok is returned

Result of this check can be found in log with verdict: prefix in ERROR/INFO (1/2) log level and higher.

But even if state machine returned non-ok verdict, operation can be processed. This may happen if per-bucket ACL flags allow not all (bit 1), but only read requests (bit 0). In this case /get, /list, /download-info and other reading-only handlers will check ACL flags bits and optionally rewrite verdict. You will find something like this in logs:

[NOTICE] get-base: checked: url: /get?name=test.txt, original-verdict: 400, passed-no-auth-check

That's it.
Check out more info about Rift - our full-featured elliptics HTTP frontend: http://doc.reverbrain.com/rift:rift

, , ,

Listing of the keys uploaded into elliptics

I often get requests on how to get a list of keys written into elliptics. Do not really understand why is this really needed especially considering storage setups where billions of keys were uploaded, but yet, this is one of the most frequently asked question.

Elliptics has secondary indexes for that purpose. Indexes are automatically sharded and evenly distributed across the nodes in the group.

One can tag own uploaded keys with special indexes and then intersect those indexes on servers or read the whole index key-by-key. That’s essentially what RIFT – http elliptics frontend does when you upload file through its HTTP interface.

And I’ve added listing support into RIFT proxy via /list URI – it reads an index from the server, iterates over the keys and creates a nice output json. It also prints a timestamp of the key update in the index, both in seconds and current timezone.

URI accepts a namespace – bucket name to get indexes from and name – a placeholder for future indexes names (if we will support multiple indexes).

$ curl "http://example.com/list?namespace=testns&name="

{
    "indexes": [
        {
            "id": "4e040aa8a798d04d56548d4917460f5759434fdf3ed948fd1cf35fd314cad3290e69b80deb0fc9b87a6bfbcbd08583919eb5b966658b3ed65e127236e1632525",
            "key": "test1",
            "timestamp": "1970-01-01 03:00:00.0",
            "time_seconds": "0"
        },
        {
            "id": "e5b7143155f46c9e9023cbf5e04be7276ae2e9a7583fee655c32aaff39755fa213468217291f0e08428a787bf282b416be1d26a5211f244fc66d1ce8ce545382",
            "key": "test7",
            "timestamp": "2014-02-18 03:29:44.835283",
            "time_seconds": "1392679784"
        }
    ]
}

Zero timestamp is for older indexes when timestamps were not yet supported. key is an object name given at upload time, id is numeric elliptics ID (one can read those objects directly from elliptics without namespace name), time_seconds is a coarse grained timeout in seconds since the Epoch. timestamp is a real parsed timestamp with microsecond resolution.

There is also an example python script which does basically the same – reads an index, unpacks it and print to console: https://github.com/reverbrain/rift/blob/elliptics-2.25/example/listing.py

,

Previous Posts