Featured post

reverbrain.com is up and running

Reverbrain is a company we started to provide support and solutions for distributed storage and realtime computation problems.

We created the whole stack of technologies ranging from the lowest level of data storage upto high-level processing pipeline.

Time moves on and we see how complex is to deal with massively increasing amounts of data. We provide solutions for true-way horizontal scaling with fair system complexity. Our storage appliances lay in the area where one has to host billions of small-to-medium objects upto huge data streaming systems.

Your data should nost just lay in archive in the distributed system — realtime pipeline processing is our vision of how data has to be handled. We use them by ourself and want to provide the best experience for our customers.

Adaptive MPEG-DASH streaming and arbitrary channel muxing

Ever wanted to be able to compose your own content stream without reencoding video files? Like having video stream from this file, and audio from these files, and then add another video and so on and so on?

I have created a simple service to highlight one of our technologies which allows to create adaptive MPEG-DASH stream (the same stream as youtube returns) and mix many streams from different files stored in Elliptics.

stream1

In the example above I play NeuralNetwork course video and some of my saxophone music as audio track.

DASH stream is being created on demand from different sources, so if you want to add new sound track or link multiple videos one after another there is no need to reencode video content each time. The service I created is rather simple, it does not use every feature of the technology, in particular, playlist protection is not used in the service, i.e. you can share it with others and initializing DASH player with our URL you will be able to play it on your site. Also we do not turn on stream insertion, i.e. when you want to play the main video and put in some additional video stream (or its part) at some time offset. We have both of this features in the server, but there are no interface controls for them in the service yet.
As a side note using this service one can create gapless audio/video playing, i.e. no player reinitialization between tracks like on popular video streaming sites.

So far we only support MPEG-DASH streaming, and while almost all modern browsers support it (i.e. Chrome, Firefox, IE and Opera), Safari is out of the list. We do not yet fully support HLS, and although Apple had announced at WWDC2016 that they will switch from mpeg2ts streaming (probably being more compatible with DASH stream), we still have plans to implement HLS streaming too.

So, if you are interesting to play with the technology, you are welcome to our playground service: http://stream.reverbrain.com
Do not be afraid of mad design, just upload files, they will be tagged automatically, and create playlists!

Audio/Video transcoding server

I’m pleased to announce Nullx – our standalone audio/video/subtitles transcoding server. It accepts files to be transcoded via http and either returns transcoded files in the reply or optionally uploads file (and metadata) into Elliptics storage.

So far it is a quite simple service which does not change parameters of the streams like height/width or bitrate, but instead transcodes streams into h264/aac/mov_text format, which is only suitable for HLS adaptive streaming. Since we plan to add realtime downscaling of the data for adaptive streaming, this service will be extended and will receive some per http request controls which will tell how exactly should given stream be transcoded, so far I believe only h264 profile and height/width for video and bitrate for audio streams are needed.
That will be our next release.

Nullx – transcoding server – is used in our broadcast service which allows to group uploaded into Elliptics storage audio/video streams, mux them together with different options (like order, timings, split at various time positions, sound from different source and so on) and adaptively stream resulted data using MPEG-DASH/HLS protocols (natively supported by Chrome/IE/Firefox and Safari on desktop and mobile).

Ever thought of working with ffmpeg?

Transcoding audio track into mp4(aac) is just about 30kb of hardcore C code.

That’s a small highlight on our progress on the streaming service, we are building a system which accepts user content and allows to create long-lived broadcasting translations containing many tracks in one stream.

For example your stream may start with file ‘introduction’ then ‘adv1’ then part of the file ‘main content’, ‘adv2’, ‘main content’, ‘final part’ and so on.
The only thing you need is to upload your audio/video tracks to our service, and create your stream using our interface. If you prefer, you can setup different audio track for your stream.
We will use adaptive HLS/DASH streaming for different client bandwidths.

We will not concatenate your video files together, instead we are using real-time stream composition on our streaming servers which build video content just out of the files you uploaded.

Here is initial presentation (MPEG-DASH) which muxes 2 video streams (5 seconds each, sample-after-sample) and 2 completely different audio streams: http://video.reverbrain.com/index.html

Elliptics as Video-On-Demand storage

We have moved video archive of our friends at FC Dynamo to Elliptics storage and in parallel upgraded video streaming from Flash to HTML5.

Besides the fact this solution scales well, it is much faster. Mostly because of Flash vs HTML5 but also because of Elliptics underlying storage vs plain filesystem.

Architecture of the project is rather simple, .NET frontend creates redirect link to onf of the storage servers located in Germany and Netherlands, and server streams data directly to client’s browser. This solution does not use adaptive HLS/DASH streaming, but plain on-demand or progressive download.

How harmful is eventually consistent model at large scale? (spoiler: it isn’t)

Eventually consistent updates is considered to be bad design choice, but sometimes it is not possible to live without updates, one has to overwrite data and can not always write using new keys.

How harmful could this be at large scale? At Facebook’s scale.
Paper “Measuring and Understanding Consistency at Facebook” measures all consistency anomalies in Facebook TAO storage system, i.e., when results returned by eventually consistent TAO differ from what is allowed by stronger consistency models.

Facebook TAO has quite sophisticated levels of caches and storages, common update consists of 7 steps each of which may end up with temporal inconsistency.
TAO

And it happens that eventually consistent system even at Facebook scale is quite harmless, somewhere at the noise level, like 5 out of million request which violate linearizability, i.e. you overwrite data with the new content but read older value.

You may also check shorter gist paper describing how Facebook TAO works, how they measured consistency errors (by building update graph for each request from random selection out of billions facebook updates and proving there are no loops) and final results.

http://muratbuffalo.blogspot.ru/2016/03/paper-summary-measuring-and.html

HLS and DASH adaptive streaming and muxing from the same files from Elliptics

Our streaming engine Nulla is now capable of streaming in both HLS and MPEG-DASH formats from Elliptics storage. We do not require uploading multiple files (in mp4 and mp2ts container formats) or specially repack frames in the media files (that’s what mp4box is used for) for streaming.

Elliptics streaming service Nulla creates DASH/HLS stream in realtime from data file you have provided, it builds either mp4 or mp2ts container on demand based on the streaming offset client requests. This allows not to force clients to upload multiple format files (mp2ts for HLS, mp4 for MPEG-DASH) or repack your existing files to meet streaming demands (fragmenting stream and put indexing box in front of data).

Since we build stream from data frames and create container in realtime we can adjust presentation and decode times and build an audio/video muxer. This allows, for example, to stream one file and put another one into the middle or stream file and split it into X chunks each of which will be followed by another different file, like 5 seconds from file X, 10 seconds from Y, 15 seconds from X starting from 5-seconds offset, then 10 seconds from file Z, while audio track has own muxing and so on and so on.

This is all being controlled via siple json API and guarded from embedding into hostile sites via random URLs with limited lifetime.

We built a simple example page which shows this kind of muxing and streaming.
Depending on your browser our servers will stream either HLS (desktop Safari and iOS) or MPEG-DASH (tested in current stable versions of Chrome, Firefox and IE) from 2 video and 2 audio files uploaded quite far ago.

Enjoy!

http://video.reverbrain.com/index.html

Source code for this simple HTML page shows how simple and convenient is our API.

Next tasks is to build a cache for preprocessed chunks and distribute it among multiple geographically distributed elliptics nodes in your cluster. We also plan to add automatic transcoding of video stream into smaller bitrate which will be automatically selected by the browser (that’s why HLS/DASH are adaptive streamings), currently you have to upload files in multiple bitrates and use them in API to create appropriate playlist, this task can be automated and will be implemented in the next release.

Another next major goal is to implement live translation from application (or browser) into Elliptics, who will distribute your translation via HLS/DASH to the thousands of simultaneously watching users.

Adaptive MPEG-DASH streaming and multiple stream muxing in Elliptics

We built Elliptics distributed storage quite long time ago, it is mature technology which just works when you have to store medium and large objects in geographically distributed locations.

It also supports progressive download: FLV and byte-range (tutorial, mp4) streaming, but we wanted more – native adaptive streaming with channel muxing from elliptics.

Basically, we wanted to upload multiple mp4 files into Elliptics and then stream them to client in required order like 10 seconds of the first file, then 20 seconds of the second, then 10 minutes from the first started from 10’th second while there is another track (like sound) in background which is mixed in its own way. And preferably with adaptive bitrate switching if client moves from slower to faster networks like 3g-to-wifi and vice versa.

Another example of this muxing feature is gapless music playing – you listen songs one after another without delays in between, without player reinitialization, without delay waiting for song (or its part) to be downloaded when previous one has stopped.

There are many technologies which implement streaming: HLS, HDS, MPEG-DASH, RTMP, RTSP and others. It looks like HLS is the winner for now, it is backed by Apple and is supported by iOS and Safari, but MPEG-DASH is backed by large group of vendors and is supported by all other browsers including TVs except iOS. Since Flash is going to die after youtube.com, netflix.com and other large vendors stopped streaming in that format, I believe MPEG-DASH will become more and more popular (its market share to date is rather small) and eventually only HLS and MPEG-DASH will be default streaming protocols.

There are fair number of streaming services which support both HLS and MPEG-DASH, but most of the time they require quite a lot of efforts to create fault-tolerant streaming service which would work with distributed storage. And neither of them supports audio/video muxing described above. Actually streaming technology itself partially supports this feature, for example in MPEG-DASH there is a notion of “Period”, and multiple periods would be played one after another. But this is a quite advanced feature which is not yet supported by reference player Dash.js (there are commercially available players which somewhat support this feature though). There are questions on implementation whether player can reinitialize itself to play multiple periods without stream interruption.

We decided to implement MPEG-DASH in our Elliptics streaming service to support both adaptive streaming and stream muxing. To implement this we create the whole mpeg container in runtime and only read samples data from audio/video files stored in Elliptics. To allow muxing all files in the stream must be encoded the same way though.

Using our technology one can implement 5-seconds muxing (5 seconds of the first video, then 5 second of the second, then next 5 seconds from the first and so on) in example below using following control json:

"tracks": [
{
  "bucket": "b1",
  "key": "video_1.mp4",
  "duration": 5000
},
{
  "bucket": "b1",
  "key": "video_2.mp4",
  "duration": 5000
},
{
  "bucket": "b1",
  "key": "video_1.mp4",
  "start": 5000,
  "duration": 5000
},
{
  "bucket": "b1",
  "key": "video_2.mp4",
  "start": 5000,
  "duration": 5000
}
]

But enough words, show me the result.

Here it is, muxing 2 video and 2 sound channels in the way described above without interruption and gaps. All 4 files are stored in Elliptics storage as usual objects.

http://video.reverbrain.com/index.html

Please note that Elliptics and streaming servers are located in USA and it adds 150+ ms to get the first chunk (or if you’ve sought into the area which isn’t yet present in the cache) from Russia, otherwise it is very fast.

You can check the source of the html page above to see how muxing is being set up, you can play with different settings and watch the whole files or mix them in other ways around.

Enjoy, and stay tuned!

Why will rotating disks matter for the next 10-20 years?

The short answer is SPACE. Not a cosmic space, but the disk space.

Although rotating disk speeds are not going to increase, there are technologies coming which will drastically improve density of the plates.

Samsung highlights some of them from the ones which are in testing like shielded magnetic recording and heat-assisted recording up to futuristic heat-dot magnetic recording.

astc_technology_roadmap

More details: http://www.anandtech.com/show/9858/seagate-hard-disk-drives-set-to-stay-relevant-for-20-years

Range headers and native HTML5 streaming support in Backrunner

Plain HTML5 streaming requires server to properly handle RFC 2616 Range header, in particular video seek/rewind uses Range header to specify exact position within file to start streaming from.

HLS protocol is a bit different, but yet again player uses Range header to specify position within timeframe.

Previously Elliptics HTTP server Backrunner used size/offset URI parameters for this purpose, which are not ajax friendly and obviously are not supported by standard players.

With this new Backrunner update we add Range and If-Modified-Since headers support.
The former allows to work HTML5 pleers with Elliptics HTTP proxy out of the box. If-Modified-Since is quite useful for client-side caching.

Here is a simple example of our video-on-demand service.

Our future plans include realtime HLS generation and transcoding for live-translations built on top of elliptics and related technologies.

Facebook time series storage

Time series databases are quite niche product, but it is extremely useful in monitoring. I already wrote about basic design of the timeseries database, but things are more tricky.

Having something like HBase for TS database is a good choice – it has small overhead, it is distributed, all data is sorted, HBase is designed for small records, its packs its sorted tables, it supports Hadoop (or vice versa if that matters) – there are many features why it is a great TS database. Until you are Facebook.

At their scale HBase is no capable of handling the write and read monitoring and statistics load. And Facebook created Gorilla – a fast, scalable, in-memory time series database.

Basically, it is a tricky cache on top of HBase, but it is not just a cache. Gorilla uses very intelligent algorithm to pack 64 bits of monitoring data by the factor of 12.

This allows Facebook to store Gorilla’s data in memory, reducing query latency by 73x and improving query throughput by 14x when compared to a traditional database (HBase)- backed time series data. This performance improvement has unlocked new monitoring and debugging tools, such as time series correlation search and more dense visualization tools. Gorilla also gracefully handles failures from a single-node to entire regions with little to no operational overhead.

Design of the fault tolerant part is rather straightforward, and Gorilla doesn’t care about consistency or even more – there is no recovering missing monitoring data. But after all, Gorilla is a cache in front of persistent TS database like HBase.

Gorilla uses sharding mechanism to deal with write load. Shards are stored in memory and on disk in GlusterFS. Facebook uses its own Paxos-based ShardManager software to store shard-to-host mapping information. If some shards have failed, read may return partial results – client knows how to deal with it, in particular, it will automatically try to read missing data from other replicas.

I personally love Gorilla for its compression XOR algorithm.

facebook-gorilla-compression

It is based on the idea that subsequent monitoring events generally do not differ in the most bits – for example, CPU usage doesn’t jump from zero to 100% in a moment, and thus XORing two monitoring events yields a lot of zeroes which can be replaced with some smaller meta tag. Impressive.

Gorilla article is a must-read for monitoring developers: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Greylock tutorial – distributed base search engine based on Elliptics

We’ve heavily updated Reverbrain documentation pages: doc.reverbrain.com, and I’m pleased to present our distributed base search engine Greylock. Documentation page includes generic information about search engine and tutorial, which includes installation process, configs and two types of clients: plain HTTP API (similar to what is expected from base search engine like Greylock and ElasticSearch) and Python client (works via HTTP too, but also uses Consul to acquire mailbox locks). If you need C++ tutorial, you can check greylock test suite which includes insertion/selection/removal as well as various iterators over data, self-recovery tests, statistics and other interesting bits.

I get a fair number of questions on how is Greylock different from ElasticSearch or Solr for instance or Amazon Cloud Search? They all have enormous amount of features and work with large amount of data, so what’s the purpose?

And the answer is just two words: scalability and automation.
If you worked with Elastic you do know which operations have to be made to reshard cluster when current sharding scheme becomes a bottleneck (how long it takes, what is the performance penalty and how dangerous is the process). When you work in the environment where new documents always arrive and space consumption grows with time, this resharding process will have to be started again and again with new servers added. At some point this becomes a serious issue.

With Greylock this is not needed at all. There is virtually no data and index movements when new servers are being added due to Elliptics bucket system. This design proved to work really well in Elliptics storage installations, where upload rates reach tens of terabytes daily, and that’s only our clients data, there are other seriously larger installations for example in Yandex.

We concentrated on scalability problem and solved it. And yet we do have a set of features. It is not comparable with Elastic of course even not counting NLP tasks which we will release later (language models and spelling correction for instance for any language where you can find a rather large corpus). Greylock supports basic relevance model based on the word distance among client request and words in the document.

Likely two of the worst issues are absence of numerical indexes and client locks. Both were made deliberately. Numerical indexes break pagination, which in turn means that if you want 100 documents out of a million, you will have to read them all into RAM, resort either to numeric order or into lexical order (that’s how document ids are stored in the inverted indexes), intersect the whole million of keys and return the first 100. For any subsequent request this has to be done again and again. Without numerics pagination works with iterators pointing to inverted indexes only, the whole index (and its millions of entries) is never being read, only some pages are accessed sequentially.

To help with numerics Greylock supports document timestamps, i.e. a single 64-bit numeric per document ID which is used in inverted indexes sorting order. Of course this is not a replacement for fair numeric index, but it does solve almost all of our use cases.

The second major issue is consistency and client locking. Greylock as well as Elliptics are not strictly consistent storages. With Greylock things are even worse – amount of data overwritten by a single client insert can be large and index structure (originally started as a distributed B+/*-tree) does not tolerate broken pages. Elastic and others implement consistency model (like Raft, Paxos or ZAB) internally. Greylock doesn’t. That’s why we require clients to acquire locks in some other system like Consul, Etcd or ZooKeeper to work properly. Our tutorial shows basic locking scheme implemented using strictly consistent Consul key-value storage.

We have a major plan for Greylock distributed search engine, expect new features and give it a try: http://doc.reverbrain.com/greylock:greylock
If you have any questions, you are welcome:
Google group: https://groups.google.com/forum/?fromgroups=#!forum/reverbrain,
this site and comments.

Server-side operations

Elliptics distributed storage is being built as a client-server architecture, and although servers may discover themselves, exchange various statistic information and forward requests, they most of the time serve client’s requests.

Recovery in a distributed storage is a tricky operation which requires serious thinking on which keys have to be copied to which destinations. In Elliptics recovery is another client process which iterates remote nodes, reads data to be recovered and update needed keys.

But there are cases when this round trip to client is useless. For example when you require missing replica, or when you have a set of keys you want to copy or move to new destination.

Thus I introduced two server-side operations which allow to send content from one server to multiple replicas. It is intended for various recovery tools which optimize by not copying data from local node to recovery temporary location, instead they may tell remote node to send data directly to required locations. It can also be used to move data from one low-level backend (for example eblob) to a newer version or different backend without server interruption.

There is a new iterator type now which sends all keys being iterated to set of remote groups. It does it with the speed of network or disk (what it slower), in local tests iteration over 200Gb blobs sending data over the network to one remote node via write commands ended up with ~78MB/s sustained speed. There were pikes though, especially when remote node synced caches. Both sender and recipient had 30Gb of RAM. Rsync shows ~32MB/s speed on these machines, but not because it is that slow, but because of ssh which maxed out CPU by packet encryption.
Iterator sends dnet_iterator_response structure for each write result for every key it has processed just like for usual iterator, neither API nor ABI is broken.

Second server-send command accepts vector of keys. It searches for all remote nodes/backends which host given keys in the one specified group, splits keys into per-node/backend basis and tells remote backends to send appropriate keys to specified remote groups. The same iterator response is generated for every key which has been processed.

All operations are async and can run in background with other client requests being handled in parallel.

There are 3 operation modes:
1. default – writing data to remote node using compare-and-swap, i.e. only write data if it either doesn’t exist or it is the same on remote servers. Server sending (iterator or per-key) running in this mode is especially useful for recovery – there is no way it can overwrite newer copy with the old data.
2. overwrite – when special flag is set, it overwrites data (clears compare-and-swap logic)
3. move – if write has been successful, remove local key

There is example tool in examples which iterates over remote node and backends and performs copy/move of the keys being iterated. Next step is to update our Backrunner HTTP proxy to use this new logic to automatically recover all buckets in background.

Stay tuned!

Some years ago in the past life I was a kernel hacker

One of my initial kernel projects was Dallas 1-wire: https://www.kernel.org/doc/Documentation/w1/

I’m still a maintainer, although quite occasional.
It was quite later that I began working with VFS and created Elliptics/PohmelFS, but there was a bit of hardware time.

In particular, I really liked to solder bits to my Matrox VGA card, which had w1 pins available. It used 1-wire bus for some internal control, but since this bus is addressable, one can add any other devices and things will not explode.

Anyway, w1 bus is quite actively used and I created (at the same time as w1 itself, like 7-10 years ago?) netlink interface over kernel connector interface from userspace to kernel w1 bus.
Using netlink from userspace allows to search for devices and perform all other actions with noticebly lower latencies than working with sysfs files.

During time userspace example code was lost and recovered and lost again and again, but apparently people do want to use it to monitor w1 bus from userspace.

So I created a github page with that example code: https://github.com/bioothod/w1
That’s it.

BoltDB – local ACID database written in Golang

BoltDB is a key/value local storage with simple API but powerful transactions and ACID support.

Main goal of the project is to provide a simple, fast, and reliable database for projects that don’t require a full database server such as Postgres or MySQL. It is used in high-load production with database sizes up to 1Tb.

1Tb is not that much actually, but since its main competitors are RocksDB/LevelDB and LMDB it is quite spectacular volume. Main differences between BoltDB and RocksDB/LevelDB are on-disk structure and transaction support – RocksDB and LevelDB are designed on top of LSM trees and thus are well-suited for write load, while BoltDB uses B+-tree – this generally scales better for read workload. LevelDB doesn’t support transactions, while BoltDB does (and it uses very nice and convenient API).

Main difference between BoltDB and relation databases like Postgres or MySQL is, well, relation – BoltDB is a simple key/value store (although its API supports cursors, transactions, prefix key search).

Worth considering for local storage: https://github.com/boltdb/bolt

backrunner: HTTPS support

We at Reverbrain.com develop highly scalable distributed storage Elliptics for medium and large objects. And in the web era the most frequently used API is HTTP. We had developed HTTP proxy for elliptics named Backrunner, it supports a wide range of options like ACL, streaming, redirect, partial upload/download, static file downloading and many others.

But if you build a system hidden behind HTTPS you likely want to secure your work with storage, in porticular your CDN will likely require you to work via HTTPS in this case.

So we have updated Backrunner to support HTTPS. It can listen for unencrypted and secured connections simultaneously on different ports/addresses, and you have to provide certificate/private key files.

This rather small change allows to deploy fully secured storage access to your frontend.

MySQL application level HA/loadbalancer

Master-master scheme doesn’t really work for relation databases, master-slave replication is used almost always and there is a serious problem to balance this load.

In particular, most of the time the most demanded setup is to direct all ‘writes’ to master and balance all ‘reads’ among multiple slaves. This can be done at application itself, after all only client knows whether he updates data or reads, but sometimes this is not possible.

Plain and simple regexp analyzer of the query most of the time is not enough, so MariaDB guys wrote MaxScale tool for this.

Percona used it to solve client’s scalability problem and here is their report: https://www.percona.com/blog/2015/06/08/maxscale-a-new-tool-to-solve-your-mysql-scalability-problems/

Quite informative post about MaxScale tool.

Distributed database for small records on top of LevelDB/RocksDB

LevelDB (and actually RocksDB) is a very well known local storage for small records. It is implemented over log-structured merge trees and optimized for write performance.

LevelDB is quite horrible for records larger than 1Kb because of its merge operation – it quickly reaches level where it has to always merge trees and it takes seconds to complete.
But for small keys it is very useful and fast.

RebornDB is a proxy storage, which operates on top of LevelDB/RocksDB and provides Redis API to clients. Basically, it is a Redis on top of on-disk leveldb storage.

There is an interesting sharding scheme – there are 1024 slots each of which uses its own replica set, which can be configured by admin. When client writes a key, it is hashed and one of the 1024 slots is being selected (using modulo (% 1024) operation). When admin decides that some slots should be moved to new/different machine, it uses command line tool to reconfigure the proxies. During migration slots in question are available for IO, although it may span multiple servers since proxy doesn’t yet know whether required key has been or hasn’t been yet moved to the new destination.
Having many slots is a bit more flexible than old-school sharding which uses number of servers, although quite far from automatic ID range generation – manual resharding doesn’t scale for admins.

RebornDB uses zookeeper/etcd to store information about slot/server matching and per-slot replication policies. This doesn’t force every operation to contact zookeeper (this actually kills this service), instead every proxy has abovementioned info locally and every reconfiguration also updates info on every proxy.

There is not that much information about data recovery (and migration) except that it is implemented on key-by-key basis. Given that leveldb databases usually contain tens-to-hundreds millions of keys recovery may take a real while, snapshot migration is on todo list.

Reverbrain packages repository

Reverbrain package repository now hosts packages for the following distributives: RHEL6, RHEL7 (CentOS supported), Ubuntu Precise, Ubuntu Trusty, Debian Wheezy, Debian Jessie.

Repository includes all packages needed to install Elliptics distributed storage and Eblob low-level storage.

Here is a small tutorial on how to automatically turn on repository in your setup: http://doc.reverbrain.com/elliptics:server-tutorial

Backrunner HTTP elliptics proxy can be found in Docker repo: https://registry.hub.docker.com/u/reverbrain/backrunner/

LSM and fractal trees and how to really work with large files

LSM tree (stands for log-structured merge tree) is a rather simple structure which can be hardly called a tree.

This is an append-only log which is sorted when written to disk. LSM tree is intended for write-heavy workloads, since reading requires at least O(number of on-disk log files) disk-seek operations.

There is a fair number of read optimizations for LSM trees, in particular bloom filter which can reduce number of disk seek operations to minimum albeit with some probability (it can be arbitrary small though).

LSM tree behaves much better for write workloads compared to Btree and friends (B+, B* and so on), since there is only one write of the sorted tree and it is always sequential. Btree potentially has to update multiple nodes (some log of total number of keys) when performing single write. Nodes are likely located at random locations which ends up with random writes. These are slow.

Quite contrary Btree reading is usually faster than that of LSM trees – logarithm of number of keys is less than number of sorted logs in LSM tree. But this does not count bloom filters in. Which in turn doesn’t count node caching in btrees.
Multiple operations needed to perform single request – like multiple page reads to fetch single key in btree case – is called multiplication. Fractal tree is aimed at write multiplication – it is yet B+tree, but it stores data in intermediate nodes (not in leafs) for some time until page split is required. This allows to reduce number of writes needed to insert or update a key.

Anyway, btrees are considered to be faster than LSM trees for reading and slower for writing. The latter is a fact – LSM trees are designed for that, the former is questionable.

Elliptics distributed storage can use many backends, and the most popular one is Eblob – a low-level storage built with LSM trees design in mind.

LSM trees do not support data rewrite – key update is appended to new log and older version is either marked as removed or special lookup sequence is used to find out newer keys first. Eventually LSM tree merges and removes old versions of the duplicate keys.

In Eblob this process is called defragmentation, and it is a bit different than LSM tree process. LSM tree stores already sorted data to disk, it sorts it in RAM. But if your storage is intended to store large files like Elliptics – we store objects which are sometimes quite larger than amount of RAM in the system – plain LSM tree approach (sort in mem and sync to disk) doesn’t work.

Instead Eblob stores unsorted log to disk (optionally overwriting data in place) and adds in-memory index of the keys. This simple scheme could be very naive since number of keys multiplied by key size – amount of RAM needed to store key index in memory – can be much larger than amount of RAM on any given server. So we have additional on-disk index of stored keys, it can be sorted – binary search allows to find needed key rather quickly.
But not quickly enough – there will be log2 of number of keys random seek operations – we have to split sorted keys into ranges of smaller size called index blocks and store start/stop pairs for each index block in RAM. This allows to find out index block quickly without on-disk operations, and then perform single read to get the whole index block (tens-to-hundreds of keys) and perform in-memory binary search.

And even this is not enough. Iterators and for example recovery works with sorted keys – recovery merges index lists from different nodes and produces sorted list of keys which have to be recovered – since our data is not sorted yet, reads of the to be recovered keys will be actually random reads. Instead we can turn that purely random read into subsequent read plus some times head positioning. So we sort data which is performed when defragmentation process is being started the first time.

This allows Elliptics+Eblob be the ultimate solution for medium-to-large files distributed storage. For smaller files pure LSM tree is usually enough.

Apache Bookkeeper or how NOT to design replication consistency scheme

Consistency in a distributed system is hard, it is even a limitation – everyone knows that one has to sacrifice a letter in CAP theorem to be have consistent distributed system: either it will not be 100% available or it will not tolerate partition split.

PAXOS algorithm is known to be a complex solution for consistency problem – it allows to have a consensus on multiple opinions, but algorithm is quite complex to implement. And it is very slow. Cassandra for instance caught bugs in its realization for several years (and probably still have it) and it used 6 round trips to converge for a single write.

But there are several simple optimization for PAXOS, the most interesting (and the fastest) is FAST PAXOS. Basically, this is a single master – multiple learners solution. It is very similar to Apache Zookeeper Atomic Broadcast (ZAB) protocol. Single master is a point of failure of course, but there are always leader selection algorithms which might or might not converge, and usually they do (this is a tricky place in plain PAXOS – it is easily possible that it will not converge on a single write round).

ZAB and FAST PAXOS work with the simple master server which handles all updates and serialize them according to own rules. Since there is no distribution consensus problem anymore, it can use 2-phase commit to force all learners to get update. It works, and it is the fastest consistency forcing algorithm, but yet there are 2 round trips.

Apache Bookkeeper decided to work that around. They decided, that 2-phase commit can be avoided and instead data is being written directly to multiple ‘bookies’ or replicas. And only by single writer (more on this below). Since there is no commit phase replica do not know whether its data was written to quorum of nodes or not, thus reader has to read data from multiple replicas to find out whether data it has received was actually written to majority of the nodes.

This ‘write once – read multiple’ case is a horrible performance killer in real storage devices. I do now know any of the real-life workloads where reader tolerates N-fold number of reads to get data. Maybe only log collection, but even they are being read *at least* once, which eats up any possible profit from this scheme.

I created such scheme in Elliptics distributed storage – writer puts a timestamp for every record and if there are consistency issues (for example 2 writers updated the same key in different replica order), reader may use special API call which reads multiple replicas and compare timestamps – the newest data is being returned to the client.
And when I tell you this doesn’t scale – it is true. This design forces distributed storage into a centralized entity for every read request. It can be acceptible for writes, but not reads.

And here comes another issue – single writer. Apache Bookkeeper operates with single writer at a time. I do not know how they achieve that, maybe there is a master server, or there are unique node-specific IDs, which allow to serialize requests coming through different nodes, but yet it is a single writer solution.

Apache Bookkeeper traded single round trip (2-phase commit vs plain replica update) for N-fold increase in number of reads. I call this a very bad deal. Zookeeper has its issues (it has tons of them – Pinterest calls its Zookeeper clusters ‘single cluster of failure), but at least it has strong guarantees on its performance.