Featured post

reverbrain.com is up and running

Reverbrain is a company we started to provide support and solutions for distributed storage and realtime computation problems.

We created the whole stack of technologies ranging from the lowest level of data storage upto high-level processing pipeline.

Time moves on and we see how complex is to deal with massively increasing amounts of data. We provide solutions for true-way horizontal scaling with fair system complexity. Our storage appliances lay in the area where one has to host billions of small-to-medium objects upto huge data streaming systems.

Your data should nost just lay in archive in the distributed system — realtime pipeline processing is our vision of how data has to be handled. We use them by ourself and want to provide the best experience for our customers.

Setting up client to access Elliptics webdav server

We have installed a simple webdav server + elliptics backend for you to test, you can connect using your favorite client and check how POSIX filesystem works.

For Linux we recommend using wdfs and definitely not dvfs2, the latter is very racey and basically doesn’t work. For Windows to enable network drive support with authentication you will have to install webdav client (wrapper actually) which supports this, default Windows webdav client doesn’t support auth, we recommend NetDrive.

Here is a linux setup:

sudo wdfs http://webdav.reverbrain.com:8021/webdav /mnt/webdav/ -o uid=1000,gid=1001,umask=0022,username=test,password=test,allow_other

Where uid/gid should correspond to your local user. Username/password is test/test.

Please note, that this node can be (and will be) destroyed/reset at any moment.

POSIX filesystem interface for Elliptics distributed storage

We in Reverbrain create storages. Starting from single-node search engines to multi-petabyte clouds spawning hundreds of servers. All of them are accessed mostly via HTTP API and sometimes using programming interfaces for Python, Go and C/C++

But there is a huge number of clients who do not need complexity of the HTTP API, instead they want to have locally connected storage directory with virtually unlimited free space. They will store and load data using existing applications which operate with files. It is very convenient for user to mount remote storage into local directory and run for example backup application which will copy local files into this directory and this will end up in several physically distributed copies spread around the world. Without any additional steps from the user.

The first incarnation of the POSIX filesystem interface for some simpler distributed storage we created was a network block device which contained a connection layer and depending on the block offset it selected a server to work with. That was a bit ugly way to work, since block device doesn’t know which object is being stored or read, and what locking should be performed. Locking was always either too coarse or too fine, it ended up performing a lot of operations for simple block transfer, it became obvious that locking has to be performed on the higher layer, namely in the filesystem. This distributed block device was named DST and it lived in linux kernel for couple of years.

The second approach was to implement a filesystem. We created POHMELFS – Parallel Optimized Host Message Exchange Layered FileSystem. Its first commits were imported at the very beginning of January 2008. Actually POHMELFS was not very active and it clearly became visible that existing Linux VM stack doesn’t really scale to billions of objects, we do not have enough resources to change that – that’s a huge task both from technical and political sides. We implemented several features which were then found in Parallel NFS and Ceph. POHMELFS lived in linux kernel for several years.

We removed both project from the linux kernel back then and concentrated on Elliptics distributed storage. And now its time to resurrect POSIX filesystem interface.

But we decided to move another way. Native filesystem interface is fast, but you have to implement it for every OS, this requires a lot of resources which will be wasted supporting different versions for different OSes. Do you know inode allocation differences between Windows 8 and 10?

We found that our clients do not need this, instead they want network attached directory which works pretty well using WebDAV protocol. Well, not exactly, since Windows clients do not support authenticated webdav, and some applications like NetDrive has to be installed, but it happened to be almost standard application for NAS/SAN surprisingly.

We implemented WebDAV server which supports HTTP authentication and connects to Elliptics storage. There are limitations both in WebDAV protocol and in our server, in particular we do not allow locking to be transferred among servers, i.e. if client connected to storage via one gateway and the reconnected using the other, interlocking will not see each other. But that should not be a problem, since webdav prohibits parallel update of any object.

We can create many private folders for every user, it is even possible to add features on top of user files like indexing for search, version control and so on, but that’s a different story.

The main advantage is that this distributed storage is cheap per gigabyte. You can add many commodity servers into Elliptics cluster this will just increase the size of the storage without interruption – system scales linearly to ~4Tb/day writes in our setups and 200+Tb/day in Yandex for example. You can also put replicas of your data into different datacenters – this is inherent feature of Elliptics, and if connection to one datacenter drops down, client will just work with the other replicas.

And all those features are now accessible via usual filesystem interface. It is possible to access data via HTTP or other APIs though.

If you are interested, feel free to contact us info@reverbrain.com

Comparing various compression algorithms for particular project

We build a hobby search engine for one blog platform and there is a challenging problem of fitting the whole search index and archival dataset into one server node.

There are 1.3 billion of posts and comments and the whole uncompressed data takes about 2Tb.
We generate *many* indexes for very tricky and fast searches, we can not afford waiting for 20+ seconds to find all comments Alice made in Bob’s blog, which is a number of seconds Sphinx takes to solve this task by enumerating all comments and filtering them according to author/blog fields, instead we have to answer in a fraction of second.

Our algorithms produce about 3x of uncompressed data compared to input, and generally this should be compressed. 1x among those 3x is original text content (slightly reduced by supported language, dropped html markup and so on), the rest is a set of binary indexes where index content is actually timestamp-like ids. Although these timestamp-like ids are monotonically increasing, they are not fixed-interval timestamps and we can not use facebook’s Gorilla without serious modifications.

As a test we decided to check how common compression algorithms will work this out. We tried zlib, snappy, lz4, lz4hc and zstd. Test dataset was about 220MB uncompressed, we measured time our tool needed to index this text with exactly the same settings, it produced about 660MB of data which was compressed during the test at the storage level, size of the final database was reported. We did not test bz2, since its compression/decompression time is much higher and that’s unacceptable for realtime.

Here are the results
index_compaction_time_database_size

It is a bit controversial that zstd is comparable in speed with zlib but final database takes 10% more space.

Arithmetic coding vs common compression

Arithmetic coding is a type of lossless compression, where more frequently used bytes are replaced with fewer bits than less frequent.

Arithmetic coding is commonly associated with Huffman coding, but it is not correct – Huffman coding splits input string into smaller and smaller chunks until chunk can be replaced with smaller entity. Usually Huffman encoding can not reach Shannon limit (theoretical compression limit for given data) for reasonable number of iterations, while arithmetic coding – which transfers the whole string into packed binary stream at once – approaches that limit much faster.

I work with Greylock search engine where there is a major task to implement full-text search index with as small disk footprint as possible. In our tests 200k of uncompressed input events occupy about 220Mb of disk space and whole index takes close to 450Mb. Index includes original content, common index and per-author index (this basically doubles the original index). Elastic search with morphology takes about 539Mb, without morphology – 317Mb, Sphinx (+ mysql to store original content) is about 250Mb, but this index is heavily truncated, there is still a notion of stop-words and morphology is broken.

We compress our binary index entries as well as original content using lz4hc or gzip, but I wondered whether arithmetic coding can do better than that. I skipped Huffman encoder because of its theoretical problems approaching Shannon limit (not talking about practical issues), and decided to check rANS entropy coder – this is an Asymmetric Numeral entropy Systems coder which combines speed of Huffman coding with compression rate of arithmetic coding. Well, at least that’s what its author says in the paper.
Here is an image from the paper showing Huffman coder approaching Shannon limit huffman

It happend to be worse than any common compression both for binary and text data.

Text data from original library: 768771 bytes
bz2: 232598
gzip: 312281
lz4: 363236
rans: 435113

Binary data from our index: 9678
bz2: 3652
gzip: 3665
lz4: 5262
rans: 7795

I do not drop idea of using arithmetic coding to pack binary data since Dropbox showed with its Lepton project that jpeg images can be losslessly compressed by 20% using tricky DCT indexes sorting and prediction (about 3-4% among 20%) and VP8 arithmetic coding for these indexes.

Naive comparison of Lepton and rANS yields dramatic difference:
Original jpeg file sile: 71517
Lepton encoded size: 55700
gzip: 71171
bz2: 71586
lz4: 71414
rANS encoded size: 71336

This is of course unfair comparison, since Lepton encodes not binary steam of the original image, but specially crafted DCT indexes.

Since our indexes are basically vectors of document IDs which are in fact partially increasing timestamp and partially sequence number, it might be a good idea to use timestamp compression algorithm described in Facebook’s Gorilla timeseries database – I wrote about it previously, but that’s another story.

Adaptive MPEG-DASH streaming and arbitrary channel muxing

Ever wanted to be able to compose your own content stream without reencoding video files? Like having video stream from this file, and audio from these files, and then add another video and so on and so on?

I have created a simple service to highlight one of our technologies which allows to create adaptive MPEG-DASH stream (the same stream as youtube returns) and mix many streams from different files stored in Elliptics.

stream1

In the example above I play NeuralNetwork course video and some of my saxophone music as audio track.

DASH stream is being created on demand from different sources, so if you want to add new sound track or link multiple videos one after another there is no need to reencode video content each time. The service I created is rather simple, it does not use every feature of the technology, in particular, playlist protection is not used in the service, i.e. you can share it with others and initializing DASH player with our URL you will be able to play it on your site. Also we do not turn on stream insertion, i.e. when you want to play the main video and put in some additional video stream (or its part) at some time offset. We have both of this features in the server, but there are no interface controls for them in the service yet.
As a side note using this service one can create gapless audio/video playing, i.e. no player reinitialization between tracks like on popular video streaming sites.

So far we only support MPEG-DASH streaming, and while almost all modern browsers support it (i.e. Chrome, Firefox, IE and Opera), Safari is out of the list. We do not yet fully support HLS, and although Apple had announced at WWDC2016 that they will switch from mpeg2ts streaming (probably being more compatible with DASH stream), we still have plans to implement HLS streaming too.

So, if you are interesting to play with the technology, you are welcome to our playground service: http://stream.reverbrain.com
Do not be afraid of mad design, just upload files, they will be tagged automatically, and create playlists!

Audio/Video transcoding server

I’m pleased to announce Nullx – our standalone audio/video/subtitles transcoding server. It accepts files to be transcoded via http and either returns transcoded files in the reply or optionally uploads file (and metadata) into Elliptics storage.

So far it is a quite simple service which does not change parameters of the streams like height/width or bitrate, but instead transcodes streams into h264/aac/mov_text format, which is only suitable for HLS adaptive streaming. Since we plan to add realtime downscaling of the data for adaptive streaming, this service will be extended and will receive some per http request controls which will tell how exactly should given stream be transcoded, so far I believe only h264 profile and height/width for video and bitrate for audio streams are needed.
That will be our next release.

Nullx – transcoding server – is used in our broadcast service which allows to group uploaded into Elliptics storage audio/video streams, mux them together with different options (like order, timings, split at various time positions, sound from different source and so on) and adaptively stream resulted data using MPEG-DASH/HLS protocols (natively supported by Chrome/IE/Firefox and Safari on desktop and mobile).

Ever thought of working with ffmpeg?

Transcoding audio track into mp4(aac) is just about 30kb of hardcore C code.

That’s a small highlight on our progress on the streaming service, we are building a system which accepts user content and allows to create long-lived broadcasting translations containing many tracks in one stream.

For example your stream may start with file ‘introduction’ then ‘adv1’ then part of the file ‘main content’, ‘adv2’, ‘main content’, ‘final part’ and so on.
The only thing you need is to upload your audio/video tracks to our service, and create your stream using our interface. If you prefer, you can setup different audio track for your stream.
We will use adaptive HLS/DASH streaming for different client bandwidths.

We will not concatenate your video files together, instead we are using real-time stream composition on our streaming servers which build video content just out of the files you uploaded.

Here is initial presentation (MPEG-DASH) which muxes 2 video streams (5 seconds each, sample-after-sample) and 2 completely different audio streams: http://video.reverbrain.com/index.html

Elliptics as Video-On-Demand storage

We have moved video archive of our friends at FC Dynamo to Elliptics storage and in parallel upgraded video streaming from Flash to HTML5.

Besides the fact this solution scales well, it is much faster. Mostly because of Flash vs HTML5 but also because of Elliptics underlying storage vs plain filesystem.

Architecture of the project is rather simple, .NET frontend creates redirect link to onf of the storage servers located in Germany and Netherlands, and server streams data directly to client’s browser. This solution does not use adaptive HLS/DASH streaming, but plain on-demand or progressive download.

How harmful is eventually consistent model at large scale? (spoiler: it isn’t)

Eventually consistent updates is considered to be bad design choice, but sometimes it is not possible to live without updates, one has to overwrite data and can not always write using new keys.

How harmful could this be at large scale? At Facebook’s scale.
Paper “Measuring and Understanding Consistency at Facebook” measures all consistency anomalies in Facebook TAO storage system, i.e., when results returned by eventually consistent TAO differ from what is allowed by stronger consistency models.

Facebook TAO has quite sophisticated levels of caches and storages, common update consists of 7 steps each of which may end up with temporal inconsistency.
TAO

And it happens that eventually consistent system even at Facebook scale is quite harmless, somewhere at the noise level, like 5 out of million request which violate linearizability, i.e. you overwrite data with the new content but read older value.

You may also check shorter gist paper describing how Facebook TAO works, how they measured consistency errors (by building update graph for each request from random selection out of billions facebook updates and proving there are no loops) and final results.

http://muratbuffalo.blogspot.ru/2016/03/paper-summary-measuring-and.html

HLS and DASH adaptive streaming and muxing from the same files from Elliptics

Our streaming engine Nulla is now capable of streaming in both HLS and MPEG-DASH formats from Elliptics storage. We do not require uploading multiple files (in mp4 and mp2ts container formats) or specially repack frames in the media files (that’s what mp4box is used for) for streaming.

Elliptics streaming service Nulla creates DASH/HLS stream in realtime from data file you have provided, it builds either mp4 or mp2ts container on demand based on the streaming offset client requests. This allows not to force clients to upload multiple format files (mp2ts for HLS, mp4 for MPEG-DASH) or repack your existing files to meet streaming demands (fragmenting stream and put indexing box in front of data).

Since we build stream from data frames and create container in realtime we can adjust presentation and decode times and build an audio/video muxer. This allows, for example, to stream one file and put another one into the middle or stream file and split it into X chunks each of which will be followed by another different file, like 5 seconds from file X, 10 seconds from Y, 15 seconds from X starting from 5-seconds offset, then 10 seconds from file Z, while audio track has own muxing and so on and so on.

This is all being controlled via siple json API and guarded from embedding into hostile sites via random URLs with limited lifetime.

We built a simple example page which shows this kind of muxing and streaming.
Depending on your browser our servers will stream either HLS (desktop Safari and iOS) or MPEG-DASH (tested in current stable versions of Chrome, Firefox and IE) from 2 video and 2 audio files uploaded quite far ago.

Enjoy!

http://video.reverbrain.com/index.html

Source code for this simple HTML page shows how simple and convenient is our API.

Next tasks is to build a cache for preprocessed chunks and distribute it among multiple geographically distributed elliptics nodes in your cluster. We also plan to add automatic transcoding of video stream into smaller bitrate which will be automatically selected by the browser (that’s why HLS/DASH are adaptive streamings), currently you have to upload files in multiple bitrates and use them in API to create appropriate playlist, this task can be automated and will be implemented in the next release.

Another next major goal is to implement live translation from application (or browser) into Elliptics, who will distribute your translation via HLS/DASH to the thousands of simultaneously watching users.

Adaptive MPEG-DASH streaming and multiple stream muxing in Elliptics

We built Elliptics distributed storage quite long time ago, it is mature technology which just works when you have to store medium and large objects in geographically distributed locations.

It also supports progressive download: FLV and byte-range (tutorial, mp4) streaming, but we wanted more – native adaptive streaming with channel muxing from elliptics.

Basically, we wanted to upload multiple mp4 files into Elliptics and then stream them to client in required order like 10 seconds of the first file, then 20 seconds of the second, then 10 minutes from the first started from 10’th second while there is another track (like sound) in background which is mixed in its own way. And preferably with adaptive bitrate switching if client moves from slower to faster networks like 3g-to-wifi and vice versa.

Another example of this muxing feature is gapless music playing – you listen songs one after another without delays in between, without player reinitialization, without delay waiting for song (or its part) to be downloaded when previous one has stopped.

There are many technologies which implement streaming: HLS, HDS, MPEG-DASH, RTMP, RTSP and others. It looks like HLS is the winner for now, it is backed by Apple and is supported by iOS and Safari, but MPEG-DASH is backed by large group of vendors and is supported by all other browsers including TVs except iOS. Since Flash is going to die after youtube.com, netflix.com and other large vendors stopped streaming in that format, I believe MPEG-DASH will become more and more popular (its market share to date is rather small) and eventually only HLS and MPEG-DASH will be default streaming protocols.

There are fair number of streaming services which support both HLS and MPEG-DASH, but most of the time they require quite a lot of efforts to create fault-tolerant streaming service which would work with distributed storage. And neither of them supports audio/video muxing described above. Actually streaming technology itself partially supports this feature, for example in MPEG-DASH there is a notion of “Period”, and multiple periods would be played one after another. But this is a quite advanced feature which is not yet supported by reference player Dash.js (there are commercially available players which somewhat support this feature though). There are questions on implementation whether player can reinitialize itself to play multiple periods without stream interruption.

We decided to implement MPEG-DASH in our Elliptics streaming service to support both adaptive streaming and stream muxing. To implement this we create the whole mpeg container in runtime and only read samples data from audio/video files stored in Elliptics. To allow muxing all files in the stream must be encoded the same way though.

Using our technology one can implement 5-seconds muxing (5 seconds of the first video, then 5 second of the second, then next 5 seconds from the first and so on) in example below using following control json:

"tracks": [
{
  "bucket": "b1",
  "key": "video_1.mp4",
  "duration": 5000
},
{
  "bucket": "b1",
  "key": "video_2.mp4",
  "duration": 5000
},
{
  "bucket": "b1",
  "key": "video_1.mp4",
  "start": 5000,
  "duration": 5000
},
{
  "bucket": "b1",
  "key": "video_2.mp4",
  "start": 5000,
  "duration": 5000
}
]

But enough words, show me the result.

Here it is, muxing 2 video and 2 sound channels in the way described above without interruption and gaps. All 4 files are stored in Elliptics storage as usual objects.

http://video.reverbrain.com/index.html

Please note that Elliptics and streaming servers are located in USA and it adds 150+ ms to get the first chunk (or if you’ve sought into the area which isn’t yet present in the cache) from Russia, otherwise it is very fast.

You can check the source of the html page above to see how muxing is being set up, you can play with different settings and watch the whole files or mix them in other ways around.

Enjoy, and stay tuned!

Why will rotating disks matter for the next 10-20 years?

The short answer is SPACE. Not a cosmic space, but the disk space.

Although rotating disk speeds are not going to increase, there are technologies coming which will drastically improve density of the plates.

Samsung highlights some of them from the ones which are in testing like shielded magnetic recording and heat-assisted recording up to futuristic heat-dot magnetic recording.

astc_technology_roadmap

More details: http://www.anandtech.com/show/9858/seagate-hard-disk-drives-set-to-stay-relevant-for-20-years

Range headers and native HTML5 streaming support in Backrunner

Plain HTML5 streaming requires server to properly handle RFC 2616 Range header, in particular video seek/rewind uses Range header to specify exact position within file to start streaming from.

HLS protocol is a bit different, but yet again player uses Range header to specify position within timeframe.

Previously Elliptics HTTP server Backrunner used size/offset URI parameters for this purpose, which are not ajax friendly and obviously are not supported by standard players.

With this new Backrunner update we add Range and If-Modified-Since headers support.
The former allows to work HTML5 pleers with Elliptics HTTP proxy out of the box. If-Modified-Since is quite useful for client-side caching.

Here is a simple example of our video-on-demand service.

Our future plans include realtime HLS generation and transcoding for live-translations built on top of elliptics and related technologies.

Facebook time series storage

Time series databases are quite niche product, but it is extremely useful in monitoring. I already wrote about basic design of the timeseries database, but things are more tricky.

Having something like HBase for TS database is a good choice – it has small overhead, it is distributed, all data is sorted, HBase is designed for small records, its packs its sorted tables, it supports Hadoop (or vice versa if that matters) – there are many features why it is a great TS database. Until you are Facebook.

At their scale HBase is no capable of handling the write and read monitoring and statistics load. And Facebook created Gorilla – a fast, scalable, in-memory time series database.

Basically, it is a tricky cache on top of HBase, but it is not just a cache. Gorilla uses very intelligent algorithm to pack 64 bits of monitoring data by the factor of 12.

This allows Facebook to store Gorilla’s data in memory, reducing query latency by 73x and improving query throughput by 14x when compared to a traditional database (HBase)- backed time series data. This performance improvement has unlocked new monitoring and debugging tools, such as time series correlation search and more dense visualization tools. Gorilla also gracefully handles failures from a single-node to entire regions with little to no operational overhead.

Design of the fault tolerant part is rather straightforward, and Gorilla doesn’t care about consistency or even more – there is no recovering missing monitoring data. But after all, Gorilla is a cache in front of persistent TS database like HBase.

Gorilla uses sharding mechanism to deal with write load. Shards are stored in memory and on disk in GlusterFS. Facebook uses its own Paxos-based ShardManager software to store shard-to-host mapping information. If some shards have failed, read may return partial results – client knows how to deal with it, in particular, it will automatically try to read missing data from other replicas.

I personally love Gorilla for its compression XOR algorithm.

facebook-gorilla-compression

It is based on the idea that subsequent monitoring events generally do not differ in the most bits – for example, CPU usage doesn’t jump from zero to 100% in a moment, and thus XORing two monitoring events yields a lot of zeroes which can be replaced with some smaller meta tag. Impressive.

Gorilla article is a must-read for monitoring developers: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Greylock tutorial – distributed base search engine based on Elliptics

We’ve heavily updated Reverbrain documentation pages: doc.reverbrain.com, and I’m pleased to present our distributed base search engine Greylock. Documentation page includes generic information about search engine and tutorial, which includes installation process, configs and two types of clients: plain HTTP API (similar to what is expected from base search engine like Greylock and ElasticSearch) and Python client (works via HTTP too, but also uses Consul to acquire mailbox locks). If you need C++ tutorial, you can check greylock test suite which includes insertion/selection/removal as well as various iterators over data, self-recovery tests, statistics and other interesting bits.

I get a fair number of questions on how is Greylock different from ElasticSearch or Solr for instance or Amazon Cloud Search? They all have enormous amount of features and work with large amount of data, so what’s the purpose?

And the answer is just two words: scalability and automation.
If you worked with Elastic you do know which operations have to be made to reshard cluster when current sharding scheme becomes a bottleneck (how long it takes, what is the performance penalty and how dangerous is the process). When you work in the environment where new documents always arrive and space consumption grows with time, this resharding process will have to be started again and again with new servers added. At some point this becomes a serious issue.

With Greylock this is not needed at all. There is virtually no data and index movements when new servers are being added due to Elliptics bucket system. This design proved to work really well in Elliptics storage installations, where upload rates reach tens of terabytes daily, and that’s only our clients data, there are other seriously larger installations for example in Yandex.

We concentrated on scalability problem and solved it. And yet we do have a set of features. It is not comparable with Elastic of course even not counting NLP tasks which we will release later (language models and spelling correction for instance for any language where you can find a rather large corpus). Greylock supports basic relevance model based on the word distance among client request and words in the document.

Likely two of the worst issues are absence of numerical indexes and client locks. Both were made deliberately. Numerical indexes break pagination, which in turn means that if you want 100 documents out of a million, you will have to read them all into RAM, resort either to numeric order or into lexical order (that’s how document ids are stored in the inverted indexes), intersect the whole million of keys and return the first 100. For any subsequent request this has to be done again and again. Without numerics pagination works with iterators pointing to inverted indexes only, the whole index (and its millions of entries) is never being read, only some pages are accessed sequentially.

To help with numerics Greylock supports document timestamps, i.e. a single 64-bit numeric per document ID which is used in inverted indexes sorting order. Of course this is not a replacement for fair numeric index, but it does solve almost all of our use cases.

The second major issue is consistency and client locking. Greylock as well as Elliptics are not strictly consistent storages. With Greylock things are even worse – amount of data overwritten by a single client insert can be large and index structure (originally started as a distributed B+/*-tree) does not tolerate broken pages. Elastic and others implement consistency model (like Raft, Paxos or ZAB) internally. Greylock doesn’t. That’s why we require clients to acquire locks in some other system like Consul, Etcd or ZooKeeper to work properly. Our tutorial shows basic locking scheme implemented using strictly consistent Consul key-value storage.

We have a major plan for Greylock distributed search engine, expect new features and give it a try: http://doc.reverbrain.com/greylock:greylock
If you have any questions, you are welcome:
Google group: https://groups.google.com/forum/?fromgroups=#!forum/reverbrain,
this site and comments.

Server-side operations

Elliptics distributed storage is being built as a client-server architecture, and although servers may discover themselves, exchange various statistic information and forward requests, they most of the time serve client’s requests.

Recovery in a distributed storage is a tricky operation which requires serious thinking on which keys have to be copied to which destinations. In Elliptics recovery is another client process which iterates remote nodes, reads data to be recovered and update needed keys.

But there are cases when this round trip to client is useless. For example when you require missing replica, or when you have a set of keys you want to copy or move to new destination.

Thus I introduced two server-side operations which allow to send content from one server to multiple replicas. It is intended for various recovery tools which optimize by not copying data from local node to recovery temporary location, instead they may tell remote node to send data directly to required locations. It can also be used to move data from one low-level backend (for example eblob) to a newer version or different backend without server interruption.

There is a new iterator type now which sends all keys being iterated to set of remote groups. It does it with the speed of network or disk (what it slower), in local tests iteration over 200Gb blobs sending data over the network to one remote node via write commands ended up with ~78MB/s sustained speed. There were pikes though, especially when remote node synced caches. Both sender and recipient had 30Gb of RAM. Rsync shows ~32MB/s speed on these machines, but not because it is that slow, but because of ssh which maxed out CPU by packet encryption.
Iterator sends dnet_iterator_response structure for each write result for every key it has processed just like for usual iterator, neither API nor ABI is broken.

Second server-send command accepts vector of keys. It searches for all remote nodes/backends which host given keys in the one specified group, splits keys into per-node/backend basis and tells remote backends to send appropriate keys to specified remote groups. The same iterator response is generated for every key which has been processed.

All operations are async and can run in background with other client requests being handled in parallel.

There are 3 operation modes:
1. default – writing data to remote node using compare-and-swap, i.e. only write data if it either doesn’t exist or it is the same on remote servers. Server sending (iterator or per-key) running in this mode is especially useful for recovery – there is no way it can overwrite newer copy with the old data.
2. overwrite – when special flag is set, it overwrites data (clears compare-and-swap logic)
3. move – if write has been successful, remove local key

There is example tool in examples which iterates over remote node and backends and performs copy/move of the keys being iterated. Next step is to update our Backrunner HTTP proxy to use this new logic to automatically recover all buckets in background.

Stay tuned!

Some years ago in the past life I was a kernel hacker

One of my initial kernel projects was Dallas 1-wire: https://www.kernel.org/doc/Documentation/w1/

I’m still a maintainer, although quite occasional.
It was quite later that I began working with VFS and created Elliptics/PohmelFS, but there was a bit of hardware time.

In particular, I really liked to solder bits to my Matrox VGA card, which had w1 pins available. It used 1-wire bus for some internal control, but since this bus is addressable, one can add any other devices and things will not explode.

Anyway, w1 bus is quite actively used and I created (at the same time as w1 itself, like 7-10 years ago?) netlink interface over kernel connector interface from userspace to kernel w1 bus.
Using netlink from userspace allows to search for devices and perform all other actions with noticebly lower latencies than working with sysfs files.

During time userspace example code was lost and recovered and lost again and again, but apparently people do want to use it to monitor w1 bus from userspace.

So I created a github page with that example code: https://github.com/bioothod/w1
That’s it.

BoltDB – local ACID database written in Golang

BoltDB is a key/value local storage with simple API but powerful transactions and ACID support.

Main goal of the project is to provide a simple, fast, and reliable database for projects that don’t require a full database server such as Postgres or MySQL. It is used in high-load production with database sizes up to 1Tb.

1Tb is not that much actually, but since its main competitors are RocksDB/LevelDB and LMDB it is quite spectacular volume. Main differences between BoltDB and RocksDB/LevelDB are on-disk structure and transaction support – RocksDB and LevelDB are designed on top of LSM trees and thus are well-suited for write load, while BoltDB uses B+-tree – this generally scales better for read workload. LevelDB doesn’t support transactions, while BoltDB does (and it uses very nice and convenient API).

Main difference between BoltDB and relation databases like Postgres or MySQL is, well, relation – BoltDB is a simple key/value store (although its API supports cursors, transactions, prefix key search).

Worth considering for local storage: https://github.com/boltdb/bolt

backrunner: HTTPS support

We at Reverbrain.com develop highly scalable distributed storage Elliptics for medium and large objects. And in the web era the most frequently used API is HTTP. We had developed HTTP proxy for elliptics named Backrunner, it supports a wide range of options like ACL, streaming, redirect, partial upload/download, static file downloading and many others.

But if you build a system hidden behind HTTPS you likely want to secure your work with storage, in porticular your CDN will likely require you to work via HTTPS in this case.

So we have updated Backrunner to support HTTPS. It can listen for unencrypted and secured connections simultaneously on different ports/addresses, and you have to provide certificate/private key files.

This rather small change allows to deploy fully secured storage access to your frontend.

MySQL application level HA/loadbalancer

Master-master scheme doesn’t really work for relation databases, master-slave replication is used almost always and there is a serious problem to balance this load.

In particular, most of the time the most demanded setup is to direct all ‘writes’ to master and balance all ‘reads’ among multiple slaves. This can be done at application itself, after all only client knows whether he updates data or reads, but sometimes this is not possible.

Plain and simple regexp analyzer of the query most of the time is not enough, so MariaDB guys wrote MaxScale tool for this.

Percona used it to solve client’s scalability problem and here is their report: https://www.percona.com/blog/2015/06/08/maxscale-a-new-tool-to-solve-your-mysql-scalability-problems/

Quite informative post about MaxScale tool.