libeblob

5 billions

Tagged:  

It is amount of objects in one of our clusters - it hosts avatars and other rather small objects (hundreds of bytes to several kilobytes)
This amount of objects is hosted just on 2 nodes, each has 24x2Tb disks, 48Gb of ram, only 6Tb are actually used

We should multiply it to 3, since that's amount of objects in whole cluster spread over 3 datacenters in Moscow region
Load is quite small though, about 500 rps of reads and about the same for simultaneous writes

POHMELFS and elliptics server-side

POHMELFS in a meantime got 'noreadcsum' mount option support as well as some other tricky bits.

This bumps read performance several times, since eblob (elliptics backend) stores data in contiguous chunks increasing read/write performance, but optionally forcing to copy data, when prepared space is not enough.
Reading with csum enabled forces to checksum the whole written area, which in turn requires to populate it from disk to page cache and so forth. For gigabyte file this takes about 5-6 seconds (first time).

And that's only to read one page. I.e. to read every page (or readahead requested chunk).
Not sure that disabling read checksumming is a good idea, so I made this a mount option. Maybe eventually we end up with some better solution.

I also fixed nasty remount bug in POHMELFS which uncovered a really unexpected (for me at least) behaviour of Linux VFS.
Every inode may have ->drop_inode() callback, which is called each time its reference counter reaches 0 and inode is about to be freed.
But sometimes when inode was recently accessed, it is not evicted, but placed into lru list with special bits set.
Inode cache shrink code (invoked from umount path in particular, but likely may be called from memory shrink path too) grabs all inodes in that lru list and later calls plain iput() on them, which in turn invokes ->drop_callback() for inode in question.

Thus it is possible to get multiple invocation of callback in question without reinitializing inode between them. This crashed pohmelfs in some cases. Now it is fixed with appropriate comment in the code, but I'm wonder how many other such tricks are yet to discover?

POHMELFS is a great tester for elliptics server-side scripting support. I was lazy and put all somewhat complex processing into server-side scripts written in Python. Anton Kortunov implemented simple sstable-like structure in Python for directory entries used by pohmelfs.

Since every command is processed atomically (on single replica) in elliptics, we can put complex directory update mechanism in this 'kind-of-transactions'. In particular server-side scripting is used to insert and remove inodes from directory. Lookup is also implemented using server-side scripting - we read directory inode in python code, search for requested name, and return inode information if something is found, which is sent back to pohmelfs.

Overall this takes about 2-13 msecs. I.e. receive command from pohmelfs, 'forward' it to pool of python executers (srw project), where python code will read directory inode data from elliptics (using elliptics_node.read_data_wait()), search for inode with given name there and send it back to pohmelfs.
Insert takes about 30-150 msecs - script reads directory content, adds new entry (or update old) and then writes it back into the storage.

That's how it looks in python - srw/pohmelfs_lookup.py

Given that we spend 10 msecs in such not really trivial piece of code, I believe that my implementation is actually not that bad.
Those are recent news. Stay tuned for more!

Recent elliptics changes

Tagged:  

It was a while I wrote here last time, but there are a fair number things happened.

First, elliptics.
We fixed number of bugs in server-side scripting implementation, so it is very stable now and easily allows to create complex scripts which not only perform basic tasks, but also rather complex transactions, like read/process/update multiple records.

We added transaction locks, which allow to perform operations on given key atomically on single replica. In this case your server-side script, which reads data, updates its structure and uploads it back, will be performed atomically on single replica.

There are many other distributed storage systems, which allow atomic key operations, but they do not support datacenter replica split. Usually it is implemented (like in Oracle NoSQL or MongoDB) with single master node, which copies data to multiple replicas. This operation may even be synchronous and safe.
And it is possible to put replicas into different datacenters. But what happens, when you want to add more datacenters, but do not want to increase number of copies? In this scheme one has to rebalance replica sets.

In elliptics you just say which datacenters you want to use for given IO transaction. This forces non-atomic replica updates, which happen in parallel. This allows to have faster writes and more flexible setup, but we can not guarantee that replicas are updated synchronously.

And actually with mongodb scheme it is possible, that replicas will not be in sync, since there may be failed node which ost some data, so reads from this node will not return consistent results.

Real fix for this problem lies outside of the low-level storage subsystem. It is like forcing block device to lock against multiple processes writing into the same file. Instead this protection lives on higher layers and generally requires external locks, which are both known and used by concurrent users.

In distributed world this algorithm is based on Paxos with optimizations (like fast Paxos or Zookeeper Atomic Broadcast and friends). We do not yet have such subsystem.

Another change is embedded checksums. Previously we stored them in blobs after each data records, but never used. Instead there was a separate command to calculate and store checksum in metadata (separate column). This may resulted in stall checksum stored in metadata, so parallel read could fail with inconsistent data error. Right now metadata is just a storage for replica information and update/check status dates and other meta information (namespace and so on).
Each write updates checksum in blob atomically (again, this is not synchronized between replicas), so any read will always get correct check if data was not corrupted.

We also added bulk IO operations, namely read and write. Write issues many parallel writes and waits for all of them to complete, thus essentially being equal to the performance of the slowest nodes, while doing sequential write ends up being as long as sum of all writes in question.
Read does similar thing, but splits and optimizes read using offset sorting (i.e. that we read data from disk in sequential order).

Elliptics vs HBase on hundreds of millions of small records

Tagged:  

We recently run a test on 215 millions of production records, which we wanted to store on single server.
Objects are rather small about 200 bytes, server hardware is quite common in our environment: 24 Gb of RAM, handful of mostly unused CPU cores, 4 SATA disks of 1-2 Tb in RAD10
Unfortunately due to configuration issues there were only 2 working disks in elliptics server (raid near layout sucks), while HBase had fair load spread over all 4 disks

Usage case: random reads, random range reads.

Out of the box elliptics _yet_ does not provide good enough on-disk index, so it is usually a binary search, which is costly. So we warmed index cache files by reading them into /dev/null

Upload speed was roughly the same in HBase and Ellipitcs - 2.5-3 Krps. But using HBase batch upload it was possible to write data upto 30.000 objects per second, object were packed into 5000 blocks. Elliptics does not yet support such batch upload mode.

So, reading. First of all, we started plain run of 200 random IO rps test. Its duration was about 10 minutes.
Elliptics showed about 30ms median reply times. HBase was closer to 100-150 ms.


Elliptics data (timing scale is wrong, though, median time is 30 ms)


HBase data

Second test was set to find maxium RPS rate of random IOs starting from cold caches and data.
I will not post graphs though, only numbers. Graphs on demand.
Elliptics showed 115 RPS within 100 ms each.
HBase reached almost 300 RPS, but with 100-200 ms median.

And that was with only 2 working disks in elliptics setup (blame on me), while HBase used 4 disks in RAID. Here are proof pictures (first one is elliptics dstat, second - HBase)

HBase also supports index compression, which ended up with 1500 RPS of random IO within 100 ms. Pretty good numbers for single server with 215 millions of records. Our objects are easily compressable, so HBase's data base only occupied about 15 GBs of space.

Elliptics does not compress index, only data, so compression would not help us. Instead we plan to replace current index (binary search on disk, roughly the same happens in every filesystem with b-tree, when you are looking for a file by name). New scheme will cache each N'th key in ram with specifying ID range stored behind given key. Keys are rather small - 64 bytes by default (specified at compile-time), so we could pack a bunch of them into 4k page for example. Reading this page from disk will take single IO, and searching for the key in this page (read into RAM) will be very fast.

This only optimization will kick hbase's ass I think. Getting that HBase can not safely live in multi-datacenter environment, elliptics will fit all needs for huge-number-of-small-records data storage.
But there is also a plan to add bloom filter for faster detection whether given key is present in blob or not.
Since blob file is 'closed' for writes after reaching maximum size or number of records, and no more writes goes into it except when overwrite is turned on or deletions sets the bit, but both do not change keys, it is possible to create rather static and very optimized index. This comes after HBase index design acutally, when its blocks are immutable and so are index ranges.

We currently cache in RAM each key we read from disk, but practice shows that overwhelming number of old enough records (so theirs keys moved to disk from ram) are never (or extremely rarely) read again, so we will add cache timeout for keys to drop them back to disk.

Elliptics range requests benchmark

Tagged:  

73+ millions of records, 25-30 Gb total space on single node (actually there were 3 replicas, but we used only one)

Here is the graph

3000 rps within 10 milliseconds, where each range request returned 20-4k records.

Elliptics range request is a full analogue of SQL's "SELECT * from TABLE WHERE key > X and key < Y LIMIT (from, num)". In this test each record's key contained timestamp and range request asked for data in some time range.

Each key (elliptics uses 64 bytes for key) looked like this:
xxxyyy...whatever else...timestamp[16 bytes]

and range request was from
xxxyyy...whatever else...0000...0000[16 bytes]
to
xxxyyy...whatever else...ffff...ffff[16 bytes]

Looking forward for the larger datasets

Tagged:  

To date largest (in terms of number of objects) elliptics cluster hosts as much as 400 millions of records on every node. Those are quite small records, so it does not occupy much space (just about 4 Tb per server node), but index becomes quite large (45+ Gb).

We dropped in-memory index in eblob quite for a while already, but having it on disk means that we have to check disk to find out needed key. Currently it is binary-searchable structure on disk, but this is very suboptimal. Well, for 45+ Gb indexes lookup has acceptable timings and likely will have it for 2-3 times larger datasets, but we want every node to host as much as 5-10 times more data, i.e. 40 Tb of data and 4 billions of records.

In this case binary search in 450 Gb index will take too long. We can add more servers to spread data, but in this case we will not optimally use quite limited physical space in datacenters. We have couple of ideas on faster indexes, which basically employ kind of sharding property of our keys, i.e. we can split our key (512 bits in default configuration) into chunks where each part will be an offset into box-like index structure.

In a meantime I added in-memory caching of the read keys - now every key read will be placed into hash table in memory for the fast next-time access. Eblob also got Python bindings and set of handy utils to scan blobs (like regexp match). It also supports statistics file (you will find it in root directory), which shows number of objects present on disk, removed on disk and pushed into memory. Removing this file will force eblob to regenerate it on the next start.

And now one really good news - new POHMELFS will be started next week. Estimated completion time is rather short (we want it to be if not production ready, but more usable at the end of the month), since it will be just elliptics frontend. To date main question is object indexing - the most naive and simple design will have quite slow file/dir renames, i.e. full copy and delete, which is not a good idea generally.
But I did not yet think in details about design problems, so this is an open question.

Stay tuned, there will be very interesting thins shortly!

Elliptics API extension

Tagged:  

I added a bunch of new URI parameters, so HTTP frontend is now almost as capable as clients created using C/C++/Python API. (Github has not yet been updated though)

In particular, added

  • column= - read/write data from particular column. Number of columns is not limited. Well, each new column is a separate large eblob, so it is limited by number of blob files in directory, by default blob size is 50 Gb, so it should not be a limitation
  • size=/offset= - to read/write data from particular offset and size
  • prepare/plain_write/commit - using 'prepare' one can reserve large enough space and then put there chunks of data using 'plain_write'. 'commit' will commit data into index, calculate checksum and so on
  • aflags/ioflags - one can enable/disable checksum, turn on/off compression, APPEND mode and so on
  • range/from/to - range requests. HTTP server returns array of data chunks, where each record is prepended with DNET_ID_SIZE key (64 bytes by default) and 8 bytes of size (little endian)
  • id - to read/write data using own ID (DNET_ID_SIZE max, 64 bytes in default compile - just for sha512 keys)

Here are examples.

1. Write data using own 64-byte keys (in hex: 1122334400... - 1122334600...):

wget -q  -S -O- --post-data="1234567890 id=11223344" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223344"
wget -q  -S -O- --post-data="1234567890 id=11223345" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223345"
wget -q  -S -O- --post-data="1234567890 id=11223346" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223346"

2. Read data by own user-provided key:

wget -q  -S -O- "http://elisto19f.dev/elliptics-proxy?name=1.xml&id=11223345&direct"

3. Range request hex keys are from 1122334400... to 112233ff00...:

wget -q  -S -O- "http://elisto19f.dev/elliptics?range&from=11223344&to=112233ff"

hexdump of the stream (red is key, blue is size):

 wget -q  -S -O- "http://elisto19f.dev.yandex.net/elliptics?range&from=11223344&to=112233ff" | hexdump -C
  HTTP/1.0 200 OK
  Content-Length: 459
  Content-Type: application/octet
  Connection: keep-alive
  Date: Thu, 04 Aug 2011 19:27:57 GMT
  Server: lighttpd/1.4.26

00000000  11 22 33 44 00 00 00 00  00 00 00 00 00 00 00 00  |."3D............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000040  16 00 00 00 00 00 00 00  31 32 33 34 35 36 37 38  |........12345678|
00000050  39 30 20 69 64 3d 31 31  32 32 33 33 34 34 11 22  |90 id=11223344."|
00000060  33 45 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |3E..............|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 16 00  |................|
000000a0  00 00 00 00 00 00 31 32  33 34 35 36 37 38 39 30  |......1234567890|
000000b0  20 69 64 3d 31 31 32 32  33 33 34 35 11 22 33 46  | id=11223345."3F|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 16 00 00 00  |................|
00000100  00 00 00 00 31 32 33 34  35 36 37 38 39 30 20 69  |....1234567890 i|
00000110  64 3d 31 31 32 32 33 33  34 36 11 22 33 47 00 00  |d=11223346."3G..|
00000120  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000150  00 00 00 00 00 00 00 00  00 00 16 00 00 00 00 00  |................|
00000160  00 00 31 32 33 34 35 36  37 38 39 30 20 69 64 3d  |..1234567890 id=|
00000170  31 31 32 32 33 33 34 37                           |11223347|
00000178

We will stress-test range requests next week, it is expected that performance should not be radically different from plain random IO performance (benchmarks can be found here)

We also tested Cassandra in 4-nodes configuration, and I'm about to collect raw number and graphs to show.

Stay tuned!

Yfrog presentation about HBASE usage

Tagged:  

They removed my drawings from yfrog, so I will say some words about their presentation - mostly about how great we are

First, its setup - 60 servers for 250 millions of photos. I do not really know whether this number includes number of copies, but we use ONE server with elliptics to host 200+ millions of small objects. And we actually have 3 copies, which turns out to 600 millions of records in 3 different datacenters. Well, that server has 48 Gb of RAM and 24 disks in RAID6 - kinda big box.
But our 1 Pb cluster contains not only those machines, but also smaller ones with 4 disks and smaller amount of RAM.

Metadata cluster with 1 Bn of records and 20 nodes is much more interesting, but I'm afraid most of it fits RAM, since metadata records are small.

Presentation speaks about 10 krps - great number, but wait a minute - from 20 or even 60 nodes? Above mentioned setup handles 3krps (2 krps write and 1 krps read) easily. And I do not call it 'Super fast!'

We tested Cassandra in a lab, it handled 2 krps great (1400 rps read, 600 rps write) from 4 nodes (24 Gb of RAM, 4-disk RAID10, 15 Gb of data on every node), but only until data fits memory. Overall it was not very stable and required huge amount of tuning, so that it wouldn't end up in OOM or JVM GC killer.

Huge MongoDB tests are on the road.

Using eblob as low-level IO backend disk bottleneck was pushed away, but it adds limitation on number of records stored effectively - we store key index in RAM. Now when I added on-disk index we are about to retest our benchmarks, which will likely show performance degradation. Although this change allows to put more object into single node and do not heavily depend on amount of RAM, it may be slow enough.

Next change will be to use fast in-memory index for frequently used keys.

Our main disadvantage is lack of documentation. Nobody uses elliptics except our company friends, who bite me about how things are implemented and their meaning. But this will change very soon! We are working, and that's actually kinda harder than code :)

Recent news about projects

Tagged:  

So far I'm mostly concentrated on elliptics and eblob, since we build rather large and loaded storage systems on top of them.
But I plan to step a little bit away and work with language grammatics (starting with simple flex/bison so far), but about this later.

Elliptics got another project to host huge amounts of small files (1-20kbytes) uploaded virtually continuously all day long into servers in 3 different datacenters. Here is a monitoring pic (amount of files uploaded this week) from one of the servers:



There is a small caveat though - eblob stores data and index on disk and also fast index in RAM. With such amount of objects written eventually we will run out of memory for fast index, which in turn will force us to setup more servers, which sometimes is not that simple (mostly because of absence of physical space in datacenters).

Eblob and on-disk vs in-ram lookup indexes

To eliminate this problem I mapped fast index (hash table) into on-disk file, but because of serious fragmentation issues this ended up with huge slowdown - mostly to move disk heads from the beginning of the map file to its end to find out needed object. Although this was O(1) in terms of number of records, it still required fair number of disk seeks to iterate over hash table to get single record. So, it was dropped.

Now system is quite different - I sort on-disk index after blob is completed, which is quite fast operation (index file always fits free memory), then it is written to disk and all records placed there are removed from in-memory fast index. File is mapped and binary search is used to find data records.
Currently we do not propagate frequently accessed records into fast in-memory index, but will do in next releases.
Also if there is an old on-disk index it is not converted on startup, which should be changed too.

We did not yet test new lookup scheme performance, it is scheduled for the end of the week or next one. It is not that simple to generate a hundred of millions of records quickly, so likely I will use existing data first (it requires sorting of the existing on-disk indexes).

Eblob as append-only and overwrite in-place storage

Besides on-disk index eblob also got overwrite support - when new records is less than object (including alignment) already written with this key, it will be overwritten in place instead of appending it at the very end. This feature should eliminate quick metadata grows, since we update it each time object is checked or written.

Language grammatics

I plan to implement mapreduce on top of column data store used in elliptics, and want to create a rather trivial language for that. In particular I plan to use boolean and arithmetic operations, regex string search and generic read/write primitives mostly to eliminate data dereference (kind of string class). Maybe it will even contain loops and functions.
So far I wrote a simple reenterable grammatics parser using Flex/Bison, which supports multiline strings in Flex among others, but grammatics is yet very trivial and basically unusable for language parsing. I will experiment with it more, since I want to add it into elliptics sooner than later.

Those a plans and infos. Stay tuned!

Eblob got automatic defragmentation support

Tagged:  

Eblob is an extremely fast append-only low-level storage used in elliptics network distributed storage.

Since it is append-only all updates will look like new records appended somewhere at the end of the blob file. Thus old entries will just waste space.

Previously there was a special tool which was able to defragment blobs in offline and elliptics could be restarted to get new data location (it worked with old blobs until then). It required lots of space or lots of restarts to defragment really large storage.

Now this is gone - eblob uses online automatic defragmentation, which finds blobs with lots of removed objects (there should be at least one third of objects removed) and sequentially reads records and writes them back into another blobs.
In-memory index is also updated, so when this process completes scanning given blob file it can be removed.

Automatic defragmentation requires at most size of the single blob of free space to defragment whole storage, since blobs are processed one after another without storage restart.

Getting that all reads during defragmentation are sequential and all writes are actually appends properly aligned by VFS to make huge sequential writes, this process is rather fast.

Looks like this is the last feature for eblob, so we can concentrate on recovery tools, tests and eventually new features.

Since elliptics@eblob and API support column storage, I plan to implement incremental mapreduce on top of that. It is a bit far plan though, maybe to the end of the summer...
Anyway, stay tuned!

Eblob got Snappy compression support

Tagged:  

Kudos to Google for opensourcing it.

I did not yet test its performance, but it compresses our small common XML about 2 times.
API was extended to support per-command flag to compress data or not. Eblob stores this information in its structure, so reading detects this and automatically decompresses data.

So, to store textual data elliptics and eblob look very promising, getting that Cassandra and MongoDB do not yet support compression (although there are tickets back from 2009) :)

Enjoy!

Elliptics 2.10.0

Tagged:  

New release with huge number of improvements, the most notable of which are

  • column data storage (eblob only)
  • range requests
  • faster metadata and overall IO performance
  • server-side checksumming and verification support
  • base for ultra-fast recovery and local mapreduce
  • simplified interfaces for POHMELFS

So far this is a beta release, since not all API calls were propagated to high-level interfaces. There are some issues with recovery process - we will not yet fully finished transition process. Eblob offline defragmentation tool was removed, but online processes was not yet added into the core.

We plan to resolve this issues this week and move further.

Elliptics currently hosts not-that-large clusters (the biggest cluster spawns over about 200 nodes which live on ~20 physical servers and close to 1 Pb of data) with about hundreds of millions objects (not counting redundant copies), which covers 4 datacenters including abroad. With above features we plan to expand our usage area. Unfortunately it is quite complex to get into with elliptics usage, and that should be my main goal actually.

In a meantime we got www.elliptics.ru and www.eblob.ru, which will host documentation portal with example configs, API docs, benchmarks and so on. Many thanks to Andrey Godin.

Maybe eventually this will be a community driven project.
Anyway, stay tuned!

Column data store

Tagged:  

In a meantime elliptics got column store support.
It is implemented in eblob only though, filesystem backend does not support this feature.

I'm thinking about proper API for this feature though.
There are million of changes in the current tree, we will polish it and release new version very soon, which will draw the line between current rapid development cycles with extreme feature creation rate and steady maintenance support. Probably :)

Stay tuned, I will make new post, when things are ready for prime use.

Key range requests in elliptics network

Tagged:  

Elliptics network is a horizontally scalable key/value storage, which by default implements distributed hash table for ID spreading scheme.

But it is possible to implement your own ID generation mechanism, which can introduce data locality. Range requests in such scenarios can produce a huge win for users.

I implemented range requests (in eblob backend only), which can be kind of transformed into following SQL command:

SELECT * from eblob WHERE key >= start AND key < end LIMIT (from, num)

One can use half of the key to store master ID and another half (we use 512-bits keys, but it can be changed at compile time) for each object client wants to attach to selected master ID, for example if we name client X as key '123456.000000' then all its data can be stored with keys 123456.000001, 123456.000002 and so on
To get all data objects with 'master ID' 123456 in one request one may execute range query starting with 123456.000000 and ending with 123457.000000

For example using python binding it looks like this (different range is used here):

$ cat dnet/bindings/python/range.py
#!/usr/bin/python
# -*- coding: utf-8 -*-

from libelliptics_python import *

log = elliptics_log_file("/dev/stderr", 8)
n = elliptics_node_python(log)

group = 2
n.add_groups([group])
n.add_remote("localhost", 1025)

r = elliptics_range()
r.start = [1, 2, 3, 4]
r.end = [0xcc, 0xdd]
r.group_id = group
r.limit_start = 2
r.limit_num = 2

print n.read_data_range(r)

API hides cases when requested range is handled by different hosts, although this should be very rare case, it is supported in the underlying methods.

Range requests are supported in C, C++ and Python bindings as well as usual IO methods.

Next huge step is fair column data store. I'm open for ideas about better implementation.

Mapped file index, on-disk sorting, range requests and per-column data store

Tagged:  

Mapped file for fast in-memory index

I started to test eblob - append-only storage used in elliptics - with mapped file for its in-memory index. Since it is a hash table, I decided to reduce number of seeks in each hash bucket search, so I implemented array of elements and not a classical list. We reallocate array each time collision happens to add another entry at the end.

Unfortunately this tends completely not to scale. Eventually defragmentation of the mapped file becomes so huge (and it becomes clear just after several tens of millions of records) especially if we preallocate large enough hash bucket arrays (like for 5-10 elements), that write of element at the beginning of the hash table tends to flush write at the end of the mapped file, which in turn requires seeks and seeks and more seeks.

After about 40-50 millions of entries in single mmap file kernel flush thread starts eating 100% CPU and write performance degrade to hell. Less than that it works with the same spped as glibc malloc() though.
This idea has to be rethought.

On-disk sorting and range requests

To implement range requests in elliptics I decided first to optimize on-disk eblob storage. Range requests likely will not be supported for filesystem IO backend (where every object is a separate file), eventually I will drop it.

So, I imlemented offline on-disk sorting tool, which uses eblob iterators (C++ interface) and its speed equals to one half of write speed of underlying disk array, since it writes to-be-sorted data 2 times - temporal tables and merged result.
Sorting tool uses combination of in-memory quick sort (std::sort() to be precise) and writes resulted less-than-ram blocks into temporal files, which are in turn merged the way merge sort works.

12.5 millions of records (written objects from 1.5 to 20 Kb each, ~100 Gb total) were sorted with 22-25 Mb/s speed, while write speed of underlying RAID array is about 45-50 Mb/s (we write 2 files at the very end - eblob with data and its on-disk small index), so it took a bit more than an hour.

Range requests will be handled by elliptics and will return data objects for separate keys. On the storage it will iterate over all keys in requested range and send data to client. If range covers multiple machines, API will automatically split request to multiple nodes.

Since by default elliptics uses hash of the name as a key, range requests will not have any reasonable meaning. But it can be very useful when keys are generated manually.
For example client may use half of the key as user login (hash from it) and second half as a written object id (hash from uploaded file for example). In this case range requests allow to get all user's data via single request. Keys can be split whatever client likes, we use 512 bit keys, so there is plenty of space there.

Column data storage

But IMHO range requests and manual key generation is a half step to the real column data store. I do want to implement this feature this summer, so we could compete with systems like Cassandra and MongoDB in calculation and processing markets. On a performance (and optionally reliability) elliptics@eblob are unbeatable already :)

Documentation

It sucks. And I mean it - it is horrible as freakin' bloody hell.
This uglymoron can be really called docuementation, but I will, I promise, I will work with people who can write good docs, so that writing applications and setting up systems would not be a rocket science.
Actually it is very simple, but still this knowledge has to be documented.

Stay tuned, and things will be even better!

Per-column data store in eblob and elliptics

Tagged:  

A simple test to iterate over 38 millions of name metadata records placed separately into own eblob gave us almost 2 times performance increase - regex took about 50 seconds.
When combined metadata (about 3-5 different types) lived in single eblob it took 80 seconds to complete.

Per-column buzz is quite trending, although above data shows that if we store data in separated databases (or eblobs in our case), it is much more efficient. Databases like Cassandra store multiple columns in single record, which albeit not optimal but is quite convenient - we get all columns via single request and then iterate over them.

I'm thinking about putting such feature into elliptics and namely eblob - server will itself update columns in selected row, although placing all columns into single record looks quite questionable for me.

Eblob iterators, mapped file index, mapreduce over column storage and elliptics metadata

Tagged:  

Eblob iterators over elliptics metadata

To continue mapreduce games I decided to check how fast will be EBLOB as a metadata storage. Since eblob is a low-level append-only system, it should allow to have very fast concurent iterators.

So I copied whole metadata database from previous test from Kyoto Cabinet (which in turn will support concurent iterators in the next version) into eblob. We have about 38 millions of records there total of 13 Gb in Kyoto Cabinet, which took 16 Gb of data and 4 Gb of index in eblob.

I used the same regex to check object names and started 16 threads to iterate over eblob storage.
Eblob iterators took 80 seconds to complete (.*).jpg(.*) regex over 38 millions of records with cold VFS cache, which is rougly sequential read speed of underlying 4-disks raid10 array.

Since our database is 16 Gb in size it is quite obvious it fits into RAM (test server has 24 Gb of memory).

Kyoto Cabinet also fits RAM (13 Gb in size), but somehow it did not take advantage of that, since its time (only single iteration thread supported in current version) is about 15 minutes

As of HDFS/Hadoop (500 millions of records / 5 Gb on disk / 130 seconds), one can try to compare our apples to their oranges as I did in previous post, but I suspect that it still will be slower.

Eblob in-memory index as on-disk mapped file

In a meantime I implemented in-memory eblob index as a mapped file on disk, which should solve problems with unused kyes in memory. It is supposed that VFS will drop to disk pages which correspond to unused entries and only actively used keys will take place in RAM. This allows to store more keys to single node than physical memory in turn.

We start to extensively test it and I plan to test more benchmarks with it later. I want to load even more entries (like those in Hadoop test) so they would not fit VFS cache and run iterators again, but expect that performance will be the same - after all we just sequentially read data from disk.

It could be also a good test to run iterators over defragmented eblob (i.e. with removed entries), but I expect that it will be limited by disk speed, since we will just skip removed entries.

Mapreduce over column eblob storage

In a meantime I found a nice way on how to store metadata in eblobs split on per-column basis. Elliptics metadata is virtually a set of columns written by client after data is placed on disk. We support checksum, object name, timestamp and several other records, although none of them are required for system work. List of copies is only required for recovery check.

If we ever move from Kyoto Cabinet as metadata storage to EBLOB I could add a special IO flag, which will force elliptics not only to write metadata, but also update per-column indexes, so we could VERY quickly find out all records within given namespace for example or with given regex name or records older than some time and so on.

I suspect that initial implementation will not be a real mapreduce, since we likely will not process reduce() operation. I.e. we will just iterate or map() records and process them in place, without storing for reduce, without sorting and so on. For our current needs this is more than enough, but it can be extended if we need a real mapreduce.

I even plan to write some kind of a language for this - after all I have a dragon book and compiler principles unread yet :))

Elliptics network and MapReduce

Tagged:  

Yesterday I implemented quck and dirty MapReduce service on top of elliptics storage. It only worked with metadata, i.e. names, timestamps and so on, and was built using Kyoto Cabinet Map Reduce subsystem, since we store object metadata in its database on the same machines which host data.

Unfortunately all Kyoto Cabinet iterators including MapReduce are single-threaded, i.e. you can access data through parallel iterators (or map/reduce processors), but can not spread the same database to multiple iterating threads.

I implemented simple regex grep on uploaded filenames. On a typical storage node containing 40 millions of objects (13 Gb database size) (.*).xml regex took about 15 minutes (single map thread) to complete, which is roughly 44 krps. Smaller databases took less time to iterate, for example database with 200 thousands records can be 'parsed' with the same regexp in 2.5 seconds.

I found Hive@HDFS benchmark, where 500 millions of objects (50 Gb of data) were spread to 380 mappers (likely multiple machines), and similar regex took 130 seconds to complete, which is about 10 krps per mapper.

Unfortunately Kyoto Cabinet Map Reduce is not concurrent and I'm not sure that its internal structure can be easily made parallel for iterators. But eblob - our fast append-only subsystem used by elliptics as underlying storage, contains all its on-disk indexes as simple arrays (it is 'converted' into hash table in memory at startup though), which can be trivially iterated in parallel. In particular, uploading index into hash table in memory can be run in multiple iteration threads.

So I'm thinking about moving metadata from KyotoCabinet into out eblob storage (we can put it before or after data in the same chunks), and implement fair MapReduce in elliptics which can start multiple mappers on top of local data (as well as multiple mappers on different servers of course).

I believe I have to run some tests in particular put metadata for aforementioned 40 millions of objects into eblob storage and run multiple MapReduce iterators on top of our array-like indexes on single machine. If results will be multiple times faster that single-threaded Kyoto Cabinet MapReduce iterator, then this is the way to go.

I do not plan to implement complex Hive-like language for MapReduce processing, but several simple enough commands (like comparison of timestamps and regexes for namespace and object name matches) should be enough for start.

If it will work well, we can even implement MapReduce processors for data (all above is for metadata only so far), but this is a long-term idea to date.

Sounds like a plan?

Eblob and large volumes of data

Tagged:  

As mentioned in comments, there are possible limitations in eblob technology.

First, since eblob stores whole index in RAM, it is possible that it will take all the memory and there will be no IO progress.
While this sounds true, practice says quite the opposite. To save 40 millions of keys we only need about 6 Gb of ram (and we can decrease this number). Usually we have several times more keys on single machine. Those numbers are quite common in commodity servers not even saying about high-end machines. System bootstrap itself in several minutes, which may be a limitation though, but again it is 40 millions of objects to store in RAM.

Having index in RAM guarantees fast IO - we have to seek only once per object at most. If we put index to disk, IO can not be served that fast. For some cases this is valid requirement, while for other we do not need the whole index in RAM.

We have plans to move index from pure RAM (anonymous memory) into memory-mapped file. This will solve both problems, but to date we did not yet came close to this limitation, so it was not implemented yet.

Another mentioned issue is 'atomic' writes. I.e. since eblob is append-only storage, it has to have whole data object ready to be written into eblob. This is not true - eblob supports prepare/commit API, where the former reserves requested space in the blob and returns offset which can be used to write multiple portions of object. Commit will, well, commit your object into index.

This API is not exported into elliptics yet, since we, well, do not yet work with such large objects. But we have plans to start (or at least test) hosting multi-gigabytes files this year, so we will extend elliptics commands to support prepare/commit writes.

Streaming from blobs

Tagged:  

Elliptics IO backend is a modular subsystem, which allows to implement different data storages which will be handled by the library core.

In particular we support EBLOB - storage module designed after Facebook's Haystack. We do not run for their data volumes, so eblob has only single index level, which can be locked in memory. This ends up with guaranteed single seek to get your data in the worst case. I posted its benchmark and design notes.

One of the most wanted feature from the storage is ability to stream data to clients. Kind of youtube/whatever. If you store your data in files, it will be very slow (millions of files on ext4 filesystem end up with at most 100 RPS from 4-disks raid10 storage). The same eblob configuration allows to have almost order of magnitude more.

Simple streaming elliptics storage looks like this: we have a set of nodes (one per server or per disk), which are combined into elliptics storage. There is a web server running on the same storage machines. There is a proxy web server(s) which controls client requests and returns 'links' to the actual data, which can be streamed directly from server nodes.

So, client asks proxy about its file. Proxy returns XML describing full path, size and offset of the requested data as well as address of one of the servers, which host requested content. Client sends usual HTTP request to given address (this is a web server on storage node, which runs FLV streamer for example) and gets its stream from provided offset.

But streaming is dangerous. Blobs are append-only storages, where other client's data is placed somewhere behind yours. So, client can hack request and ask not only his own data, but also something else. That's why proxy signs XML data it returned using provate key. Streaming server has to check this sign and confirm that requested size and offset are valid. This should be done in another simple web server module.

That's it. Among other info we also return file's owner, permission bits, access, create and modify time and so on.
This will be used in POHMELFS for inode info.

It is of course possible to return data directly from proxy when client asked for this. It works great with pictures and other small objects. But gigabyte videos and even smaller mp3 files can not be effectively proxied to multiple clients.

In a meantime we started to write simple migration tool - it will allow to move content of selected elliptics node to different location without administration intervention.

Elliptics: datacenters, open functionality, changes, future

Tagged:  

Our team has grown up a little and we started to make things faster.

External projects

First, our friends in Yandex open-sourced FCGI management daemon: fastcgi-daemon
It allows a rapid deployment of C++ FCGI applications and has flexible configuration and management tools.
Kudos to Ilya Golubtsov and Vasily Kiritsev from Yandex team.

The first external elliptics application is HTTP proxy written on top of fastcgi-daemon by Leonid Movsesjan from Yandex team.
Source code contains example configs, which allow to read/write/delete data, get direct link to the object on the server in cluster, which can be streamlined or downloaded directly by client.
It easily handles 1000 random read RPS being connected to 2 elliptics nodes, where each request is handled just within tens of milliseconds. And although data set is not very large (just hundreds of thousands of small files from 10 to 50 kbytes each), it is still a spectacular number.

Datacenters

in a meantime our production clusters grew up. We will get first abroad datacenter this year, which will host content for international users. This cluster currently hosts about 150 millions of rather small objects (10-50 kbytes each) with 3 copies in physically distributed datacenters.
Our second production cluster comes close to half-petabyte scale (4 datacenters, about 200 elliptics nodes), its network load reaches 3.5 gbit/s daytime, although this scale took a fair technical price - we had to rewrite IO model again.

IO thread models

We recently switched IO model in elliptics core from thread-per-client to single network IO thread, which implements non-blocking reactor, and pool of disk IO threads. Elliptics maintains O(1) lookup times which implies that every node knows about every other, which in turn requires N^2 network connections in cluster and thus the same number of IO threads in thread-per-client model.
For 200 nodes this is 4k threads/connections, each machine with 24 disks starts 24 elliptics nodes, ending up with almost 100.000 sockets and threads in peaks. Linux just can not start that much threads by default (we used 2.6.32 and .38 kernels) on 64-bit servers with 24 Gb of RAM.
And even if we tune some things, it will stop at 500 nodes limit and so forth. Eventually this is a dead end.

Now we use IO thread pool and network processing thread, which handles sockets via epoll(). We dropped libevent and friends, since it is still extremely unconvenient and ugly to add events from one thread into other's processing loops.

Future plans

We started to think about secondary indexes in elliptics network. We can easily build slow-enough map-reduce-like processing on top of our data, but for example it will not quickly return prefix-search data, although we can run brute-force regex-like search.

I will continue PAXOS implementation - we do want to build Chubby-like service for small synchronized data storages as well as locks and counters. Combined with elliptics it should provide a full operation stack for data storage.

POHMELFS is not dead, IT IS NOT DEAD, please remember this :)
I'm just a little bit busy with things.

I think about dropping transactions support in elliptics and always perform rewrite-in-place. Ext* filesystems live without this feature compared to Btrfs and everyone is happy. Well, this is a cool feature, but hardly needed for everyone.

Hmm, I bought myself a small Yamaha KX-49 MIDI-keyboard and it even works in Linux with Rosegarden (modulo non-working controls) and continue to play trumpet, but this is another story.
I will try to post more regulary here, although most of the time twitter wins :)

To blob or not to blob

Tagged:  

Disks are slow and filesystems on top of them are even slower. But this is not a huge problem, what really concerns about performance is its degradation with increasing amount of data written to filesystem.

Directory listing containing a million of files will take ages and usually there is no way to speed this up using multiple threads.

But it is possible to create different data structure which will have constant random IO performance no matter how many objects were written into the storage. Well, to some degree.
And it has its own price - there is no silver bullet.

Libeblob is such a system. It was created with two main design goals in mind:

  • constant random IO performance
  • constant high write performancs

it is achieved by maintaining object index fully in RAM while writing data sequentially into the storage. So writes will not seek over the disk, since we use append write mode - new data is always placed after old one.
Any read takes at most one seek to get to the data. Well, it depends on filesystem under the storage, but modern extent-based filesystems will not face a problem with multiple seeks to get to requested file offset. Also we ran tests on raw block device under eblob and its performance was the same as with ext4.

Here is a small elliptics cluster test (just 4 servers) which contains 130+ millions of objects written (each one has 2 copies), total size is close to 4 terabytes. We used ext3 as underlying fs.

2000 rps of random IO out of commodity disks (each server has 4-disks raid10) within 100 ms latency.
Performance drops after 2krps is likely caused by VFS page cache which flushes eblob index to disk, since by default elliptics does not lock eblob index in RAM.

elliptics@eblob: 5000 rps of random IO requests in 1 Tb of data (100 millions of objects)

Tagged:  

Legend.

elliptics network - distributed hash table storage
eblob - low-level data storage used as one of IO backends for elliptics network

Hardware: 2 E5530 servers (16 cores, 24 Gb of RAM) each one is connected to SAS shelf (14 disks, ext4, raid10).
Data: 100 millions of objects (total of 1042 Gb) roughly equally spread over above server nodes.
Requests: random IO reads.

Result? We have it:


Reply time (left, in ms) @ number of requests per second (right, red inclined line)



Reply time (in ms) distribution

Net result: 3500 rps within 100 ms, 4000 rps witin 200 ms.
Not that bad I think...

New elliptics network release: 2.9.1

Tagged:  

This day has come: I made whole new and shiny elliptics network release. Amount of invasive changes is noticeable but yet a lot more I found should be changed.

We made 30 intermediate releases after 2.9.0 which revealed fair number of bugs and showed a number of different ways of further elliptics storage development.

A short changelog includes:

  • Huge log refactoring. Now logs are external entities which are referenced in elliptics core.
  • Allow to embed timestamps into data objects, which makes http cache headers processing (namely if-modified-since and last-modified) as fast as data reading. Timestamps are also stored in transaction histories, which previously had to be read from disk, thus making 2 times more disk seeks to handle client request. It is quite easily possible to implement other URI parameters embedding (one can specify object timestamp in URI as well as getting current time, for more details see example config file which comes in a package/archive).
  • Added region tag in XML returned by HTTP proxy. Region ID can be obtained using external library calls when properly configured.
  • Added EBLOB low-level IO backend. It is an append-only storage system which works with huge blob files and allows to lookup object with O(1) time optionally locking index in memory. I posted a number of its benchmarks here - it outperforms everything, although has a set of constraints, like need to call defragmentation tool. I plan to remove TokyoCabinet database IO backend in next release, since when it starts paging out its performance suffers greatly and elliptics network does not need its rich database API.
  • Added spec file for RPM builds. Debian build files were added quite long time ago, and we use deb packages to actually rollout new releases.
  • Added C++ and Python bindings. I stumbled upon an interesting user behaviour: when API is exteremely rich library is used noticeably less frequently, since people who are not very familiar with the low-level internals can not decide what exactly they need from this huge exported function set. In C++/Python bindings I implemented rather simple classes with limited API which should solve most of the common problems users face daily. Simplifying API leads to better acceptance of the project.
  • Changed library and include filenames from dnet to elliptics. In particular, this solves problem with already existing libdnet found in all popular distros. Now we use libelliptics and include/elliptics.
  • Switched server-side code from command-line options to configuration file. Allow low-level IO backends to have their own config options. Split former dnet_ioserv to server code with the same name and dnet_ioclient which is able to read/write/remove objects in the storage as well as requesting VFS/memory/CPU usage statistics.
  • Fair number of bug fixes.

With this release I mark 2.9 tree as experimental, since I will break compatibility soon to allow faster fault recovery process.

Milestone TODO list includes:

  • Refactor transaction history processing code and move it to elliptics core from low-level IO backends. Transaction logs will be stored in eblobs and will be optimized for performance. IO backends will only store data objects in different formats.
  • Remove TokyoCabinet IO backend in favour of EBLOB one.
  • Switch to thread-per-client IO processing model.
  • POHMELFS port.

Stay tuned!

And now its time to recall about OLOLO-intellect and related lexical and morphological analysis...

elliptics@eblob on 14-disks SAS raids

Tagged:  

We got 2 servers (E5530, 16 cores, 24 Gb of RAM) with 14-disks SAS shelf converted into raid10 in each.
Uploaded 30 millions of records (record size varies from 5 to 20 kb) total of 370 Gb of data.

Fired with random requests and got this graph:

2000 rps within 130-150 ms, according to dynamics we could get more, but stopped test.
Will run longer with additional 60 millions of records (total of about 100 millions objects in the storage).

Eblob index must take all available RAM on such machine, with 15 millions of records on each (30 millions total) we get about 9.9 Gb of virtual memory and 8.7 Gb of 'real' mem.

Should fit 100 millions of records on 2 nodes perfectly :)

In a meantime I cooked up 2.9.1 release, its release candidate will pass final tests and I will make an announcement in a day or so. We got 30 internal releases prior this one already, its time to publish new shiny features in a new package.

Stay tuned!

Shooting at elliptics@libeblob with lots of data

Tagged:  

We got 2 servers (4 SATA disk raid10, default ext4, 24 Gb of RAM, E5530, 16 cores), uploaded 30 millions of objects into elliptics network on top of libeblob backend. Total of 370 Gb on disks.

And got this with completely random requests:

900 rps within 100-150 ms. Quite a superb result for such hardware.
And we have this SAS shelfs:

md4 : active raid10 sdr[13] sdq[12] sdp[11] sdo[10] sdn[9] sdm[8] sdl[7]
                               sdk[6] sdj[5] sdi[4] sdh[3] sdg[2] sdf[1] sde[0]
      2050780256 blocks super 1.2 16K chunks 2 near-copies [14/14] [UUUUUUUUUUUUUU]

which will be used to host about 200 millions of objects (upload will be finished next week - writing server can not cope with such load :). Will also test the same amount of data on 4-sata-disk raid10 machines as well.

Stay tuned!

First low-level file storage libeblob release

Tagged:  

libeblob is a low-level IO library which stores data in huge blob files appending records one after another.

I implemented all missing functionality, added comments and README, and rolled out the first version: 0.0.1

Here is a short changelog:

  • defragmentation tool: entries to be deleted are only marked as removed, eblob_check will iterate over specified blob files and actually remove those blocks
  • off-line blob consistency checker: eblob_check can verify checksums for all records which have them
  • run-time sync support - dedicated thread runs fsync on all files on timed base
  • added documentation and comments

libeblob can be downloaded from git tree ($ git clone http://www.ioremap.net/git/eblob.git/) or archive.

Low-level data storage introduction: libeblob

Tagged:  

Elliptics network is a very modular key/value (distributed hash table) storage. Among others it allows to build pluggable low-level IO backends - those entities which store data.

Currently elliptics network supports following IO backends:

  • file IO backend, where each transaction is stored as a separate file
  • TokyoCabinet database backend - each transaction is stored as a record in appropriate table. I dropped BerkelyDB support because of its low performance, even though TC does not provide ACID contrary to BDB

And now I added append-only blob storage - libeblob.

Following features are already supported:

  • fast append-only updates which do not require disk seeks
  • compact index to populate lookup information from disk
  • multi-threaded index reading during starup
  • O(1) data location lookup time
  • ability to lock in-memory lookup index (hash table) to eliminate memory swap
  • readahead games with data and index blobs for maximum performance
  • multiple blob files support (tested with blob-file-as-block-device too)
  • optional sha256 on-disk checksumming
  • 2-stage write: prepare (which reserves the space) and commit (which calculates checksum and update in-memory and on-disk indexes). One can (re)write data using pwrite() in between without locks
  • usuall 1-stage write interface
  • flexible configuration of hash table size, flags, alignment

TODO list includes:

  • defragmentation tool: entries to be deleted are only marked as removed, we need to have a tool (or embed it into library) to actually remove those blocks from blob files
  • proper off-line blob consistency checker: we put a checksum into data blob, someone may want to check if read data matches
  • run-time sync support - we should have a dedicate thread to call syncs on timed base

Elliptics network uses it as one of its low-level IO backends. Numbers I posted (1, 2, 3) also highlight its advantages.

But during elliptics network integration with libeblob I found how unoptimal transaction history log was implemented in the storage (and maybe found an answer why monsters like Cassandra do not support it at all). Maybe its time to rethink and reinvent it though...

Anyway, there is a set of features I will create to complete this implementation as well as new elliptics network release (with C++ and Python bindings and new IO backend).
Stay tuned!

libeblob

Tagged:  

libeblob is an append-only low-level IO library, which saves data in blob files.

Following features are already supported:

  • fast append-only updates which do not require disk seeks
  • compact index to populate lookup information from disk
  • multi-threaded index reading during starеup
  • O(1) data location lookup time
  • ability to lock in-memory lookup index (hash table) to eliminate memory swap
  • readahead games with data and index blobs for maximum performance
  • multiple blob files support (tested with single blob file on block device too)
  • optional sha256 on-disk checksumming
  • 2-stage write: prepare (which reserves the space) and commit (which calculates checksum and update in-memory and on-disk indexes). One can (re)write data using pwrite() in between without locks
  • usuall 1-stage write interface
  • flexible configuration of hash table size, flags, alignment
  • defragmentation tool: entries to be deleted are only marked as removed, eblob_check will iterate over specified blob files and actually remove those blocks
  • off-line blob consistency checker: eblob_check can verify checksums for all records which have them enabled
  • run-time sync support - dedicated thread runs fsync in background on all files on timed base
  • Google's Snappy compression support

Elliptics network uses it as one of its low-level IO backends. Numbers I posted (1, 2, 3) also highlight its advantages.

libeblob can be downloaded from git tree ($ git clone http://www.ioremap.net/git/eblob.git/) or archive.

Syndicate content