Elliptics network: how much parallel is parallel

Tagged:  

Its time for the new elliptics network benchmark. Today with small Linux cluster (1-4 nodes) and multiple clients (1-3).
As usual small to medium writes (100 and 1024 bytes). I selected Tokyo Cabinet IO backend, which is the fastest among supported (BerkeleyDB and file IO backends were not even considered, they are way too slow for this kind of test).

Server hardware: 2 servers, each of 8 cores (E5345 system, 2.33 GHz each core), 8 Gb RAM, two HITACHI UltraStar 15K300 SAS disks, 1 Gbit ethernet.
Each server hosted two elliptics nodes - each node worked with own disk, but shared network bandwidth.

Clients - 3 nodes, each of 4 cores (E5345 system, 2.33 GHz each core), 8 GB RAM, 1 Gbit ethernet.

Storage nodes were created on top of ext2 mounted volumes, hash Tokyo Cabinet databases were used: single table for data per node, shared by all threads (did you know TC uses single lock for the whole database and not per-bucket one, which greatly degrades its performance in multi-threaded environment?), no history updates.

Following tests were performed: 100-bytes writes into network with 1, 2 (on the same physical server) and 4 nodes (2 physical servers) from 1, 2 and 3 clients; 1024-byte writes with the same configuration.

I tried to rebuild the database for each test, but only 100-bytes write tests worked with clean database each run, 1024 writes updated the same database with increased number of clients (I just started the new clients in addition to already running).

And now the most interesting: the results!

Single writer - multiple servers.


Single writer, multiple servers

Client used 3 CPUs at maximum (while it could use more, its a client implementation issue) when connected to all 4 servers, so results do not represent the maximum speed. Servers were loafing at 50-60% each node all the time waiting for IO completion.

2 and 3 writers, 4 servers.


2 writers, 4 servers


3 writers, 4 servers

Stable linear scalability - aggregated number of operations per second grows with number of clients.
Please note, that there are only two servers, but 4 nodes, so each two nodes share the same gigabit link.

Medium-sized writes with multiple clients.


Multiple writers (1024 byte packets), 4 servers

Linear scalability for 1024 byte writes around 50 MB/s - half of the 1 gigabit link per server node.

Tokyo Cabinet database on top of ext2 filesystem degrades with time though, and this can be observed even on the presented graphs (there is also ext2 sync problems with noticeble gaps on non-smoothed graphs).

For some reason I never saw small-sized IO tests (i.e. effectively latency tests) from major distributed system vendors, which could show bottlenecks. Bulk performance of the elliptics network proved to be good in the presented environment.

Eventually I will run a large-scale test with tens to hundreds server nodes and clients, but for today it is enough. 2.3.1 (or 2.4.0, need to check how many features were implemented) release will see the light in a day or so, which will include hardened failure management, automatic tests and bug fixes. I think this will be the last 'self-containing' release, all further features will depend on external libraries (like distributed locks, new IO backens, POHMELFS port and so on).

Enjoy!

The latest release of Tokyo Cabinet is able to do live defragmentation.
Maybe this can help with the degradation over time you reported in your test.

How can one configure this ? Command Example please

I used 1.4.14 version, while the latest is .21, so it is possible that I missed some interesting features, which could further improve results.

1.4.14 is getting on a bit, and a lot of improvements happened since.

You really should try a new benchmark using a newer TC version.

Live defragmentation is disabled by default, but you can enable it with tchdbsetdfunit() and tcbdbsetdfunit().

Although in japanese (Google translation can help a bit if you can't read japanese), this article describes the fragmentation in TC and how to deal with it, either explicitly or using live defragmentation: http://alpha.mixi.co.jp/blog/?p=862