Tag Archives: benchmark

Elliptics, golang, GC and performance

Elliptics distributed storage has native C/C++ client API as well as Python (comes with elliptics sources) and Golang bindings.

There is also elliptics http proxy Rift.

I like golang because of its static type system, garbage collection and built in lightweight threading model. Let’s test HTTP proxying capabilities against Elliptics node. I already tested Elliptics cache purely against native C++ client, it showed impressive 2 millions requests per second from 10 nodes, or about 200-220 krps per node using native API (very small upto 100 bytes requests), what would be HTTP proxying numbers?

First, I ran single client, single Rift proxy, single elliptics node test. After some tuning I got 23 krps for random writes of 1k-5k bytes (very real load) per request. I tested 2 cases when elliptics node and rift server were on the same machine and on different physical servers. Maximum latencies with 98% percentile were about 25ms at the end of the test (about 23 krps) and 0-3 ms at 18 krps not counting rare spikes at graph below.

elliptics-cache-rift-23krpsRift HTTP proxy writing data into elliptics cache, 1k-5k bytes per request

Second, I tested a simple golang HTTP proxy with the same setup – single elliptics node, single proxy node and Yandex Tank benchmark tool.

I ran tests using the following setups: golang 1.2 with GC=100 and GC=off and golang 1.3 with the same garbage collection settings. Results are impressive: without garbage collection (GC=ff) golang 1.3 test ran with the same RPS and latencies as native C++ client. Although proxy ate 90+ Gb of RAm. Golang 1.2 showed 20% worse numbers.

elliptics-cache-golang-1.3-gc-offGolang HTTP proxy (turned off garbage collection) writing data into elliptics cache, 1k-5k bytes per request

Turning garbage collection on with GC=100 setting lead to much worse results than native C++ client but yet it is quite impressive. I got the same RPS numbers for this test of about 23 krps, but latencies at the 20 krps were close to 80-100 msecs, and about 20-40 msecs at the middle of the test. Golang 1.2 showed 30-50% worse results here.

elliptics-cache-golang-1.3-gc-100Golang HTTP proxy (GC=100 garbage collection setting) writing data into elliptics cache, 1k-5k bytes per request

Numbers are not that bad for single-node setup. Writing asynchronous parallel code in Golang is incredibly simpler than that in C++ with its forest of callbacks. So I will stick to Golang for the network async code for now. Will wait for Rust to stabilize though.

Monitoring of Elliptics server node

Meet scalable monitoring subsystem for Elliptics server nodes.

Monitoring allows to track performance of various parts of Elliptics server node such as: commands handling, cache, backend etc.

It includes simple HTTP server which provides json statistics via REST API. The statistics can be requested fully or partly by category. List of available statistics categories can be found at http://host:monitoring_port/list.

Monitoring can be extended via external providers that allow to deepen basic statistics.

For more details check out docs:
http://doc.reverbrain.com/elliptics:monitoring – users documentation that describes how to gather and read statistics
http://doc.reverbrain.com/elliptics:monitoring-inside – developers documentation that describes how monitoring is implemented and how you can write custom statistics provider.

Does LevelDB suck?

We have following config for Elliptics leveldb backend:

sync = 0

root = /opt/elliptics/leveldb
log = /var/log/elliptics/leveldb.log
cache_size = 64424509440
write_buffer_size = 1073741824
block_size = 10485760
max_open_files = 10000
block_restart_interval = 16

And with 1kb of pretty compressable data (ascii strings) chunks pushed into single server with 128 Gb of RAM and 4 SATA disks combined into RAID10 ends up with poor 6-7 krps.

If request rate is about 20 krps median reply time is about 7 seconds (!)

Elliptics with Eblob backend on the same machine easily handles the same load.

dstat shows that it is not disk (well, with 20 krps it is disk), but before that it is neither CPU nor disk – leveldb just doesn’t allow more than 5-7 krps with 1 Kb data chunks from parallel threads (we have 8-64 IO threads depending on config). When snappy compression is enabled things get worse.

Is it ever possible to push 20 MB/s into LevelDB with small-to-medium (1Kb) chunks?

2 millions requests per second from 1 terabyte cache

This is a reality now, and most notably, this is a rather small cluster configuration: 10 machines with 100+ Gb of ram and Elliptics cache.

We do not really need that level of requests in any of our clusters, but it was fun to push the limits. Ruslan Nigmatullin implemented elliptics native plugin for high-performance IO engine Phantom.
Our friends in Yandex use Phantom for the most performance-critical tasks when serving ads.

Elliptics Phantom plugin utilizes phantom’s coroutines and IO capabilities, so we are able to achieve as much as 220 krps from single machine to single elliptics server (configured for in-memory cache operations only).

It takes its price: 32 cores (actually there are only 16 real cores + hyperthreading) on each client and server, 1+ millions of context switches per second, 300+ thousands of hardware interrupts per second most likely from NIC (200+ thousands packets per second, 600+ Mbit/s monitored on network switch).
8 cores take only in-kernel load – mostly interrupt processing, so userspace only have 4 real cores, which are fully drained by socket receiving machinery (which we can improve actually).

So, if you think about application or even HTTP cache – Elliptics is a good candidate. It uses DHT mode for single replica, so your applications do not need to think about how messages are routed and where data is stored as well as automatically detects new joined nodes and so on.
HTTP interface is implemented via fastcgi proxy, which also means that you can store Nginx cache in elliptics (which solves problem with data replication and deduplication – no more wasted space on multiple cache machines).