Tag Archives: cache

Tachyon – “memory-centric” distributed storage

Tachyon is distributed filesystem on top of HDFS with aggressive caching. By ‘filesystem’ one has to accept Java File objects and related methods and Hadoop compatibility.

Tachyon can run on top of HDFS, S3 and GlusterFS, in fact, its fault tolerance is what HDFS provides – single name node. Tachyon requires zookeeper to maintain master server which performs all operations and checks for consistency.

By their own benchmark it scales lineary with new nodes added

Because of aggressive caching Tachyon’s performance is way ahead of in-memory HDFS. Transparent in-memory caching is a great way to speedup Hadoop and particularly Spark, which was designed for immutable datasets with multiple access patterns.

I can also recommend Andreessen Horowitz article as an example of high-profile investment and management language, it looks like it even introduced new ‘memory-centric’ kind of technical term.

As a side note, I can not slip away comparing this caching system with what Elliptics distributed cache is. Not highlighting and providing Java File API instead we concentrated on HTTP access through Backrunner – Elliptics HTTP load balancer and proxy.

Building transparent subsystem which do not require semi-manual caching policies and additional configuration layers is a way forward for distributed systems, which already require quite a configuration to run and maintain. With a16z investment support I’m pretty sure we will hear more about Tachyon soon.

Facebook’s hot, warm and graph data

Facebook is a maze for developing data store and access algorithms. Their scale allows and actually requires to find out new ways of storing information.

For example social graph data – it is stored in MySQL and used to use aggressive memcached cache. Now Facebook uses Tao – graph-aware storage for social graph with its own graph cache.

Facebook uses 3 levels of persistent storage: hot data is stored in Haystack storage – it’s lowest level is the prototype I used for Eblob. I would describe Elliptics storage with DHT, i.e. multiple servers to store one data replica, as this level of abstraction.

Second comes so called WARM data – data which is accessed by 2 orders of magnitude less frequently than that at HOT level. Warm data is stored in F4 storage, Facebook describes this split and motivation in this article.
I saw Elliptics installation with one disk per group, i.e. without DHT, as a storage for this access level of the data.

And the last is a COLD storage, something weird and scientifically crazy like robotic hand with a stack of Blu-Ray disks.

Various flash caches benchmarked

Guys from Swrve have benchmarked various flash caches in front of Amazon EBS volume and shared the results.

Data shows that Facebook’s Flashcache performs the best. Swrve decided to replace safety with performance and started to use raid-0 stipe for EBS volumes plus SSD cache in front of them.

Here is the result: Y axis is events processed per second, higher is better; X axis is time.
Y axis is events processed per second, higher is better; X axis is time.

  • single 200GB EBS volume (blue)
  • 4 x 50GB EBS volumes in RAID-0 stripes (red)
  • 4 x 50GB EBS volumes in RAID-0 stripes, with an enhanceio cache device (green)

EnhanceIO is a Facebook’s Flashcache fork with nicer tools. It itself increases performance by about 25% in their workload, while splitting EBS volume into raid0 added another 25%.

Above Swrve link contains more details on benchmark and other results.

How is Redis used in twitter

Everyone knows Redis – high-performance persistent cache system. Here is an article on how Redis is being used in Twitter.

It happens that Twitter not only forked and extended 1 year old Redis version, but looks like it doesn’t have plans to upgrade. Redis and its latencies are much-much-much faster than Twitter infrastructure written in Java because of GC in JVM. This allows to put a bunch of proxies on top of Redis caching cluster to do cluster management, the thing Redis misses for a while.

Also Twitter uses Redis only to cache data, doesn’t care about consistency issues, doesn’t use persistent caching, at least article says data is being thrown away when server goes offline.
It is client responsibility to read data from disk storage if there is no data in the cache.

Article desribes Twitter timeline architecture, and that’s quite weird to me: instead of having list of semifixed (or limited by size) chunks of timeline which are loaded on demand, they created a bunch of realtime updated structures in Redis, found non-trivial consistency issues and eventually ended up with the same simple approach of having ‘chunks’ of timeline stored in cache.

I started to compare cache management in Twitter using Redis with what we have in Reverbrain for caching: our Elliptics SLRU cache. It uses persistent caching system (which was also described a bit in article in comparison with memcache), but also uses persistent storage to backup cache, and while cache is actually segmented LRU, its backing store can be arbitrary large at size compared to Redis.

Although article is written as ‘set of facts’ somehow cut out of context (it was interview with the twitter employee), it is a good reading to think about caching, JVM, Redis and cache cluster architecture.

Elliptics, golang, GC and performance

Elliptics distributed storage has native C/C++ client API as well as Python (comes with elliptics sources) and Golang bindings.

There is also elliptics http proxy Rift.

I like golang because of its static type system, garbage collection and built in lightweight threading model. Let’s test HTTP proxying capabilities against Elliptics node. I already tested Elliptics cache purely against native C++ client, it showed impressive 2 millions requests per second from 10 nodes, or about 200-220 krps per node using native API (very small upto 100 bytes requests), what would be HTTP proxying numbers?

First, I ran single client, single Rift proxy, single elliptics node test. After some tuning I got 23 krps for random writes of 1k-5k bytes (very real load) per request. I tested 2 cases when elliptics node and rift server were on the same machine and on different physical servers. Maximum latencies with 98% percentile were about 25ms at the end of the test (about 23 krps) and 0-3 ms at 18 krps not counting rare spikes at graph below.

elliptics-cache-rift-23krpsRift HTTP proxy writing data into elliptics cache, 1k-5k bytes per request

Second, I tested a simple golang HTTP proxy with the same setup – single elliptics node, single proxy node and Yandex Tank benchmark tool.

I ran tests using the following setups: golang 1.2 with GC=100 and GC=off and golang 1.3 with the same garbage collection settings. Results are impressive: without garbage collection (GC=ff) golang 1.3 test ran with the same RPS and latencies as native C++ client. Although proxy ate 90+ Gb of RAm. Golang 1.2 showed 20% worse numbers.

elliptics-cache-golang-1.3-gc-offGolang HTTP proxy (turned off garbage collection) writing data into elliptics cache, 1k-5k bytes per request

Turning garbage collection on with GC=100 setting lead to much worse results than native C++ client but yet it is quite impressive. I got the same RPS numbers for this test of about 23 krps, but latencies at the 20 krps were close to 80-100 msecs, and about 20-40 msecs at the middle of the test. Golang 1.2 showed 30-50% worse results here.

elliptics-cache-golang-1.3-gc-100Golang HTTP proxy (GC=100 garbage collection setting) writing data into elliptics cache, 1k-5k bytes per request

Numbers are not that bad for single-node setup. Writing asynchronous parallel code in Golang is incredibly simpler than that in C++ with its forest of callbacks. So I will stick to Golang for the network async code for now. Will wait for Rust to stabilize though.

Rift persistent caching

Rift allows you to store popular content into separate groups for caching. This is quite different from elliptics cache where data is stored in memory in segmented LRU lists. Persistent caching allows you to temporarily put your data into additional elliptics groups, which will serve IO requests. This is usually very useful for heavy content like big images or audio/video files, which are rather expensive to put into memory cache.

One can update list of objects to be cached as well as per-object list of additional groups. There is a cache.py with excessive help to work with cached keys. This tool will grab requested key from source groups and put them into caching groups as well as update special elliptics cache list object which is periodically (timeout option in cache configuration block of the Rift) checked by Rift. As soon as Rift found new keys in elliptics cache list object, it will start serving IO from those cached groups too as well as from original groups specified in the bucket. When using cache.py tool please note that its file-namespace option is actually a bucket name.

To remove object from cache one should use the same cache.py tool – it will remove data from caching groups (please note that physically removing objects from disk in elliptics may require running online eblob defragmentation) and update special elliptics cache list object. This object will be reread sometime in the future, so if requested key can not be found in cache, it will be automatically served from original bucket groups.

SLRU cache in Elliptics

In order to add some flexibility to our cache-layer, we replaced simple LRU-cache with Segmented-LRU-cache. It’s structure perfectly handles the concept of “hot data” which is so common in real world applications.

In Segmented-LRU, cached data is divided into several pages depending on it’s access frequency. This way one-time requests will only touch temporal pages and won’t affect data in popular pages, thus giving a security to hot data against overflow evictions.

One other feature implemented in new cache in order to decrease the size of cache-record was replacing binary-search-tree+heap data structures combination with one structure called Cartesian_tree that can encompass both aspects just as effectively.

For more information and implementation details check out our docs: http://doc.reverbrain.com/elliptics:cache

New elliptics HTTP proxy, authentification, caching and Go bindings

I separated elliptics HTTP proxy from our high-performance server HTTP framework TheVoid. TheVoid continues to be a framework for writing HTTP servers in C++, whlist Rift becomes elliptics HTTP access point.

Rift support usual object upload/get/removal as well as upload/download flow control. The latter (soon will be default and the only possible mode) is basically an arbiter who doesn’t allow to read more data from client if current chunk hasn’t been yet written. It uses chunked upload and download. Rift supports range request (Range: HTTP header).

There is basic authentification support in the Rift. I will extend it to be per-bucket fashion similar to what Amazon S3 has (not the same API though). Rift also support multiple-groups caching, this is extremely useful for bigger content, when you suddenly decided that given objects has to be spread into many groups instead of just those originally written into. There is ‘a button’ (basically a python script) which copies given keys from theirs original groups into caching and broadcasts updates to all Rift proxies via updating special keys which are periodically checked by the proxies. Caching can be turned on and off on per-key basis.

One can create spacial SSD caching groups for example and put the needed files for some time. Or those can be commodity spinning disks for larger files like video content.
More details on this later at documentation site.

Everything above and several new features will be available both in Rift proxies and our new cluster we currently develop. Not that it will be plain Amazon S3, but something similar. More details later this year :)

And right now one can check out new Go elliptics bindings project started by Anton Tyurin. That will be a base for our HTTP entry point.

Distributed in-memory cache in Elliptics

Ruslan “elessar” Nigmatullin wrote a new aricle on elliptics cache architecture: http://doc.reverbrain.com/elliptics:cache

It describes how cache is organized, how one can use it to speed up disk operations or just to store data in memory only.
Some non-trivial cases are also touched: how to organize cache for append writes and partial updates as well as various corner cases.


Pure in-memory cache VS persistent cache?

Elliptics contains built-in LRU in-memory cache with timeouts. This is a rather demanded feature used in our most intensive workloads.

Elliptics distributed cache is more preferable than memcache and derivatives because of DHT IO balancing built-in, automatic reconnection and more user-friendliness, i.e. generally client doesn’t care about where to find given key – elliptics takes care itself to lookup, reconnect and rebalance.

But memory itself is used not only for temporal caches – clients want to have a persistent cache frequently, i.e. when data stored in such ‘cache’ can survive server outages. Cache IO performance should not be affected by real disk IO, instead some tricky schemes must be employed not to degrade speed.

Likely the most well-known storage for such workload is Redis. Whilst it has built-in (or implemented in client library) partitioning support (Redis Cluster is not production ready though) and beta-staged Sentinel – failover daemon, it is yet generally a single-master solution.

I thought of adding Redis backend into Elliptics – this provides automatic recovery (when new iterators are ready), failover, multiple replicas and so on. Given Redis performn

Elliptics cache operations

I didn’t yet updated doc.ioremap.net, but it will be done soon. So, will write it here first

Elliptics distributed storage has built-in cache, which can (or can not, depending on how you write your data) be backed up by disk storage. Cache is rather simple LRU, but we will extend it with RED or maybe more comlex eviction mechanisms.
Each cached element has an expiration timeout, which by default never expires.

I’ve added possibility to remove cached entry not only from in-memory cache, but also from disk, when expiration timeout fires or when you remove object by hands. Also C++/Python APIs were extended to support this kind of operations.

Overall this feature is extremely useful for cases like session store. You generally do not want to lose them and want to read t, otherwise you could just use a plain in-memory cache, but also you know that your in-memory session pool is rather limited for frequently accessing users.

Elliptics processing layers and distributed in-memory cache

It was long ago since elliptics became not only storage system, but procesing engine. It contains several layers and only the last one is actual storage backend (elliptics has 3 different storage backends to date and it is rather trivial to create your own new one).

The highest is routing layer, where notion of group exists. Group is basically a data replica – it is a set of nodes logically bound to each other (basically they are bound by admin’s decision) and identified by group number.

Group may consist of several nodes each of which forms well-known hash ring. This is the layer where DHT lives. If your groups are small enough (we have installations of hundreds of groups of one node in production, where even node is not a server, but smaller instance), then no DHT will be operated in your setup.

That was a routing layer – client issues a new command (read, write, execute, stat or whatever else) with ID, which consists of group number and object ID. The latter is usually sha512 of some string your used when called high-level API, but it is possibly to specify it manually (object ID by default is limited by 512 bits). Command with given ID is routed to appropriate node in the storage and then executed there.

There are several commands which are reserved to the elliptics core like route table update, low-level node joining protocol and so on, but rest of them is directed to IO layer or backend layer. Basically, all other commands (including those which are not listed in elliptics API – you can create your own commands) are pushed down to backend, which decides what to return back.

In the middle of this process there is a simple LRU cache with timeouts. Elliptics core intercepts read/write/del commands and if IO request contains cache flags, also updates records in the cache. Read and del commands always go to cache first, while write may write data to cache and disk or just to cache (depending on IO flags).

It is also possible to specify lifetime timeout for every object being written into the cache. By default it equals zero, or infinite lifetime. In this case object will be removed from cache, but not from disk (if it was written both to disk and to cache). Timeout is specified in seconds.

Getting all the power of elliptics routing protocol and transport layer (which takes obligations to discover new nodes, reconnect, update network state weight’s and so on), one can easily implement distributed in-memory cache (optionally backed by disk backend), which will not suffer when nodes are removed or added (elliptics automatically detects that and changes load).