LevelDB (and actually RocksDB) is a very well known local storage for small records. It is implemented over log-structured merge trees and optimized for write performance.
LevelDB is quite horrible for records larger than 1Kb because of its merge operation – it quickly reaches level where it has to always merge trees and it takes seconds to complete.
But for small keys it is very useful and fast.
RebornDB is a proxy storage, which operates on top of LevelDB/RocksDB and provides Redis API to clients. Basically, it is a Redis on top of on-disk leveldb storage.
There is an interesting sharding scheme – there are 1024 slots each of which uses its own replica set, which can be configured by admin. When client writes a key, it is hashed and one of the 1024 slots is being selected (using modulo (% 1024) operation). When admin decides that some slots should be moved to new/different machine, it uses command line tool to reconfigure the proxies. During migration slots in question are available for IO, although it may span multiple servers since proxy doesn’t yet know whether required key has been or hasn’t been yet moved to the new destination.
Having many slots is a bit more flexible than old-school sharding which uses number of servers, although quite far from automatic ID range generation – manual resharding doesn’t scale for admins.
RebornDB uses zookeeper/etcd to store information about slot/server matching and per-slot replication policies. This doesn’t force every operation to contact zookeeper (this actually kills this service), instead every proxy has abovementioned info locally and every reconfiguration also updates info on every proxy.
There is not that much information about data recovery (and migration) except that it is implemented on key-by-key basis. Given that leveldb databases usually contain tens-to-hundreds millions of keys recovery may take a real while, snapshot migration is on todo list.
Everyone knows Redis – high-performance persistent cache system. Here is an article on how Redis is being used in Twitter.
It happens that Twitter not only forked and extended 1 year old Redis version, but looks like it doesn’t have plans to upgrade. Redis and its latencies are much-much-much faster than Twitter infrastructure written in Java because of GC in JVM. This allows to put a bunch of proxies on top of Redis caching cluster to do cluster management, the thing Redis misses for a while.
Also Twitter uses Redis only to cache data, doesn’t care about consistency issues, doesn’t use persistent caching, at least article says data is being thrown away when server goes offline.
It is client responsibility to read data from disk storage if there is no data in the cache.
Article desribes Twitter timeline architecture, and that’s quite weird to me: instead of having list of semifixed (or limited by size) chunks of timeline which are loaded on demand, they created a bunch of realtime updated structures in Redis, found non-trivial consistency issues and eventually ended up with the same simple approach of having ‘chunks’ of timeline stored in cache.
I started to compare cache management in Twitter using Redis with what we have in Reverbrain for caching: our Elliptics SLRU cache. It uses persistent caching system (which was also described a bit in article in comparison with memcache), but also uses persistent storage to backup cache, and while cache is actually segmented LRU, its backing store can be arbitrary large at size compared to Redis.
Although article is written as ‘set of facts’ somehow cut out of context (it was interview with the twitter employee), it is a good reading to think about caching, JVM, Redis and cache cluster architecture.
Elliptics contains built-in LRU in-memory cache with timeouts. This is a rather demanded feature used in our most intensive workloads.
Elliptics distributed cache is more preferable than memcache and derivatives because of DHT IO balancing built-in, automatic reconnection and more user-friendliness, i.e. generally client doesn’t care about where to find given key – elliptics takes care itself to lookup, reconnect and rebalance.
But memory itself is used not only for temporal caches – clients want to have a persistent cache frequently, i.e. when data stored in such ‘cache’ can survive server outages. Cache IO performance should not be affected by real disk IO, instead some tricky schemes must be employed not to degrade speed.
Likely the most well-known storage for such workload is Redis. Whilst it has built-in (or implemented in client library) partitioning support (Redis Cluster is not production ready though) and beta-staged Sentinel – failover daemon, it is yet generally a single-master solution.
I thought of adding Redis backend into Elliptics – this provides automatic recovery (when new iterators are ready), failover, multiple replicas and so on. Given Redis performn