Tag Archives: Documentation

Data recovery in Elliptics

As you know, Elliptics includes built-in utility for data synchronization both within one replica and between number of replicas. In this article I describe different modes in which synchronization could be run, how it works and give details on situations when it could be used.

dnet_recovery is the main utility for data synchronization in Elliptics. It is distributed with the package elliptics-client. dnet_recovery supports 2 modes: `merge` and `dc`. `merge` is intended for data synchronization within one replica and `dc` – for data synchronization between number of replicas. Each of the modes can automatically search keys for synchronization or use keys from dump file.

Most of described bellow can be found at recovery doxygen documentation. All doxygen documentation will be published on doc.reverbrain.com soon.

Continue reading

Documentation updates

We always improve our documentation.
Now we added detailed page about elliptics configuration that is available at http://doc.reverbrain.com/elliptics:configuration.
Also we added description of dnet_balancer tool that allows to balance DHT ring partition within one group. It can be found at http://doc.reverbrain.com/elliptics:tools#dnet_balancer.

Monitoring of Elliptics server node

Meet scalable monitoring subsystem for Elliptics server nodes.

Monitoring allows to track performance of various parts of Elliptics server node such as: commands handling, cache, backend etc.

It includes simple HTTP server which provides json statistics via REST API. The statistics can be requested fully or partly by category. List of available statistics categories can be found at http://host:monitoring_port/list.

Monitoring can be extended via external providers that allow to deepen basic statistics.

For more details check out docs:
http://doc.reverbrain.com/elliptics:monitoring – users documentation that describes how to gather and read statistics
http://doc.reverbrain.com/elliptics:monitoring-inside – developers documentation that describes how monitoring is implemented and how you can write custom statistics provider.

New small article on doc wiki: elliptics atomic operations, locks and compare-and-swap write

This small article tells about locks in elliptics and how atomic operations work.
In particular, it desribes how EXEC command is being processed and where locks are taken.

In few words, all operations (READ, WRITE, EXEC or any other) are processed atomically in single group in elliptics unless special flag is taken.

http://doc.reverbrain.com/elliptics:atomic-operations

Discussion group for our technologies

We created the whole stack of technologies ranging from the lowest level of data storage upto high-level processing pipeline.
It happend that we never actually had a normal discussion group for most of them.

So, I opened google groups where stuff we created can be discussed.
https://groups.google.com/forum/?fromgroups=#!forum/reverbrain

Grape: multiple events -> multiple applications

Grape is a realtime pipeline processing engine.
You can load your application into elliptics cluster and start data processing by triggering external events.

Grape is build using signal-slot model, so you can create multiple events which can handle different jobs over your data in parallel on multiple machines in the cluster. Every event can send a data reply back to original caller which concatenates them and returns when completion event is received.

But having single application which performs a whole bunch of processing events is not suitable for every situation. For example you may want to store intermediate processing results in the storage to create fallback mechanism, or you need multiple replies from one set of events to be concatenated, processed and sent to the next step in particular form…

There may be many cases where you may want to split your whole processing graph to subset of smaller applications, where each of which should be started somewhere from within the pipeline topology.

With 0.3.0 grape release we introduce new feature, which allows event within topology to start another application pipeline and wait for its completion (collecting reply data).

Something like this
Multiple applications within grape pipeline

In this example finish node for the first (black) application will start second application (blue) and will wait for its completion. It will receive all replies sent back from nodes within second application and then will send reply back to the original caller (start node of the first application), which will unblock client and return processed data.

I’ve updated grape server-side tutorial as well as example code in grape source package. Example application does exactly what’s described above now.

Enjoy!

Elliptics autodiscovery support

Our clusters grew upto 400 nodes and it becomes quite boring to add new nodes: admin has to create config file, where number of remote nodes should be specified, and those nodes (or at least one of them) have to be alive.
New node will update routing tables on remote servers, it will connect to other nodes and so on…

To simplify this process even further we implemented multicast autodiscovery.
Every (configured to do this) node broadcasts information about itself, so that client can receive this and if authentification cookie matches, client will connect to those nodes. Multicast TTL is set to 3.

Using reserved ‘hostname’ word instead of local address (like ‘addr = hostname:1025:2‘) and this new feature (it is turned on by adding ‘autodiscovery:address:port:family’ string into list of remote nodes like ‘remote = autodiscovery:224.0.0.5:1025:2‘, one can fully eliminate need for admin to edit any configuration file for new nodes.

Elliptics cache operations

I didn’t yet updated doc.ioremap.net, but it will be done soon. So, will write it here first

Elliptics distributed storage has built-in cache, which can (or can not, depending on how you write your data) be backed up by disk storage. Cache is rather simple LRU, but we will extend it with RED or maybe more comlex eviction mechanisms.
Each cached element has an expiration timeout, which by default never expires.

I’ve added possibility to remove cached entry not only from in-memory cache, but also from disk, when expiration timeout fires or when you remove object by hands. Also C++/Python APIs were extended to support this kind of operations.

Overall this feature is extremely useful for cases like session store. You generally do not want to lose them and want to read t, otherwise you could just use a plain in-memory cache, but also you know that your in-memory session pool is rather limited for frequently accessing users.

Server-side processing engine

Server-side (former known as server-side scripting) is an code execution environment created in elliptics on demand. We use Cocaine for this. This allows us to dynamically create pool of cgroups-bound processes or set of LXC containers to execute externally loaded code, which is triggered by client command. One may consider this as a write trigger, but actually it is more than that – it is a special command which may have your data attached and with whole access to local storage as well as any other external elements.

For example you may want to connect to external MySQL servers and trigger special command which will read or write data into elliptics only when special record is valid in SQL database (not that it could not be implemented in elliptics only, but for the sake of example simplicity).

Cocaine server-side pool can execute code in Python, Perl, Javascript (proof-of-concept) and binary compiled into shared library object (*.so). It is implemented through cocaine plugins (libcocaine-plugin-*) which are loaded by cocaine core (started in elliptics) on demand.

Cocaine engine must be enabled and initialized in elliptics config first. Then you have to load your code into elliptics in a way cocaine understands. In current (0.9) version this is done by cocaine-deploy tool. When you start your application (actually when first trigger event is received by elliptics), pools of proper workers are created and code starts. You may want to read detailed tutorial for step-by-step setup.

Serverside processing in elliptics is triggered by special command which operates with events. Event is basically a ‘application-name@event-name’ string with associated raw data embedded into elliptics packet. For every ‘application-name’ we start new cocaine engine (with its pools and so on).

Event can be blocked (if special flag is set) – in this case client blocks until event is fully processed (executed function returns) and error code as well as optional data is returned back to client from elliptics node. If event does not block, successful return means it was queued for execution.

It is possible to create the whole pipeline where each node sends new events to another nodes and so on over some kind of signal-slot topology spread over the whole cluster for parallel realtime execution. Grape is a framework which greatly simplifies this task getting the whole work of network transport away from programmer, who only need to implement processing nodes and register them via signal-slot topology model.

Elliptics: replication and recovery process

Elliptics since its day first uses replication to ensure data availability.
One of the main design goals ws to create a system, which is capable of dealing with the case when whole datacenter or geographical region is out of connectivity. Since those old days (first public elliptics version was released in early 2009) there are not so many players on the distributed storage market who is capable of dealing with this problem.
Basically, elliptics treats yearly Amazon’s whole-region-failure as a usual replica-is-not-available problem, and automatically switches to available copies in different regions.
One of our clusters contains replicas in US (Nevada), Europe (Amsterdam) and Russia (several replicas in Moscow and region and Ryazan’s region).

To provide this level of availability we introduced notion of group. Basically, group is just a set of servers (this can be even single node), which are logically bound to each other by admin. Some of our current installations treat whole datacenter as a single group, so in example above we could have 5 groups: US, Europe and 3 in Russia.

But usually things are a bit more comples: there are different power lines, different connectivity possibilities and so on, so generally group (or replica) is a smaller set of machines in one datacenter. Nothing prevents to spread it over multiple datacenters of course.

Group node belongs to is writen in config in ‘group = ‘ section.

When client configures itself it says what groups he wants to work with.
Elliptics client library stores this info to automatically switch between groups to find available replica. Due to eventual consistency, one of them may be unavaliable or be not in sync, for this case there is timestamp for each record, which used by ‘read-latest’ set of APIs – library reads timestamps from all available replicas, selects one with the latest update time and reads data from that group.

There are 2 types of recovery process in elliptics: merge and copy recovery.
Merge process is basically data moving within the same group from one node to another, which happens when new node is added (or returned back). In this case route table changes and part of ID ranges covered by given node starts belonging to new node. System should move data from old nodes to new one, since new node starts serving requests immediately after its start not waiting for data to be copied to its storage.
Merge is a rather fast process, especially when nothing has to be copied, so we start it at cron job once per hour or so. It is started by dnet_check tool. Its help message should explain the usage.

Second recovery process is needed when one or more replicas lost its data due to disk damages or other problems. To date this process is rather time consuming (hundred of millions of records is checked and copied for roughly a week). We have to check every single key on every system to ensure that it is in sync. Moreover, when we check single server, its replicas do not considered checked (although they can be updated), so, to ensure that the whole cluster is in sync, one must check every node.

It is ok in distributed system when part of it is not available – after all we put several replicas exactly for that. So whole replica recovery may be started not too frequently. One has to determine frequency of data loss in single replica and run recovery according to that. In our systems we start it several times per month, sometimes even less frequent if we know, that things we ok.

Replica recovery process is time consuming and may fail. Even with fair number of optimizations we put in, it is still unsatisfactorly slow. Currently we have to store metadata for every written key to determine where (in which groups) its replicas live. This limits performance of Smack backend for example by 100 times roughly. Metadata overhead is about 500 bytes per record, so eblob is not very useful for smaller writes. If we turn metadata off (which is default in Smack backend) then we lost possibility to recover data.

Getting all that together we plan to completely chage the whole logic of recovery process to check not a single key (or bunch of them), but whole data blob. Even not a blob, but underlying set of keys.
Backend will provide whole set of keys it stores (maybe split on per-blob basis if it has such smaller entity) with timestamps, which will be transferred to appropriate replica. It is possible to create assymetric groups now, where single group may contain replicas from several different groups.
For example, we can store key X in groups 1, 2, 3 and key Y in group 3, 4 and 5.
In this example group 3 will be assymetric, since there are no group which are fully symmetrical with group 3. This will be gone in future with introduction of metadata cluster, but this is a different story.

New update process will send the whole index for one or more blobs, remote side will return keys which have to be updated, and sync process completes by sending huge block of missed data to remote node. This is scheduled to be completed to the end of the year.

Looks like I wrote together all major parts of the elliptics, so its time to write a simple tutorial.
We alreay put all needed config files into source (conf directory in git tree), but setting it up from scratch is a good idea.

Next time I will write about elliptics server-side processing made using Cocaine and Grape – realtime pipeline processing engine made on top of elliptics and cocaine.
This will be the second part of tutorial.

Expect new articles tomorrow and so on. Plan is to create nicely structured documetation site within this month. I plan to use Sphinx as documentation generator, but maybe end up with plain wiki.
Thoughts?

Elliptics: route table, lookup, p2p

Route table is essentially a control layer for network transport. Basically it is a set of node addresses and ID ranges which are handled by given node.

When elliptics starts it connects to all remote nodes specified in ‘remote = ‘ section of the config. Connect to remote node includes asking for ID ranges remote node maintains as well as its route table.

Route table may look like this dump:

Server is now listening at 127.0.0.1:1026.
2012-08-23 22:29:26.917737 27694/27694 4: 2: 10fe9b14804c -> 127.0.0.1:1026
2012-08-23 22:29:26.917753 27694/27694 4: 2: 28901d6094ed -> 127.0.0.1:1025
2012-08-23 22:29:26.917767 27694/27694 4: 2: 2e03dfaa560c -> 127.0.0.1:1026
2012-08-23 22:29:26.917781 27694/27694 4: 2: 354782a7527a -> 127.0.0.1:1023

Where ‘2:’ is a group (or replica set) id and hex strings next to group number are start IDs of ranges asscociated with node, which address is written at the end.

By default each elliptics node connects to every node specified in ‘remote’ part of config. Then it downloads and merges theirs remote table and connects to every node it found there.
This remote tables are periodically refreshed from every group (elliptics randomly selects node from every group and asks its remote table) – route tables are downloaded, updated and new nodes are connected if needed.

Thus every client and server node is connected with every other node in cluster. And those connections are periodically checked.
When client wants to send command it uses its in-memory route table to determine remote node. It is still possible that remote table is not yet properly updated when new command is being sent, in this case node which received this command may forward it to the server which has to handle this request (according to its route table).

Since process of route table refreshing and updating is continous in the whole cluster, it is rather quick to detect new nodes connected to subset of servers or some nodes dropped out of the cluster. This allows to add new servers without disruption of client connections and servers restart.

Elliptics. What is it, why and how?

Elliptics project was started in 2009 as a new backend to pohmelfs (version 1 those days).

There were no open and mature enough NoSQL systems to date, but Amazon Dynamo was on the rise. Initial Elliptics storage system was rather academical – we did not consider ‘infinitely-growing’ storage, experimented with different routing protocols, played with various replication scenarios. There was no clear vision about how such a system should look.

Thousans of experimens later we ended up with what we have now in large set of production clusters from couple of servers to hundreds of nodes.

Elliptics was specially designed for the case of physically disributed data replicas. Even now there are no simple enough systems which can provide the same level of automatizations during datacenters replications or replica separation (hello yearly Amazon outages). One can manually create such setups in other distributed systems, but it is hard task to find out those which allow parallel write and reading balancing. Usually this is a mix of master-slave design, which may provide stronger consistency by the price or availability.

To date elliptics is not a storage system. There are several layers where storage is located at the lowest one. Elliptics supports 3 low-level backends to date: filesystem (where written objects are stored as files), Eblob – a fast append-only (with rewrite support though) storage and Smack – very fast backend designed for small compressible (6 different compressions are supported) objects, stored in sorted tables.

It is a very simple task to write your own backend which may store data in SQL for example. In this case you can even work with plain SQL commands on client side, while providing data distribution across the whole cluster, but beware that complex joins may be non-consistent with parallel updates.

Elliptics uses eventual consistency model to maintain data replicas. This means that number of copies you write may not always be the same. Client receives write status for every replica system tried to save, but for example, if you configured to write 3 replicas, but only 2 of them were successfully written, third one will be synced with others sometime in future. During this period of non-consistency read may return old data or do not return anything at all (this is not a problem, since elliptics client will automatically try another replica in this case).

General rule of thumb for eventual consistency systems is to never overwrite old data, but always write data using new keys. In this case there is virtually no non-consistency, there may only be non-complete replica, and client code automatically (well, automatically in elliptics, in others system you may need to do it manually) switch to another replica and read data from those servers.

Eventually data will be recovered in missed replica, but until it is ready, system generally can not survive loss of another replica. So, recovery process should be frequent enough. In our production we run replica-recovery process once per several days.

In elliptics replica is called a group. Group is a set of nodes (or just one) which admin logically bound together. For example one group may contain all servers in single datacenter, or one group may only contain several disks in a single server.
When group contains multiple nodes, they form DHT – there is ID ring from 0 to 2^512 (by default), and each nodes grabs multiple ranges in this ring. In particular, when elliptics node starts the very first time, it searches for ‘ids’ file in $history directory (it is specified in config), which contains range boundaries (this is just a set of IDs written one after another, by default each ID is 64-byte long, so each 64-bytes block in ‘ids’ file is a new boundary). If there is no such file, elliptics node generates own – it creates new random boundary for every 100 Gb of free space it has on the data partition.

Since each node may have several ranges in DHT ring in particular group, recovery process will fetch data from multiple nodes in parallel.

There is another issue with DHT rings – when new node is added or removed, ID ranges change and some nodes may get new ranges or lost. For example when new node starts it connects to others and says that ranges from its boundaries now belong to it. When it dies, those ranges are ‘returned’ back to neighbours.

When new node comes into the cluster, it starts serving requests immediately. This means that every write will succeed, but reads likely wont until recovery process moves data for this node’s ID ranges to its new location. This recovery process is called ‘merge’ in elliptics – this is basically a move of part of the data from several nodes to newly added one. This recovery process is cheap enough to be called once per hour or so.

Next time I will tell how recovery process operates and what we want to change.
Stay tuned!

Documentation for all our projects

Probably the most frequently asked ‘feature’ in every our project is its documentation.
It was always quite scarce and most of the time ‘a little bit’ outdated.

Now most of our projects went to enough maturity level so that not having up-to-date documentation is a serious hurt, which does not allow new people to work or even setup a system.

So, I will write set of blog articles about what our projects are, how they operate on higher level, and eventually this will become a base for complete documentation.

Well, we even have a deadline – all our technologies will be presented on Yandex’s Yet Another Conference October 1.

And if something will not be quite obvious feel free to join the con