Tag Archives: server-side

Server-side operations

Elliptics distributed storage is being built as a client-server architecture, and although servers may discover themselves, exchange various statistic information and forward requests, they most of the time serve client’s requests.

Recovery in a distributed storage is a tricky operation which requires serious thinking on which keys have to be copied to which destinations. In Elliptics recovery is another client process which iterates remote nodes, reads data to be recovered and update needed keys.

But there are cases when this round trip to client is useless. For example when you require missing replica, or when you have a set of keys you want to copy or move to new destination.

Thus I introduced two server-side operations which allow to send content from one server to multiple replicas. It is intended for various recovery tools which optimize by not copying data from local node to recovery temporary location, instead they may tell remote node to send data directly to required locations. It can also be used to move data from one low-level backend (for example eblob) to a newer version or different backend without server interruption.

There is a new iterator type now which sends all keys being iterated to set of remote groups. It does it with the speed of network or disk (what it slower), in local tests iteration over 200Gb blobs sending data over the network to one remote node via write commands ended up with ~78MB/s sustained speed. There were pikes though, especially when remote node synced caches. Both sender and recipient had 30Gb of RAM. Rsync shows ~32MB/s speed on these machines, but not because it is that slow, but because of ssh which maxed out CPU by packet encryption.
Iterator sends dnet_iterator_response structure for each write result for every key it has processed just like for usual iterator, neither API nor ABI is broken.

Second server-send command accepts vector of keys. It searches for all remote nodes/backends which host given keys in the one specified group, splits keys into per-node/backend basis and tells remote backends to send appropriate keys to specified remote groups. The same iterator response is generated for every key which has been processed.

All operations are async and can run in background with other client requests being handled in parallel.

There are 3 operation modes:
1. default – writing data to remote node using compare-and-swap, i.e. only write data if it either doesn’t exist or it is the same on remote servers. Server sending (iterator or per-key) running in this mode is especially useful for recovery – there is no way it can overwrite newer copy with the old data.
2. overwrite – when special flag is set, it overwrites data (clears compare-and-swap logic)
3. move – if write has been successful, remove local key

There is example tool in examples which iterates over remote node and backends and performs copy/move of the keys being iterated. Next step is to update our Backrunner HTTP proxy to use this new logic to automatically recover all buckets in background.

Stay tuned!

Secondary indexes and write-cas

It was a while I wrote last time, and there were enourmous amount of changes made.

Let’s start with write-cas and secondary indexes.
Write-cas stands for write-compare-and-swap, which means server performs your write only if your cookie matches what it has. We use data checksum as a cookie, but that can be changed if needed.

CAS semantic was introduced to heal a problem when multiple clients update the same key. In some cases you might want to update a complex structure and not to loose data made by others.

Since we do not have distributed locks, client performs read-modify-write loop without any lock being held, which ends up with the race against other clients. CAS allows to detect that data was changed after we read it last time and our modification will overwrite those chages.

In this case client gets error and performs read-modify-write loop again. There are nice function to use write-cas: with functor which just modifies data (read and write is hidden) and the one, and low-level method, which requires checksum and new data – if it fails, it is up to the client to read data again, apply his changes and write it again.

The first write-cas client is elliptics’ secondary indexes. You may attach ‘tags’ or ‘indexes’ with data to any key you want. This means that after secondary indexes update has completed, you will find your key under the ‘tag’ or ‘index’.

For example, tag can be a daily timestamp – in this case you will get list of all keys ‘changed’ at given day.

Of course you can find all keys which match multiple indexes – this is kind of ‘AND’ logical operation. Our API returns all keys found under/in all requested tags/indexes.

All indexes (as well as any keys) may live in different namespaces.

So far secondary indexes are implemented on pure write-cas semantic. And our rather simple realtime search engine Wookie uses them to update its reverse indexes.
And that doesn’t scale :)

We found that having multiple clients competing to update the same popular index ends up with too many read-modify-write cycles, that the whole system becomes network bound. And we want to be CPU bould instead.

To fix this issue we will change secondary indexes to be invoked from server-side engine. All server-side operations are atomic by default, so this will eliminate races between multiple clients.

Stay tuned – I will write more about Wookie soon.
After all, we want to implement a baby (really-really baby) Watson with it :)

Elliptics server-side processing

Elliptics has for a while so called server-side processing engine. What does it mean, what it is used for?

Basically saying, server-side processing is a dynamic pool of workers which can execute your commands triggered by external event. Like you want to write data, but only if it matches some options. Or you want to read data, but only if provided credentials match some internal data.
It is done as simple as writing data into elliptics and emiting server-side processing event with your application and event name with uploaded key. Your event processing code can read that key, obtain data and perform needed steps. Or you can provide whole your data blob with event instead of writing it into elliptics.

Many other tasks can be triggered on your data. At the very lowest layer server-side processing is just a pool of processes (linux containers soon), which execute your code over data you provided in given event.

In the current elliptics git we switched to 0.10 Cocaine (cloud management software), which brought us new features and performance.

It is quite common when single elliptics node handles 100k IOps of read/writes from internal LRU cache, or 10+ krps from on-disk leveldb backend.
About 19+ krps is a number of executions per second our engine can deliver to single node during empty processing, so those are our overhead numbers. OpenStack on the same hardware delivers about 4 times less, not counting how it works with storage layer.

And we work on improving even those numbers.
More on our