Elliptics distributed storage is being built as a client-server architecture, and although servers may discover themselves, exchange various statistic information and forward requests, they most of the time serve client’s requests.
Recovery in a distributed storage is a tricky operation which requires serious thinking on which keys have to be copied to which destinations. In Elliptics recovery is another client process which iterates remote nodes, reads data to be recovered and update needed keys.
But there are cases when this round trip to client is useless. For example when you require missing replica, or when you have a set of keys you want to copy or move to new destination.
Thus I introduced two server-side operations which allow to send content from one server to multiple replicas. It is intended for various recovery tools which optimize by not copying data from local node to recovery temporary location, instead they may tell remote node to send data directly to required locations. It can also be used to move data from one low-level backend (for example eblob) to a newer version or different backend without server interruption.
There is a new iterator type now which sends all keys being iterated to set of remote groups. It does it with the speed of network or disk (what it slower), in local tests iteration over 200Gb blobs sending data over the network to one remote node via write commands ended up with ~78MB/s sustained speed. There were pikes though, especially when remote node synced caches. Both sender and recipient had 30Gb of RAM. Rsync shows ~32MB/s speed on these machines, but not because it is that slow, but because of ssh which maxed out CPU by packet encryption.
Iterator sends dnet_iterator_response structure for each write result for every key it has processed just like for usual iterator, neither API nor ABI is broken.
Second server-send command accepts vector of keys. It searches for all remote nodes/backends which host given keys in the one specified group, splits keys into per-node/backend basis and tells remote backends to send appropriate keys to specified remote groups. The same iterator response is generated for every key which has been processed.
All operations are async and can run in background with other client requests being handled in parallel.
There are 3 operation modes:
1. default – writing data to remote node using compare-and-swap, i.e. only write data if it either doesn’t exist or it is the same on remote servers. Server sending (iterator or per-key) running in this mode is especially useful for recovery – there is no way it can overwrite newer copy with the old data.
2. overwrite – when special flag is set, it overwrites data (clears compare-and-swap logic)
3. move – if write has been successful, remove local key
There is example tool in examples which iterates over remote node and backends and performs copy/move of the keys being iterated. Next step is to update our Backrunner HTTP proxy to use this new logic to automatically recover all buckets in background.