Elliptics network: server-side scripting for data processing engine
Storage systems more and more commonly demand not just ability to save data to disk and provide fast access.
It is almost impossible to find a plain storage stack, which will not perform some kind of data processing.
This may include applications starting from simple client-side operations (like wrapping data in HTML in web case) and finishing with complex real-time processing or batch jobs wrapped in mapreduce-like systems.
Writing most of server-side work in low-level languages is not convenient and frequently is error-prone. Also performance is not commonly limited by CPU power, but disk IO speeds.
For example in elliptics we store a client node, which is connected to the storage and has access to all local data.
Every execution command is started in context, which is copied from global environment, and although global context can be affected by command (for example connection can be dropped by remote node), all variables created for scripts to be executed are destroyed when command/script completes.
It is possible to create, for example, a simple checker on every node, where systems checks clients credential when processing data via direct URL. Direct URLs are used when we do not want to proxy data through dedicated servers (like when streaming huge files), but instead allow client to connect and send requests to particular server which hosts needed data objects.
Another example is batch data processing, like background data conversion from one format into another without need to copy every object to processing node. We also plan to use this mode for batch recovery – instead of checking replicas for every stored key, we can collect a bunch of objects which are known to me missed on some other node and later upload this blob to remote host in one go.
Classical mapreduce can also be implemented on servers. Actually our contexts are exact mappers, which have access to data and can iterate over all records selecting those to ‘reduce’. In our scheme reduce operation will run on the node, which sent request (or client). This may not be the optimal solution though (for example when reduce data set is rather huge), so we could extend it in the future.
Plan is to allow execution contexts to store its state locally when needed, so that subsequent commands have access to already processed data (for example consider the case, when we calculate number of links to given URL – we generally do not want to check all files every time we start processing, instead would like to only parse files uploaded since previous start).
Another application for server-side scripting is directory structure implementation for POHMELFS. This may be implemented in low-level code like C/C++ though and linked to server binary. Storing and processing complex structure on the server allows client (which is in kernel) to be really simple and do not mess with complex locks.
Here is context initialization script for elliptics for example:
$ cat /tmp/test/history.2/python.init import sys sys.path.append('/tmp/dnet/lib') from libelliptics_python import * log = elliptics_log_file('/dev/stderr', 40) n = elliptics_node_python(log) n.add_groups([1,2,3]) n.add_remote('localhost', 1025) __return_data = 'unused'
We create global elliptics node object ‘n’ (not a good name for such things of course, but that’s just an example), which connects to storage server listening on a localhost:1025.
__return_data is a special variable used to return data from execution context back to client. It is possible to write data into a file on the local filesystem or upload test results back into elliptics though.
That’s how client requests may look (there are also API extensions for C++ and Python):
$ ./example/dnet_ioclient -r server:1025:2 -C1 -c "local_addr = '`hostname`'" -n script.py $ ./example/dnet_ioclient -r localhost:1025:2 -C2 -c "__return_data = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)"
The latter example is just a plain python script executed on the server – it only reads data from elliptics.
First example executes a script named
script.py, which should exist on the server (config specifies scripting dir).
local_addr = '`hostname`' is part of the final script which initializes some variable (it can be arbitrary complex if needed).
Script (on the server) may look something like this:
$ cat /tmp/test/history.2/script.py from time import time, ctime ret = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0) __return_data = local_addr + '>>> ' + ctime(time()) + ' --- ' + ret
Prior being executed server-side code concatenates client’s part of the script (string with
local_addr in above example) with script itself, so server executes following code:
local_addr = 'zbr.localnet' from time import time, ctime ret = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0) __return_data = local_addr + '>>> ' + ctime(time()) + ' --- ' + ret
Everything stored in
__return_data is returned back to client.
Next execution command initializes context back into the state which existed just after initialization script execution (although global elliptics node in our example can be internally modified).
Those were our first set of changes to data processing engine implementation in elliptics network.
Stay tuned for more results!