Elliptics network: server-side scripting for data processing engine
Storage systems more and more commonly demand not just ability to save data to disk and provide fast access.
It is almost impossible to find a plain storage stack, which will not perform some kind of data processing.
This may include applications starting from simple client-side operations (like wrapping data in HTML in web case) and finishing with complex real-time processing or batch jobs wrapped in mapreduce-like systems.
Writing most of server-side work in low-level languages is not convenient and frequently is error-prone. Also performance is not commonly limited by CPU power, but disk IO speeds.
So, I added server-side scripting support to elliptics. Separate lib (called libsrw) creates a pool of processes (since many scripting languages are single-threaded like Python or popular javascript v8) which talk to external world via pipes. Each process initializes global context where it can run user provided script.
For example in elliptics we store a client node, which is connected to the storage and has access to all local data.
Every execution command is started in context, which is copied from global environment, and although global context can be affected by command (for example connection can be dropped by remote node), all variables created for scripts to be executed are destroyed when command/script completes.
To date I implemented python context and we plan to complete v8 javascript soon.
It is possible to create, for example, a simple checker on every node, where systems checks clients credential when processing data via direct URL. Direct URLs are used when we do not want to proxy data through dedicated servers (like when streaming huge files), but instead allow client to connect and send requests to particular server which hosts needed data objects.
Another example is batch data processing, like background data conversion from one format into another without need to copy every object to processing node. We also plan to use this mode for batch recovery – instead of checking replicas for every stored key, we can collect a bunch of objects which are known to me missed on some other node and later upload this blob to remote host in one go.
Classical mapreduce can also be implemented on servers. Actually our contexts are exact mappers, which have access to data and can iterate over all records selecting those to ‘reduce’. In our scheme reduce operation will run on the node, which sent request (or client). This may not be the optimal solution though (for example when reduce data set is rather huge), so we could extend it in the future.
Plan is to allow execution contexts to store its state locally when needed, so that subsequent commands have access to already processed data (for example consider the case, when we calculate number of links to given URL – we generally do not want to check all files every time we start processing, instead would like to only parse files uploaded since previous start).
Another application for server-side scripting is directory structure implementation for POHMELFS. This may be implemented in low-level code like C/C++ though and linked to server binary. Storing and processing complex structure on the server allows client (which is in kernel) to be really simple and do not mess with complex locks.
What I want to play right now is some kind of full-text search embedded into elliptics. The most naive implementation could just parse all written documents and build reverse indexes (using appropriate language morphology if needed) and store them in dedicated columns in elliptics as well. Writing this kind of scripts in Python or JavaScript is rather simple task, which greatly extends functionality of the storage kind of ‘for free’ – bunch of server-grade processors are usually unused in storage systems.
Here is context initialization script for elliptics for example:
$ cat /tmp/test/history.2/python.init
import sys
sys.path.append('/tmp/dnet/lib')
from libelliptics_python import *
log = elliptics_log_file('/dev/stderr', 40)
n = elliptics_node_python(log)
n.add_groups([1,2,3])
n.add_remote('localhost', 1025)
__return_data = 'unused'
We create global elliptics node object ‘n’ (not a good name for such things of course, but that’s just an example), which connects to storage server listening on a localhost:1025. __return_data is a special variable used to return data from execution context back to client. It is possible to write data into a file on the local filesystem or upload test results back into elliptics though.
That’s how client requests may look (there are also API extensions for C++ and Python):
$ ./example/dnet_ioclient -r server:1025:2 -C1 -c "local_addr = '`hostname`'" -n script.py
$ ./example/dnet_ioclient -r localhost:1025:2 -C2 -c "__return_data = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)"
The latter example is just a plain python script executed on the server – it only reads data from elliptics.
First example executes a script named script.py, which should exist on the server (config specifies scripting dir).
local_addr = '`hostname`' is part of the final script which initializes some variable (it can be arbitrary complex if needed).
Script (on the server) may look something like this:
$ cat /tmp/test/history.2/script.py
from time import time, ctime
ret = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)
__return_data = local_addr + '>>> ' + ctime(time()) + ' --- ' + ret
Prior being executed server-side code concatenates client’s part of the script (string with local_addr in above example) with script itself, so server executes following code:
local_addr = 'zbr.localnet'
from time import time, ctime
ret = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)
__return_data = local_addr + '>>> ' + ctime(time()) + ' --- ' + ret
Everything stored in __return_data is returned back to client.
Next execution command initializes context back into the state which existed just after initialization script execution (although global elliptics node in our example can be internally modified).
Those were our first set of changes to data processing engine implementation in elliptics network.
Stay tuned for more results!
Elliptics vs HBase on hundreds of millions of small records Elliptics network: authentication bits, IO/net priorities and write performance
Comments are currently closed.

Simple is the best !
Being deeply involved in a python-programming, I still believe that Lua is the best tool for such an “embedding” beast. It’s a highly performant, low-footprint tool with a great sandboxing capabilities that couldn’t be fully reproduced by Python environment.