Elliptics

Elliptics distributed network

Recent elliptics changes

Tagged:  

In a meantime I added stream write support to HTTP elliptics network fronend. WIth this change it is possible to atomically append data to existing records (in backends which support it - currently it is file IO backened).

Thus HTTP interface becomes very useful for stream-like data updates in various projects starting from log saving to auido/video recording.

Another major change includes IO backend update: I finally dropped TokyoCabinet database in favour of libeblob. Main objection against TC was its extremely slow performance when index swaps to disk. I dropped VM cache during test and TokyoCabinet never recovered back even if amount of RAM allows to populate whole database (I used several millions of 10k objects and database size was close to 10 Gb while machine had 24 Gb of RAM).

Also TC version I tested (1.4.44 iirc) was very unstable - I even wrote a test script to restart upload, but to my shame I did not contact author with cores and backtraces. When I tried .23 version year ago it was rock stable, so I expect that in newer version things were fixed.

Another TC issue is related to multi-threading: this database just does not allow multiple users to read and write data simultaneously - it locks the whole database each time we start transaction, which is needed to protect against parallel writes. Although this was not an issue in elliptics network tests.

And now dig back into python and morphological data processing... I tested handful of phrase generation techniques, but neither provided good result for words which were not present in the dictionary. I have to find and answer.

elliptics@eblob: 5000 rps of random IO requests in 1 Tb of data (100 millions of objects)

Tagged:  

Legend.

elliptics network - distributed hash table storage
eblob - low-level data storage used as one of IO backends for elliptics network

Hardware: 2 E5530 servers (16 cores, 24 Gb of RAM) each one is connected to SAS shelf (14 disks, ext4, raid10).
Data: 100 millions of objects (total of 1042 Gb) roughly equally spread over above server nodes.
Requests: random IO reads.

Result? We have it:


Reply time (left, in ms) @ number of requests per second (right, red inclined line)



Reply time (in ms) distribution

Net result: 3500 rps within 100 ms, 4000 rps witin 200 ms.
Not that bad I think...

New elliptics network release: 2.9.1

Tagged:  

This day has come: I made whole new and shiny elliptics network release. Amount of invasive changes is noticeable but yet a lot more I found should be changed.

We made 30 intermediate releases after 2.9.0 which revealed fair number of bugs and showed a number of different ways of further elliptics storage development.

A short changelog includes:

  • Huge log refactoring. Now logs are external entities which are referenced in elliptics core.
  • Allow to embed timestamps into data objects, which makes http cache headers processing (namely if-modified-since and last-modified) as fast as data reading. Timestamps are also stored in transaction histories, which previously had to be read from disk, thus making 2 times more disk seeks to handle client request. It is quite easily possible to implement other URI parameters embedding (one can specify object timestamp in URI as well as getting current time, for more details see example config file which comes in a package/archive).
  • Added region tag in XML returned by HTTP proxy. Region ID can be obtained using external library calls when properly configured.
  • Added EBLOB low-level IO backend. It is an append-only storage system which works with huge blob files and allows to lookup object with O(1) time optionally locking index in memory. I posted a number of its benchmarks here - it outperforms everything, although has a set of constraints, like need to call defragmentation tool. I plan to remove TokyoCabinet database IO backend in next release, since when it starts paging out its performance suffers greatly and elliptics network does not need its rich database API.
  • Added spec file for RPM builds. Debian build files were added quite long time ago, and we use deb packages to actually rollout new releases.
  • Added C++ and Python bindings. I stumbled upon an interesting user behaviour: when API is exteremely rich library is used noticeably less frequently, since people who are not very familiar with the low-level internals can not decide what exactly they need from this huge exported function set. In C++/Python bindings I implemented rather simple classes with limited API which should solve most of the common problems users face daily. Simplifying API leads to better acceptance of the project.
  • Changed library and include filenames from dnet to elliptics. In particular, this solves problem with already existing libdnet found in all popular distros. Now we use libelliptics and include/elliptics.
  • Switched server-side code from command-line options to configuration file. Allow low-level IO backends to have their own config options. Split former dnet_ioserv to server code with the same name and dnet_ioclient which is able to read/write/remove objects in the storage as well as requesting VFS/memory/CPU usage statistics.
  • Fair number of bug fixes.

With this release I mark 2.9 tree as experimental, since I will break compatibility soon to allow faster fault recovery process.

Milestone TODO list includes:

  • Refactor transaction history processing code and move it to elliptics core from low-level IO backends. Transaction logs will be stored in eblobs and will be optimized for performance. IO backends will only store data objects in different formats.
  • Remove TokyoCabinet IO backend in favour of EBLOB one.
  • Switch to thread-per-client IO processing model.
  • POHMELFS port.

Stay tuned!

And now its time to recall about OLOLO-intellect and related lexical and morphological analysis...

Reactor vs. thread-per-client models

Tagged:  

I used to believe that reactor (aka state-machine) model is always superior compared to thread-per-client case, since threads are huge, context switch is CPU-pricey and slow.

At least it was 10 years ago when sky was bluer and grass was greener. Now things changed.
When I developed kevent - a kernel dispatching system for file descriptors, process' and other events, I implemented it as a complex enough state machine, which handled different event types and performed scheduled for each event type (like network AIO and so on).

Those days state-machine system did not differ too much from one created on top of thread-per-client model, especially in IO cases where clients frequently sleep waiting for IO completion.
Now, when multi-core systems become commodity hardware we want to utilize all CPUs, which means creating more and more separate state machines each one running on own physical processor. Or we can create a pool of threads where each thread will process its own state machine.

In elliptics network libevent-based state machine handles all network operations. Another state machine handles client commands and third one glues them together (this one is especially visible when we forward data from one socket to another).

System of complex structures is as weak as its weakest member. Thus, when we block in directory reading syscall which takes 10 minutes to populate cold dentry and/or inode cache from directory containing millions of records, that thread is completely uselesss for other operations scheduled. Which means no network transfers or other commands which could be done waiting for blocking operation to complete. In some systems I worked with on top of elliptics it was as much upto 10 thousands of 'clients' per default 128 IO threads.

And in IO-heavy systems amount of such blocking operations prevails number of potentiall non-blocking (like most of writes and all socket events, since it signals only when something can be read/written and sockets are non-blocking). As a proof-of-concept one may consider recent Java NIO vs sync-IO benchmark (shorter and more popular HTML article).

Thus I decided to rewrite event-handling model in elliptics network and use simple thread-per-client approach. It will eliminate possible libevent bugs (and I found some of them in 1.3 versions) as well as greatly simplify processing logic. Also it will be possible to imlpement IO priorities, first by using ionice and then maybe by introducing automatic tools.

elliptics@eblob on 14-disks SAS raids

Tagged:  

We got 2 servers (E5530, 16 cores, 24 Gb of RAM) with 14-disks SAS shelf converted into raid10 in each.
Uploaded 30 millions of records (record size varies from 5 to 20 kb) total of 370 Gb of data.

Fired with random requests and got this graph:

2000 rps within 130-150 ms, according to dynamics we could get more, but stopped test.
Will run longer with additional 60 millions of records (total of about 100 millions objects in the storage).

Eblob index must take all available RAM on such machine, with 15 millions of records on each (30 millions total) we get about 9.9 Gb of virtual memory and 8.7 Gb of 'real' mem.

Should fit 100 millions of records on 2 nodes perfectly :)

In a meantime I cooked up 2.9.1 release, its release candidate will pass final tests and I will make an announcement in a day or so. We got 30 internal releases prior this one already, its time to publish new shiny features in a new package.

Stay tuned!

Shooting at elliptics@libeblob with lots of data

Tagged:  

We got 2 servers (4 SATA disk raid10, default ext4, 24 Gb of RAM, E5530, 16 cores), uploaded 30 millions of objects into elliptics network on top of libeblob backend. Total of 370 Gb on disks.

And got this with completely random requests:

900 rps within 100-150 ms. Quite a superb result for such hardware.
And we have this SAS shelfs:

md4 : active raid10 sdr[13] sdq[12] sdp[11] sdo[10] sdn[9] sdm[8] sdl[7]
                               sdk[6] sdj[5] sdi[4] sdh[3] sdg[2] sdf[1] sde[0]
      2050780256 blocks super 1.2 16K chunks 2 near-copies [14/14] [UUUUUUUUUUUUUU]

which will be used to host about 200 millions of objects (upload will be finished next week - writing server can not cope with such load :). Will also test the same amount of data on 4-sata-disk raid10 machines as well.

Stay tuned!

Presentation bit: Elliptics network implementaion details

Tagged:  

5. Elliptics network implementaion details: data redundancy, fault tolerance, transctions, versions and snapshots, data deduplication and so on.

Elliptics network is a distributed key/value storage. Having more keys associated with the same data being written means that multiple copies of the written block will exist in the storage. Using data hashes to generate keys and having multiple hash functions registered for given block we end up having multiple copies of the data.

Knowing ID distribution among nodes in the storage it is possible to tweak hash generation function the way it will produce hash, which will belong to the interested node. This is a rather complex task to be implemented on client, and instead we introduced virtual datacenters.

Virtual datacenter is a set of nodes combined into some logical group, where nodes may or may not actually be physically groupped together. System modifies hash generated by common function by changing its first 4 bytes to be equal to the datacenter number.

Thus VDC feature allows to specify preconfigured prefix in every transaction ID - the first 4 bytes. Nodes which have their first ID bytes equal to VDC number will receive all transactions indexed by appropriate hash function.

Here is a virtual datacenter example.

Let's suppose we have two transformation functions setup on client: dc1_sha1 and dc2_sha1. They will produce sha1 hash of the data with the first bytes set to 1 and 2 accordingly.

Now let's suppose we add two nodes with IDs being equal to 0x01000000 and 0x01000080 and then two nodes with 0x02000000 and 0x02800000 IDs. Now there will be two transactions made with above transformation functions with its first 4 bytes set either to 1 or 2, so there will be a guaranteed copy in each virtual datacenter.

Now we can implement multiple VDCs with different ID's first bytes for every datacenter we want to work with. If we only need to have 2 copies, but there are more than 2 datacenters, nodes can be spread between those VDCs. '2' is arbitrary here of course.

This feature also allows to implement geographical linking of the requests. Let's suppose some application receives data read request from Moscow, it can check whether its set of transformation functions contains the one with the ID assigned to Moscow. If there is such a function, it can be used first to obtain object ID and to fetch it from Moscow-local servers instead of going to New York, where main datacenter lives. If Moscow storage does not contain our requested object, we will use second transformation function(s), which will 'point' to main storage cluster.

5.1. Fault tolerance in elliptics network.

In a meantime I was told that there will be a special man who will teach me how to make presentations...
They got me. And I did not yet read what I wrote several days ago :)

Elliptics network article for Linux Kongress 2010

Tagged:  

The purpose of Elliptics network storage is to allow users to access a set of physically distributed servers through flat addressing model in decentralized network environment. Key/value distributed storage provides an efficient method of accessing data with limited set of constraints. As a proof that such functionality is useful in a real life scenarios we present practical implementation of the DHT storage server with modular IO backends on top of common filesystems or database and various frontends varying from POSIX interface to HTTP access mode. We will discuss limitations faced with distributed hash table approach and compare them to functionality provided by centralized storage systems, namely high-performance data access and its high-availability in the fault-prone environment. Based on the practical results and flexibility of the implemented storage model we will highlight possible new functionality and the ways it could be made in the discussed system.

This is an abstract, I will post some blocks here, but whole article will be available when LK is over.

Low-level data storage introduction: libeblob

Tagged:  

Elliptics network is a very modular key/value (distributed hash table) storage. Among others it allows to build pluggable low-level IO backends - those entities which store data.

Currently elliptics network supports following IO backends:

  • file IO backend, where each transaction is stored as a separate file
  • TokyoCabinet database backend - each transaction is stored as a record in appropriate table. I dropped BerkelyDB support because of its low performance, even though TC does not provide ACID contrary to BDB

And now I added append-only blob storage - libeblob.

Following features are already supported:

  • fast append-only updates which do not require disk seeks
  • compact index to populate lookup information from disk
  • multi-threaded index reading during starup
  • O(1) data location lookup time
  • ability to lock in-memory lookup index (hash table) to eliminate memory swap
  • readahead games with data and index blobs for maximum performance
  • multiple blob files support (tested with blob-file-as-block-device too)
  • optional sha256 on-disk checksumming
  • 2-stage write: prepare (which reserves the space) and commit (which calculates checksum and update in-memory and on-disk indexes). One can (re)write data using pwrite() in between without locks
  • usuall 1-stage write interface
  • flexible configuration of hash table size, flags, alignment

TODO list includes:

  • defragmentation tool: entries to be deleted are only marked as removed, we need to have a tool (or embed it into library) to actually remove those blocks from blob files
  • proper off-line blob consistency checker: we put a checksum into data blob, someone may want to check if read data matches
  • run-time sync support - we should have a dedicate thread to call syncs on timed base

Elliptics network uses it as one of its low-level IO backends. Numbers I posted (1, 2, 3) also highlight its advantages.

But during elliptics network integration with libeblob I found how unoptimal transaction history log was implemented in the storage (and maybe found an answer why monsters like Cassandra do not support it at all). Maybe its time to rethink and reinvent it though...

Anyway, there is a set of features I will create to complete this implementation as well as new elliptics network release (with C++ and Python bindings and new IO backend).
Stay tuned!

Elliptics python

Tagged:  

I extended C++ and Python bindings for elliptics library, although python part was a little bit messy at first.

Python is massively ... single-threaded language: GIL is a tricky global lock monster, which does not easily allow to implement not only threads but also async communications. Of course python has threads, but they are internal entities which can not be worked with from the outside system threads.

Contrary elliptics network library is a multi-threaded application, and the main problem related to python was its async completion notifications. When transaction is finished or being processed, remote side can send multiple replies about its state (like chunks of data being read for exampl), which are processed in different thread than original sending one.

Python does not expect itself to be interrupted by those callbacks (even if we properly wrap them into python classes). But still we can (or it can be called a hack) invoke async python callbacks from C/C++ code and external threads.

Python may have multiple execution threads, or states, and at startup we have to select the one, which will be used to invoke our C++ callbacks. In older python versions it took quite a bit of efforts: stack selection, saving it somewhere in private data, then switch to/from it and so on. In newer python versions it is just as simple as calling PyEval_InitThreads(). Python thread which called it first will be selected as the one to dispatch exernal callbacks. Then just doing

PyGILState_STATE st = PyGILState_Ensure();
this->get_override("some_virtual_callback_invoked_from_cpp")(its, data);
PyGILState_Release(st);

will schedule C++ callback invocation. It will take care about thread state and GIL.

And when I managed to finally implement all wrappers and helpers for async bidirectional C++-to-Python communication, I dropped its support. Just because it is much simpler to read/write data using blocking calls, which is I believe the most common Python programming model.

That's how this works in python now:

#!/usr/bin/python

from libelliptics_python import *
from array import *
import sys

id = array('B')
for x in xrange(0, 20) :
	id.append(x + 1)

trans = array('B')
for x in xrange(0, 20) :
	trans.append(1)

try:
	log = elliptics_log_file("/dev/stderr", 15)
	n = elliptics_node_python(id.buffer_info()[0], log)

	t = elliptics_transform_openssl("sha1")

	n.add_transform(t)
	# weird thing happens if I write n.add_transform(elliptics_transform_openssl("sha1"))
	# we crash somewhere inside c++ binding, probably because I implemented lazy
	# reference counting model (i.e. not at all :)
	# thus object MUST live after this function is completed
	# this should be fixed of course with proper copy constructors
	# the same applies to logger actually

	n.add_remote("devfs8", 1025)

	#n.write_file(trans.buffer_info()[0], "/tmp/test_file", 0, 0, 0)
	#n.read_file(trans.buffer_info()[0], "/tmp/test_file.read", 0, 0)

	data = array('B', "1234567890")
	n.write_data(trans.buffer_info()[0], data.buffer_info()[0], 0, data.buffer_info()[1])

	read = array('B')
	for x in xrange(0, len(data)) : read.append(0)

	n.read_data(trans.buffer_info()[0], read.buffer_info()[0], 0, read.buffer_info()[1])

	for x in xrange(0, len(data)) :
		print data[x], " ", read[x]
except:
	print "Ooops, error:", sys.exc_info()[0]

$ ./test.py  # written and read data from example above
49   49
50   50
51   51
52   52
53   53
54   54
55   55
56   56
57   57
48   48

Also finished proper object copy for logger, it will clone logger and when proper methods are implemented one can create own private python-made loggers. But that's details.

To date I consider python bindings as well as C++ ones fully finished. C++ has async callbacks as well as blocking sync IO operations.

Completed python elliptics network API bindings

Tagged:  

After several bottles of cold beer with this killing heat in Moscow... And brain suddenly starts 'thinking' in the right or better say - alternative direction.

[zbr@baccara lib]$ ./test.py 
010203040506: successfully initialized notify hash table (256 entries).
Server is now listening at 0.0.0.0:0.
010203040506: new node has been created at 0.0.0.0:0, id_size: 20.
connected to 127.0.0.1:1025.
123abc000000 reverse lookup -> 127.0.0.1:1025.
123abc000000: node list dump:
      id: 123abc000000 [12], addr: 127.0.0.1:1025.
do_transform: transform
calling openssl::transform
transform
000000000000: created trans: 1, cmd: 4, size: 575, offset: 0, local_offset: 0 -> 127.0.0.1:1025.
010101010101: created trans: 2, cmd: 4, size: 64, offset: 0, local_offset: 0 -> 127.0.0.1:1025.
000000000000: transactions sent: 2, error: 0.
010203040506: started resending thread. Timeout: 60 seconds.
000000000000: object write completed: trans: 1, status: 0.
010101010101: object write completed: trans: 2, status: 0.
Successfully wrote file: '/tmp/test_file' into the storage, size: 575.
010101010101: created trans: 3, cmd: 5, size: 0, offset: 0, local_offset: 0 -> 127.0.0.1:1025.
010101010101: read completed: file: '/tmp/test_file.read.history', offset: 0, size: 96, status: 0.
010101010101: read completed: file: '/tmp/test_file.read.history', status: 0, freeing: 1.
/tmp/test_file.read.history: objects: 1, range: 0-18446744073709551615, counting from the most recent.
a13677c93a73: created trans: 4, cmd: 5, size: 575, offset: 0, local_offset: 0 -> 127.0.0.1:1025.
a13677c93a73: reading chunk into file '/tmp/test_file.read', direct: 0, offset: 0/0, size: 575, err: 0.
a13677c93a73: flags: 00000080, offset:        0, size:      575: match: -2, rest: 18446744073709551615
a13677c93a73: read completed: file: '/tmp/test_file.read', offset: 0, size: 575, status: 0.
a13677c93a73: read completed: file: '/tmp/test_file.read', status: 0, freeing: 1.
010203040506: destroying node at 0.0.0.0:0, st: 0x8609f50.


[zbr@baccara lib]$ md5sum /tmp/test_file /tmp/test_file.read
d712a7ccfe66a45bef31892befa250f8  /tmp/test_file
d712a7ccfe66a45bef31892befa250f8  /tmp/test_file.read
[zbr@baccara lib]$

Sort of completed - there are couple of other functions (read/write data via pointer and not file) to test, but overall it is done. Although I can not say I understand how boost::python works, and why it may crash or did not compile if I change A to B.

But what do you want, I started to write c++ bindings 2 days ago and wrote first python program today.
After all: program crashes - its time to make alpha release!

To solve 'unsigned char *' and other non-existing types in Python I use this hack:

class elliptics_node_python : public elliptics_node {
	public:
		elliptics_node_python(unsigned long lptr, elliptics_log &l) :
			elliptics_node((unsigned char *)lptr, &l) {};

		void read_file_by_id(unsigned long lid, const char *file, uint64_t offset, uint64_t size) {
			elliptics_node::read_file((unsigned char *)lid, const_cast(file), offset, size);
		}

I.e. transform 'unsigned long' into pointer in C++ code. I wonder why Python calls it 'int' no matter what :)

[zbr@baccara lib]$ cat test.py 
#!/usr/bin/python

from libelliptics_python import *
from array import *

id = array('B')
for x in xrange(0, 20) :
	id.append(x + 1)

trans = array('B')
for x in xrange(0, 20) :
	trans.append(1)

log = elliptics_log_file("/dev/stderr", 10)
n = elliptics_node_python(id.buffer_info()[0], log)

t = elliptics_transform_openssl("sha1")

n.add_transform(t)
# weird thing happens if I write n.add_transform(elliptics_transform_openssl("sha1"))
# we crash somewhere inside c++ binding, probably because I implemented lazy
# reference counting model (i.e. not at all :)
# thus object MUST live after this function is completed
# this should be fixed of course with proper copy constructors
# the same applies to logger actually

n.add_remote("localhost", 1025)

n.write_file(trans.buffer_info()[0], "/tmp/test_file", 0, 0, 0)
n.read_file(trans.buffer_info()[0], "/tmp/test_file.read", 0, 0)

O, poor python

Tagged:  

My study of this language was started today. And while my knowledge base is somewhere between zero and void I already can ask stupid qustions, which were not answered on #python / irc.freenode.net

What python type to use when c++ to python boost::python binding requires 'unsigned char *' parameter?
Code snippet will tell much more:

class test_class : public test_class_base {
	public:
		test_class(const char *path);
		virtual ~test_class();

		virtual void log(const char *msg);
		unsigned char test(unsigned char *ptr) { return *ptr; };
	private:
		std::ofstream *stream;
};

....

	class_ >("test_class", init())
		.def("log", &test_class::log, &test_class_wrap::default_log)
		.def("test", &test_class::test)
	;

I ommitted wrapper class definition, since it is not critical. It works for python-string to c++ 'const char *' transform, but unsigned char pointer fires up an error. I tried both struct.unpack_from("P", array) and array.buffer_info()[0], but python says that they both return 'int' while c++ code expects 'unsigned char *':

id = array('B')
for x in xrange(0, 20) :
	id.append(x)

s = struct.unpack_from("P", id);
print hex(s[0])

t = test_class("/dev/stderr")
print hex(t.test(s[0]))

error:
[zbr@baccara python]$ ./test.py
0x3020100
Traceback (most recent call last):
  File "./test.py", line 19, in 
    print hex(t.test(s[0]))
Boost.Python.ArgumentError: Python argument types in
    test_class.test(test_class, int)
did not match C++ signature:
    test(test_class {lvalue}, unsigned char*)

Code snippets can be also found at http://paste.pocoo.org/show/239085/.

If no solution will be found, I will refactor elliptics python binding to support new classes for missing types like 'elliptics_id' for 'unsigned char *'.

Argh, those fucking C++ and boost::python

Tagged:  

I study C++ effectively for last two days. Well, about 15 years ago I knew it to some degree (without STL, although with templates), but I would not count it now.

That's why my first sexual experience with C++, STL and boost::python pushed me into deep depression.
Any single error and you will get 7-10 pages of compiler errors, which are completely non-understandible for newbie like me. Googling them for 30 minutes I found, for example, that stl::iostream and its children are non-copyable streams. No need to tell that this was not obvious from things like (5 pages of):

class test_class {
  public:
    test_class(const char *path) : stream(path) {};
    std::ofstream stream;
};

In file included from /usr/lib/gcc/i686-redhat-linux/4.4.4/../../../../include/c++/4.4.4/bits/localefwd.h:43,
                 from /usr/lib/gcc/i686-redhat-linux/4.4.4/../../../../include/c++/4.4.4/string:45,
                 from /usr/lib/gcc/i686-redhat-linux/4.4.4/../../../../include/c++/4.4.4/stdexcept:39,
                 from /usr/include/boost/function/function_base.hpp:14,
                 from /usr/include/boost/function/detail/prologue.hpp:17,
                 from /usr/include/boost/function/function_template.hpp:13,
                 from /usr/include/boost/function/detail/maybe_include.hpp:13,
                 from /usr/include/boost/function/function0.hpp:11,
                 from /usr/include/boost/python/errors.hpp:13,
                 from /usr/include/boost/python/handle.hpp:11,
                 from /usr/include/boost/python/args_fwd.hpp:10,
                 from /usr/include/boost/python/args.hpp:10,
                 from /usr/include/boost/python.hpp:11,
                 from eee_python.cpp:16:
/usr/lib/gcc/i686-redhat-linux/4.4.4/../../../../include/c++/4.4.4/bits/ios_base.h: In copy constructor
‘std::basic_ios<char, std::char_traits<char> >::basic_ios(const std::basic_ios<char, std::char_traits<char> >&)’:
/usr/lib/gcc/i686-redhat-linux/4.4.4/../../../../include/c++/4.4.4/iosfwd:47:   instantiated from
‘boost::python::objects::value_holder::value_holder(PyObject*, A0)
[with A0 = boost::reference_wrapper<const test_class>, Value = test_class]’

I stopped to count such shit error-reporting pages when I tried to export virtual functions via multi-level inheritance through boost::python. ALthough it was quite simple in tutorial...

And yet after 5 A.M. I fucking won:

[zbr@baccara lib]$ python
Python 2.6.2 (r262:71600, Jun  4 2010, 18:28:04)
[GCC 4.4.3 20100127 (Red Hat 4.4.3-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from libelliptics_python import elliptics_log_file
>>> w = elliptics_log_file("/dev/stderr", 31)
>>> w.log(31, "qweqweqe\n")
qweqweqe

Tomorrow I will complete Python bindings and happily hopefully forget C++ for another 15 years.

Elliptics network blob IO backend test: small number of SATA disks

Tagged:  

Our generic system contains just 4 sata disks combined into software RAID10. Putting BLOB IO backend to ext4 fs and installing two nodes, we got quite surprising skyrocketed results:


2 sata storages, each contains 4-disks software RAID10

There are 10 millions of records, total of 87 Gb. Each node contains about 44 Gb of data, and it has 24 Gb of RAM. Although we flush caches prior each test run, readahead games quickly suck blob files back into ram, which I believe explains such results: 700 rps (of completely random IO) within 200 ms, 1000 rps within 300 ms.

I wonder why 4-disks SATA setup is close to 16-disks SAS storage. Looks like raid10 requires a serious tuning for larger storages, otherwise I can not explain such major hardware difference and quite similar performance numbers.

Elliptics network bindings: C++

Tagged:  

It took a little bit more time than we expected: it tends that people suddenly get some other tasks and of course with higher priority.

So, I decided to write thinkgs up myself. To implement python bindings I will use boost::python, but first I have to wrap common elliptics network operations into proper classes.
That's what I ended up with today:

int main()
{
	unsigned char id[DNET_ID_SIZE];
	elliptics_transform_openssl t("sha1");
	elliptics_log_file log("/dev/stderr", 15);

	memset(id, 1, DNET_ID_SIZE);

	elliptics_node n(id, &log);

	n.add_remote("devfs8", 1025, AF_INET);
	n.add_transform(t);
#if 1
	elliptics_callback_io callback(&log);

	memset(id, 0xff, DNET_ID_SIZE);
	n.read_data(id, 0, 0, callback);
#endif
	n.write_file(id, const_cast<char *>("/tmp/some_file.txt.bak"), 0, 0, 0);
	n.read_file(id, const_cast<char *>("/tmp/some_file.txt"), 0, 0);

	n.read_file(reinterpret_cast<void *>(const_cast<char *>("1.xml")), 5,
			const_cast<char *>("/tmp/1.xml.cpp"), 0, 0);

	/* cool, yeah? we have to wait for read_data() to complete actually */
	sleep(10);
}

Most of them can throw numeric exceptions. There are less than a dozen of classes, which I will put into proper boost::python wrappers.

And in a meantime it is of course possible to use them in c++ code. Current version can be found in git.

Why DHTs suck or do they?

Tagged:  

DHT is virtually the same hash table spread over multiple nodes. Thus it shares its advantages, like O(1) access when properly configured, and its weak parts, like full absence of scalability.

Yes, hash table does not scale. Because to change its size one has to perform lots of tricks which are generally end up in a full table content rehash. With millions of entries on disks this may take a while...

Thus we can not easily extend hash table storage - each node addition will require table rebuild. Depending on DHT configuration and routing protocol used, it can be full or partial content copying.

Contrary to hash table, classical distributed storage with dedicated master server is able to scale without need to copy data each time new server is added. Master server will return this new node's address for each new write, then next node and so on.

This looks shine and well in theory. Let's face the practice.

Data tends to wear down with time - we do not lok at old photos and do not read old mails as frequently as access recently written information. Thus servers which host old data will not be loaded compared to the new nodes in the described master server example.
To fix this issue master server has to start data copying - old servers move their old unaccesible data to some new nodes, and new writes can go to old nodes too. This task is full of non-trivial heuristics about what data to move and what server to use. With time it ends up with data copy for each new write (or usually sufficiently large chunk write).

In distributed hash table this copy is not needed, since by design writes are balanced across the whole storage (when cryptographically strong hash function and storage size are configured of course).
In DHT there is no problem of data wearing, since each write goes to (kind of) random node, thus all nodes will contain roughly the same amount of old and new data.

Drawing the line, in DHT we have to copy data when new node is added to take some load from the neighbours, while in master-server scenario we have to copy each time new data is added (this can be limited to large chunks of course). In theory master-server scenario will copy data to the distribution DHT provides out of the box, and still will have to copy again when new nodes added.

In some cases this is not an issue - we may want not to start data redistribution in master-server storage because of some reasons, or may not foresee that this demand will appear though, while need for data copy in DHT when new node is added is a must - otherwise data will not be accessible because of changed hash distribution over nodes.

This two cases frequently (if not all the time :) becomes the most significant corner cases when DHT is not selected to be used for distributed storage.

Elliptics BLOB IO backend testing: single request - single read

Tagged:  

Previous elliptics test showed how good (or bad) is append-only BLOB backend, when HTTP proxy issued two reads to handle single client's request.
Namely it fetched transaction history log to find out which stored transaction has the same version as client asks.

Now let's see how well we behave when single client request results in single data read from the blob.


700 rps witin 100 ms, 900 rps within 300 ms

Surprisingly absolute numbers did not change - we still fit 1000 rps within 300 ms, which is rather unacceptible for single client. But at the beginning we are about 2 times faster than described 2-reads case: we handle 700 rps within 100 ms range.

Testing etup is the same as in previus test: 2 SAS storages attached, each has 16 disks in it. Ext4 over software RAID10. 2.6.34 kernel. Random requests. 10 millions of records (about 87 Gb total, 44 on each SAS storage).

We also ran the same test but moved storage blob to single SAS storage. Also moved it to block device directly instead of using usual file on ext4 filesystem. Results were 2 times degraded as expected: like about 600 rps within 200 ms.

There was no difference whether block device or ext4 was used as low-level storage.

In a meantime blob IO backend got loading index support to speedup its startup. The only missing feature is index truncation aka ability to compact and remove deleted entries. When this is done, I will start POHMELFS - POSIX frontend to elliptics network.
Its initial implementation will not be performance centered as well as feature-rich, instead I will create a rather simple client, which will allow trivial deployment procedure.

And I have to start writing elliptics network paper for Linux Kongress. This may take a while though...
Stay tuned!

Initial BLOB IO backend implementation in elliptics network

Tagged:  

Elliptics network is a quite modular distributed hash table, which allows to implement and build-in different IO backends, enabled via config. IO backend is a quite simple entity which just stores data to media and allows to read it back using provided tranaction ID.

BLOB IO backend is a yet trivial append-only array of variable-sized elements, which are stored one after another on disk. Each entry's offset is stored in hash table in RAM, indexed by transaction IDs. Currently we do not even support ID index - to create this hash table in memory initialization code runs over whole file and jumps from entry to entry. With 10 millions of entries stored on single node (about 44 Gb of data) this takes about 8-9 minutes to initialize, so it is likely a good idea to implement external index.
Hash table is neven swapped to disk if configured to be locked in RAM.

Thus to get an object we ask hash table about object offset and read it directly from the storage (send it via sendfile() actually). In theory it should be noticebly faster than filesystem IO backend, where each object is stored in separate file.

Let's see raw results for 2 sas storages (each one contains 16 disks), about 10 millions of data objects (total of about 30 millions, since we have 2 additional history objects for each data one). To handle single request we have to read two objects from disk: one history (parse it and get ID for selected version) and data (with the ID read previously). It is possible to disable versioning and get data via single disk read of course.

Filesystem is default ext4 on 2.6.34 kernel. Machine has about 24 Gb of RAM. Random requests.


600 rps witin 200 ms, 1000 rps within 300 ms

And compare BLOB to filesystem IO backends.

Clearly blob is about 2 times faster than filesystem at the beginning (green one is blob, violet is file-per-object aka filesystem backend), but with time they become equal, likely because of filled hardware queues.

Although I play some simple tricks with read-ahead in blob backend, I still want to test with data stored in raw block device, thus eliminating potential FS overhead, although ext4 has extents it still may require multuple seeks to read different blocks in single file. Also direct-io case can be useful too.

Main problem with blobs is object removal support. While in common web scenarios we can just mark object as removed and drop it from index not even trying to 'squeeze' blob file, in a real life some external application (or IO backend itself triggered by timeout or whatever else) should be able to compact blobs.
Back to drawing board...

Elliptics network: 2.9.0 release

Tagged:  

This is a rather small release (actually it was made several days ago, but I postponed announcement to allow binding poll to stick on top) - it does not even break library or API at all.

Instead it changes IO server and its backends to use config file insted of zillions command line options.
Main purpose of this step was not to simplify deployment life actually, but instead to make a ground for the further extensions: namely automatic network topology configuration and new single-seek backend.

Currently elliptics network servers are required to be properly configured prior cloud join - they have to have unique IDs, which split address space according to administration policy. But all the time it is just as simple as spread IDs according to node's disk space (the bigger disk space is the larger ID set it covers). And in the common case of the same nodes IDs should be equally spread over covered address space.

This will be made automatic to reduce configuration only to network address selection. It should also alow to reconfigure storage on demand, for example when new nodes added or removed.

As of new IO backend, I plan to implement a rather trivial low-level storage, which will operate with huge blobs of data. With some (common) usage cases it is supposed to perform only single seek to get data by its index. Main users will be storages with lots (tens of millions) of rather small objects. Classical databases like TokyoCabinet IO backend do not work here, since even for several millions of objects it starts paging out which drops it down to floor immediately, which is unacceptible.

File IO backend is much worse in this scenario. For example having about 30 millions of objects (about tens of KB each) with versions (i.e. to get single object we have to read two times from disk) loaded into SAS and SATA 16-disk raid10 machines, we got following numbers (random objects are fetched):


SAS raid10 array (2 storages of 16 disks each)
Got 800 rps within 300 ms


SATA raid10 array (2 storages of 16 disks each)
Got 200 rps within 200 ms

Which in SATA case roughly corresponds to 20ms per seek and we make 2 seeks to get an object (4 seeks, which 2 times decreases performance, since we have to read two files from the disk when versions are used). With its uber-large NCQ depth of 32 this is the end of the story: 400 rps from such setup.

SAS was a little bit (3-4 times) faster - 800 rps within 300 ms and 600 rps within 200 ms.
Again, to handle a single request we have to read two times from the storage (each one in turn resulted in multiple seeks) - one to get version information, and another one to get object with some version itself.

With the new IO backend I believe we can suddenly increase performance. By the factor of 2.
But so far it is a speculation only, let's first implement the idea...

What external bindings should we implement for elliptics network API?

Tagged:  

Pressen frau, trinken bier

Tagged:  

Dear Evgeniy,
Congratulations! The program committee has finished their work and is glad to tell you that your submission for a /Refereed Paper/ with the title

"Elliptics network - a distributed hash table, design and implementation"

was accepted!

Wanna chat? I will show up a small presentation for Linux Kongress this year. Back to writing table now... its time to write some bits.

Metadata server or distributed hash? LRU or weighted caching?

Tagged:  

The former is actually quite religious question, but still it has a fair amount of technical background to talk about.

Main pros of the dedicated metadata server is its incredible flexibility and control. To determine where given object lives we ask special server which can perform whatever we want to make the answer: system can check permissions, locate the least loaded server, update centralized statistics or contact oracle and notify external entities.
Thus metadata server becomes a complex database which is too hard to replicate and generally maintain in consistent state with its copies. And we do need another copies, since every server fails and dedicated metadata one fails too. In some practical scenarious they fail even more frequently than storage ones, although quite contrary I have example where they never fail during several years of maintenance and everyday access.

But no matter what, generally we want to replicate metadata server and preferably to implement master-master operation mode to unload single entry of failure. To date I do not know production-quality master-master replication solution neither in free nor in proprietary world. And by production I mean hundreds of millions of records with millions of records updated/created per day with physically separated datacenters with flaky link between.

A very elegant solution for metadata servers is ... full absence of them. Distributed hash storage is one of them. But it only solves access problem - client can determine needed server itself, but we still have to implement control access, statistics and notifications somewhere. If we put more complex logic into the storage, unflexibility of the central control point absence becomes even more visible.

One such problem is caching. When some object is popular we want to put it to faster media to satisfy increased access rate. For example we can create multiple copies and distribute clients between them.
With metadata server this is a trivial case - we just update appropriate database record, so that connected client could get random (or preferable, doesn't matter) path to the data object. The more popular content is the more copies we put into the storage and update metadata database.

With the distributed storage without central metadata server we have to perform a full lookup to determine whether given object is present in the storage or not. Which means we have to contact remote server and try to read some data from its media. This slows things down noticebly, especially for non-popular content which does not have hundred of cached copies.

A simple solution is to move a control entity from the storage to higher levels. In case of cache it means some external storage, which will contact low-level one only when requested objects are not in the cache. Depending on the cache implementation this can be very cheap price for the access problems.

Thus we build a layerd system where DHT is a low-level storage.

Actually this solution will also work for metadata server too, except that for some workloads classical LRU caches do not work. Thus squid or page cache should not be used for them, and metadata server solution wins again.

But what wins and especially what is needed not only for distributed metadata-free storages is cache with content weights - the more frequently given object is requested the more time it will live in cache even if currently it is not requested. Last access time does not work in this case.

To my shame I do not know such cache systems - very popular memcached and squid do not support this iirc. And they do not allow to distribute cache content among multiple nodes. Memcached actually has quite nice frontends, which can form DHT, but still pure memcached lacks some features.

Plan has been plotted by itself...

Elliptics network release: 2.8.1

Tagged:  

This is a rather small update which brings together various recovery tools bug fixes as well as debug cleanups.
I believe this will be the last 2.8 elliptics network release, since I plan to change server code to drop command line arguments in favour of config file, since this breaks backward compatibility I believe we will change version to 2.9 cycle.

Actually main reason I want to introduce config files (besides zillion command line options mess) is to move further in autoconfiguration land. This will require some options which are not quite fit into command-line interface, or actually into its getopt() parser.

And the main goal is to implement full server node autoconfiguration. Basically we need some simple enough protocol to gather node usage statistics (we have this already via stat command) and a policy engine, which will run over all nodes in the cloud, collect their statistical data and determine ID of the joining node according to loaded heuristics. The main (and to date the only) one is to select joining node ID, that it could get half of the load from the neighbour node. Extension of this mode is to change IDs of all nodes so that data could be spread equally over all nodes.

So according to config options joining node will connect to all nodes, get their storage statistics and IDs and calculate its own ID.

WIth this changes completed we can simplify and automate storage reconfiguration, which basically will not require administrators to perform any steps to add or remove server nodes to/from the cloud.
Automation will not be full though - elliptics network maintains eventual consistency model, thus after new node has been added or removed whole storage does not start reconfiguration/recovery process. This is external step which is not started by the storage itself. Thus when node is removed some data will not be accesible until recovery tools completed its work. And in some cases we want to postpone recovery... This will be administration step, but at least knowledge of very low-level configuration will be taken off admin shoulders.

This is not a trivial task though - we support so called virtual datacenters, when storage nodes are groupped into smaller clouds, usually physically separated, to guarantee that different data copies live in different storage node groups. Dumb storage check can determine that node being added should live in some other virtual datacenter than that admins wanted it to be added, but it is a technical details of the process, which I believe can be easily fixed.

New elliptics version and new features.

Tagged:  

I've just released a shiny new 2.8.0 elliptics network version.

It breaks compatibility with rather short-living 2.7 versions, but it adds a quite optimized metadata storage. Well, 2.7 was only publically a short release, but actually it was developed during the last couple of month and we fixed fair number of bugs and added zillions of tasty features on request during that cycle.

So, what's with the metadata storage? Now transaction log becomes just another metadata object along with parent object name and list of transformation functions. The latter two allow to automatically recover missed copies, while transaction log allows to recover simultaneous updates which happend in parallel in network split scenario. As a bonus append-only operation mode allows to create snapshots of the old data.

Now we solely concentrate on testing and integration process, which in turn may or may not force to rethink some cases.

The main one is versioning right now. Elliptics network allows to have multiple copies of the same object with different versions, since basically each new version is just another transaction written into the referenced object. Since object may have multiple versions, this information must be somewhere stored to allow client to select version he likes more, or to build more complex policies, like returning the first version which is less than missed requested one.

By default we store this information in the history log for object. So we can download version 13 for 'test.xml' and version 23 for 'qwerty.jpg'. But when we suddenly upload several tens of millions of objects into each node in the cluster and start randomly select files and their versions, things become a little bit dizzy - we have to read two objects from the disk to complete client's request.

Read again: two objects from the disk to handle a single request. TWO objects from the fucking slow raid under elder FS which basically laugh at me trying to dance with the freaking tambourine getting another RPS from it.

Yes, we greatly extended functionality and made some people's life damn shine and lightweight, but that flexibility costs us half of the performance. And that's already not that shine...

Well, we can setup two times more machines - it is a matter of dozens and not hundreds for now, but this is not a Jedy way I think. But still we have a very nice functionality, and performance issues are yet to think about in the future when it will hit us (well, I already started to think, since if it doesn't hit us now, it will do this tomorrow, and tomorrow I plan to have a rest watching how good it is ;)

To date that's all news, stay tuned!

Elliptics network got extended statistics support

Tagged:  

We are able to fetch per-client counters of successful and failed commands executed. Thus we can determine in real-time if some client got too much of a 'bandwidth' for given node. Currently we do not have ability to limit clients, but we can implement prioritization technique based on socket processing state machine.

I.e. when high and low priority clients send their commands, we can first try to process the higher one. in theory it should be a good feature to have, but to date we do not have a practical problem associated with this idea, so I will postpone it to some future date.

What we really want to have is ability to read multiple objects and select obly those transactions which were successfully stored on multiple nodes. I.e. if it happend that transaction was updated only on single node, we could read previous one instead.

This is rather simple task to be implemented on client using our API, but HTTP frontend does not support this yet - we return only the latest transaction. We already have upload 'limit', i.e. when number of copies written is less than preconfigured number we fail upload (we do not remove written copies though), but reading uses only single node to get data from.

This is a task for future date also - albeit being a very nice to have we do not have practical requirement for this though, so it is also postponed.

Effectively elliptics network entered testing mode and if things go smooth there is a fair number of projects to put it into real hardcore mode!

New major elliptics network release

Tagged:  

This is a completely new release which effectively draws the line of elliptics network project development, after which I start (maybe only temporarily) thinking about something new and bright future of maintenance mode for elliptics :)

With this new version our distributed hash table storage got ability to automatically recover data when nodes go offline. Elliptics network supports eventual consistency model, when recover process is distributed in time and may take a while to complete, whilst alive nodes still perform their usual IO operations and serve clients.

There are two main types of data recovery: transaction log merge and missed copy recovery.
The former is used when during network split the same object was updated in different parts of the storage. By default we merge transactions by timestamps, which works great for systems which prefer to take the latest data as only valid or when simple append-like/whole-overwrite structure update is used.
But in some complex cases timestamp based merge is not enough, so we provide a way to use external library callbacks. For example it can use merge-by-content or other algorithms.

Merge is performed on a per-node basis.

Second type of recovery process is maintenance of the desired number of copies of some object in the storage. When node goes offline we lose some objects, which can lead to the total storage degradataion with time, when no copies of uploaded object exist in the cloud. This can also happen when data is silently corrupted on the nodes.

Recovery consists of two steps: recovery of the object using its name and set of transformation functions used to create different IDs for different copies, and log file creation where each entry corresponds to the object to be checked.

Recovery tool parses input log file (in simple text format) and checks whether all copies of given object are present in the storage. If some objects are missing, it downloads existing transactions and reupload them with missing IDs thus making object copy accessible again.

Log file can be either manually created for some objects to be checked or can be created on per-node basis autoamatically. The former case only works when object was uploaded with appropriate metadata associated with it, namely it contains transformation functions and object name. Appropriate application parses all metadata stored on the node and creates log file suitable for recovery tool to process.

Currently only HTTP fcgi frontend uses metadata creation during data upload.

With all those changes I believe elliptics network is close to its finish line. Our distributed hash table already supports data redundancy and recovery process, thus tolerating faults to desired degree. Its list of featuers includes automatic rerouting/reconnectino to failed nodes, O(1) lookups, flexible frontends and modular IO backends, versioning and virtual datacenters and many other features.

There is a small set of features we still want to add into elliptics, like extended statistics, history log cutting tool and so on, but their priority is rather low to date. We start testing it in hard environment (multiple nodes, where each one contains tens of millions of objects total of multiple terabytes of data) and eventually push similar production processes to it.
And in a meantime I will take some tee...

Metadata and automatic recovery in elliptics network

Tagged:  

I added a simple enough patch to return listing of stored objects and their types into elliptics IO backends. Now our distributed hash table can easily return information about what given object is.

This can be used by automatic recovery tools to get information about how some object must be saved in the storage. Namely what was the name of the original parent object and what transformation functions were used to generate our transactions. Having this information we can determine, for example, how many copies are supposed to live in the storage, and if number of then differs, we could make things right, for example recovery missed copies.

Currently transaction logs merging tool and copies checker are a bit ugly and are more prototype-like, but it is not a very high priority to change its architecture right now. Instead I would like to think more on how data is stored in elliptics network.

We have a plugin-like way to add IO backends, which actually store data in the way it wants. For example file IO backend stores all data as separate files. Since elliptics network is a distributed hash table, it is supposed that there are no some dedicated servers to store some or another information, instead everything is a data, and only client knows what is stored and how IDs are generated.
Elliptics network is append-only (by default) storage, thus for every new update there will be a bunch of transaction created (its number depends on number of copies and the way data is stored), which in theory will grow up infinitely.

Recovery tool must know about every transaction ever stored, thus it will have to ask storage node to return whole list of objects stored in it. Its processing may take ages, and this is what bothers me alot.

Let's consider a rather medium system we are about to start in production. It contains upto to tens of millions of objects on each node in the storage, where each one is several Kb 'long'. Only listing of this storage takes minutes on ext3/ext4 (I will put recent reading benchmark of elliptics vs plain fs access from HTTP server in our production system if interested, there are multiple filesystems as well as elliptics mode usages), and then recovery tool will have to ask storage about every object and all its copies.

Another bothering issue is node removal. When some server goes offline, others do not actually know what information was stored there. Without special place to store such data every node must scan its local content and check whether all objects stored are actually present in the whole storage with requested number of copies.

If we could know that given host contains transactions for objects A and B, we could only check them when it goes offlline, without need to check all other millions. Unfortunately with plain DHT we do not have such information.

It is possible of course to use low-level elliptics network API and put this data somewhere in the storage, but by default (and that's what everyone actually uses) we do not perform this step. This is another issue to think about.

And in a meantime I'm about to implement a simple tool to get a listing of all stored objects in a given node, which can be used by recovery applications to process. This will draw the line and new elliptics network version will see the light.

Elliptics metadata implementation

Tagged:  

In a meantime our distributed hash table got metadata support. Essentially nothing major happened - metadata is present like usual data transactions, which are logged in the appropriate history files, which IDs are generated either by context hashing (for metadata itself) or specially crafted string to identify object it was generated for.

So, when we upload a data transaction for object '/etc/passwd' system can also create metadata object, which is being built like hash from '/etc/passwd\0meta' data. I use such string extension to simplify POSIX frontend aka POHMEL filesystem, which will only work with plain strings just like usual filesystems.

Metadata objects are not supposed to be updated partially, thus client downloads whole transaction the same way Linux does for inodes, although it could only need a timestamp. By default they are quite small objects, so it should not be a problem.

Currently only HTTP frontend supports metadata creation, and it puts hash function names into the storage. Recovery process is supposed to read it and determine how many copies should be present in the storage for given data transaction. If number of copies is less than requested it will sync their content.
Recovery application does not yet work with metadata. I plan not to force every write to put metadata update, namely I will not extend existing file download/upload tools to work with it, but HTTP an POSIX frontends will do it by default. Also appropriate functions are exported from the elliptics network library to be used when needed.

Effectively this is the first step to fully automatic data consistency support, and when it is ready, we can draw the line. I believe this is the last feature we do not yet support out of the box to be able to have a fair name of distributed hash table.

Stay tuned!

Elliptics network: snapshots and versioning examples

Tagged:  

Elliptics network is a distributed hash table storage with quite a few features, including replication, recovery, transactions and log merges. But it is also a very flexible one.

We are in a middle of negotiations with people who want to use elliptics cloud storage with data files, which are supposed to have multiple versions. As a simple and rather dumb idea it is achievable using different transformation functions, where each one corresponds to new version.

But since elliptics network uses transactions, we are able to do much more clever and flexible solution. It is simply based on the fact that we can manually select needed transaction from the update history log. Thus each update transaction will correspond to the new version of the object. We can implement different policies on how version is embedded in transaction ID, but in the simplest case we just overwrite some bytes there.

This was implemented for the HTTP frontend - it can parse URI and if there is version parameter, it will be used to properly update remote object. It is possible to post and get data with needed version as well as remove only requested one.

And as a simple exercise implemented Last-Modified header support for the HTTP frontend. It returns modification time either for requested version of the object or transction log itself. When data was uploaded without version, it will match its modification time.

And NOW I can concentrate on automatic data recovery (instead of kind-of manuall log based one) and initially to think about simple policy to properly put metadata into the storage.

Elliptics network metadata support: take 2

Tagged:  

Elliptics network is a distributed hash table storage with ability to maintain needed redundancy level and automatically support recovery process. It requires some external help for this though, namely to have a log of written and to be checked objects. System can automatically (in timely manner for example) check stored transactions and merge appropriate update logs if needed.

But since we do not have metadata yet, consistency checks can not be made fully autonomous.

In a meantime I fully reverted all changes made for separate metadata support. Those changes are gone now and I decided to put metadata into separate objects the same way data transactions are stored in the storage.

Which basically means that all merge and consistency support will work out of the box for metadata updates.

To solve unlink problem I currently disabled possibility to physically remove data from the storage. Instead a special transaction log entry is appended which marks given object as removed. I will create a special application according to classical Unix way methodology, which will force storage IO backends to remove transactions from the storage.
With this change we can suffer and recover from parallel update and removal during network split as long as timestamps were generated correctly. Since merge application will recover correct history for specified object.

Saving metadata as plain data is a simple task, so I can add a few-lines-long API function to write something somewhere, but main idea of metadata is a rarely modified place to hold some meaningful data.

Namely I need object name and transformation functions used to generate object ID to be stored somewhere. Since currently consistency maintaining application has to have external log of the object names to be checked. When we will be able to extract names (or other IDs) from metadata this step can be fully automated, since there will be no need to have external log used by admins when they detect failed nodes.

But this is not that simple task. We do not have ID->name mapping at all. There is a back reference in each transaction, which shows ID of the object it belongs to. But how to convert that parent ID into metadata ID?

We can not just mess with hash functions here, since we do not know how transformation functions work. If it is a cryptographically strong hash we can reverse it for example, or add some constant or perform any other fancy stuff to get new ID which will store metadata for original object. But if ID was generated by external application like database index? We can not change it, since that value pretty likely will be reused by other update.

One of the solutions is to only have a back reference from metadata object to data parent one. So we will have, for example of course, two objects named sha1(some_string_or_anything_else) and sha1(some_string_or_anything_else\0metadata) and the latter will have its origin ID to be set to the first ID. It will also have a special flag in history log that it is a metadata object and it can be read by the external application to find out how original object was created.

So, to date plan is to write such automatic metadata update when we perform initial object creation. And create two application: to remove object as described and to scan local storage for metadata objects and perform consistency checks according to data stored there.

And then I will fully switch over to POHMELFS!

Stay tuned.

Syndicate content