Elliptics

Elliptics distributed network

Week and weekend

Tagged:  

New weekend is coming, and it there is a new snow in Moscow.

But unfortunately I'm unlikely to move to the 'mountains' this weekend - I damaged my leg next week so that even walked three days with a crutche. It was a bit warm day, although it was lower than -10C early morning when I moved to the ski resort.
But on the slope weather was noticebly hotter - about 0 degrees Centigrade or so and quite moist. So that my new skis felt quite uncotrollable in the high and middle stand. But when sit rather low I was able to control skis at quite high speeds, although this requre substantial muscle efforts.

I managed to film a small porn video on how I ski over the red trace in Stepanovo. Phone in left hand is not the best way to fight for Oscar, but it was fun. There is completely no feel of speed, although it was substantial for me at least - more than 40 km/h (about 11 meters per second). Calculated by dividing trace length by moving time, so effectively it does not take into accout arc length, which I prefer to make small to medium.

On such speeds I manage to outrun many of the skiers and almost all snowboarders. But since I have essentially no technique (I moved to outdoor traces three times, each time I spent about 3-5 hours on the slope), it is likely that I move quite wrong. And this can explain problems I sometimes get during the movement on the slope.

Add here weather and wet snow and result is quite simple: I fall. I do not care about that until I feel the pain longer than a day or so. And this week was my first time when pain was that strong and long.
I managed to outrun some other boarder and was not able to control skis, so fell and flew several meters away from the trace breaking the boarding :)

well, it was quite simple to break that bearding net, but there was a noticebly gap out of the trace, where I moved several meters crawling over the snow. leg did not hurt that much on the trace, but when I moved home pain started to show up.

Currently I feel mostly ok, although play table tennis quite slowly and can not move without slight lameness. Well, recently I moved with a crutch only :)
So, things are getting better.

In a meantime I added fair number of tasty things into elliptics network project, namely broke addressing storage model - now each node stores IDs which are greater than node's ID. This breaks compatibility but allows simple human understanding of how objects are spread over the storage.
Also implemented random transformation function selection for read IO requests in fastcgi frontend, now we can balance erading among multiple data copies. Dropped BerkeleyDB support - Tokyo Cabinet performs way faster, so I do not see any reason to support both. Made a big step towards completed merge support, I expect it to be finished very soon, which will be the first 2.7.x release - there is a fair number of changes accumulated already.

And as a tasty project to warm up the brain I decided to implement a rhyme generator based on Levenstein-Damerau distance and sound-syllable similarity algorithm. It was not formalized even in my head yet, but it is interesting thing to think about.

Also managed to win a judgement against development company which built my house (without judge and defendant though). I'm quite close to finally get property rights on my appartments and to sold it for good. I believe its time to make living place wider.

So far so good. Stay tuned!

Elliptics changes

Tagged:  

They are quite dramatical, but are very small yet - I committed search protocol changes. Now node stores transactions with IDs greater or equal than node's ID (it stored smaller or equal IDs previously), which is incompatible with current node searching, but allows to maintain human readable and logical (for humans) ID generation.

So, when node has ID, say, 0100..., it will host data transactions, which start from 01 (its the highest byte). It is much more convenient to configure nodes with this in mind, than to calculate what is less than 01, namely FF... IDs.

I also committed initial metadata support, but neither low level IO backend supports that yet, and I will leave only Tokyo Cabinet DB and file backends, BerkeleyDB support will be dropped, because of its slowliness. It is still in a development stage, since there is no clear vision on where this functionality should live - client or server.
I.e. it is possible that client will tell that it wants to insert metadata X into given object, and server will read/modify/write metadata blob itself, or it is possible that client will download whole metadata blob, update it locally and then write it back to server, which will replace old one with the new data. Likely I will use the former case, since it simplified client development, which should be a higher priority than server simplification.

We also found an interesting bug or feature of the storage - in some cases it is not possible to remove object, it will be recovered from the dead. Let's say we have two object copies and one node was turned off. Automatic recovery (not present yet though) will create another copy from the first one on alive nodes. Subsequent object removal will kill both copies on running nodes. When turned off node goes online again, autoamtic recovery tool will resurrect removed object from the copy presented on this node.

To date it is all a pure theory, since there is no separate metadata in the storage, thus no automatic recovery (admin should run special tool with properly crafted log file currently) and it does not remove objects from the storage. But still, described problem will hit us badly when we will actively use it.
And while there is no merge implemented either (it is kind of being materialized in my mind while we talk), solution will involve new history entry creation instead of actual data removal. Thus transaction log will contain a note that given object was removed. In case of network split and parallel object removal and update in different parts (which can not contact each other during this event) of the storage, this will also allow to implement correct and complete transaction history log by synchronization daemon.

Thus object will never be deleted from the storage, and instead its history will be updated to store a note about its status. File system checker will be extended to support a mode, when it will actually remove objects from the storage after they were marked (and resolved during merge with other logs if needed) as deleted after some timeout, which should be big enough to eliminate such ghost nodes appearence.

And the last but not least discussed issue concerns storage size and related limitations. Let's say that we reached our current storage capacity and want to add several another machines, which will add 50% of the current volume. We want to spread data equally between all nodes, thus we will need to update every node's ID to shift it a little, so that new nodes entered addressing ring and formed a fair ID distribution. Amount of transaction copies in this case is quite large - more than a half of all data will have to be transferred over the network, which will take a while.
Also, when we add new empty node into the storage, it will kind of hide data it is supposed to host (according to ID distribution) until it is copied to the new node from the neighbour. Thus there should be a poilicy, which will forbid simultaneous update of all servers, since there is a possibility that suddenly all added nodes hide all copies of some objects. It will be recovered of course, but it will take some time, which in some cases is not appropriate.

One of the solutions for the described storage size issue is different storage policy. We can implement multiple virtual datacenters, where each new virtual datacenter corresponds to newly added set of machines. In this case we will extend write application so that it could 'touch' old hash functions (and thus old virtual datacenters) first to determine whether it can store data there and move to the new machines if there is no space in the old ones. Reading can issue a parallel lookup to all virtual datacenters asking for given object ID.

This scheme has latency limitations as well as network traffic growing with new virtual datacenters involved, but it can be a good decision for smaller setups though.

Virtual datacenters (or configurable hash/transformation functions used to generate transaction ID) becomes one of the most flexible 'tools' to implement different storage setups.

Stay tuned, there will be more news soon!

Elliptics network background fsck

Tagged:  

Its original draft could be read previously, but I believe it became a little bit outdated, so requires some highlighting.

But first, let's clear the status of fsck log checker. I completed its implementation, which is now capable of supporting consistent number of copies in the storage. It does not allow to merge different transaction logs yet.

To determine object to check it uses special text log file, which among other info contains name of the object and transformation functions to work with. Each transformation function will produce unique ID, which will be checked in the storage. For example we can put there sha1 and md5 transformation functions, so we will have two IDs equal to appropriate hash of the input name (and optionally hash of the transactions content).

When some objects are not presented in the storage, checker will download first existing copy and try to upload it using transformation functions corresponding to missing objects. So, if object with ID being equal to md5(name) is present and sha1(name) isn't, then checker will download all transactions stored in the existing object and upload them using sha1 transformation, thus recovering requested number of copies.

Checker currently requires log file to get information from and admin to start the process.
Background fsck is supposed to eliminate both needs.

Basic idea is to store some metadata with each object, which will tell origin of the given object and how it was supposed to be stored in the elliptics network. Thus we can timely or on request parse metadata for all objects in the given node (or only part of them), create a log file and run existing checker against it.

It becomes similar to what extended attributes are in the existing filesystems. Metadata can contain information not only about what object is, but also its IO permissions or access policies, owner information and anything else we would like to have there, which will allow to implement at least basic security model for elliptics network as well as simplify POHMELFS port.

Elliptics network: 2.6.4 release

Tagged:  

It took a while to prepare a new release of the distributed hash table storage elliptics network, but here we go. This is still a minor version bump, although amount of changes is rather large for small update.

Likely this will be the last releae in 2.6 release cycle, since in parallel we are cooking up a completely new versioning and merge logic as well as data synchronization. Btw, this release breaks to some degree that logic, but there is a tool to fix things up. It will be automated in the next versions.

But let's dig into details and changelog:

  • Data integrity checker. Although a little bit undocumented (see example below), it allows to check whether given object is present in the storage with requested number of its copies. And if number of found objects does not correspond to config, it will automatically download and upload data with the desired IDs. Later this tool will also be able to upload data into the storage. This checker will be a base for background FSCK, which will be a simple script, which will parse metadata and start checker with given log. It also supports external library call for requests merge.
  • [FCGI frontend]: cookie, timeouts, tunable headers, variable content types, more and clean XML.
  • Rewritten network state and reconnection logic. This makes NATed box support trivial (we do support it), client nodes became even simpler than ever, less code, less bugs, everyone is happy.
  • Debian debug package.
  • Fair number of bug fixes. This version is used in production, if time permits I will describe this load in details later.

Modulo possible bugs, main work is concentrated on the filesystem checker. There are two problems to solve.
The first one is absence of transaction log made by requested transformation function, or in plain words - absence of copy of the object in the storage. This happens when some node went offline and returned empty or was replaced. Or did not return at all. In this case fsck application will check how many copies are present in the storage and automatially download one of them (the first one from config) and upload with given ID.

Second issue to resolve is transaction merge. Elliptics network by default uses transactions for every update, so there is no object as is in the storage, instead reader will download transaction log, parse it and select transactions which cover requested object range. It is hidden in API of course, but it is possible to manually select needed transactions, for example to support versioning and data snapshots. As tasty effect two fully equal transactions (objects) will not use two times more space, since there are appopriate transaction reference counters.

Currently there are multiple (5) merge strategies, but practice shows that they introduce more harm or misunderstanding at best, than actual goodness. So I decided to drop them all in favour of trivial timestamp based merge algorithm. Of course it is possible to merge transactions based on private algorithm, which can be called from fsck daemon. We have request to allow external modules to merge objects based on actual data.

This version disables content synchronization during node joining. Instead admin has to call fsck application with externally stored log of the uploaded data to check whether things are ok and fixup what was broken. It will be automated and no external log will be required in the next versions.

Fsck application log file should look like this:

3 0,0,0 sha1,md5 object_name

where '3' is object creation flags - without transactions, just like those created by FSCK frontend. Will be removed in the next version.
'0,0,0' is a placeholder for object parsing information meaning start,end,update_existing. Start and end are positions of the starting and ending symbol in the object_name used to generate ID. Zeroes mean automatic detection. Update_existing is not currently supported, in the next version if set will upload local file named object_name into the storage no matter if its copies are already present.
sha1,md5 - transformation functions used to generate ID from object_name. This setup uses two copies - each one created by appropriate hash.
object_name - name of the uploaded object. Its hash (or actually transformation of the name using presented functions, it is allowed to be some other function than plain hash) will be object ID.

Stay tuned, work is boiling and results are very close!

Elliptics network got new on-disk format

Tagged:  

Eventually any storage should go into production mode, which implies not only data storage itself but also access restrictions. Distributed hash table systems, like elliptics network, do not have dedicated servers which could store that information and manage access permissions, so each object should have its own set of rules. Although without proper security framework on top of network media this will not guarantee required data access granularity, but even in this model it is still possible to implement IO permissions to some degree.

Until now elliptics network did not have even a slight mechanism for doing this. And even that rudimentary supported metadata was stored in the transaction log and did not allow any kind of extensions or proper updates.

And although I did not yet write any line of code to deal with metadata, I already broke old-style transaction logs, which now contain only and only transaction information. There are no metadata objects at all, but I will update appropriate parts of the library to generate them and store in the separate entities.

It is possible to store metadata in the different objects like the ones being indexed by the hash of the original object's name plus some extension, but this will force system to perform two lookups to find out needed object and its metadata.
Another way is to add new object type to existing transaction and history log objects - all metadata will be stored close to the object itself and could be fetched using only object ID. In the filesystem backend where each object is stored as separate file, metadata will be indexed by the '.metadata' extension or similar - just like we have $ID.history for transaction logs. In the database backend (BDB and Tokyo Cabinet, although I seriously consider to drop the former, since it is unacceptibly slow compared to TC) it will be a separate table, indexed by the object ID.

Metadata will have flexible format (maybe even human-readable one based on strings?) to allow extensions without breaking backwards compatibility.

But first I should fix background log checker, which although syncs all kinds of objects currently (i.e. when there is no some object in the storage, but there is its copy with different ID, it will upload missing data from that copy), it does it slightly wrong way, namely messing with hashes and producing unneded additional transaction references. When checker is ready, whole storage fsck process will just combine a log based on metadata objects, and start check process for it.

Stay tuned, we are very close to the next major release, which will draw the line of the serious features and changes!

In a meantime elliptics network went production

Tagged:  

We started the first production elliptics network distributed hash table cloud last week. Configuration is rather simple - there are 3 virtual datacenters each one contains two physical machines with several Tbs of space each.

I do not know precise number of HTTP proxies (fastcgi frontends for elliptics network) installed, but we talked about one or two of them in each datacenter. Each proxy has 5 uploading and 50 downloading processes. Uploading ones are hidden behind firewall, and downloading proxies are configured to read small objects through themself (like XML and image files) and big objects (data files) are downloaded directly from storage nodes, proxy only generates XML output with direct URL to some storage node.

This pilot project will host only about 2-4 Tb of data, each object will have 3 copies, each one stored in the appropriate datacenter. We will add geographically-spread datacenters soon, which will host only the most popular content, so that clients local to those storage nodes would not go the main datacenter.

Objects are not supposed to be updated during its lifetime, only once uploaded and removed when time requires. Data synchronization for failed and new nodes will be done using fsck application, which currently does not support advanced merge algorithms we had in the elliptics network, it only checks number of copies and optionally downloads/uploads if something is missing. Fsck application uses log file to get information about what and how objects should be checked, currently storage does not store this metadata with objects themself.

And actually background fsck daemon discussed previously will not be something very different from this application. Instead I suppose that having a script which will parse object's metadata and invoke application is a good approach. Fsck checker does not yet work with all possible types of uploaded objects, namely it was not yet tested with the transaction logs, only with objects themself, since it is what we use in the pilot project.

The more I think about transaction merge algorithms, the more I like the idea, when we only merge them using timestamps. Currently elliptics network has 5 merge algorithms, which may or may not be appropriate in some or other setup. Merge algorithm is invoked when node joins the storage to sync its content with what we have in the storage. Idea is to drop this join syncing and use postponed fsck which will sync data sometime in the future. Thus it is a prerequistic to have multiple object copies in such system, since it can take a while to sync joining node.

I understand that it is not possible to maintain in-sync clocks on multiple machines to have correct timestamps for each update transaction, but it should not be a problem if proper locking is used. Although there is no appropriate distributed locking system yet, it will be implemented anyway (plan is to write PAXOS locking daemon).

Fsck application will be extended to allow external libraries to merge data, namely I have requests to allow external entities to merge data based on its content and not meta informaion I might store, so fsck will just call external functions loaded from provided shared library if configured.

I did not make a new release yet, we will figure out possible bugs and complete fsck application testing first. Next version will not have syncing during join time, so this will be a major update.

Stay tuned!

Elliptics network progress

Tagged:  

Elliptics network - a distributed hash table storage got initial fsck implementation. Actually I can only say this is a data checker, which does not try to recover if something is broken.

And it only works with provided text log and only for non-transactional objects uploaded, i.e. what is generated with HTTP fastcgi frontend. Fsck checks number of copies in the storage according to config and reports them to the user, which in turn can manually start upload process if number does not match.

This is a very first step of course and things will be improved with time, but it is already quite handy for administrators. To date elliptics network storage still requires manual intervention for 100% guarantee that storage contains needed number of copies, although in some cases it can be done automatically.

New elliptics network release: 2.6.3

Tagged:  

Ellipitcs distributed hash table storage does not stay on the same place for too long, so I made a new release.

This is a rather minor one compared to what is scheduled next, but still it is rather big one. It mainly contains new shiny features and bug fixes, so here is a short changelog:

  • Improved Debian package building machinery
  • [HTTP fastcgi frontend] Added more commands: statistics, unlinking
  • [HTTP fastcgi frontend] Improved security context of the dataflows: permission checks, cookie and secret generation
  • [HTTP fastcgi frontend] Implemented ability to download data through the proxy and not via redirect link
  • [HTTP fastcgi frontend] Added DNS lookup option, which allows to return fully qualified names in redirect XML
  • [HTTP fastcgi frontend] Extended documentation
  • [HTTP fastcgi frontend] Switch from CGI mode to more thread-friendly FCGX mode
  • [HTTP fastcgi frontend] Virtual datacenters and geographical binding implementation
  • [HTTP fastcgi frontend] Allow to use external library calls for pre- and post-processing of URLs, FastCGI daemon can request region ID from preconfigured library
  • Added -fstack-protector-all compilation option (which breaks Debian Etch building)
  • Extended file IO backend - now it stores objects in subdirs indexed by configurable number of bits of the ID, previously it used hardcoded 8 bits (256 directories)
  • Fair number of bug fixes

As usual, all sources are available in archive or git tree (latest snapshot is also there).

Now you can build your own distributed fault-tolerant hash table storage virtually by few commands started from script. And although documentation is rather fluffy and there are no plain and simple HOWTOs, it should not take too much time to get in touch with the applications.

There is one issue yet to be resolved: background fsck.
The more I think about it the more I like idea when node does not copy content when joining to the network and instead that background fsck task will do the work, there is a fair number of problems with content sync during joining, which will fire up when we will sync 10 Tb of data from one machine to another (those numbers we talked about recently). Well, 'empty' joined node will not return data (until it is copied there), so client will request it from backup copy, but there will be no need to postpone writes.

Its initial implmentation will use external log to check specified objects, and if some of them do not have preconfigured copies, they will be fetched from the other servers. This application is supposed to be started either by administrator when new node joins the network or automatically during node setup. Log has a simple text format and can be edited manually if needed. Log creation at first place and its maintenance is an administrative task though, for example, it can be sed/awk generated from HTTP server logs, if elliptics fascgi proxy is used for data upload.

Eventually this will be done automatically in background, and fsck daemon will not use some external log, but instead will process stored object metadata on the alive nodes.

The former change is scheduled to the end of the year - this will be the last 2.6.x release, and while actuall background fsck does not break API, it is a rather major change in the server-side logic, so this will be the first 2.7 release early in the next year. So far they are the only major changes for the foreseeable elliptics network future.
Next is POHMELFS port.

And in a meantime I very-very-fucking-very much want to implement a rather simple LR grammatics parser/generator implementation. I just sleep and see how productions are made. Well, not particulary this, but that's what I want to spent some time on.
Aho's Dragon book looks at me like at the shit.

Stay tuned!

Dynamic storage node selection in elliptics network based on IP address and queries

Tagged:  

Let's imagine a distributed hash table storage build on top of elliptics network physically spread over multiple datacenters which are geographically separated all over the world.

Let's further suppose we want to fetch some data from our local Moscow datacenter instead of main basement somewhere in Nepal. We can assign a special transformation function like dc256_sha1 for this, which will force all transaction IDs to have ff000000 as their first 4 bytes. Moscow datacenter will have 0 ID, so if there are no other nodes with IDs more than 0xff, all requests with above transformation function will point to Moscow.

So far so good, we install this function in main load balancer all over the world, but will put it somewhere at the end of the list, since in Nepal in particular people will prefer not to go to Moscow. But what if some clients in Moscow connected to main load balancer in Nepal? If it does not know how to dynamically select transformation functions from its set according to IP or query, it will try the one listed first.

This is changed now - I added two calls to external library (not counting initialization and cleanup callbacks) which will receive address and query, start one is executed before data is processed (i.e. before GET/POST handlers start) and stop one is executed at the very end (just before data is freed). Elliptics fastCGI daemon will call external library to get region ID from query and address, and if it is more or equal than 0, it will search for 'dc$id_' string (like 'dc256_' string) and if there is such transformation fucntion, it will be moved first. Stop callback will move it to the end of the list.

So, when such system is installed all over the world datacenters, their frontends will receive requests, search for appropriate hash (transformation) function and probably (if configured) use local to client storages. Currently aforementioned library to obtain region ID from IP address is not public, so this feature is a bit useless for the non-programmers.

Everything can be turned off of course - it is just a matter of proper configuration options.

This was the last task before background fsck project. Which in turn is the last one for the elliptics network (modulo bug fixes).
Then I will switch to POHMELFS to finally feel again taste of kernel programming.

Virtual datacenters and geographical linking in elliptics network storage

Tagged:  

Elliptics network is a distributed hash table storage, which among zillions of other features allows to implement virtual datacenters now. It is a set of nodes combined into some logical group, where nodes may or may not actually be physically groupped together.

I added a hash prefix (one byte) to implement virtual datacenters.

This feature allows to specify preconfigured prefix in every transaction ID (the first byte). If nodes are added with IDs starting from the same number as in above hash function, then every transaction will go to that ID.

Here is a virtual datacenter example.

Let's suppose we have two transformation functions setup on client: dc1_sha1 and dc2_sha1. They will produce sha1 hash of the data with the first byte set to 1 and 2 accordingly.

Now let's suppose we add two nodes with IDs being equal to 0x0100 and 0x0180 and then two nodes with 0x0200 and 0x0280 IDs. Now there will be two transaction made with above transformation functions with its first byte set either to 1 or 2, so there will be a guaranteed copy in the first set of nodes and in the second set (sets are indexed by the first ID byte, this set is a so called virtual datacenter).

Now we can implement multiple sets with different ID's first byte for every datacenter we want to work with. If we only need to have 2 copies, but there are more than 2 datacenters, machines from them can be spread between those sets of nodes. '2' is arbitrary here of course.

This feature also allows to implement geographical linking of the requests. Let's suppose some application receives data read request from Moscow, it can check whether its set of transformation functions contains the one with the ID assigned to Moscow. If there is such a function, it can be used first to obtain object ID and to fetch it from Moscow-local servers instead of going to New York, where main datacenter lives. If Moscow storage does not contain our requested object, we will use second transformation function, which will 'point' to main storage cluster.

I will extend HTTP elliptics frontend to support external library call, which will determine object ID from received URI and client address. In particular it will be possible to write some external library to parse address and its local geobase to select transformation hash according to that data, or fallback to generic mechanism of trying every preconfigured transformation function to generate ID.

In a meantime elliptics network development

Tagged:  

Its HTTP fastcgi frontend got ability to request statistics from remote nodes and tell clients whether things are good or not. In particular it is able to return XML with CPU and memory usage stats and filesystem data (total and available size, number of files if supported and FSID).

Here are fastcgi (lighttpd) config options:

# request remote nodes statistics
# plain stat will request stats and return 200 status if either
# number of received replies is more than $DNET_FCGI_STAT_BAD_LIMIT (if set)
# or is more than number of bad replies
"DNET_FCGI_STAT_PATTERN_URI" => "/stat",

# this will always return 200 status with XML data showing node statistics:
# CPU and memory usage and filesystem data (total and available sizes, FSID and number of files)
# if such statistics is supported
"DNET_FCGI_STAT_LOG_PATTERN_URI" => "/stat_log",

# plain stat request will return 400 status when number of bad replies returned from the
# storage is more than this number or when number of good replies is less than that.
# It is not used when log statistics is requested.
"DNET_FCGI_STAT_BAD_LIMIT" => "1",

Next task is to actually implement virtual datacenters split via hash functions. This idea is described on elliptics network homepage. This will allow not only to split copies into different datacenters, but also implement special caching nodes, which will be mapped by additional hash function, so that some popular content could be fetched not from the main cloud but from those dedicated nodes, it can be controlled on per-address basis for example.

And the main idea is background fsck. The more I think about it the more I like idea when node does not copy content when joining to the network and instead that background fsck task will do the work, there is a fair number of problems with content sync during joining, which will fire up when we will sync 10 Tb of data from one machine to another (those numbers we talked about recently). Well, 'empty' joined node will not return data (until it is copied there), so client will request it from backup copy, but there will be no need to postpone writes.

Another idea about background fsck is special log it can use to fetch data, which will contain all objects written with its IDs, so it would not ask every single object whether its copy exists or not, but only process those ones which are presented in the log. This is a client responsibility to write one though, for example it can be formatted from POST access logs from HTTP frontend proxy.

That's the plan, stay tuned!

Elliptics network HTTP download performance

Tagged:  

This time I ran download test, where HTTP fastcgi elliptics network proxy downloaded object from the network itself and sent it to client.

We uploaded several tens of thousands files from 4 to 15 kb each into 4-servers storage, each server hosted two elliptics node attached to dedicated SCSI disk.
Proxy server has 2 physical 2.3 Ghz CPUs (+2 HT CPUs) and 8 Gb of RAM. Storage severs and fastcgi proxy are connected over 1 GigE network. I started 4 lighttpd dispatching workers and 200 elliptics network fastcgi daemons. There was fair number of network stack tunables we touched because of problems found in redirect test, which I will describe below.

I do not know what is the client software (called Tanks here), but it is able to generate HTTP GET load with (tens of) thousands requests per second rate with 250 rps steps. Graph below shows reply rate from elliptics network HTTP proxy (legend is the same as in redirect test).


Elliptics network HTTP fastcgi proxy download test

Proxy node was maxed at about 7-8k rps - all 4 CPUs (2 physical + 2 HT) were at 100%, where most of it was eaten by dispatching lighttpd processes. Active elliptics daemons ate at most 10% with turned on small logs, cookie and authentification messages generation and data forwarding itself (proxy downloaded 4-15kb objects from the elliptics storage and sent data to the client).
Elliptics storage nodes ate at most 10-12% of CPUs.

Now about problems we found during redirect test. In this test proxy node does not fetch and sent requested objects to client, instead it generates small XML data with direct URL to the object, which client can use itself.
With the same test setup system loafed at 8k rps and 50% CPU usage (most of which took caching bind9, since system was configured to return domain name instead of IP address, which is needed for correct cookie work). At this moment something cuts the balls happens, and perfromance dropps to about 300 rps, CPU usage decreases to zero, but nothing crashes or goes out of file descriptors (there is a fair number of them configured in ulimits). To fix the problem we tuned network stack (previously we regulary got warning about timewait socket table overflow) a bit, but problem remains (maybe it is related to client software, since I thought it is 8 nodes with 1024 fds ulimit, but the way perfromance drops and the fact, that it did not dissapear after ulimits were tuned says problem is somewhere else).

According to all observations we should be able to process 15k rps in described redirect test on the 2-way (2 physical + 2 HT CPUs) machines.

Testing elliptics network small IO size write performance

Tagged:  

I setup a 6-machine cluster where each node was placed in different datacenter (at least that's what I was told), connected over 10 Gbit network, but actual network speed was about 20-30 MB/s - it is production datacenters after all (I'm curious whether filling up the whole available bandwidth made people unhappy).

Each system runs on 8-way E5440 64-bit Xeons with 16 Gb of RAM and has two test SCSI disks attached, which I formatted into ext2 and ran the latest (1.4.39) Tokyo Cabinet database. System runs old as mamont's shit 5.4 RHEL (for example it has libevent 1.1, 9.8e openssl and 2.6.18 kernel).

I installed one elliptics network node on each of 6 servers and started a singlethreaded IO testing tool, which wrote 100 byte chunks into the storage. When test application was connected to every node it was able to get more than 300k requests per second IO rate, which was effectively limited by the network - it is those 20-30 MB/s of free bandwidth between datacenters. Sometimes it dropped to 70-100k rps, sometimes grew up to 360 thousands of rps.
I was asked to run multithreaded benchmark, like 50-100 threads from single node writing into the storage, but system was able to fill the whole pipe using just one client.

Anyway, everything looked quite good except one small detail - servers regulary crash on those machines. And by regulary I mean it. To date I do not know the reason, but I have to admit that it works without problems on the similar Ubuntu systems (with 4.2.4 gcc).
For example RHEL machines can throw out following dmesg message:

dnet_ioserv[2320] trap invalid opcode rip:2b649487455e rsp:43dcdfe8 error:0

which, I must say, I do not understand how to run into with whatever kind of software error in my userspace application.

Following quite usual gcc options were used during compilation:

 gcc -DHAVE_CONFIG_H -I. -I../config    -I../include -I../config -pthread
  -I/usr/include -I/tmp/rhel//include -I/tmp/rhel//include
  -g -O2 -W -Wall -Wextra -MT iotest.o -MD -MP -MF .deps/iotest.Tpo -c -o iotest.o iotest.c

$ gcc --version
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I believe I will have to try different compiler, since I extensively use __sync instructions for atomic access, which had some bugs in 4.1 gcc versions. Or at least to try libatomic first.

There are some bugs to hunt on!

Elliptics network background fsck

Tagged:  

Elliptics network is a distributed hash table storage, which, among other features, allows to redundantly store data. Number of copies of written object can be specified on various levels namely on client and server, and it depends on how and which transformation functions were set. Transformation function is a method of generating object ID from provided data, for example it can be object name or content hash, or it can be a database index or whetever else.
Multiple transformation functions will provide multiple IDs object should be stored with, thus implying multiple object copies.

Generally we want that different copies of the same object are stored on different nodes, in some setups we can guarantee that (for more information please take a look at per-datacenter object distribution discussion on the project homepage). But when node with some object fails and returns with empty storage, the same object will not be copied from the storage even if there is another copy somewhere else, since there is no central information server which maintains object<->id relations (eliminating this server is one of the main advantages of DHT storages actually, since it is likely main point of failure in the comon existing distributed storages), since only client itself can generate that relation.

In some cases we can eliminate this problem, namely when we use classical redundancy scheme, when object being written to some node is also copied to its neighbour. When node fails, neighbour copies appropriate objects to its new neighbour instead of failed one, so number of object copies remains constant. When node comes back online again (potentially with empty storage) it will fetch 'its' object from the neighbour node.

This mode is extremely unfriendly from administrative point of view. Admins already hate me since they need to generate IDs for the storage nodes manually to have a good load distibution (when we have small number of nodes random IDs will not evenly spread whole ID range between nodes), and while we can generally implement ID distribuition the way that object copies will be spread among different machines and even datacenters, this becomes very hard when copies go not to multiple specified nodes, but to single specified node and its neighbour (which ID we do not really know on the client).
Thus admins will have to maintain information about how neighbour nodes are spread over datacenters, which becomes a real pain in the ass, when single machine contains shitload of disks where each one has its own elliptics storage node attached, and we do want to put a copy of the object to another physical machine (and even in different datacenter).

So, redundancy scheme, where we put object to some node and its neighbour maintains number of copies when nodes go offline and online, but it does not (easily) allow to implement copies distribution among physical machines and datacenters, which is rather trivial in the case, when different transformation functions (say, sha1 and sha256 with some changes described on the homepage) are used.

To fix problem with multiple transformation functions and precise number of copies maintenance I decided to implement (read: to put into TODO list :) a background fsck daemon, which will run on every elliptics node and check whether each object stored on given node has its copies presented somewhere in the storage according to object metadata.
In a nutshell, each update transaction will have some metadata attached to it on the storage, namely original object name it was generated from and transformation functions used (with time we will want to put there IO permissions also and maybe something else, so it should have extensible on-disk attributes format). Fsck daemon will read that information and check whether copies with IDs created using all listed transformation functions are presented in the storage, otherwise it will write local object into the storage with missing ID. It can also check whether locally stored object is corrupted and get it from the storage if needed.

Having enough redundant copies we can reduce probability of the simultaneous failure of all nodes which store copies of the object to the accepted level, so background fsck will have enough time to scan local object tree and upload missing copies. This task by no means is supposed to be fast, that's why it is a background fsck, which should not affect storage node performance.

That's the plan, although not very immediate, but it has a rather high priority.

First elliptics network HTTP frontend benchmark results

Tagged:  

We ran performance testing quite for a while already, but system was not tuned for the maximum performance, so I wanted to show the best ones, but so far it is a little bit postponed, so in a day or so I will have another set, and now will post what we already got.

HTTP proxy was configured as elliptics node which performed node lookup for every request it got, so it returned small XML with download information and not object itself. We setup single lighttpd process and 100 fastcgi daemons. It ran on 2-way 32-bit Xeon (2 physical + 2 HT processors) with 8 Gb of RAM.

Elliptics storage network contained 8 nodes on 4 physical servers, where each one had 8 cores of 64-bit Xeons with 8 Gb of RAM. Each elliptics node ran on top of its own SCSI disk, it was configured to work with file storage backend, i.e. each object written was separate file. I used reiserfs because of its good performance for this workload.

Everything was connected by 1 Gbit ethernet network.

I do not know what is the client software, but it provides details statistics about reply time and allows to show nice graphs (in flash though).

So, the first results - reply time and rate. We were maxed at 5k rps and up to 4k rps behaviour was very good. At 5k rps reply time started to degrade although still was able to match request rate. At this point single lighttpd process was not able to dispatch requests fast enough, it got close to 100% CPU usage while elliptics fastcgi processes loafed at 3-5 % maximum. That's why I want to rerun this test with more lighttpd processes (namely 4 for 2 CPUs + 2 HT CPUs).
First graph shows number of replies per second changing with time (and number of requests per second). Legend says that green is number of momentary replies, blue is median number of replies, red - load in rps. Second graph shows reply time. Dark blue is momentary reply time in ms, orange is median reply time and red is load in rps.


Number of replies and its time

Next graphs show reply time distribution and HTTP reply status codes depending on workload. The first one is workload scheme and reply time distribution in ms, second one is HTTP status code distribution.


Reply time and HTTP status codes distribution

It was redirect data test, i.e. ellipitcs node on proxy server did not download data but only requested remote nodes whether they contain needed object, and returned some small formatted XML data with download info, which could be parsed by the client to create direct URL to data object to be fetched. Also proxy server generated cookies and secure authentification codes.

Next task is to run a test where data is actually downloaded through the elliptics network HTTP proxy daemon. Plan is to upload several thousands of small files (5-15 Kb each) and fetch them through this proxy.

And of course tune server software for maximum performance. I want to get 10k rps from that rather old 2-way machine in redirect test. I have another test machine with fair 8 modern cores, where I can setup this proxy too, so those data will also be interesting.

Stay tuned!

Elliptics network HTTP frontend extensions

Tagged:  

system got DNS resolving support as well as object removal.

Everything is rather trivial, but it must be done in any somewhat functional distributed storage.
I extended fastcgi config options documentation and will write down some examples. Also will put a small memo about security context and how it should be used to authentificate users.

We scheduled a performance testing soon (I do hope it will be done this week and better if tomorrow), and it will not be my tests at all. Well, I work with those people of course, but neither clients nor testing scenario belongs to me - this will be a fair production workload performance check, which can be shown to the people who make decisions about project deployment. This will likely be ran even without my notice (although server side runs on my test cluster :)

While working on those issues I thought about how to manage object metadata, which is needed both to web projects and filesystem frontend. Namely we have object permissions, which should be stored somewhere. Plan is to add another object with ID generated from the main one, which will store metadata structure as its data, which can be read before accessing main data to resolve control issues.

So far its a plan for the future though, and the nearest plan is to finally run new massivelly parallel IO test with very small data requests. Likely it will be setup in a day or so (tomorrow is a climbing day).

Stay tuned, new shiny (or miserable) graphs are coming :)

Elliptics network fastcgi daemon got direct download option

Tagged:  

One has to specify special flag (configurable in fastcgi daemon) in URI and its requested ID must contain one token from provided in config list (like .jpg or .xml), otherwise error is returned to the caller.

If all constraints are met, fascgi daemon will download given object from the storage into its RAM (which basically means that this proxy is not supposed to push very large object throught itself, instead it should be configured to return direct URL) and push it to the client. It will not set tricky things like content type header or anything else, but libfcgi will take care about setting correct content length header (btw, the way it does this shows it will not work with large objects too, looks like it buffers data from multiple FCGI_fwrite() calls).

So, if fastcgi daemon is properly configured, following request will 'force it' to show picture in the browser:

GET /test.mp3?name=123456.jpg&direct=1
Host: devfs1

and following one will return 'redirect' XML with appropriate configurable status (edited to add some newlines for nicer view):

GET /test.mp3?name=123456.jpg
Host: devfs1

 Location: http://devfs8/1/f5/f58fd51148ce114a88ce9d668fae9b1a28869ea8
 Set-Cookie: Our_complany_cookie=87c62d2279e78da1cbabd038a7b7b16fb2678b66c097c92e4806cbc7191e8e03;
  expires=Mon, 23-Nov-2009 02:15:22 MSK
 Content-type: application/xml

 <download-info>
  <host>devfs8</host>
  <path>/1/f5/f58fd51148ce114a88ce9d668fae9b1a28869ea8</path>
  <ts>4b09b7fa</ts>
  <s>c847dad56d952d44e20a9b57bae4ca9824af531661a8cfb0fc48d65791eca2ab</s>
 </download-info>

where secret is generated using private key, timestamp and cookie, and cookie itself is either set by the client or generated using client address, timestamp and random data. Its expiration time is also configurable. Cookie is supposed to be used by the storage server software to check whether it is allowed to return its data. Direct object reading does not check cookies.

I will desribe it in details in fcgi README file in the new release, which is scheduled to see the light in a day or so.

I have to admit, that it was easier than writing filesystem, which effectively will do very similar things :)
Although not that much - elliptics network provides library API, so it is simple, but I had to write it at first, which effectively equals to write it in kernel. Unfortunately it is not that simple to use that code in kernel, and actually kernel does not need it all, only basic helpers to maintain stack of attributes in the transaction, everything else (low-level IO for exampl) is actually very different from userspace.

That's it, so far there are no more feature requests for elliptics network and fastcgi daemon, so it will be the next version (it also contains bug fixes, very likely 2.6.2 will not work the way you expect, so use git for now, it allows to download a tarball :)

Elliptics network got authentification and upload permissions check

Tagged:  

as well as fair number of bug fixes. Authentification is implemented via pre-shared secret key and digests, stored in cookies, which can be checked on the storage servers. POST is protected via config option. It is possible to disable or configure everything though.

During implementation additional project rised by itself - to store some other files in the storage and download them not via direct links, but through that fastcgi proxy, which already handles redirect URL generation.

It is a rather trivial idea getting that fastcgi daemon supports lookup and data writing, so I will extend it in a couple of days and roll out new version (actually I need just a couple of hours for this implementation, but there are some other things to work on first like piano and trumpet, eventually I will write about my progress in the latter). It is a good idea to download all pictures and text files from the page in one go instead of multiple redirect requests, and only large files will be fetched via additional request.
Actually it will be possible to download all files through the proxy and not via redirect, it should only have a special flag in URI. I'm not yet sure how to implement this correctly, it is posible to store them into files and then send them using sendfile() or to put them into RAM and send through usual socket machinery. Both approaches have its pros and cons.

Then we will test this setup, so I could switch to another elliptics network testing (modulo bugs of course): multiple machines in different data centers, some people want to know aggregated small (about a hundred of bytes per request) read and write performance from large number of threads (like hundreds per node).

And finally to POHMELFS, which I thought I lost because of some troubles, which apparently were not caught by backup, but fortunately there are development/testing machines, where there are latest sources. Those machines tend to be killed regulary (its a doom of filesystem development to lost all data), but everything is yet ok.

Elliptics network "make things easy" release: 2.6.2

Tagged:  

Name says it all: it is not dumb simple to create distributed hash table redundant storage over multiple nodes with HTTP data access.

Data is uploaded using POST method through special FastCGI application, which is linked with elliptics network library and writes data into the storage according to its config file (one can specify data redundancy there for example).

Data receiving is rather different idea - FastCGI application described above only lookups requested object in the network and returns direct URL to appropriate storage server. It has to be configured according to some standards (like data must be placed in subdirs, indexed by the parameter, which is equal to appropriate elliptics network port minus FCGI_DNET_BASE_PORT config option, i.e. if there are two elliptics nodes (for two disks for example) running on 1025 and 1026 ports, FCGI_DNET_BASE_PORT config parameter being set to "1024" and single web server, its document root should contain subdirs 1 and 2 (1025-1024 and 1026-1024).
There is bunch of other useful config parameters, although there is no authentification or any kind of permission checks yet.

Anyway, here is a changelog:

  • Added FastCGI daemon to handle GET (returns direct URL via redirect or XML) and POST requests
  • Extended lookup to optionally check whether requested object is stored locally
  • Added lighttpd fastcgi config
  • Bug fixes

I deployed small storage total of 1 Tb of data spread over 4 physical machines with 2 elliptics node on each (one per storage disk) and uploaded 60 gb of data there (about 3-4 thousans of files and each one has additional copy).

Upload is rather trivial:

wget --post-file=$file http://base_fast_cgi_host.net/name.mp3?name=$some_file_name

You might expect that file downloading will be as simple as

wget http://base_fast_cgi_host.net/name.mp3?name=$some_file_name

and you will be absolutely right, that it will redirect you to some server inside storage cloud via direct link to the requested object.

Also added files to make debian packages. This even works.

Full changelog is available in git tree.

That's it, enjoy!

HTTP fastcgi daemon has been imported into elliptics network tree

Tagged:  

and uploaded into git tree.

Now its time to setup a small 8-node elliptics network cluster (one node per fast scsi disk) on 4 physical machines (total of about 1 Tb) and run some tests, namely massive data upload and download. There will be two external nodes serving as upload proxies (I plan to write at least one additional object copy for redundancy) and data fetching URL generators (objects themself will be downloaded via direct HTTP links from the storage nodes)

Upload as well as download works with wget pretty smoothly, but there was no load yet.
I will go climbing quite soon but would like to start some massive uploads before that to get some results later today.

If things will go smooth as well, this will be next elliptics network release. Next step will be to add some authentification bits into the field, currently neither application checks permissions just because there are no restrictions at all. One can configure web-server instead though...

Stay tuned!

UPDATE: imported my music collection (3500 files), total of just about 30 Gb, not that much actually, will try to find out what else one can find here.

Full cycle elliptics network access over HTTP completed

Tagged:  

Elliptics network - a distributed hash table storage with zillions of tasty things got another gem.

It is possible to upload and download content over HTTP (GET and POST methods are supported) via direct links (download only, upload uses FastCGI proxy), system scales horizontally, allows to implement redundant object storage with multiple copies, automatic data relocation when nodes fail and so forth.
POST processor was originally written in Lisp, but I decided to switch over to plain C because of simplicity elliptics network library provides with its API. Getting that I stopped to use Google in the office (sigh, if you would know how ugly Bing is, but there is no alternative I'm afraid), it takes really long to find out something useful in internet about complex tech tasks now, so continuing working with Lisp in that environment did not look like a good idea. I will use it for AI tasks though.

So, modulo unknown bugs, it should completely solve your download HTTP server scalability issues. Now its time to cleanup debug prints and code a little bit, extend configs to add some latest words, and then to add authentification protocol, namely some cookie generation to forbid unauthorized access. So far I did not commit the latest changes, it will be there tomorrow.

It happend that using elliptics network is really trivial, even when one forgot quite a lot about its internal state. Code extension, albeit quite visible, was not that invasive actually, and I was able to catch up with it very quickly.

I added another flag into the protocol, which allows to upload data without transaction machinery on the disk (transaction are still present in memory on the client, so will be resent if not acked and so on), i.e. it is now possible to put data into the storage without history (was possible before) and to eliminate on-disk format changes for the data, so that placed files only differ in names from what was originally posted, and names may depend not on content, but on provided ID (like hash of the name or precise ID user may provide). Previously there was a history for the object, which in turn contained transaction ID, which was generated based on uploaded content, so there were two steps needed to fetch the data: get history and get appropriate transactions from it.

So, while you are thinking about how to solve scalability problems of the new project, consider checking elliptics homepage to get in touch with its features and capabilities. There is also benchmark section there, which I plan to extend quite soon with the new data.

Also there is pending POSIX filesystem on top of this storage, which should show up not that far away either.

Stay tuned, there will be some news quite soon!

Completed static content elliptics network implementation

Tagged:  

I've completed installation of the small distributed hash table storage with static content delivered via direct URLs. This whole setup slightly differs from more common and expected one in one detail: how data is fetched and accessed by the client.

In the common case it is supposed that elliptics network powered applications will fetch data from the network according to transaction history of the object (optionally in parallel). This requires client code to be linked with elliptics network library and modified according to its API.

But there is another way, which is much simpler although a bit limited - split data lookup and reading itself, and implement the former in the special small application, while rely on other facilities (like HTTP servers) to get the data.

This is what was made. I wrote simple FastCGI application which starts data lookups and form URLs which are returned to the client application, which in turn fetches data from storage HTTP servers. There is one-to-one relation between any potenially failing object within storage cluster (one can install one elliptics node per disk or per server, or even per datacenter) and elliptics network nodes. FastCGI daemons (which can live on separate set of machines if needed) are persistent clients of that network, and the only task they do is elliptics network node's IP address lookup, which is then extended to static URL to actually get the data.
This URL is returned from the fastcgi daemon as redirect, but this is configurable.

I extended lookup message to optionally stat local storage on the node to actually test whether object is presented on the given node. Using multiple IDs for the same data object allows to redundantly store multiple copies, so that client could switch to another copy if object can not be found using previous ID. Elliptics network storage servers will take care about data relocation when servers go offline and online.

The only problem for this setup is how data is treated by the client and storage. Client expects dataflow from the single node starting from the beginning to the end, while elliptics storage uses transactions with its own protocol and on-disk storage format, which is processed by the library when appropriate IO API is called from the client code.

Problem can be solved if we will upload data not via elliptics network, but directly into the strorage, although using the same name conversion which could be done by elliptics internals. I.e. when we manually create directory structure and put there objects with names, which are equal to hash transformations of the real names, which then in turn will be made in fastcgi elliptics network daemon.
Let me show an example of how this is done. Let's consider an object called '/tmp/passwd.c' to be placed into the storage, which will use sha1 transformation function.

sha1('/tmp/passwd.c') = 8c23ac86ef943021cf6524f475c15f3d5d575deb
so we manually put this object into the storage network on the appropriate node (which handles covering ID range) with that 8c23... name.

FastCGI daemon configured to use sha1 transformation will receive URL like

GET some.host.net/blah?name=/tmp/passwd.c

take name part, hash it, lookup object with above 8c23... id and return following header:

Status: 301
Location: http://some_other_host.net/8c/8c23ac86ef943021cf6524f475c15f3d5d575deb

Simple. We only need to make an appropriate script for data upload. If we would use elliptics network for data upload, then above 8c23... transaction will contain history for data updates, and actual data transaction (or there could be multiple transactions if object was split into multiple parts to allow parallel reading) should be read from the history and then fetched from some other nodes using elliptics network API.

I will write such helper script and upload some content (currently I do this manually via ssh/scp :), so that I could stress-test setup before it goes up. So far things went pretty smooth.
Interested parties can check example directory in the git tree.

Stay tuned!

Elliptics network goes production

Tagged:  

Kind of goes - there is a perfect task for this solution, which I can try to hook into this year. The New Year deadlines all deadlines, so there is about a month and a half for the task.

Task is quite simple actually - there is a huge library of files, which does not fit single storage machine. And although it is not that large, about 5-10 Tb of data for starters, next step is to suck in close to 200 Tb of data. Task is to allow on-demand reading without updates of the existing files, only new ones will be added with time. I expect millions of reads per day.

Files should be spread over multiple machines for read balancing, there should be multiple copies of each for redundacy. System should transparently handle failures (storage machines will be spread over multiple data centers). And the main request is to allow to fetch files over direct links, i.e. elliptics network provides data location and some usual HTTP server will give them away.

While I wrote this entry another cool task (re)appeared: clusterize some very popular monitoring system, which to date does not scale very well to existing amount of notification writers (about 200k small writes per second per small cluster). I need to provide fault-tolerant storage which will be able to suffer this load and allow simple horizontal scaling on demand.

Existing performance numbers show that elliptics network can easily handle all those tasks, but some obscure numbers created by the project author are usually not enough for those who deploy new system. As in any other business, people do not eager to try something new. New, shiny and likely buggy...

Well, let's show what we can do. I will post results and setup systems here.

Data de-duplication in ZFS and elliptics network (POHMELFS)

Jon Smirl sent me a link describing new ZFS feature - data deduplication.

This is a technique which allows to store multiple data objects in the same place when their content is the same, thus effectively saving the space. There are three levels of data deduplication - files (objects actually), blocks and bytes. Every level allows to store single entity for the multiple identical objects, like single block for several equal data blocks or byte range and so on. ZFS supports block deduplication.

This feature existed effectively from the beginning in the elliptics network distributed hash table storage, but it has two levels of data deduplication: object and transaction. Well, actually we have transaction only, but maximum transaction size can be limited to some large enough block (like megabytes or more, or can be unlimited if needed), so if object is smaller than that, it will be deduplicated automatically.

Which basically means that if multiple users write the same content into the storage and use the same ID, no new storage space will be used, instead transaction log for the selected object will be updated to show that two external objects refer to given transaction.

Depending on transaction size it may have a negative impact, in particular when transaction size is smaller than log entry, it will be actually a waste of space, but transactions are required for the log-strucutred filesystem and to implement things like snapshots and update history. By default log entry size equals to 56 bytes, so it should not be a problem in the common case.

POHMELFS as elliptics network frontend will support this feature without actually any steps out of the box.

Perl bindings for elliptics network API

Tagged:  

As you might know, elliptics network is a distributed hash table storage system, which supports data partitioning, redundancy and recovery without any master servers as well as lots of other features.
It is created as a simple library with examples which include server and client nodes, notification and statistics receiving utility and so on. Now it also has Perl bindings for the vast majority of the exported APIs.

One can check out a git tree and build appropriate perl object files.

It is synced with the master elliptics network branch, so both trees have the same version number.

Enjoy!

2.6.1 elliptics network release

Tagged:  

Elliptics network is a fault tolerant distributed hash table object storage.

This is a bugfix release without substantial changes.
Its changelog is also quite short:

  • fixed deletion command reference counters
  • improved processing of acked transactions (before this change it could take noticebly longer to complete transaction because of possible resends)
  • automatic tests refactoring

As usual, it is available as tarball from archive or git tree.

There is another news about elliptics network - we created Perl API to work with the network storage. It was implemented as a binding to elliptics library, and currently has support for high-level file IO API (read/write/delete files and histories) and low-level object manipulation functions are being tested.
Kudos to Max Dementyev.

It is not opened yet because I do not directly participate in it, but I will work on integrating bindings into main tree.

Elliptics network: 2.6.0 release

Tagged:  

It is a rather major elliptics network release, although it does not bring something especially cool. Just what should be implemented in all and every distributed network storages.

So, changelog is pretty short:

  • implemented stall transactions reprocessing
  • proper node shutdown, it is now enabled by default even in examples which plainly destroyed objects by exiting previously
  • bug fixes

As usual code is available in archive and git tree.

A short feature list includes:

  • distributed hash tables, no metadata servers, horizontal scaling
  • data replication
  • transactions and parallel IO
  • different IO storage backends, modular architecture which allows to easily implement own transaction storage
  • automatic data repartitioning in case of removed or added nodes
  • ring addressing structure, ability to implement own transaction ID generation model
  • support for NATed connections
  • remote node statistics gathering module (usage of the disk, cpu, ram and so on)
  • IO update notifications support for any object in the network
  • advanced merge strategies
  • automatic tests (IO, joining, merge, deletion)

This is a milestone. And to date we stop here to wait for POHMELFS.
Enjoy!

New ellitpics network features: stall transaction reprocessing

Tagged:  

Elliptics network is a transactional append-only distributed hash table. Each transaction is either stored on the destination node or forwarded to another server.

Client has a completion callback which is invoked when remote node acks transaction. Client may ask several nodes to provide data in parallel or issue next request if the first one failed.
Nodes itself can reconfigure and change own IDs so data will be copied around to different nodes.

But up until now there was no mechanism to work with the failed transactions except than to return error to the client. So, for example if data was placed on the nodes which are not currently available because of network failure, read request will either block or return connection reset/refuse error.

In some cases it is a good idea to resend the same transaction when network or node becomes available again. That's what I imlpemented in the elliptics network - stall transactions are checked and if its resend counter and time match configured parameters, transaction will be resent. When resend counter hits zero, transaction is destroyed and error is returned to the client.
Checkout timeout for each transaction is specified in the configuration structure - this is a time after which transaction is considered stall and its resend counter should be checked, which may result either in transaction destruction or actual resend.

I combined this code with the reconnection logic. In particular it allows all transactions to reappear immediately when remote node goes back online.

It is implemented as a separate checking/recoonnecting thread, so it can be done without signal mess and can be safely used in the library.

Effectively it is the last feature of the elliptics network library itself iirc (but my be wrong :), other extensions will likely be done as external libraries/daemons (namely PAXOS locking). I will test these changes and if things go smooth will make a new release later this week.
Next step is likely to start POHMELFS porting to elliptics network storage.

Stay tuned!

Elliptics network transaction tree checks

Tagged:  

Elliptics network is a transaction based distributed hash table. All transactions are stored on the clients and servers which forward data to the destination nodes. When transaction is acked, it is removed from the tree and its completion callback is invoked whether it was provided by the client or set by the core system itself.

To date there was no way to check for stall transactions and resend or complete them with errors if destination node suddenly failed and no ack was received. Now things started to change - I added a helper function which traverses the whole transaction tree and checks for stall ones the same way it is implemented in POHMELFS. It operates with transction resend counter and fire time. The former is a number which is decreased each time transaction was sent into the network, when it hits zero, transaction is completed with timeout error and destroyed. The latter is a real time when transaction's resend counter should be checked. It is set during transaction creation and rescheduled after each resend.

Since elliptics network is a generic-purpose library, it can not easily mess with the timers, namely because higher-level users may already define own signal handlers. Thus I implemented it as a simple helper function supposed to be called from the safe context (not signal one and allowing to sleep).
As extension to this API I will add a way to create special per-node thread, which will periodically check transaction tree. This internal matter will be hidden from the users though. Maybe this will be a default mode for all client applications, since joining nodes already have such thread responsible for remote node reconnection.

I hate to break one idea think-path to implement something very different, but I feel extremely stupid not being able to implement a must-be-simple recirsive string parser in LISP, so decided to temporarily switch to something I know well.
Stay tuned!

Minor elliptics network release: 2.5.1

Tagged:  

This is a minor release of the distributed hash table object storage which contains following changes:

  • fixed leak in the forwarding path
  • removed unneded checks in the initialization path
  • increased verbose level for rare errors
  • code and debug cleanups

As usual, elliptics network releases are available from archive and git tree.
Enjoy!

Syndicate content