Blogs

Elliptics changes

Tagged:  

They are quite dramatical, but are very small yet - I committed search protocol changes. Now node stores transactions with IDs greater or equal than node's ID (it stored smaller or equal IDs previously), which is incompatible with current node searching, but allows to maintain human readable and logical (for humans) ID generation.

So, when node has ID, say, 0100..., it will host data transactions, which start from 01 (its the highest byte). It is much more convenient to configure nodes with this in mind, than to calculate what is less than 01, namely FF... IDs.

I also committed initial metadata support, but neither low level IO backend supports that yet, and I will leave only Tokyo Cabinet DB and file backends, BerkeleyDB support will be dropped, because of its slowliness. It is still in a development stage, since there is no clear vision on where this functionality should live - client or server.
I.e. it is possible that client will tell that it wants to insert metadata X into given object, and server will read/modify/write metadata blob itself, or it is possible that client will download whole metadata blob, update it locally and then write it back to server, which will replace old one with the new data. Likely I will use the former case, since it simplified client development, which should be a higher priority than server simplification.

We also found an interesting bug or feature of the storage - in some cases it is not possible to remove object, it will be recovered from the dead. Let's say we have two object copies and one node was turned off. Automatic recovery (not present yet though) will create another copy from the first one on alive nodes. Subsequent object removal will kill both copies on running nodes. When turned off node goes online again, autoamtic recovery tool will resurrect removed object from the copy presented on this node.

To date it is all a pure theory, since there is no separate metadata in the storage, thus no automatic recovery (admin should run special tool with properly crafted log file currently) and it does not remove objects from the storage. But still, described problem will hit us badly when we will actively use it.
And while there is no merge implemented either (it is kind of being materialized in my mind while we talk), solution will involve new history entry creation instead of actual data removal. Thus transaction log will contain a note that given object was removed. In case of network split and parallel object removal and update in different parts (which can not contact each other during this event) of the storage, this will also allow to implement correct and complete transaction history log by synchronization daemon.

Thus object will never be deleted from the storage, and instead its history will be updated to store a note about its status. File system checker will be extended to support a mode, when it will actually remove objects from the storage after they were marked (and resolved during merge with other logs if needed) as deleted after some timeout, which should be big enough to eliminate such ghost nodes appearence.

And the last but not least discussed issue concerns storage size and related limitations. Let's say that we reached our current storage capacity and want to add several another machines, which will add 50% of the current volume. We want to spread data equally between all nodes, thus we will need to update every node's ID to shift it a little, so that new nodes entered addressing ring and formed a fair ID distribution. Amount of transaction copies in this case is quite large - more than a half of all data will have to be transferred over the network, which will take a while.
Also, when we add new empty node into the storage, it will kind of hide data it is supposed to host (according to ID distribution) until it is copied to the new node from the neighbour. Thus there should be a poilicy, which will forbid simultaneous update of all servers, since there is a possibility that suddenly all added nodes hide all copies of some objects. It will be recovered of course, but it will take some time, which in some cases is not appropriate.

One of the solutions for the described storage size issue is different storage policy. We can implement multiple virtual datacenters, where each new virtual datacenter corresponds to newly added set of machines. In this case we will extend write application so that it could 'touch' old hash functions (and thus old virtual datacenters) first to determine whether it can store data there and move to the new machines if there is no space in the old ones. Reading can issue a parallel lookup to all virtual datacenters asking for given object ID.

This scheme has latency limitations as well as network traffic growing with new virtual datacenters involved, but it can be a good decision for smaller setups though.

Virtual datacenters (or configurable hash/transformation functions used to generate transaction ID) becomes one of the most flexible 'tools' to implement different storage setups.

Stay tuned, there will be more news soon!

Two days of snow

Tagged:  

I used to hate skiing - I wasted 3 years in running ski section in univercity, while I could play football or, let's say, chess. Well, there was no chess section, but whatever else it could be more interesting than ski.

And this year I opened myself alpine ski. I did it about 15-20 years ago previously when was in school, and it was simple small plastic skis. Technology made a significan progress since then and I got ability to test real skis.

That's what I did this and previous weekends - two days in Stepanovo ski resort. It was essentially the first time I tried big slope (not that big compared to real resorts in Europe of course, just about a kilometer or less and 100 meters drop) and real snow. And it was fucking incredible - it is fast, it is long enough to feel the speed and ground, it is quite different - there are multiple traces and a lot of small roads from main trace, where one can ride over hummocks and small ski jumps.

I bought myself all equipment except skis itself - want to touch different things first, but I believe I will get my own next time. With the proper equipment it is not cold, warm or wet, it is just ubercool. Getting that I basically have no technique, I open lots of cases for myself all the time. And I believe that I have some progress, maybe not that good, but very pleasant for myself.

I tried long blue trace previously, but today I started a red one. And it was fucking beautiful - so fast and so strong. No boring places and long waits, just pure pleasure of speed and control. On this trace I found myself moving noticebly more technically than on a simpler trace.

I started to sit lower, put legs closer and change ski edges using mass center and not ass or legs, pipe changing arcs became shorter and with longer radius, which increased speed compared to plain skiing.

Of course it was not always perfect, and frankly I believe it looked like crap and was a real crap from good technique point of view, but it was very pleasant for me, and that's what matters. I want to get another hour or so with good teacher, who will tell me where main problems are, since I can not see how I made a slope. Sometimes I flew over the trace couple of meters and than landed in 'different positions' usually already without skis moving on my body another dozen of meters. But I like it too - it shows complex cases and sharps instincts.

Currently I believe there are no somewhat big parts of my body, which do not try to scream and ache. Especially shine bones (hard to move or stay long enough) and various leg muscles, but it is not a problem - I will be fresh again in a day, and hundred or so of "The Glenrothes" and couple of hours playing piano and trumpet will quickly help me. So plan is to make another turn next weekend or preferably move to ski resort couple times.

Fucking incredible. Just love it!

Elliptics network background fsck

Tagged:  

Its original draft could be read previously, but I believe it became a little bit outdated, so requires some highlighting.

But first, let's clear the status of fsck log checker. I completed its implementation, which is now capable of supporting consistent number of copies in the storage. It does not allow to merge different transaction logs yet.

To determine object to check it uses special text log file, which among other info contains name of the object and transformation functions to work with. Each transformation function will produce unique ID, which will be checked in the storage. For example we can put there sha1 and md5 transformation functions, so we will have two IDs equal to appropriate hash of the input name (and optionally hash of the transactions content).

When some objects are not presented in the storage, checker will download first existing copy and try to upload it using transformation functions corresponding to missing objects. So, if object with ID being equal to md5(name) is present and sha1(name) isn't, then checker will download all transactions stored in the existing object and upload them using sha1 transformation, thus recovering requested number of copies.

Checker currently requires log file to get information from and admin to start the process.
Background fsck is supposed to eliminate both needs.

Basic idea is to store some metadata with each object, which will tell origin of the given object and how it was supposed to be stored in the elliptics network. Thus we can timely or on request parse metadata for all objects in the given node (or only part of them), create a log file and run existing checker against it.

It becomes similar to what extended attributes are in the existing filesystems. Metadata can contain information not only about what object is, but also its IO permissions or access policies, owner information and anything else we would like to have there, which will allow to implement at least basic security model for elliptics network as well as simplify POHMELFS port.

Elliptics network: 2.6.4 release

Tagged:  

It took a while to prepare a new release of the distributed hash table storage elliptics network, but here we go. This is still a minor version bump, although amount of changes is rather large for small update.

Likely this will be the last releae in 2.6 release cycle, since in parallel we are cooking up a completely new versioning and merge logic as well as data synchronization. Btw, this release breaks to some degree that logic, but there is a tool to fix things up. It will be automated in the next versions.

But let's dig into details and changelog:

  • Data integrity checker. Although a little bit undocumented (see example below), it allows to check whether given object is present in the storage with requested number of its copies. And if number of found objects does not correspond to config, it will automatically download and upload data with the desired IDs. Later this tool will also be able to upload data into the storage. This checker will be a base for background FSCK, which will be a simple script, which will parse metadata and start checker with given log. It also supports external library call for requests merge.
  • [FCGI frontend]: cookie, timeouts, tunable headers, variable content types, more and clean XML.
  • Rewritten network state and reconnection logic. This makes NATed box support trivial (we do support it), client nodes became even simpler than ever, less code, less bugs, everyone is happy.
  • Debian debug package.
  • Fair number of bug fixes. This version is used in production, if time permits I will describe this load in details later.

Modulo possible bugs, main work is concentrated on the filesystem checker. There are two problems to solve.
The first one is absence of transaction log made by requested transformation function, or in plain words - absence of copy of the object in the storage. This happens when some node went offline and returned empty or was replaced. Or did not return at all. In this case fsck application will check how many copies are present in the storage and automatially download one of them (the first one from config) and upload with given ID.

Second issue to resolve is transaction merge. Elliptics network by default uses transactions for every update, so there is no object as is in the storage, instead reader will download transaction log, parse it and select transactions which cover requested object range. It is hidden in API of course, but it is possible to manually select needed transactions, for example to support versioning and data snapshots. As tasty effect two fully equal transactions (objects) will not use two times more space, since there are appopriate transaction reference counters.

Currently there are multiple (5) merge strategies, but practice shows that they introduce more harm or misunderstanding at best, than actual goodness. So I decided to drop them all in favour of trivial timestamp based merge algorithm. Of course it is possible to merge transactions based on private algorithm, which can be called from fsck daemon. We have request to allow external modules to merge objects based on actual data.

This version disables content synchronization during node joining. Instead admin has to call fsck application with externally stored log of the uploaded data to check whether things are ok and fixup what was broken. It will be automated and no external log will be required in the next versions.

Fsck application log file should look like this:

3 0,0,0 sha1,md5 object_name

where '3' is object creation flags - without transactions, just like those created by FSCK frontend. Will be removed in the next version.
'0,0,0' is a placeholder for object parsing information meaning start,end,update_existing. Start and end are positions of the starting and ending symbol in the object_name used to generate ID. Zeroes mean automatic detection. Update_existing is not currently supported, in the next version if set will upload local file named object_name into the storage no matter if its copies are already present.
sha1,md5 - transformation functions used to generate ID from object_name. This setup uses two copies - each one created by appropriate hash.
object_name - name of the uploaded object. Its hash (or actually transformation of the name using presented functions, it is allowed to be some other function than plain hash) will be object ID.

Stay tuned, work is boiling and results are very close!

Elliptics network got new on-disk format

Tagged:  

Eventually any storage should go into production mode, which implies not only data storage itself but also access restrictions. Distributed hash table systems, like elliptics network, do not have dedicated servers which could store that information and manage access permissions, so each object should have its own set of rules. Although without proper security framework on top of network media this will not guarantee required data access granularity, but even in this model it is still possible to implement IO permissions to some degree.

Until now elliptics network did not have even a slight mechanism for doing this. And even that rudimentary supported metadata was stored in the transaction log and did not allow any kind of extensions or proper updates.

And although I did not yet write any line of code to deal with metadata, I already broke old-style transaction logs, which now contain only and only transaction information. There are no metadata objects at all, but I will update appropriate parts of the library to generate them and store in the separate entities.

It is possible to store metadata in the different objects like the ones being indexed by the hash of the original object's name plus some extension, but this will force system to perform two lookups to find out needed object and its metadata.
Another way is to add new object type to existing transaction and history log objects - all metadata will be stored close to the object itself and could be fetched using only object ID. In the filesystem backend where each object is stored as separate file, metadata will be indexed by the '.metadata' extension or similar - just like we have $ID.history for transaction logs. In the database backend (BDB and Tokyo Cabinet, although I seriously consider to drop the former, since it is unacceptibly slow compared to TC) it will be a separate table, indexed by the object ID.

Metadata will have flexible format (maybe even human-readable one based on strings?) to allow extensions without breaking backwards compatibility.

But first I should fix background log checker, which although syncs all kinds of objects currently (i.e. when there is no some object in the storage, but there is its copy with different ID, it will upload missing data from that copy), it does it slightly wrong way, namely messing with hashes and producing unneded additional transaction references. When checker is ready, whole storage fsck process will just combine a log based on metadata objects, and start check process for it.

Stay tuned, we are very close to the next major release, which will draw the line of the serious features and changes!

Opened skiing season

Tagged:  

Of course its downhill skiing, I used to hate runnng skiing wasted 3 years in the section when was in the university.

And actually I not only opened a season, but tried it first time. Some years ago I made a downhill run on the board, but I weared 'grinders' shoes instead of special shoes :)

Anyway, I do not know how to downhill, so I took an hour ski lesson, found myself can not being able to perform even the simplest things like V-deceleration (I do not know how it is called in english, but it is supposed that feet form kind of V figure).
Apparently I found a way to learn this quickly - when you flight into the wall on the bone-breaking speed it is better to find out a way to decelerate and stop. One of them is to fall, but it has own and rather serious problem - pain, haematomas and ego drop.

First trainig was rather painful and without interesting results except that I found myself very liking this stuff. So I went there in a day and enjoyed the hell skiing from the top. Even multiple times down to the bottom without falls. And EVEN once (or at least half-once) I moved the way and speed I wanted.
I'm sure, my carving and V-turns are likely ugly as hell, I like how things go.

A good news is that whole-year hill center (well, I was lazy to move to the real hills :) is very conveniently located between my home and office, so I can spent 1-2 morning hours there. I always wanted to be able to ski and its quite possible now with car.


Not me :)

In a meantime elliptics network went production

Tagged:  

We started the first production elliptics network distributed hash table cloud last week. Configuration is rather simple - there are 3 virtual datacenters each one contains two physical machines with several Tbs of space each.

I do not know precise number of HTTP proxies (fastcgi frontends for elliptics network) installed, but we talked about one or two of them in each datacenter. Each proxy has 5 uploading and 50 downloading processes. Uploading ones are hidden behind firewall, and downloading proxies are configured to read small objects through themself (like XML and image files) and big objects (data files) are downloaded directly from storage nodes, proxy only generates XML output with direct URL to some storage node.

This pilot project will host only about 2-4 Tb of data, each object will have 3 copies, each one stored in the appropriate datacenter. We will add geographically-spread datacenters soon, which will host only the most popular content, so that clients local to those storage nodes would not go the main datacenter.

Objects are not supposed to be updated during its lifetime, only once uploaded and removed when time requires. Data synchronization for failed and new nodes will be done using fsck application, which currently does not support advanced merge algorithms we had in the elliptics network, it only checks number of copies and optionally downloads/uploads if something is missing. Fsck application uses log file to get information about what and how objects should be checked, currently storage does not store this metadata with objects themself.

And actually background fsck daemon discussed previously will not be something very different from this application. Instead I suppose that having a script which will parse object's metadata and invoke application is a good approach. Fsck checker does not yet work with all possible types of uploaded objects, namely it was not yet tested with the transaction logs, only with objects themself, since it is what we use in the pilot project.

The more I think about transaction merge algorithms, the more I like the idea, when we only merge them using timestamps. Currently elliptics network has 5 merge algorithms, which may or may not be appropriate in some or other setup. Merge algorithm is invoked when node joins the storage to sync its content with what we have in the storage. Idea is to drop this join syncing and use postponed fsck which will sync data sometime in the future. Thus it is a prerequistic to have multiple object copies in such system, since it can take a while to sync joining node.

I understand that it is not possible to maintain in-sync clocks on multiple machines to have correct timestamps for each update transaction, but it should not be a problem if proper locking is used. Although there is no appropriate distributed locking system yet, it will be implemented anyway (plan is to write PAXOS locking daemon).

Fsck application will be extended to allow external libraries to merge data, namely I have requests to allow external entities to merge data based on its content and not meta informaion I might store, so fsck will just call external functions loaded from provided shared library if configured.

I did not make a new release yet, we will figure out possible bugs and complete fsck application testing first. Next version will not have syncing during join time, so this will be a major update.

Stay tuned!

Fixed comments from anonymous users

Drupal updates did not sync up with module changes, so captcha was screwed a bit.

Comments work again now.

Improving morphological analyzer and sentence generator

Tagged:  

(сверху впечатлению заикалась дотация , вокруг видению она выныривала)

Since russian has noticebly more complex morphological structure than english, and I basically do not know english good enough (most of my blog readers already noticed that :), I work with my native language, although derived logic could be applied for other language analysis too.

First, automatic grammatics generation is not very complex task, but since every word can have multiple meanings (like the same word can be a verb and noun in multiple forms with different cases and so on), number of grammatics automatically derived from given sentence rarely equals to just one and tends to explode with large numebr of words in the input sentences.

So there should be a way to eliminate some of them from the initial set. One of the factors, which can allow to drop impossible combiations of the word forms is relations between some common word forms, like closest adjective and noun, which should have the same case, or some prepositions which are only applicable to noun in selected case. I'm pretty sure that number of such rules is quite big, but getting that my last russian language lesson was in school kind of 15 or so years ago, it is a bit hard to summarize how it should look like.

I loosely implemented just couple of rules, and result looks noticebly more correct now. From grammatical point of view of course - currently I do not aim at generating content with some correct meaning.

(мы начинаем спокойный видеоряд о кости)

Another issue I was stumbled upon is actually wrong morphological information stored in database and dictionaries. Like verb voice or noun 'animation' i.e. check whether it is related to alive subject or not.

System can derive active voice, but dictionary will suggest passive voice verb for the substitution, and phrase will explode reader's brain first by the fact, that it is a grammatical crap way before it will the head with its meaning. So I added couple of checks (applied to russian language only of course), and results improved noticebly.

Another big problem I have right now is preposition logic. Or actually its absence - in russian each preposition can be related to only limited set of noun forms, and to date I did not find such information structured into the form, suitable for the database. So I need to manually write such information for about several dozens of such 'small words', which I'm a bit lazy to do.

That's why automatically derived grammatics which contain plain prepositions without relation to the appropriate nouns usually produce rather ugly random sentences.

To play with the system I implemented ability to generate content from manually created grammatics. Usually it differs from automatically derived ones only a little bit.

(к решению вернулась правда , сзади сомнению она пожимала)

To date I consider this part of the AI task as ready. Next one is long-term memory and fact extraction.

Basic idea is to create a system which will be able to memorize not only document content, but also relations between separate word forms and phrases. Such memory will allow to extract knowledge from the input data according to already learned information.
Thus we will be able to select words not randomly like now, but according to system's memory, so that selected set of words will be intelligent and related to some of its internal previously learned facts.

And it has to be automatic. The same way system is able to automatically derive grammatics from the random sentences according to some rules, it should extract possibly hidden relations between terms in the input data.

Its time to have some serious thinking on this problem...

Living outside of civilization

Tried to order Kindle DX from Amazon - fuck me, it is only available in US, although its description page clearly says it can be shipped outside (including Russia in menu):

Good news. Kindle DX can now be shipped to customers outside the U.S.

Apparently our dollars differ from ones in US - Kindle can not be ordered.

Amazon has a long negative story of relations with orders from Russia - it even did not want to allow me to order even a bunch of trumpet books, although could deliver them to much longer distances like New Zealand. Recently (well last year iirc) it changed delivery method and dropped air mail support completely. Fortunately UK Amazon still allows this.

And even eBay.com (there is only one Kindle DX seller) does not want to deliver outide of US.

I do not want to read books from notebook screen, although i do not have any negative feeling, but it is just unconvenient. And do not want smallish Sony PRS (even its 900 model) - 9.7" is a good fitting size for reading, close to usual book sizes :). And frequently waiting for paper copy is way too long, also frequently there are no needed books at all.

Crap.

Suddenly: back to business

Tagged:  

At took a bit of time to settle things down all over the place, so I can continue with the things I used to work with. Now huge post about things happened :)

First, recent development things. Altough a bit confused, but still.
Last several days I worked on morphological analysis and mainly on automatic grammatics extraction from texts and text generation based on them. I did not care much about performance, multithreading and the like, but instead concentrated on the idea. So I selected Lisp for prototyping, to date I do not know whether it will have more serious usage.

So, task of grammatics extraction is rather trivial when you have a database of morphological data. Namely I wanted to parse a sentence and get a knowledge about all words and their morphological forms, for example for nouns it could be at least cases and number and so on. Having a huge database for all possible word forms and all words in russian language (I selected it since I know it well and it has a quite reach morphological structure, but technique is applicable to any language of course) is not a feasible task.

So I developed a lemmer/stemmer, which has a limited database (I use about 10k words), obtained by aot.ru output parsing, although originally I wanted to DoS wikipedia with my requests. Getting that there are no on-line dictionaries (for russian at least) with tagged morphological data and structured output, I had to write regexp parsers to get that information from HTML pages. Wikipedia has noticebly worse and noisy output compared to aot.ru though.

To select unique words I downloaded one rather big text (Pelevin's "Generation P") and used Levenstein-Damerau distance to select words, which are rather far from each other (I used 0.25 normalized distance as a threshold, i.e. words are considered different in this metric, when roughly more than 1/4 of the letters are different). Then I received morphological data from aot.ru and stored it into local structures.

My lemmer allows not only to obtain information about words it has stored in the database, but also guess it for the words, which it does not know about based on matched word endings.
Grammatics generation becomes rather trivial - iterate over all words in the sentence and write realted morphological information without word form itself.

Second part - sentence generation, is rather simple, when morphological data is well structured. Namely I use derived grammatics and select random words which have to match morphological data for grammatics. So it looks like a grammatically valid sentence, but it contains nonsence of course, since words are not related to each other and do not follow some 'meaning' of the phrase. Also, there is no information for prepositions, which concatenate different forms, and some of them (in russian at least) can not be used with some forms and vice versa.
Another problematic part is multiple meaning of some words. It is even possible that the same word will not only have multiple noun cases with the same wordform, but simultaneously will be a verb in some form. And I did not yet develop a sentence analyzer, which will drop grammatics which do not match russian sentence rules, like must-have verb and/or noun forms presentd in the sentence and so on.

Below is several examples of how it works (in russian). First sentence is origianal text.

(какая-нибудь простая грамматика с текстом .)
(сдайся эротическая девушка меж мотоциклом .)
(жмурься рыжеволосая межа вне логиком .)
(вдумайся техническая папироса от борисовичем .)
(ответь тренажерная сидоровна изо симптомом .)
(просыпайся непосредственная харольдовна безо фактом .)
(тревожь забытая судьба между бомжом .)

It is possible to generate grammatics by hand and select words around some meaning, but such manual interference is not what I want. Next step is sentence processing rules described above. It is rather simple task, but it is a must-have feature for text generation.

Noticebly more complex is knowledge extraction problem and long-term memory, which in turn will allow to select words tied to each other based on previous experience. Using such technology system will be able to understand meaning of the data in terms of related words and generate reply based on its knowledge of their relations.

This is a task for some future though...
--------------------------------------------------------------------------------------------------------------

Another lexical problem I worked on is language detection. Common algorithms use N-M gramms, where N is number of letters and M is number of subsequent words, such NM-gramms are used to calculate conditional probability of the next characters based on probability of the previous ones in selected NM-gramm, so it is possible to detect languages, when system was trained and language-specific NM-gramms have been selected.

I think that brain works quite differently and does not calculate any kind of NM-gramms at all, but instead use highly parallel fuzzy words search in the memory. I did not yet develop a fast fuzzy searching except calculating Levenstein-Damerau distance against every word in the dictionary for every input word, which is rather costly task. So this is another interesting task to think about.

So I decided to switch to more simple matching - cut off endings and match against learned words. Thus I implemented in LISP a simple RADIX-based algorithm, where downloaded documents are parsed and reversed words (optionally without one or two last letters - kind of endings) are inserted into RADIX tree. Checked words are reversed and looked in this 'dictionary' optionally without one or two letters from its end - I kind of cut off the ending. When lookup returns a match system considers given word as being part of the language it refers to. Of course it is possible that the same word will be present in multiple languages (especially when training corpus contained words from different languages like what I used: raw wikipedia pages), so to determine document language we should check all words and calculate how many of them matched against every known to the system language. It is still possible to check single words of course.

This simple technique (less than 200 lines in LISP not counting RADIX tree implementation) behaves surprisingly good in the test case I ran. Namely I selected 3 big articles (several thousands of words) from wikipedia in english, turkey, ukrainian and russian languages, and then got wikipedia texts (not used in learning of course) and text matched its real language with probability (just a division of the matched words (in the above sence) to all words in the document) noticebly higher than for any other languages. All texts were downloaded automatically and CL-PPCRE based parser removed all tags, numbers and non-letter characters.

Here is an example output for english learning process:

$ ./get-page.lisp :radix-root-path=radix.obj.en :learn-url-path=/tmp/learn.en.txt \
  :output-dir=learn-data :check-url-path=/tmp/check.txt

url: http://en.wikipedia.org/wiki/Bahá'í_Faith, learned words (including dublicates): 9197
url: http://en.wikipedia.org/wiki/Carabane, learned words (including dublicates): 11072
url: http://en.wikipedia.org/wiki/Is_This_It, learned words (including dublicates): 6469

url statistics: http://tr.wikipedia.org/wiki/Uğultulu_Tepeler_(roman)
total match: 35 %

url statistics: http://en.wikipedia.org/wiki/Wuthering_Heights
total match: 60 %

url statistics: http://ru.wikipedia.org/wiki/Заглавная_страница
total match: 7 %

url statistics: http://uk.wikipedia.org/wiki/Головна_сторінка
total match: 6 %

As we see, it detected english language with 2 times higher probability. I skipped other tests (turkey, ukrainian and russian), but they show similar numbers.

Here is example for the most popular russian livejournal blogger Tema Lebedev LJ page and its profile:

TR: tema: 5%  profile: 21%
UA: tema: 30%  profile: 15%
EN: tema: 8%  profile: 27%
RU: tema: 47%  profile: 20%

Profile contains rather large number of english usernames words, so result is quite correct.
Percentage is far from 100% since small number of words were learned, I skipped prepositions and other small words and there are non-russian words there of course.

Not sure whether this is a very useful project, but I did not regret a day spent on thinking and development.

------------------------------------
That's it for noticeble development issues. Now lets more to life happenings.

I made an eye correction operation and can look at women without glasses now. I can also swim, play tennis and football and overall behave like a normal person. It is fucking cool feeling!

Operation itself is painless, but was rather complex from psycological point of view, at least for me. Especially things like vacuum cup on eye, can-opener-like part of the cornea cut and the like. But overall it was not something you should be afraid of.

I absolutely do not regret I did it.

I also filled an action at law against development company, which build a house I bought appartments in, to recognize me as a real owner of the appartments. It will take a while to settle though, I think a month or two.

Ugh, and I play trumpet. I do play it, and it sounds quite good, when I'm in a good mood and can play loudly on my Yamaha. I still can not improvise out of the head, but I usually have no troubles playing some melody after I learnt it. Learning can take a while if I did not hear melody before. Piano playing is rather stuck - I prefer to learn melody first on piano, but I have real troubles playing by both hands, even when left one part is really trivial like 2-3 notes. So I usually learn melody part only before trying it on trumpet :)

Heh, I'm back to business :)
Stay tuned!

Knowledge extraction and morphological analysis

Tagged:  

I started to work on basic lingustic tasks, namely morphological analysis. And started to do that using Lisp of course, although suspect that using C++ (with its cool STL) would be much faster to develop.

To date I implemented a simple algorithm to calculate Levenshtein-Damerau distance between words, which can be used for fuzzy text search. Effectively that's what ispell does.

Now I work on morphological analysis itself. I decided to start with Russian language for this, since I know it quite good, and it is rather complex one. Application should get a word and provide morphological tag for this: whether it is noun, verb or whatever else, which case or conjugation it has and so on.

I want to use a trainig corpora to teach algorithm about lemmes and inflexions, which then can be used to determine other forms and try to guess them if there is no strict match. I will use rather simple algorithm - separate endings and use them as a pointer to RADIX tree roots, which will contain reversed testing word roots. Ending and matched root will allow to determine morphological nature of the word.

To date I finished simple compressed RADIX tree implementation in Lisp, but there is a huge problem with trainig corpora. Did I say that all linguists are greedy bastards? There are no morphologically tagged corporas in public access, and even those who claim having it, does not provide such access, at least I did not find it anywhere except wikipedia.

Which in turn provides it as a HTML page like this, and of course there is no strict template or at least standard format for morphological part of the page. So it can add some additional tags or symbols between strings and so on. I implemented a simple Lisp HTTP downloader, which tries to analyze wikipedia pages and select morphological information, but to date it only work for nouns and adjectives. Next task is verb.

Then I will be able to build a testing corpora and create morphological parser. I expect this is already implemented in ispell, and I could use its dictionaries, but I was not able to find out how to make it dump morphological information about words.

The main question is why do I need this? And the answer is 'for lulz'. Plan is to create a simple grammatics generator, which will take training text, analyze every word and store learned grammatics. Then application will select some other words and produce sentences using those grammatics.

Stupid, but wery interesting, and allows to move furhter...

Stay tuned!

Elliptics network progress

Tagged:  

Elliptics network - a distributed hash table storage got initial fsck implementation. Actually I can only say this is a data checker, which does not try to recover if something is broken.

And it only works with provided text log and only for non-transactional objects uploaded, i.e. what is generated with HTTP fastcgi frontend. Fsck checks number of copies in the storage according to config and reports them to the user, which in turn can manually start upload process if number does not match.

This is a very first step of course and things will be improved with time, but it is already quite handy for administrators. To date elliptics network storage still requires manual intervention for 100% guarantee that storage contains needed number of copies, although in some cases it can be done automatically.

Uberpatch

Tagged:  

Hey, hackers, did you ever commit such a patch:

...
  11 files changed, 4550 deletions(-)

I'm pretty sure this is a rather rare event. And next Linux kernel will include such a brilliant - Greg Kroah "driver killer" Hartman dropped DST driver from the tree - my cool network block device with tons of tasty features will rest in peace.

And I fully acknowledge this decision. Actually it was me who pushed that idea at first place, since block devices are dead. In my opinion, only few niches are left for this low-level stuff, and most if not all of them do not need something new, even if it has something others did not have.

Let's move forward and look for real distributed storage systems, which do not require additional metadata servers to maintain object info, which scale horizontally, which do not dedicate special nodes for network routing and so on. Let's move to peer-to-peer solutions, which I believe are the future, at least the closest one, of the parallel distributed solutions. To the solutions like elliptics network.

There can be multiple interfaces to such storages, like filesystem one (future POHMELFS version) or block level access (like in NTT's SheepDog), but no matter what, it should be much more than a simple point-to-point connection.

So, DST is dead. And this is good. Let's not stay on the same place.

 drivers/staging/Kconfig           |    2
 drivers/staging/Makefile          |    1
 drivers/staging/dst/Kconfig       |   67 --
 drivers/staging/dst/Makefile      |    3
 drivers/staging/dst/crypto.c      |  733 ----------------------------
 drivers/staging/dst/dcore.c       |  968 --------------------------------------
 drivers/staging/dst/export.c      |  660 -------------------------
 drivers/staging/dst/state.c       |  844 ---------------------------------
 drivers/staging/dst/thread_pool.c |  348 ------------
 drivers/staging/dst/trans.c       |  337 -------------
 include/linux/dst.h               |  587 -----------------------
 11 files changed, 4550 deletions(-)

P.S. Found myself reading 'Engineering a compiler' by Keith Cooper and Linda Torczon with much more interest and understanding than Dragon compiler book from Alfred Aho.

New elliptics network release: 2.6.3

Tagged:  

Ellipitcs distributed hash table storage does not stay on the same place for too long, so I made a new release.

This is a rather minor one compared to what is scheduled next, but still it is rather big one. It mainly contains new shiny features and bug fixes, so here is a short changelog:

  • Improved Debian package building machinery
  • [HTTP fastcgi frontend] Added more commands: statistics, unlinking
  • [HTTP fastcgi frontend] Improved security context of the dataflows: permission checks, cookie and secret generation
  • [HTTP fastcgi frontend] Implemented ability to download data through the proxy and not via redirect link
  • [HTTP fastcgi frontend] Added DNS lookup option, which allows to return fully qualified names in redirect XML
  • [HTTP fastcgi frontend] Extended documentation
  • [HTTP fastcgi frontend] Switch from CGI mode to more thread-friendly FCGX mode
  • [HTTP fastcgi frontend] Virtual datacenters and geographical binding implementation
  • [HTTP fastcgi frontend] Allow to use external library calls for pre- and post-processing of URLs, FastCGI daemon can request region ID from preconfigured library
  • Added -fstack-protector-all compilation option (which breaks Debian Etch building)
  • Extended file IO backend - now it stores objects in subdirs indexed by configurable number of bits of the ID, previously it used hardcoded 8 bits (256 directories)
  • Fair number of bug fixes

As usual, all sources are available in archive or git tree (latest snapshot is also there).

Now you can build your own distributed fault-tolerant hash table storage virtually by few commands started from script. And although documentation is rather fluffy and there are no plain and simple HOWTOs, it should not take too much time to get in touch with the applications.

There is one issue yet to be resolved: background fsck.
The more I think about it the more I like idea when node does not copy content when joining to the network and instead that background fsck task will do the work, there is a fair number of problems with content sync during joining, which will fire up when we will sync 10 Tb of data from one machine to another (those numbers we talked about recently). Well, 'empty' joined node will not return data (until it is copied there), so client will request it from backup copy, but there will be no need to postpone writes.

Its initial implmentation will use external log to check specified objects, and if some of them do not have preconfigured copies, they will be fetched from the other servers. This application is supposed to be started either by administrator when new node joins the network or automatically during node setup. Log has a simple text format and can be edited manually if needed. Log creation at first place and its maintenance is an administrative task though, for example, it can be sed/awk generated from HTTP server logs, if elliptics fascgi proxy is used for data upload.

Eventually this will be done automatically in background, and fsck daemon will not use some external log, but instead will process stored object metadata on the alive nodes.

The former change is scheduled to the end of the year - this will be the last 2.6.x release, and while actuall background fsck does not break API, it is a rather major change in the server-side logic, so this will be the first 2.7 release early in the next year. So far they are the only major changes for the foreseeable elliptics network future.
Next is POHMELFS port.

And in a meantime I very-very-fucking-very much want to implement a rather simple LR grammatics parser/generator implementation. I just sleep and see how productions are made. Well, not particulary this, but that's what I want to spent some time on.
Aho's Dragon book looks at me like at the shit.

Stay tuned!

Passive OS fingerprinting extension in iptables-1.4.6 release

Tagged:  

OSF userspace part is now in iptables-1.4.6.

So far without configuration utility, but appropriate OSF shared library is there in extensions/ subdir.

Dynamic storage node selection in elliptics network based on IP address and queries

Tagged:  

Let's imagine a distributed hash table storage build on top of elliptics network physically spread over multiple datacenters which are geographically separated all over the world.

Let's further suppose we want to fetch some data from our local Moscow datacenter instead of main basement somewhere in Nepal. We can assign a special transformation function like dc256_sha1 for this, which will force all transaction IDs to have ff000000 as their first 4 bytes. Moscow datacenter will have 0 ID, so if there are no other nodes with IDs more than 0xff, all requests with above transformation function will point to Moscow.

So far so good, we install this function in main load balancer all over the world, but will put it somewhere at the end of the list, since in Nepal in particular people will prefer not to go to Moscow. But what if some clients in Moscow connected to main load balancer in Nepal? If it does not know how to dynamically select transformation functions from its set according to IP or query, it will try the one listed first.

This is changed now - I added two calls to external library (not counting initialization and cleanup callbacks) which will receive address and query, start one is executed before data is processed (i.e. before GET/POST handlers start) and stop one is executed at the very end (just before data is freed). Elliptics fastCGI daemon will call external library to get region ID from query and address, and if it is more or equal than 0, it will search for 'dc$id_' string (like 'dc256_' string) and if there is such transformation fucntion, it will be moved first. Stop callback will move it to the end of the list.

So, when such system is installed all over the world datacenters, their frontends will receive requests, search for appropriate hash (transformation) function and probably (if configured) use local to client storages. Currently aforementioned library to obtain region ID from IP address is not public, so this feature is a bit useless for the non-programmers.

Everything can be turned off of course - it is just a matter of proper configuration options.

This was the last task before background fsck project. Which in turn is the last one for the elliptics network (modulo bug fixes).
Then I will switch to POHMELFS to finally feel again taste of kernel programming.

Virtual datacenters and geographical linking in elliptics network storage

Tagged:  

Elliptics network is a distributed hash table storage, which among zillions of other features allows to implement virtual datacenters now. It is a set of nodes combined into some logical group, where nodes may or may not actually be physically groupped together.

I added a hash prefix (one byte) to implement virtual datacenters.

This feature allows to specify preconfigured prefix in every transaction ID (the first byte). If nodes are added with IDs starting from the same number as in above hash function, then every transaction will go to that ID.

Here is a virtual datacenter example.

Let's suppose we have two transformation functions setup on client: dc1_sha1 and dc2_sha1. They will produce sha1 hash of the data with the first byte set to 1 and 2 accordingly.

Now let's suppose we add two nodes with IDs being equal to 0x0100 and 0x0180 and then two nodes with 0x0200 and 0x0280 IDs. Now there will be two transaction made with above transformation functions with its first byte set either to 1 or 2, so there will be a guaranteed copy in the first set of nodes and in the second set (sets are indexed by the first ID byte, this set is a so called virtual datacenter).

Now we can implement multiple sets with different ID's first byte for every datacenter we want to work with. If we only need to have 2 copies, but there are more than 2 datacenters, machines from them can be spread between those sets of nodes. '2' is arbitrary here of course.

This feature also allows to implement geographical linking of the requests. Let's suppose some application receives data read request from Moscow, it can check whether its set of transformation functions contains the one with the ID assigned to Moscow. If there is such a function, it can be used first to obtain object ID and to fetch it from Moscow-local servers instead of going to New York, where main datacenter lives. If Moscow storage does not contain our requested object, we will use second transformation function, which will 'point' to main storage cluster.

I will extend HTTP elliptics frontend to support external library call, which will determine object ID from received URI and client address. In particular it will be possible to write some external library to parse address and its local geobase to select transformation hash according to that data, or fallback to generic mechanism of trying every preconfigured transformation function to generate ID.

In a meantime elliptics network development

Tagged:  

Its HTTP fastcgi frontend got ability to request statistics from remote nodes and tell clients whether things are good or not. In particular it is able to return XML with CPU and memory usage stats and filesystem data (total and available size, number of files if supported and FSID).

Here are fastcgi (lighttpd) config options:

# request remote nodes statistics
# plain stat will request stats and return 200 status if either
# number of received replies is more than $DNET_FCGI_STAT_BAD_LIMIT (if set)
# or is more than number of bad replies
"DNET_FCGI_STAT_PATTERN_URI" => "/stat",

# this will always return 200 status with XML data showing node statistics:
# CPU and memory usage and filesystem data (total and available sizes, FSID and number of files)
# if such statistics is supported
"DNET_FCGI_STAT_LOG_PATTERN_URI" => "/stat_log",

# plain stat request will return 400 status when number of bad replies returned from the
# storage is more than this number or when number of good replies is less than that.
# It is not used when log statistics is requested.
"DNET_FCGI_STAT_BAD_LIMIT" => "1",

Next task is to actually implement virtual datacenters split via hash functions. This idea is described on elliptics network homepage. This will allow not only to split copies into different datacenters, but also implement special caching nodes, which will be mapped by additional hash function, so that some popular content could be fetched not from the main cloud but from those dedicated nodes, it can be controlled on per-address basis for example.

And the main idea is background fsck. The more I think about it the more I like idea when node does not copy content when joining to the network and instead that background fsck task will do the work, there is a fair number of problems with content sync during joining, which will fire up when we will sync 10 Tb of data from one machine to another (those numbers we talked about recently). Well, 'empty' joined node will not return data (until it is copied there), so client will request it from backup copy, but there will be no need to postpone writes.

Another idea about background fsck is special log it can use to fetch data, which will contain all objects written with its IDs, so it would not ask every single object whether its copy exists or not, but only process those ones which are presented in the log. This is a client responsibility to write one though, for example it can be formatted from POST access logs from HTTP frontend proxy.

That's the plan, stay tuned!

Elliptics network HTTP download performance

Tagged:  

This time I ran download test, where HTTP fastcgi elliptics network proxy downloaded object from the network itself and sent it to client.

We uploaded several tens of thousands files from 4 to 15 kb each into 4-servers storage, each server hosted two elliptics node attached to dedicated SCSI disk.
Proxy server has 2 physical 2.3 Ghz CPUs (+2 HT CPUs) and 8 Gb of RAM. Storage severs and fastcgi proxy are connected over 1 GigE network. I started 4 lighttpd dispatching workers and 200 elliptics network fastcgi daemons. There was fair number of network stack tunables we touched because of problems found in redirect test, which I will describe below.

I do not know what is the client software (called Tanks here), but it is able to generate HTTP GET load with (tens of) thousands requests per second rate with 250 rps steps. Graph below shows reply rate from elliptics network HTTP proxy (legend is the same as in redirect test).


Elliptics network HTTP fastcgi proxy download test

Proxy node was maxed at about 7-8k rps - all 4 CPUs (2 physical + 2 HT) were at 100%, where most of it was eaten by dispatching lighttpd processes. Active elliptics daemons ate at most 10% with turned on small logs, cookie and authentification messages generation and data forwarding itself (proxy downloaded 4-15kb objects from the elliptics storage and sent data to the client).
Elliptics storage nodes ate at most 10-12% of CPUs.

Now about problems we found during redirect test. In this test proxy node does not fetch and sent requested objects to client, instead it generates small XML data with direct URL to the object, which client can use itself.
With the same test setup system loafed at 8k rps and 50% CPU usage (most of which took caching bind9, since system was configured to return domain name instead of IP address, which is needed for correct cookie work). At this moment something cuts the balls happens, and perfromance dropps to about 300 rps, CPU usage decreases to zero, but nothing crashes or goes out of file descriptors (there is a fair number of them configured in ulimits). To fix the problem we tuned network stack (previously we regulary got warning about timewait socket table overflow) a bit, but problem remains (maybe it is related to client software, since I thought it is 8 nodes with 1024 fds ulimit, but the way perfromance drops and the fact, that it did not dissapear after ulimits were tuned says problem is somewhere else).

According to all observations we should be able to process 15k rps in described redirect test on the 2-way (2 physical + 2 HT CPUs) machines.

Testing elliptics network small IO size write performance

Tagged:  

I setup a 6-machine cluster where each node was placed in different datacenter (at least that's what I was told), connected over 10 Gbit network, but actual network speed was about 20-30 MB/s - it is production datacenters after all (I'm curious whether filling up the whole available bandwidth made people unhappy).

Each system runs on 8-way E5440 64-bit Xeons with 16 Gb of RAM and has two test SCSI disks attached, which I formatted into ext2 and ran the latest (1.4.39) Tokyo Cabinet database. System runs old as mamont's shit 5.4 RHEL (for example it has libevent 1.1, 9.8e openssl and 2.6.18 kernel).

I installed one elliptics network node on each of 6 servers and started a singlethreaded IO testing tool, which wrote 100 byte chunks into the storage. When test application was connected to every node it was able to get more than 300k requests per second IO rate, which was effectively limited by the network - it is those 20-30 MB/s of free bandwidth between datacenters. Sometimes it dropped to 70-100k rps, sometimes grew up to 360 thousands of rps.
I was asked to run multithreaded benchmark, like 50-100 threads from single node writing into the storage, but system was able to fill the whole pipe using just one client.

Anyway, everything looked quite good except one small detail - servers regulary crash on those machines. And by regulary I mean it. To date I do not know the reason, but I have to admit that it works without problems on the similar Ubuntu systems (with 4.2.4 gcc).
For example RHEL machines can throw out following dmesg message:

dnet_ioserv[2320] trap invalid opcode rip:2b649487455e rsp:43dcdfe8 error:0

which, I must say, I do not understand how to run into with whatever kind of software error in my userspace application.

Following quite usual gcc options were used during compilation:

 gcc -DHAVE_CONFIG_H -I. -I../config    -I../include -I../config -pthread
  -I/usr/include -I/tmp/rhel//include -I/tmp/rhel//include
  -g -O2 -W -Wall -Wextra -MT iotest.o -MD -MP -MF .deps/iotest.Tpo -c -o iotest.o iotest.c

$ gcc --version
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I believe I will have to try different compiler, since I extensively use __sync instructions for atomic access, which had some bugs in 4.1 gcc versions. Or at least to try libatomic first.

There are some bugs to hunt on!

Elliptics network background fsck

Tagged:  

Elliptics network is a distributed hash table storage, which, among other features, allows to redundantly store data. Number of copies of written object can be specified on various levels namely on client and server, and it depends on how and which transformation functions were set. Transformation function is a method of generating object ID from provided data, for example it can be object name or content hash, or it can be a database index or whetever else.
Multiple transformation functions will provide multiple IDs object should be stored with, thus implying multiple object copies.

Generally we want that different copies of the same object are stored on different nodes, in some setups we can guarantee that (for more information please take a look at per-datacenter object distribution discussion on the project homepage). But when node with some object fails and returns with empty storage, the same object will not be copied from the storage even if there is another copy somewhere else, since there is no central information server which maintains object<->id relations (eliminating this server is one of the main advantages of DHT storages actually, since it is likely main point of failure in the comon existing distributed storages), since only client itself can generate that relation.

In some cases we can eliminate this problem, namely when we use classical redundancy scheme, when object being written to some node is also copied to its neighbour. When node fails, neighbour copies appropriate objects to its new neighbour instead of failed one, so number of object copies remains constant. When node comes back online again (potentially with empty storage) it will fetch 'its' object from the neighbour node.

This mode is extremely unfriendly from administrative point of view. Admins already hate me since they need to generate IDs for the storage nodes manually to have a good load distibution (when we have small number of nodes random IDs will not evenly spread whole ID range between nodes), and while we can generally implement ID distribuition the way that object copies will be spread among different machines and even datacenters, this becomes very hard when copies go not to multiple specified nodes, but to single specified node and its neighbour (which ID we do not really know on the client).
Thus admins will have to maintain information about how neighbour nodes are spread over datacenters, which becomes a real pain in the ass, when single machine contains shitload of disks where each one has its own elliptics storage node attached, and we do want to put a copy of the object to another physical machine (and even in different datacenter).

So, redundancy scheme, where we put object to some node and its neighbour maintains number of copies when nodes go offline and online, but it does not (easily) allow to implement copies distribution among physical machines and datacenters, which is rather trivial in the case, when different transformation functions (say, sha1 and sha256 with some changes described on the homepage) are used.

To fix problem with multiple transformation functions and precise number of copies maintenance I decided to implement (read: to put into TODO list :) a background fsck daemon, which will run on every elliptics node and check whether each object stored on given node has its copies presented somewhere in the storage according to object metadata.
In a nutshell, each update transaction will have some metadata attached to it on the storage, namely original object name it was generated from and transformation functions used (with time we will want to put there IO permissions also and maybe something else, so it should have extensible on-disk attributes format). Fsck daemon will read that information and check whether copies with IDs created using all listed transformation functions are presented in the storage, otherwise it will write local object into the storage with missing ID. It can also check whether locally stored object is corrupted and get it from the storage if needed.

Having enough redundant copies we can reduce probability of the simultaneous failure of all nodes which store copies of the object to the accepted level, so background fsck will have enough time to scan local object tree and upload missing copies. This task by no means is supposed to be fast, that's why it is a background fsck, which should not affect storage node performance.

That's the plan, although not very immediate, but it has a rather high priority.

First elliptics network HTTP frontend benchmark results

Tagged:  

We ran performance testing quite for a while already, but system was not tuned for the maximum performance, so I wanted to show the best ones, but so far it is a little bit postponed, so in a day or so I will have another set, and now will post what we already got.

HTTP proxy was configured as elliptics node which performed node lookup for every request it got, so it returned small XML with download information and not object itself. We setup single lighttpd process and 100 fastcgi daemons. It ran on 2-way 32-bit Xeon (2 physical + 2 HT processors) with 8 Gb of RAM.

Elliptics storage network contained 8 nodes on 4 physical servers, where each one had 8 cores of 64-bit Xeons with 8 Gb of RAM. Each elliptics node ran on top of its own SCSI disk, it was configured to work with file storage backend, i.e. each object written was separate file. I used reiserfs because of its good performance for this workload.

Everything was connected by 1 Gbit ethernet network.

I do not know what is the client software, but it provides details statistics about reply time and allows to show nice graphs (in flash though).

So, the first results - reply time and rate. We were maxed at 5k rps and up to 4k rps behaviour was very good. At 5k rps reply time started to degrade although still was able to match request rate. At this point single lighttpd process was not able to dispatch requests fast enough, it got close to 100% CPU usage while elliptics fastcgi processes loafed at 3-5 % maximum. That's why I want to rerun this test with more lighttpd processes (namely 4 for 2 CPUs + 2 HT CPUs).
First graph shows number of replies per second changing with time (and number of requests per second). Legend says that green is number of momentary replies, blue is median number of replies, red - load in rps. Second graph shows reply time. Dark blue is momentary reply time in ms, orange is median reply time and red is load in rps.


Number of replies and its time

Next graphs show reply time distribution and HTTP reply status codes depending on workload. The first one is workload scheme and reply time distribution in ms, second one is HTTP status code distribution.


Reply time and HTTP status codes distribution

It was redirect data test, i.e. ellipitcs node on proxy server did not download data but only requested remote nodes whether they contain needed object, and returned some small formatted XML data with download info, which could be parsed by the client to create direct URL to data object to be fetched. Also proxy server generated cookies and secure authentification codes.

Next task is to run a test where data is actually downloaded through the elliptics network HTTP proxy daemon. Plan is to upload several thousands of small files (5-15 Kb each) and fetch them through this proxy.

And of course tune server software for maximum performance. I want to get 10k rps from that rather old 2-way machine in redirect test. I have another test machine with fair 8 modern cores, where I can setup this proxy too, so those data will also be interesting.

Stay tuned!

Elliptics network HTTP frontend extensions

Tagged:  

system got DNS resolving support as well as object removal.

Everything is rather trivial, but it must be done in any somewhat functional distributed storage.
I extended fastcgi config options documentation and will write down some examples. Also will put a small memo about security context and how it should be used to authentificate users.

We scheduled a performance testing soon (I do hope it will be done this week and better if tomorrow), and it will not be my tests at all. Well, I work with those people of course, but neither clients nor testing scenario belongs to me - this will be a fair production workload performance check, which can be shown to the people who make decisions about project deployment. This will likely be ran even without my notice (although server side runs on my test cluster :)

While working on those issues I thought about how to manage object metadata, which is needed both to web projects and filesystem frontend. Namely we have object permissions, which should be stored somewhere. Plan is to add another object with ID generated from the main one, which will store metadata structure as its data, which can be read before accessing main data to resolve control issues.

So far its a plan for the future though, and the nearest plan is to finally run new massivelly parallel IO test with very small data requests. Likely it will be setup in a day or so (tomorrow is a climbing day).

Stay tuned, new shiny (or miserable) graphs are coming :)

Elliptics network fastcgi daemon got direct download option

Tagged:  

One has to specify special flag (configurable in fastcgi daemon) in URI and its requested ID must contain one token from provided in config list (like .jpg or .xml), otherwise error is returned to the caller.

If all constraints are met, fascgi daemon will download given object from the storage into its RAM (which basically means that this proxy is not supposed to push very large object throught itself, instead it should be configured to return direct URL) and push it to the client. It will not set tricky things like content type header or anything else, but libfcgi will take care about setting correct content length header (btw, the way it does this shows it will not work with large objects too, looks like it buffers data from multiple FCGI_fwrite() calls).

So, if fastcgi daemon is properly configured, following request will 'force it' to show picture in the browser:

GET /test.mp3?name=123456.jpg&direct=1
Host: devfs1

and following one will return 'redirect' XML with appropriate configurable status (edited to add some newlines for nicer view):

GET /test.mp3?name=123456.jpg
Host: devfs1

 Location: http://devfs8/1/f5/f58fd51148ce114a88ce9d668fae9b1a28869ea8
 Set-Cookie: Our_complany_cookie=87c62d2279e78da1cbabd038a7b7b16fb2678b66c097c92e4806cbc7191e8e03;
  expires=Mon, 23-Nov-2009 02:15:22 MSK
 Content-type: application/xml

 <download-info>
  <host>devfs8</host>
  <path>/1/f5/f58fd51148ce114a88ce9d668fae9b1a28869ea8</path>
  <ts>4b09b7fa</ts>
  <s>c847dad56d952d44e20a9b57bae4ca9824af531661a8cfb0fc48d65791eca2ab</s>
 </download-info>

where secret is generated using private key, timestamp and cookie, and cookie itself is either set by the client or generated using client address, timestamp and random data. Its expiration time is also configurable. Cookie is supposed to be used by the storage server software to check whether it is allowed to return its data. Direct object reading does not check cookies.

I will desribe it in details in fcgi README file in the new release, which is scheduled to see the light in a day or so.

I have to admit, that it was easier than writing filesystem, which effectively will do very similar things :)
Although not that much - elliptics network provides library API, so it is simple, but I had to write it at first, which effectively equals to write it in kernel. Unfortunately it is not that simple to use that code in kernel, and actually kernel does not need it all, only basic helpers to maintain stack of attributes in the transaction, everything else (low-level IO for exampl) is actually very different from userspace.

That's it, so far there are no more feature requests for elliptics network and fastcgi daemon, so it will be the next version (it also contains bug fixes, very likely 2.6.2 will not work the way you expect, so use git for now, it allows to download a tarball :)

Elliptics network got authentification and upload permissions check

Tagged:  

as well as fair number of bug fixes. Authentification is implemented via pre-shared secret key and digests, stored in cookies, which can be checked on the storage servers. POST is protected via config option. It is possible to disable or configure everything though.

During implementation additional project rised by itself - to store some other files in the storage and download them not via direct links, but through that fastcgi proxy, which already handles redirect URL generation.

It is a rather trivial idea getting that fastcgi daemon supports lookup and data writing, so I will extend it in a couple of days and roll out new version (actually I need just a couple of hours for this implementation, but there are some other things to work on first like piano and trumpet, eventually I will write about my progress in the latter). It is a good idea to download all pictures and text files from the page in one go instead of multiple redirect requests, and only large files will be fetched via additional request.
Actually it will be possible to download all files through the proxy and not via redirect, it should only have a special flag in URI. I'm not yet sure how to implement this correctly, it is posible to store them into files and then send them using sendfile() or to put them into RAM and send through usual socket machinery. Both approaches have its pros and cons.

Then we will test this setup, so I could switch to another elliptics network testing (modulo bugs of course): multiple machines in different data centers, some people want to know aggregated small (about a hundred of bytes per request) read and write performance from large number of threads (like hundreds per node).

And finally to POHMELFS, which I thought I lost because of some troubles, which apparently were not caught by backup, but fortunately there are development/testing machines, where there are latest sources. Those machines tend to be killed regulary (its a doom of filesystem development to lost all data), but everything is yet ok.

Elliptics network "make things easy" release: 2.6.2

Tagged:  

Name says it all: it is not dumb simple to create distributed hash table redundant storage over multiple nodes with HTTP data access.

Data is uploaded using POST method through special FastCGI application, which is linked with elliptics network library and writes data into the storage according to its config file (one can specify data redundancy there for example).

Data receiving is rather different idea - FastCGI application described above only lookups requested object in the network and returns direct URL to appropriate storage server. It has to be configured according to some standards (like data must be placed in subdirs, indexed by the parameter, which is equal to appropriate elliptics network port minus FCGI_DNET_BASE_PORT config option, i.e. if there are two elliptics nodes (for two disks for example) running on 1025 and 1026 ports, FCGI_DNET_BASE_PORT config parameter being set to "1024" and single web server, its document root should contain subdirs 1 and 2 (1025-1024 and 1026-1024).
There is bunch of other useful config parameters, although there is no authentification or any kind of permission checks yet.

Anyway, here is a changelog:

  • Added FastCGI daemon to handle GET (returns direct URL via redirect or XML) and POST requests
  • Extended lookup to optionally check whether requested object is stored locally
  • Added lighttpd fastcgi config
  • Bug fixes

I deployed small storage total of 1 Tb of data spread over 4 physical machines with 2 elliptics node on each (one per storage disk) and uploaded 60 gb of data there (about 3-4 thousans of files and each one has additional copy).

Upload is rather trivial:

wget --post-file=$file http://base_fast_cgi_host.net/name.mp3?name=$some_file_name

You might expect that file downloading will be as simple as

wget http://base_fast_cgi_host.net/name.mp3?name=$some_file_name

and you will be absolutely right, that it will redirect you to some server inside storage cloud via direct link to the requested object.

Also added files to make debian packages. This even works.

Full changelog is available in git tree.

That's it, enjoy!

HTTP fastcgi daemon has been imported into elliptics network tree

Tagged:  

and uploaded into git tree.

Now its time to setup a small 8-node elliptics network cluster (one node per fast scsi disk) on 4 physical machines (total of about 1 Tb) and run some tests, namely massive data upload and download. There will be two external nodes serving as upload proxies (I plan to write at least one additional object copy for redundancy) and data fetching URL generators (objects themself will be downloaded via direct HTTP links from the storage nodes)

Upload as well as download works with wget pretty smoothly, but there was no load yet.
I will go climbing quite soon but would like to start some massive uploads before that to get some results later today.

If things will go smooth as well, this will be next elliptics network release. Next step will be to add some authentification bits into the field, currently neither application checks permissions just because there are no restrictions at all. One can configure web-server instead though...

Stay tuned!

UPDATE: imported my music collection (3500 files), total of just about 30 Gb, not that much actually, will try to find out what else one can find here.

Full cycle elliptics network access over HTTP completed

Tagged:  

Elliptics network - a distributed hash table storage with zillions of tasty things got another gem.

It is possible to upload and download content over HTTP (GET and POST methods are supported) via direct links (download only, upload uses FastCGI proxy), system scales horizontally, allows to implement redundant object storage with multiple copies, automatic data relocation when nodes fail and so forth.
POST processor was originally written in Lisp, but I decided to switch over to plain C because of simplicity elliptics network library provides with its API. Getting that I stopped to use Google in the office (sigh, if you would know how ugly Bing is, but there is no alternative I'm afraid), it takes really long to find out something useful in internet about complex tech tasks now, so continuing working with Lisp in that environment did not look like a good idea. I will use it for AI tasks though.

So, modulo unknown bugs, it should completely solve your download HTTP server scalability issues. Now its time to cleanup debug prints and code a little bit, extend configs to add some latest words, and then to add authentification protocol, namely some cookie generation to forbid unauthorized access. So far I did not commit the latest changes, it will be there tomorrow.

It happend that using elliptics network is really trivial, even when one forgot quite a lot about its internal state. Code extension, albeit quite visible, was not that invasive actually, and I was able to catch up with it very quickly.

I added another flag into the protocol, which allows to upload data without transaction machinery on the disk (transaction are still present in memory on the client, so will be resent if not acked and so on), i.e. it is now possible to put data into the storage without history (was possible before) and to eliminate on-disk format changes for the data, so that placed files only differ in names from what was originally posted, and names may depend not on content, but on provided ID (like hash of the name or precise ID user may provide). Previously there was a history for the object, which in turn contained transaction ID, which was generated based on uploaded content, so there were two steps needed to fetch the data: get history and get appropriate transactions from it.

So, while you are thinking about how to solve scalability problems of the new project, consider checking elliptics homepage to get in touch with its features and capabilities. There is also benchmark section there, which I plan to extend quite soon with the new data.

Also there is pending POSIX filesystem on top of this storage, which should show up not that far away either.

Stay tuned, there will be some news quite soon!

Completed static content elliptics network implementation

Tagged:  

I've completed installation of the small distributed hash table storage with static content delivered via direct URLs. This whole setup slightly differs from more common and expected one in one detail: how data is fetched and accessed by the client.

In the common case it is supposed that elliptics network powered applications will fetch data from the network according to transaction history of the object (optionally in parallel). This requires client code to be linked with elliptics network library and modified according to its API.

But there is another way, which is much simpler although a bit limited - split data lookup and reading itself, and implement the former in the special small application, while rely on other facilities (like HTTP servers) to get the data.

This is what was made. I wrote simple FastCGI application which starts data lookups and form URLs which are returned to the client application, which in turn fetches data from storage HTTP servers. There is one-to-one relation between any potenially failing object within storage cluster (one can install one elliptics node per disk or per server, or even per datacenter) and elliptics network nodes. FastCGI daemons (which can live on separate set of machines if needed) are persistent clients of that network, and the only task they do is elliptics network node's IP address lookup, which is then extended to static URL to actually get the data.
This URL is returned from the fastcgi daemon as redirect, but this is configurable.

I extended lookup message to optionally stat local storage on the node to actually test whether object is presented on the given node. Using multiple IDs for the same data object allows to redundantly store multiple copies, so that client could switch to another copy if object can not be found using previous ID. Elliptics network storage servers will take care about data relocation when servers go offline and online.

The only problem for this setup is how data is treated by the client and storage. Client expects dataflow from the single node starting from the beginning to the end, while elliptics storage uses transactions with its own protocol and on-disk storage format, which is processed by the library when appropriate IO API is called from the client code.

Problem can be solved if we will upload data not via elliptics network, but directly into the strorage, although using the same name conversion which could be done by elliptics internals. I.e. when we manually create directory structure and put there objects with names, which are equal to hash transformations of the real names, which then in turn will be made in fastcgi elliptics network daemon.
Let me show an example of how this is done. Let's consider an object called '/tmp/passwd.c' to be placed into the storage, which will use sha1 transformation function.

sha1('/tmp/passwd.c') = 8c23ac86ef943021cf6524f475c15f3d5d575deb
so we manually put this object into the storage network on the appropriate node (which handles covering ID range) with that 8c23... name.

FastCGI daemon configured to use sha1 transformation will receive URL like

GET some.host.net/blah?name=/tmp/passwd.c

take name part, hash it, lookup object with above 8c23... id and return following header:

Status: 301
Location: http://some_other_host.net/8c/8c23ac86ef943021cf6524f475c15f3d5d575deb

Simple. We only need to make an appropriate script for data upload. If we would use elliptics network for data upload, then above 8c23... transaction will contain history for data updates, and actual data transaction (or there could be multiple transactions if object was split into multiple parts to allow parallel reading) should be read from the history and then fetched from some other nodes using elliptics network API.

I will write such helper script and upload some content (currently I do this manually via ssh/scp :), so that I could stress-test setup before it goes up. So far things went pretty smooth.
Interested parties can check example directory in the git tree.

Stay tuned!

Syndicate content