Other

Other development notes

Want to do it good - do it yourself

Tagged:  
SDL_Init(SDL_INIT_VIDEO | SDL_INIT_AUDIO);

Python does not have an audio library, which allows to asynchronously play music and control it in close enough to real time. Mainly I need a fast audio start/stop capability, which will be triggered from python code when I press and release different keys on software piano keyboard.

So I will write it in C using SDL and SDL_mixer library in particular:

Mix_OpenAudio(audio_rate, audio_format, audio_channels, audio_buffers);
music = Mix_LoadMUS("music.ogg");
Mix_PlayMusic(music, 0);
Mix_HaltMusic();

The latter two functions can be used to control audio flow and it will be hardware mixed with other system sounds out of the SDL_mixer library box. SDL_mixer supports a wide variety of formats including OGG Vorbis and MP3.

And in a week or so I will publish a nice project I spent a week on. Actually it can be announced right here, but it does not yet work with production servers.
I created a simple enough photo uploader for Shotwell photo management application which uses Yandex.Fotki photohosting, which is the largest on in Russia and I used to use it quite frequently now. I hope my 1000 lines patch will be accepted (some people will even bring me beer for that :)

If I understood correctly Shotwell will replace F-Spot in modern distros.

Shotwell is written in Vala language, which is kind of C#, but its complier is sooo weird, that it is very unlikely I will return to it anytime soon. But java-style memory memory management (i.e. when you do not care about malloc/free) as well as compiler forced exception handling (one has to use try/catch blocks, when some function can throw exceptions) just make me relax during programming...

Python loving psto

Tagged:  

I love languages with rich standard library. Python is just awesome in this regard.

But amount of already written extension is outstanding - I parsed HTML using regexps in Lisp, but in Python with python-lxml it took just couple of hours to parse rather broken html using xpath and small string matching calls.

I spent one day to write a parser of non-structured morphological data (frequently with suddnly unwanted symbols or additional tags within) from aot.ru to create a quite large (300k+ morphems) russian dictionary, and then to store it into prefix array and ouput as XML file.

Yes, default CPython sucks with threads, it is not (yet) suitable for trivial audio processing (play and stop sound when pressing/releasing a key), but it is just bloody ubergood at high-level prototyping.

Returning back to morphological analysis I'm about to start rewriting my experimental knowledge extraction and grammatic generation 'engine' from Lisp to Python. And I expect to have some cool results with it soon.

Python threads: scheduling

Tagged:  

Python threads suck! They do not have deterministic behaviour from scheduling point of view, i.e. one does not know when thread will run and how long will it run.

And the main problem is that it likely may not give up control when faces syscall and/or lock. So it is not possible to control it using lock acquired in main thread and spawned one. PyQt signal will not be handled with enough priority to interrupt spawned thread which drops into thread.lock.

Eventually spawned thread will give up control, but this non-determinism is not acceptible for piano project I work on - we have to start and stop sound when key is pressed (modulo some short enough timeout, but not failing into extreme like using low-latency jack audio servers).

Python threads ...

Tagged:  

Using 'threading' module apparently is not a very good idea

$ python /tmp/threads.py
# ... interrupt by Ctrl-C

Traceback (most recent call last):
  File "/tmp/threads.py", line 12, in 
    TestThread().start()
  File "/usr/lib/python2.5/threading.py", line 442, in start
    _sleep(0.000001)    # 1 usec, to let the thread run (Solaris hack)

$  cat /tmp/threads.py 
import threading, sys

test_num = 0
class TestThread(threading.Thread):
	def run (self):
		global test_num
		test_num = test_num + 1

num = 10000;
for x in xrange (num):
	TestThread().start()

print "var: ", test_num, "num: ", num

I created this trivial Python::threading example which spans 10k threads which update global variable to find out whether GIL will protect it and to check how fast is thread creation. It happend that python threads are damn slow to create (and topmost dump confirms that - python sleeps there!). During this test python interpreter ate about 1% of 16 available CPUs and there was only single thread at a time, which means GIL protects thread creation and execution. When I added time.sleep() there, GIL was dropped to enter syscall and procfs confirmed that there were many simultaneous threads created.

But low-level module 'thread' is different. As documentation states:

This module provides low-level primitives for working with multiple threads (also called light-weight processes or tasks) — multiple threads of control sharing their global data space. For synchronization, simple locks (also called mutexes or binary semaphores) are provided.

I do not know what above high-level stuff does, but low-level threads are created very quickly forcing python interpreter to ate gigabytes of virtual memory and then throw unknown exceptions at the end (practice shows that waiting for all threads to complete does not rise those exceptions, maybe this happens only in 2.5.2):

Unhandled exception in thread started by 
Error in sys.excepthook:

$ cat /tmp/low_level_threads.py 
import thread

test_num = 0
def test_thread(smth):
	global test_num
	test_num = test_num + 1

num = 10000
for x in xrange(num):
	thread.start_new_thread(test_thread, ('qwe',))

print "var: ", test_num, "num: ", num

This application ate about 60-100% of CPU so I wonder whether GIL allows multiple threads to run (and not wait in syscall) on different physical processors without hacks (i.e. explicit GIL state drop and acquire).
There are also nice lock classes in that module.

I plan to use separate thread for each key in my piano keyboard emulator, which will be scheduled from key-pressed signal.

Python piano

Tagged:  

Spent several hours to write a trivial piano emulator using Python. I do not have a book so just googled whatever did not work: apparently all newbie problems are already solved.
I used PyQt bindings and Qt-designer to create UI. pyuic4 generates a bit ugly Python code: first, python 2.6 does not understand it (namely assignment some value to function, like self.some_func() = something, so I had to manually edit it (I found a post that latest Python snapshots already support it, likely it was about Python3), second, it does not know about loops and just creates a static representation of the form screen.
Also I did not work hard on things like resizing (I just forbid it :)

To play sounds I used PyMedia library, unfortunately it is not available in Ubuntu, and its compilation failed, so I had to edit setup scripts and put proper defines for C code :) Furtunately I quite know C a little.

Probably I will switch out from PyMedia, since it does not allow to control output sound buffer size. I need to 'schedule' audio sample playing so that qt-signal handler returned and next one (like key release) could be processed. PyMedia implements async sound output, but it may be not enough - for example it blocks trying to get and play 100.000 samples.
Or I will try to work with Python threads and assign press/release signals to different threads (if it is possible, I do not know Python enough yet).

This is a very ugly proof of concept code to date. I will polish things and split downloaded scales into separate tones, which will be played when appropriate keys are pressed. I will also try to implement nicer classes for keys and sounds.

And in a meantime my workplace screenshot (full-screen on click):

I found it so simple to program using Python that it is very likely I will switch all prototyping development to Python from Lisp. Lisp is great and I do like its features, but it is way too rare when I used its cool abilities like writing complex macros, most of my macros were rather simple with-something () like with-open-file() and friends.
Lambda and closures for internal states are great, but iirc Python also has them.

But main feature of the Python I like is its incredible standard library as well as zillions of already written extensions. I was not able to find graphical bindings for CLISP or audio processing for example, although its quite unusual to write GUI in Lisp.
I had to write HTTP GET/POST utility myself using low-level sockets; regular-expression in CLISP is rather limited, and although cl-ppcre is great, rich regexp is available in Python out of the box. Python has some multithreading support (and ugly GIL to name), while CLISP does not (at least 2.44.1 version, which comes with Ubuntu Lucid).
And so, and so and on...

Stay tuned, I will play more with this project, after all I want to play some simple music at work :)

I am a musician

Tagged:  

A fucking bad musician actually, but I learn...

Yes, there are harminies and scales, accords and arpeggios and so on, which I study every day as well as practicing my trumpet itself. And I have quite good progress I think.

But very frequently I want to play some simple (or complex) motive of the melody I hear. Or 'workaround' it a little... At home I can play piano and try to pick out melody by ears, but in office or anywhere near computer I can not do this.

The reason is simple: Linux does not have a piano synthesizer. Well, it has plenty of them, but each one plays through MIDI. And audio system in Linux is only good for listening. MIDI/alsa/oss/jackd/pulseaudio - thousands of them, and this web only works in some precise setups.

I tried to install plenty of MIDI software on Ubuntu Lucid Lyx - epic fail. Jackd stops, pulseaudio crashes and freezes... It is such a mess.

So I decided to write a trivial application which will look like a piano keyboard and play recorded samples instead of going through MIDI synth when keyboard or mouse keys are pressed. I checked FreshMeat - there are only MIDI synthesizer there.

And I need a simple, fast, nice and non-brain-fucking in setup application, which just plays what I press.

O, poor python

Tagged:  

My study of this language was started today. And while my knowledge base is somewhere between zero and void I already can ask stupid qustions, which were not answered on #python / irc.freenode.net

What python type to use when c++ to python boost::python binding requires 'unsigned char *' parameter?
Code snippet will tell much more:

class test_class : public test_class_base {
	public:
		test_class(const char *path);
		virtual ~test_class();

		virtual void log(const char *msg);
		unsigned char test(unsigned char *ptr) { return *ptr; };
	private:
		std::ofstream *stream;
};

....

	class_ >("test_class", init())
		.def("log", &test_class::log, &test_class_wrap::default_log)
		.def("test", &test_class::test)
	;

I ommitted wrapper class definition, since it is not critical. It works for python-string to c++ 'const char *' transform, but unsigned char pointer fires up an error. I tried both struct.unpack_from("P", array) and array.buffer_info()[0], but python says that they both return 'int' while c++ code expects 'unsigned char *':

id = array('B')
for x in xrange(0, 20) :
	id.append(x)

s = struct.unpack_from("P", id);
print hex(s[0])

t = test_class("/dev/stderr")
print hex(t.test(s[0]))

error:
[zbr@baccara python]$ ./test.py
0x3020100
Traceback (most recent call last):
  File "./test.py", line 19, in 
    print hex(t.test(s[0]))
Boost.Python.ArgumentError: Python argument types in
    test_class.test(test_class, int)
did not match C++ signature:
    test(test_class {lvalue}, unsigned char*)

Code snippets can be also found at http://paste.pocoo.org/show/239085/.

If no solution will be found, I will refactor elliptics python binding to support new classes for missing types like 'elliptics_id' for 'unsigned char *'.

Week and weekend

Tagged:  

New weekend is coming, and it there is a new snow in Moscow.

But unfortunately I'm unlikely to move to the 'mountains' this weekend - I damaged my leg next week so that even walked three days with a crutche. It was a bit warm day, although it was lower than -10C early morning when I moved to the ski resort.
But on the slope weather was noticebly hotter - about 0 degrees Centigrade or so and quite moist. So that my new skis felt quite uncotrollable in the high and middle stand. But when sit rather low I was able to control skis at quite high speeds, although this requre substantial muscle efforts.

I managed to film a small porn video on how I ski over the red trace in Stepanovo. Phone in left hand is not the best way to fight for Oscar, but it was fun. There is completely no feel of speed, although it was substantial for me at least - more than 40 km/h (about 11 meters per second). Calculated by dividing trace length by moving time, so effectively it does not take into accout arc length, which I prefer to make small to medium.

On such speeds I manage to outrun many of the skiers and almost all snowboarders. But since I have essentially no technique (I moved to outdoor traces three times, each time I spent about 3-5 hours on the slope), it is likely that I move quite wrong. And this can explain problems I sometimes get during the movement on the slope.

Add here weather and wet snow and result is quite simple: I fall. I do not care about that until I feel the pain longer than a day or so. And this week was my first time when pain was that strong and long.
I managed to outrun some other boarder and was not able to control skis, so fell and flew several meters away from the trace breaking the boarding :)

well, it was quite simple to break that bearding net, but there was a noticebly gap out of the trace, where I moved several meters crawling over the snow. leg did not hurt that much on the trace, but when I moved home pain started to show up.

Currently I feel mostly ok, although play table tennis quite slowly and can not move without slight lameness. Well, recently I moved with a crutch only :)
So, things are getting better.

In a meantime I added fair number of tasty things into elliptics network project, namely broke addressing storage model - now each node stores IDs which are greater than node's ID. This breaks compatibility but allows simple human understanding of how objects are spread over the storage.
Also implemented random transformation function selection for read IO requests in fastcgi frontend, now we can balance erading among multiple data copies. Dropped BerkeleyDB support - Tokyo Cabinet performs way faster, so I do not see any reason to support both. Made a big step towards completed merge support, I expect it to be finished very soon, which will be the first 2.7.x release - there is a fair number of changes accumulated already.

And as a tasty project to warm up the brain I decided to implement a rhyme generator based on Levenstein-Damerau distance and sound-syllable similarity algorithm. It was not formalized even in my head yet, but it is interesting thing to think about.

Also managed to win a judgement against development company which built my house (without judge and defendant though). I'm quite close to finally get property rights on my appartments and to sold it for good. I believe its time to make living place wider.

So far so good. Stay tuned!

Vigenere crackdown

Tagged:  

Suddenly decided to make something bad, while waiting for other things to settle.

Years ago I used to believe that I know something about hacking. Not kernel hacking, but the one, related to cracking of various supposed to be secure systems. Starting from ciphers down to code 'issues' (described in phrack, yup). It is rather laughable right now :)

But today I know, that reading (and even understanding) smart and hardcore articles is really far from being able to apply given knowledge to some practical problems. So, I decided to start (and complete practical implementation) from really simple things: various mono/poli alphabet ciphers. Ceasar/rot13 and Vigener are good choices afaics.

Bruce Schneier's Self-Study Course in Block Cipher Cryptanalysis does not allow me to easily fall asleep, although its a little bit fun to compare my alphabet ciphers analysis versus even simplest crackdowns described in the article.

Anyway, it took me a day or less to hack up a semi-automatic Vigenere and monoalphabet cipher cracker in LISP. In Vigenere code there are two steps: find out key length and split and decode monoalphabet enciphered text blocks.

The former task is performed manually - the only application created searches trigrams in text and shows their frequencies. When there are 3 or more trigrams it is possible to find key length with rather high probability. Bigrams will work too, but error rate is higher. One have to find distance between the same trigrams and found greatest common divisor, which will likely be equal to cipher key length.

Vigenere cipher is theoretically uncrackable when unique key long enough to cover input message is used. But in practice shorter keys are used, which are then repeated number of times so that resulted key string length becomes equal to plaintext message. So, when cipher starts to repeat itself, the same letters will be encrypted into the same ciphertext.

This method is named after Friedrich Kasiski, who found it 150 years ago. When key length is found, we split ciphertext into separate strings, where each one encrypted using monoalphabet cipher. It consists of letters separated by key length, i.e. the each string will be formed from $string_number + $i * $key_length letters, where $i runs over ciphertext until it is fully covered.

Monoalphabet ciphers are rather trivial to crack using frequency analysis. Namely we should find the most frequent letters in the encrypted text and they will correspond to the most frequent letters in the language used for plaintext. Difference between those letters is a cipher shift, so it is trivial to recover the text just by replacing each letter with the one shifted by calculated number.

Here is example I used (random New York Times article, cut just to show idea, I also converted it to downcase and dropped all non-letter character, since gcipher application does not understand that):

After spending months searching for a bipartisan consensus on financial regulatory reform, Senator Christopher Dodd, chairman of the banking committee, is expected to unveil his own bill on Monday, without one Republican supporter.

afterspendingmonthssearchingforabipartisanconsensusonfinancial
regulatoryreformsenatorchristopherdoddchairmanofthebanking

That's how it was encrypted and resulted ciphertext:

$ gcipher -c Vigenere -k asknfalwiwf v1.enc v2.enc

axdrwsaavznnywbstsoaafrurvsgqkzwgihkeyidwvytnkoaxudkvbnnsxpn
awnmczlsdbwycankwmkoaftznkdwikdbuhpnlkidurnnrxwvkktzoofnv

Trigram search uncovers that the most frequent ciphertext trigram 'tzo' is spread in the text with the distances being multiple of 11, which does correspond to above key length.

Splitting text and calculating frequences is a rather simple technical task. After performing decoding we got following result:

lfterwpeydiygmonxhsdeacchinkfocabtpartmsaycoysensysoyfiyancielrpguwatorcreqorxsenaxornhrtstopler
afterspendingmonthssearchingforabipartisanconsensusonfinancialregulatoryreformsenatorchristopher

Second line is a plaintext message, which differs in less than half symbols. I was not able to decode some of the text parts because of small enough text length, so that frequency analysis did not always provide correct data. But even as is decoded text allows to read and recover data.

On this positive note I will start preparig for the weekend skiing.


Stay tuned!

Hashes and theory of their cracking

Tagged:  

Of course there is no such theory, but practice breaking a hash is fascinating for the researcher.

Currently in netdev@ people started lengthy discussion about new hash for the interface name and (optionally?) for dentry hash, or I just misunderstood the latter.

Anyway there is more than a dozen of different algorithms tested for deviation and speed. It is very interesting to find out which one will be selected.

Actually it is only interesting from the single side - how to break it. By breaking I mean creating application which can generate input data which will produce the same hash value after processed by the selected algorithm.

That's what I did for Jenkins and Bernstein/Torek (hash * 33) hash quite for a while already.

Looking forward for the new hash :)

Lua in NetBSD kernel

Tagged:  

Lua in NetBSD kernel.

As you might know, NetBSD already has XML parser in its kernel for some obscure Apple protocol. Now they want scripting language also.

Btw, there was some work to add Lua bindings for the elliptics network by Daniel Poelzleithner, but looks like it is not very active at the moment. There are Perl bindings for this distributed hash table storage.

NTT Cyber Space Labs presents Sheepdog - distributed storage system for KVM

Tagged:  

MORITA Kazutaka wrote:

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.

The following list describes the features of Sheepdog.

* Linear scalability in performance and capacity
* No single point of failure
* Redundant architecture (data is written to multiple nodes)
- Tolerance against network failure
* Zero configuration (newly added machines will join the cluster automatically)
- Autonomous load balancing
* Snapshot
- Online snapshot from qemu-monitor
* Clone from a snapshot volume
* Thin provisioning
- Amazon EBS API support (to use from a Eucalyptus instance)

(* = current features, - = on our todo list)

More details and download links are here:
http://www.osrg.net/sheepdog/

Note that the code is still in an early stage.
There are some critical TODO items:

- VM image deletion support
- Support architectures other than X86_64
- Data recoverys
- Free space management
- Guarantee reliability and availability under heavy load
- Performance improvement
- Reclaim unused blocks
- More documentation

IMHO, block level distrubuted systems are dead overall, although it has its niche.

oLOLO-intellect

Tagged:  

I did not blog about technical stuff for a while since read some articles and books here and there about related topics. To date I did not get enough ground to start development, but there is already a short todo list to be completed in a week or so (well, only first couple of tasks are supposed to be finished, I will see later how it will go, it is unfeasible to finish them all quickly).

So, how knowledge extraction and cognition development processes are about to be implemented with time:

  • Regexp parser implemented as a (non)deterministic finite automata. This task is not about to implement another bicycle, but to get in touch with complex state machines and their automatic generation from some abstract data and its relations.
  • Test regexp implementation with html parsing and tag stack implementation. Can be used as a yet another HTML validator and sanitizer. Not that it is a particular goal, but to date I did not see any such tool with the extensible configuration (like with a tag substitution language) of how tag stack should be fixed upon error detection.
  • Above tasks actually form a minimum program (for the week or so). It is kind of HTML compiler. In particular, knowledge extraction task needs to have a regexp parser to select fact words (and combinations) and their relations if expressed not in a grammatically correct form (like changed prefixes and/or suffixes, i.e. 'plane flew over' -> 'plane' - 'fly' - 'over').

  • Context-free grammatics. LR(1) grammatics. The latter is mainly to get in touch with the area. The former is a very powerful tool for the structured data analysis and its correctness.
  • Knowledge extraction itself - building of the weighted graph of the word relations in the studied information.
  • Input information processing:
    • Knowledge graph of the input data build based not only input text relations but also affected by the knowledge stored in the already processed data.
    • Question processing - generate a reply based on the existing knowledge.
    • Free actions - generate uncondition 'ideas' based on random spikes of action in the specified areas of the knowledge graph. Uncondition means not absolutely random, but random in the area with the highest knowledge weight concentration or in the area specified in the latest input data.
    • Input data analysis looking for known and unknown facts confirmation and extraction. Generating replies/questions based on those data.
  • Linguistically correct reply generation using predefined grammatics.

To date I'm somewhere at the very beginning. Back to drawing board reading room...

On knowledge extraction problem

Tagged:  

Thought of the weighted graph of the dictionary learned by the system, where node corresponds to the word, each link's value corresponds to the weight of the appropriate phrase, and amount of links going out of the every node determines how frequently given word was used in the learned texts.

The same graph can be built out of the new input sentence and already learnt data (like from memory), which will correspond to the meaning of the phrase by the given system. It is still quite rough in my head, but I have a thinking direction.

What I'm a bit stuck with yet, is a response generation. Let's suppose I determined from the question sentence its meaning. I expect it to be similar to what search engines output, but for example it should be quite different for questions like "where to swim on vacation in Moscow" or "do you like Stanislav Lem" and other similar kinds of questions.

Having a local system memory means subjective decision on each question, but I have troubles imagining how system should respond. How to program its own decision on what and how to answer when there is a (some kind of) understanding what was said.

Going to sleep, one says brain will think about the problem in the short sleeping stage, maybe I will be enlighted. I think its time to create AI tag for this kind of ideas :)
And it absolutely does not mean I stopped my other projects!

When physics helps the math

Tagged:  

[video]

Another approach to solve text recognition problem - applying gravitation law between pixels, so that letters which are similar could 'collide' quickly. As we can see from the video, it works, but the same dynamics happens for any other letter, and the matter of the matching is not clear.

In theory the letter whose 'nature' was not changed after gravitation transformation is the one we are interested in, but it is quite unclear how to define a letter. Computer reads it as set of pixels - brain definitely treats it differently. For example when kids learn how to write they draw lines and circles or arcs, but not single points.

So I do not think above idea is right. I believe the right solution should involve a set of rules on how to write a letter. And when new picture is submitted, it could be interpolated by those rules, and if there is a clear match (like more than 3 quarters of the interpolation points fit the letter), then we consider letter as recognized (of course including scaling of the original image and set of interpolation points).
Simply speaking, we need to check if given picture can be drawn using 4 lines each of which has an angle less than 90 degress against other (for letter M or W), or a circle and a line (for letter P or b) and so on.

Problem is how to split a letter into the simplest parts which brain does easily, but computer can not, since it does not operate with higher-level objects. Maybe approach of splitting interpolation 'letters' into simple parts (like lines and circles or arcs), which will fit parts of the recognized picture (according to smallest square rule for example), and then we will check if resulted placement of those parts still match the letter they described originally.

Thinking, thinking, thinking. No clean idea, and that's good!

Lexical and morphological text analysis/generation articles wanted

Tagged:  

If you have some interesting links, please drop me a mail or comment.
I want to implement a bot for the local mail lists and have some fun about it, but do not expect it to be trivial.

There are two main questions: knowledge extraction (to roughly understand what given text is about) and text generation (writing a small essay on given topic according to the lexical and morphological language rules).

So far I expect both problems should be solved using grammatical analysis (LR, productions, yeah), but my knowledge here is somewhat between zero and void (several chapters from the compilers Dragon Book), practical side is definitely nonexistant part (but I will work on regexp implementation to feel comfortable with the complex state machines and their generation).

Userspace RCU relicensed under LGPL 2.1

Tagged:  

liburcu is a LGPLv2.1 userspace RCU (read-copy-update) library.
This data synchronization library provides read-side access which scales linearly with the number of cores. It does so by allowing multiples copies of a given data structure to live at the same time, and by monitoring the data structure accesses to detect grace periods after which memory reclamation is possible.

One can check sources from web git tree or by cloning:

git clone git://lttng.org/userspace-rcu.git

Library currently supports x86 and powerpc. LGPL-compatible low-level primitive headers will be required for other architectures. Note that the build system is at best rudimentary at the moment.

Its low-level details were described in libsync, which is effectively the same library, but with human face. With above changes I suppose there is no need for another lib - liburcu moves in the right direction.

A development roadmap experimental poll - you can make the difference

Documentation, automatic tests and 3.0 elliptics release
26% (38 votes)
SMTP/IMAP elliptics storage backend
7% (11 votes)
POHMELFS elliptics frontend
30% (45 votes)
SSL support and more advanced merge config resolve in elliptics network
3% (5 votes)
LISP (and eventually C) regexp and subsequent LR grammatics analyzer
3% (4 votes)
HTML validator (based on above code though)
1% (1 vote)
PAXOS library and distributed locking
18% (27 votes)
Write your own - vote or lose!
1% (2 votes)
More tight electronics projects (w1 sniffer, digital electronics, robotics)
10% (15 votes)
Total votes: 148

libsync: atomic operations and userspace RCU implementation library

Tagged:  

Mathieu Desnoyers posted his userspace RCU synchronization implementation and I found it very interesting to play with.

Unfortunately it was designed likely as a proof-of-concept code, so it took a while to cleanup the arch-dependant headers, add autoconfiguration, extend atomic operations and wrap Mathieu's code into a library. Now one can use libsync to get a platform-independant API to work with atomic variables (arch-specific add/sub and decrease and test functions are provided) and RCU implementation. This is the first release.

It works great, but there are some caveats.
First, Mathieu decided to use an interesting approach to force all reading threads to flush its data into memory from the caches, namely to get a knowledge about when quiescent periods are completed and writer can grab the lock. URCU uses a signal (SIGUSR1 in the code) sent to each reader thread and writer waits for each thread to 'ack' it. So if your application does not want to enable signals (or there are no unused), this imlementation will not work for you. This can be optimized IMHO: if thread is not currently running in the read-locked section, but sleeping, there is no need to send a signal, since there is no quiescent state to wait for. Also thread rescheduling implies a barrier.

Second, only x86 (both i386 and x86_64) and PPC (not all models though) are supported. I added Sparc64 support, but then found that I only have access to SUNW,Ultra-60 machine (where there is only 2.95 compiler, which does not yet know about __thread specification, and it has 32-bit CPUs) and SUNW,Sun-Fire-V240, which happend to have v8 CPUs, which are also 32-bit:

atomic_64.S: Assembler messages:
atomic_64.S:19: Error: Architecture mismatch on "lduw".
atomic_64.S:19:  (Requires v9|v9a|v9b; requested architecture is v8.)
atomic_64.S:21: Error: Architecture mismatch on "cas".
atomic_64.S:21:  (Requires v9|v9a|v9b; requested architecture is v8.)

so this was not tested either, will try to resolve it tomorrow. But getting that x86 becomes a world-dominating platform, this should not be a show-stopper.

So, I tested libsync library on FreeBSD 7 (AMD64), Debian Lenny (i386 SMP) and Ubuntu Hardy (i386 UP), where reading performance was just freaking awesome (writing performance was miserable though :).
If I will get access to other platforms, I will port it there. Also I will cook up some documentation (there are source code examples) and a homepage soon.

Libsync will be used in the elliptics network for the reference counters and read-mostly lookups where POSIX spinlocks are currently used (which introduces visible overhead).

Elliptics network in a meantime got MacOSX support (its sendfile() differs from FreeBSD one), although one may need to make a small patch to POSIX options header, if /usr/include/bits/posix_opt.h is old enough (thanks to Tuncer Ayaz for the reference). Also added generic read()/send() implementation for the platforms where no sendfile() is available, like OpenBSD.

That's how I felt sick today. But things will change. Stay tuned!

EDITED TO ADD: that MacOSX problem with the POSIX spinlocks is not yet resolved.

BDB deadlocks

Tagged:  

8000009f dd= 0 locks held 2    write locks 2    pid/thread 65927/0
8000009f WRITE         1 WAIT    history.db                page          1
8000009f WRITE         1 HELD    history.db                page          3
8000009f WRITE         1 HELD    data.db                   page          7
800000a4 dd= 0 locks held 2    write locks 1    pid/thread 65927/0
800000a4 WRITE         1 WAIT    history.db                page          3
800000a4 READ          1 HELD    history.db                page          1
800000a4 WRITE         1 HELD    data.db                   page          9
 
Locks grouped by object:
Locker   Mode      Count Status  ----------------- Object ---------------
8000009f WRITE         1 HELD    data.db                   page          7
800000a4 READ          1 HELD    history.db                page          1
8000009f WRITE         1 WAIT    history.db                page          1
       1 READ          1 HELD    data.db                   handle        0
       3 READ          1 HELD    history.db                handle        0
8000009f WRITE         1 HELD    history.db                page          3
800000a4 WRITE         1 WAIT    history.db                page          3
 
800000a4 WRITE         1 HELD    data.db                   page          9

That's a very fun dump, first, since my code does not grab read lock at all, I use read-modify-write flags, which should end up with write lock for the read operation. Second, because to read and then update the entry I have to grab two locks. For the same entry: on page 1 and 3. And two threads get them in diffrent order.

Code in question is rather trivial:

	memset(&key, 0, sizeof(DBT));
	memset(&data, 0, sizeof(DBT));
 
	key.data = cmd->id;
	key.size = DNET_ID_SIZE;
 
	data.size = 0;
	data.flags = DB_DBT_USERMEM;
 
        err = e->db->get(e->db, txn, &key, &data, DB_RMW);
 
        offset = data.size;
 
 
	memset(&key, 0, sizeof(DBT));
	memset(&data, 0, sizeof(DBT));
 
	key.data = cmd->id;
	key.size = DNET_ID_SIZE;
 
	data.data = io;
	data.doff = offset;
	data.ulen = sizeof(struct dnet_io_attr);
	data.dlen = sizeof(struct dnet_io_attr);
	data.size = sizeof(struct dnet_io_attr);
	data.flags = DB_DBT_PARTIAL | DB_DBT_USERMEM;
 
	err = e->db->put(e->db, txn, &key, &data, 0);

There is no other code ever started.
My explaination (and it somehow correlate with the above read locks, which were never taken) is related to the read-modify-write flag and likely can be observed with pure reading also. Page one above likely contains index or some other metadata needed to be checked when reading, so we lock it read-only. But data entry itself is locked read-write (according to read-modify-write flag) on the page 3. Another thread already checked the index and now wants to put some new entry into the btree (also on the page 3), and thus has to write-lock the index, which will deadlock: thread 1 has readlock A and waits to writelock B, thread 2 has writelock B and waits to writelock A.

I can not say if BDB 4.7.25 really has this logic inside, but commenting out RMW reading fixes the problem.
Thinking...

Linux and FreeBSD POSIX mutexes and semaphores

Tagged:  

Decided to check how fast are POSIX mutexes and semaphores in Linux and FreeBSD depending on number of contended threads.

So the test is rather simple: thread aciqures a lock, perform some quick action and releases it. This sequence is repeated multiple times. Number of threads can vary.
I tested time to perform 100 millions of counter increments (two asm instructions) protected either by POSIX mutex or generic semaphore, so effectively number of lock operations per second per thread was measured. 1, 2, 3, 5, 7 and 10 threads were actively racing for the same lock in the tests, which were repeated 5 times and median result was shown.

I ran the same test on three different machines: Linux UP laptop (Intel 1.4 Ghz), Linux SMP desktop (2-way AMD 2.5 Ghz), FreeBSD SMP server (8-way Intel E5345 2.3 Ghz, but usually 2-4 CPUs were fully loaded by external applications).

Single-thread case should show how fast is non-contended lock implementation on different systems. It is expected that two (and more) threads case will have noticebly smaller number of locks per second application is capable to aciqure compared to single-threaded case (for example in Linux there should be no syscalls at all in this case). With increased number of threads the proper implementation should be as close as possible to linear decrease in the number of lock operations per second (i.e. when two threads race for the same lock, each thread should perform the loop two times longer compared single-threaded case).

Let's see the practical results.


Number of locks per second per thread

And compared to the non-contended single-threaded case.


Number of locks per second including single-threaded case

Linux POSIX thread mutex implementation is definitely faster than that of FreeBSD (about 1.5 times), and with time FreeBSD scales worse, for example in some tests with 10 contended threads application performed as much as 20k locks per second - 10 times slower than Linux, in other cases it took 80k locks per second, 2-3 times slower than Linux as in the non-contended case.
But there is an interesting observation - the more CPUs we have the worse is scalability, looks like scheduler decision to which thread to start fires up at the wrong time, instead of allowing thread to complete the locked operation, scheduler wakes up another thread just to check that lock is still not available. And FreeBSD has more problems with it. Also ULE scheduler does not allow test application (with 10 threads for example) to run on more than 4 CPUs, do not even know why.

I can not explain performance of the Linux semaphores in the non-contended case - more than 2 times faster than POSIX mutexes created on top of futexes. In the 2-threads contended case semaphores are about two times slower than POSIX mutexes and with increased number of threads difference dissapears.

Overall conclusion: POSIX thread locks should not be used for the frequently accessible code pathes, which are not supposed to sleep, i.e. dig into the kernel. Elliptic network uses it to implement reference counter (instead of atomic variables) and for tree/list search/modify protection. All operations are supposed to be very lightweight, so heavy POSIX mutexes instroduce huge overhead here and should be replaced with spinlocks and hardware-dependant atomic variables.

Time for liblock and libatomic implementations, but this will be postponed for the futher releases.

No reason given

Tagged:  

Your request to the Linux-arm-kernel mailing list

Posting of your message titled "Re: [WIP/RFC] crypto: add support for Orion5X crypto engine"

has been rejected by the list moderator. The moderator gave the following reason for rejecting your request:

"No reason given"




Love it.

ASCII vs binary formats

Tagged:  

I used to use binary format for as much building blocks as possible, since they by definition have fixed size structure or at least size parameter somewhere in the fixed-size header.

Contrary ASCII strings can only be differentiated by the searching for some special symbol, most of the time it is a null-terminating byte (in a pair with some visible symbol like end-of-line). It is possible to put a length field at the beginning, but then it can not be used as a dereferenced pointer or it has to involve some a bit ugly arithmetic operations in the access macros.
But definitely it is much more convenient to look at the strings in the file instead of calling some special application (or hex editor) to understand the binary format.

And still knowing all this details, I managed to implement (quick and dirty) transaction update log as ASCII file, which contains (variable length) IO offset field, (variable length) size field and (fixed length) ID substring. Thus each entry in the log file (separated by the end-of-line and null bytes) has a variable length and it is not possible to easily traverse them from the end of the file or doing a binary search.
Each access thus has to involve a string parsing, and as we all know, any string parser always has a hidden buffer overflow.

This will be changed, which is rather simple task though. Also I decided to put a special header at the beginning of the transaction history file, which will contain a structure version field among others, so that if we will ever decide to update it to the new format, there would be no data corruption. I will use it when transferring history information and check if versions do not match. Or better implement this transaction log as a set of nested attributes, where the highest level has always the same format (length and type of the inner data is enough), the same way I did it for the network protocol.

While working on this I also found, that currently it is not that simple to use a different hash (from what is supported by OpenSSL) to create an object binding from the common path or data content. There is a good crypto engine abstraction, but its initialization is based on OpenSSL, and thus it is not that simple to implement own hash, which will take into account for example data center ID (or any other geographical data), so that it could be always possible to spread data among different physical locations for redundancy and better load balancing.

So I plan to extend server initialization to include special structure which will contain among other fields (like ID, listening address, root directory to store/load data and list of the initial routes to connect to) hashing function pointer, which by default will use OpenSSL.

Google loves me

Tagged:  

David Rientjes wrote:

I'm quite certain you've spent more time writing emails to me than merging the patch and testing its possibilities, given your lack of understanding of its very basic concepts.

I'm curious what else I will get with the same technical strength.

And things are actually quite simple: I ask how proposed oom-handling system will behave in some conditions. Hopefully I did not understand something, otherwise it just does not work. But apparently it is not easy to describe this in details... Let's see what will happen next :)

Linux killed Kenny, bastard!

Tagged:  

Do you want to own a tame killer? Do you want to control the world?

Start with your computer now and own the planet next: you already have an OOM-killer in the Linux to kill for you. But to date it was quite berserk and usually killed not what you would like him to murder.

Now you can add a name of the victims, which will be checked by the oom-killer, who select the process to kill first among the ones which have given string in their executable name.

By default the process to be killed is called 'Kenny', and if you like him, change the name by calling
echo Java > /proc/sys/vm/oom_victim

Try it!

Patch allows to set the name, which will be checked by the OOM-killer, and if there are processes with given string in the name, it will select the victim between them, otherwise will find one between all processes.

Updated hamock hooks.

Tagged:  

Some time ago one of the main anchor hooks for my hamock was removed, and now I decided to return it back. Apparently applied force was too strong...


Anchor hook, 8 mm diameter

So I got another one and screwed it in with less efforts, now things are just excellent (one can check how it was long ago). If I could play my trumpet like I want, I would definitely spent lots of spare time hunging in the hamock, drining rum and playing...

But so far I only practice several mornings a week and frankly do not have major progress. For example tried to pick out a tune by ears, but succeeded with simple 'Happy birthday', and actually failed to find proper notes after octave jump, since do not remember the melody. Something more complex is likely out of my abilities right now, mostly because lips tire quickly.
But I work (and work, and work, and work...) on this.

Dallas one-wire (w1) subsystem update.

Tagged:  

Just pushed w1 userspace communication update which includes several new commands and documentation update. Added commands:

  • touch block - allows to write and return sampled data back to the userspace
  • master list - sends information about all registered master devices to the userspace
  • slave list (search or alarm search) - starts appropriate search on the bus attached to specified bus master device and returns all found slave devices to the userspace without attaching them to the w1 subsystem

W1 uses netlink connector in the kernel and one can bind to the connector group and work with above commands (requires root to bind though). Also supported read/write commands, but they existed long ago. Search/alarm search got reply as desribed above (previously they just initiated a search).

Great thanks to Paul Alfille (paul.alfille_gmail.com) of OWFS for testing and w1-to-network daemon.

Memory pool and kmem cache constructor gotcha.

Tagged:  

Memory pool is a neat structure which contains array of objects previously allocated from the attached to memory pool cache. When cache is created, it is possible to assign it a constructor, which will be invoked each time new object is allocated from that cache.

Memory pools do not invoke cache allocator's constructor, when attached cache is 'empty' (i.e. requires refill), but memory pool is not. It reuses some previously allocated entries, so if constructor contains some object initialization code, it may be missed.

On development process.

Tagged:  

Jonathan Corbet wrote me an open letter where he describes Linux kernel development process, a little bit biased to my particular blog cryyy though :)

In a nutshell Jonathan selected DST to show what was made wrong from the process point of view. And although making excuses is not interesting, I will show that (from my humble opinion) objections are not entirely correct.
For those who do not like such details, you may jump the the end of the post to read frustration points on the process I have. It is not related to the review, inclusion or Jupiter weather. These things are so small and really do not worth our time. What I do not like is just how communication between people happens. And only that really matters: no kilobytes of code can make a good process, but only people do.

HIFN update pushed upstream.

Tagged:  

Patch series has been sent to the linux-crypto@ mail list for inclusion. This set of patches includes Patrick McHardy's (kaber_trash.net) work and fixes number of bugs mostly related to hardware queue management. I also changed the way how completions are handled and reduced size of the crypto ring.

Now each hardware interrupt (or actually tasklet callback scheduled from the hardware interrupt) does not scan the whole descriptor ring for the completed requests, but maintains a last checked position in each ring, which is used as a starting point to search for ready descriptors. Usually it generates upto 100.000 interrupts per second, which most of the time corresponds only to the single ready entry in the ring, even after tasklet starts execution, so scanning the whole ring is a complete waste of CPU cycles.

HIFN watchdog was also reworked a bit, now it flushes the whole hardware queue and only when things are really very bad (for example when there is a number of started requests but hardware was 'idle' for a long time not processing requests, which is specified by set of constants).

Becuase of the completion handling changes HIFN now is about 2 times faster than it was before and operates (if I understood correctly) faster than 1.7 Ghz CPU.

Here are some results for Soekris (500 MHz CPU) and dm-crypt:

time dd if=/dev/zero of=test.img bs=1024 count=700000
 
        hifn: ~70 seconds
software aes: > 120 seconds

And some data from 1.7 Ghz Athlon XP over dm-crypt read/write speed in MB/s:

write:
        hifn: 10.4 MB/s
software aes: 8.9 MB/s
 
read:
        hifn: 9.9 MB/s
software aes: 9.1 MB/s

And of course CPU usage is two times lower in the second case and is about 4 times lower in case of Soekris machine.

Great thanks to Andreas Gerlich (agl_online.de) and Tobias Geiger (tobias.geiger_vido.info) for testing and performance results.

Syndicate content