Other

Other development notes

Hashes and theory of their cracking

Tagged:  

Of course there is no such theory, but practice breaking a hash is fascinating for the researcher.

Currently in netdev@ people started lengthy discussion about new hash for the interface name and (optionally?) for dentry hash, or I just misunderstood the latter.

Anyway there is more than a dozen of different algorithms tested for deviation and speed. It is very interesting to find out which one will be selected.

Actually it is only interesting from the single side - how to break it. By breaking I mean creating application which can generate input data which will produce the same hash value after processed by the selected algorithm.

That's what I did for Jenkins and Bernstein/Torek (hash * 33) hash quite for a while already.

Looking forward for the new hash :)

Lua in NetBSD kernel

Tagged:  

Lua in NetBSD kernel.

As you might know, NetBSD already has XML parser in its kernel for some obscure Apple protocol. Now they want scripting language also.

Btw, there was some work to add Lua bindings for the elliptics network by Daniel Poelzleithner, but looks like it is not very active at the moment. There are Perl bindings for this distributed hash table storage.

NTT Cyber Space Labs presents Sheepdog - distributed storage system for KVM

Tagged:  

MORITA Kazutaka wrote:

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.

The following list describes the features of Sheepdog.

* Linear scalability in performance and capacity
* No single point of failure
* Redundant architecture (data is written to multiple nodes)
- Tolerance against network failure
* Zero configuration (newly added machines will join the cluster automatically)
- Autonomous load balancing
* Snapshot
- Online snapshot from qemu-monitor
* Clone from a snapshot volume
* Thin provisioning
- Amazon EBS API support (to use from a Eucalyptus instance)

(* = current features, - = on our todo list)

More details and download links are here:
http://www.osrg.net/sheepdog/

Note that the code is still in an early stage.
There are some critical TODO items:

- VM image deletion support
- Support architectures other than X86_64
- Data recoverys
- Free space management
- Guarantee reliability and availability under heavy load
- Performance improvement
- Reclaim unused blocks
- More documentation

IMHO, block level distrubuted systems are dead overall, although it has its niche.

oLOLO-intellect

Tagged:  

I did not blog about technical stuff for a while since read some articles and books here and there about related topics. To date I did not get enough ground to start development, but there is already a short todo list to be completed in a week or so (well, only first couple of tasks are supposed to be finished, I will see later how it will go, it is unfeasible to finish them all quickly).

So, how knowledge extraction and cognition development processes are about to be implemented with time:

  • Regexp parser implemented as a (non)deterministic finite automata. This task is not about to implement another bicycle, but to get in touch with complex state machines and their automatic generation from some abstract data and its relations.
  • Test regexp implementation with html parsing and tag stack implementation. Can be used as a yet another HTML validator and sanitizer. Not that it is a particular goal, but to date I did not see any such tool with the extensible configuration (like with a tag substitution language) of how tag stack should be fixed upon error detection.
  • Above tasks actually form a minimum program (for the week or so). It is kind of HTML compiler. In particular, knowledge extraction task needs to have a regexp parser to select fact words (and combinations) and their relations if expressed not in a grammatically correct form (like changed prefixes and/or suffixes, i.e. 'plane flew over' -> 'plane' - 'fly' - 'over').

  • Context-free grammatics. LR(1) grammatics. The latter is mainly to get in touch with the area. The former is a very powerful tool for the structured data analysis and its correctness.
  • Knowledge extraction itself - building of the weighted graph of the word relations in the studied information.
  • Input information processing:
    • Knowledge graph of the input data build based not only input text relations but also affected by the knowledge stored in the already processed data.
    • Question processing - generate a reply based on the existing knowledge.
    • Free actions - generate uncondition 'ideas' based on random spikes of action in the specified areas of the knowledge graph. Uncondition means not absolutely random, but random in the area with the highest knowledge weight concentration or in the area specified in the latest input data.
    • Input data analysis looking for known and unknown facts confirmation and extraction. Generating replies/questions based on those data.
  • Linguistically correct reply generation using predefined grammatics.

To date I'm somewhere at the very beginning. Back to drawing board reading room...

On knowledge extraction problem

Tagged:  

Thought of the weighted graph of the dictionary learned by the system, where node corresponds to the word, each link's value corresponds to the weight of the appropriate phrase, and amount of links going out of the every node determines how frequently given word was used in the learned texts.

The same graph can be built out of the new input sentence and already learnt data (like from memory), which will correspond to the meaning of the phrase by the given system. It is still quite rough in my head, but I have a thinking direction.

What I'm a bit stuck with yet, is a response generation. Let's suppose I determined from the question sentence its meaning. I expect it to be similar to what search engines output, but for example it should be quite different for questions like "where to swim on vacation in Moscow" or "do you like Stanislav Lem" and other similar kinds of questions.

Having a local system memory means subjective decision on each question, but I have troubles imagining how system should respond. How to program its own decision on what and how to answer when there is a (some kind of) understanding what was said.

Going to sleep, one says brain will think about the problem in the short sleeping stage, maybe I will be enlighted. I think its time to create AI tag for this kind of ideas :)
And it absolutely does not mean I stopped my other projects!

When physics helps the math

Tagged:  

[video]

Another approach to solve text recognition problem - applying gravitation law between pixels, so that letters which are similar could 'collide' quickly. As we can see from the video, it works, but the same dynamics happens for any other letter, and the matter of the matching is not clear.

In theory the letter whose 'nature' was not changed after gravitation transformation is the one we are interested in, but it is quite unclear how to define a letter. Computer reads it as set of pixels - brain definitely treats it differently. For example when kids learn how to write they draw lines and circles or arcs, but not single points.

So I do not think above idea is right. I believe the right solution should involve a set of rules on how to write a letter. And when new picture is submitted, it could be interpolated by those rules, and if there is a clear match (like more than 3 quarters of the interpolation points fit the letter), then we consider letter as recognized (of course including scaling of the original image and set of interpolation points).
Simply speaking, we need to check if given picture can be drawn using 4 lines each of which has an angle less than 90 degress against other (for letter M or W), or a circle and a line (for letter P or b) and so on.

Problem is how to split a letter into the simplest parts which brain does easily, but computer can not, since it does not operate with higher-level objects. Maybe approach of splitting interpolation 'letters' into simple parts (like lines and circles or arcs), which will fit parts of the recognized picture (according to smallest square rule for example), and then we will check if resulted placement of those parts still match the letter they described originally.

Thinking, thinking, thinking. No clean idea, and that's good!

Lexical and morphological text analysis/generation articles wanted

Tagged:  

If you have some interesting links, please drop me a mail or comment.
I want to implement a bot for the local mail lists and have some fun about it, but do not expect it to be trivial.

There are two main questions: knowledge extraction (to roughly understand what given text is about) and text generation (writing a small essay on given topic according to the lexical and morphological language rules).

So far I expect both problems should be solved using grammatical analysis (LR, productions, yeah), but my knowledge here is somewhat between zero and void (several chapters from the compilers Dragon Book), practical side is definitely nonexistant part (but I will work on regexp implementation to feel comfortable with the complex state machines and their generation).

Userspace RCU relicensed under LGPL 2.1

Tagged:  

liburcu is a LGPLv2.1 userspace RCU (read-copy-update) library.
This data synchronization library provides read-side access which scales linearly with the number of cores. It does so by allowing multiples copies of a given data structure to live at the same time, and by monitoring the data structure accesses to detect grace periods after which memory reclamation is possible.

One can check sources from web git tree or by cloning:

git clone git://lttng.org/userspace-rcu.git

Library currently supports x86 and powerpc. LGPL-compatible low-level primitive headers will be required for other architectures. Note that the build system is at best rudimentary at the moment.

Its low-level details were described in libsync, which is effectively the same library, but with human face. With above changes I suppose there is no need for another lib - liburcu moves in the right direction.

A development roadmap experimental poll - you can make the difference

Documentation, automatic tests and 3.0 elliptics release
26% (38 votes)
SMTP/IMAP elliptics storage backend
7% (11 votes)
POHMELFS elliptics frontend
30% (45 votes)
SSL support and more advanced merge config resolve in elliptics network
3% (5 votes)
LISP (and eventually C) regexp and subsequent LR grammatics analyzer
3% (4 votes)
HTML validator (based on above code though)
1% (1 vote)
PAXOS library and distributed locking
18% (27 votes)
Write your own - vote or lose!
1% (2 votes)
More tight electronics projects (w1 sniffer, digital electronics, robotics)
10% (15 votes)
Total votes: 148

libsync: atomic operations and userspace RCU implementation library

Tagged:  

Mathieu Desnoyers posted his userspace RCU synchronization implementation and I found it very interesting to play with.

Unfortunately it was designed likely as a proof-of-concept code, so it took a while to cleanup the arch-dependant headers, add autoconfiguration, extend atomic operations and wrap Mathieu's code into a library. Now one can use libsync to get a platform-independant API to work with atomic variables (arch-specific add/sub and decrease and test functions are provided) and RCU implementation. This is the first release.

It works great, but there are some caveats.
First, Mathieu decided to use an interesting approach to force all reading threads to flush its data into memory from the caches, namely to get a knowledge about when quiescent periods are completed and writer can grab the lock. URCU uses a signal (SIGUSR1 in the code) sent to each reader thread and writer waits for each thread to 'ack' it. So if your application does not want to enable signals (or there are no unused), this imlementation will not work for you. This can be optimized IMHO: if thread is not currently running in the read-locked section, but sleeping, there is no need to send a signal, since there is no quiescent state to wait for. Also thread rescheduling implies a barrier.

Second, only x86 (both i386 and x86_64) and PPC (not all models though) are supported. I added Sparc64 support, but then found that I only have access to SUNW,Ultra-60 machine (where there is only 2.95 compiler, which does not yet know about __thread specification, and it has 32-bit CPUs) and SUNW,Sun-Fire-V240, which happend to have v8 CPUs, which are also 32-bit:

atomic_64.S: Assembler messages:
atomic_64.S:19: Error: Architecture mismatch on "lduw".
atomic_64.S:19:  (Requires v9|v9a|v9b; requested architecture is v8.)
atomic_64.S:21: Error: Architecture mismatch on "cas".
atomic_64.S:21:  (Requires v9|v9a|v9b; requested architecture is v8.)

so this was not tested either, will try to resolve it tomorrow. But getting that x86 becomes a world-dominating platform, this should not be a show-stopper.

So, I tested libsync library on FreeBSD 7 (AMD64), Debian Lenny (i386 SMP) and Ubuntu Hardy (i386 UP), where reading performance was just freaking awesome (writing performance was miserable though :).
If I will get access to other platforms, I will port it there. Also I will cook up some documentation (there are source code examples) and a homepage soon.

Libsync will be used in the elliptics network for the reference counters and read-mostly lookups where POSIX spinlocks are currently used (which introduces visible overhead).

Elliptics network in a meantime got MacOSX support (its sendfile() differs from FreeBSD one), although one may need to make a small patch to POSIX options header, if /usr/include/bits/posix_opt.h is old enough (thanks to Tuncer Ayaz for the reference). Also added generic read()/send() implementation for the platforms where no sendfile() is available, like OpenBSD.

That's how I felt sick today. But things will change. Stay tuned!

EDITED TO ADD: that MacOSX problem with the POSIX spinlocks is not yet resolved.

BDB deadlocks

Tagged:  

8000009f dd= 0 locks held 2    write locks 2    pid/thread 65927/0
8000009f WRITE         1 WAIT    history.db                page          1
8000009f WRITE         1 HELD    history.db                page          3
8000009f WRITE         1 HELD    data.db                   page          7
800000a4 dd= 0 locks held 2    write locks 1    pid/thread 65927/0
800000a4 WRITE         1 WAIT    history.db                page          3
800000a4 READ          1 HELD    history.db                page          1
800000a4 WRITE         1 HELD    data.db                   page          9
 
Locks grouped by object:
Locker   Mode      Count Status  ----------------- Object ---------------
8000009f WRITE         1 HELD    data.db                   page          7
800000a4 READ          1 HELD    history.db                page          1
8000009f WRITE         1 WAIT    history.db                page          1
       1 READ          1 HELD    data.db                   handle        0
       3 READ          1 HELD    history.db                handle        0
8000009f WRITE         1 HELD    history.db                page          3
800000a4 WRITE         1 WAIT    history.db                page          3
 
800000a4 WRITE         1 HELD    data.db                   page          9

That's a very fun dump, first, since my code does not grab read lock at all, I use read-modify-write flags, which should end up with write lock for the read operation. Second, because to read and then update the entry I have to grab two locks. For the same entry: on page 1 and 3. And two threads get them in diffrent order.

Code in question is rather trivial:

	memset(&key, 0, sizeof(DBT));
	memset(&data, 0, sizeof(DBT));
 
	key.data = cmd->id;
	key.size = DNET_ID_SIZE;
 
	data.size = 0;
	data.flags = DB_DBT_USERMEM;
 
        err = e->db->get(e->db, txn, &key, &data, DB_RMW);
 
        offset = data.size;
 
 
	memset(&key, 0, sizeof(DBT));
	memset(&data, 0, sizeof(DBT));
 
	key.data = cmd->id;
	key.size = DNET_ID_SIZE;
 
	data.data = io;
	data.doff = offset;
	data.ulen = sizeof(struct dnet_io_attr);
	data.dlen = sizeof(struct dnet_io_attr);
	data.size = sizeof(struct dnet_io_attr);
	data.flags = DB_DBT_PARTIAL | DB_DBT_USERMEM;
 
	err = e->db->put(e->db, txn, &key, &data, 0);

There is no other code ever started.
My explaination (and it somehow correlate with the above read locks, which were never taken) is related to the read-modify-write flag and likely can be observed with pure reading also. Page one above likely contains index or some other metadata needed to be checked when reading, so we lock it read-only. But data entry itself is locked read-write (according to read-modify-write flag) on the page 3. Another thread already checked the index and now wants to put some new entry into the btree (also on the page 3), and thus has to write-lock the index, which will deadlock: thread 1 has readlock A and waits to writelock B, thread 2 has writelock B and waits to writelock A.

I can not say if BDB 4.7.25 really has this logic inside, but commenting out RMW reading fixes the problem.
Thinking...

Linux and FreeBSD POSIX mutexes and semaphores

Tagged:  

Decided to check how fast are POSIX mutexes and semaphores in Linux and FreeBSD depending on number of contended threads.

So the test is rather simple: thread aciqures a lock, perform some quick action and releases it. This sequence is repeated multiple times. Number of threads can vary.
I tested time to perform 100 millions of counter increments (two asm instructions) protected either by POSIX mutex or generic semaphore, so effectively number of lock operations per second per thread was measured. 1, 2, 3, 5, 7 and 10 threads were actively racing for the same lock in the tests, which were repeated 5 times and median result was shown.

I ran the same test on three different machines: Linux UP laptop (Intel 1.4 Ghz), Linux SMP desktop (2-way AMD 2.5 Ghz), FreeBSD SMP server (8-way Intel E5345 2.3 Ghz, but usually 2-4 CPUs were fully loaded by external applications).

Single-thread case should show how fast is non-contended lock implementation on different systems. It is expected that two (and more) threads case will have noticebly smaller number of locks per second application is capable to aciqure compared to single-threaded case (for example in Linux there should be no syscalls at all in this case). With increased number of threads the proper implementation should be as close as possible to linear decrease in the number of lock operations per second (i.e. when two threads race for the same lock, each thread should perform the loop two times longer compared single-threaded case).

Let's see the practical results.


Number of locks per second per thread

And compared to the non-contended single-threaded case.


Number of locks per second including single-threaded case

Linux POSIX thread mutex implementation is definitely faster than that of FreeBSD (about 1.5 times), and with time FreeBSD scales worse, for example in some tests with 10 contended threads application performed as much as 20k locks per second - 10 times slower than Linux, in other cases it took 80k locks per second, 2-3 times slower than Linux as in the non-contended case.
But there is an interesting observation - the more CPUs we have the worse is scalability, looks like scheduler decision to which thread to start fires up at the wrong time, instead of allowing thread to complete the locked operation, scheduler wakes up another thread just to check that lock is still not available. And FreeBSD has more problems with it. Also ULE scheduler does not allow test application (with 10 threads for example) to run on more than 4 CPUs, do not even know why.

I can not explain performance of the Linux semaphores in the non-contended case - more than 2 times faster than POSIX mutexes created on top of futexes. In the 2-threads contended case semaphores are about two times slower than POSIX mutexes and with increased number of threads difference dissapears.

Overall conclusion: POSIX thread locks should not be used for the frequently accessible code pathes, which are not supposed to sleep, i.e. dig into the kernel. Elliptic network uses it to implement reference counter (instead of atomic variables) and for tree/list search/modify protection. All operations are supposed to be very lightweight, so heavy POSIX mutexes instroduce huge overhead here and should be replaced with spinlocks and hardware-dependant atomic variables.

Time for liblock and libatomic implementations, but this will be postponed for the futher releases.

No reason given

Tagged:  

Your request to the Linux-arm-kernel mailing list

Posting of your message titled "Re: [WIP/RFC] crypto: add support for Orion5X crypto engine"

has been rejected by the list moderator. The moderator gave the following reason for rejecting your request:

"No reason given"




Love it.

ASCII vs binary formats

Tagged:  

I used to use binary format for as much building blocks as possible, since they by definition have fixed size structure or at least size parameter somewhere in the fixed-size header.

Contrary ASCII strings can only be differentiated by the searching for some special symbol, most of the time it is a null-terminating byte (in a pair with some visible symbol like end-of-line). It is possible to put a length field at the beginning, but then it can not be used as a dereferenced pointer or it has to involve some a bit ugly arithmetic operations in the access macros.
But definitely it is much more convenient to look at the strings in the file instead of calling some special application (or hex editor) to understand the binary format.

And still knowing all this details, I managed to implement (quick and dirty) transaction update log as ASCII file, which contains (variable length) IO offset field, (variable length) size field and (fixed length) ID substring. Thus each entry in the log file (separated by the end-of-line and null bytes) has a variable length and it is not possible to easily traverse them from the end of the file or doing a binary search.
Each access thus has to involve a string parsing, and as we all know, any string parser always has a hidden buffer overflow.

This will be changed, which is rather simple task though. Also I decided to put a special header at the beginning of the transaction history file, which will contain a structure version field among others, so that if we will ever decide to update it to the new format, there would be no data corruption. I will use it when transferring history information and check if versions do not match. Or better implement this transaction log as a set of nested attributes, where the highest level has always the same format (length and type of the inner data is enough), the same way I did it for the network protocol.

While working on this I also found, that currently it is not that simple to use a different hash (from what is supported by OpenSSL) to create an object binding from the common path or data content. There is a good crypto engine abstraction, but its initialization is based on OpenSSL, and thus it is not that simple to implement own hash, which will take into account for example data center ID (or any other geographical data), so that it could be always possible to spread data among different physical locations for redundancy and better load balancing.

So I plan to extend server initialization to include special structure which will contain among other fields (like ID, listening address, root directory to store/load data and list of the initial routes to connect to) hashing function pointer, which by default will use OpenSSL.

Google loves me

Tagged:  

David Rientjes wrote:

I'm quite certain you've spent more time writing emails to me than merging the patch and testing its possibilities, given your lack of understanding of its very basic concepts.

I'm curious what else I will get with the same technical strength.

And things are actually quite simple: I ask how proposed oom-handling system will behave in some conditions. Hopefully I did not understand something, otherwise it just does not work. But apparently it is not easy to describe this in details... Let's see what will happen next :)

Linux killed Kenny, bastard!

Tagged:  

Do you want to own a tame killer? Do you want to control the world?

Start with your computer now and own the planet next: you already have an OOM-killer in the Linux to kill for you. But to date it was quite berserk and usually killed not what you would like him to murder.

Now you can add a name of the victims, which will be checked by the oom-killer, who select the process to kill first among the ones which have given string in their executable name.

By default the process to be killed is called 'Kenny', and if you like him, change the name by calling
echo Java > /proc/sys/vm/oom_victim

Try it!

Patch allows to set the name, which will be checked by the OOM-killer, and if there are processes with given string in the name, it will select the victim between them, otherwise will find one between all processes.

Updated hamock hooks.

Tagged:  

Some time ago one of the main anchor hooks for my hamock was removed, and now I decided to return it back. Apparently applied force was too strong...


Anchor hook, 8 mm diameter

So I got another one and screwed it in with less efforts, now things are just excellent (one can check how it was long ago). If I could play my trumpet like I want, I would definitely spent lots of spare time hunging in the hamock, drining rum and playing...

But so far I only practice several mornings a week and frankly do not have major progress. For example tried to pick out a tune by ears, but succeeded with simple 'Happy birthday', and actually failed to find proper notes after octave jump, since do not remember the melody. Something more complex is likely out of my abilities right now, mostly because lips tire quickly.
But I work (and work, and work, and work...) on this.

Dallas one-wire (w1) subsystem update.

Tagged:  

Just pushed w1 userspace communication update which includes several new commands and documentation update. Added commands:

  • touch block - allows to write and return sampled data back to the userspace
  • master list - sends information about all registered master devices to the userspace
  • slave list (search or alarm search) - starts appropriate search on the bus attached to specified bus master device and returns all found slave devices to the userspace without attaching them to the w1 subsystem

W1 uses netlink connector in the kernel and one can bind to the connector group and work with above commands (requires root to bind though). Also supported read/write commands, but they existed long ago. Search/alarm search got reply as desribed above (previously they just initiated a search).

Great thanks to Paul Alfille (paul.alfille_gmail.com) of OWFS for testing and w1-to-network daemon.

Memory pool and kmem cache constructor gotcha.

Tagged:  

Memory pool is a neat structure which contains array of objects previously allocated from the attached to memory pool cache. When cache is created, it is possible to assign it a constructor, which will be invoked each time new object is allocated from that cache.

Memory pools do not invoke cache allocator's constructor, when attached cache is 'empty' (i.e. requires refill), but memory pool is not. It reuses some previously allocated entries, so if constructor contains some object initialization code, it may be missed.

On development process.

Tagged:  

Jonathan Corbet wrote me an open letter where he describes Linux kernel development process, a little bit biased to my particular blog cryyy though :)

In a nutshell Jonathan selected DST to show what was made wrong from the process point of view. And although making excuses is not interesting, I will show that (from my humble opinion) objections are not entirely correct.
For those who do not like such details, you may jump the the end of the post to read frustration points on the process I have. It is not related to the review, inclusion or Jupiter weather. These things are so small and really do not worth our time. What I do not like is just how communication between people happens. And only that really matters: no kilobytes of code can make a good process, but only people do.

HIFN update pushed upstream.

Tagged:  

Patch series has been sent to the linux-crypto@ mail list for inclusion. This set of patches includes Patrick McHardy's (kaber_trash.net) work and fixes number of bugs mostly related to hardware queue management. I also changed the way how completions are handled and reduced size of the crypto ring.

Now each hardware interrupt (or actually tasklet callback scheduled from the hardware interrupt) does not scan the whole descriptor ring for the completed requests, but maintains a last checked position in each ring, which is used as a starting point to search for ready descriptors. Usually it generates upto 100.000 interrupts per second, which most of the time corresponds only to the single ready entry in the ring, even after tasklet starts execution, so scanning the whole ring is a complete waste of CPU cycles.

HIFN watchdog was also reworked a bit, now it flushes the whole hardware queue and only when things are really very bad (for example when there is a number of started requests but hardware was 'idle' for a long time not processing requests, which is specified by set of constants).

Becuase of the completion handling changes HIFN now is about 2 times faster than it was before and operates (if I understood correctly) faster than 1.7 Ghz CPU.

Here are some results for Soekris (500 MHz CPU) and dm-crypt:

time dd if=/dev/zero of=test.img bs=1024 count=700000
 
        hifn: ~70 seconds
software aes: > 120 seconds

And some data from 1.7 Ghz Athlon XP over dm-crypt read/write speed in MB/s:

write:
        hifn: 10.4 MB/s
software aes: 8.9 MB/s
 
read:
        hifn: 9.9 MB/s
software aes: 9.1 MB/s

And of course CPU usage is two times lower in the second case and is about 4 times lower in case of Soekris machine.

Great thanks to Andreas Gerlich (agl_online.de) and Tobias Geiger (tobias.geiger_vido.info) for testing and performance results.

Parsers.

Tagged:  

Found an interesting problem to think on while waiting for Andrew Morton and Robert Love to make a decision about my inotify update, which somewhat blocks POHMELFS development, since I have to know how to implement notifications. I will resend it again tomorrow and if there will be no answers, one can can get yet another argument for stupid linux marketing bullshit about new stuff and new developers. For the reference: patch is 47 lines long.

So, the problem: language parsers. For example I want to implement XML parser in LISP which will produce either tree of the tag objects or flow of the objects (like in event-driven model). Task happend to be non trivial for me, and although I do know that there are lots of similar parsers (and more general compiler theory applications), I want to implement my own to get into this interesting subject.
Actually this problem rised from quite practical application: there is an intranet web site which contains short information notes about all my collegues, and I want to create an application (in LISP, maybe this is the main part of the problem :), which will download pages, parse them, form tag trees, find there 'Birthday' leafs and draw the age statistics. I sometimes (frequently) feel that I'm older than most of the people around, and yet believe I'm young. So, I decided to implement a generic enough XML parser (HTML as a subset) as a compiler theory problem solution, to draw age statistics. I think it is a good idea.

POHMELFS will be updated this week no matter what inotify patch status will be, expect new version very soon.

Site development.

Tagged:  

"I know administration kung-fu" or something like that.

So, one-by-one, journey of the installing mail list software and configuring site-integrated web-frontend begins. That was somewhat new for kernel hacker to check postfix configs and hack PHP scripts at 5 A.M.

How financial crisis affects tbench performance.

Tagged:  

There is a clear dependency don't you agree? Well, everyone should be affected, why tbench differs?
Let's see the details.

After scheduler guys were kicked multiple times and not only guilty commits showed but also patches were provided, things have changed a little. Well, not that little, but noticebly. Also some nasty modulo opertion was found in the network fast path, which is not yet completely resolved, but I created a trivial patch for testing and present results with it.

Tbench regression with time
Tbench regression with time

As you may see, 2.6.28-rc2 behaves noticebly better than -rc1. But please note, that 2.5% of the win is from the network modulo elimination, which exist in the tree from the middle of 2005 year, so essentially there is some unresolved bug in the recent changes, which are just hidden by the modulo elimination performance gain. Also tso and gso are turned off via ethtool.

So, the progress looks good, but I can prognose, that this is the end. We will not get any better from here (even worse, when high-resolutin timers will be turned on again), since no one in scheduler camp seems to be interested in digging into way too old commits and try to find out why things regressed there and even try to accept that it may be scheduler (design) bug, although according to Mike Galbraith's tests it is not precisely scheduler to be blamed. Contrary tests by Jiri Kosina likely shows to scheduler.

The more we talk about this issue, the clearer picture become: the more unfair the IO scheduler, the "better" dbench results we get. Even when we have single task doing network load and no more active processes, it is now unfair to allow server to get maximum performance from the system. We should also care about init process: it is the first and the oldest process in the system, we should let him have a seat in the bus and force scheduler to select it more frequently even if it sleeps all the time. We also need to consider scheduler's dinner overhead itself.

David Miller works on modulo elimination from the network code, but likely it is the only change will be made there too, although if we eliminate all checkes in tcp_tso_should_defer(), we could get another couple of megabytes per second.

Also when all major distros will start turning on SLUB allocator instead of SLAB, there will be additional 6-7 MB/s drop in this kind of workload, but I already wrote about this issue.

Things we like and don't. Break something because we believe it to be overall beneit.

Tagged:  

There are always some conditions we like and we don't. Like good weather and rain, clean and dirty streets... benchmarks which show we are wrong and which do not.

There are times when we have made changes which we knew full well reduced dbench's throughput, because we believed them to be of overall benefit.

Theorists do not and can not accept, that they can be wrong. And if practice shows they are, this is just a time for the new theory. This is a world in pink glasses. Without them though each test is created just to show that there are problems. One can infinitely argue that it can not happen in a real life, but test itself already is a real life, and thus there may be other people who use similar techniques. They should not, but they can. And in pink glasses we do not see such people and we can break something just because we believed them to be of overall benefit. We believe in theory and do not accept practical tests which show we are wrong.

Overall this is definitely a direction we should go. Absolutely. I want those who bit Andrew Morton also bite me. I want pink glasses, although no, I already wear ones... Although in a pink world anyone has perfect vision to things they believe to be overall benefit.

That's why tbench performance degradation reported multiple times just dissapeared from the regression list. Only after David Miller showed that high-resolution timers made spark wake_up() to be noticebly slower, they were disabled in scheduler. And this was actually a part of the dbench tests. And in my tests I showed two weeks ago hrticks only earned 80 out of 120 MB/s loss... I would not be surprised if high-resolution timers affect other timer-sensitive operations like BH context 'generation',

P.S. I want to be an idealist, unfortunately transform into cynic.

Kernel/userspace network ping-pong testing module

Tagged:  

This module uses simple ping-pong protocol and exchange number of messages with userspace server.

The latter does not require any configuration except address and port to listen at:

 $ ./upp -p 10250 -a 0.0.0.0
Server is now listening at 0.0.0.0:10250

Kernel module has following parameters:

  • kpp_addr - address to connect to as 32bit value, default is loopback: 0x7f000001
  • kpp_port - port to connect to, default is 1025
  • kpp_protocol - protocol to use (userspace supports only tcp though), default is TCP: 6
  • kpp_num - number of messages to exchange, default is 1000
  • kpp_send_fragments - to use or not to use fragmented sending (first 4 bytes are sent, then rest 60 bytes of the packet), default is not to use fragmented sending: 0
  • kpp_send_msg_more - to use or not to use MSG_MORE flag for the first fragment, default not to use: 0
  • kpp_recv_fragments - to use or not to use fragmented receiving (first 4 bytes are received, then rest
    60 bytes of the packet), default is not to use fragmented receiving: 0

Usage example:

# insmod ./kpp.ko kpp_addr=0xa000001 kpp_port=10250 kpp_num=1000
  kpp_send_fragments=1 kpp_recv_fragments=1 kpp_send_msg_more=0
insmod: error inserting './kpp.ko': -1 Operation not permitted
# dmesg | tail -n3
[61590.379241] Initializing kernel network ping-pong testing module.
[61590.379243]       addr: 10.0.0.1, port: 10250, protocol: 6, num: 1000, send_framgents: 1,
  use_msg_more: 0, recv_fragments: 1.
[61630.268499] Client: 10.0.0.1:10250, num: 1000, time: 40.4257398967.

Source can be found in archive. Replace KDIR variable in Makefile to point to your kernel tree.

Traffic jam simulation

Tagged:  

This is a simple traffic simulator, which contains variable number of cars and lights, each of which can be programmed to different acceleration, maximum allowed speed, stop and deceleration distances.

Each light can be programmed to switch lights after different interval, there are only two lights in real life: red and green, at least in Moscow very unfortunately) no one ever cares about yellow, lots of drivers specially accelerate when see yellow... So, only two colors.

Since I did not bother to implement a nice config for each car and light, there is only signle set of parameters, but command line parameters allow to vary initial number of cars and distance between them, number of time frames before lights change the state or new car enters the road, number of lights
and distance between them.

There are two known problems with the lights on the road: first, bad drivers, who do not maintain a huge enough buffer, so they have to wait until car in from of them moves far enough so they can start, this takes some time from limited timeframe of the green light. If buffer is large enough, drivers can start simultaneously and thus move much faster.
One can simulate this behaviour with variable initial number of cars and with different distances between them, if distance is less than stop distance (i.e. distance where driver has to stop its car, it is 4 in the current setup), then driver will have to wait, until distance becomes more or equal to stop distance, if driver stopped far than stop distance (let say 5 'meters'), then car can start simultaneously with the head. The latter approach allows to move more cars through the light during fixed time frame,
but psychologically it looks better to stay as closer as possible to the head car, which introduces a latency, since we have to wait until head car moves far enough, so we could start. This leads to negative exponential speed increase for each car behind instead of linear speed if drivers would maintain the buffer. Appropriate equations are quite simple: difference of the distance moved by the single time frame is proportional to the acceleration of the head car, which in turn is proportional to its coordinate, so we have a simple differential equation, which solution results in a negative exponential. One can read a bit more here.

Second problem is light interval. If interval is too short, then cars can not start, only couple of them moves forward, and if it is big enough, then during red light a large backlog of cars can be accumulated, and it will not be removed during green light because of the above problem: each car has to wait until head one moves to some distance. The latter is actually worse, since backlog can become so huge, that it will not be removed at all, which will lead to complete stall of the traffic flow (at the back side, front one will move, but number of cars at the tail will be bigger than number of cars which leave the traffic jam).

One can get sources from the archive. It requires gtk2 devel package installed.

Here is an example:

 ./traffic -n2 -d1 -l10 -L50 -I5 -i10

which starts with 2 cars (-nw), distance between them is equal to 1 (remember, stop distance is equal to 4, this is a distance after which car starts moving, or stops completely, if obstacle distance is less or equal), there are 10 lights (-l10), distance between them is 50 (-L50), light interval is 10 (-i10), interval when new car appears at the tail of the line is equal to 5 (-I5)

M-on-N threading library.

Tagged:  

Userspace M-on-N threading model is based on the idea, that when signal is delivered, kernel saves all information related to previous context in stack, so it is possible to find it and replace.

Userspace threading and theirs benefits and drawbacks.
Benefits:

  • Fast scheduling. There is no need to cross userspace/kernelspace boundary to schedule new thread execution (just watch what happens with userspace network stack compared to kernel's one when there are a lot of syscalls performed for small packets receiving/sending).
  • Fast thread creation and destruction. It just becomes an allocation of the structure in the userspace, no need for full creation process which is performed in clone() syscall.
  • Smaller number of cache misses. Since there is only one process instead of several threads, cache locality is increased greatly with reduced number of misses.
    • Drawbacks.

      • Scheduling fairness. Since kernel does not know about multiple threads behind given process, it can not add it appropriate number of timeslices for execution. Can be solved either by more tight collaboarion of the userspace and kernelspace schedulers or simply by increasing process' nice value.
      • All communications are performed through one kevent pipe. All syscalls are going to be converted into non-blocking operations (including nanosleep() and the like), and keep a track of
        what each context performed. In practice glibc rewrite is not what I would like to do, but instead some layer on top of it will be implemented, which will convert syscalls into kevent operations, and become a rescheduling point.
      • Complex code for good SMP scalability and userspace scheduler. Not actually a problem.

      SMP scalability in M-on-N threading model.
      Since only kernel can schedule thread (actually not even thread or process, but its own kernel's representation, so called kernel's virtual process) to run on specified CPU, M-on-N threading model should have several real threads (for example several current POSIX threads), its number should be equal to number of real CPUs, and then library layer will schedule execution of context of different real threads, each of which in turn can run on separate CPU.

      So, userspace will create new real threads when pthread_create() is called until number of them is less than number of real CPUs, each real thread in turn is a context in the global set of contexts, where fake context will be added with all subsequent pthread_create() calls, and userspace scheduler (backed by real threads) will pick up several contexts from the tree and execute them on the real CPUs.

      I would be possible to use existing Linux clone() syscall, but due to complete absence of hte documentation (which is sometimes plain wrong) and ery strong encryption of glibc sources it is quite complex task.

      As NPTL, M-on-N threading library uses stack rlimit for thread stack allocation.

      Benchmarks.
      I only ran simple benchmark of empty thread creation (its function just exits). After I started to use atomic locks ("lock" prefix on x86) instead of semaphores, thread start/empty exec/stop was reduced down to 0.3 microseconds compared to 14 microsecods for POSIX NPTL case.

      But there are problems. First one is that I perform initial context setup through signal invokation,
      which is at least two syscalls. They are slow. Another one is that thread is really started only after rescheduling, which is another signal, so another two syscalls. Third on is that there must exist different locking primitives - for signal context and for process context, which must block signals, which in turn adds additional overhead of sigprocmask() syscall.
      After I fixed all above issues (actually not fixed, but confirmed that they must exist), performance reduced to 9 microseconds compared to 14 microsecods for POSIX NPTL case for empty thread creation/destruction.
      This can be fixed, if I would have created arch-specific getcontext()-like calls, which would be mutually transformable into signal context information (existing getcontext() and friends produces different data than signal context has at least on x86). But I can not right now, since I do not know enough x86 ABI (I learned a lot for past several days, as you can notice from this blog, but it is still even remotely not enough).

      Currently M-on-N threading model uses ugly arch-specific hacks to start new threads, which actually are something remotely similar to makecontext(). So, the solution, which will rock M-on-N threading implementation is to convert or create getcontext() and friends calls which can be used with signal context information.

      Development status can be tracked here.

kevent subsystem

Tagged:  

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.



Development status can be tracked here.

Project is closed.

The latest release is 37, release notes can be checked
here.



Another approach for the same idea is eventfs -
pseudo FS which allows to bind file descriptors to events and poll them using epoll().

Events are provided into kernel through control syscall and can be read back through userspace ring buffer or syscall. Kevent update (i.e. readiness switching) happens directly from internals of the appropriate state machine of the underlying subsytem (like network, filesystem, timer or any other).

Kevent can be used both for edge and level notifications. It supports socket and pipe notifications (accept, sending and receiving), AIO (aio_sendfile(), aio_sendfile_path()), generic poll()/select() notifications, timer notifications, signal notifications, private userspace notifications, POSIX timers.

LWN published a great article about kevent (second one can be found here, and the latest is here.

There is documentation page at OSDL.org.

Real life benchmark with patched lighttpd server can be found
here. Additional benchmark results ran by Johann Borck (johann.borck_densedata.com) can be found here, here and here.
New benchmark created by Eric Dumazet (dada1_cosmosbay.com) which creates two threads
and set of sockets, which are read/written to by those threads. Here is a results. Here is a results for pipes.
Another probably broken benchmark of ab (Apache benchamrk) against lighttpd with epoll and kevent (more than 7k connections per second on amd64 server) can be found here.

Benchmark and some conclusions about FreeBSD kqueue on the same hardware are available here. Briefly saying, FreeBSD showed much worse behaviour than kevent and even epoll, but it is very likely that I did not complete it's configuration, since by default FreeBSD does not allow such performance test at all.

libevent kevent patch can be found here. lighttpdlighttpd patch can be found here.
Kevent patch and userspace applications are available in archive.

Kevent has a git repository, one can clone it via http (164 Mb):

$ git clone http://tservice.net.ru/~s0mbre/archive/kevent/kevent.git/ ./

NAIO - network AIO. More details on the original page.

w1 - Dallas' 1-wire protocol realization for Linux kernel 2.6

Tagged:  

w1 is i2c-like bus with single data wire (and ground one) designed by Dallas.
It was included into 2.6.8-rc2 kernel.

Historical releases available in archive.

For more details check drivers/w1/ and Documentation/w1/ directories in the Linux kernel tree.

Syndicate content