ioremap.net

Storage and beyond

Linux and FreeBSD POSIX mutexes and semaphores

Decided to check how fast are POSIX mutexes and semaphores in Linux and FreeBSD depending on number of contended threads.

So the test is rather simple: thread aciqures a lock, perform some quick action and releases it. This sequence is repeated multiple times. Number of threads can vary.
I tested time to perform 100 millions of counter increments (two asm instructions) protected either by POSIX mutex or generic semaphore, so effectively number of lock operations per second per thread was measured. 1, 2, 3, 5, 7 and 10 threads were actively racing for the same lock in the tests, which were repeated 5 times and median result was shown.

I ran the same test on three different machines: Linux UP laptop (Intel 1.4 Ghz), Linux SMP desktop (2-way AMD 2.5 Ghz), FreeBSD SMP server (8-way Intel E5345 2.3 Ghz, but usually 2-4 CPUs were fully loaded by external applications).

Single-thread case should show how fast is non-contended lock implementation on different systems. It is expected that two (and more) threads case will have noticebly smaller number of locks per second application is capable to aciqure compared to single-threaded case (for example in Linux there should be no syscalls at all in this case). With increased number of threads the proper implementation should be as close as possible to linear decrease in the number of lock operations per second (i.e. when two threads race for the same lock, each thread should perform the loop two times longer compared single-threaded case).

Let’s see the practical results.


Number of locks per second per thread

And compared to the non-contended single-threaded case.


Number of locks per second including single-threaded case

Linux POSIX thread mutex implementation is definitely faster than that of FreeBSD (about 1.5 times), and with time FreeBSD scales worse, for example in some tests with 10 contended threads application performed as much as 20k locks per second – 10 times slower than Linux, in other cases it took 80k locks per second, 2-3 times slower than Linux as in the non-contended case.
But there is an interesting observation – the more CPUs we have the worse is scalability, looks like scheduler decision to which thread to start fires up at the wrong time, instead of allowing thread to complete the locked operation, scheduler wakes up another thread just to check that lock is still not available. And FreeBSD has more problems with it. Also ULE scheduler does not allow test application (with 10 threads for example) to run on more than 4 CPUs, do not even know why.

I can not explain performance of the Linux semaphores in the non-contended case – more than 2 times faster than POSIX mutexes created on top of futexes. In the 2-threads contended case semaphores are about two times slower than POSIX mutexes and with increased number of threads difference dissapears.

Overall conclusion: POSIX thread locks should not be used for the frequently accessible code pathes, which are not supposed to sleep, i.e. dig into the kernel. Elliptic network uses it to implement reference counter (instead of atomic variables) and for tree/list search/modify protection. All operations are supposed to be very lightweight, so heavy POSIX mutexes instroduce huge overhead here and should be replaced with spinlocks and hardware-dependant atomic variables.

Time for liblock and libatomic implementations, but this will be postponed for the futher releases.

Comments are currently closed.

5 Responses to “Linux and FreeBSD POSIX mutexes and semaphores”

  • zbr says:

    With increased number of threads the proper implementation should be as close as possible to linear decrease in the number of lock operations per second (i.e. when two threads race for the same lock, each thread should perform the loop two times longer compared single-threaded case).

    Of course I meant 4 threads should be 2 times slower than 2 threads, 3 threads should be 1.5 slower than 2 threads and so on. Single-threaded non-contended case is special and may be very optimized, so we can not compare against it.

  • Anonymous says:

    Hi, do you think that repeating the test under Windows and posting the results would be worthwhile?

  • zbr says:

    But I do not have Windows machines, nor have enough interest to learn how to port POSIX applications there.

  • Anonymous says:

    How about providing the sources of your test so that I could do it myself?

  • zbr says:

    lock_test.c

    Running script:

    #!/bin/bash
    
    inum=100000000
    log=lock.log
    
    echo `uname -a` > $log
    for num in 1 2 3 5 7 10; do
    	echo $num >> $log
    
    	for ((i=0; i<5; ++i)); do
    		./lock_test -n $num -N $inum >> $log
    	done
    done