Mathieu Desnoyers posted his userspace RCU synchronization implementation and I found it very interesting to play with.
Unfortunately it was designed likely as a proof-of-concept code, so it took a while to cleanup the arch-dependant headers, add autoconfiguration, extend atomic operations and wrap Mathieu's code into a library. Now one can use libsync to get a platform-independant API to work with atomic variables (arch-specific add/sub and decrease and test functions are provided) and RCU implementation. This is the first release.
It works great, but there are some caveats.
First, Mathieu decided to use an interesting approach to force all reading threads to flush its data into memory from the caches, namely to get a knowledge about when quiescent periods are completed and writer can grab the lock. URCU uses a signal (SIGUSR1 in the code) sent to each reader thread and writer waits for each thread to 'ack' it. So if your application does not want to enable signals (or there are no unused), this imlementation will not work for you. This can be optimized IMHO: if thread is not currently running in the read-locked section, but sleeping, there is no need to send a signal, since there is no quiescent state to wait for. Also thread rescheduling implies a barrier.
Second, only x86 (both i386 and x86_64) and PPC (not all models though) are supported. I added Sparc64 support, but then found that I only have access to SUNW,Ultra-60 machine (where there is only 2.95 compiler, which does not yet know about __thread specification, and it has 32-bit CPUs) and SUNW,Sun-Fire-V240, which happend to have v8 CPUs, which are also 32-bit:
atomic_64.S: Assembler messages: atomic_64.S:19: Error: Architecture mismatch on "lduw". atomic_64.S:19: (Requires v9|v9a|v9b; requested architecture is v8.) atomic_64.S:21: Error: Architecture mismatch on "cas". atomic_64.S:21: (Requires v9|v9a|v9b; requested architecture is v8.)
so this was not tested either, will try to resolve it tomorrow. But getting that x86 becomes a world-dominating platform, this should not be a show-stopper.
So, I tested libsync library on FreeBSD 7 (AMD64), Debian Lenny (i386 SMP) and Ubuntu Hardy (i386 UP), where reading performance was just freaking awesome (writing performance was miserable though :).
If I will get access to other platforms, I will port it there. Also I will cook up some documentation (there are source code examples) and a homepage soon.
Libsync will be used in the elliptics network for the reference counters and read-mostly lookups where POSIX spinlocks are currently used (which introduces visible overhead).
Elliptics network in a meantime got MacOSX support (its sendfile() differs from FreeBSD one), although one may need to make a small patch to POSIX options header, if /usr/include/bits/posix_opt.h is old enough (thanks to Tuncer Ayaz for the reference). Also added generic read()/send() implementation for the platforms where no sendfile() is available, like OpenBSD.
That's how I felt sick today. But things will change. Stay tuned!
EDITED TO ADD: that MacOSX problem with the POSIX spinlocks is not yet resolved.
Could you add ARM support?
If you will test it, since I do not have arm machines handy.
Please send me (or post here) <code>configure</code> output and resulted <code>include/atomic/target.h</code> file.
Those Sun machines you mention are all 64-bit. However it's pretty rare (even in the Solaris world) to run a full 64-bit userspace, which is why you're having trouble getting 64-bit instructions through the toolchain. Typically only those applications that really need the address space or other features will be built 64-bit.
It'd have been more correct for me to refer to v9 there I suppose, but I'm sure you get the drift.
Ok, I see, so it seems that SUNW,Sun-Fire-V240 is v9 machine, but its userspace runs in 32bit emulation mode, which happens to have only v8 instructions? That could explain the results.
Maybe I just need to add sparc32 support into libatomic :)
If at all, this code should use an RT signal. I.e. SIGRT1..SIG64.
overflow and can be lost. Usual signals are just flags, which is exactly what is needed in this case.
Newer GCC versions natively support atomic operations using __sync on most architectures. If you do atomic ops this is probably the best API to choose (when it is available), instead of cooking your own assembler based ones.
note: __sync_synchronize is broken prior to 4.4:
<a href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36793" title="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36793">http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36793</a>
It appeared in 4.1, so if available, it should be used of course.
the referenced patch from the VideoLan wiki is meant to be used for Linux systems, btw.
it cannot work on OS X.
--
Tuncer
we could try to add support for OSSpinLock on OS X but I am not sure this will be needed with libsync.