M-on-N threading library.

Userspace M-on-N threading model is based on the idea, that when signal is delivered, kernel saves all information related to previous context in stack, so it is possible to find it and replace.

Userspace threading and theirs benefits and drawbacks.

  • Fast scheduling. There is no need to cross userspace/kernelspace boundary to schedule new thread execution (just watch what happens with userspace network stack compared to kernel’s one when there are a lot of syscalls performed for small packets receiving/sending).
  • Fast thread creation and destruction. It just becomes an allocation of the structure in the userspace, no need for full creation process which is performed in clone() syscall.
  • Smaller number of cache misses. Since there is only one process instead of several threads, cache locality is increased greatly with reduced number of misses.
    • Drawbacks.

      • Scheduling fairness. Since kernel does not know about multiple threads behind given process, it can not add it appropriate number of timeslices for execution. Can be solved either by more tight collaboarion of the userspace and kernelspace schedulers or simply by increasing process’ nice value.
      • All communications are performed through one kevent pipe. All syscalls are going to be converted into non-blocking operations (including nanosleep() and the like), and keep a track of
        what each context performed. In practice glibc rewrite is not what I would like to do, but instead some layer on top of it will be implemented, which will convert syscalls into kevent operations, and become a rescheduling point.
      • Complex code for good SMP scalability and userspace scheduler. Not actually a problem.

      SMP scalability in M-on-N threading model.
      Since only kernel can schedule thread (actually not even thread or process, but its own kernel’s representation, so called kernel’s virtual process) to run on specified CPU, M-on-N threading model should have several real threads (for example several current POSIX threads), its number should be equal to number of real CPUs, and then library layer will schedule execution of context of different real threads, each of which in turn can run on separate CPU.

      So, userspace will create new real threads when pthread_create() is called until number of them is less than number of real CPUs, each real thread in turn is a context in the global set of contexts, where fake context will be added with all subsequent pthread_create() calls, and userspace scheduler (backed by real threads) will pick up several contexts from the tree and execute them on the real CPUs.

      I would be possible to use existing Linux clone() syscall, but due to complete absence of hte documentation (which is sometimes plain wrong) and ery strong encryption of glibc sources it is quite complex task.

      As NPTL, M-on-N threading library uses stack rlimit for thread stack allocation.

      I only ran simple benchmark of empty thread creation (its function just exits). After I started to use atomic locks ("lock" prefix on x86) instead of semaphores, thread start/empty exec/stop was reduced down to 0.3 microseconds compared to 14 microsecods for POSIX NPTL case.

      But there are problems. First one is that I perform initial context setup through signal invokation,
      which is at least two syscalls. They are slow. Another one is that thread is really started only after rescheduling, which is another signal, so another two syscalls. Third on is that there must exist different locking primitives – for signal context and for process context, which must block signals, which in turn adds additional overhead of sigprocmask() syscall.
      After I fixed all above issues (actually not fixed, but confirmed that they must exist), performance reduced to 9 microseconds compared to 14 microsecods for POSIX NPTL case for empty thread creation/destruction.
      This can be fixed, if I would have created arch-specific getcontext()-like calls, which would be mutually transformable into signal context information (existing getcontext() and friends produces different data than signal context has at least on x86). But I can not right now, since I do not know enough x86 ABI (I learned a lot for past several days, as you can notice from this blog, but it is still even remotely not enough).

      Currently M-on-N threading model uses ugly arch-specific hacks to start new threads, which actually are something remotely similar to makecontext(). So, the solution, which will rock M-on-N threading implementation is to convert or create getcontext() and friends calls which can be used with signal context information.

      Development status can be tracked here.