network tree allocator

Network tree allocator can be used to allocate memory for all network operations from any context. Main designed features are:

  • reduced fragmentation (self defragmentation)
  • possibility to create zero-copy sending and receiving (ground work for userspace netchannels and high-performance sniffers)
  • greater than SLAB speed
  • full per CPU allocation and freeing (objects are never freed on different CPU)
  • dynamically grown cache
  • separate network allocations from main system’s ones

Design notes.
Original idea was to store meta information used for allocation in AVL tree, but since I found a way to use some “unused” fields in struct page, tree is unused in the allocator.

node – container for per-page (of different order) meta information.
chunk – region of free or allocated memory.

Each allocation is aligned to the minimum allocation size (L1 cache line size bytes). Each node contains a bitmask of used/unused chunks of memory for memory region bound to that node. Each free chunk is placed into double linked list inside array, indexed of size (divided by minimal allocation size).

Allocation algo.
When user requests some memory regiosn, it’s size is rounded upto minimum allocation size (instead of power of two in SLAB) and appropriate entry in array of lists of free chunks is selected. If that list contains free elements, first one is returned, otherwise next list is selected with bigger size until on-empty list is found.
Then allocator determines a node for appropriate chunk (using fields in corresponding struct age->lru), and bits corresponding to selected chunk are marked as used. If chunk had bigger size than requested, the rest is placed back into the list for appropriate size.
Thus allocator gets requested memory aligned to minimum allocation size.

Freeing algo.
When user frees chunk of the memory, appropriate node and allocation CPU are selected for given memory regions (by dereferencing page->lru fields). If current CPU does not correspond to allocation CPU, then chunk is dereferenced into single-linked list entry and placed into list
of semi-free objects for freeing on original (allocation) CPU. If freeing CPU is the same as allocation one, appropriate node is checked and neighbours for being freed chunk are found. Free neighbour(s) are then dereferenced into double linked list entry and removed from appropriate lists of free elements. Then all free chunks are combined into one and appropriate bitmask is updated in the node. Chunk of (possibly) bigger size then is dereferenced into double-linked list and placed into list of free objects for appropriate size. Then freeing algo checks if list of “semi-free” objects (objects which were started to be freed on different CPUs, but should be freed actually on current one) is not empty, and if so, all chunks from that list are freed as described above.

All above lists and arrays are not accessed by different CPUs, except “semi-freed” list, which is accessed under appropriate lock (each CPU has it’s own lock and list).

Defragmentation is a part of freeing algorithm and initial fragmentation avoidance is being done at allocation time by removing power-of-two allocations. Rate of fragmentation can be found in some userspace modlling tests being done for both power-of-two SLAB-like and NTA allocators.
For initial test I’ve run NTA with set of pseudo-random sized allocations until first allocation fails, when it hapens I decrease maximum allocation size in two times. Each graph below shows free and used chunks
inside each page (there are 4094 pages), green points correspond to free and red ones – to used chunks (each one of 32 bytes).
Maximum allocation size is equal to PAGE_SIZE, failed allocation was for 1912 bytes
(60 chunks of 32 bytes):

Fragmentation graph

Maximum allocation size is equal to PAGE_SIZE/2 (decreased after allocation failure),
failed allocation was for 968 bytes (31 chunks of 32 bytes):

Fragmentation graph

Maximum allocation size is equal to PAGE_SIZE/4 (decreased in two times after first and second allocation failures), last failed allocation was for 504 bytes (16 chunks of 32 bytes):

Fragmentation graph

Maximum allocation size is equal to PAGE_SIZE/8 (decreased in two times after each of three allocation failures), last failed allocation was for 252 bytes (8 chunks of 32 bytes):

Fragmentation graph

This tests do not show how fragmentation is changed with the time, when there are a lot of allocations and freeings are completed, but even existing results show that network tree allocator performs very well. Next time I will run the same tests after some pseudo-random allocation and freeing periods.

For comparison I’ve run the same test with power-of-2 slab-like allocator (actually it is much more simple, but it has the same ideas as SLAB allocator and probably can behave even better if we get into account big-sized chunks). Picture does not change when maximum allocation size is being decreased after allocation failures, since most of the overhead and fragmentation is obtained from power of 2 rounds.

SLAB-like power-of-two allocator overhead and fragmentation

This SLAB-like power-of-two allocator overhead and fragmentation actually looks different than on the picture, since almost all allocations have fragmentation overhead, so each vertical line actually must contain several red(used)-green(free or fragmentation overhead) pieces, where sum of all pieces of the same colour will be equal to what is shown on the picture. But picture presents that absolute amount of fragmentation overhead is extremely high for power-of-2 allocators. For the real SLAB allocator picture
will be better for small-sized chunks (since chunks never share pools with different sized ones, except when they can steal pages when cache is refilled), but much worse for big-sized ones.

Difference in used and free chunks position on the pictures is due to the fact, that in network tree allocator chunks in page are shown in reverse order (i.e. higher addresses are first).

Benchmarks with trivial epoll based web server showed noticeble (more than 40%) imrovements of the request rates (1600-1800 requests per second vs. more than 2450 ones). It can be described by more
cache-friendly freeing algorithm, by tighter objects packing and thus reduced cache line ping-pongs, reduced lookups into higher-layer caches and so on.

Design of allocator allows to map all node’s pages into userspace thus allows to have true zero-copy support for both sending and receiving dataflows. Based on this ability zero-copy sniffer and sending programm can be found in archive.

As described in recent threads [1] it is also possible to eliminate any kind of main system OOM influence on network dataflow processing, thus it is possible to prevent deadlock for systems, which use network as memory storage (swap over network, iSCSI, NBD and so on).

Network allocator allows to dynamically grow it’s cache when there is strong demand on this.

Things in TODO list:

  • implement userspace mapping and move netchannels into userspace

Old developement status can be tracked here.

1. Discussions about deadlock prevention in network based storage devices.

Patch is available in archive.