DST benchmark. IO elevators and TCP issues.

Bonnie++ benchmark of the latest DST version.

Previously released versions used blk_alloc_queue() for block IO queue allocation. This is rather simple fuction which does not do anything special. It happend that with the last kenels this does not attach IO elevator (also known as IO scheduler, subsystem which may reorder and combine multiple requests into single one to improve performance), so all DST requests wen into the network via single-segments block. This kills performance, but I did not notice that, since this behaviour of the blk_alloc_queue() was quite surprising for me (maybe I just imagined that, but in previous DST releases, which worked with older kernels (before DST rewrite), had elevator attached). So current DST tree uses blk_init_queue(), which is very different actually (there are two methods to handle IO requests), but has IO scheduler initialization, so I use it with request_fn() callback set to NULL and using own make_request_fn() to handle block IOs.

This change allows to combine multiple IO blocks by the attached IO scheduler, so now client sends big-sized requests to the server. Which brings us to the next problem: TCP ack storm.
When client sends big enough read requests (by default it is up to 32 pages), for example reading big file from the filesystem created on top of the DST, server will reply with MTU sized frames, which are in turn transformed into MSS sized TCP blocks. With 1500 MTU we will get about 1460 MSS (depending on enabled options this may be a little bit smaller). Linux TCP stack replies with single ACK message to each two received messages, which is about 45 ACKs per above max request, which in turn messes with existing TCP states and introduces additional overhead for server (amount of interrupts and tcp ack processing). This in turn kills read performance (about 1.5-2 MB/s bulk read performance when reading from filesystem mounted over DST. This is also very surprising to me, since effectively having large enough read requests should not kill performance because of lots of ACKs. I do not observe this behaviour in POHMELFS for example. I also tested the same workload havin SACK and FACK turned off, with essentially the same results.
DST has maximum block IO size (in pages) parameter, the best results are obtained when client sends as much as single (or two) page read requests. Reading performance is close to local filesystem performance. Writing performance in turn reduces (from being equal to local filesystem performance).
Here is example of what tcpdump shows:

20:40:08.836975 IP server.1025 > client.46340: . 1772313:1776507(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836978 IP server.1025 > client.46340: . 1776507:1780701(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836980 IP server.1025 > client.46340: . 1780701:1783497(2796) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836984 IP server.1025 > client.46340: . 1783497:1787691(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836987 IP server.1025 > client.46340: . 1787691:1791885(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836990 IP server.1025 > client.46340: . 1791885:1796079(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836993 IP server.1025 > client.46340: . 1796079:1800273(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836995 IP server.1025 > client.46340: . 1800273:1804467(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.836998 IP server.1025 > client.46340: . 1804467:1808661(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.837001 IP server.1025 > client.46340: . 1808661:1812855(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.837003 IP server.1025 > client.46340: . 1812855:1817049(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.837019 IP server.1025 > client.46340: . 1838019:1842213(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288>
20:40:08.837095 IP client.46340 > server.1025: . ack 1713597 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837099 IP server.1025 > client.46340: . 1842213:1842401(188) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184289>
20:40:08.837100 IP client.46340 > server.1025: . ack 1716393 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837102 IP client.46340 > server.1025: . ack 1719189 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837346 IP client.46340 > server.1025: . ack 1721985 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837348 IP client.46340 > server.1025: . ack 1724781 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837350 IP client.46340 > server.1025: . ack 1727577 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837352 IP client.46340 > server.1025: . ack 1730373 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837353 IP client.46340 > server.1025: . ack 1733169 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837354 IP client.46340 > server.1025: . ack 1735965 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837356 IP client.46340 > server.1025: . ack 1738761 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837357 IP client.46340 > server.1025: . ack 1741557 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837359 IP client.46340 > server.1025: . ack 1744353 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837360 IP client.46340 > server.1025: . ack 1747149 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837362 IP client.46340 > server.1025: . ack 1749945 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837597 IP client.46340 > server.1025: . ack 1752741 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837600 IP client.46340 > server.1025: . ack 1755537 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837601 IP client.46340 > server.1025: . ack 1758333 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837602 IP client.46340 > server.1025: . ack 1761129 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837604 IP client.46340 > server.1025: . ack 1763925 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837605 IP client.46340 > server.1025: . ack 1766721 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837607 IP client.46340 > server.1025: . ack 1769517 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837608 IP client.46340 > server.1025: . ack 1772313 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837609 IP client.46340 > server.1025: . ack 1775109 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837611 IP client.46340 > server.1025: . ack 1777905 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837612 IP client.46340 > server.1025: . ack 1780701 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837846 IP client.46340 > server.1025: . ack 1783497 win 13493 <nop,nop,timestamp 1184289 1316925>
20:40:08.837850 IP client.46340 > server.1025: . ack 1786293 win 13493 <nop,nop,timestamp 1184289 1316925>

What happend with TCP is so strange, that I definitely want to dig deeper into the problem, but so far the only idea I have is a difference between POHMELFS and DST: the former sends read reply as long as first page has been read, while DST server sends the whole reply only when all requested data has been read, so DST produces huge bursts of IO and POHMELFS does not. I expected that DST will be a bit slower than AoE, but in some cases (writing performance and lots of small requests) because of above tweak difference is noticeble. In case of writing and small requests (especially random) NFS starts to behave comparable becaue of its ability to combine requests. AoE is a clear winner in small requests case, since it is not limited by TCP ping-pong issues.

MTU 1500 data (dst with 1 and 2 pages as maximum IO size) :

Version        ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP

local           16G 44100  99 68515  16 31931  12 46348  98 75432   7 288.8   0
nfs             16G 42498  98 68708   9 25402   6 30040  66 50496   5 350.6   0
aoe             16G 39055  99 72872  18 27076   9 38785  98 75510  11 201.6   0
dst             16G 39498  98 67943  19 25614  10 37676  91 72807  16  17.8   0
dst-nosack-nofa 16G 39439  98 67041  20 25789  12 37556  92 72802  17  90.7   0
dst-2           16G 38658  92 61499  20 30195  11 43373  93 72451  18  17.6   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
local            16   486   2 +++++ +++   418   1   486   2 +++++ +++   384   1
nfs              16  1619   5  8709  14  2708   7  2578   8 11918  14  1275   3
aoe              16  8581  49 +++++ +++  6626  34  8495  48 +++++ +++  5876  35
dst              16  2031  12 +++++ +++  2795  16  2196  11 +++++ +++   781   4
dst-nosack-nofac 16  1922  13 +++++ +++  2954  20  1830  12 +++++ +++  1079   7
dst-2            16  1949  12 +++++ +++  2503  16  1908  12 +++++ +++   846   5