Bonnie++ benchmark of the latest DST version.
Previously released versions used blk_alloc_queue() for block IO queue allocation. This is rather simple fuction which does not do anything special. It happend that with the last kenels this does not attach IO elevator (also known as IO scheduler, subsystem which may reorder and combine multiple requests into single one to improve performance), so all DST requests wen into the network via single-segments block. This kills performance, but I did not notice that, since this behaviour of the blk_alloc_queue() was quite surprising for me (maybe I just imagined that, but in previous DST releases, which worked with older kernels (before DST rewrite), had elevator attached). So current DST tree uses blk_init_queue(), which is very different actually (there are two methods to handle IO requests), but has IO scheduler initialization, so I use it with request_fn() callback set to NULL and using own make_request_fn() to handle block IOs.
This change allows to combine multiple IO blocks by the attached IO scheduler, so now client sends big-sized requests to the server. Which brings us to the next problem: TCP ack storm.
When client sends big enough read requests (by default it is up to 32 pages), for example reading big file from the filesystem created on top of the DST, server will reply with MTU sized frames, which are in turn transformed into MSS sized TCP blocks. With 1500 MTU we will get about 1460 MSS (depending on enabled options this may be a little bit smaller). Linux TCP stack replies with single ACK message to each two received messages, which is about 45 ACKs per above max request, which in turn messes with existing TCP states and introduces additional overhead for server (amount of interrupts and tcp ack processing). This in turn kills read performance (about 1.5-2 MB/s bulk read performance when reading from filesystem mounted over DST. This is also very surprising to me, since effectively having large enough read requests should not kill performance because of lots of ACKs. I do not observe this behaviour in POHMELFS for example. I also tested the same workload havin SACK and FACK turned off, with essentially the same results.
DST has maximum block IO size (in pages) parameter, the best results are obtained when client sends as much as single (or two) page read requests. Reading performance is close to local filesystem performance. Writing performance in turn reduces (from being equal to local filesystem performance).
Here is example of what tcpdump shows:
20:40:08.836975 IP server.1025 > client.46340: . 1772313:1776507(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288> 20:40:08.836978 IP server.1025 > client.46340: . 1776507:1780701(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288> 20:40:08.836980 IP server.1025 > client.46340: . 1780701:1783497(2796) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288> 20:40:08.836984 IP server.1025 > client.46340: . 1783497:1787691(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288> 20:40:08.836987 IP server.1025 > client.46340: . 1787691:1791885(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288> 20:40:08.836990 IP server.1025 > client.46340: . 1791885:1796079(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288> 20:40:08.836993 IP server.1025 > client.46340: . 1796079:1800273(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288> 20:40:08.836995 IP server.1025 > client.46340: . 1800273:1804467(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288> 20:40:08.836998 IP server.1025 > client.46340: . 1804467:1808661(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288> 20:40:08.837001 IP server.1025 > client.46340: . 1808661:1812855(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288> 20:40:08.837003 IP server.1025 > client.46340: . 1812855:1817049(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288> 20:40:08.837019 IP server.1025 > client.46340: . 1838019:1842213(4194) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184288> 20:40:08.837095 IP client.46340 > server.1025: . ack 1713597 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837099 IP server.1025 > client.46340: . 1842213:1842401(188) ack 7392 win 19019 <nop,nop,timestamp 1316925 1184289> 20:40:08.837100 IP client.46340 > server.1025: . ack 1716393 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837102 IP client.46340 > server.1025: . ack 1719189 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837346 IP client.46340 > server.1025: . ack 1721985 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837348 IP client.46340 > server.1025: . ack 1724781 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837350 IP client.46340 > server.1025: . ack 1727577 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837352 IP client.46340 > server.1025: . ack 1730373 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837353 IP client.46340 > server.1025: . ack 1733169 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837354 IP client.46340 > server.1025: . ack 1735965 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837356 IP client.46340 > server.1025: . ack 1738761 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837357 IP client.46340 > server.1025: . ack 1741557 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837359 IP client.46340 > server.1025: . ack 1744353 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837360 IP client.46340 > server.1025: . ack 1747149 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837362 IP client.46340 > server.1025: . ack 1749945 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837597 IP client.46340 > server.1025: . ack 1752741 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837600 IP client.46340 > server.1025: . ack 1755537 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837601 IP client.46340 > server.1025: . ack 1758333 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837602 IP client.46340 > server.1025: . ack 1761129 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837604 IP client.46340 > server.1025: . ack 1763925 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837605 IP client.46340 > server.1025: . ack 1766721 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837607 IP client.46340 > server.1025: . ack 1769517 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837608 IP client.46340 > server.1025: . ack 1772313 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837609 IP client.46340 > server.1025: . ack 1775109 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837611 IP client.46340 > server.1025: . ack 1777905 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837612 IP client.46340 > server.1025: . ack 1780701 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837846 IP client.46340 > server.1025: . ack 1783497 win 13493 <nop,nop,timestamp 1184289 1316925> 20:40:08.837850 IP client.46340 > server.1025: . ack 1786293 win 13493 <nop,nop,timestamp 1184289 1316925>
What happend with TCP is so strange, that I definitely want to dig deeper into the problem, but so far the only idea I have is a difference between POHMELFS and DST: the former sends read reply as long as first page has been read, while DST server sends the whole reply only when all requested data has been read, so DST produces huge bursts of IO and POHMELFS does not. I expected that DST will be a bit slower than AoE, but in some cases (writing performance and lots of small requests) because of above tweak difference is noticeble. In case of writing and small requests (especially random) NFS starts to behave comparable becaue of its ability to combine requests. AoE is a clear winner in small requests case, since it is not limited by TCP ping-pong issues.
MTU 1500 data (dst with 1 and 2 pages as maximum IO size) :
Version ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
local 16G 44100 99 68515 16 31931 12 46348 98 75432 7 288.8 0
nfs 16G 42498 98 68708 9 25402 6 30040 66 50496 5 350.6 0
aoe 16G 39055 99 72872 18 27076 9 38785 98 75510 11 201.6 0
dst 16G 39498 98 67943 19 25614 10 37676 91 72807 16 17.8 0
dst-nosack-nofa 16G 39439 98 67041 20 25789 12 37556 92 72802 17 90.7 0
dst-2 16G 38658 92 61499 20 30195 11 43373 93 72451 18 17.6 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
local 16 486 2 +++++ +++ 418 1 486 2 +++++ +++ 384 1
nfs 16 1619 5 8709 14 2708 7 2578 8 11918 14 1275 3
aoe 16 8581 49 +++++ +++ 6626 34 8495 48 +++++ +++ 5876 35
dst 16 2031 12 +++++ +++ 2795 16 2196 11 +++++ +++ 781 4
dst-nosack-nofac 16 1922 13 +++++ +++ 2954 20 1830 12 +++++ +++ 1079 7
dst-2 16 1949 12 +++++ +++ 2503 16 1908 12 +++++ +++ 846 5
Hi!
I was bored surfing the web and find out this: http://data.guug.de/slides/lk2008/hm_LinuxKongress_2008-Remote-Replicati...
I don't know if you already know about that and if you get in touch with them when developing dst. But I thought it perhaps could be useful and help dst go upstream (perhaps getting in touch with them and making dst work together with that would help getting dst in mainline ? :).
If its not intresting to you or if it has no sense at all this papper with dst, I 'm very sorry, I wont do this again, really.
Thanks a lot!
DST could be used with above project as a low-level driver as shown on the diagrams, the whole logic used in dm-replicator is quite high-level as long as its code though.
Idea of the async data replication is very interesting and I would like to work out the solution tied to DST as a transport, so far DST was used as sync RAID target.
It never attached an IO scheduler, I think you are imaging things :-). Ti does precisely what the function name implies - simply allocates a queue. Secondly, if you use ->make_request_fn() based queueing, then you are not hitting the IO scheduler even it is attached. That has, also, always been so, it's not a new development. It's the responsibility of ->make_request_fn() to ask the IO scheduler for merging, which in turn produces request structures that are extracted from ->request_fn(). So if you are still just queuing from your own ->make_request_fn(), you haven't changed behaviour at all.
So your code sounds a little funny. If you do your own coalescing in your own ->make_request_fn(), then it's of course possible to provide bigger requests. But that really has nothing to do with the IO scheduler chosen, since you will never enter it.
Hmm, strange, but after that chage I see BIOs with large number o bio_vecs, and previously they all had only single sector. It was not IO scheduler who made that combining, but previously BIOs were large, sky blue and grass green, and right now only sky is (BIOs are small (before the change) and there is no grass, since winter comes close). I played with amount of physical segments and maximum sizes, but that did not change the BIO size, it looked like there was a huge flag which did not allow to push several buffer_heads into single BIO in the
->readpage()callback or something like that. Just an handwaving though, I did not make a deep analysis.So apparently something changed, but my unrelated (agree on that :) change somehow fixed the situtaion. But I will definitely experiment with
->request_fn()instead of->make_request_fn()one so that IO scheduler start playing, that's where I expect DST to improve its small request performance, since although it is smaller with the same ratio as its CPU usage is smaller, but I think it is not the fully right way.I was unclear in the original post, I know that it was not IO scheduler who made BIOs large, but apparently using new initialization function (which attached scheduler) somehow allowed them to grow past single sector in a request. Maybe it was not scheduler's part, but something else, but I tried all
blk_queue_*()tweaks and nothing helped.