There is a clear dependency don't you agree? Well, everyone should be affected, why tbench differs?
Let's see the details.
After scheduler guys were kicked multiple times and not only guilty commits showed but also patches were provided, things have changed a little. Well, not that little, but noticebly. Also some nasty modulo opertion was found in the network fast path, which is not yet completely resolved, but I created a trivial patch for testing and present results with it.

Tbench regression with time
As you may see, 2.6.28-rc2 behaves noticebly better than -rc1. But please note, that 2.5% of the win is from the network modulo elimination, which exist in the tree from the middle of 2005 year, so essentially there is some unresolved bug in the recent changes, which are just hidden by the modulo elimination performance gain. Also tso and gso are turned off via ethtool.
So, the progress looks good, but I can prognose, that this is the end. We will not get any better from here (even worse, when high-resolutin timers will be turned on again), since no one in scheduler camp seems to be interested in digging into way too old commits and try to find out why things regressed there and even try to accept that it may be scheduler (design) bug, although according to Mike Galbraith's tests it is not precisely scheduler to be blamed. Contrary tests by Jiri Kosina likely shows to scheduler.
The more we talk about this issue, the clearer picture become: the more unfair the IO scheduler, the "better" dbench results we get. Even when we have single task doing network load and no more active processes, it is now unfair to allow server to get maximum performance from the system. We should also care about init process: it is the first and the oldest process in the system, we should let him have a seat in the bus and force scheduler to select it more frequently even if it sleeps all the time. We also need to consider scheduler's dinner overhead itself.
David Miller works on modulo elimination from the network code, but likely it is the only change will be made there too, although if we eliminate all checkes in tcp_tso_should_defer(), we could get another couple of megabytes per second.
Also when all major distros will start turning on SLUB allocator instead of SLAB, there will be additional 6-7 MB/s drop in this kind of workload, but I already wrote about this issue.
Care to test again with a fresh cloned tree?
I saw same performance fixes beeing pushed to Linus....
Hard to tell...
Here is one of my test machines:
Results are definitely worse for the newer kernel.
Let's now look at a bit slower machine:
And here results for the newer kernel are slightly better.
A while ago we did some tests of MTU vs throughput and got some extraordinary results. We've unfortunately not got the hardware (or the time) since to repeat this with newer kernels, and our investigations at the time didn't show up anything particularly likely as to why it has such an odd shape. But if you're interested in looking at networking performance, this might be a very interesting place to look. One important point appears to be that packets of 8k in size (such as NFS packets) have exceptionally poor performance.
Anyway, the graph we managed to produce is here: http://wand.net.nz/~perry/mtu.png
It's possible that newer kernels have fixed the problem, we've not had a chance to check. Tests were done with two machines with a cross over cable between them (no switch), running iperf from single user mode.
Just thought you'd possibly be interested. This isn't a request for support, I don't care if you don't do anything about it. Just a "wow, that is /not/ what I expected to happen!" and thought you'd find it interesting too.
This driver has a long history of allocation problems with large MTU. This could explain some drops in your graph, but not all of them though. Also looks like this happens only between two e1000, so maybe this is an indication of some kind of a pause frame or other flow control this NIC emits.