[Devel] Re: IO scheduler based IO controller V10

KAMEZAWA Hiroyuki kamezawa.hiroyu at jp.fujitsu.com
Thu Sep 24 18:09:52 PDT 2009


On Thu, 24 Sep 2009 14:33:15 -0700
Andrew Morton <akpm at linux-foundation.org> wrote:
> > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > ===================================================================
> > Fairness for async writes is tricky and biggest reason is that async writes
> > are cached in higher layers (page cahe) as well as possibly in file system
> > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > in proportional manner.
> > 
> > For example, consider two dd threads reading /dev/zero as input file and doing
> > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > be forced to write out some pages to disk before more pages can be dirtied. But
> > not necessarily dirty pages of same thread are picked. It can very well pick
> > the inode of lesser priority dd thread and do some writeout. So effectively
> > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > service differentation.
> > 
> > IOW, the core problem with buffered write fairness is that higher weight thread
> > does not throw enought IO traffic at IO controller to keep the queue
> > continuously backlogged. In my testing, there are many .2 to .8 second
> > intervals where higher weight queue is empty and in that duration lower weight
> > queue get lots of job done giving the impression that there was no service
> > differentiation.
> > 
> > In summary, from IO controller point of view async writes support is there.
> > Because page cache has not been designed in such a manner that higher 
> > prio/weight writer can do more write out as compared to lower prio/weight
> > writer, gettting service differentiation is hard and it is visible in some
> > cases and not visible in some cases.
> 
> Here's where it all falls to pieces.
> 
> For async writeback we just don't care about IO priorities.  Because
> from the point of view of the userspace task, the write was async!  It
> occurred at memory bandwidth speed.
> 
> It's only when the kernel's dirty memory thresholds start to get
> exceeded that we start to care about prioritisation.  And at that time,
> all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> consumes just as much memory as a low-ioprio dirty page.
> 
> So when balance_dirty_pages() hits, what do we want to do?
> 
> I suppose that all we can do is to block low-ioprio processes more
> agressively at the VFS layer, to reduce the rate at which they're
> dirtying memory so as to give high-ioprio processes more of the disk
> bandwidth.
> 
> But you've gone and implemented all of this stuff at the io-controller
> level and not at the VFS level so you're, umm, screwed.
> 

I think I must support dirty-ratio in memcg layer. But not yet.
I can't easily imagine how the system will work if both dirty-ratio and
io-controller cgroup are supported. But considering use them as a set of
cgroup, called containers(zone?), it's will not be bad, I think.

The final bottelneck queue for fairness in usual workload on usual (small)
server will ext3's journal, I wonder ;)

Thanks,
-Kame


> Importantly screwed!  It's a very common workload pattern, and one
> which causes tremendous amounts of IO to be generated very quickly,
> traditionally causing bad latency effects all over the place.  And we
> have no answer to this.
> 
> > Vanilla CFQ Vs IO Controller CFQ
> > ================================
> > We have not fundamentally changed CFQ, instead enhanced it to also support
> > hierarchical io scheduling. In the process invariably there are small changes
> > here and there as new scenarios come up. Running some tests here and comparing
> > both the CFQ's to see if there is any major deviation in behavior.
> > 
> > Test1: Sequential Readers
> > =========================
> > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
> > 2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
> > 4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
> > 8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
> > 16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  
> > 
> > IO scheduler: IO controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
> > 2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
> > 4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
> > 8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
> > 16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  
> > 
> > Test2: Sequential Writers
> > =========================
> > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
> > 2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
> > 4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
> > 8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
> > 16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
> > 2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
> > 4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
> > 8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
> > 16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  
> > 
> > Test3: Random Readers
> > =========================
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
> > 2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
> > 4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
> > 8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
> > 16  38KiB/s     8KiB/s      328KiB/s    3965 msec   
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
> > 2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
> > 4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
> > 8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
> > 16  43KiB/s     9KiB/s      327KiB/s    3905 msec   
> > 
> > Test4: Random Writers
> > =====================
> > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
> > 2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
> > 4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
> > 8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
> > 16  66KiB/s     22KiB/s     829KiB/s    1308 msec   
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
> > 2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
> > 4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
> > 8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
> > 16  71KiB/s     29KiB/s     814KiB/s    1457 msec   
> > 
> > Notes:
> >  - Does not look like that anything has changed significantly.
> > 
> > Previous versions of the patches were posted here.
> > ------------------------------------------------
> > 
> > (V1) http://lkml.org/lkml/2009/3/11/486
> > (V2) http://lkml.org/lkml/2009/5/5/275
> > (V3) http://lkml.org/lkml/2009/5/26/472
> > (V4) http://lkml.org/lkml/2009/6/8/580
> > (V5) http://lkml.org/lkml/2009/6/19/279
> > (V6) http://lkml.org/lkml/2009/7/2/369
> > (V7) http://lkml.org/lkml/2009/7/24/253
> > (V8) http://lkml.org/lkml/2009/8/16/204
> > (V9) http://lkml.org/lkml/2009/8/28/327
> > 
> > Thanks
> > Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers




More information about the Devel mailing list