[Devel] [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
Andrea Righi
righi.andrea at gmail.com
Wed Aug 27 09:07:32 PDT 2008
The objective of the i/o controller is to improve i/o performance
predictability of different cgroups sharing the same block devices.
Respect to other priority/weight-based solutions the approach used by this
controller is to explicitly choke applications' requests that directly (or
indirectly) generate i/o activity in the system.
The direct bandwidth and/or iops limiting method has the advantage of improving
the performance predictability at the cost of reducing, in general, the overall
performance of the system (in terms of throughput).
Detailed informations about design, its goal and usage are described in the
documentation.
Patchset against 2.6.27-rc1-mm1.
The all-in-one patch (and previous versions) can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/
This patchset is an experimental implementation, it includes functional
differences respect to the previous versions (see the changelog below), and I
haven't done much testing yet. So, comments are really welcome.
Changelog: (v8 -> v9)
* introduce struct res_counter_ratelimit as a generic structure to implement
throttling-based cgroup subsystems
* removed the throttling hooks from the page cache (set_page_dirty): set a
single throttling hook in submit_bio() both for read and write operations; a
generic process that is dirtying pages on a limited block device (for the
cgroup it belongs to) is forced to flush the same amount of pages back to the
block device (in this way write operations are forced to occur in the same IO
context of the process that actually generated the IO)
* collect per cgroup, block device and task throttling statistics (throttle
counter and total time slept for throttling) and export them to userspace
through blockio.throttlcnt (in the cgroup filesystem) and
/proc/PID/io-throttle-stat (per-task statistics)
* fair throttling: simple attempt to distribute the sleeps equally among all
the tasks belonging to the same cgroup; instead of imposing a sleep to the
first task that exceeds the IO limits, the time to sleep is divided by the
number of tasks present in the same cgroup
TODO:
* Try to push down the throttling and implement it directly in the I/O
schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
to keep track of the right cgroup context. This approach could lead to more
memory consumption and increases the number of dirty pages (hard/slow to
reclaim pages) in the system, since dirty-page ratio in memory is not
limited. This could even lead to potential OOM conditions, but these problems
can be resolved directly into the memory cgroup subsystem
* Handle I/O generated by kswapd: at the moment there's no control on the I/O
generated by kswapd; try to use the page_cgroup functionality of the memory
cgroup controller to track this kind of I/O and charge the right cgroup when
pages are swapped in/out
* Improve fair throttling: distribute the time to sleep among all the tasks of
a cgroup that exceeded the I/O limits, depending of the amount of IO activity
generated in the past by each task (see task_io_accounting)
* Try to reduce the cost of calling cgroup_io_throttle() on every submit_bio();
this is not too much expensive, but the call of task_subsys_state() has
surely a cost. A possible solution could be to temporarily account I/O in the
current task_struct and call cgroup_io_throttle() only on each X MB of I/O.
Or on each Y number of I/O requests as well. Better if both X and/or Y can be
tuned at runtime by a userspace tool
* Think an alternative design for general purpose usage; special purpose usage
right now is restricted to improve I/O performance predictability and
evaluate more precise response timings for applications doing I/O. To a large
degree the block I/O bandwidth controller should implement a more complex
logic to better evaluate real I/O operations cost, depending also on the
particular block device profile (i.e. USB stick, optical drive, hard disk,
etc.). This would also allow to appropriately account I/O cost for seeky
workloads, respect to large stream workloads. Instead of looking at the
request stream and try to predict how expensive the I/O cost will be, a
totally different approach could be to collect request timings (start time /
elapsed time) and based on collected informations, try to estimate the I/O
cost and usage
-Andrea
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list