[CRIU] aio_estimate_nr_reqs() pretends it knows too much about the kernel internals
Pavel Emelyanov
xemul at parallels.com
Sun Apr 19 22:02:29 PDT 2015
On 04/16/2015 06:28 PM, Oleg Nesterov wrote:
Hi.
Sorry for the late response :) By the time I got to this e-mail it
was Friday evening so I decided to postpone the answer till working
days.
> First of all, I am sorry. I initiated the "--ext-mount-map auto"
> discussion and the disappeared. This is because I hit yet another
> problem, and it took me a lot of time to understand whats going on
> (mostly because of typo in the debugging code I added to criu ;).
>
> Lets look at
>
> static unsigned int aio_estimate_nr_reqs(unsigned int k_max_reqs)
> {
> /*
> * Kernel does
> *
> * nr_reqs = max(nr_reqs, nr_cpus * 4)
> * nr_reqs *= 2
> * nr_reqs += 2
> * ring = roundup(sizeof(head) + nr_reqs * sizeof(req))
> * nr_reqs = (ring - sizeof(head)) / sizeof(req)
> *
> * And the k_max_reqs here is the resulting value.
> *
> * We need to get the initial nr_reqs that would grow
> * up back to the k_max_reqs.
> */
>
> return (k_max_reqs - 2) / 2;
> }
>
> cough... I will not try to comment this hack ;) Yes, yes, I understand
> that it is hardly possible to blame CRIU for this beauty.
Well, yes. AIO kernel API is ... tricky.
> And it mostly works. But on my my testing machine running v3.10 kernel.
> Because it doesn't do "nr_reqs *= 2" mentioned above and thus (depending
> on initial nr_events passed to sys_io_setup) "criu restore" can succeed
> or fail.
>
> This is "fixed" by the
>
> - return (k_max_reqs - 2) / 2;
> + return k_max_reqs - 2;
>
> but somehow I am not sure you will accept this change if I send it to
> criu at openvz.org ;)
>
> Any idea how we can improve this "estimate" logic?
I planned to use the convergence method :) When restoring we should create
context for the "best estimate" and compare the resulted length. If equal,
then we're done, if not -- halve/double the nr_reqs and try again.
> At least. Can we add something like CONFIG_AIO_DOUBLES_NR_EVENTS
> somewhere to avoid the out-of-tree patches?
>
>
> ===============================================================================
> And btw it doesn't look correct anyway. Because the kernel also does
>
> nr_events = max(nr_events, num_possible_cpus() * 4);
>
> before it doubles nr_events. Now suppose that an application does
> io_setup(1) on a single-cpu machine and then you migrate it to another
> machine with 4096 CPUs.
Then restore goes bad :(
> So perhaps it makes sense to change __export_restore_task() ? perhaps
> it should do something like
>
> struct aio_ring *r;
> unsigned long len;
>
> ret = sys_io_setup(raio->nr_req, &ctx);
>
> r = (void *)ctx;
> BUG_ON(r->nr < raio->nr_req);
>
> /*
> * If we failed to get the proper nr_req right and
> * created a larger ring, the new (enlarged) vma can
> * break the layout of other vma's we need to restore.
> */
> len = get_ring_len(ctx);
>
> if (len > raio->len)
> pr_warn("OOPS. lets pray this is fine);
>
> // remap whate we actually have, not raio->len
> sys_mremap(..., len, len, ...);
>
> what do you think?
Agree as the first approximation. In stead of pr_warn() we probably should
do destroy context and goto again with halved nr_reqs.
> (yes, get_ring_len() is not trivial in the context of __export_restore_task,
> this is another complication).
>
>
> ===============================================================================
> Finally. Perhaps something like below makes sense? With this patch "criu check"
> _fails_ on my testing machine (without the oneliner fix above), and this is good.
Unless the "convergence" works, this would make perfect sense :)
-- Pavel
More information about the CRIU
mailing list