[CRIU] aio_estimate_nr_reqs() pretends it knows too much about the kernel internals

Pavel Emelyanov xemul at parallels.com
Sun Apr 19 22:02:29 PDT 2015


On 04/16/2015 06:28 PM, Oleg Nesterov wrote:

Hi.

Sorry for the late response :) By the time I got to this e-mail it
was Friday evening so I decided to postpone the answer till working
days.

> First of all, I am sorry. I initiated the "--ext-mount-map auto"
> discussion and the disappeared. This is because I hit yet another
> problem, and it took me a lot of time to understand whats going on
> (mostly because of typo in the debugging code I added to criu ;).
> 
> Lets look at
> 
> 	static unsigned int aio_estimate_nr_reqs(unsigned int k_max_reqs)
> 	{
> 		/*
> 		 * Kernel does
> 		 *
> 		 * nr_reqs = max(nr_reqs, nr_cpus * 4)
> 		 * nr_reqs *= 2
> 		 * nr_reqs += 2
> 		 * ring = roundup(sizeof(head) + nr_reqs * sizeof(req))
> 		 * nr_reqs = (ring - sizeof(head)) / sizeof(req)
> 		 *
> 		 * And the k_max_reqs here is the resulting value.
> 		 *
> 		 * We need to get the initial nr_reqs that would grow
> 		 * up back to the k_max_reqs.
> 		 */
> 
> 		return (k_max_reqs - 2) / 2;
> 	}
> 
> cough... I will not try to comment this hack ;) Yes, yes, I understand
> that it is hardly possible to blame CRIU for this beauty.

Well, yes. AIO kernel API is ... tricky.

> And it mostly works. But on my my testing machine running v3.10 kernel.
> Because it doesn't do "nr_reqs *= 2" mentioned above and thus (depending
> on initial nr_events passed to sys_io_setup) "criu restore" can succeed
> or fail.
> 
> This is "fixed" by the
> 
> 	-	return (k_max_reqs - 2) / 2;
> 	+	return k_max_reqs - 2;
> 
> but somehow I am not sure you will accept this change if I send it to
> criu at openvz.org ;)
> 
> Any idea how we can improve this "estimate" logic?

I planned to use the convergence method :) When restoring we should create 
context for the "best estimate" and compare the resulted length. If equal, 
then we're done, if not -- halve/double the nr_reqs and try again.

> At least. Can we add something like CONFIG_AIO_DOUBLES_NR_EVENTS
> somewhere to avoid the out-of-tree patches?
> 
> 
> ===============================================================================
> And btw it doesn't look correct anyway. Because the kernel also does
> 
> 	nr_events = max(nr_events, num_possible_cpus() * 4);
> 
> before it doubles nr_events. Now suppose that an application does
> io_setup(1) on a single-cpu machine and then you migrate it to another
> machine with 4096 CPUs.

Then restore goes bad :(

> So perhaps it makes sense to change __export_restore_task() ? perhaps
> it should do something like
> 
> 		struct aio_ring *r;
> 		unsigned long len;
> 
> 		ret = sys_io_setup(raio->nr_req, &ctx);
> 
> 		r = (void *)ctx;
> 		BUG_ON(r->nr < raio->nr_req);
> 
> 		/*
> 		 * If we failed to get the proper nr_req right and
> 		 * created a larger ring, the new (enlarged) vma can
> 		 * break the layout of other vma's we need to restore.
> 		 */
> 		len = get_ring_len(ctx);
> 
> 		if (len > raio->len)
> 			pr_warn("OOPS. lets pray this is fine);
> 
> 		// remap whate we actually have, not raio->len
> 		sys_mremap(..., len, len, ...);
> 
> what do you think?

Agree as the first approximation. In stead of pr_warn() we probably should
do destroy context and goto again with halved nr_reqs.

> (yes, get_ring_len() is not trivial in the context of __export_restore_task,
>  this is another complication).
> 
> 
> ===============================================================================
> Finally. Perhaps something like below makes sense? With this patch "criu check"
> _fails_ on my testing machine (without the oneliner fix above), and this is good.

Unless the "convergence" works, this would make perfect sense :)

-- Pavel



More information about the CRIU mailing list