[CRIU] aio_estimate_nr_reqs() pretends it knows too much about the kernel internals

Oleg Nesterov oleg at redhat.com
Thu Apr 16 08:28:01 PDT 2015


First of all, I am sorry. I initiated the "--ext-mount-map auto"
discussion and the disappeared. This is because I hit yet another
problem, and it took me a lot of time to understand whats going on
(mostly because of typo in the debugging code I added to criu ;).

Lets look at

	static unsigned int aio_estimate_nr_reqs(unsigned int k_max_reqs)
	{
		/*
		 * Kernel does
		 *
		 * nr_reqs = max(nr_reqs, nr_cpus * 4)
		 * nr_reqs *= 2
		 * nr_reqs += 2
		 * ring = roundup(sizeof(head) + nr_reqs * sizeof(req))
		 * nr_reqs = (ring - sizeof(head)) / sizeof(req)
		 *
		 * And the k_max_reqs here is the resulting value.
		 *
		 * We need to get the initial nr_reqs that would grow
		 * up back to the k_max_reqs.
		 */

		return (k_max_reqs - 2) / 2;
	}

cough... I will not try to comment this hack ;) Yes, yes, I understand
that it is hardly possible to blame CRIU for this beauty.

And it mostly works. But on my my testing machine running v3.10 kernel.
Because it doesn't do "nr_reqs *= 2" mentioned above and thus (depending
on initial nr_events passed to sys_io_setup) "criu restore" can succeed
or fail.

This is "fixed" by the

	-	return (k_max_reqs - 2) / 2;
	+	return k_max_reqs - 2;

but somehow I am not sure you will accept this change if I send it to
criu at openvz.org ;)

Any idea how we can improve this "estimate" logic?

At least. Can we add something like CONFIG_AIO_DOUBLES_NR_EVENTS
somewhere to avoid the out-of-tree patches?


===============================================================================
And btw it doesn't look correct anyway. Because the kernel also does

	nr_events = max(nr_events, num_possible_cpus() * 4);

before it doubles nr_events. Now suppose that an application does
io_setup(1) on a single-cpu machine and then you migrate it to another
machine with 4096 CPUs.

So perhaps it makes sense to change __export_restore_task() ? perhaps
it should do something like

		struct aio_ring *r;
		unsigned long len;

		ret = sys_io_setup(raio->nr_req, &ctx);

		r = (void *)ctx;
		BUG_ON(r->nr < raio->nr_req);

		/*
		 * If we failed to get the proper nr_req right and
		 * created a larger ring, the new (enlarged) vma can
		 * break the layout of other vma's we need to restore.
		 */
		len = get_ring_len(ctx);

		if (len > raio->len)
			pr_warn("OOPS. lets pray this is fine);

		// remap whate we actually have, not raio->len
		sys_mremap(..., len, len, ...);

what do you think?

(yes, get_ring_len() is not trivial in the context of __export_restore_task,
 this is another complication).


===============================================================================
Finally. Perhaps something like below makes sense? With this patch "criu check"
_fails_ on my testing machine (without the oneliner fix above), and this is good.

Oleg.

diff --git a/aio.c b/aio.c
index 9965efd..ccda7f5 100644
--- a/aio.c
+++ b/aio.c
@@ -44,7 +44,7 @@ void free_aios(MmEntry *mme)
 	}
 }
 
-static unsigned int aio_estimate_nr_reqs(unsigned int k_max_reqs)
+unsigned int aio_estimate_nr_reqs(unsigned int k_max_reqs)
 {
 	/*
 	 * Kernel does
diff --git a/cr-check.c b/cr-check.c
index 3518773..3d8aeb7 100644
--- a/cr-check.c
+++ b/cr-check.c
@@ -631,14 +631,23 @@ notfound:
 	return 0;
 }
 
+struct aio_ring {
+	unsigned        id;     /* kernel internal index number */
+	unsigned        nr;     /* number of io_events */
+	// everything else doesn't matter
+};
+
+unsigned int aio_estimate_nr_reqs(unsigned int k_max_reqs);
+
 static int check_aio_remap(void)
 {
 	aio_context_t ctx = 0;
 	unsigned long len;
+	unsigned int nr;
 	void *naddr;
 	int r;
 
-	if (sys_io_setup(16, &ctx) < 0) {
+	if (sys_io_setup(1024, &ctx) < 0) {
 		pr_err("No AIO syscall");
 		return -1;
 	}
@@ -668,6 +677,20 @@ static int check_aio_remap(void)
 			pr_warn("Skipping unsupported AIO remap\n");
 	}
 
+
+	nr = ((struct aio_ring *)ctx)->nr;
+	nr = aio_estimate_nr_reqs(nr);
+	ctx = 0;
+	if (sys_io_setup(nr, &ctx) < 0) {
+		pr_err("XXX 1\n");
+		return -1;
+	}
+
+	if (len != get_ring_len((unsigned long) ctx)) {
+		pr_err("XXX 2\n");
+		return -1;
+	}
+
 	return 0;
 }
 



More information about the CRIU mailing list