[CRIU] aio_estimate_nr_reqs() pretends it knows too much about the kernel internals
Oleg Nesterov
oleg at redhat.com
Thu Apr 16 08:28:01 PDT 2015
First of all, I am sorry. I initiated the "--ext-mount-map auto"
discussion and the disappeared. This is because I hit yet another
problem, and it took me a lot of time to understand whats going on
(mostly because of typo in the debugging code I added to criu ;).
Lets look at
static unsigned int aio_estimate_nr_reqs(unsigned int k_max_reqs)
{
/*
* Kernel does
*
* nr_reqs = max(nr_reqs, nr_cpus * 4)
* nr_reqs *= 2
* nr_reqs += 2
* ring = roundup(sizeof(head) + nr_reqs * sizeof(req))
* nr_reqs = (ring - sizeof(head)) / sizeof(req)
*
* And the k_max_reqs here is the resulting value.
*
* We need to get the initial nr_reqs that would grow
* up back to the k_max_reqs.
*/
return (k_max_reqs - 2) / 2;
}
cough... I will not try to comment this hack ;) Yes, yes, I understand
that it is hardly possible to blame CRIU for this beauty.
And it mostly works. But on my my testing machine running v3.10 kernel.
Because it doesn't do "nr_reqs *= 2" mentioned above and thus (depending
on initial nr_events passed to sys_io_setup) "criu restore" can succeed
or fail.
This is "fixed" by the
- return (k_max_reqs - 2) / 2;
+ return k_max_reqs - 2;
but somehow I am not sure you will accept this change if I send it to
criu at openvz.org ;)
Any idea how we can improve this "estimate" logic?
At least. Can we add something like CONFIG_AIO_DOUBLES_NR_EVENTS
somewhere to avoid the out-of-tree patches?
===============================================================================
And btw it doesn't look correct anyway. Because the kernel also does
nr_events = max(nr_events, num_possible_cpus() * 4);
before it doubles nr_events. Now suppose that an application does
io_setup(1) on a single-cpu machine and then you migrate it to another
machine with 4096 CPUs.
So perhaps it makes sense to change __export_restore_task() ? perhaps
it should do something like
struct aio_ring *r;
unsigned long len;
ret = sys_io_setup(raio->nr_req, &ctx);
r = (void *)ctx;
BUG_ON(r->nr < raio->nr_req);
/*
* If we failed to get the proper nr_req right and
* created a larger ring, the new (enlarged) vma can
* break the layout of other vma's we need to restore.
*/
len = get_ring_len(ctx);
if (len > raio->len)
pr_warn("OOPS. lets pray this is fine);
// remap whate we actually have, not raio->len
sys_mremap(..., len, len, ...);
what do you think?
(yes, get_ring_len() is not trivial in the context of __export_restore_task,
this is another complication).
===============================================================================
Finally. Perhaps something like below makes sense? With this patch "criu check"
_fails_ on my testing machine (without the oneliner fix above), and this is good.
Oleg.
diff --git a/aio.c b/aio.c
index 9965efd..ccda7f5 100644
--- a/aio.c
+++ b/aio.c
@@ -44,7 +44,7 @@ void free_aios(MmEntry *mme)
}
}
-static unsigned int aio_estimate_nr_reqs(unsigned int k_max_reqs)
+unsigned int aio_estimate_nr_reqs(unsigned int k_max_reqs)
{
/*
* Kernel does
diff --git a/cr-check.c b/cr-check.c
index 3518773..3d8aeb7 100644
--- a/cr-check.c
+++ b/cr-check.c
@@ -631,14 +631,23 @@ notfound:
return 0;
}
+struct aio_ring {
+ unsigned id; /* kernel internal index number */
+ unsigned nr; /* number of io_events */
+ // everything else doesn't matter
+};
+
+unsigned int aio_estimate_nr_reqs(unsigned int k_max_reqs);
+
static int check_aio_remap(void)
{
aio_context_t ctx = 0;
unsigned long len;
+ unsigned int nr;
void *naddr;
int r;
- if (sys_io_setup(16, &ctx) < 0) {
+ if (sys_io_setup(1024, &ctx) < 0) {
pr_err("No AIO syscall");
return -1;
}
@@ -668,6 +677,20 @@ static int check_aio_remap(void)
pr_warn("Skipping unsupported AIO remap\n");
}
+
+ nr = ((struct aio_ring *)ctx)->nr;
+ nr = aio_estimate_nr_reqs(nr);
+ ctx = 0;
+ if (sys_io_setup(nr, &ctx) < 0) {
+ pr_err("XXX 1\n");
+ return -1;
+ }
+
+ if (len != get_ring_len((unsigned long) ctx)) {
+ pr_err("XXX 2\n");
+ return -1;
+ }
+
return 0;
}
More information about the CRIU
mailing list