[CRIU] aio_estimate_nr_reqs() pretends it knows too much about the kernel internals

Oleg Nesterov oleg at redhat.com
Mon Apr 20 10:07:59 PDT 2015


On 04/20, Pavel Emelyanov wrote:
>
> On 04/16/2015 06:28 PM, Oleg Nesterov wrote:
> >
> > And it mostly works. But on my my testing machine running v3.10 kernel.
> > Because it doesn't do "nr_reqs *= 2" mentioned above and thus (depending
> > on initial nr_events passed to sys_io_setup) "criu restore" can succeed
> > or fail.
> >
> > This is "fixed" by the
> >
> > 	-	return (k_max_reqs - 2) / 2;
> > 	+	return k_max_reqs - 2;
> >
> > but somehow I am not sure you will accept this change if I send it to
> > criu at openvz.org ;)
> >
> > Any idea how we can improve this "estimate" logic?
>
> I planned to use the convergence method :) When restoring we should create
> context for the "best estimate" and compare the resulted length. If equal,
> then we're done, if not -- halve/double the nr_reqs and try again.

I thought about this too. But again, this can't work if a 4096 CPU's machine
insists on the "large" ->nr_req. You simply can not create the AIO mapping
with the same length/nr_req in general.

But I agree, "convergence" makes sense anyway. 

> > And btw it doesn't look correct anyway. Because the kernel also does
> >
> > 	nr_events = max(nr_events, num_possible_cpus() * 4);
> >
> > before it doubles nr_events. Now suppose that an application does
> > io_setup(1) on a single-cpu machine and then you migrate it to another
> > machine with 4096 CPUs.
>
> Then restore goes bad :(
>
> > So perhaps it makes sense to change __export_restore_task() ? perhaps
> > it should do something like
> >
> > 		struct aio_ring *r;
> > 		unsigned long len;
> >
> > 		ret = sys_io_setup(raio->nr_req, &ctx);
> >
> > 		r = (void *)ctx;
> > 		BUG_ON(r->nr < raio->nr_req);
> >
> > 		/*
> > 		 * If we failed to get the proper nr_req right and
> > 		 * created a larger ring, the new (enlarged) vma can
> > 		 * break the layout of other vma's we need to restore.
> > 		 */
> > 		len = get_ring_len(ctx);
> >
> > 		if (len > raio->len)
> > 			pr_warn("OOPS. lets pray this is fine);
> >
> > 		// remap whate we actually have, not raio->len
> > 		sys_mremap(..., len, len, ...);
> >
> > what do you think?
>
> Agree as the first approximation. In stead of pr_warn() we probably should
> do destroy context and goto again with halved nr_reqs.

OK... but see above, we can't assume that we always can re-create vma with
the same nr_reqs/length.

And if length grows, "restore" can obviously fail later.

All I can say is that io_setup() is a very nice API. I was also thinking
about the (kernel) patch below (to avoid mremap), but I am sure it will
be nacked. At least I would nack it if I were a maintainer ;)

Oleg.


--- a/fs/aio.c
+++ b/fs/aio.c
@@ -396,12 +396,12 @@ static const struct address_space_operations aio_ctx_aops = {
 #endif
 };
 
-static int aio_setup_ring(struct kioctx *ctx)
+static int aio_setup_ring(struct kioctx *ctx, unsigned long addr)
 {
 	struct aio_ring *ring;
 	unsigned nr_events = ctx->max_reqs;
 	struct mm_struct *mm = current->mm;
-	unsigned long size, unused;
+	unsigned long size, unused, mmapf;
 	int nr_pages;
 	int i;
 	struct file *file;
@@ -460,7 +460,10 @@ static int aio_setup_ring(struct kioctx *ctx)
 	pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size);
 
 	down_write(&mm->mmap_sem);
-	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
+	mmapf = MAP_SHARED;
+	if (addr)
+		mmapf |= MAP_FIXED;
+	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, addr, ctx->mmap_size,
 				       PROT_READ | PROT_WRITE,
 				       MAP_SHARED, 0, &unused);
 	up_write(&mm->mmap_sem);
@@ -649,7 +652,7 @@ static void aio_nr_sub(unsigned nr)
 /* ioctx_alloc
  *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
-static struct kioctx *ioctx_alloc(unsigned nr_events)
+static struct kioctx *ioctx_alloc(unsigned nr_events, unsigned long addr)
 {
 	struct mm_struct *mm = current->mm;
 	struct kioctx *ctx;
@@ -703,7 +706,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (!ctx->cpu)
 		goto err;
 
-	err = aio_setup_ring(ctx);
+	err = aio_setup_ring(ctx, addr);
 	if (err < 0)
 		goto err;
 
@@ -1296,7 +1299,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
 		goto out;
 	}
 
-	ioctx = ioctx_alloc(nr_events);
+	ioctx = ioctx_alloc(nr_events, ctx);
 	ret = PTR_ERR(ioctx);
 	if (!IS_ERR(ioctx)) {
 		ret = put_user(ioctx->user_id, ctxp);



More information about the CRIU mailing list