[CRIU] aio_estimate_nr_reqs() pretends it knows too much about the kernel internals
Oleg Nesterov
oleg at redhat.com
Mon Apr 20 10:07:59 PDT 2015
On 04/20, Pavel Emelyanov wrote:
>
> On 04/16/2015 06:28 PM, Oleg Nesterov wrote:
> >
> > And it mostly works. But on my my testing machine running v3.10 kernel.
> > Because it doesn't do "nr_reqs *= 2" mentioned above and thus (depending
> > on initial nr_events passed to sys_io_setup) "criu restore" can succeed
> > or fail.
> >
> > This is "fixed" by the
> >
> > - return (k_max_reqs - 2) / 2;
> > + return k_max_reqs - 2;
> >
> > but somehow I am not sure you will accept this change if I send it to
> > criu at openvz.org ;)
> >
> > Any idea how we can improve this "estimate" logic?
>
> I planned to use the convergence method :) When restoring we should create
> context for the "best estimate" and compare the resulted length. If equal,
> then we're done, if not -- halve/double the nr_reqs and try again.
I thought about this too. But again, this can't work if a 4096 CPU's machine
insists on the "large" ->nr_req. You simply can not create the AIO mapping
with the same length/nr_req in general.
But I agree, "convergence" makes sense anyway.
> > And btw it doesn't look correct anyway. Because the kernel also does
> >
> > nr_events = max(nr_events, num_possible_cpus() * 4);
> >
> > before it doubles nr_events. Now suppose that an application does
> > io_setup(1) on a single-cpu machine and then you migrate it to another
> > machine with 4096 CPUs.
>
> Then restore goes bad :(
>
> > So perhaps it makes sense to change __export_restore_task() ? perhaps
> > it should do something like
> >
> > struct aio_ring *r;
> > unsigned long len;
> >
> > ret = sys_io_setup(raio->nr_req, &ctx);
> >
> > r = (void *)ctx;
> > BUG_ON(r->nr < raio->nr_req);
> >
> > /*
> > * If we failed to get the proper nr_req right and
> > * created a larger ring, the new (enlarged) vma can
> > * break the layout of other vma's we need to restore.
> > */
> > len = get_ring_len(ctx);
> >
> > if (len > raio->len)
> > pr_warn("OOPS. lets pray this is fine);
> >
> > // remap whate we actually have, not raio->len
> > sys_mremap(..., len, len, ...);
> >
> > what do you think?
>
> Agree as the first approximation. In stead of pr_warn() we probably should
> do destroy context and goto again with halved nr_reqs.
OK... but see above, we can't assume that we always can re-create vma with
the same nr_reqs/length.
And if length grows, "restore" can obviously fail later.
All I can say is that io_setup() is a very nice API. I was also thinking
about the (kernel) patch below (to avoid mremap), but I am sure it will
be nacked. At least I would nack it if I were a maintainer ;)
Oleg.
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -396,12 +396,12 @@ static const struct address_space_operations aio_ctx_aops = {
#endif
};
-static int aio_setup_ring(struct kioctx *ctx)
+static int aio_setup_ring(struct kioctx *ctx, unsigned long addr)
{
struct aio_ring *ring;
unsigned nr_events = ctx->max_reqs;
struct mm_struct *mm = current->mm;
- unsigned long size, unused;
+ unsigned long size, unused, mmapf;
int nr_pages;
int i;
struct file *file;
@@ -460,7 +460,10 @@ static int aio_setup_ring(struct kioctx *ctx)
pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size);
down_write(&mm->mmap_sem);
- ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
+ mmapf = MAP_SHARED;
+ if (addr)
+ mmapf |= MAP_FIXED;
+ ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, addr, ctx->mmap_size,
PROT_READ | PROT_WRITE,
MAP_SHARED, 0, &unused);
up_write(&mm->mmap_sem);
@@ -649,7 +652,7 @@ static void aio_nr_sub(unsigned nr)
/* ioctx_alloc
* Allocates and initializes an ioctx. Returns an ERR_PTR if it failed.
*/
-static struct kioctx *ioctx_alloc(unsigned nr_events)
+static struct kioctx *ioctx_alloc(unsigned nr_events, unsigned long addr)
{
struct mm_struct *mm = current->mm;
struct kioctx *ctx;
@@ -703,7 +706,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
if (!ctx->cpu)
goto err;
- err = aio_setup_ring(ctx);
+ err = aio_setup_ring(ctx, addr);
if (err < 0)
goto err;
@@ -1296,7 +1299,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
goto out;
}
- ioctx = ioctx_alloc(nr_events);
+ ioctx = ioctx_alloc(nr_events, ctx);
ret = PTR_ERR(ioctx);
if (!IS_ERR(ioctx)) {
ret = put_user(ioctx->user_id, ctxp);
More information about the CRIU
mailing list