[CRIU] [PATCH v1] shmem: use mincore page status bit with anon shared mem dumping

Sun Aug 28 07:46:06 PDT 2016

Hi Eugene,

On Sun, Aug 28, 2016 at 01:23:23AM +0300, Eugene Batalov wrote:
> It turned out that ashmem can have pages with non zero content

Just a small nitpick about the "ashmem" term.
There is an "ashmem" driver used by Android, which also implements
anonymous shared memory, but in slightly different way.
AFAIK, CRIU does not support the Android ashmem, so, IMHO, it'll be better
to use "anon shmem" to avoid mis-undestandings...

> and with both PME_PRESENT and PME_SWAP bits unset in all its vmas
> in the whole ps tree.
> Such case is reproduced in issue #209:
> 1. Dump ps tree with ashmem filled using datagen.
> 2. Restore ps tree. ashmem content is restored
>    in open_shmem(). fd is created for it and it is
>    unmapped from restorer process.
> 3. ashmem vma is mapped in restore_mapping() of pie restorer context.
>    ashmem content is already initialized to non zero content
>    but restored process doesn't touch its newly mapped vma.
> 4. Run CRIU dump again. All the pages of ashmem vmas have
>    PME_PRESENT and PME_SWAP bits unset and we don't put
>    vma pages to dump.
> 
> So if we filter ashmem pages using PME_PRESENT and PME_SWAP bits
> the same way as we do it for anon private mem then we have a bug.
> PME_PRESENT and PME_SWAP bits work for anon private mem because
> at least one process would restore content of private anon vma
> in its own address space thus PME bits will be set and pages
> will be damped.
> 
> We can't just stop using PME_PRESENT and PME_SWAP bits and dump all
> non soft dirty and non zero pfn pages. In this case each 1Gb of
> mapped and not used ashmem vma will go to dump. This is too bad.
> 
> To fix the bug in this patch we use mincore bits to finally
> understand should we dump page or not. mincore bits show page
> usage status better because mincore performs deeper checking of
> internal in-kernel state. PME bits filling is based only on
> process page table.
> 
> Using mincore has a drawback. It doesn't work when page is in swap.
> But it's ok for now because mincore was used before we started using
> PME bits. Also mincore doesn't break page changes tracking
> functionality for ashmem that we have now.
> 
> This bug can be fixed in another way. For example we can make ashmem
> restoration work similar to anon private mem restoration.
> But this fix looks much harder to implement.
> 
> Signed-off-by: Eugene Batalov <eabatalov89 at gmail.com>
> ---
>  criu/shmem.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/criu/shmem.c b/criu/shmem.c
> index a21e8c1..3a2e4d1 100644
> --- a/criu/shmem.c
> +++ b/criu/shmem.c
> @@ -607,6 +607,7 @@ static int dump_one_shmem(struct shmem_info *si)
>  	struct page_pipe *pp;
>  	struct page_xfer xfer;
>  	int err, ret = -1, fd;
> +	unsigned char *mc_map = NULL;
>  	void *addr = NULL;
>  	unsigned long pfn, nrpages;
> 
> @@ -625,6 +626,14 @@ static int dump_one_shmem(struct shmem_info *si)
>  	}
> 
>  	nrpages = (si->size + PAGE_SIZE - 1) / PAGE_SIZE;
> +	mc_map = xmalloc(nrpages * sizeof(*mc_map));
> +	if (!mc_map)
> +		goto err_unmap;
> +	/* We can't rely only on PME bits for ashmem */
> +	err = mincore(addr, si->size, mc_map);
> +	if (err)
> +		goto err_unmap;
> +
>  	pp = create_page_pipe((nrpages + 1) / 2, NULL, PP_CHUNK_MODE);
>  	if (!pp)
>  		goto err_unmap;
> @@ -638,7 +647,7 @@ static int dump_one_shmem(struct shmem_info *si)
>  		unsigned long pgaddr;
> 
>  		pgstate = get_pstate(si->pstate_map, pfn);
> -		if (pgstate == PST_DONT_DUMP)
> +		if ((pgstate == PST_DONT_DUMP) && !(mc_map[pfn] & PAGE_RSS))
>  			continue;
> 
>  		pgaddr = (unsigned long)addr + pfn * PAGE_SIZE;
> @@ -669,6 +678,7 @@ err_pp:
>  err_unmap:
>  	munmap(addr,  si->size);
>  err:
> +	xfree(mc_map);
>  	return ret;
>  }
> 
> -- 
> 1.9.1
> 
> _______________________________________________
> CRIU mailing list
> CRIU at openvz.org
> https://lists.openvz.org/mailman/listinfo/criu
>