[CRIU] [PATCH v2 12/14] files: Make tasks set their own service_fd_base

Andrei Vagin avagin at virtuozzo.com
Sat Dec 30 19:48:10 MSK 2017


On Sat, Dec 30, 2017 at 12:22:32PM +0300, Kirill Tkhai wrote:
> On 29.12.2017 22:59, Andrei Vagin wrote:
> >>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
> >>> index 1feabfa9f..d2517b835 100644
> >>> --- a/test/zdtm/static/env00.c
> >>> +++ b/test/zdtm/static/env00.c
> >>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
> >>>  
> >>>  int main(int argc, char **argv)
> >>>  {
> >>> +	int i;
> >>> +	pid_t pid;
> >>>  	char *env;
> >>>  
> >>>  	test_init(argc, argv);
> >>>  
> >>> +	for (i = 0; i < 10; i++) {
> >>> +		pid = fork();
> >>> +		if (pid == 0) {
> >>> +			while (1)
> >>> +				sleep(1);
> >>> +			return 0;
> >>> +		}
> >>> +	}
> >>> +
> >>> +//	dup2(1, 1000000);
> >>> +
> >>>  	if (setenv(envname, test_author, 1)) {
> >>>  		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
> >>>  		exit(1);
> >>>
> >>> before dump:
> >>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> >>> 11849728
> >>>
> >>> after restore:
> >>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> >>> 20877312
> >>>
> >>> Criu restore changes many in-kernel structures, and this is a global problem, not
> >>> about sfds.
> >>
> >> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
> >> be 100MB. This is not a global problem, it is the problem that we move
> >> service descriptors in child processes.
> > 
> > I create a small path for zdtm.py to get kmem before and after c/r:
> > git clone https://github.com/avagin/criu -b 2195
> > 
> > And here are steps how I do my experiments:
> > 
> > [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> > [root at fc24 criu]# ulimit -n 1024
> > [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> > === Run 1/1 ================ zdtm/static/env00
> > 
> > ========================== Run zdtm/static/env00 in h ==========================
> > Start test
> > ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> > memory.kmem.usage_in_bytes = 9920512
> > Pause at pre-dump. Press any key to continue.
> > Run criu dump
> > 
> > Unable to kill 36: [Errno 3] No such process
> > Pause at pre-restore. Press any key to continue.Run criu restore
> > Pause at post-restore. Press any key to continue.
> > memory.kmem.usage_in_bytes = 22913024
> > before: memory.kmem.usage_in_bytes =              9920512
> > after:  memory.kmem.usage_in_bytes =             22913024 (230.97%)
> > Send the 15 signal to  36
> > Wait for zdtm/static/env00(36) to die for 0.100000
> > Unable to kill 36: [Errno 3] No such process
> > Removing dump/zdtm/static/env00/36
> > ========================= Test zdtm/static/env00 PASS ==========================
> > 
> > [root at fc24 criu]# 
> > [root at fc24 criu]# ulimit -n $((1 << 20))
> > [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> > [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> > === Run 1/1 ================ zdtm/static/env00
> > 
> > ========================== Run zdtm/static/env00 in h ==========================
> > Start test
> > ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> > memory.kmem.usage_in_bytes = 19714048
> > Pause at pre-dump. Press any key to continue.
> > Run criu dump
> > Unable to kill 36: [Errno 3] No such process
> > Pause at pre-restore. Press any key to continue.
> > Run criu restore
> > Pause at post-restore. Press any key to continue.
> > memory.kmem.usage_in_bytes = 914178048
> > before: memory.kmem.usage_in_bytes =             19714048
> > after:  memory.kmem.usage_in_bytes =            914178048 (4637.19%)
> > Send the 15 signal to  36
> > Wait for zdtm/static/env00(36) to die for 0.100000
> > Unable to kill 36: [Errno 3] No such process
> > Removing dump/zdtm/static/env00/36
> > ========================= Test zdtm/static/env00 PASS ==========================
> > 
> > When we call ulimit -n 1024, dup2(1, 1000000) returns an error and
> > the test has only descriptors with small numbers. In this case,
> > a kmem consumption increases on 130%. We need to investigate this fact,
> > but it isn't so critical, if we compare it with the next case.
> > 
> > When we call ulimit -n $((1 << 20)) before running the test, dup2(1, 1000000)
> > creates the 1000000 descriptor, and a kmem consumption after c/r increases on
> > 4537.19%. This is a real problem, what we have to solve with a high priority.
> 
> It's not a problem of my patchset, it's the problem of current engine. My patchset
> does not aim to exploit the engine and rework everything. It just improves
> the engine and it works better then current engine in many cases and not worse
> in the rest of cases. Yes, there are the cases, when the patchset works like
> current engine, but there is no a case, when it behaves worse. Strange to
> call me to solve all the file or memory problems at once, isn't it?!

I like this work. Thank you for it. But I see nothing strange to
discuss the other problem of this code and try to solve them now or
later.

> 
> The goals of the patchset are:
> 1)Disconnect prlimit(RLIMIT_NOFILE) and service fds placement,
> i.e. make the placement to depend on task fds, not on prlimit().
> 2)Make people able to restore big fd numbers without increasing
> memory usage on all tasks in a container (as currently this requires
> to increase global limits).
> 
> #ulimit -n $((1 << 20))
> --- a/test/zdtm/static/env00.c
> +++ b/test/zdtm/static/env00.c
> @@ -25,14 +25,13 @@ int main(int argc, char **argv)
>         for (i = 0; i < 100; i++) {
>                 pid = fork();
>                 if (pid == 0) {
> +                       dup2(1, 10000);
>                         while (1)
>                                 sleep(1);
>                         return 0;
>                 }
>         }
>  
> -       dup2(1, 1000000);
> -
>         test_daemon();
>         test_waitsig();
> 
> Before patchset:
> before: memory.kmem.usage_in_bytes =             24076288
> after:  memory.kmem.usage_in_bytes =            921894912 (3829.06%)
> 
> After patchset:
> before: memory.kmem.usage_in_bytes =             23855104
> after:  memory.kmem.usage_in_bytes =             39591936 (165.97%)
> 
> Someone may rework this later, and implement new engine using holes in task fd numbers.
> I'm not against that, everybody is welcome! I'm just not sure this will take small time for
> writing/stabilization/etc. And I don't think this is the first priority problem in criu,
> so this is why I'm not going to solve it now.

I think we can solve this problem in contex of these changes.

We can create service descriptors with a small base in criu, then we can
fork all processes and then we can move service descriptors to a
calculated base for each process. It is not optimal, but it will work.

It can be optimazed if we predict a case when a child process has a
smaller service fd base...

> 
> Also, keep in mind, that after the rework is happened, you still meet the situation, when
> there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
> scheme will still require more memory than before the dump. Just one more boundary case
> like you pointed in your message.
> 
> Just analyse the tasks files on your linux system and find out how the patchset will help
> to reduce memory usage on restore of generic case.
> 
> Thanks,
> Kirill


More information about the CRIU mailing list