[CRIU] [PATCH v2 12/14] files: Make tasks set their own service_fd_base

Andrei Vagin avagin at virtuozzo.com
Tue Jan 9 19:30:33 MSK 2018


On Tue, Jan 09, 2018 at 02:54:31PM +0300, Kirill Tkhai wrote:
> On 30.12.2017 19:48, Andrei Vagin wrote:
> > On Sat, Dec 30, 2017 at 12:22:32PM +0300, Kirill Tkhai wrote:
> >> On 29.12.2017 22:59, Andrei Vagin wrote:
> >>>>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
> >>>>> index 1feabfa9f..d2517b835 100644
> >>>>> --- a/test/zdtm/static/env00.c
> >>>>> +++ b/test/zdtm/static/env00.c
> >>>>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
> >>>>>  
> >>>>>  int main(int argc, char **argv)
> >>>>>  {
> >>>>> +	int i;
> >>>>> +	pid_t pid;
> >>>>>  	char *env;
> >>>>>  
> >>>>>  	test_init(argc, argv);
> >>>>>  
> >>>>> +	for (i = 0; i < 10; i++) {
> >>>>> +		pid = fork();
> >>>>> +		if (pid == 0) {
> >>>>> +			while (1)
> >>>>> +				sleep(1);
> >>>>> +			return 0;
> >>>>> +		}
> >>>>> +	}
> >>>>> +
> >>>>> +//	dup2(1, 1000000);
> >>>>> +
> >>>>>  	if (setenv(envname, test_author, 1)) {
> >>>>>  		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
> >>>>>  		exit(1);
> >>>>>
> >>>>> before dump:
> >>>>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> >>>>> 11849728
> >>>>>
> >>>>> after restore:
> >>>>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> >>>>> 20877312
> >>>>>
> >>>>> Criu restore changes many in-kernel structures, and this is a global problem, not
> >>>>> about sfds.
> >>>>
> >>>> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
> >>>> be 100MB. This is not a global problem, it is the problem that we move
> >>>> service descriptors in child processes.
> >>>
> >>> I create a small path for zdtm.py to get kmem before and after c/r:
> >>> git clone https://github.com/avagin/criu -b 2195
> >>>
> >>> And here are steps how I do my experiments:
> >>>
> >>> [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> >>> [root at fc24 criu]# ulimit -n 1024
> >>> [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> >>> === Run 1/1 ================ zdtm/static/env00
> >>>
> >>> ========================== Run zdtm/static/env00 in h ==========================
> >>> Start test
> >>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> >>> memory.kmem.usage_in_bytes = 9920512
> >>> Pause at pre-dump. Press any key to continue.
> >>> Run criu dump
> >>>
> >>> Unable to kill 36: [Errno 3] No such process
> >>> Pause at pre-restore. Press any key to continue.Run criu restore
> >>> Pause at post-restore. Press any key to continue.
> >>> memory.kmem.usage_in_bytes = 22913024
> >>> before: memory.kmem.usage_in_bytes =              9920512
> >>> after:  memory.kmem.usage_in_bytes =             22913024 (230.97%)
> >>> Send the 15 signal to  36
> >>> Wait for zdtm/static/env00(36) to die for 0.100000
> >>> Unable to kill 36: [Errno 3] No such process
> >>> Removing dump/zdtm/static/env00/36
> >>> ========================= Test zdtm/static/env00 PASS ==========================
> >>>
> >>> [root at fc24 criu]# 
> >>> [root at fc24 criu]# ulimit -n $((1 << 20))
> >>> [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> >>> [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> >>> === Run 1/1 ================ zdtm/static/env00
> >>>
> >>> ========================== Run zdtm/static/env00 in h ==========================
> >>> Start test
> >>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> >>> memory.kmem.usage_in_bytes = 19714048
> >>> Pause at pre-dump. Press any key to continue.
> >>> Run criu dump
> >>> Unable to kill 36: [Errno 3] No such process
> >>> Pause at pre-restore. Press any key to continue.
> >>> Run criu restore
> >>> Pause at post-restore. Press any key to continue.
> >>> memory.kmem.usage_in_bytes = 914178048
> >>> before: memory.kmem.usage_in_bytes =             19714048
> >>> after:  memory.kmem.usage_in_bytes =            914178048 (4637.19%)
> >>> Send the 15 signal to  36
> >>> Wait for zdtm/static/env00(36) to die for 0.100000
> >>> Unable to kill 36: [Errno 3] No such process
> >>> Removing dump/zdtm/static/env00/36
> >>> ========================= Test zdtm/static/env00 PASS ==========================
> >>>
> >>> When we call ulimit -n 1024, dup2(1, 1000000) returns an error and
> >>> the test has only descriptors with small numbers. In this case,
> >>> a kmem consumption increases on 130%. We need to investigate this fact,
> >>> but it isn't so critical, if we compare it with the next case.
> >>>
> >>> When we call ulimit -n $((1 << 20)) before running the test, dup2(1, 1000000)
> >>> creates the 1000000 descriptor, and a kmem consumption after c/r increases on
> >>> 4537.19%. This is a real problem, what we have to solve with a high priority.
> >>
> >> It's not a problem of my patchset, it's the problem of current engine. My patchset
> >> does not aim to exploit the engine and rework everything. It just improves
> >> the engine and it works better then current engine in many cases and not worse
> >> in the rest of cases. Yes, there are the cases, when the patchset works like
> >> current engine, but there is no a case, when it behaves worse. Strange to
> >> call me to solve all the file or memory problems at once, isn't it?!
> > 
> > I like this work. Thank you for it. But I see nothing strange to
> > discuss the other problem of this code and try to solve them now or
> > later.
> > 
> >>
> >> The goals of the patchset are:
> >> 1)Disconnect prlimit(RLIMIT_NOFILE) and service fds placement,
> >> i.e. make the placement to depend on task fds, not on prlimit().
> >> 2)Make people able to restore big fd numbers without increasing
> >> memory usage on all tasks in a container (as currently this requires
> >> to increase global limits).
> >>
> >> #ulimit -n $((1 << 20))
> >> --- a/test/zdtm/static/env00.c
> >> +++ b/test/zdtm/static/env00.c
> >> @@ -25,14 +25,13 @@ int main(int argc, char **argv)
> >>         for (i = 0; i < 100; i++) {
> >>                 pid = fork();
> >>                 if (pid == 0) {
> >> +                       dup2(1, 10000);
> >>                         while (1)
> >>                                 sleep(1);
> >>                         return 0;
> >>                 }
> >>         }
> >>  
> >> -       dup2(1, 1000000);
> >> -
> >>         test_daemon();
> >>         test_waitsig();
> >>
> >> Before patchset:
> >> before: memory.kmem.usage_in_bytes =             24076288
> >> after:  memory.kmem.usage_in_bytes =            921894912 (3829.06%)
> >>
> >> After patchset:
> >> before: memory.kmem.usage_in_bytes =             23855104
> >> after:  memory.kmem.usage_in_bytes =             39591936 (165.97%)
> >>
> >> Someone may rework this later, and implement new engine using holes in task fd numbers.
> >> I'm not against that, everybody is welcome! I'm just not sure this will take small time for
> >> writing/stabilization/etc. And I don't think this is the first priority problem in criu,
> >> so this is why I'm not going to solve it now.
> > 
> > I think we can solve this problem in contex of these changes.
> > 
> > We can create service descriptors with a small base in criu, then we can
> > fork all processes and then we can move service descriptors to a
> > calculated base for each process. It is not optimal, but it will work.
> > 
> > It can be optimazed if we predict a case when a child process has a
> > smaller service fd base...
> 
> I've tried this one, but it looks not good. Tasks with shared fd table still require
> the fds relocation are made before forking of children. Helpers are included there too.
> Service fds relocation and closing of old files break up in several cases, which
> (I assume) won't be easy to maintain.
> 
> In my opinion parent with big fds is unlikely case, and we may skip them for a while.
> We may get a better profit if we thing about setting up lower service_fds_base before
> forking of root_item, as it happens always.
> 
> What do you think about this?

Ok, let's postpone this task. If you have time, could you describe the
current state of service fd-s on the criu wiki with all known issues.

Thanks,
Andrei

> 
> >>
> >> Also, keep in mind, that after the rework is happened, you still meet the situation, when
> >> there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
> >> scheme will still require more memory than before the dump. Just one more boundary case
> >> like you pointed in your message.
> >>
> >> Just analyse the tasks files on your linux system and find out how the patchset will help
> >> to reduce memory usage on restore of generic case.
> >>
> >> Thanks,
> >> Kirill


More information about the CRIU mailing list