[CRIU] [PATCH v2 12/14] files: Make tasks set their own service_fd_base

Kirill Tkhai ktkhai at virtuozzo.com
Wed Jan 10 10:39:56 MSK 2018


On 09.01.2018 19:30, Andrei Vagin wrote:
> On Tue, Jan 09, 2018 at 02:54:31PM +0300, Kirill Tkhai wrote:
>> On 30.12.2017 19:48, Andrei Vagin wrote:
>>> On Sat, Dec 30, 2017 at 12:22:32PM +0300, Kirill Tkhai wrote:
>>>> On 29.12.2017 22:59, Andrei Vagin wrote:
>>>>>>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
>>>>>>> index 1feabfa9f..d2517b835 100644
>>>>>>> --- a/test/zdtm/static/env00.c
>>>>>>> +++ b/test/zdtm/static/env00.c
>>>>>>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
>>>>>>>  
>>>>>>>  int main(int argc, char **argv)
>>>>>>>  {
>>>>>>> +	int i;
>>>>>>> +	pid_t pid;
>>>>>>>  	char *env;
>>>>>>>  
>>>>>>>  	test_init(argc, argv);
>>>>>>>  
>>>>>>> +	for (i = 0; i < 10; i++) {
>>>>>>> +		pid = fork();
>>>>>>> +		if (pid == 0) {
>>>>>>> +			while (1)
>>>>>>> +				sleep(1);
>>>>>>> +			return 0;
>>>>>>> +		}
>>>>>>> +	}
>>>>>>> +
>>>>>>> +//	dup2(1, 1000000);
>>>>>>> +
>>>>>>>  	if (setenv(envname, test_author, 1)) {
>>>>>>>  		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
>>>>>>>  		exit(1);
>>>>>>>
>>>>>>> before dump:
>>>>>>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>>>>>>> 11849728
>>>>>>>
>>>>>>> after restore:
>>>>>>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>>>>>>> 20877312
>>>>>>>
>>>>>>> Criu restore changes many in-kernel structures, and this is a global problem, not
>>>>>>> about sfds.
>>>>>>
>>>>>> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
>>>>>> be 100MB. This is not a global problem, it is the problem that we move
>>>>>> service descriptors in child processes.
>>>>>
>>>>> I create a small path for zdtm.py to get kmem before and after c/r:
>>>>> git clone https://github.com/avagin/criu -b 2195
>>>>>
>>>>> And here are steps how I do my experiments:
>>>>>
>>>>> [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
>>>>> [root at fc24 criu]# ulimit -n 1024
>>>>> [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
>>>>> === Run 1/1 ================ zdtm/static/env00
>>>>>
>>>>> ========================== Run zdtm/static/env00 in h ==========================
>>>>> Start test
>>>>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
>>>>> memory.kmem.usage_in_bytes = 9920512
>>>>> Pause at pre-dump. Press any key to continue.
>>>>> Run criu dump
>>>>>
>>>>> Unable to kill 36: [Errno 3] No such process
>>>>> Pause at pre-restore. Press any key to continue.Run criu restore
>>>>> Pause at post-restore. Press any key to continue.
>>>>> memory.kmem.usage_in_bytes = 22913024
>>>>> before: memory.kmem.usage_in_bytes =              9920512
>>>>> after:  memory.kmem.usage_in_bytes =             22913024 (230.97%)
>>>>> Send the 15 signal to  36
>>>>> Wait for zdtm/static/env00(36) to die for 0.100000
>>>>> Unable to kill 36: [Errno 3] No such process
>>>>> Removing dump/zdtm/static/env00/36
>>>>> ========================= Test zdtm/static/env00 PASS ==========================
>>>>>
>>>>> [root at fc24 criu]# 
>>>>> [root at fc24 criu]# ulimit -n $((1 << 20))
>>>>> [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
>>>>> [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
>>>>> === Run 1/1 ================ zdtm/static/env00
>>>>>
>>>>> ========================== Run zdtm/static/env00 in h ==========================
>>>>> Start test
>>>>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
>>>>> memory.kmem.usage_in_bytes = 19714048
>>>>> Pause at pre-dump. Press any key to continue.
>>>>> Run criu dump
>>>>> Unable to kill 36: [Errno 3] No such process
>>>>> Pause at pre-restore. Press any key to continue.
>>>>> Run criu restore
>>>>> Pause at post-restore. Press any key to continue.
>>>>> memory.kmem.usage_in_bytes = 914178048
>>>>> before: memory.kmem.usage_in_bytes =             19714048
>>>>> after:  memory.kmem.usage_in_bytes =            914178048 (4637.19%)
>>>>> Send the 15 signal to  36
>>>>> Wait for zdtm/static/env00(36) to die for 0.100000
>>>>> Unable to kill 36: [Errno 3] No such process
>>>>> Removing dump/zdtm/static/env00/36
>>>>> ========================= Test zdtm/static/env00 PASS ==========================
>>>>>
>>>>> When we call ulimit -n 1024, dup2(1, 1000000) returns an error and
>>>>> the test has only descriptors with small numbers. In this case,
>>>>> a kmem consumption increases on 130%. We need to investigate this fact,
>>>>> but it isn't so critical, if we compare it with the next case.
>>>>>
>>>>> When we call ulimit -n $((1 << 20)) before running the test, dup2(1, 1000000)
>>>>> creates the 1000000 descriptor, and a kmem consumption after c/r increases on
>>>>> 4537.19%. This is a real problem, what we have to solve with a high priority.
>>>>
>>>> It's not a problem of my patchset, it's the problem of current engine. My patchset
>>>> does not aim to exploit the engine and rework everything. It just improves
>>>> the engine and it works better then current engine in many cases and not worse
>>>> in the rest of cases. Yes, there are the cases, when the patchset works like
>>>> current engine, but there is no a case, when it behaves worse. Strange to
>>>> call me to solve all the file or memory problems at once, isn't it?!
>>>
>>> I like this work. Thank you for it. But I see nothing strange to
>>> discuss the other problem of this code and try to solve them now or
>>> later.
>>>
>>>>
>>>> The goals of the patchset are:
>>>> 1)Disconnect prlimit(RLIMIT_NOFILE) and service fds placement,
>>>> i.e. make the placement to depend on task fds, not on prlimit().
>>>> 2)Make people able to restore big fd numbers without increasing
>>>> memory usage on all tasks in a container (as currently this requires
>>>> to increase global limits).
>>>>
>>>> #ulimit -n $((1 << 20))
>>>> --- a/test/zdtm/static/env00.c
>>>> +++ b/test/zdtm/static/env00.c
>>>> @@ -25,14 +25,13 @@ int main(int argc, char **argv)
>>>>         for (i = 0; i < 100; i++) {
>>>>                 pid = fork();
>>>>                 if (pid == 0) {
>>>> +                       dup2(1, 10000);
>>>>                         while (1)
>>>>                                 sleep(1);
>>>>                         return 0;
>>>>                 }
>>>>         }
>>>>  
>>>> -       dup2(1, 1000000);
>>>> -
>>>>         test_daemon();
>>>>         test_waitsig();
>>>>
>>>> Before patchset:
>>>> before: memory.kmem.usage_in_bytes =             24076288
>>>> after:  memory.kmem.usage_in_bytes =            921894912 (3829.06%)
>>>>
>>>> After patchset:
>>>> before: memory.kmem.usage_in_bytes =             23855104
>>>> after:  memory.kmem.usage_in_bytes =             39591936 (165.97%)
>>>>
>>>> Someone may rework this later, and implement new engine using holes in task fd numbers.
>>>> I'm not against that, everybody is welcome! I'm just not sure this will take small time for
>>>> writing/stabilization/etc. And I don't think this is the first priority problem in criu,
>>>> so this is why I'm not going to solve it now.
>>>
>>> I think we can solve this problem in contex of these changes.
>>>
>>> We can create service descriptors with a small base in criu, then we can
>>> fork all processes and then we can move service descriptors to a
>>> calculated base for each process. It is not optimal, but it will work.
>>>
>>> It can be optimazed if we predict a case when a child process has a
>>> smaller service fd base...
>>
>> I've tried this one, but it looks not good. Tasks with shared fd table still require
>> the fds relocation are made before forking of children. Helpers are included there too.
>> Service fds relocation and closing of old files break up in several cases, which
>> (I assume) won't be easy to maintain.
>>
>> In my opinion parent with big fds is unlikely case, and we may skip them for a while.
>> We may get a better profit if we thing about setting up lower service_fds_base before
>> forking of root_item, as it happens always.
>>
>> What do you think about this?
> 
> Ok, let's postpone this task. If you have time, could you describe the
> current state of service fd-s on the criu wiki with all known issues.

Ok. Should some existing file wiki page be used or to introduce a new one?

Kirill
 
> Thanks,
> Andrei
> 
>>
>>>>
>>>> Also, keep in mind, that after the rework is happened, you still meet the situation, when
>>>> there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
>>>> scheme will still require more memory than before the dump. Just one more boundary case
>>>> like you pointed in your message.
>>>>
>>>> Just analyse the tasks files on your linux system and find out how the patchset will help
>>>> to reduce memory usage on restore of generic case.
>>>>
>>>> Thanks,
>>>> Kirill


More information about the CRIU mailing list