[CRIU] [PATCH v2 12/14] files: Make tasks set their own service_fd_base
Kirill Tkhai
ktkhai at virtuozzo.com
Tue Jan 9 14:54:31 MSK 2018
On 30.12.2017 19:48, Andrei Vagin wrote:
> On Sat, Dec 30, 2017 at 12:22:32PM +0300, Kirill Tkhai wrote:
>> On 29.12.2017 22:59, Andrei Vagin wrote:
>>>>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
>>>>> index 1feabfa9f..d2517b835 100644
>>>>> --- a/test/zdtm/static/env00.c
>>>>> +++ b/test/zdtm/static/env00.c
>>>>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
>>>>>
>>>>> int main(int argc, char **argv)
>>>>> {
>>>>> + int i;
>>>>> + pid_t pid;
>>>>> char *env;
>>>>>
>>>>> test_init(argc, argv);
>>>>>
>>>>> + for (i = 0; i < 10; i++) {
>>>>> + pid = fork();
>>>>> + if (pid == 0) {
>>>>> + while (1)
>>>>> + sleep(1);
>>>>> + return 0;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> +// dup2(1, 1000000);
>>>>> +
>>>>> if (setenv(envname, test_author, 1)) {
>>>>> pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
>>>>> exit(1);
>>>>>
>>>>> before dump:
>>>>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>>>>> 11849728
>>>>>
>>>>> after restore:
>>>>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>>>>> 20877312
>>>>>
>>>>> Criu restore changes many in-kernel structures, and this is a global problem, not
>>>>> about sfds.
>>>>
>>>> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
>>>> be 100MB. This is not a global problem, it is the problem that we move
>>>> service descriptors in child processes.
>>>
>>> I create a small path for zdtm.py to get kmem before and after c/r:
>>> git clone https://github.com/avagin/criu -b 2195
>>>
>>> And here are steps how I do my experiments:
>>>
>>> [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
>>> [root at fc24 criu]# ulimit -n 1024
>>> [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
>>> === Run 1/1 ================ zdtm/static/env00
>>>
>>> ========================== Run zdtm/static/env00 in h ==========================
>>> Start test
>>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
>>> memory.kmem.usage_in_bytes = 9920512
>>> Pause at pre-dump. Press any key to continue.
>>> Run criu dump
>>>
>>> Unable to kill 36: [Errno 3] No such process
>>> Pause at pre-restore. Press any key to continue.Run criu restore
>>> Pause at post-restore. Press any key to continue.
>>> memory.kmem.usage_in_bytes = 22913024
>>> before: memory.kmem.usage_in_bytes = 9920512
>>> after: memory.kmem.usage_in_bytes = 22913024 (230.97%)
>>> Send the 15 signal to 36
>>> Wait for zdtm/static/env00(36) to die for 0.100000
>>> Unable to kill 36: [Errno 3] No such process
>>> Removing dump/zdtm/static/env00/36
>>> ========================= Test zdtm/static/env00 PASS ==========================
>>>
>>> [root at fc24 criu]#
>>> [root at fc24 criu]# ulimit -n $((1 << 20))
>>> [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
>>> [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
>>> === Run 1/1 ================ zdtm/static/env00
>>>
>>> ========================== Run zdtm/static/env00 in h ==========================
>>> Start test
>>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
>>> memory.kmem.usage_in_bytes = 19714048
>>> Pause at pre-dump. Press any key to continue.
>>> Run criu dump
>>> Unable to kill 36: [Errno 3] No such process
>>> Pause at pre-restore. Press any key to continue.
>>> Run criu restore
>>> Pause at post-restore. Press any key to continue.
>>> memory.kmem.usage_in_bytes = 914178048
>>> before: memory.kmem.usage_in_bytes = 19714048
>>> after: memory.kmem.usage_in_bytes = 914178048 (4637.19%)
>>> Send the 15 signal to 36
>>> Wait for zdtm/static/env00(36) to die for 0.100000
>>> Unable to kill 36: [Errno 3] No such process
>>> Removing dump/zdtm/static/env00/36
>>> ========================= Test zdtm/static/env00 PASS ==========================
>>>
>>> When we call ulimit -n 1024, dup2(1, 1000000) returns an error and
>>> the test has only descriptors with small numbers. In this case,
>>> a kmem consumption increases on 130%. We need to investigate this fact,
>>> but it isn't so critical, if we compare it with the next case.
>>>
>>> When we call ulimit -n $((1 << 20)) before running the test, dup2(1, 1000000)
>>> creates the 1000000 descriptor, and a kmem consumption after c/r increases on
>>> 4537.19%. This is a real problem, what we have to solve with a high priority.
>>
>> It's not a problem of my patchset, it's the problem of current engine. My patchset
>> does not aim to exploit the engine and rework everything. It just improves
>> the engine and it works better then current engine in many cases and not worse
>> in the rest of cases. Yes, there are the cases, when the patchset works like
>> current engine, but there is no a case, when it behaves worse. Strange to
>> call me to solve all the file or memory problems at once, isn't it?!
>
> I like this work. Thank you for it. But I see nothing strange to
> discuss the other problem of this code and try to solve them now or
> later.
>
>>
>> The goals of the patchset are:
>> 1)Disconnect prlimit(RLIMIT_NOFILE) and service fds placement,
>> i.e. make the placement to depend on task fds, not on prlimit().
>> 2)Make people able to restore big fd numbers without increasing
>> memory usage on all tasks in a container (as currently this requires
>> to increase global limits).
>>
>> #ulimit -n $((1 << 20))
>> --- a/test/zdtm/static/env00.c
>> +++ b/test/zdtm/static/env00.c
>> @@ -25,14 +25,13 @@ int main(int argc, char **argv)
>> for (i = 0; i < 100; i++) {
>> pid = fork();
>> if (pid == 0) {
>> + dup2(1, 10000);
>> while (1)
>> sleep(1);
>> return 0;
>> }
>> }
>>
>> - dup2(1, 1000000);
>> -
>> test_daemon();
>> test_waitsig();
>>
>> Before patchset:
>> before: memory.kmem.usage_in_bytes = 24076288
>> after: memory.kmem.usage_in_bytes = 921894912 (3829.06%)
>>
>> After patchset:
>> before: memory.kmem.usage_in_bytes = 23855104
>> after: memory.kmem.usage_in_bytes = 39591936 (165.97%)
>>
>> Someone may rework this later, and implement new engine using holes in task fd numbers.
>> I'm not against that, everybody is welcome! I'm just not sure this will take small time for
>> writing/stabilization/etc. And I don't think this is the first priority problem in criu,
>> so this is why I'm not going to solve it now.
>
> I think we can solve this problem in contex of these changes.
>
> We can create service descriptors with a small base in criu, then we can
> fork all processes and then we can move service descriptors to a
> calculated base for each process. It is not optimal, but it will work.
>
> It can be optimazed if we predict a case when a child process has a
> smaller service fd base...
I've tried this one, but it looks not good. Tasks with shared fd table still require
the fds relocation are made before forking of children. Helpers are included there too.
Service fds relocation and closing of old files break up in several cases, which
(I assume) won't be easy to maintain.
In my opinion parent with big fds is unlikely case, and we may skip them for a while.
We may get a better profit if we thing about setting up lower service_fds_base before
forking of root_item, as it happens always.
What do you think about this?
>>
>> Also, keep in mind, that after the rework is happened, you still meet the situation, when
>> there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
>> scheme will still require more memory than before the dump. Just one more boundary case
>> like you pointed in your message.
>>
>> Just analyse the tasks files on your linux system and find out how the patchset will help
>> to reduce memory usage on restore of generic case.
>>
>> Thanks,
>> Kirill
More information about the CRIU
mailing list