[CRIU] [PATCH v2 12/14] files: Make tasks set their own service_fd_base

Kirill Tkhai ktkhai at virtuozzo.com
Sat Dec 30 12:22:32 MSK 2017


On 29.12.2017 22:59, Andrei Vagin wrote:
>>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
>>> index 1feabfa9f..d2517b835 100644
>>> --- a/test/zdtm/static/env00.c
>>> +++ b/test/zdtm/static/env00.c
>>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
>>>  
>>>  int main(int argc, char **argv)
>>>  {
>>> +	int i;
>>> +	pid_t pid;
>>>  	char *env;
>>>  
>>>  	test_init(argc, argv);
>>>  
>>> +	for (i = 0; i < 10; i++) {
>>> +		pid = fork();
>>> +		if (pid == 0) {
>>> +			while (1)
>>> +				sleep(1);
>>> +			return 0;
>>> +		}
>>> +	}
>>> +
>>> +//	dup2(1, 1000000);
>>> +
>>>  	if (setenv(envname, test_author, 1)) {
>>>  		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
>>>  		exit(1);
>>>
>>> before dump:
>>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>>> 11849728
>>>
>>> after restore:
>>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>>> 20877312
>>>
>>> Criu restore changes many in-kernel structures, and this is a global problem, not
>>> about sfds.
>>
>> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
>> be 100MB. This is not a global problem, it is the problem that we move
>> service descriptors in child processes.
> 
> I create a small path for zdtm.py to get kmem before and after c/r:
> git clone https://github.com/avagin/criu -b 2195
> 
> And here are steps how I do my experiments:
> 
> [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> [root at fc24 criu]# ulimit -n 1024
> [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> === Run 1/1 ================ zdtm/static/env00
> 
> ========================== Run zdtm/static/env00 in h ==========================
> Start test
> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> memory.kmem.usage_in_bytes = 9920512
> Pause at pre-dump. Press any key to continue.
> Run criu dump
> 
> Unable to kill 36: [Errno 3] No such process
> Pause at pre-restore. Press any key to continue.Run criu restore
> Pause at post-restore. Press any key to continue.
> memory.kmem.usage_in_bytes = 22913024
> before: memory.kmem.usage_in_bytes =              9920512
> after:  memory.kmem.usage_in_bytes =             22913024 (230.97%)
> Send the 15 signal to  36
> Wait for zdtm/static/env00(36) to die for 0.100000
> Unable to kill 36: [Errno 3] No such process
> Removing dump/zdtm/static/env00/36
> ========================= Test zdtm/static/env00 PASS ==========================
> 
> [root at fc24 criu]# 
> [root at fc24 criu]# ulimit -n $((1 << 20))
> [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> === Run 1/1 ================ zdtm/static/env00
> 
> ========================== Run zdtm/static/env00 in h ==========================
> Start test
> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> memory.kmem.usage_in_bytes = 19714048
> Pause at pre-dump. Press any key to continue.
> Run criu dump
> Unable to kill 36: [Errno 3] No such process
> Pause at pre-restore. Press any key to continue.
> Run criu restore
> Pause at post-restore. Press any key to continue.
> memory.kmem.usage_in_bytes = 914178048
> before: memory.kmem.usage_in_bytes =             19714048
> after:  memory.kmem.usage_in_bytes =            914178048 (4637.19%)
> Send the 15 signal to  36
> Wait for zdtm/static/env00(36) to die for 0.100000
> Unable to kill 36: [Errno 3] No such process
> Removing dump/zdtm/static/env00/36
> ========================= Test zdtm/static/env00 PASS ==========================
> 
> When we call ulimit -n 1024, dup2(1, 1000000) returns an error and
> the test has only descriptors with small numbers. In this case,
> a kmem consumption increases on 130%. We need to investigate this fact,
> but it isn't so critical, if we compare it with the next case.
> 
> When we call ulimit -n $((1 << 20)) before running the test, dup2(1, 1000000)
> creates the 1000000 descriptor, and a kmem consumption after c/r increases on
> 4537.19%. This is a real problem, what we have to solve with a high priority.

It's not a problem of my patchset, it's the problem of current engine. My patchset
does not aim to exploit the engine and rework everything. It just improves
the engine and it works better then current engine in many cases and not worse
in the rest of cases. Yes, there are the cases, when the patchset works like
current engine, but there is no a case, when it behaves worse. Strange to
call me to solve all the file or memory problems at once, isn't it?!

The goals of the patchset are:
1)Disconnect prlimit(RLIMIT_NOFILE) and service fds placement,
i.e. make the placement to depend on task fds, not on prlimit().
2)Make people able to restore big fd numbers without increasing
memory usage on all tasks in a container (as currently this requires
to increase global limits).

#ulimit -n $((1 << 20))
--- a/test/zdtm/static/env00.c
+++ b/test/zdtm/static/env00.c
@@ -25,14 +25,13 @@ int main(int argc, char **argv)
        for (i = 0; i < 100; i++) {
                pid = fork();
                if (pid == 0) {
+                       dup2(1, 10000);
                        while (1)
                                sleep(1);
                        return 0;
                }
        }
 
-       dup2(1, 1000000);
-
        test_daemon();
        test_waitsig();

Before patchset:
before: memory.kmem.usage_in_bytes =             24076288
after:  memory.kmem.usage_in_bytes =            921894912 (3829.06%)

After patchset:
before: memory.kmem.usage_in_bytes =             23855104
after:  memory.kmem.usage_in_bytes =             39591936 (165.97%)

Someone may rework this later, and implement new engine using holes in task fd numbers.
I'm not against that, everybody is welcome! I'm just not sure this will take small time for
writing/stabilization/etc. And I don't think this is the first priority problem in criu,
so this is why I'm not going to solve it now.

Also, keep in mind, that after the rework is happened, you still meet the situation, when
there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
scheme will still require more memory than before the dump. Just one more boundary case
like you pointed in your message.

Just analyse the tasks files on your linux system and find out how the patchset will help
to reduce memory usage on restore of generic case.

Thanks,
Kirill


More information about the CRIU mailing list