[CRIU] [PATCH v2 12/14] files: Make tasks set their own service_fd_base
Andrei Vagin
avagin at virtuozzo.com
Sat Dec 30 19:48:10 MSK 2017
On Sat, Dec 30, 2017 at 12:22:32PM +0300, Kirill Tkhai wrote:
> On 29.12.2017 22:59, Andrei Vagin wrote:
> >>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
> >>> index 1feabfa9f..d2517b835 100644
> >>> --- a/test/zdtm/static/env00.c
> >>> +++ b/test/zdtm/static/env00.c
> >>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
> >>>
> >>> int main(int argc, char **argv)
> >>> {
> >>> + int i;
> >>> + pid_t pid;
> >>> char *env;
> >>>
> >>> test_init(argc, argv);
> >>>
> >>> + for (i = 0; i < 10; i++) {
> >>> + pid = fork();
> >>> + if (pid == 0) {
> >>> + while (1)
> >>> + sleep(1);
> >>> + return 0;
> >>> + }
> >>> + }
> >>> +
> >>> +// dup2(1, 1000000);
> >>> +
> >>> if (setenv(envname, test_author, 1)) {
> >>> pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
> >>> exit(1);
> >>>
> >>> before dump:
> >>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> >>> 11849728
> >>>
> >>> after restore:
> >>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> >>> 20877312
> >>>
> >>> Criu restore changes many in-kernel structures, and this is a global problem, not
> >>> about sfds.
> >>
> >> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
> >> be 100MB. This is not a global problem, it is the problem that we move
> >> service descriptors in child processes.
> >
> > I create a small path for zdtm.py to get kmem before and after c/r:
> > git clone https://github.com/avagin/criu -b 2195
> >
> > And here are steps how I do my experiments:
> >
> > [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> > [root at fc24 criu]# ulimit -n 1024
> > [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> > === Run 1/1 ================ zdtm/static/env00
> >
> > ========================== Run zdtm/static/env00 in h ==========================
> > Start test
> > ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> > memory.kmem.usage_in_bytes = 9920512
> > Pause at pre-dump. Press any key to continue.
> > Run criu dump
> >
> > Unable to kill 36: [Errno 3] No such process
> > Pause at pre-restore. Press any key to continue.Run criu restore
> > Pause at post-restore. Press any key to continue.
> > memory.kmem.usage_in_bytes = 22913024
> > before: memory.kmem.usage_in_bytes = 9920512
> > after: memory.kmem.usage_in_bytes = 22913024 (230.97%)
> > Send the 15 signal to 36
> > Wait for zdtm/static/env00(36) to die for 0.100000
> > Unable to kill 36: [Errno 3] No such process
> > Removing dump/zdtm/static/env00/36
> > ========================= Test zdtm/static/env00 PASS ==========================
> >
> > [root at fc24 criu]#
> > [root at fc24 criu]# ulimit -n $((1 << 20))
> > [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> > [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> > === Run 1/1 ================ zdtm/static/env00
> >
> > ========================== Run zdtm/static/env00 in h ==========================
> > Start test
> > ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> > memory.kmem.usage_in_bytes = 19714048
> > Pause at pre-dump. Press any key to continue.
> > Run criu dump
> > Unable to kill 36: [Errno 3] No such process
> > Pause at pre-restore. Press any key to continue.
> > Run criu restore
> > Pause at post-restore. Press any key to continue.
> > memory.kmem.usage_in_bytes = 914178048
> > before: memory.kmem.usage_in_bytes = 19714048
> > after: memory.kmem.usage_in_bytes = 914178048 (4637.19%)
> > Send the 15 signal to 36
> > Wait for zdtm/static/env00(36) to die for 0.100000
> > Unable to kill 36: [Errno 3] No such process
> > Removing dump/zdtm/static/env00/36
> > ========================= Test zdtm/static/env00 PASS ==========================
> >
> > When we call ulimit -n 1024, dup2(1, 1000000) returns an error and
> > the test has only descriptors with small numbers. In this case,
> > a kmem consumption increases on 130%. We need to investigate this fact,
> > but it isn't so critical, if we compare it with the next case.
> >
> > When we call ulimit -n $((1 << 20)) before running the test, dup2(1, 1000000)
> > creates the 1000000 descriptor, and a kmem consumption after c/r increases on
> > 4537.19%. This is a real problem, what we have to solve with a high priority.
>
> It's not a problem of my patchset, it's the problem of current engine. My patchset
> does not aim to exploit the engine and rework everything. It just improves
> the engine and it works better then current engine in many cases and not worse
> in the rest of cases. Yes, there are the cases, when the patchset works like
> current engine, but there is no a case, when it behaves worse. Strange to
> call me to solve all the file or memory problems at once, isn't it?!
I like this work. Thank you for it. But I see nothing strange to
discuss the other problem of this code and try to solve them now or
later.
>
> The goals of the patchset are:
> 1)Disconnect prlimit(RLIMIT_NOFILE) and service fds placement,
> i.e. make the placement to depend on task fds, not on prlimit().
> 2)Make people able to restore big fd numbers without increasing
> memory usage on all tasks in a container (as currently this requires
> to increase global limits).
>
> #ulimit -n $((1 << 20))
> --- a/test/zdtm/static/env00.c
> +++ b/test/zdtm/static/env00.c
> @@ -25,14 +25,13 @@ int main(int argc, char **argv)
> for (i = 0; i < 100; i++) {
> pid = fork();
> if (pid == 0) {
> + dup2(1, 10000);
> while (1)
> sleep(1);
> return 0;
> }
> }
>
> - dup2(1, 1000000);
> -
> test_daemon();
> test_waitsig();
>
> Before patchset:
> before: memory.kmem.usage_in_bytes = 24076288
> after: memory.kmem.usage_in_bytes = 921894912 (3829.06%)
>
> After patchset:
> before: memory.kmem.usage_in_bytes = 23855104
> after: memory.kmem.usage_in_bytes = 39591936 (165.97%)
>
> Someone may rework this later, and implement new engine using holes in task fd numbers.
> I'm not against that, everybody is welcome! I'm just not sure this will take small time for
> writing/stabilization/etc. And I don't think this is the first priority problem in criu,
> so this is why I'm not going to solve it now.
I think we can solve this problem in contex of these changes.
We can create service descriptors with a small base in criu, then we can
fork all processes and then we can move service descriptors to a
calculated base for each process. It is not optimal, but it will work.
It can be optimazed if we predict a case when a child process has a
smaller service fd base...
>
> Also, keep in mind, that after the rework is happened, you still meet the situation, when
> there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
> scheme will still require more memory than before the dump. Just one more boundary case
> like you pointed in your message.
>
> Just analyse the tasks files on your linux system and find out how the patchset will help
> to reduce memory usage on restore of generic case.
>
> Thanks,
> Kirill
More information about the CRIU
mailing list