[CRIU] [PATCH v2 12/14] files: Make tasks set their own service_fd_base
Andrei Vagin
avagin at virtuozzo.com
Tue Jan 9 19:30:33 MSK 2018
On Tue, Jan 09, 2018 at 02:54:31PM +0300, Kirill Tkhai wrote:
> On 30.12.2017 19:48, Andrei Vagin wrote:
> > On Sat, Dec 30, 2017 at 12:22:32PM +0300, Kirill Tkhai wrote:
> >> On 29.12.2017 22:59, Andrei Vagin wrote:
> >>>>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
> >>>>> index 1feabfa9f..d2517b835 100644
> >>>>> --- a/test/zdtm/static/env00.c
> >>>>> +++ b/test/zdtm/static/env00.c
> >>>>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
> >>>>>
> >>>>> int main(int argc, char **argv)
> >>>>> {
> >>>>> + int i;
> >>>>> + pid_t pid;
> >>>>> char *env;
> >>>>>
> >>>>> test_init(argc, argv);
> >>>>>
> >>>>> + for (i = 0; i < 10; i++) {
> >>>>> + pid = fork();
> >>>>> + if (pid == 0) {
> >>>>> + while (1)
> >>>>> + sleep(1);
> >>>>> + return 0;
> >>>>> + }
> >>>>> + }
> >>>>> +
> >>>>> +// dup2(1, 1000000);
> >>>>> +
> >>>>> if (setenv(envname, test_author, 1)) {
> >>>>> pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
> >>>>> exit(1);
> >>>>>
> >>>>> before dump:
> >>>>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> >>>>> 11849728
> >>>>>
> >>>>> after restore:
> >>>>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> >>>>> 20877312
> >>>>>
> >>>>> Criu restore changes many in-kernel structures, and this is a global problem, not
> >>>>> about sfds.
> >>>>
> >>>> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
> >>>> be 100MB. This is not a global problem, it is the problem that we move
> >>>> service descriptors in child processes.
> >>>
> >>> I create a small path for zdtm.py to get kmem before and after c/r:
> >>> git clone https://github.com/avagin/criu -b 2195
> >>>
> >>> And here are steps how I do my experiments:
> >>>
> >>> [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> >>> [root at fc24 criu]# ulimit -n 1024
> >>> [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> >>> === Run 1/1 ================ zdtm/static/env00
> >>>
> >>> ========================== Run zdtm/static/env00 in h ==========================
> >>> Start test
> >>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> >>> memory.kmem.usage_in_bytes = 9920512
> >>> Pause at pre-dump. Press any key to continue.
> >>> Run criu dump
> >>>
> >>> Unable to kill 36: [Errno 3] No such process
> >>> Pause at pre-restore. Press any key to continue.Run criu restore
> >>> Pause at post-restore. Press any key to continue.
> >>> memory.kmem.usage_in_bytes = 22913024
> >>> before: memory.kmem.usage_in_bytes = 9920512
> >>> after: memory.kmem.usage_in_bytes = 22913024 (230.97%)
> >>> Send the 15 signal to 36
> >>> Wait for zdtm/static/env00(36) to die for 0.100000
> >>> Unable to kill 36: [Errno 3] No such process
> >>> Removing dump/zdtm/static/env00/36
> >>> ========================= Test zdtm/static/env00 PASS ==========================
> >>>
> >>> [root at fc24 criu]#
> >>> [root at fc24 criu]# ulimit -n $((1 << 20))
> >>> [root at fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> >>> [root at fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> >>> === Run 1/1 ================ zdtm/static/env00
> >>>
> >>> ========================== Run zdtm/static/env00 in h ==========================
> >>> Start test
> >>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> >>> memory.kmem.usage_in_bytes = 19714048
> >>> Pause at pre-dump. Press any key to continue.
> >>> Run criu dump
> >>> Unable to kill 36: [Errno 3] No such process
> >>> Pause at pre-restore. Press any key to continue.
> >>> Run criu restore
> >>> Pause at post-restore. Press any key to continue.
> >>> memory.kmem.usage_in_bytes = 914178048
> >>> before: memory.kmem.usage_in_bytes = 19714048
> >>> after: memory.kmem.usage_in_bytes = 914178048 (4637.19%)
> >>> Send the 15 signal to 36
> >>> Wait for zdtm/static/env00(36) to die for 0.100000
> >>> Unable to kill 36: [Errno 3] No such process
> >>> Removing dump/zdtm/static/env00/36
> >>> ========================= Test zdtm/static/env00 PASS ==========================
> >>>
> >>> When we call ulimit -n 1024, dup2(1, 1000000) returns an error and
> >>> the test has only descriptors with small numbers. In this case,
> >>> a kmem consumption increases on 130%. We need to investigate this fact,
> >>> but it isn't so critical, if we compare it with the next case.
> >>>
> >>> When we call ulimit -n $((1 << 20)) before running the test, dup2(1, 1000000)
> >>> creates the 1000000 descriptor, and a kmem consumption after c/r increases on
> >>> 4537.19%. This is a real problem, what we have to solve with a high priority.
> >>
> >> It's not a problem of my patchset, it's the problem of current engine. My patchset
> >> does not aim to exploit the engine and rework everything. It just improves
> >> the engine and it works better then current engine in many cases and not worse
> >> in the rest of cases. Yes, there are the cases, when the patchset works like
> >> current engine, but there is no a case, when it behaves worse. Strange to
> >> call me to solve all the file or memory problems at once, isn't it?!
> >
> > I like this work. Thank you for it. But I see nothing strange to
> > discuss the other problem of this code and try to solve them now or
> > later.
> >
> >>
> >> The goals of the patchset are:
> >> 1)Disconnect prlimit(RLIMIT_NOFILE) and service fds placement,
> >> i.e. make the placement to depend on task fds, not on prlimit().
> >> 2)Make people able to restore big fd numbers without increasing
> >> memory usage on all tasks in a container (as currently this requires
> >> to increase global limits).
> >>
> >> #ulimit -n $((1 << 20))
> >> --- a/test/zdtm/static/env00.c
> >> +++ b/test/zdtm/static/env00.c
> >> @@ -25,14 +25,13 @@ int main(int argc, char **argv)
> >> for (i = 0; i < 100; i++) {
> >> pid = fork();
> >> if (pid == 0) {
> >> + dup2(1, 10000);
> >> while (1)
> >> sleep(1);
> >> return 0;
> >> }
> >> }
> >>
> >> - dup2(1, 1000000);
> >> -
> >> test_daemon();
> >> test_waitsig();
> >>
> >> Before patchset:
> >> before: memory.kmem.usage_in_bytes = 24076288
> >> after: memory.kmem.usage_in_bytes = 921894912 (3829.06%)
> >>
> >> After patchset:
> >> before: memory.kmem.usage_in_bytes = 23855104
> >> after: memory.kmem.usage_in_bytes = 39591936 (165.97%)
> >>
> >> Someone may rework this later, and implement new engine using holes in task fd numbers.
> >> I'm not against that, everybody is welcome! I'm just not sure this will take small time for
> >> writing/stabilization/etc. And I don't think this is the first priority problem in criu,
> >> so this is why I'm not going to solve it now.
> >
> > I think we can solve this problem in contex of these changes.
> >
> > We can create service descriptors with a small base in criu, then we can
> > fork all processes and then we can move service descriptors to a
> > calculated base for each process. It is not optimal, but it will work.
> >
> > It can be optimazed if we predict a case when a child process has a
> > smaller service fd base...
>
> I've tried this one, but it looks not good. Tasks with shared fd table still require
> the fds relocation are made before forking of children. Helpers are included there too.
> Service fds relocation and closing of old files break up in several cases, which
> (I assume) won't be easy to maintain.
>
> In my opinion parent with big fds is unlikely case, and we may skip them for a while.
> We may get a better profit if we thing about setting up lower service_fds_base before
> forking of root_item, as it happens always.
>
> What do you think about this?
Ok, let's postpone this task. If you have time, could you describe the
current state of service fd-s on the criu wiki with all known issues.
Thanks,
Andrei
>
> >>
> >> Also, keep in mind, that after the rework is happened, you still meet the situation, when
> >> there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
> >> scheme will still require more memory than before the dump. Just one more boundary case
> >> like you pointed in your message.
> >>
> >> Just analyse the tasks files on your linux system and find out how the patchset will help
> >> to reduce memory usage on restore of generic case.
> >>
> >> Thanks,
> >> Kirill
More information about the CRIU
mailing list