[CRIU] [PATCH v2 12/14] files: Make tasks set their own service_fd_base

Kirill Tkhai ktkhai at virtuozzo.com
Sat Dec 30 11:36:39 MSK 2017


On 29.12.2017 21:18, Andrei Vagin wrote:
> On Fri, Dec 29, 2017 at 11:58:25AM +0300, Kirill Tkhai wrote:
>> On 28.12.2017 22:00, Andrei Vagin wrote:
>>> On Thu, Dec 28, 2017 at 04:10:29PM +0300, Kirill Tkhai wrote:
>>>> Currently, we set rlim(RLIMIT_NOFILE) unlimited
>>>> and service_fd_rlim_cur to place service fds.
>>>> This leads to a signify problem: every task uses
>>>> the biggest possible files_struct in kernel, and
>>>> it consumes excess memory after restore
>>>> in comparation to dump. In some situations this
>>>> may end in restore fail as there is no enough
>>>> memory in memory cgroup of on node.
>>>>
>>>> The patch fixes the problem by introducing
>>>> task-measured service_fd_base. It's calculated
>>>> in dependence of max used file fd and is placed
>>>> near the right border of kernel-allocated memory
>>>> hunk for task's fds (see alloc_fdtable() for
>>>> details). This reduces kernel-allocated files_struct
>>>> to 512 fds for the most process in standard linux
>>>> system (I've analysed the processes in my work system).
>>>>
>>>> Also, since the "standard processes" will have the same
>>>> service_fd_base, clone_service_fd() won't have to
>>>> actualy dup() their service fds for them like we
>>>> have at the moment. This is the one of reasons why
>>>> we still keep service fds as a range of fds,
>>>> and do not try to use unused holes in task fds.
>>>>
>>>> Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
>>>>
>>>> v2: Add a handle for very big fd numbers near service_fd_rlim_cur.
>>>> ---
>>>>  criu/util.c |   57 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>>>  1 file changed, 53 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/criu/util.c b/criu/util.c
>>>> index d45bb7d61..8a25ab2cf 100644
>>>> --- a/criu/util.c
>>>> +++ b/criu/util.c
>>>> @@ -527,7 +527,7 @@ int close_service_fd(enum sfd_type type)
>>>>  static void move_service_fd(struct pstree_item *me, int type, int new_id, int new_base)
>>>>  {
>>>>  	int old = get_service_fd(type);
>>>> -	int new = __get_service_fd(type, new_id);
>>>> +	int new = new_base - type - SERVICE_FD_MAX * new_id;
>>>>  	int ret;
>>>>  
>>>>  	if (old < 0)
>>>> @@ -540,24 +540,73 @@ static void move_service_fd(struct pstree_item *me, int type, int new_id, int ne
>>>>  		close(old);
>>>>  }
>>>>  
>>>> +static int choose_service_fd_base(struct pstree_item *me)
>>>> +{
>>>> +	int nr, real_nr, fdt_nr = 1, id = rsti(me)->service_fd_id;
>>>> +
>>>> +	if (rsti(me)->fdt) {
>>>> +		/* The base is set by owner of fdt (id 0) */
>>>> +		if (id != 0)
>>>> +			return service_fd_base;
>>>> +		fdt_nr = rsti(me)->fdt->nr;
>>>> +	}
>>>> +	/* Now find process's max used fd number */
>>>> +	if (!list_empty(&rsti(me)->fds))
>>>> +		nr = list_entry(rsti(me)->fds.prev,
>>>> +				struct fdinfo_list_entry, ps_list)->fe->fd;
>>>> +	else
>>>> +		nr = -1;
>>>> +
>>>> +	nr = max(nr, inh_fd_max);
>>>> +	/*
>>>> +	 * Service fds go after max fd near right border of alignment:
>>>> +	 *
>>>> +	 * ...|max_fd|max_fd+1|...|sfd first|...|sfd last (aligned)|
>>>> +	 *
>>>> +	 * So, they take maximum numbers of area allocated by kernel.
>>>> +	 * See linux alloc_fdtable() for details.
>>>> +	 */
>>>> +	nr += (SERVICE_FD_MAX - SERVICE_FD_MIN) * fdt_nr;
>>>> +	nr += 16; /* Safety pad */
>>>> +	real_nr = nr;
>>>> +
>>>> +	nr /= (1024 / sizeof(void *));
>>>> +	nr = 1 << (32 - __builtin_clz(nr + 1));
>>>> +	nr *= (1024 / sizeof(void *));
>>>> +
>>>> +	if (nr > service_fd_rlim_cur) {
>>>> +		/* Right border is bigger, than rlim. OK, then just aligned value is enough */
>>>> +		nr = round_down(service_fd_rlim_cur, (1024 / sizeof(void *)));
>>>> +		if (nr < real_nr) {
>>>> +			pr_err("Can't chose service_fd_base: %d %d\n", nr, real_nr);
>>>> +			return -1;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	return nr;
>>>> +}
>>>> +
>>>>  int clone_service_fd(struct pstree_item *me)
>>>>  {
>>>>  	int id, new_base, i, ret = -1;
>>>>  
>>>> -	new_base = service_fd_base;
>>>> +	new_base = choose_service_fd_base(me);
>>>>  	id = rsti(me)->service_fd_id;
>>>>  
>>>> -	if (service_fd_id == id)
>>>> +	if (new_base == -1)
>>>> +		return -1;
>>>> +	if (service_fd_base == new_base && service_fd_id == id)
>>>>  		return 0;
>>>
>>> For the root process, service_fd_base is changed 1048576 -> 256 and
>>> there is one service fd, so for the root process fdtable will be
>>> bigger than it has to be. Actually we have a problem each time when
>>> a child process has a smaller fdtable than a parent process.
>>
>> It's because of process's fds are read in root_prepare_shared()->prepare_fd_pid().
>> This is the reason, why I moved setup_newborn_fds() in "files: Do setup_newborn_fds() later".
>> It seems this was made to finish CR_STATE_PREPARE_NAMESPACES stage faster (it's being finished
>> before root_prepare_shared()), because it's important
>>
>>  
>>> And here is a test which shows this problem with more collors
>>> t a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
>>> index 1feabfa9f..2505cb54c 100644
>>> --- a/test/zdtm/static/env00.c
>>> +++ b/test/zdtm/static/env00.c
>>> @@ -1,6 +1,7 @@
>>>  #include <errno.h>
>>>  #include <stdlib.h>
>>>  #include <string.h>
>>> +#include <sched.h>
>>>  
>>>  #include "zdtmtst.h"
>>>  
>>> @@ -13,9 +14,22 @@ TEST_OPTION(envname, string, "environment variable name", 1);
>>>  int main(int argc, char **argv)
>>>  {
>>>         char *env;
>>> +       int i;
>>> +       pid_t pid;
>>>  
>>>         test_init(argc, argv);
>>>  
>>> +       for (i = 0; i < 10; i++) {
>>> +               pid = fork();
>>> +               if (pid == 0) {
>>> +                       while (1)
>>> +                               sleep(1);
>>> +                       return 0;
>>> +               }
>>> +       }
>>> +
>>> +       dup2(1, 1000000);
>>> +
>>>         if (setenv(envname, test_author, 1)) {
>>>                 pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
>>>                 exit(1);
>>>
>>> before dump:
>>> [root at fc24 criu]# echo  $((`cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes` >> 10))
>>> 12276
>>>
>>> after restore:
>>> [root at fc24 criu]# echo  $((`cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes` >> 10))
>>> 108432
>>
>> It's not not obligatory connected with sfds. It's very difficult to measure the memory
>> connected with file table only. See below:
>>
>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
>> index 1feabfa9f..d2517b835 100644
>> --- a/test/zdtm/static/env00.c
>> +++ b/test/zdtm/static/env00.c
>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
>>  
>>  int main(int argc, char **argv)
>>  {
>> +	int i;
>> +	pid_t pid;
>>  	char *env;
>>  
>>  	test_init(argc, argv);
>>  
>> +	for (i = 0; i < 10; i++) {
>> +		pid = fork();
>> +		if (pid == 0) {
>> +			while (1)
>> +				sleep(1);
>> +			return 0;
>> +		}
>> +	}
>> +
>> +//	dup2(1, 1000000);
>> +
>>  	if (setenv(envname, test_author, 1)) {
>>  		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
>>  		exit(1);
>>
>> before dump:
>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>> 11849728
>>
>> after restore:
>> root at pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>> 20877312
>>
>> Criu restore changes many in-kernel structures, and this is a global problem, not
>> about sfds.
> 
> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
> be 100MB. This is not a global problem, it is the problem that we move
> service descriptors in child processes.

Yes, but it's just the consequent of the service fds design in general, and this
is not result of this patchset.
 
>>
>>>>  
>>>>  	/* Dup sfds in memmove() style: they may overlap */
>>>> -	if (get_service_fd(LOG_FD_OFF) > __get_service_fd(LOG_FD_OFF, id))
>>>> +	if (get_service_fd(LOG_FD_OFF) > new_base - LOG_FD_OFF - SERVICE_FD_MAX * id)
>>>>  		for (i = SERVICE_FD_MIN + 1; i < SERVICE_FD_MAX; i++)
>>>>  			move_service_fd(me, i, id, new_base);
>>>>  	else
>>>>  		for (i = SERVICE_FD_MAX - 1; i > SERVICE_FD_MIN; i--)
>>>>  			move_service_fd(me, i, id, new_base);
>>>>  
>>>> +	service_fd_base = new_base;
>>>>  	service_fd_id = id;
>>>>  	ret = 0;
>>
>> Kirill



More information about the CRIU mailing list