[CRIU] [PATCH 4/4] restore: Fix hang if root task is waiting on zombie

Andrey Vagin avagin at gmail.com
Tue Dec 11 09:51:12 MSK 2018


On Fri, Dec 07, 2018 at 02:57:12PM +0300, Cyrill Gorcunov wrote:
> When we're waiting for zombie we might be the last task
> and the others already altered nr_in_progress futex. Thus
> in this situation we should rather use nanosleep or similar
> to relieve syscall pressure. Otherwise we can simply stuck
> at restore procedure waiting for @nr_in_progress forever.

I don't understand this description. Pls provide a sequence of actions
when we can stuck at restore.

Thanks,
Andrei

> 
> Signed-off-by: Cyrill Gorcunov <gorcunov at gmail.com>
> ---
>  criu/pie/restorer.c | 36 +++++++++++++++++++++++++++++++-----
>  1 file changed, 31 insertions(+), 5 deletions(-)
> 
> diff --git a/criu/pie/restorer.c b/criu/pie/restorer.c
> index d3b459c6c8b1..b4c0eecd44cc 100644
> --- a/criu/pie/restorer.c
> +++ b/criu/pie/restorer.c
> @@ -1162,6 +1162,13 @@ static int wait_helpers(struct task_restore_args *task_args)
>  
>  static int wait_zombies(struct task_restore_args *task_args)
>  {
> +	static const uint32_t step_ms = 100;
> +	static const struct timespec req = {
> +		.tv_nsec        = step_ms * 1000000,
> +		.tv_sec         = 0,
> +	};
> +	struct timespec rem;
> +	uint32_t nr_prints = 10;
>  	int i;
>  
>  	for (i = 0; i < task_args->zombies_n; i++) {
> @@ -1171,12 +1178,31 @@ static int wait_zombies(struct task_restore_args *task_args)
>  
>  		ret = sys_waitid(P_PID, task_args->zombies[i], NULL, WNOWAIT | WEXITED, NULL);
>  		if (ret == -ECHILD) {
> -			/* A process isn't reparented to this task yet.
> -			 * Let's wait when someone complete this stage
> -			 * and try again.
> +			/*
> +			 * A zombie is not reparented to us yet. So we need to
> +			 * wait for it. If there some tasks left then we can
> +			 * use @nr_in_progress to not calling waitid too often.
> +			 * But in case if we're the root process, the @nr_in_progress
> +			 * won't be altered and we use nanosleep with @step_ms
> +			 * to relieve syscall pressure.
>  			 */
> -			futex_wait_while_eq(&task_entries_local->nr_in_progress,
> -								nr_in_progress);
> +			if (nr_in_progress > 1) {
> +				futex_wait_while_eq(&task_entries_local->nr_in_progress,
> +						    nr_in_progress);
> +			} else {
> +				if (nr_prints)
> +					pr_debug("wait_zombies %d (nanosleep %ld %ld)\n",
> +						 task_args->zombies[i], req.tv_sec, req.tv_nsec);
> +				ret = sys_nanosleep((struct timespec *)&req, &rem);
> +				if (ret == -EINTR) {
> +					if (nr_prints)
> +						pr_debug("\twait_zombies %d (nanosleep %ld %ld)\n",
> +							 task_args->zombies[i], rem.tv_sec, rem.tv_nsec);
> +					sys_nanosleep((struct timespec *)&rem, NULL);
> +				}
> +				if (nr_prints)
> +					nr_prints--;
> +			}
>  			i--;
>  			continue;
>  		}
> -- 
> 2.17.2
> 


More information about the CRIU mailing list