[CRIU] [PATCH] restore: handle exit code of the unlock network script
Pavel Emelyanov
xemul at parallels.com
Tue Mar 25 06:06:53 PDT 2014
On 03/25/2014 12:41 PM, Andrew Vagin wrote:
> On Tue, Mar 25, 2014 at 02:27:33AM +0400, Pavel Emelyanov wrote:
>> On 03/24/2014 03:07 PM, Andrey Vagin wrote:
>>> When we are migrating processes from one host to another host,
>>> we need to know the moment, when processes can be killed on the source
>>> host.
>>> If a migration script is killed (segv, exception, etc), the process tree
>>> must not live on both nodes and we need to reduce the chance of
>>> killing processes.
>>
>> I didn't quite get why the existing scheme used by p.haul is flawed.
>> Can you draw a two-sided diagram of source-destination interaction
>> and show where the problem is and how you propose to solve it?
>
> source destination
> criu dump
> post-dump
> criu restore
> network unlock
> post-restore
> kill p.haul before receiving cr_rpc.RESTORE
> resume
>
> In this case both hosts will have alive process trees...
>
> And I want to move post-restore before network_unlock, because we can't
> fail after unlocking network.
OK, but this patch does something different.
>
>>
>>> In this patch I suggest to check exit code of network unlock scripts and
>>> use the following scheme to restore processes:
>>>
>>> 1. The source host dumps processes and stops in the post-dump script.
>>> 2. Then it sends signal to restore process tree on the other side.
>>> 3. The destination host restores processes and stops in the
>>> network-unlock script.
>>> 4. Then it sends signal to the source host to kill the origin process
>>> tree.
>>> 5. At the final stage the destination host unlocks network and resumes
>>> the migrated process tree.
>>>
>>> We have a small window between the 4-th and 5-th steps, because if a
>>> migration script is killed in this moment, the process tree will be
>>> killed on both nodes. And we have zero chance to leave two copies of
>>> the process tree.
>>>
>>> Signed-off-by: Andrey Vagin <avagin at openvz.org>
>>> ---
>>> cr-restore.c | 4 +++-
>>> include/net.h | 2 +-
>>> net.c | 8 +++++---
>>> 3 files changed, 9 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/cr-restore.c b/cr-restore.c
>>> index b352daa..22396a2 100644
>>> --- a/cr-restore.c
>>> +++ b/cr-restore.c
>>> @@ -1517,7 +1517,9 @@ static int restore_root_task(struct pstree_item *init)
>>> }
>>>
>>> /* Unlock network before disabling repair mode on sockets */
>>> - network_unlock();
>>> + ret = network_unlock();
>>> + if (ret < 0)
>>> + goto out_kill;
>>>
>>> /*
>>> * -------------------------------------------------------------
>>> diff --git a/include/net.h b/include/net.h
>>> index a9f0d46..a02f98a 100644
>>> --- a/include/net.h
>>> +++ b/include/net.h
>>> @@ -15,7 +15,7 @@ struct veth_pair {
>>> };
>>>
>>> extern int network_lock(void);
>>> -extern void network_unlock(void);
>>> +extern int network_unlock(void);
>>>
>>> extern struct ns_desc net_ns_desc;
>>>
>>> diff --git a/net.c b/net.c
>>> index ac66374..f7df130 100644
>>> --- a/net.c
>>> +++ b/net.c
>>> @@ -594,15 +594,17 @@ int network_lock(void)
>>> return run_scripts("network-lock");
>>> }
>>>
>>> -void network_unlock(void)
>>> +int network_unlock(void)
>>> {
>>> pr_info("Unlock network\n");
>>>
>>> + if ((current_ns_mask & CLONE_NEWNET) && run_scripts("network-unlock"))
>>> + return -1;
>>> +
>>> cpt_unlock_tcp_connections();
>>> rst_unlock_tcp_connections();
>>>
>>> - if (current_ns_mask & CLONE_NEWNET)
>>> - run_scripts("network-unlock");
>>> + return 0;
>>> }
>>>
>>> int veth_pair_add(char *in, char *out)
>>>
>>
>>
> .
>
More information about the CRIU
mailing list