[CRIU] Error CRIU restore because pid not matched

Aris Setyawan aris.sety at gmail.com
Wed Dec 31 05:52:34 PST 2014


Hi,

I still have many PID mismatch, when the restored process have been
checkpoint-ed fo along time (more than one hour). Please note that I
run this on a busy system, where many process run and killed, very
often.

About your suggestion, I still can understand:

> How to prevent this?
> So it can't be fixed?

In theory we can let process live with whatever PID kernel allocates
for it, but our knowledge of glibc says that most likely there will
be BUGs.

One way to work around this is to unshare the pid namespace with
unshare -p, then call restore. But in this case you may suffer from
/proc being the proc from former pid namespace, not the new one. This,
in turn, can be solved by unsharing the mount namespace too and
re-mounting the /proc.

The most viable solution for this type of usecases is to checkpoint
and restore tasks living in namespaces from the very beginning, i.e.
start them in this or that form of container.

> Btw, the error caused by "pid mismatch" still can occur. Is this an
> expected behavior?

Yes, some possibility to re-use the PID still exists. On a running
systems doing C/R is only "safe" for containers.


My questions:
Is error PID mismatch "guaranteed" impossible if I doing C/R for container?
Is there any documentation about this?

On 10/27/14, Aris Setyawan <aris.sety at gmail.com> wrote:
>> This is suspicious. On inactive system this shouldn't happen at that
>> rate. Can you find out who gets the PID of your task?
>
> That was the problem come from.
> I deploy CRIU on a busy system and using it to C/R ~100MB ram consumed
> program.
>
> But I have fixed this problem.
> I have increased system PID limit, and then, the success rate of
> restore command, increased to more than 95%. Just sometimes failed.
>
> I use this command: sysctl -w kernel.pid_max=4194303
>
> Btw, the error caused by "pid mismatch" still can occur. Is this an
> expected behavior?
>
>> Has the CRIU binary suid bit set?
> Yes.
>
>> Try to unshare a new shell into pid namespace, then run criu restore.
>> In this case PID1 will be assigned to criu process. And don't use the
>> -d option in this case, otherwise criu will exit after restore thus
>> causing all the pid namespace to die.
>
> I have fix the problem by increasing system PID limit.
>
> On 10/27/14, Pavel Emelyanov <xemul at parallels.com> wrote:
>> On 10/26/2014 01:55 PM, Aris Setyawan wrote:
>>> I have did a benchmark. The result is that around 50% of my restore
>>> command, failed with error "pid not matched". It is a high number. Is
>>> this normal?
>>
>> This is suspicious. On inactive system this shouldn't happen at that
>> rate. Can you find out who gets the PID of your task?
>>
>>> I ran restore command under non root user.
>>
>> Has the CRIU binary suid bit set?
>>
>>>> One way to work around this is to unshare the pid namespace with
>>>> unshare -p, then call restore. But in this case you may suffer from
>>>> /proc being the proc from former pid namespace, not the new one. This,
>>>> in turn, can be solved by unsharing the mount namespace too and
>>>> re-mounting the /proc.
>>>
>>> I have tried this, but not work.
>>> Everytime I restore using "unshare -pid restore", the new PID is
>>> always "1". That, of course, will be miss-match.
>>
>> Try to unshare a new shell into pid namespace, then run criu restore.
>> In this case PID1 will be assigned to criu process. And don't use the
>> -d option in this case, otherwise criu will exit after restore thus
>> causing all the pid namespace to die.
>>
>>> On 10/25/14, Pavel Emelyanov <xemul at parallels.com> wrote:
>>>> On 10/24/2014 08:08 PM, Aris Setyawan wrote:
>>>>>> This means, that pid 1813, with which the restored task wants to
>>>>>> live,
>>>>>> is already busy.
>>>>>
>>>>> Is this mean, that pid already used by another process?
>>>>
>>>> Yes
>>>>
>>>>>> This option was obsoleted some time ago, sorry. Now CRIU detects
>>>>>> and creates the namespaces itself.
>>>>>
>>>>> I have read the code, and it will "always" raise error, if mismatch
>>>>> happened.
>>>>
>>>> Exactly.
>>>>
>>>>> How to prevent this?
>>>>> So it can't be fixed?
>>>>
>>>> In theory we can let process live with whatever PID kernel allocates
>>>> for it, but our knowledge of glibc says that most likely there will
>>>> be BUGs.
>>>>
>>>> One way to work around this is to unshare the pid namespace with
>>>> unshare -p, then call restore. But in this case you may suffer from
>>>> /proc being the proc from former pid namespace, not the new one. This,
>>>> in turn, can be solved by unsharing the mount namespace too and
>>>> re-mounting the /proc.
>>>>
>>>> The most viable solution for this type of usecases is to checkpoint
>>>> and restore tasks living in namespaces from the very beginning, i.e.
>>>> start them in this or that form of container.
>>>>
>>>>
>>>> Thanks,
>>>> Pavel
>>>>
>>>>
>>> .
>>>
>>
>>
>


More information about the CRIU mailing list