[CRIU] Error CRIU restore because pid not matched

Aris Setyawan aris.sety at gmail.com
Wed Dec 31 15:20:12 PST 2014


> P.S. I'd appreciate if any discussion about CRIU happens with the mailing
> list in Cc. My responsiveness throughput is limited :) but on the mailing
> list there are quite a lot of other people that can help.

Ok.

> [1] http://criu.org/When_C/R_fails

Ok, I successfully restore it using unshare command, after dump it
(test.sh process):

	criu dump -t 422439 --images-dir /tmp/test --shell-job
	unshare -p --fork --mount-proc criu restore --images-dir /tmp/test  --shell-job

And the test.sh process, run as a child process (this is because we
can not running restore with -d option):

	$ ps ax | grep test
	 422516 pts/0    S      0:00 /path/unshare -p --fork --mount-proc
/path/criu restore --images-dir /tmp/test --shell-job
	 422517 pts/0    S      0:00 /path/criu restore --images-dir
/tmp/test --shell-job
	 422518 pts/0    S      0:15 /bin/sh ./test.sh

In my use case, I need to dump test.sh process again (and it's state)
and then restore it again, then doing a computation. And then dump it
again.

New problem, come. How I can dump the process again?

-Aris

On 12/31/14, Pavel Emelyanov <xemul at parallels.com> wrote:
> On 12/31/2014 04:50 PM, Aris Setyawan wrote:
>> Hi,
>>
>> I still have many PID mismatch, when the restored process have been
>> checkpoint-ed fo along time (more than one hour). Please note that I
>> run this on a busy system, where many process run and killed, very
>> often.
>>
>> About your suggestion, I still can understand:
>>
>>> How to prevent this?
>>> So it can't be fixed?
>>
>> In theory we can let process live with whatever PID kernel allocates
>> for it, but our knowledge of glibc says that most likely there will
>> be BUGs.
>>
>> One way to work around this is to unshare the pid namespace with
>> unshare -p, then call restore. But in this case you may suffer from
>> /proc being the proc from former pid namespace, not the new one. This,
>> in turn, can be solved by unsharing the mount namespace too and
>> re-mounting the /proc.
>>
>> The most viable solution for this type of usecases is to checkpoint
>> and restore tasks living in namespaces from the very beginning, i.e.
>> start them in this or that form of container.
>>
>>> Btw, the error caused by "pid mismatch" still can occur. Is this an
>>> expected behavior?
>>
>> Yes, some possibility to re-use the PID still exists. On a running
>> systems doing C/R is only "safe" for containers.
>>
>>
>> My questions:
>> Is error PID mismatch "guaranteed" impossible if I doing C/R for
>> container?
>
> Yes. When you C/R a whole container (even just a pid namespace) the "pid
> mismatch" error is guaranteed NOT to happen.
>
>> Is there any documentation about this?
>
> Not yet, but you've asked a great question :) I've created a wiki page [1]
> that will get eventually filled with typical C/R failures and descriptions
> of why this happens and what to do next.
>
> [1] http://criu.org/When_C/R_fails
>
> Thanks,
> Pavel
>
> P.S. I'd appreciate if any discussion about CRIU happens with the mailing
> list in Cc. My responsiveness throughput is limited :) but on the mailing
> list there are quite a lot of other people that can help.
>
>


More information about the CRIU mailing list