[CRIU] PID mismatch problem

Pavel Emelyanov xemul at parallels.com
Thu Dec 17 04:35:56 PST 2015


On 12/17/2015 03:22 PM, Federico Reghenzani wrote:
> Hi Pavel,
> thank you for your answer. I notice today that "sometimes" it works, so the problem is intermittent.

I see. Then it's likely some race with the rest of the system.

>     >     /  5485: Error (cr-restore.c:1262): 5488 exited, status=1/
>     >     /Killed/
>     >
>     >
>     > other times tells me:
>     >
>     >     /  5485: Error (tty.c:531): tty: Unable to open dev/ptmx with specified index 0/
>     >     /  5485: Error (tty.c:917): tty: Can't open a (index 0): Bad file descriptor/
> 
>     Did you use the -j option on dump? If so, then it's likely the lack of
>     same -j option for restore.
> 
> 
> I'm neither using -j for dump nor for restore. I tried also adding that but it seems it changes
> nothing. (I'm using the C API, so the criu_set_shell_job)

Hm... Then it's some tty issue. Cyrill, would you help us, please?
BTW, Frederico, feel free to file an issues about it on github.com/xemul/criu
if you find it easier to track issue progress out there.

>     >     /Error (files-reg.c:445):  `- XFail [/dev/shm/open_mpi.0000.cr.1.ghost] ghost: No such file or directory/
> 
>     MPI? Are you trying to C/R mpi jobs?
> 
> 
> Yes, we are trying to add to Open MPI the capability to migrate orted daemons between nodes. Currently 
> we do not checkpoint the single mpi process, but the entire daemon with its children.

OK :) Then I have two suggestions.

First is -- for reliable C/R it's better to start the tasks you plan
to migrate in the namespaces from the very beginning to avoid using
the --unshare option on restore.

The second is -- we have --restore-sibling option in CRIU that creates
more natural process tree. Like this:

when you restore w/o the option the ps tree would look like

<your application>
  `- <criu restore process>
        `- <the restored task>
             `- <the kids of restored task>

so when criu process exits the whole restored tree would get 
re-parent-ed to init

with the --restore-sibling option the result would be

<your application>
  `- <criu restore process>
  `- <the restored task>
       `- <the kids of restored task>

i.e. -- the restored tree would become child of your process from the
very beginning.

>     >     /Error (cr-restore.c:1995): Restoring FAILED./
>     >
>     >
>     > Note that in the first case I have no active process with that PID, and all other processes have PID under 1000.
> 
>     Hm... If it's so, can you strace the restore with -f option so that we
>     could check where the "bad" process comes from?
> 
> 
> I'll try this option and the --unshare option in next days (probably on Monday), and I let you know.

OK, but as I told above -- this option is not recommended for everyday 
live migration usage. It's rather a helper to fix the restore in case
dump was taken in unfortunate conditions. Starting the tasks you want 
to migrate in namespaces is better.

-- Pavel



More information about the CRIU mailing list