[CRIU] PID mismatch problem
Pavel Emelyanov
xemul at parallels.com
Thu Dec 17 04:35:56 PST 2015
On 12/17/2015 03:22 PM, Federico Reghenzani wrote:
> Hi Pavel,
> thank you for your answer. I notice today that "sometimes" it works, so the problem is intermittent.
I see. Then it's likely some race with the rest of the system.
> > / 5485: Error (cr-restore.c:1262): 5488 exited, status=1/
> > /Killed/
> >
> >
> > other times tells me:
> >
> > / 5485: Error (tty.c:531): tty: Unable to open dev/ptmx with specified index 0/
> > / 5485: Error (tty.c:917): tty: Can't open a (index 0): Bad file descriptor/
>
> Did you use the -j option on dump? If so, then it's likely the lack of
> same -j option for restore.
>
>
> I'm neither using -j for dump nor for restore. I tried also adding that but it seems it changes
> nothing. (I'm using the C API, so the criu_set_shell_job)
Hm... Then it's some tty issue. Cyrill, would you help us, please?
BTW, Frederico, feel free to file an issues about it on github.com/xemul/criu
if you find it easier to track issue progress out there.
> > /Error (files-reg.c:445): `- XFail [/dev/shm/open_mpi.0000.cr.1.ghost] ghost: No such file or directory/
>
> MPI? Are you trying to C/R mpi jobs?
>
>
> Yes, we are trying to add to Open MPI the capability to migrate orted daemons between nodes. Currently
> we do not checkpoint the single mpi process, but the entire daemon with its children.
OK :) Then I have two suggestions.
First is -- for reliable C/R it's better to start the tasks you plan
to migrate in the namespaces from the very beginning to avoid using
the --unshare option on restore.
The second is -- we have --restore-sibling option in CRIU that creates
more natural process tree. Like this:
when you restore w/o the option the ps tree would look like
<your application>
`- <criu restore process>
`- <the restored task>
`- <the kids of restored task>
so when criu process exits the whole restored tree would get
re-parent-ed to init
with the --restore-sibling option the result would be
<your application>
`- <criu restore process>
`- <the restored task>
`- <the kids of restored task>
i.e. -- the restored tree would become child of your process from the
very beginning.
> > /Error (cr-restore.c:1995): Restoring FAILED./
> >
> >
> > Note that in the first case I have no active process with that PID, and all other processes have PID under 1000.
>
> Hm... If it's so, can you strace the restore with -f option so that we
> could check where the "bad" process comes from?
>
>
> I'll try this option and the --unshare option in next days (probably on Monday), and I let you know.
OK, but as I told above -- this option is not recommended for everyday
live migration usage. It's rather a helper to fix the restore in case
dump was taken in unfortunate conditions. Starting the tasks you want
to migrate in namespaces is better.
-- Pavel
More information about the CRIU
mailing list