[CRIU] [PATCH] Make restored processes inherit file descriptors from criu

Tue Dec 2 00:33:41 PST 2014

On 12/01/2014 08:15 PM, Saied Kazemi wrote:

>> Actually, after more thinking I came to conclusion that with pipes the situation
>> may be even more confusing. If for some reason the process we restore hold one
>> pipe with two "ends" -- reading and writing -- and we say that this pipe should
>> be inherited from criu, we make this process put one criu's pipe end into its
>> both descriptors. Thus one of them will happen in the wrong read/write mode.
> 
> I would go back to the use case.  We're not making a broad statement
> that with inherit fd one can replace an arbitrary descriptor at
> checkpoint time with another arbitrary descriptor at restore time for
> the entire process tree and expect everything to work.  What we
> can/should say is that inherit fd functionality can restore certain
> descriptors that cannot otherwise be restored at all (causing the
> entire restore to fail).
> 
> I suggest that we take the same approach here that we took with
> external bind mounts.  Start with a simple implementation that does
> not change core CRIU functionality (hence very low risk) and see in
> practice what issues it may/will uncover.  Based on the issues, we can
> decide whether to extend its functionality or drop it.

I don't disagree :) Though dropping something that has an external
API would not be extremely easy... 

BTW, while we're at it, we have a list of problems in our image 
files [1], I think that some time soon I'll start a page with the
list of API problems. And any idea how to fix those w/o breaking 
the users are always welcome.

[1] http://criu.org/What%27s_bad_with_V1_images

> 
>> I think that the root cause of it is that with the --inherit-fd option we're
>> mapping into each other objects of different types. We ask criu to replace an
>> "inode" object (/foor/bar from reg-file.img or pipe[12345] from pipes.img)
>> with a "file" one (the fd[3]).
> 
> No, the idea is not to replace one inode type with another.  We're
> asking CRIU to use a live pipe fd in its descriptors instead of a
> dead/broken pipe fd in the image file.

Agreed.

> Or, we're asking CRIU to use a
> new regular file fd (e.g., /tmp/newfoo) instead of an old regular file
> fd (e.g., /tmp/oldfoo).  But we're not asking CRIU to use a regular
> file inode in place of a pipe inode or vice versa.  Sorry if this
> wasn't clear.

Well, yes, now it's more clear. But still the issue with addressing the
object stays. Here's what I mean, if a task has some file opened, in the
kernel memory this chain of objects exists:

task -> file-descriptors-table [ FD ] -> struct file -> struct dentry -> struct inode

Several file-s may reference one dentry/inode part, as well as several
FD-s may reference one file. Like this

task1 -> FDT [FD] -+
                   |
task2 -> FDT [FD] -+-> struct file -+-> struct dentry -> struct inode
                                    |
task3 -> FDT [FD] ---> struct file -+

When we say that we want to --inherit-fd fd[X]:/foo/bar, with the tail
part of the option (the /foo/bar) we address the dentry/inode pair and with
the head part of it (the fd[X]) we point to the struct file object in
the criu's FDT.

I'm ready to make the --inherit-fd option be the "use at your own risk"
one, but the problem is that a CRIU caller has no API to find out what
will happen after he uses it. IOW -- there's currently no way to reveal
the diagram above. For three tasks in this example the /proc/pid/fd
would show you the same three lines

3 -> /foo/bar

and that's it. This complex structure is hidden inside the kernel.

You can find this in the criu's image files as well. In the reg-files.img
there are records like

id: 0xabc ... name: /foo/bar

and while the id addresses the struct file, the name field is about the
dentry/inode. Same with pipes.img

id: 0xcde ... pipe_id: 0x123

The id is struct file and the pipe_id is struct inode (dentries are fake
for pipes).

And the same is true for any other image with struct file-s.

If with the --inherit-fd option we point to the "id" field, then this will
be clear what would happen. Maybe we can go this route?

>> What would you say if we make the --inherid-fd option (on restore only) work like
>> this. User would say "--inherit-fd $pid.$fd:$fd2" thus making criu to get a struct
>> file referenced by task $pid with the descriptor $fd, find everybody else who
>> references the same struct file and replace this struct file with the one, pointed
>> by criu process with the $fd2 descriptor. With this we map one file to another one.
>> And the collect_fd function from files.c does exactly what we need -- it resolves
>> the struct file sharing, finding the one who will "create" the inode and
>> serve one out to the others.
> 
> If I understand the gist of your proposal, it's almost the same as
> what we're already doing because task $pid is CRIU itself and $fd is
> the descriptor that is set up for CRIU when it's called.  

No, it's reversed. Sorry for not being clear enough.

> If I misunderstood, please explain what is task $pid and how is it created?

I meant that $pid.$fd points to some task from images and $fd2
points to criu's descriptor. Thus with the first pair we address
a struct file in the tree we restore, and with the $fd2 we also
address a struct file but in the criu's FDT.

Actually it's the same as with "id" field I've described above.
The task's fdinfo.img image entry looks like

id: 0xfde ... type: xxx fd: 1

If you take task's $pid entry with fd being $fd you uniquely
identify the (type, id) pair, which is the struct file.

>  Is it a task that was previously checkpointed or is it a new task
> that the user has to start in addition to starting CRIU itself?
> 
> Should I go ahead to remove checkpoint side of inherit fd and send you
> a patch (for pipe and regular file) or do you prefer to discuss an
> alternate design/implementation?

I agree with the implementation, right now I'd like to sort out
the option semantics :)

Thanks,
Pavel