[CRIU] [PATCH] Make restored processes inherit file descriptors from criu

Tue Dec 2 10:40:42 PST 2014

On Tue, Dec 2, 2014 at 12:33 AM, Pavel Emelyanov <xemul at parallels.com> wrote:
> On 12/01/2014 08:15 PM, Saied Kazemi wrote:
>
>>> Actually, after more thinking I came to conclusion that with pipes the situation
>>> may be even more confusing. If for some reason the process we restore hold one
>>> pipe with two "ends" -- reading and writing -- and we say that this pipe should
>>> be inherited from criu, we make this process put one criu's pipe end into its
>>> both descriptors. Thus one of them will happen in the wrong read/write mode.
>>
>> I would go back to the use case.  We're not making a broad statement
>> that with inherit fd one can replace an arbitrary descriptor at
>> checkpoint time with another arbitrary descriptor at restore time for
>> the entire process tree and expect everything to work.  What we
>> can/should say is that inherit fd functionality can restore certain
>> descriptors that cannot otherwise be restored at all (causing the
>> entire restore to fail).
>>
>> I suggest that we take the same approach here that we took with
>> external bind mounts.  Start with a simple implementation that does
>> not change core CRIU functionality (hence very low risk) and see in
>> practice what issues it may/will uncover.  Based on the issues, we can
>> decide whether to extend its functionality or drop it.
>
> I don't disagree :) Though dropping something that has an external
> API would not be extremely easy...
>
> BTW, while we're at it, we have a list of problems in our image
> files [1], I think that some time soon I'll start a page with the
> list of API problems. And any idea how to fix those w/o breaking
> the users are always welcome.
>
> [1] http://criu.org/What%27s_bad_with_V1_images
>
>>
>>> I think that the root cause of it is that with the --inherit-fd option we're
>>> mapping into each other objects of different types. We ask criu to replace an
>>> "inode" object (/foor/bar from reg-file.img or pipe[12345] from pipes.img)
>>> with a "file" one (the fd[3]).
>>
>> No, the idea is not to replace one inode type with another.  We're
>> asking CRIU to use a live pipe fd in its descriptors instead of a
>> dead/broken pipe fd in the image file.
>
> Agreed.
>
>> Or, we're asking CRIU to use a
>> new regular file fd (e.g., /tmp/newfoo) instead of an old regular file
>> fd (e.g., /tmp/oldfoo).  But we're not asking CRIU to use a regular
>> file inode in place of a pipe inode or vice versa.  Sorry if this
>> wasn't clear.
>
> Well, yes, now it's more clear. But still the issue with addressing the
> object stays. Here's what I mean, if a task has some file opened, in the
> kernel memory this chain of objects exists:
>
> task -> file-descriptors-table [ FD ] -> struct file -> struct dentry -> struct inode
>
> Several file-s may reference one dentry/inode part, as well as several
> FD-s may reference one file. Like this
>
>
> task1 -> FDT [FD] -+
>                    |
> task2 -> FDT [FD] -+-> struct file -+-> struct dentry -> struct inode
>                                     |
> task3 -> FDT [FD] ---> struct file -+
>
> When we say that we want to --inherit-fd fd[X]:/foo/bar, with the tail
> part of the option (the /foo/bar) we address the dentry/inode pair and with
> the head part of it (the fd[X]) we point to the struct file object in
> the criu's FDT.
>
> I'm ready to make the --inherit-fd option be the "use at your own risk"
> one, but the problem is that a CRIU caller has no API to find out what
> will happen after he uses it. IOW -- there's currently no way to reveal
> the diagram above. For three tasks in this example the /proc/pid/fd
> would show you the same three lines
>
> 3 -> /foo/bar
>
> and that's it. This complex structure is hidden inside the kernel.
>
> You can find this in the criu's image files as well. In the reg-files.img
> there are records like
>
> id: 0xabc ... name: /foo/bar
>
> and while the id addresses the struct file, the name field is about the
> dentry/inode. Same with pipes.img
>
> id: 0xcde ... pipe_id: 0x123
>
> The id is struct file and the pipe_id is struct inode (dentries are fake
> for pipes).
>
> And the same is true for any other image with struct file-s.
>
> If with the --inherit-fd option we point to the "id" field, then this will
> be clear what would happen. Maybe we can go this route?
>
>>> What would you say if we make the --inherid-fd option (on restore only) work like
>>> this. User would say "--inherit-fd $pid.$fd:$fd2" thus making criu to get a struct
>>> file referenced by task $pid with the descriptor $fd, find everybody else who
>>> references the same struct file and replace this struct file with the one, pointed
>>> by criu process with the $fd2 descriptor. With this we map one file to another one.
>>> And the collect_fd function from files.c does exactly what we need -- it resolves
>>> the struct file sharing, finding the one who will "create" the inode and
>>> serve one out to the others.
>>
>> If I understand the gist of your proposal, it's almost the same as
>> what we're already doing because task $pid is CRIU itself and $fd is
>> the descriptor that is set up for CRIU when it's called.
>
> No, it's reversed. Sorry for not being clear enough.
>
>> If I misunderstood, please explain what is task $pid and how is it created?
>
> I meant that $pid.$fd points to some task from images and $fd2
> points to criu's descriptor. Thus with the first pair we address
> a struct file in the tree we restore, and with the $fd2 we also
> address a struct file but in the criu's FDT.
>
> Actually it's the same as with "id" field I've described above.
> The task's fdinfo.img image entry looks like
>
> id: 0xfde ... type: xxx fd: 1
>
> If you take task's $pid entry with fd being $fd you uniquely
> identify the (type, id) pair, which is the struct file.
>
>>  Is it a task that was previously checkpointed or is it a new task
>> that the user has to start in addition to starting CRIU itself?
>>
>> Should I go ahead to remove checkpoint side of inherit fd and send you
>> a patch (for pipe and regular file) or do you prefer to discuss an
>> alternate design/implementation?
>
> I agree with the implementation, right now I'd like to sort out
> the option semantics :)
>
> Thanks,
> Pavel
>

I understand the problem you're explaining with the state of the
struct file between the file descriptor table and the inode but
because of the following reasons we can't address this issue with a
command line option like $pid$fd:$fd2 or anything else for that
matter.

1. The root/init process may have passed down its descriptor to a
distant descendant, so there is no way to know which $pid to specify.

2. Whatever process using the descriptor may have moved its fd, so
there is no way to know which $fd to specify.

As an example, consider the following simple shell script in which
$</bin/sh-pid>:$fd[1] becomes $</bin/dash-pid>:$fd[3].

$ /bin/sh -c 'exec 3>&1; exec 1>&-; /bin/dash -c "i=0; while true; do
echo \$i >&3; i=\$(expr \$i + 1); sleep 1; done"'

As you said, it's best to introduce --inherit-fd as an experimental
"use it at your own risk" command line option that is suitable for
specific use cases (most notably pipes and log files).  We can even
add a --experimental command line option that should be specified to
use --inherit-fd (and also other experimental options, if any).  This
way the users will definitely realize that the API is experimental and
may change or disappear.

--Saied