<div dir="ltr"><div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">These snippets below are from my conversation with Tycho Andersen about the issue where he looked into it as well and confirmed the fix:</div>
<div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">Garrison:</div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">
This is the situation in which the bug arises:</div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">1. Make a namespace jail. (I was using the pid and mnt namespaces)</div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">
2. Run a process in that namespace jail (ex: infiniteLoop) </div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">3. Put the nsinit in cgroups such as:</div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">
/memory/TMP0</div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">/cpu/TMP0</div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">4. Put the child process of nsinit in cgroups such as:</div>
<div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><div>/memory/TMP0/TMP1</div><div>/cpu/TMP0/TMP1</div></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">5. Dump the nsinit process (and as such dumping/killing the child process too)</div>
<div style="font-family:arial,sans-serif;font-size:12.727272033691406px">6. Remove the cgroups TMP0 and TMP1</div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">7. Restore the nsinit process</div>
<div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">This fails on the attempt to move the child process into TMP0/TMP1/tasks.</div>
<div style="font-family:arial,sans-serif;font-size:12.727272033691406px"> </div></div><div><span style="font-family:arial,sans-serif;font-size:12.727272033691406px">Yes move_in_cgroup is being called twice and failing on the second call. The first time, move_in_cgroup puts both the parent and the child process in the TMP0/ cgroup set. The second call is trying to move the child into the TMP0/TMP1 cg set and fails due to not getting the correct fd from get_service_fd even though the correct fd still seems to be around.</span></div>
<div><font face="arial, sans-serif"><br></font></div><div><span style="font-family:arial,sans-serif;font-size:12.727272033691406px">By calling close_service_fd() at the end of move_in_cgroup(), you are turning off the bit for CGROUP_YARD in sfd_map which keeps track of all service file descriptors. Then when, move_in_cgroup() gets called a second time, get_service_fd() fails because the bit is still off. </span><font face="arial, sans-serif"><br>
</font><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Tycho:</div><div class="gmail_extra"><span style="font-family:arial,sans-serif;font-size:12.727272033691406px">It looks like these are all mutually recursive, there is a call graph:</span></div>
<div class="gmail_extra"><br style="font-family:arial,sans-serif;font-size:12.727272033691406px"><span style="font-family:arial,sans-serif;font-size:12.727272033691406px">restore_task_with_children -> create_children_and_session -></span><br style="font-family:arial,sans-serif;font-size:12.727272033691406px">
<span style="font-family:arial,sans-serif;font-size:12.727272033691406px">fork_with_pid -> restore_task_with_children</span><br style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br style="font-family:arial,sans-serif;font-size:12.727272033691406px">
<span style="font-family:arial,sans-serif;font-size:12.727272033691406px">So it seems definitely incorrect to close the cgroup fd in</span><br style="font-family:arial,sans-serif;font-size:12.727272033691406px"><span style="font-family:arial,sans-serif;font-size:12.727272033691406px">move_in_cgroup. Looking at 203c291 (when move_in_cgroup was</span><br style="font-family:arial,sans-serif;font-size:12.727272033691406px">
<span style="font-family:arial,sans-serif;font-size:12.727272033691406px">introduced) it seems that this loop has always existed. Seems best to</span><br style="font-family:arial,sans-serif;font-size:12.727272033691406px">
<span style="font-family:arial,sans-serif;font-size:12.727272033691406px">fix it (and I think you're right that we can just delete the call,</span><br style="font-family:arial,sans-serif;font-size:12.727272033691406px">
<span style="font-family:arial,sans-serif;font-size:12.727272033691406px">since it is closed in fini_cgroup() on error or success).</span><br></div><div class="gmail_extra"><br></div><div class="gmail_extra"><span style="font-family:arial,sans-serif;font-size:12.727272033691406px">Without the close_service_fd call it all seems to</span><br style="font-family:arial,sans-serif;font-size:12.727272033691406px">
<span style="font-family:arial,sans-serif;font-size:12.727272033691406px">work just fine. </span><br style="font-family:arial,sans-serif;font-size:12.727272033691406px"><span style="font-family:arial,sans-serif;font-size:12.727272033691406px">Anyway, I can confirm this bug and fix.</span><br>
<br><div class="gmail_quote">On Tue, Aug 5, 2014 at 10:35 PM, Pavel Emelyanov <span dir="ltr"><<a href="mailto:xemul@parallels.com" target="_blank">xemul@parallels.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div>On 08/05/2014 11:50 PM, <a href="mailto:gbellack@google.com" target="_blank">gbellack@google.com</a> wrote:<br>
> From: gbellack <<a href="mailto:gbellack@google.com" target="_blank">gbellack@google.com</a>><br>
><br>
> There is an issue where if the proccess to be killed spawns a child proccess and<br>
> moves it in a child cgroup of the one the parent proccess is in, upon restore,<br>
> move_in_cgroup() is called twice as it should be (once to move the parent<br>
> proccess and once to move the child proccess) but the file descriptor has<br>
> already been closed causing a failure for the second call to move_in_cgroup().<br>
<br>
</div>Can you provide more details please? The move_in_cgroup() is supposed to<br>
move task only once, how does the 2nd time happen?<br>
<div><div><br>
> Change-Id: I6ae88b95c5410a7f56108e28eb3133f113e868d0<br>
> Signed-off-by: Garrison Bellack <<a href="mailto:gbellack@google.com" target="_blank">gbellack@google.com</a>><br>
> ---<br>
> cgroup.c | 1 -<br>
> 1 file changed, 1 deletion(-)<br>
><br>
> diff --git a/cgroup.c b/cgroup.c<br>
> index 06311e4..8c99e9d 100644<br>
> --- a/cgroup.c<br>
> +++ b/cgroup.c<br>
> @@ -617,7 +617,6 @@ static int move_in_cgroup(CgSetEntry *se)<br>
> }<br>
> }<br>
><br>
> - close_service_fd(CGROUP_YARD);<br>
> return 0;<br>
> }<br>
><br>
><br>
<br>
</div></div></blockquote></div><br></div></div></div>