[Devel] [PATCH] kernel: call task_work_run() before exit_task_namespaces()

Konstantin Khorenko khorenko at virtuozzo.com
Mon Aug 7 12:31:31 MSK 2017


ping

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 07/20/2017 11:44 AM, Konstantin Khorenko wrote:
> Andrey, i've applied the patch in order not to block,
> but please extend the comment a little bit more:
> - ms commits (which are big to backport)
> - and what is the difference between those ms commits and our small commit.
>
> i'll edit the commit message during next rebase.
>
> Thank you.
>
> --
> Best regards,
>
> Konstantin Khorenko,
> Virtuozzo Linux Kernel Team
>
> On 07/19/2017 09:47 PM, Stanislav Kinsburskiy wrote:
>>
>>
>> 19 июля 2017 г. 9:37 PM пользователь Andrey Vagin <avagin at virtuozzo.com> написал:
>>
>>     On Wed, Jul 19, 2017 at 11:31:28AM -0700, Stanislav Kinsburskiy wrote:
>>     >
>>     >
>>     > 19 июля 2017 г. 9:14 PM пользователь Andrey Vagin <avagin at virtuozzo.com>
>>     > написал:
>>     >
>>     >     On Wed, Jul 19, 2017 at 08:04:22PM +0300, Andrey Ryabinin wrote:
>>     >     >
>>     >     >
>>     >     > On 07/19/2017 04:14 AM, Andrei Vagin wrote:
>>     >     > > From: Andrei Vagin <avagin at virtuozzo.com>
>>     >     > >
>>     >     > > task_work_run() has to be called before exit_task_namespaces(),
>>     >     > > because fuse_abort_conn() is called from __fput(). If it will not
>>     >     > > be executed, we can hang in request_wait_answer(). We have seen this
>>     >     > > situation when a process was the last member of a mount namespace
>>     >     > > and the mount namespace has a vstorage fuse mount.
>>     >     > >
>>     >     >
>>     >     > Can we pleas have a changelog that doesn't look like an output of random
>>     >     text generator?
>>     >
>>     >     Thanks!
>>     >
>>     >     > The fact that "fuse_abort_conn() is called from __fput()" doesn't really
>>     >     explain why
>>     >     > task_work_run() needs to be called before exit_task_namespaces.
>>     >     >
>>     >
>>     >     Here is another version of my random text generator. It has to be more
>>     >     detailed.
>>     >
>>     >     This patch solves a problem for a following case. We have a container (a
>>     >     group of processes in pid and mount namespaces) with a fuse mount. An
>>     >     init process exits and the kernel kills all process in its pid
>>     >     namespace. There is a fuse daemon, which handle the fuse mount.
>>     >     Currently the kernel kills this process and closes all its file
>>     >     descriptors, but __fput() for them is postponed and they will be
>>     >     called from a task_work.  Then the kernel starts destroying the mount
>>     >     namespace and the fuse mount, it sees that a control descriptor for
>>     >     this mount is alive and sends a request to a fuse daemon:
>>     >
>>     >     $ cat /proc/4353/task/4355/stack
>>     >     [<ffffffffa04c3451>] request_wait_answer+0x91/0x270 [fuse]
>>     >     [<ffffffffa04c36b7>] __fuse_request_send+0x87/0xe0 [fuse]
>>     >     [<ffffffffa04c6c47>] fuse_request_check_and_send+0x27/0x30 [fuse]
>>     >     [<ffffffffa04c6c60>] fuse_request_send+0x10/0x20 [fuse]
>>     >     [<ffffffffa04d2f35>] fuse_put_super+0x55/0xc0 [fuse]
>>     >     [<ffffffff81218b32>] generic_shutdown_super+0x72/0xf0
>>     >     [<ffffffff81218f12>] kill_anon_super+0x12/0x20
>>     >     [<ffffffffa04d2577>] fuse_kill_sb_anon+0x47/0x50 [fuse]
>>     >     [<ffffffff812194a9>] deactivate_locked_super+0x49/0x80
>>     >     [<ffffffff81219526>] deactivate_super+0x46/0x60
>>     >     [<ffffffff81237145>] mntput_no_expire+0xc5/0x120
>>     >     [<ffffffff812371c4>] mntput+0x24/0x40
>>     >     [<ffffffff812372f8>] namespace_unlock+0x118/0x130
>>     >     [<ffffffff81239f2b>] put_mnt_ns+0x4b/0x60
>>     >     [<ffffffff810b786b>] free_nsproxy+0x1b/0x90
>>     >     [<ffffffff810b7a0a>] switch_task_namespaces+0x5a/0x70
>>     >     [<ffffffff810b7ae0>] exit_task_namespaces+0x10/0x20
>>     >     [<ffffffff8108c883>] do_exit+0x2f3/0xb20
>>     >     [<ffffffff8108d12f>] do_group_exit+0x3f/0xa0
>>     >     [<ffffffff8109e760>] get_signal_to_deliver+0x1d0/0x6d0
>>     >     [<ffffffff8102a357>] do_signal+0x57/0x6b0
>>     >     [<ffffffff8102aa0f>] do_notify_resume+0x5f/0xb0
>>     >     [<ffffffff8169273d>] int_signal+0x12/0x17
>>     >     [<ffffffffffffffff>] 0xffffffffffffffff
>>     >
>>     >     But we know that a fuse daemon is already dead and the control
>>     >     descriptor isn't closed completely, because __fput() was postponed.
>>     >
>>     >     This patch calls task_work_run() before destroying namespaces to
>>     >     complete closing all process files.
>>     >
>>     >
>>     > Now I have a question. Because rised questions look reasonable.
>>     > This sounds like a generic issue.
>>     > I.e. it's either solved in upstream likewise or otherwise. Or not solved at
>>     > all.
>>     > I.e. what's the status of this issue in linux-next?
>>     >
>>
>>     In the upstream kernel deactivate_super() is called from a task_work too,
>>     so there is not this problem. But we can't backport these changes from
>>     the upstream, because they are too big.
>>
>>
>> Then probably it worth to mention the upsteam solution in the change log and why we can't use it, and how our patch logic correlates with the upstream one.
>> Hm?
>>
>>
>>     >
>>     >
>>     >     >
>>     >     >
>>     >     > > https://jira.sw.ru/browse/PSBM-68266
>>     >     > > Signed-off-by: Andrei Vagin <avagin at virtuozzo.com>
>>     >     > > ---
>>     >     > >  include/linux/task_work.h | 9 +++++++--
>>     >     > >  kernel/exit.c             | 9 +++++++++
>>     >     > >  kernel/task_work.c        | 4 ++--
>>     >     > >  3 files changed, 18 insertions(+), 4 deletions(-)
>>     >     > >
>>     >     > > diff --git a/include/linux/task_work.h b/include/linux/task_work.h
>>     >     > > index ca5a1cf..b3af76d 100644
>>     >     > > --- a/include/linux/task_work.h
>>     >     > > +++ b/include/linux/task_work.h
>>     >     > > @@ -14,11 +14,16 @@ init_task_work(struct callback_head *twork,
>>     >     task_work_func_t func)
>>     >     > >
>>     >     > >  int task_work_add(struct task_struct *task, struct callback_head
>>     >     *twork, bool);
>>     >     > >  struct callback_head *task_work_cancel(struct task_struct *,
>>     >     task_work_func_t);
>>     >     > > -void task_work_run(void);
>>     >     > > +void __task_work_run(bool exiting);
>>     >     > > +
>>     >     > > +static inline void task_work_run(void)
>>     >     > > +{
>>     >     > > +   return __task_work_run(false);
>>     >     > > +}
>>     >     > >
>>     >     > >  static inline void exit_task_work(struct task_struct *task)
>>     >     > >  {
>>     >     > > -   task_work_run();
>>     >     > > +   __task_work_run(true);
>>     >     > >  }
>>     >     > >
>>     >     > >  #endif     /* _LINUX_TASK_WORK_H */
>>     >     > > diff --git a/kernel/exit.c b/kernel/exit.c
>>     >     > > index 3c83db2..ea54a73 100644
>>     >     > > --- a/kernel/exit.c
>>     >     > > +++ b/kernel/exit.c
>>     >     > > @@ -827,6 +827,15 @@ void do_exit(long code)
>>     >     > >      exit_fs(tsk);
>>     >     > >      if (group_dead)
>>     >     > >              disassociate_ctty(1);
>>     >     > > +
>>     >     > > +   /*
>>     >     > > +    * task_work_run() has to be called before exit_task_namespaces(),
>>     >     > > +    * because fuse_abort_conn() is called from __fput(). If it will
>>     >     not
>>     >     > > +    * be executed, we can hang in request_wait_answer(). We have seen
>>     >     this
>>     >     > > +    * situation when a process was the last member of a mount
>>     >     namespace
>>     >     > > +    * and the mount namespace has a vstorage fuse mount.
>>     >     > > +    */
>>     >     > > +   task_work_run();
>>     >     >
>>     >     > Given that this is purely fuse's problem, maybe request_wait_answer()
>>     >     could just call task_work_run()?
>>     >     >
>>     >     > Or maybe we can just call exit_task_work(tsk) before exit_task_namespaces
>>     >     (tsk). This seems fine to me,
>>     >     > but perhaps I'm missing something.
>>     >     >
>>     >     > >      exit_task_namespaces(tsk);
>>     >     > >      exit_task_work(tsk);
>>     >     > >
>>     >     > > diff --git a/kernel/task_work.c b/kernel/task_work.c
>>     >     > > index 65bd3c9..f0000c4 100644
>>     >     > > --- a/kernel/task_work.c
>>     >     > > +++ b/kernel/task_work.c
>>     >     > > @@ -46,7 +46,7 @@ task_work_cancel(struct task_struct *task,
>>     >     task_work_func_t func)
>>     >     > >      return work;
>>     >     > >  }
>>     >     > >
>>     >     > > -void task_work_run(void)
>>     >     > > +void __task_work_run(bool exiting)
>>     >     > >  {
>>     >     > >      struct task_struct *task = current;
>>     >     > >      struct callback_head *work, *head, *next;
>>     >     > > @@ -58,7 +58,7 @@ void task_work_run(void)
>>     >     > >               */
>>     >     > >              do {
>>     >     > >                      work = ACCESS_ONCE(task->task_works);
>>     >     > > -                   head = !work && (task->flags & PF_EXITING) ?
>>     >     > > +                   head = !work && exiting ?
>>     >     >
>>     >     > Why we need this change? AFAIU this will allow to add more task_works in
>>     >     exit_task_namespaces()
>>     >     > before final exit_task_work(). What's the point of this?
>>     >     >
>>     >     > >                              &work_exited : NULL;
>>     >     > >              } while (cmpxchg(&task->task_works, work, head) != work);
>>     >     > >
>>     >     > >
>>     >     _______________________________________________
>>     >     Devel mailing list
>>     >     Devel at openvz.org
>>     >     https://lists.openvz.org/mailman/listinfo/devel
>>     >
>>     >
>>
>>
>>
>>
>> _______________________________________________
>> Devel mailing list
>> Devel at openvz.org
>> https://lists.openvz.org/mailman/listinfo/devel
>>
> _______________________________________________
> Devel mailing list
> Devel at openvz.org
> https://lists.openvz.org/mailman/listinfo/devel
>


More information about the Devel mailing list