[Devel] [PATCH] kernel: call task_work_run() before exit_task_namespaces()

Stanislav Kinsburskiy skinsbursky at virtuozzo.com
Wed Jul 19 21:47:17 MSK 2017



19 июля 2017 г. 9:37 PM пользователь Andrey Vagin <avagin at virtuozzo.com> написал:
On Wed, Jul 19, 2017 at 11:31:28AM -0700, Stanislav Kinsburskiy wrote:
>
>
> 19 июля 2017 г. 9:14 PM пользователь Andrey Vagin <avagin at virtuozzo.com>
> написал:
>
>     On Wed, Jul 19, 2017 at 08:04:22PM +0300, Andrey Ryabinin wrote:
>     >
>     >
>     > On 07/19/2017 04:14 AM, Andrei Vagin wrote:
>     > > From: Andrei Vagin <avagin at virtuozzo.com>
>     > >
>     > > task_work_run() has to be called before exit_task_namespaces(),
>     > > because fuse_abort_conn() is called from __fput(). If it will not
>     > > be executed, we can hang in request_wait_answer(). We have seen this
>     > > situation when a process was the last member of a mount namespace
>     > > and the mount namespace has a vstorage fuse mount.
>     > >
>     >
>     > Can we pleas have a changelog that doesn't look like an output of random
>     text generator?
>
>     Thanks!
>
>     > The fact that "fuse_abort_conn() is called from __fput()" doesn't really
>     explain why
>     > task_work_run() needs to be called before exit_task_namespaces.
>     >
>
>     Here is another version of my random text generator. It has to be more
>     detailed.
>
>     This patch solves a problem for a following case. We have a container (a
>     group of processes in pid and mount namespaces) with a fuse mount. An
>     init process exits and the kernel kills all process in its pid
>     namespace. There is a fuse daemon, which handle the fuse mount.
>     Currently the kernel kills this process and closes all its file
>     descriptors, but __fput() for them is postponed and they will be
>     called from a task_work.  Then the kernel starts destroying the mount
>     namespace and the fuse mount, it sees that a control descriptor for
>     this mount is alive and sends a request to a fuse daemon:
>
>     $ cat /proc/4353/task/4355/stack
>     [<ffffffffa04c3451>] request_wait_answer+0x91/0x270 [fuse]
>     [<ffffffffa04c36b7>] __fuse_request_send+0x87/0xe0 [fuse]
>     [<ffffffffa04c6c47>] fuse_request_check_and_send+0x27/0x30 [fuse]
>     [<ffffffffa04c6c60>] fuse_request_send+0x10/0x20 [fuse]
>     [<ffffffffa04d2f35>] fuse_put_super+0x55/0xc0 [fuse]
>     [<ffffffff81218b32>] generic_shutdown_super+0x72/0xf0
>     [<ffffffff81218f12>] kill_anon_super+0x12/0x20
>     [<ffffffffa04d2577>] fuse_kill_sb_anon+0x47/0x50 [fuse]
>     [<ffffffff812194a9>] deactivate_locked_super+0x49/0x80
>     [<ffffffff81219526>] deactivate_super+0x46/0x60
>     [<ffffffff81237145>] mntput_no_expire+0xc5/0x120
>     [<ffffffff812371c4>] mntput+0x24/0x40
>     [<ffffffff812372f8>] namespace_unlock+0x118/0x130
>     [<ffffffff81239f2b>] put_mnt_ns+0x4b/0x60
>     [<ffffffff810b786b>] free_nsproxy+0x1b/0x90
>     [<ffffffff810b7a0a>] switch_task_namespaces+0x5a/0x70
>     [<ffffffff810b7ae0>] exit_task_namespaces+0x10/0x20
>     [<ffffffff8108c883>] do_exit+0x2f3/0xb20
>     [<ffffffff8108d12f>] do_group_exit+0x3f/0xa0
>     [<ffffffff8109e760>] get_signal_to_deliver+0x1d0/0x6d0
>     [<ffffffff8102a357>] do_signal+0x57/0x6b0
>     [<ffffffff8102aa0f>] do_notify_resume+0x5f/0xb0
>     [<ffffffff8169273d>] int_signal+0x12/0x17
>     [<ffffffffffffffff>] 0xffffffffffffffff
>
>     But we know that a fuse daemon is already dead and the control
>     descriptor isn't closed completely, because __fput() was postponed.
>
>     This patch calls task_work_run() before destroying namespaces to
>     complete closing all process files.
>
>
> Now I have a question. Because rised questions look reasonable.
> This sounds like a generic issue.
> I.e. it's either solved in upstream likewise or otherwise. Or not solved at
> all.
> I.e. what's the status of this issue in linux-next?
>

In the upstream kernel deactivate_super() is called from a task_work too,
so there is not this problem. But we can't backport these changes from
the upstream, because they are too big.

Then probably it worth to mention the upsteam solution in the change log and why we can't use it, and how our patch logic correlates with the upstream one.
Hm?


>
>
>     >
>     >
>     > > https://jira.sw.ru/browse/PSBM-68266
>     > > Signed-off-by: Andrei Vagin <avagin at virtuozzo.com>
>     > > ---
>     > >  include/linux/task_work.h | 9 +++++++--
>     > >  kernel/exit.c             | 9 +++++++++
>     > >  kernel/task_work.c        | 4 ++--
>     > >  3 files changed, 18 insertions(+), 4 deletions(-)
>     > >
>     > > diff --git a/include/linux/task_work.h b/include/linux/task_work.h
>     > > index ca5a1cf..b3af76d 100644
>     > > --- a/include/linux/task_work.h
>     > > +++ b/include/linux/task_work.h
>     > > @@ -14,11 +14,16 @@ init_task_work(struct callback_head *twork,
>     task_work_func_t func)
>     > >
>     > >  int task_work_add(struct task_struct *task, struct callback_head
>     *twork, bool);
>     > >  struct callback_head *task_work_cancel(struct task_struct *,
>     task_work_func_t);
>     > > -void task_work_run(void);
>     > > +void __task_work_run(bool exiting);
>     > > +
>     > > +static inline void task_work_run(void)
>     > > +{
>     > > +   return __task_work_run(false);
>     > > +}
>     > >
>     > >  static inline void exit_task_work(struct task_struct *task)
>     > >  {
>     > > -   task_work_run();
>     > > +   __task_work_run(true);
>     > >  }
>     > >
>     > >  #endif     /* _LINUX_TASK_WORK_H */
>     > > diff --git a/kernel/exit.c b/kernel/exit.c
>     > > index 3c83db2..ea54a73 100644
>     > > --- a/kernel/exit.c
>     > > +++ b/kernel/exit.c
>     > > @@ -827,6 +827,15 @@ void do_exit(long code)
>     > >      exit_fs(tsk);
>     > >      if (group_dead)
>     > >              disassociate_ctty(1);
>     > > +
>     > > +   /*
>     > > +    * task_work_run() has to be called before exit_task_namespaces(),
>     > > +    * because fuse_abort_conn() is called from __fput(). If it will
>     not
>     > > +    * be executed, we can hang in request_wait_answer(). We have seen
>     this
>     > > +    * situation when a process was the last member of a mount
>     namespace
>     > > +    * and the mount namespace has a vstorage fuse mount.
>     > > +    */
>     > > +   task_work_run();
>     >
>     > Given that this is purely fuse's problem, maybe request_wait_answer()
>     could just call task_work_run()?
>     >
>     > Or maybe we can just call exit_task_work(tsk) before exit_task_namespaces
>     (tsk). This seems fine to me,
>     > but perhaps I'm missing something.
>     >
>     > >      exit_task_namespaces(tsk);
>     > >      exit_task_work(tsk);
>     > >
>     > > diff --git a/kernel/task_work.c b/kernel/task_work.c
>     > > index 65bd3c9..f0000c4 100644
>     > > --- a/kernel/task_work.c
>     > > +++ b/kernel/task_work.c
>     > > @@ -46,7 +46,7 @@ task_work_cancel(struct task_struct *task,
>     task_work_func_t func)
>     > >      return work;
>     > >  }
>     > >
>     > > -void task_work_run(void)
>     > > +void __task_work_run(bool exiting)
>     > >  {
>     > >      struct task_struct *task = current;
>     > >      struct callback_head *work, *head, *next;
>     > > @@ -58,7 +58,7 @@ void task_work_run(void)
>     > >               */
>     > >              do {
>     > >                      work = ACCESS_ONCE(task->task_works);
>     > > -                   head = !work && (task->flags & PF_EXITING) ?
>     > > +                   head = !work && exiting ?
>     >
>     > Why we need this change? AFAIU this will allow to add more task_works in
>     exit_task_namespaces()
>     > before final exit_task_work(). What's the point of this?
>     >
>     > >                              &work_exited : NULL;
>     > >              } while (cmpxchg(&task->task_works, work, head) != work);
>     > >
>     > >
>     _______________________________________________
>     Devel mailing list
>     Devel at openvz.org
>     https://lists.openvz.org/mailman/listinfo/devel
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/devel/attachments/20170719/970d4361/attachment-0001.html>


More information about the Devel mailing list