[Devel] [PATCH RHEL7 COMMIT] ms/workqueue: fix ghost PENDING flag while doing MQ IO

Thu Aug 31 13:33:32 MSK 2017

Please consider to release it as a ReadyKernel patch.

https://readykernel.com/

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 08/31/2017 01:28 PM, Konstantin Khorenko wrote:
> The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
> after rh7-3.10.0-514.26.1.vz7.35.5
> ------>
> commit f24bbb53d5035c7b13b5ecb61728d5f12240f139
> Author: Roman Pen <roman.penyaev at profitbricks.com>
> Date:   Thu Aug 31 13:28:46 2017 +0300
>
>     ms/workqueue: fix ghost PENDING flag while doing MQ IO
>
>     We have the hole node hang, many processes hang on similar stack as here:
>
>     crash> ps -m ffff8802b7f00000
>     [0 00:20:36.663] [UN]  PID: 22713  TASK: ffff8802b7f00000  CPU: 1   COMMAND: "worker"
>
>     crash> bt ffff8802b7f00000
>     PID: 22713  TASK: ffff8802b7f00000  CPU: 1   COMMAND: "worker"
>      #0 [ffff88031b04f980] __schedule at ffffffff8256cdd1
>      #1 [ffff88031b04f9f8] schedule at ffffffff8256e239
>      #2 [ffff88031b04fa18] schedule_timeout at ffffffff82561cea
>      #3 [ffff88031b04fb88] io_schedule_timeout at ffffffff8256c0d9
>      #4 [ffff88031b04fbb8] wait_for_completion_io at ffffffff8256f3e0
>      #5 [ffff88031b04fc90] blkdev_issue_flush at ffffffff8193a207
>      #6 [ffff88031b04fe08] ext4_sync_file at ffffffffa0af6d34 [ext4]
>      #7 [ffff88031b04fe68] vfs_fsync_range at ffffffff8173212c
>      #8 [ffff88031b04fec8] do_fsync at ffffffff817330dc
>      #9 [ffff88031b04ff68] sys_fdatasync at ffffffff8173437e
>         RIP: 00007f474a581ddd  RSP: 00007f46ba3fe8a0  RFLAGS: 00000282
>         RAX: 000000000000004b  RBX: ffffffff8258f609  RCX: ffffffffffffffff
>         RDX: 00007f4754ffd458  RSI: 0000000000000000  RDI: 0000000000000011
>         RBP: 0000000000000000   R8: 0000000000000000   R9: 00000000000058b9
>         R10: 00007f46ba3fe8b0  R11: 0000000000000293  R12: 00007f475be25d80
>         R13: ffffffff8173437e  R14: ffff88031b04ff78  R15: 00007f4755141452
>         ORIG_RAX: 000000000000004b  CS: 0033  SS: 002b
>
>     crash> ps -m ffff8802b7f00000
>     [0 00:20:36.663] [UN]  PID: 22713  TASK: ffff8802b7f00000  CPU: 1   COMMAND: "worker"
>
>     Sleeps for 20 minutes on bio completion:
>
>     blkdev_issue_flush:
>             submit_bio(WRITE_FLUSH, bio);
>     here>	wait_for_completion_io(&wait);
>
>     As bio->bi_rw = (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FLUSH), we had:
>
>     submit_bio->generic_make_request->dm_make_request->queue_io->queue_work
>
>     So in wait_for_completion_io we wait for dm_wq_work to complete these
>     bio. But work is not in the workqueue already, as work->entry is empty
>     list, so the work seem completed. That could happen only if md->flags
>     had DMF_BLOCK_IO_FOR_SUSPEND bit set. But it is already unset, when we
>     clear the bit we queue another dm_wq_work on these wq in dm_queue_flush.
>
>     So what could've happened here is that operation reordering loads
>     DMF_BLOCK_IO_FOR_SUSPEND bit in dm_wq_work before it was cleared in
>     dm_queue_flush. Adding smp_mb in set_work_pool_and_clear_pending should
>     order operations properly.
>
>     https://jira.sw.ru/browse/PSBM-69788
>
>     original commit message:
>
>     The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
>     with the following backtrace:
>
>     [  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
>     [  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
>     [  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>     [  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
>     [  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
>     [  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
>     [  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
>     [  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
>     [  601.350965] Call Trace:
>     [  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
>     [  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
>     [  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
>     [  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
>     [  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
>     [  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
>     [  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
>     [  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
>     [  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
>     [  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
>     [  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
>     [  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
>     [  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
>     [  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
>     [  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
>     [  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
>     [  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
>     [  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
>     [  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
>     [  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
>     [  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50
>
>     The underlying device is a null_blk, with default parameters:
>
>       queue_mode    = MQ
>       submit_queues = 1
>
>     Verification that nullb0 has something inflight:
>
>     root at pserver8:~# cat /sys/block/nullb0/inflight
>            0        1
>     root at pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
>     ...
>     /sys/block/nullb0/mq/0/cpu2/rq_list
>     CTX pending:
>             ffff8838038e2400
>     ...
>
>     During debug it became clear that stalled request is always inserted in
>     the rq_list from the following path:
>
>        save_stack_trace_tsk + 34
>        blk_mq_insert_requests + 231
>        blk_mq_flush_plug_list + 281
>        blk_flush_plug_list + 199
>        wait_on_page_bit + 192
>        __filemap_fdatawait_range + 228
>        filemap_fdatawait_range + 20
>        filemap_write_and_wait_range + 63
>        blkdev_fsync + 27
>        vfs_fsync_range + 73
>        blkdev_write_iter + 202
>        __vfs_write + 170
>        vfs_write + 169
>        kernel_write + 56
>
>     So blk_flush_plug_list() was called with from_schedule == true.
>
>     If from_schedule is true, that means that finally blk_mq_insert_requests()
>     offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
>     i.e. it calls kblockd_schedule_delayed_work_on().
>
>     That means, that we race with another CPU, which is about to execute
>     __blk_mq_run_hw_queue() work.
>
>     Further debugging shows the following traces from different CPUs:
>
>       CPU#0                                  CPU#1
>       ----------------------------------     -------------------------------
>       reqeust A inserted
>       STORE hctx->ctx_map[0] bit marked
>       kblockd_schedule...() returns 1
>       <schedule to kblockd workqueue>
>                                              request B inserted
>                                              STORE hctx->ctx_map[1] bit marked
>                                              kblockd_schedule...() returns 0
>       *** WORK PENDING bit is cleared ***
>       flush_busy_ctxs() is executed, but
>       bit 1, set by CPU#1, is not observed
>
>     As a result request B pended forever.
>
>     This behaviour can be explained by speculative LOAD of hctx->ctx_map on
>     CPU#0, which is reordered with clear of PENDING bit and executed _before_
>     actual STORE of bit 1 on CPU#1.
>
>     The proper fix is an explicit full barrier <mfence>, which guarantees
>     that clear of PENDING bit is to be executed before all possible
>     speculative LOADS or STORES inside actual work function.
>
>     Signed-off-by: Roman Pen <roman.penyaev at profitbricks.com>
>     Cc: Gioh Kim <gi-oh.kim at profitbricks.com>
>     Cc: Michael Wang <yun.wang at profitbricks.com>
>     Cc: Tejun Heo <tj at kernel.org>
>     Cc: Jens Axboe <axboe at kernel.dk>
>     Cc: linux-block at vger.kernel.org
>     Cc: linux-kernel at vger.kernel.org
>     Cc: stable at vger.kernel.org
>     Signed-off-by: Tejun Heo <tj at kernel.org>
>
>     Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
> ---
>  kernel/workqueue.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index ec41322a..ffd9f4e 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -638,6 +638,35 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
>  	 */
>  	smp_wmb();
>  	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
> +	/*
> +	 * The following mb guarantees that previous clear of a PENDING bit
> +	 * will not be reordered with any speculative LOADS or STORES from
> +	 * work->current_func, which is executed afterwards.  This possible
> +	 * reordering can lead to a missed execution on attempt to qeueue
> +	 * the same @work.  E.g. consider this case:
> +	 *
> +	 *   CPU#0                         CPU#1
> +	 *   ----------------------------  --------------------------------
> +	 *
> +	 * 1  STORE event_indicated
> +	 * 2  queue_work_on() {
> +	 * 3    test_and_set_bit(PENDING)
> +	 * 4 }                             set_..._and_clear_pending() {
> +	 * 5                                 set_work_data() # clear bit
> +	 * 6                                 smp_mb()
> +	 * 7                               work->current_func() {
> +	 * 8				      LOAD event_indicated
> +	 *				   }
> +	 *
> +	 * Without an explicit full barrier speculative LOAD on line 8 can
> +	 * be executed before CPU#0 does STORE on line 1.  If that happens,
> +	 * CPU#0 observes the PENDING bit is still set and new execution of
> +	 * a @work is not queued in a hope, that CPU#1 will eventually
> +	 * finish the queued @work.  Meanwhile CPU#1 does not see
> +	 * event_indicated is set, because speculative LOAD was executed
> +	 * before actual STORE.
> +	 */
> +	smp_mb();
>  }
>
>  static void clear_work_data(struct work_struct *work)
> .
>