[Devel] [PATCH VZ9 v2 0/6] ploop BAT optimizations

Andrey Zhadchenko andrey.zhadchenko at virtuozzo.com
Wed Oct 9 18:45:10 MSK 2024



On 10/9/24 17:35, Alexander Atanasov wrote:
> Hi,
> 
> I haven't went in deep but one thing that looks suspicious is in 6/6 patch
> 
> +    if (pio->bi_op & REQ_PREFLUSH && (pio->bi_op & REQ_OP_MASK) != 
> REQ_OP_FLUSH) {
> +        flush_pio = ploop_alloc_pio(ploop, GFP_NOIO);
> +        if (!flush_pio) {
> +            pio->bi_status = BLK_STS_RESOURCE;
> +            ploop_pio_endio(pio);
> +            return;
> +        }
> +
> +        ploop_init_pio(ploop, REQ_OP_FLUSH, flush_pio);
> +        flush_pio->endio_cb = ploop_preflush_endio;
> +        flush_pio->endio_cb_data = pio;
> +        flush_pio->queue_list_id = PLOOP_LIST_FLUSH;
> +        flush_pio->free_on_endio = true;
> +    }
> +
>       write_lock_irq(&ploop->bat_rwlock);
> 
>       list_for_each_entry(md, &ploop->wb_batch_list_prio, wb_link) {
>           md_inflight++;
> -        md->piwb->flush_pio = pio;
> +        md->piwb->flush_pio = flush_pio;
>       }
> 
> 
> one pio is allocted and then assigned to multiple md->piwb
> 
> notably : flush_pio->free_on_endio = true;

That is expected: flush pio is dispatched only when all piwb has 
completed atomic_dec(&flush_pio->md_inflight) to zero.
free_on_endio is needed to automatically drop these extra flush pios.

Crash is introduced with patch "dm-ploop: add delay for metadata 
writeback". Seems like it somehow broke BLKDISCARD. mkfs.xfs uses them, 
and there is some suspicious things:

55296 ioctl(5, BLKDISCARD, [0, 2147483648]) = 0
55296 write(1, "Discarding blocks...", 20) = 20
55296 ioctl(5, BLKDISCARD, [2147483648, 2147483648]) = 0
55296 ioctl(5, BLKDISCARD, [4294967296, 2147483648]) = 0
55296 ioctl(5, BLKDISCARD, [6442450944, 2147483648]) = 0
55296 write(1, "Done.\n", 6)            = 6
...
55296 pwrite64(5, "XFSB\0\0\20\0\0\0\0\0\0 
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512
55296 fallocate(5, FALLOC_FL_ZERO_RANGE, 4294991872, 67108864) = 0
55296 pwrite64(5, 
"\376\355\272\276\0\0\0\1\0\0\0\2\0\0\2\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0"..., 
1024, 4294991872) = 1024
55296 pread64(5, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
512, 8589934080) = 512
55296 pwrite64(5, "XFSB\0\0\20\0\0\0\0\0\0 
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512
55296 pwrite64(5, 
"XAGF\0\0\0\1\0\0\0\0\0\10\0\0\0\0\0\1\0\0\0\2\0\0\0\0\0\0\0\1"..., 512, 
512) = 512
...
55296 write(2, "Metadata CRC error detected at 0"..., 70) = 70
55296 write(2, "\n", 1)                 = 1
55296 pread64(5, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
512, 2147484160) = 512
55296 write(2, "Metadata CRC error detected at 0"..., 75) = 75
55296 write(2, "\n", 1)

Second run on the same disk just hangs on
ioctl(4, BLKDISCARD, [0, 2147483648]

> 
> 
> On 9.10.24 17:35, Denis V. Lunev wrote:
>> On 10/9/24 16:33, Denis V. Lunev wrote:
>>> On 10/9/24 14:17, Andrey Zhadchenko wrote:
>>>> With this patchset, we start to check if the upcoming requests for 
>>>> ploop
>>>> have REQ_FUA flag.
>>>> If the flag is present:
>>>>   - IO submits immediately, completion waits for BAT update
>>>>   - BAT updates submit immediately
>>>>   - set IOCB_DSYNC for writes and BAT updates
>>>>   - BAT update does not trigger flush
>>>> If the flag is absent:
>>>>   - IO completion does not wait for BAT update
>>>>   - all BAT updates are periodically submitted in a batch
>>>>   - BAT update does not trigger flush
>>>>
>>>> New parameter with metadata submit delay can be specified when setting
>>>> up ploop table
>>>>
>>>> v2 changes:
>>>> - drop flush after BAT updates even for FUA requests
>>>> - set IOCB_DSYNC flag on writes triggered by FUA
>>>> - complete all accumulated metadata on FLUSH requests
>>>> - support REQ_PREFLUSH flag
>>>>
>>>> Andrey Zhadchenko (3):
>>>>    dm-ploop: do not flush BAT writes if request was not FUA
>>>>    dm-ploop: do not wait for BAT update for non-FUA requests
>>>>    dm-ploop: add delay for metadata writeback
>>>>
>>>>
>>>> *** BLURB HERE ***
>>>>
>>>> Andrey Zhadchenko (6):
>>>>    dm-ploop: do not flush after metadata writes
>>>>    dm-ploop: set IOCB_DSYNC on all FUA requests
>>>>    dm-ploop: do not wait for BAT update for non-FUA requests
>>>>    dm-ploop: add delay for metadata writeback
>>>>    dm-ploop: submit all postponed metadata on REQ_OP_FLUSH
>>>>    dm-ploop: handle REQ_PREFLUSH
>>>>
>>>>   drivers/md/dm-ploop-bat.c    |   1 +
>>>>   drivers/md/dm-ploop-map.c    | 154 
>>>> +++++++++++++++++++++++++++--------
>>>>   drivers/md/dm-ploop-target.c |  13 +++
>>>>   drivers/md/dm-ploop.h        |   9 ++
>>>>   4 files changed, 143 insertions(+), 34 deletions(-)
>>>>
>>> these patches do not work at all over 32 GB ramdisk
>>>
>>> dev007 /mnt/ramdisk # ploop init -s 8G /mnt/ramdisk/root.hds
>>> Creating delta /mnt/ramdisk/root.hds bs=2048 size=16777216 sectors v2
>>> Adding snapshot {5fbaabe3-6958-40ff-92a7-860e329aab41}
>>> Storing /mnt/ramdisk/DiskDescriptor.xml
>>> Opening delta /mnt/ramdisk/root.hds
>>> Opening delta /mnt/ramdisk/root.hds
>>> Adding delta dev=/dev/mapper/ploop30904 img=/mnt/ramdisk/root.hds (rw)
>>> Update inuse state /mnt/ramdisk/root.hds state=1953459801
>>> Opening delta /mnt/ramdisk/root.hds
>>> Running: partprobe /dev/mapper/ploop30904
>>> Set untrusted /dev/mapper/ploop30904
>>> Running: parted -s /dev/mapper/ploop30904 mklabel gpt
>>> Running: sgdisk -n 1:2048:16775167 /dev/mapper/ploop30904
>>> Warning: The kernel is still using the old partition table.
>>> The new table will be used at the next reboot or after you
>>> run partprobe(8) or kpartx(8)
>>> The operation has completed successfully.
>>> Running: partprobe /dev/mapper/ploop30904
>>> Running: mkfs.xfs -b size=4096 /dev/mapper/ploop30904p1
>>> meta-data=/dev/mapper/ploop30904p1 isize=512    agcount=4, 
>>> agsize=524160 blks
>>>          =                       sectsz=512   attr=2, projid32bit=1
>>>          =                       crc=1        finobt=1, sparse=1, 
>>> rmapbt=0
>>>          =                       reflink=1    bigtime=1 inobtcount=1 
>>> nrext64=0
>>> data     =                       bsize=4096   blocks=2096640, imaxpct=25
>>>          =                       sunit=0      swidth=0 blks
>>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>>> log      =internal log           bsize=4096   blocks=16384, version=2
>>>          =                       sectsz=512   sunit=0 blks, lazy-count=1
>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>> Discarding blocks...Done.
>>> Metadata CRC error detected at 0x561dffbd10f0, xfs_agf block 0x1/0x200
>>> Metadata CRC error detected at 0x561dffbd10f0, xfs_agf block 
>>> 0x3ffc01/0x200
>>> Metadata CRC error detected at 0x561dffbd10f0, xfs_agf block 
>>> 0x7ff801/0x200
>>> Metadata CRC error detected at 0x561dffbd10f0, xfs_agf block 
>>> 0xbff401/0x200
>>> Metadata CRC error detected at 0x561dffbee040, xfs_agi block 0x2/0x200
>>> Metadata CRC error detected at 0x561dffbee040, xfs_agi block 
>>> 0x3ffc02/0x200
>>> Metadata CRC error detected at 0x561dffbee040, xfs_agi block 
>>> 0x7ff802/0x200
>>> Metadata CRC error detected at 0x561dffbee040, xfs_agi block 
>>> 0xbff402/0x200
>>> mkfs.xfs: Inode allocation failed [28 - No space left on device]
>>> Error in run_prg_rc (util.c:309): Command mkfs.xfs -b size=4096 
>>> /dev/mapper/ploop30904p1  exited with code 1
>>> Umount /dev/mapper/ploop30904
>>> DM command: remove /dev/mapper/ploop30904p1
>>> DM message: /dev/mapper/ploop30904 get_img_name 0
>>> Opening delta /mnt/ramdisk/root.hds
>>> DM command: suspend /dev/mapper/ploop30904
>>> DM command: resume /dev/mapper/ploop30904
>>> Unmounting device /dev/mapper/ploop30904
>>> DM command: resume /dev/mapper/ploop30904
>>> DM command: remove /dev/mapper/ploop30904
>>> Clear inuse state
>>> Dropping image /mnt/ramdisk/root.hds
>>> dev007 /mnt/ramdisk #
>>>
>>> rmmod after that has been hanged.
>>>
>>> Den
>> [  351.210027] 
>> =============================================================================
>> [  351.210348] BUG ploop-pio (Tainted: G            E  X  ------- --- 
>> ): Objects remaining in ploop-pio on __kmem_cache_shutdown()
>> [  351.210620] 
>> -----------------------------------------------------------------------------
>>
>> [  351.210886] Slab 0x00000000afe69346 objects=36 used=3 
>> fp=0x0000000028f0e0af 
>> flags=0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
>> [  351.211192] CPU: 2 PID: 15519 Comm: rmmod ve: / Kdump: loaded 
>> Tainted: G            E  X  -------  ---  5.14.0-427.37.1.ovz9.78.1 #6 
>> ovz9.78.1
>> [  351.211564] Hardware name: Supermicro SYS-5019C-L/X11SCL-IF, BIOS 
>> 1.5 10/05/2020
>> [  351.211878] Call Trace:
>> [  351.212146]  <TASK>
>> [  351.212452]  dump_stack_lvl+0x34/0x48
>> [  351.212732]  slab_err+0xa2/0x110
>> [  351.213006]  __kmem_cache_shutdown+0x16a/0x310
>> [  351.213305]  kmem_cache_destroy+0x51/0x160
>> [  351.213583]  dm_ploop_exit+0x24/0x20d [ploop]
>> [  351.213867]  __do_sys_delete_module.constprop.0+0x175/0x280
>> [  351.214148]  ? syscall_trace_enter.constprop.0+0x126/0x1a0
>> [  351.214428]  do_syscall_64+0x59/0x90
>> [  351.214703]  ? __rseq_handle_notify_resume+0x26/0xb0
>> [  351.214980]  ? exit_to_user_mode_loop+0xd0/0x130
>> [  351.215257]  ? exit_to_user_mode_prepare+0xb6/0x100
>> [  351.215553]  ? syscall_exit_to_user_mode+0x22/0x40
>> [  351.215830]  ? do_syscall_64+0x69/0x90
>> [  351.216105]  ? exc_page_fault+0x62/0x150
>> [  351.216414]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> [  351.216693] RIP: 0033:0x7f72bc83f5ab
>> [  351.216988] Code: 73 01 c3 48 8b 0d 75 a8 1b 00 f7 d8 64 89 01 48 
>> 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 
>> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 a8 1b 00 f7 d8 64 89 
>> 01 48
>> [  351.217391] RSP: 002b:00007fff49e1e798 EFLAGS: 00000206 ORIG_RAX: 
>> 00000000000000b0
>> [  351.217707] RAX: ffffffffffffffda RBX: 00005580d76568f0 RCX: 
>> 00007f72bc83f5ab
>> [  351.217991] RDX: 000000000000000a RSI: 0000000000000800 RDI: 
>> 00005580d7656958
>> [  351.218313] RBP: 0000000000000000 R08: 0000000000000000 R09: 
>> 0000000000000000
>> [  351.218596] R10: 00007f72bc99eac0 R11: 0000000000000206 R12: 
>> 00007fff49e1e9f0
>> [  351.218881] R13: 00007fff49e1f070 R14: 00005580d76562a0 R15: 
>> 00005580d76568f0
>> [  351.219166]  </TASK>
>> [  351.219453] Disabling lock debugging due to kernel taint
>> [  351.219453] Disabling lock debugging due to kernel taint
>> [  351.219732] Kernel panic - not syncing: panic_on_warn set ...
>> [  351.220012] CPU: 2 PID: 15519 Comm: rmmod ve: / Kdump: loaded 
>> Tainted: G    B       E  X  -------  ---  5.14.0-427.37.1.ovz9.78.1 #6 
>> ovz9.78.1
>> [  351.220345] Hardware name: Supermicro SYS-5019C-L/X11SCL-IF, BIOS 
>> 1.5 10/05/2020
>> [  351.220677] Call Trace:
>> [  351.220945]  <TASK>
>> [  351.221230]  dump_stack_lvl+0x34/0x48
>> [  351.221523]  panic+0x107/0x2f7
>> [  351.221796]  slab_err.cold+0xc/0xc
>> [  351.222073]  __kmem_cache_shutdown+0x16a/0x310
>> [  351.222368]  kmem_cache_destroy+0x51/0x160
>> [  351.222643]  dm_ploop_exit+0x24/0x20d [ploop]
>> [  351.222921]  __do_sys_delete_module.constprop.0+0x175/0x280
>> [  351.223216]  ? syscall_trace_enter.constprop.0+0x126/0x1a0
>> [  351.223519]  do_syscall_64+0x59/0x90
>> [  351.223793]  ? __rseq_handle_notify_resume+0x26/0xb0
>> [  351.224070]  ? exit_to_user_mode_loop+0xd0/0x130
>> [  351.224345]  ? exit_to_user_mode_prepare+0xb6/0x100
>> [  351.224622]  ? syscall_exit_to_user_mode+0x22/0x40
>> [  351.224897]  ? do_syscall_64+0x69/0x90
>> [  351.225170]  ? exc_page_fault+0x62/0x150
>> [  351.225460]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> [  351.225737] RIP: 0033:0x7f72bc83f5ab
>> [  351.226011] Code: 73 01 c3 48 8b 0d 75 a8 1b 00 f7 d8 64 89 01 48 
>> 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 
>> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 a8 1b 00 f7 d8 64 89 
>> 01 48
>> [  351.226394] RSP: 002b:00007fff49e1e798 EFLAGS: 00000206 ORIG_RAX: 
>> 00000000000000b0
>> [  351.226709] RAX: ffffffffffffffda RBX: 00005580d76568f0 RCX: 
>> 00007f72bc83f5ab
>> [  351.226995] RDX: 000000000000000a RSI: 0000000000000800 RDI: 
>> 00005580d7656958
>> [  351.227316] RBP: 0000000000000000 R08: 0000000000000000 R09: 
>> 0000000000000000
>> [  351.227601] R10: 00007f72bc99eac0 R11: 0000000000000206 R12: 
>> 00007fff49e1e9f0
>> [  351.227887] R13: 00007fff49e1f070 R14: 00005580d76562a0 R15: 
>> 00005580d76568f0
>> [  351.228171]  </TASK>
>>
> 


More information about the Devel mailing list