[Devel] [PATCH] dm-ploop: fix discard writeback
Andrey Zhadchenko
andrey.zhadchenko at virtuozzo.com
Thu Jun 19 12:03:55 MSK 2025
On 6/19/25 05:59, Pavel Tikhomirov wrote:
>
>
> On 6/18/25 19:50, Andrey Zhadchenko wrote:
>> When doing simple discard tests, we encountered the following crash
>> with panic_on_warn:
>>
>> [4421147.245184] ------------[ cut here ]------------
>> [4421147.245569] WARNING: CPU: 6 PID: 505054 at drivers/md/dm-
>> ploop.h:392 ploop_advance_local_after_bat_wb+0x2ae/0x2d0 [ploop]
>> [4421147.246073] Modules linked in: loop tls dm_qcow2(E) vhost_net tap
>> xt_physdev nft_meta_bridge tun dm_zero_rq vhost_blk vhost_vsock vhost
>> vhost_iotlb ip6t_REJECT nf_reject_ipv6 xt_CHECKSUM xt_MASQUERADE
>> xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat
>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter 8021q garp mrp
>> nf_tables nfnetlink rpcrdma rfkill intel_rapl_msr intel_rapl_common
>> intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm
>> iTCO_wdt irqbypass iTCO_vendor_support rapl vmw_vsock_virtio_transport
>> vmw_vsock_virtio_transport_common virtio_console vsock i2c_i801 pcspkr
>> lpc_ich i2c_smbus pvpanic_mmio pvpanic joydev nfsd br_netfilter veth
>> overlay auth_rpcgss nfs_acl lockd grace bridge stp llc sunrpc
>> binfmt_misc xfs libcrc32c sd_mod sr_mod cdrom t10_pi sg drm_ttm_helper
>> ttm drm_kms_helper crct10dif_pclmul syscopyarea crc32_pclmul
>> sysfillrect ahci sysimgblt libahci fb_sys_fops virtio_net
>> ghash_clmulni_intel libata net_failover drm virtio_scsi failover
>> [4421147.246120] serio_raw crc32c_intel fuse_kio_pcs rdma_cm iw_cm
>> ib_cm ib_core fuse push_backup dm_tracking ploop dm_mod [last
>> unloaded: dm_qcow2]
>> [4421147.249733] CPU: 6 PID: 505054 Comm: kworker/6:1 ve: / Kdump:
>> loaded Tainted: G E X ------- ---
>> 5.14.0-427.44.1.ovz9.80.29 #1 ovz9.80.29
>> [4421147.250261] Hardware name: Virtuozzo KVM/Virtuozzo, BIOS
>> 1.16.3-2.vz9.2 04/01/2014
>> [4421147.250651] Workqueue: dio/dm-1 iomap_dio_complete_work
>> [4421147.250914] RIP:
>> 0010:ploop_advance_local_after_bat_wb+0x2ae/0x2d0 [ploop]
>> [4421147.251216] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 48
>> 8b b5 88 00 00 00 3b 8d 90 00 00 00 73 0c 89 cf f0 48 0f ab 3e e9 81
>> fe ff ff <0f> 0b e9 7a fe ff ff 0f 0b 0f 0b e9 22 ff ff ff 0f 0b e9 c9
>> fe ff
>> [4421147.251938] RSP: 0018:ffffc07dc8447d98 EFLAGS: 00010002
>> [4421147.252202] RAX: ffff9dc778699040 RBX: ffff9dc7413bd800 RCX:
>> 0000000000000000
>> [4421147.252578] RDX: 0000000000000010 RSI: 0000000068746957 RDI:
>> ffff9dcc5414c420
>> [4421147.252943] RBP: ffff9dcac2299000 R08: 0000000000000000 R09:
>> ffff9dcac2299400
>> [4421147.253328] R10: 0000000000000001 R11: ffff9dc747460000 R12:
>> ffffc07dc8447dd0
>> [4421147.253706] R13: ffffffffcedc7000 R14: 0000000000000074 R15:
>> ffff9dc8e56aa380
>> [4421147.254071] FS: 0000000000000000(0000) GS:ffff9dce9fb80000(0000)
>> knlGS:0000000000000000
>> [4421147.254468] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [4421147.254743] CR2: 00007ffd825d3000 CR3: 00000001052a4002 CR4:
>> 0000000000372ee0
>> [4421147.255109] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [4421147.255474] DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7:
>> 0000000000000400
>> [4421147.255848] Call Trace:
>> [4421147.256040] <TASK>
>> [4421147.256225] ? show_trace_log_lvl+0x1c4/0x2df
>> [4421147.256466] ? show_trace_log_lvl+0x1c4/0x2df
>> [4421147.256712] ? ploop_put_piwb+0x187/0x1f0 [ploop]
>> [4421147.256959] ? ploop_advance_local_after_bat_wb+0x2ae/0x2d0 [ploop]
>> [4421147.257272] ? __warn+0x81/0x110
>> [4421147.257485] ? ploop_advance_local_after_bat_wb+0x2ae/0x2d0 [ploop]
>> [4421147.257775] ? report_bug+0x10a/0x140
>> [4421147.257997] ? handle_bug+0x3c/0x70
>> [4421147.258212] ? exc_invalid_op+0x14/0x70
>> [4421147.258438] ? asm_exc_invalid_op+0x16/0x20
>> [4421147.258678] ? ploop_advance_local_after_bat_wb+0x2ae/0x2d0 [ploop]
>> [4421147.258962] ? ploop_advance_local_after_bat_wb+0xcc/0x2d0 [ploop]
>> [4421147.259246] ploop_put_piwb+0x187/0x1f0 [ploop]
>> [4421147.259493] ploop_do_pio_endio+0x31/0x90 [ploop]
>> [4421147.259748] process_one_work+0x1e2/0x3b0
>> [4421147.259980] ? __pfx_worker_thread+0x10/0x10
>> [4421147.261238] worker_thread+0x50/0x3a0
>> [4421147.261460] ? __pfx_worker_thread+0x10/0x10
>> [4421147.261704] kthread+0xdd/0x100
>> [4421147.261915] ? __pfx_kthread+0x10/0x10
>> [4421147.262138] ret_from_fork+0x29/0x50
>> [4421147.262360] </TASK>
>> [4421147.262553] Kernel panic - not syncing: panic_on_warn set ...
>>
>> The extra debug looks like that:
>>
>> [ 3022.561985] got discard: pos 0, size 1048576
>> [ 3022.562373] processing discard: clu 0 (idx 16), len 1048576
>> [ 3022.563182] dropped clu 1752459607 (total 101) [page 0,
>> clu 0 (i 16 + off -16)]
>>
>> 1752459607 -> 0x68746957 -> 57 69 74 68 ("With") which is a start of
>> ploop
>> header. So we clearly missed the first metadata page handling.
>>
>> Since ploop first metadata page contains 64bit header, we shift all
>> clusters in bat by this value when accessing them.
>> In ploop_advance_local_after_bat_wb() we iterate over the page either
>> from
>> 0 or 16, if it is a first page. Therefore:
>> - i is the position of iterated cluster in the metadata page, so we
>> should
>> use it for accessing/writing bat_entries and bat_levels.
>> - i + off is the real cluster number, we use it for
>> ploop_hole_set_bit()
>>
>> Also the current code does out-of-bound access if the metadata page is
>> not the first, as it uses real cluster number as index when accessing
>> the bat page (which is only 4096 bytes long)
>>
>> Fixes: Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
>> ---
>> drivers/md/dm-ploop-map.c | 11 +++++------
>> 1 file changed, 5 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/md/dm-ploop-map.c b/drivers/md/dm-ploop-map.c
>> index 6dd94af5fd94..a09be9e3a62f 100644
>> --- a/drivers/md/dm-ploop-map.c
>> +++ b/drivers/md/dm-ploop-map.c
>> @@ -853,17 +853,16 @@ static void
>> ploop_advance_local_after_bat_wb(struct ploop *ploop,
>> for (; i < last; i++) {
>> if (piwb->type == PIWB_TYPE_DISCARD) {
>> /* discard completed */
>> - u32 clu = i + off;
>> - u8 level = md->bat_levels[clu];
>> - u32 d_clu = READ_ONCE(bat_entries[clu]);
>> + u8 level = md->bat_levels[i];
>> + u32 d_clu = READ_ONCE(bat_entries[i]);
>
> In original code in ploop_piwb_discard_completed (before da5f60147b62
> ("dm-ploop: dm-ploop: simplify discard completion")) we had:
>
> ploop_piwb_discard_completed(clu = i + off) {
> clu = ploop_bat_clu_idx_in_page(clu) {
> return (clu + PLOOP_MAP_OFFSET) % (PAGE_SIZE / sizeof(map_index_t));
> }
> dst_clu = READ_ONCE(bat_entries[clu]);
> ...
> ploop_hole_set_bit(dst_clu, ploop);
> }
>
> I'm struggling to understand how is that equivalent to how it is now
> where we just:
>
> ploop_hole_set_bit(i + off, ploop);
>
> Without any conversion. Can you please explain?
>
>> if (success && !dst_clu[i] &&
>> (!(d_clu == BAT_ENTRY_NONE ||
>> level < ploop_top_level(ploop)))) {
>> WARN_ON_ONCE(ploop->nr_deltas != 1);
>> - WRITE_ONCE(bat_entries[clu], BAT_ENTRY_NONE);
>> - WRITE_ONCE(md->bat_levels[clu], 0);
>> - ploop_hole_set_bit(d_clu, ploop);
>> + WRITE_ONCE(bat_entries[i], BAT_ENTRY_NONE);
>> + WRITE_ONCE(md->bat_levels[i], 0);
>> + ploop_hole_set_bit(i + off, ploop);
>> }
>> continue;
>
Yeah, I have forgot that we should set hole not for the iterated cluster
itself but for the one we read from iterated cluster. I will send v2. So
here the correct code would be the same as before:
ploop_hole_set_bit(d_clu, ploop);
More information about the Devel
mailing list