[Devel] [PATCH v3 vz9/vz10] dm-ploop: fallback to kvmalloc for large bvec allocations

Thu Oct 23 12:33:31 MSK 2025

Hi all,

First of all, thank you for all the feedback and time you’ve spent 
reviewing this patch.

I’d like to add that gfpflags_allow_blocking() returns false if the 
flags don’t include __GFP_DIRECT_RECLAIM.
Since GFP_ATOMIC doesn’t include that flag, calling kvmalloc() with 
GFP_ATOMIC effectively prevents the vmalloc() fallback — it behaves like 
kmalloc() only.

Regarding Pavel’s question about atomic context: in our case, we’re 
called by the block layer under rcu_read_lock(), as shown below:

blk_mq_run_dispatch_ops(hctx->queue, blk_mq_sched_dispatch_requests(hctx))
   └─ rcu_read_lock()
      └─ blk_mq_sched_dispatch_requests(hctx)
          └─ blk_mq_dispatch_rq_list()
              └─ q->mq_ops->queue_rq()
                  └─ map_request()
                      └─ ploop_clone_and_map()
   └─ rcu_read_unlock()

__blk_mq_run_dispatch_ops() will use either srcu_read_lock() or 
rcu_read_lock() depending on request_queue->tag_set->flags.
In our case, request-based targets have their flags set as:

md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;

To confirm that we truly cannot sleep, I tried calling 
kvmalloc(GFP_NOIO) inside ploop_map_and_clone_rq(), which triggered:

Voluntary context switch within RCU read-side critical section!
...
[  312.012324]  __vmalloc_node_range+0xc8/0x220
[  312.012602]  kvmalloc_node+0xaa/0xe0
[  312.012876]  ? ploop_create_bvec_from_rq+0x93/0x130 [ploop]
[  312.013176]  ploop_create_bvec_from_rq+0x93/0x130 [ploop]
[  312.024807] ploop_prepare_one_embedded_pio.constprop.0+0x3b/0xc0 [ploop]
[  312.025100]  ploop_submit_embedded_pio+0x20e/0x300 [ploop]
[  312.025388]  ploop_clone_and_map+0x12b/0x1c0 [ploop]
[  312.025674]  map_request+0x56/0x240 [dm_mod]
[  312.025971]  dm_mq_queue_rq+0xe7/0x220 [dm_mod]
[  312.026267]  blk_mq_dispatch_rq_list+0x14e/0x560
[  312.026543]  __blk_mq_do_dispatch_sched+0xb8/0x330
[  312.026829]  __blk_mq_sched_dispatch_requests+0x14d/0x190
[  312.027110]  ? finish_task_switch.isra.0+0x8c/0x2a0
[  312.027389]  blk_mq_sched_dispatch_requests+0x33/0x60
[  312.027666]  blk_mq_run_work_fn+0x66/0x80
...

Given this, it seems clear that we can’t safely use any sleeping 
allocation here.
I don’t see a good way to set allocation flags per-target, and changing 
dm-rq for that would be more intrusive.

All other targets implementing clone_and_map_rq also allocate with 
GFP_ATOMIC.
I’m fine with deferring allocations larger than one page to a workqueue 
if that’s acceptable to you, Pavel.

Thanks again for the review and discussion.

Best regards,

On 10/23/25 6:29 AM, Pavel Tikhomirov wrote:
> On 10/22/25 23:20, Vasileios Almpanis wrote:
>> @@ -2615,7 +2624,7 @@ static void ploop_submit_embedded_pio(struct 
>> ploop *ploop, struct pio *pio)
>>           goto out;
>>       }
>>   -    ploop_prepare_one_embedded_pio(ploop, pio, &deferred_pios);
>> +    ploop_prepare_one_embedded_pio(ploop, pio, &deferred_pios, 
>> GFP_ATOMIC | __GFP_NOWARN);
>>       /*
>>        * Disable fast path due to rcu lockups fs -> ploop -> fs - 
>> fses are not reentrant
>>        * we can however try another fast path skip dispatcher thread 
>> and pass directly to
>
> Stupid question, do we really need GFP_ATOMIC at all on request 
> sending path?
>
> For all the paths where ploop_submit_embedded_pio is called:
>
>   +-< ploop_submit_embedded_pio
>     +-< ploop_submit_embedded_pios
>     | +-< ploop_resume_submitting_pios
>     | | +-< ploop_suspend_submitting_pios
>     | | +-< ploop_resize
>     | | +-< ploop_merge_latest_snapshot
>     | | +-< ploop_delta_clusters_merged
>     | | +-< ploop_update_delta_index
>     | +-< ploop_resubmit_enospc_pios
>     | | +-< do_ploop_run_work
>     | | +-< ploop_presuspend
>     +-< ploop_clone_and_map
>       +-< map_request
>         +-< dm_mq_queue_rq
>           +-< blk_mq_dispatch_rq_list
>           +-< __blk_mq_issue_directly
>           +-< __blk_mq_flush_plug_list
>
> I don't really see any interrupt-path. Can device mapper really call 
> us from interrupt context?
>
> Note in original bug where we've decided to add GFP_ATOMIC 
> https://virtuozzo.atlassian.net/browse/VSTOR-98291 we have stacks that 
> are definitely not from interrupts. It might just be that we take some 
> rcu/spinlocks on those paths and thus we had problems with sleep =) :
>
>   468.029539] BUG: sleeping function called from invalid context at 
> include/linux/sched/mm.h:273
> [  468.030477] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 
> 6110, name: dd
> [  468.031288] preempt_count: 0, expected: 0
> [  468.031731] RCU nest depth: 1, expected: 0
>
> [  466.575891] BUG: sleeping function called from invalid context at 
> include/linux/sched/mm.h:273
> [  466.576971] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 
> 5954, name: ploop22352-r-8
> [  466.577871] preempt_count: 1, expected: 0
>
> If I checkout to "99e18f6fee415ac23a9193d001bcc80fd520da16~" I clearly 
> see that it was spinlock:
>
> ploop_prepare_bat_update() {
>         lockdep_assert_held(&ploop->bat_lock); // which is spinlock
>         piwb = kmalloc(sizeof(*piwb), GFP_NOIO); // sleeping allocation
> }
>
> Maybe we just should not try allocations under spinlock/rcu and that 
> would save us from using GFP_ATOMIC everywhere.
>
> Maybe I'm missing something?
>