[Devel] [PATCH vz9] dm-ploop: fix and rework md updates

Wed Feb 12 12:42:12 MSK 2025

On 12.02.25 11:02, Pavel Tikhomirov wrote:
> 
> 
> On 2/12/25 14:38, Alexander Atanasov wrote:
>> On 12.02.25 7:54, Pavel Tikhomirov wrote:
>>>
>>>
>>> On 2/11/25 22:25, Alexander Atanasov wrote:
>>>> With the locking reduced it opened windows for races between
>>>> running updates, pending and new updates. Current logic to
>>>> deal with them is not correct.
>>>
>>> Examples of exact race situation in code would be appreciated, as now 
>>> it's impossible to unserstand which exact race we are trying to fix.
>>>
>>> E.g. preferably in this format:
>>>
>>> Thread 1:
>>> A1
>>> B1
>>>                  Thread2:
>>>                  A2
>>>                  B2
>>> C1
>>>                  C2
>>
>>
>>
>> Too many to describe them that way - we have three actors.
>> Locking and logic is wrong.
> 
> Sorry, but it is not obvious what was wrong with it in the first place. 
> To review and verify the fix, I need to have full information about what 
> the problem was (it's unclear from JIRA too, unless I reinvestigate all 
> the crashes there).
> 
>> Should i go over the code and explain it?
> 
> Can you please explain general lock logic a bit and maybe with couple of 
> references to code. Because for me "races between running updates, 
> pending and new updates" brings more questions when answers: What is 
> "running" updates? What is "pending" and "new" updates? What is "updates"?

Ok let me try - an md page can be in three states / which comments are 
updated in .h file/
  - MD_DIRTY - page awaits writeback - needs to be updated on disk
  - MD_WRITEBACK - page update is being  processed - write out started 
and is in progress

there is only one piwb - in md_page struct.
Let's walk over ploop_locate_new_cluster_and_attach_pio as it is the 
base case, others are derivatives.

First under lock spin_lock_irqsave(&ploop->bat_lock, flags);
The there is :
if (piwb && (piwb->type != type || test_bit(MD_WRITEBACK, &md->status)))

It uses piwb to check for dirty - if we have piwb we delay.
But the problem is in the case MD_WRITEBACK , i.e. it is not started
page is waiting for it to start.

Next check the type ALLOC vs DISCARD - we have only one piwb.
If it is the same it does not delay. And the currently waiting piwb
can be executed before it is completely updated.
For example in cow case a pio is added to cow_llist  -> the update can 
go and complete before it is added, so it won't be executed.

RACE - bat_write_complete clears  MD_WRITEBACK and sets piwb to NULL
early before the update is completed. Delayes pios can cause new updates
while the initial one is not completed.

Next it marks page dirty - but along that it is adding it to the 
writeback list - RACE - thread can see it before update is ready.
Next lock is unlocked.

To summarize - the race is between update preparation and thread
executing the updates - in several forms.

Request thread                            Execution Thread
Start to prepare update
take lock
mark for update and put into the list
release lock
                                           Executes unfinished update
                                           clears flags and NULLs piwb 
-> crash somewhere along the path

do more work to finish preparation

> 
>>
>>>
>>>>
>>>> Current flags are:
>>>> MD_DIRTY - means md page is dirty and is waiting for writeback
>>>> MD_WRITEBACK - write back is in progress
>>>> But we drop the lock after we make page dirty so it races with
>>>> writeback.
>>>>
>>>> To fix this introduce a new flag:
>>>> MD_UPDATING - which means we are updating the pending writeback
>>>> or we are creating a new one. Remove the check if piwb is present
>>>> on the page and clear piwb early on submit, check was using a type
>>>> alloc vs discard but in case of concurrent alloc and discard the piwb
>>>> is overwritten by the second operation. Using a flag solves this
>>>> and makes it more clear in what state page is.
>>>>
>>>> Nice side effect of closing that race is that we have a minimalistic
>>>> delay for writeback updates. This happens when we try to submit a
>>>> pending writeback and we see the MD_UPDATING flags, update is not
>>>> submitted but it is retried on next cycle. So if we have a lot of
>>>> update to the same page in a very short periot they are accumulated
>>>> as one.
>>>>
>>>> On error only clear MD_DIRTY if it is set by us before dropping
>>>> the lock and when breaking the prepared bat update only clear
>>>> md->piwb  if it is the same as the piwb allocated in the function.
>>>>
>>>> Add some clarification error messages around WARN_ONs.
>>>>
>>>> While at it fix:
>>>> - mapping bat_bage two times in same function, do it only once
>>>> - missed unmapping on error path
>>>> - update proper bat levels in delayed update case
>>>> - locking around prepare_bat_update from ploop_prepare_reloc_index_wb
>>>>
>>>> https://virtuozzo.atlassian.net/browse/VSTOR-98807
>>>> Signed-off-by: Alexander Atanasov <alexander.atanasov at virtuozzo.com>
>>>> ---
>>>>   drivers/md/dm-ploop-cmd.c |   4 +-
>>>>   drivers/md/dm-ploop-map.c | 153 
>>>> +++++++++++++++++++++++++++-----------
>>>>   drivers/md/dm-ploop.h     |   8 +-
>>>>   3 files changed, 117 insertions(+), 48 deletions(-)
>>>>
>>>> diff --git a/drivers/md/dm-ploop-cmd.c b/drivers/md/dm-ploop-cmd.c
>>>> index e032af52b64e..daec9827b42c 100644
>>>> --- a/drivers/md/dm-ploop-cmd.c
>>>> +++ b/drivers/md/dm-ploop-cmd.c
>>>> @@ -315,10 +315,8 @@ static int ploop_grow_relocate_cluster(struct 
>>>> ploop *ploop,
>>>>           goto out;
>>>>       }
>>>> -    spin_lock_irq(&ploop->bat_lock);
>>>>       ret = ploop_prepare_reloc_index_wb(ploop, &md, clu, &new_dst,
>>>>                          ploop_top_delta(ploop)->file);
>>>> -    spin_unlock_irq(&ploop->bat_lock);
>>>>       if (ret < 0) {
>>>>           PL_ERR("reloc: can't prepare it: %d", ret);
>>>>           goto out;
>>>> @@ -329,7 +327,7 @@ static int ploop_grow_relocate_cluster(struct 
>>>> ploop *ploop,
>>>>       ret = ploop_write_cluster_sync(ploop, pio, new_dst);
>>>>       if (ret) {
>>>>           PL_ERR("reloc: failed write: %d", ret);
>>>> -        ploop_break_bat_update(ploop, md);
>>>> +        ploop_break_bat_update(ploop, md, piwb);
>>>
>>> You should probably understand that this code:
>>> ```
>>>          piwb = md->piwb;
>>>
>>>          /* Write clu to new destination */
>>>          ret = ploop_write_cluster_sync(ploop, pio, new_dst);
>>>          if (ret) {
>>>                  PL_ERR("reloc: failed write: %d", ret);
>>>                  ploop_break_bat_update(ploop, md, piwb);
>>>                  goto out;
>>>          }
>>>
>>> ```
>>> can be optimized by the compiler to:
>>> ```
>>>          /* Write clu to new destination */
>>>          ret = ploop_write_cluster_sync(ploop, pio, new_dst);
>>>          if (ret) {
>>>                  PL_ERR("reloc: failed write: %d", ret);
>>>                  ploop_break_bat_update(ploop, md, md->piwb);
>>>                  goto out;
>>>          }
>>> ```
>>
>>
>>> And in this case there is no gain from having this extra md->piwb 
>>> argument as we can already see changed md->piwb at this point. Even 
>>> if it somehow helps, I believe the fix is still racy. 
>>
>> Yes it can be optimised but this case is a sideway case for the 
>> commands and it is a question of style what to use.
> 
> We can do two versions of ploop_break_bat_update, one for three 
> arguments and one for two (which explicitly does not care about piwb 
> change). Or just add comment on sideway case that piwb change is 
> impossible.

Ok, i can add comment, i don't see a point to make another version just 
for that.

>>
>>> If you really want to read md->piwb to piwb variable BEFORE 
>>> ploop_write_cluster_sync, you should use READ_ONCE, or other 
>>> appropriately used memory barrier.
>>>
>>
>> READ_ONCE is not a memory barrier - it is a COMPILER BARRIER.
>> https://docs.kernel.org/core-api/wrappers/memory-barriers.html
>> see COMPILER BARRIER section for indeep explanation. There you will 
>> see that infact READ_ONCE forces variable to be re-read , and not read 
>> once as in one time.
> 
> Unrelated to the patch, as you've proved that patch is ok. But we've 
> already had this discussion 
> https://lists.openvz.org/pipermail/devel/2025-January/081790.html , and 
> I've provided you with citation from documentation from "COMPILER 
> BARRIER". READ_ONCE disables different compiler optimisations, including 
> that it does not allow local variable to be optimized by instead 
> directly using the global variable which can be changed in concurrent 
> thread, which looks from the user perspective as if local variable is 
> read twice from global variable. So it actually makes reads "one time" 
> reads.

No, it actually marks the variable volatile /i missed that/ and forces
access as a whole. Which results in variable to be read every time 
instead of cached one to be used, so it is not a one time read.

The example from the docs:

  (*) The compiler is within its rights to merge successive loads from
      the same variable.  Such merging can cause the compiler to "optimize"
      the following code:

         while (tmp = a)
                 do_something_with(tmp);

      into the following code, which, although in some sense legitimate
      for single-threaded code, is almost certainly not what the developer
      intended:

         if (tmp = a)
                 for (;;)
                         do_something_with(tmp);

      Use READ_ONCE() to prevent the compiler from doing this to you:

         while (tmp = READ_ONCE(a))
                 do_something_with(tmp);

READ_ONCE(a) -> forces read every time - cast to volatile
and the side efect is it can not split access in two

The last paragaph is what you missed:

"All that aside, it is never necessary to use READ_ONCE() and
WRITE_ONCE() on a variable that has been marked volatile.  For example,
because 'jiffies' is marked volatile, it is never necessary to
say READ_ONCE(jiffies).  The reason for this is that READ_ONCE() and
WRITE_ONCE() are implemented as volatile casts, which has no effect when
its argument is already marked volatile.

Please note that these compiler barriers have no direct effect on the 
CPU, which may then reorder things however it wishes."

Yes we did but let's get on the same page.

-- 
Regards,
Alexander Atanasov