[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

Maxim Patlasov mpatlasov at virtuozzo.com
Thu Jun 16 14:41:04 PDT 2016


On 06/16/2016 09:30 AM, Dmitry Monakhov wrote:
> Dmitry Monakhov <dmonakhov at openvz.org> writes:
>
>> Maxim Patlasov <mpatlasov at virtuozzo.com> writes:
>>
>>> Dima,
>>>
>>> I agree that the ploop barrier code is broken in many ways, but I don't
>>> think the patch actually fixes it. I hope you would agree that
>>> completion of REQ_FUA guarantees only landing that particular bio to the
>>> disk; it says nothing about flushing previously submitted (and
>>> completed) bio-s and it is also possible that power outage may catch us
>>> when this REQ_FUA is already landed to the disk, but previous bio-s are
>>> not yet.
>> Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
>> So yes. it would be more correct to tag WBI with FLUSH_FUA
>>> Hence, for RELOC_{A|S} requests we actually need something like that:
>>>
>>>        RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
>>>        RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA
>>>
>>> (i.e. we do need to flush all previously submitted data before starting
>>> to update BAT on disk)
>>>
>> Correct sequence:
>> RELOC_S: R1, W2, WBI:FLUSH_FUA
>> RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA
>>
>>> not simply:
>>>
>>>> RELOC_S: R1, W2, WBI:FUA
>>>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>> Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and
>>> PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we
>>> could remove them completely (along we that optimization delaying
>>> incoming FUA) and re-implement all this stuff from scratch:
>>>
>>> 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set
>>> REQ_FUA in preq->req_rw before calling ->submit(preq)
>>>
>>> 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating
>>> BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for
>>> RELOC_A|S in ploop_index_update and map_wb_complete
>>>
>>> 3) For that optimization delaying incoming FUA (what we do now if
>>> ploop_req_delay_fua_possible() returns true) we could introduce new
>>> ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update
>>> and map_wb_complete (the same thing as 2) above). And, yes, let's
>>> WARN_ON if we somehow missed its processing.
>> Yes. This was one of my ideas.
>> 1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
>> RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
>> PLOOP_IO_FLUSH_DELAYED.
>> 2) fix ->write_page to handle flush as it does with fua.
>>> The only complication I foresee is about how to teach kaio to pre-flush
>>> in kaio_write_page -- it's doable, but involves kaio_resubmit that's
>>> already pretty convoluted.
>>>
>> Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.
> Crap. Currently kaio can handles fsync only via kaio_queue_fsync_req
> which is async and not suitable for page_write.

I think it's doable to process page_write via kaio_fsync_thread, but 
it's tricky.

> Max let's make an agreement about terminology.
> The reason I wrote this is because linux internally interpret FUA as
> preflush,write,postflush which is wrong from academic point of view but
> it is the world we live it linux.

Are you sure that this  (FUA == preflush,write,postflush) is universally 
true (i.e. no exceptions)? What about bio-based block-device drivers?

> This is the reason I read code
> diferently from the way it was designed.
> Let's state that ploop is an ideal world where:
> FLUSH ==> preflush
> FUA   ==> WRUTE,postflush

In ideal word FUA is not obliged to be handled by postflush: it's enough 
to guarantee that *this* particular request went to platter, other 
requests may remain not-flushed-yet. 
Documentation/block/writeback_cache_control.txt is absolutely clear 
about it:

> The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted 
> from the
> filesystem and will make sure that I/O completion for this request is only
> signaled after the data has been committed to non-volatile storage.
> ...
> If the FUA bit is not natively supported the block
> layer turns it into an empty REQ_FLUSH request after the actual write.


> For what reasona we can perform reloc scheme as:
>
> RELOC_A: R1,W2:FUA,WBI:FUA,W1:NULLIFY|FUA
> RELOC_A: R1,W2:FUA,WBI:FUA
>
> This allows effectively handle FUA and convert it to DELAYED_FLUSH where
> possible.

Ploop has the concept of map_multi_updates. In short words, while you're 
handling one update, many others may come to PLOOP_E_INDEX_DELAY state. 
And, as soon as the first one is done, we modify many indices in one 
loop (see map_wb_complete), then write that page to disk only once. 
Having map_multi_update in mind, it may be suboptimal to make many 
W2:FUA-s -- it may be better to do many ordinary W2-s instead, and only 
one pre-FLUSH later -- when we're going to write BAT page on disk.

> Also let's clarify may_fua_delay semantics to exact eng_state
>
> may_fua_delay {
>
>    int may_delay = 1;
>    /* effectively this is equivalent of
>       preq->eng_state != PLOOP_E_COMPLETE
>       but it is more readable, and more error prone in future
>    */
>    if (preq->eng_state != PLOOP_E_DATA_WBI)
>        may_delay = 0
>    if ((test_bit(PLOOP_REQ_RELOC_S, &preq->state)) ||
>       (test_bit(PLOOP_REQ_RELOC_A, &preq->state)))
>        may_delay = 0;
>    return may_delay;
> }

It's much easier if we consider reloc- and ordinary cases separately:

1) PLOOP_REQ_RELOC_A|S -- they always involve index-update. So we can 
always pre-FLUSH handling BAT write_page for them.

2) All other requests. Seeing REQ_FUA in dio_submit we can delay FUA if 
we're sure index-update happens later. Correct check for this: 
"preq->eng_state == PLOOP_E_DATA_WBI"

The check "preq->eng_state != PLOOP_E_COMPLETE" is incorrect even right 
now, because map_wb_complete_post_process() calls nullifying submit with 
preq->eng_state = PLOOP_E_RELOC_NULLIFY. So if we delay FUA if and only 
if preq->eng_state == PLOOP_E_DATA_WBI, we must be on the safe side, imho.

Max


More information about the Devel mailing list