[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

Mon Jun 20 16:35:00 PDT 2016

Dima,

On 06/19/2016 06:06 AM, Dmitry Monakhov wrote:
> Maxim Patlasov <mpatlasov at virtuozzo.com> writes:
>
>> On 06/16/2016 09:30 AM, Dmitry Monakhov wrote:
>>> Dmitry Monakhov <dmonakhov at openvz.org> writes:
>>>
>>>> Maxim Patlasov <mpatlasov at virtuozzo.com> writes:
>>>>
>>>>> Dima,
>>>>>
>>>>> I agree that the ploop barrier code is broken in many ways, but I don't
>>>>> think the patch actually fixes it. I hope you would agree that
>>>>> completion of REQ_FUA guarantees only landing that particular bio to the
>>>>> disk; it says nothing about flushing previously submitted (and
>>>>> completed) bio-s and it is also possible that power outage may catch us
>>>>> when this REQ_FUA is already landed to the disk, but previous bio-s are
>>>>> not yet.
>>>> Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
>>>> So yes. it would be more correct to tag WBI with FLUSH_FUA
>>>>> Hence, for RELOC_{A|S} requests we actually need something like that:
>>>>>
>>>>>         RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
>>>>>         RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA
>>>>>
>>>>> (i.e. we do need to flush all previously submitted data before starting
>>>>> to update BAT on disk)
>>>>>
>>>> Correct sequence:
>>>> RELOC_S: R1, W2, WBI:FLUSH_FUA
>>>> RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA
>>>>
>>>>> not simply:
>>>>>
>>>>>> RELOC_S: R1, W2, WBI:FUA
>>>>>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>>>> Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and
>>>>> PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we
>>>>> could remove them completely (along we that optimization delaying
>>>>> incoming FUA) and re-implement all this stuff from scratch:
>>>>>
>>>>> 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set
>>>>> REQ_FUA in preq->req_rw before calling ->submit(preq)
>>>>>
>>>>> 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating
>>>>> BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for
>>>>> RELOC_A|S in ploop_index_update and map_wb_complete
>>>>>
>>>>> 3) For that optimization delaying incoming FUA (what we do now if
>>>>> ploop_req_delay_fua_possible() returns true) we could introduce new
>>>>> ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update
>>>>> and map_wb_complete (the same thing as 2) above). And, yes, let's
>>>>> WARN_ON if we somehow missed its processing.
>>>> Yes. This was one of my ideas.
>>>> 1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
>>>> RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
>>>> PLOOP_IO_FLUSH_DELAYED.
>>>> 2) fix ->write_page to handle flush as it does with fua.
>>>>> The only complication I foresee is about how to teach kaio to pre-flush
>>>>> in kaio_write_page -- it's doable, but involves kaio_resubmit that's
>>>>> already pretty convoluted.
>>>>>
>>>> Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.
>>> Crap. Currently kaio can handles fsync only via kaio_queue_fsync_req
>>> which is async and not suitable for page_write.
>> I think it's doable to process page_write via kaio_fsync_thread, but
>> it's tricky.
>>
>>> Max let's make an agreement about terminology.
>>> The reason I wrote this is because linux internally interpret FUA as
>>> preflush,write,postflush which is wrong from academic point of view but
>>> it is the world we live it linux.
>> Are you sure that this  (FUA == preflush,write,postflush) is universally
>> true (i.e. no exceptions)? What about bio-based block-device drivers?
>>
>>> This is the reason I read code
>>> diferently from the way it was designed.
>>> Let's state that ploop is an ideal world where:
>>> FLUSH ==> preflush
>>> FUA   ==> WRUTE,postflush
>> In ideal word FUA is not obliged to be handled by postflush: it's enough
>> to guarantee that *this* particular request went to platter, other
>> requests may remain not-flushed-yet.
>> Documentation/block/writeback_cache_control.txt is absolutely clear
>> about it:
>>
>>> The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted
>>> from the
>>> filesystem and will make sure that I/O completion for this request is only
>>> signaled after the data has been committed to non-volatile storage.
>>> ...
>>> If the FUA bit is not natively supported the block
>>> layer turns it into an empty REQ_FLUSH request after the actual write.
>>
>>> For what reasona we can perform reloc scheme as:
>>>
>>> RELOC_A: R1,W2:FUA,WBI:FUA,W1:NULLIFY|FUA
>>> RELOC_A: R1,W2:FUA,WBI:FUA
>>>
>>> This allows effectively handle FUA and convert it to DELAYED_FLUSH where
>>> possible.
>> Ploop has the concept of map_multi_updates. In short words, while you're
>> handling one update, many others may come to PLOOP_E_INDEX_DELAY state.
>> And, as soon as the first one is done, we modify many indices in one
>> loop (see map_wb_complete), then write that page to disk only once.
>> Having map_multi_update in mind, it may be suboptimal to make many
>> W2:FUA-s -- it may be better to do many ordinary W2-s instead, and only
>> one pre-FLUSH later -- when we're going to write BAT page on disk.
>>
>>> Also let's clarify may_fua_delay semantics to exact eng_state
>>>
>>> may_fua_delay {
>>>
>>>     int may_delay = 1;
>>>     /* effectively this is equivalent of
>>>        preq->eng_state != PLOOP_E_COMPLETE
>>>        but it is more readable, and more error prone in future
>>>     */
>>>     if (preq->eng_state != PLOOP_E_DATA_WBI)
>>>         may_delay = 0
>>>     if ((test_bit(PLOOP_REQ_RELOC_S, &preq->state)) ||
>>>        (test_bit(PLOOP_REQ_RELOC_A, &preq->state)))
>>>         may_delay = 0;
>>>     return may_delay;
>>> }
>> It's much easier if we consider reloc- and ordinary cases separately:
>>
>> 1) PLOOP_REQ_RELOC_A|S -- they always involve index-update. So we can
>> always pre-FLUSH handling BAT write_page for them.
>>
>> 2) All other requests. Seeing REQ_FUA in dio_submit we can delay FUA if
>> we're sure index-update happens later. Correct check for this:
>> "preq->eng_state == PLOOP_E_DATA_WBI"
>>
>> The check "preq->eng_state != PLOOP_E_COMPLETE" is incorrect even right
>> now, because map_wb_complete_post_process() calls nullifying submit with
>> preq->eng_state = PLOOP_E_RELOC_NULLIFY. So if we delay FUA if and only
>> if preq->eng_state == PLOOP_E_DATA_WBI, we must be on the safe side, imho.
> I've investigated FUA handling and it looks sub optimal and wrong to me.
> dio_submit builds list of bios from preq. This is true because cluster =>
> fs-block mapping is not always cluster alligned multible bios is not impossible.
>
> We  try to optimize flush/fua handling for given preq by
> tagging first with FLUSH, and last with FUA.
>
> b1:FLUSH,b2,b3,b4:FUA

Since 2015, it's b1:FLUSH,b2,b3,b4:FLUSH|FUA. But yes, this is still 
incorrect, as you explained below.

> Normally this will be handled by block layer as: b1:FLUSH,b2,b3,b4:FLUSH, FLUSH
> which is correct, but suboptimal because requires 3flushes instead of 2.

Please point me to the place in sources where incoming b4:FUA is 
translated to b4:FLUSH (the second FLUSH of those 3flushes).

> But iosched may postpone b2 and b3 before b4 . Which result in silent corruption.

Yes, you right. Good catch!

>
> My proposal:
> 0) Use only PREFLUSH, POSTFLUSH primitives which are basically good
> known barriers.
> 1) Throw all this fancy scheme with FUA.
>     And pass FLUSH as usual, but convert FUA to FLUSH1,write,FLUSH2

I wouldn't like to sacrifice the main (most probable) case of FLUSH/FUA: 
if jbd2 commit generated one-page bio marked as FLUSH|FUA, it's nice 
(for ploop) to translate it in a new bio also marked as FLUSH|FUA. No 
need to post-flush everything in this case.

> 2) (1) allow us transparently delay FLUSH2 via preq->state PLOOP_REQ_FSYNC_DELAYED
> 3) queue such preq to io->delayed_sync_queue and handle that list in
>     fsync_thread, but call ->issue_flush if io->fsync_queue was empty and
>     ->sync is not required.

Coincidence when fsync_queue and delayed_sync_queue are both not empty 
is rare case, imho. Also, as I wrote in previous email, it may be 
beneficial to postpone FLUSH2 until index-update (due to map_multi_updates).

> 4) Also handle FSYNC_DELAYED in ploop_index_update similar PLOOP_MAP_WRITEBACK

Not sure I understand. PLOOP_MAP_WRITEBACK requires special ad-hoc 
PLOOP_E_INDEX_DELAY eng_state, while FSYNC_DELAYED is only flag in 
preq->state. Btw, your patch-set v2 does not introduce 
delayed_sync_queue, do you plan it in another patch? If yes, please wait 
-- let's discuss it first -- it seems that we can go without it (w/o 
adding another io queue).

Thanks,
Maxim