[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

Sun Jun 19 06:06:22 PDT 2016

Maxim Patlasov <mpatlasov at virtuozzo.com> writes:

> On 06/16/2016 09:30 AM, Dmitry Monakhov wrote:
>> Dmitry Monakhov <dmonakhov at openvz.org> writes:
>>
>>> Maxim Patlasov <mpatlasov at virtuozzo.com> writes:
>>>
>>>> Dima,
>>>>
>>>> I agree that the ploop barrier code is broken in many ways, but I don't
>>>> think the patch actually fixes it. I hope you would agree that
>>>> completion of REQ_FUA guarantees only landing that particular bio to the
>>>> disk; it says nothing about flushing previously submitted (and
>>>> completed) bio-s and it is also possible that power outage may catch us
>>>> when this REQ_FUA is already landed to the disk, but previous bio-s are
>>>> not yet.
>>> Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
>>> So yes. it would be more correct to tag WBI with FLUSH_FUA
>>>> Hence, for RELOC_{A|S} requests we actually need something like that:
>>>>
>>>>        RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
>>>>        RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA
>>>>
>>>> (i.e. we do need to flush all previously submitted data before starting
>>>> to update BAT on disk)
>>>>
>>> Correct sequence:
>>> RELOC_S: R1, W2, WBI:FLUSH_FUA
>>> RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA
>>>
>>>> not simply:
>>>>
>>>>> RELOC_S: R1, W2, WBI:FUA
>>>>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>>> Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and
>>>> PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we
>>>> could remove them completely (along we that optimization delaying
>>>> incoming FUA) and re-implement all this stuff from scratch:
>>>>
>>>> 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set
>>>> REQ_FUA in preq->req_rw before calling ->submit(preq)
>>>>
>>>> 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating
>>>> BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for
>>>> RELOC_A|S in ploop_index_update and map_wb_complete
>>>>
>>>> 3) For that optimization delaying incoming FUA (what we do now if
>>>> ploop_req_delay_fua_possible() returns true) we could introduce new
>>>> ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update
>>>> and map_wb_complete (the same thing as 2) above). And, yes, let's
>>>> WARN_ON if we somehow missed its processing.
>>> Yes. This was one of my ideas.
>>> 1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
>>> RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
>>> PLOOP_IO_FLUSH_DELAYED.
>>> 2) fix ->write_page to handle flush as it does with fua.
>>>> The only complication I foresee is about how to teach kaio to pre-flush
>>>> in kaio_write_page -- it's doable, but involves kaio_resubmit that's
>>>> already pretty convoluted.
>>>>
>>> Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.
>> Crap. Currently kaio can handles fsync only via kaio_queue_fsync_req
>> which is async and not suitable for page_write.
>
> I think it's doable to process page_write via kaio_fsync_thread, but 
> it's tricky.
>
>> Max let's make an agreement about terminology.
>> The reason I wrote this is because linux internally interpret FUA as
>> preflush,write,postflush which is wrong from academic point of view but
>> it is the world we live it linux.
>
> Are you sure that this  (FUA == preflush,write,postflush) is universally 
> true (i.e. no exceptions)? What about bio-based block-device drivers?
>
>> This is the reason I read code
>> diferently from the way it was designed.
>> Let's state that ploop is an ideal world where:
>> FLUSH ==> preflush
>> FUA   ==> WRUTE,postflush
>
> In ideal word FUA is not obliged to be handled by postflush: it's enough 
> to guarantee that *this* particular request went to platter, other 
> requests may remain not-flushed-yet. 
> Documentation/block/writeback_cache_control.txt is absolutely clear 
> about it:
>
>> The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted 
>> from the
>> filesystem and will make sure that I/O completion for this request is only
>> signaled after the data has been committed to non-volatile storage.
>> ...
>> If the FUA bit is not natively supported the block
>> layer turns it into an empty REQ_FLUSH request after the actual write.
>
>
>> For what reasona we can perform reloc scheme as:
>>
>> RELOC_A: R1,W2:FUA,WBI:FUA,W1:NULLIFY|FUA
>> RELOC_A: R1,W2:FUA,WBI:FUA
>>
>> This allows effectively handle FUA and convert it to DELAYED_FLUSH where
>> possible.
>
> Ploop has the concept of map_multi_updates. In short words, while you're 
> handling one update, many others may come to PLOOP_E_INDEX_DELAY state. 
> And, as soon as the first one is done, we modify many indices in one 
> loop (see map_wb_complete), then write that page to disk only once. 
> Having map_multi_update in mind, it may be suboptimal to make many 
> W2:FUA-s -- it may be better to do many ordinary W2-s instead, and only 
> one pre-FLUSH later -- when we're going to write BAT page on disk.
>
>> Also let's clarify may_fua_delay semantics to exact eng_state
>>
>> may_fua_delay {
>>
>>    int may_delay = 1;
>>    /* effectively this is equivalent of
>>       preq->eng_state != PLOOP_E_COMPLETE
>>       but it is more readable, and more error prone in future
>>    */
>>    if (preq->eng_state != PLOOP_E_DATA_WBI)
>>        may_delay = 0
>>    if ((test_bit(PLOOP_REQ_RELOC_S, &preq->state)) ||
>>       (test_bit(PLOOP_REQ_RELOC_A, &preq->state)))
>>        may_delay = 0;
>>    return may_delay;
>> }
>
> It's much easier if we consider reloc- and ordinary cases separately:
>
> 1) PLOOP_REQ_RELOC_A|S -- they always involve index-update. So we can 
> always pre-FLUSH handling BAT write_page for them.
>
> 2) All other requests. Seeing REQ_FUA in dio_submit we can delay FUA if 
> we're sure index-update happens later. Correct check for this: 
> "preq->eng_state == PLOOP_E_DATA_WBI"
>
> The check "preq->eng_state != PLOOP_E_COMPLETE" is incorrect even right 
> now, because map_wb_complete_post_process() calls nullifying submit with 
> preq->eng_state = PLOOP_E_RELOC_NULLIFY. So if we delay FUA if and only 
> if preq->eng_state == PLOOP_E_DATA_WBI, we must be on the safe side, imho.
I've investigated FUA handling and it looks sub optimal and wrong to me.
dio_submit builds list of bios from preq. This is true because cluster =>
fs-block mapping is not always cluster alligned multible bios is not impossible.

We  try to optimize flush/fua handling for given preq by
tagging first with FLUSH, and last with FUA.

b1:FLUSH,b2,b3,b4:FUA
Normally this will be handled by block layer as: b1:FLUSH,b2,b3,b4:FLUSH, FLUSH
which is correct, but suboptimal because requires 3flushes instead of 2.
But iosched may postpone b2 and b3 before b4 . Which result in silent corruption.

My proposal:
0) Use only PREFLUSH, POSTFLUSH primitives which are basically good
known barriers.
1) Throw all this fancy scheme with FUA.
   And pass FLUSH as usual, but convert FUA to FLUSH1,write,FLUSH2
2) (1) allow us transparently delay FLUSH2 via preq->state PLOOP_REQ_FSYNC_DELAYED
3) queue such preq to io->delayed_sync_queue and handle that list in
   fsync_thread, but call ->issue_flush if io->fsync_queue was empty and
   ->sync is not required.
4) Also handle FSYNC_DELAYED in ploop_index_update similar PLOOP_MAP_WRITEBACK

>
> Max
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 472 bytes
Desc: not available
URL: <http://lists.openvz.org/pipermail/devel/attachments/20160619/9c9e6cc2/attachment.sig>