[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling
Dmitry Monakhov
dmonakhov at openvz.org
Thu Jun 16 09:30:26 PDT 2016
Dmitry Monakhov <dmonakhov at openvz.org> writes:
> Maxim Patlasov <mpatlasov at virtuozzo.com> writes:
>
>> Dima,
>>
>> I agree that the ploop barrier code is broken in many ways, but I don't
>> think the patch actually fixes it. I hope you would agree that
>> completion of REQ_FUA guarantees only landing that particular bio to the
>> disk; it says nothing about flushing previously submitted (and
>> completed) bio-s and it is also possible that power outage may catch us
>> when this REQ_FUA is already landed to the disk, but previous bio-s are
>> not yet.
> Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
> So yes. it would be more correct to tag WBI with FLUSH_FUA
>> Hence, for RELOC_{A|S} requests we actually need something like that:
>>
>> RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
>> RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA
>>
>> (i.e. we do need to flush all previously submitted data before starting
>> to update BAT on disk)
>>
> Correct sequence:
> RELOC_S: R1, W2, WBI:FLUSH_FUA
> RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA
>
>> not simply:
>>
>>> RELOC_S: R1, W2, WBI:FUA
>>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>
>> Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and
>> PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we
>> could remove them completely (along we that optimization delaying
>> incoming FUA) and re-implement all this stuff from scratch:
>>
>> 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set
>> REQ_FUA in preq->req_rw before calling ->submit(preq)
>>
>> 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating
>> BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for
>> RELOC_A|S in ploop_index_update and map_wb_complete
>>
>> 3) For that optimization delaying incoming FUA (what we do now if
>> ploop_req_delay_fua_possible() returns true) we could introduce new
>> ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update
>> and map_wb_complete (the same thing as 2) above). And, yes, let's
>> WARN_ON if we somehow missed its processing.
> Yes. This was one of my ideas.
> 1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
> RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
> PLOOP_IO_FLUSH_DELAYED.
> 2) fix ->write_page to handle flush as it does with fua.
>>
>> The only complication I foresee is about how to teach kaio to pre-flush
>> in kaio_write_page -- it's doable, but involves kaio_resubmit that's
>> already pretty convoluted.
>>
> Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.
Crap. Currently kaio can handles fsync only via kaio_queue_fsync_req
which is async and not suitable for page_write.
Max let's make an agreement about terminology.
The reason I wrote this is because linux internally interpret FUA as
preflush,write,postflush which is wrong from academic point of view but
it is the world we live it linux. This is the reason I read code
diferently from the way it was designed.
Let's state that ploop is an ideal world where:
FLUSH ==> preflush
FUA ==> WRUTE,postflush
For what reasona we can perform reloc scheme as:
RELOC_A: R1,W2:FUA,WBI:FUA,W1:NULLIFY|FUA
RELOC_A: R1,W2:FUA,WBI:FUA
This allows effectively handle FUA and convert it to DELAYED_FLUSH where
possible. Also let's clarify may_fua_delay semantics to exact eng_state
may_fua_delay {
int may_delay = 1;
/* effectively this is equivalent of
preq->eng_state != PLOOP_E_COMPLETE
but it is more readable, and more error prone in future
*/
if (preq->eng_state != PLOOP_E_DATA_WBI)
may_delay = 0
if ((test_bit(PLOOP_REQ_RELOC_S, &preq->state)) ||
(test_bit(PLOOP_REQ_RELOC_A, &preq->state)))
may_delay = 0;
return may_delay;
}
k
>> Btw, I accidentally noticed awful silly bug in kaio_complete_io_state():
>> we checks for REQ_FUA after clearing it! This makes all FUA-s on
>> ordinary kaio_submit path silently lost...
>>
>> Thanks,
>> Maxim
>>
>>
>> On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:
>>> barrier code is broken in many ways:
>>> Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
>>> But request also can goes though ->dio_submit_alloc()->dio_submit_pad and write_page (for indexes)
>>> So in case of grow_dev we have following sequance:
>>>
>>> E_RELOC_DATA_READ:
>>> ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>>> ->delta->allocate
>>> ->io->submit_allloc: dio_submit_alloc
>>> ->dio_submit_pad
>>> E_DATA_WBI : data written, time to update index
>>> ->delta->allocate_complete:ploop_index_update
>>> ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>>> ->write_page
>>> ->ploop_map_wb_complete
>>> ->ploop_wb_complete_post_process
>>> ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>>> E_RELOC_NULLIFY:
>>>
>>> ->submit()
>>>
>>> This patch unify barrier handling like follows:
>>> - Add assertation to ploop_complete_request for FORCE_{FLUSH,FUA} state
>>> - Perform explicit FUA inside index_update for RELOC requests.
>>>
>>> This makes reloc sequence optimal:
>>> RELOC_S: R1, W2, WBI:FUA
>>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>>
>>> https://jira.sw.ru/browse/PSBM-47107
>>> Signed-off-by: Dmitry Monakhov <dmonakhov at openvz.org>
>>> ---
>>> drivers/block/ploop/dev.c | 10 +++++++---
>>> drivers/block/ploop/map.c | 29 ++++++++++++-----------------
>>> 2 files changed, 19 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
>>> index 96f7850..998fe71 100644
>>> --- a/drivers/block/ploop/dev.c
>>> +++ b/drivers/block/ploop/dev.c
>>> @@ -1224,6 +1224,11 @@ static void ploop_complete_request(struct ploop_request * preq)
>>>
>>> __TRACE("Z %p %u\n", preq, preq->req_cluster);
>>>
>>> + if (!preq->error) {
>>> + unsigned long state = READ_ONCE(preq->state);
>>> + WARN_ON(state & (1 << PLOOP_REQ_FORCE_FUA));
>>> + WARN_ON(state & (1 <<PLOOP_REQ_FORCE_FLUSH));
>>> + }
>>> while (preq->bl.head) {
>>> struct bio * bio = preq->bl.head;
>>> preq->bl.head = bio->bi_next;
>>> @@ -2530,9 +2535,8 @@ restart:
>>> top_delta = ploop_top_delta(plo);
>>> sbl.head = sbl.tail = preq->aux_bio;
>>>
>>> - /* Relocated data write required sync before BAT updatee */
>>> - set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>>> -
>>> + /* Relocated data write required sync before BAT updatee
>>> + * this will happen inside index_update */
>>> if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
>>> preq->eng_state = PLOOP_E_DATA_WBI;
>>> plo->st.bio_out++;
>>> diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
>>> index 3a6365d..c17e598 100644
>>> --- a/drivers/block/ploop/map.c
>>> +++ b/drivers/block/ploop/map.c
>>> @@ -896,6 +896,7 @@ void ploop_index_update(struct ploop_request * preq)
>>> struct ploop_device * plo = preq->plo;
>>> struct map_node * m = preq->map;
>>> struct ploop_delta * top_delta = map_top_delta(m->parent);
>>> + int fua = !!(preq->req_rw & REQ_FUA);
>>> u32 idx;
>>> map_index_t blk;
>>> int old_level;
>>> @@ -953,13 +954,13 @@ void ploop_index_update(struct ploop_request * preq)
>>> __TRACE("wbi %p %u %p\n", preq, preq->req_cluster, m);
>>> plo->st.map_single_writes++;
>>> top_delta->ops->map_index(top_delta, m->mn_start, &sec);
>>> - /* Relocate requires consistent writes, mark such reqs appropriately */
>>> + /* Relocate requires consistent index update */
>>> if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
>>> test_bit(PLOOP_REQ_RELOC_S, &preq->state))
>>> - set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>>> -
>>> - top_delta->io.ops->write_page(&top_delta->io, preq, page, sec,
>>> - !!(preq->req_rw & REQ_FUA));
>>> + fua = 1;
>>> + if (fua)
>>> + clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
>>> + top_delta->io.ops->write_page(&top_delta->io, preq, page, sec, fua);
>>> put_page(page);
>>> return;
>>>
>>> @@ -1078,7 +1079,7 @@ static void map_wb_complete(struct map_node * m, int err)
>>> int delayed = 0;
>>> unsigned int idx;
>>> sector_t sec;
>>> - int fua, force_fua;
>>> + int fua;
>>>
>>> /* First, complete processing of written back indices,
>>> * finally instantiate indices in mapping cache.
>>> @@ -1149,7 +1150,6 @@ static void map_wb_complete(struct map_node * m, int err)
>>>
>>> main_preq = NULL;
>>> fua = 0;
>>> - force_fua = 0;
>>>
>>> list_for_each_safe(cursor, tmp, &m->io_queue) {
>>> struct ploop_request * preq;
>>> @@ -1168,13 +1168,12 @@ static void map_wb_complete(struct map_node * m, int err)
>>> break;
>>> }
>>>
>>> - if (preq->req_rw & REQ_FUA)
>>> + if (preq->req_rw & REQ_FUA ||
>>> + test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
>>> + test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
>>> + clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
>>> fua = 1;
>>> -
>>> - if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
>>> - test_bit(PLOOP_REQ_RELOC_S, &preq->state))
>>> - force_fua = 1;
>>> -
>>> + }
>>> preq->eng_state = PLOOP_E_INDEX_WB;
>>> get_page(page);
>>> preq->sinfo.wi.tpage = page;
>>> @@ -1199,10 +1198,6 @@ static void map_wb_complete(struct map_node * m, int err)
>>> __TRACE("wbi2 %p %u %p\n", main_preq, main_preq->req_cluster, m);
>>> plo->st.map_multi_writes++;
>>> top_delta->ops->map_index(top_delta, m->mn_start, &sec);
>>> -
>>> - if (force_fua)
>>> - set_bit(PLOOP_REQ_FORCE_FUA, &main_preq->state);
>>> -
>>> top_delta->io.ops->write_page(&top_delta->io, main_preq, page, sec, fua);
>>> put_page(page);
>>> }
More information about the Devel
mailing list