[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling
Dmitry Monakhov
dmonakhov at openvz.org
Thu Jun 16 00:26:42 PDT 2016
Maxim Patlasov <mpatlasov at virtuozzo.com> writes:
> Dima,
>
> I agree that the ploop barrier code is broken in many ways, but I don't
> think the patch actually fixes it. I hope you would agree that
> completion of REQ_FUA guarantees only landing that particular bio to the
> disk; it says nothing about flushing previously submitted (and
> completed) bio-s and it is also possible that power outage may catch us
> when this REQ_FUA is already landed to the disk, but previous bio-s are
> not yet.
Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
So yes. it would be more correct to tag WBI with FLUSH_FUA
> Hence, for RELOC_{A|S} requests we actually need something like that:
>
> RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
> RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA
>
> (i.e. we do need to flush all previously submitted data before starting
> to update BAT on disk)
>
Correct sequence:
RELOC_S: R1, W2, WBI:FLUSH_FUA
RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA
> not simply:
>
>> RELOC_S: R1, W2, WBI:FUA
>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>
> Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and
> PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we
> could remove them completely (along we that optimization delaying
> incoming FUA) and re-implement all this stuff from scratch:
>
> 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set
> REQ_FUA in preq->req_rw before calling ->submit(preq)
>
> 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating
> BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for
> RELOC_A|S in ploop_index_update and map_wb_complete
>
> 3) For that optimization delaying incoming FUA (what we do now if
> ploop_req_delay_fua_possible() returns true) we could introduce new
> ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update
> and map_wb_complete (the same thing as 2) above). And, yes, let's
> WARN_ON if we somehow missed its processing.
Yes. This was one of my ideas.
1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
PLOOP_IO_FLUSH_DELAYED.
2) fix ->write_page to handle flush as it does with fua.
>
> The only complication I foresee is about how to teach kaio to pre-flush
> in kaio_write_page -- it's doable, but involves kaio_resubmit that's
> already pretty convoluted.
>
Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.
> Btw, I accidentally noticed awful silly bug in kaio_complete_io_state():
> we checks for REQ_FUA after clearing it! This makes all FUA-s on
> ordinary kaio_submit path silently lost...
>
> Thanks,
> Maxim
>
>
> On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:
>> barrier code is broken in many ways:
>> Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
>> But request also can goes though ->dio_submit_alloc()->dio_submit_pad and write_page (for indexes)
>> So in case of grow_dev we have following sequance:
>>
>> E_RELOC_DATA_READ:
>> ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>> ->delta->allocate
>> ->io->submit_allloc: dio_submit_alloc
>> ->dio_submit_pad
>> E_DATA_WBI : data written, time to update index
>> ->delta->allocate_complete:ploop_index_update
>> ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>> ->write_page
>> ->ploop_map_wb_complete
>> ->ploop_wb_complete_post_process
>> ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>> E_RELOC_NULLIFY:
>>
>> ->submit()
>>
>> This patch unify barrier handling like follows:
>> - Add assertation to ploop_complete_request for FORCE_{FLUSH,FUA} state
>> - Perform explicit FUA inside index_update for RELOC requests.
>>
>> This makes reloc sequence optimal:
>> RELOC_S: R1, W2, WBI:FUA
>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>
>> https://jira.sw.ru/browse/PSBM-47107
>> Signed-off-by: Dmitry Monakhov <dmonakhov at openvz.org>
>> ---
>> drivers/block/ploop/dev.c | 10 +++++++---
>> drivers/block/ploop/map.c | 29 ++++++++++++-----------------
>> 2 files changed, 19 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
>> index 96f7850..998fe71 100644
>> --- a/drivers/block/ploop/dev.c
>> +++ b/drivers/block/ploop/dev.c
>> @@ -1224,6 +1224,11 @@ static void ploop_complete_request(struct ploop_request * preq)
>>
>> __TRACE("Z %p %u\n", preq, preq->req_cluster);
>>
>> + if (!preq->error) {
>> + unsigned long state = READ_ONCE(preq->state);
>> + WARN_ON(state & (1 << PLOOP_REQ_FORCE_FUA));
>> + WARN_ON(state & (1 <<PLOOP_REQ_FORCE_FLUSH));
>> + }
>> while (preq->bl.head) {
>> struct bio * bio = preq->bl.head;
>> preq->bl.head = bio->bi_next;
>> @@ -2530,9 +2535,8 @@ restart:
>> top_delta = ploop_top_delta(plo);
>> sbl.head = sbl.tail = preq->aux_bio;
>>
>> - /* Relocated data write required sync before BAT updatee */
>> - set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>> -
>> + /* Relocated data write required sync before BAT updatee
>> + * this will happen inside index_update */
>> if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
>> preq->eng_state = PLOOP_E_DATA_WBI;
>> plo->st.bio_out++;
>> diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
>> index 3a6365d..c17e598 100644
>> --- a/drivers/block/ploop/map.c
>> +++ b/drivers/block/ploop/map.c
>> @@ -896,6 +896,7 @@ void ploop_index_update(struct ploop_request * preq)
>> struct ploop_device * plo = preq->plo;
>> struct map_node * m = preq->map;
>> struct ploop_delta * top_delta = map_top_delta(m->parent);
>> + int fua = !!(preq->req_rw & REQ_FUA);
>> u32 idx;
>> map_index_t blk;
>> int old_level;
>> @@ -953,13 +954,13 @@ void ploop_index_update(struct ploop_request * preq)
>> __TRACE("wbi %p %u %p\n", preq, preq->req_cluster, m);
>> plo->st.map_single_writes++;
>> top_delta->ops->map_index(top_delta, m->mn_start, &sec);
>> - /* Relocate requires consistent writes, mark such reqs appropriately */
>> + /* Relocate requires consistent index update */
>> if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
>> test_bit(PLOOP_REQ_RELOC_S, &preq->state))
>> - set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>> -
>> - top_delta->io.ops->write_page(&top_delta->io, preq, page, sec,
>> - !!(preq->req_rw & REQ_FUA));
>> + fua = 1;
>> + if (fua)
>> + clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
>> + top_delta->io.ops->write_page(&top_delta->io, preq, page, sec, fua);
>> put_page(page);
>> return;
>>
>> @@ -1078,7 +1079,7 @@ static void map_wb_complete(struct map_node * m, int err)
>> int delayed = 0;
>> unsigned int idx;
>> sector_t sec;
>> - int fua, force_fua;
>> + int fua;
>>
>> /* First, complete processing of written back indices,
>> * finally instantiate indices in mapping cache.
>> @@ -1149,7 +1150,6 @@ static void map_wb_complete(struct map_node * m, int err)
>>
>> main_preq = NULL;
>> fua = 0;
>> - force_fua = 0;
>>
>> list_for_each_safe(cursor, tmp, &m->io_queue) {
>> struct ploop_request * preq;
>> @@ -1168,13 +1168,12 @@ static void map_wb_complete(struct map_node * m, int err)
>> break;
>> }
>>
>> - if (preq->req_rw & REQ_FUA)
>> + if (preq->req_rw & REQ_FUA ||
>> + test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
>> + test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
>> + clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
>> fua = 1;
>> -
>> - if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
>> - test_bit(PLOOP_REQ_RELOC_S, &preq->state))
>> - force_fua = 1;
>> -
>> + }
>> preq->eng_state = PLOOP_E_INDEX_WB;
>> get_page(page);
>> preq->sinfo.wi.tpage = page;
>> @@ -1199,10 +1198,6 @@ static void map_wb_complete(struct map_node * m, int err)
>> __TRACE("wbi2 %p %u %p\n", main_preq, main_preq->req_cluster, m);
>> plo->st.map_multi_writes++;
>> top_delta->ops->map_index(top_delta, m->mn_start, &sec);
>> -
>> - if (force_fua)
>> - set_bit(PLOOP_REQ_FORCE_FUA, &main_preq->state);
>> -
>> top_delta->io.ops->write_page(&top_delta->io, main_preq, page, sec, fua);
>> put_page(page);
>> }
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 472 bytes
Desc: not available
URL: <http://lists.openvz.org/pipermail/devel/attachments/20160616/e62bd11e/attachment-0001.sig>
More information about the Devel
mailing list