[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

Dmitry Monakhov dmonakhov at openvz.org
Thu Jun 16 00:26:42 PDT 2016


Maxim Patlasov <mpatlasov at virtuozzo.com> writes:

> Dima,
>
> I agree that the ploop barrier code is broken in many ways, but I don't 
> think the patch actually fixes it. I hope you would agree that 
> completion of REQ_FUA guarantees only landing that particular bio to the 
> disk; it says nothing about flushing previously submitted (and 
> completed) bio-s and it is also possible that power outage may catch us 
> when this REQ_FUA is already landed to the disk, but previous bio-s are 
> not yet.
Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
So yes. it would be more correct to tag WBI with FLUSH_FUA
> Hence, for RELOC_{A|S} requests we actually need something like that:
>
>       RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
>       RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA
>
> (i.e. we do need to flush all previously submitted data before starting 
> to update BAT on disk)
>
Correct sequence:
RELOC_S: R1, W2, WBI:FLUSH_FUA
RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA

> not simply:
>
>> RELOC_S: R1, W2, WBI:FUA
>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>
> Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and 
> PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we 
> could remove them completely (along we that optimization delaying 
> incoming FUA) and re-implement all this stuff from scratch:
>
> 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set 
> REQ_FUA in preq->req_rw before calling ->submit(preq)
>
> 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating 
> BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for 
> RELOC_A|S in ploop_index_update and map_wb_complete
>
> 3) For that optimization delaying incoming FUA (what we do now if 
> ploop_req_delay_fua_possible() returns true) we could introduce new 
> ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update 
> and map_wb_complete (the same thing as 2) above). And, yes, let's 
> WARN_ON if we somehow missed its processing.
Yes. This was one of my ideas.
1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
PLOOP_IO_FLUSH_DELAYED.
2) fix ->write_page to handle flush as it does with fua.
>
> The only complication I foresee is about how to teach kaio to pre-flush 
> in kaio_write_page -- it's doable, but involves kaio_resubmit that's 
> already pretty convoluted.
>
Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.
> Btw, I accidentally noticed awful silly bug in kaio_complete_io_state(): 
> we checks for REQ_FUA after clearing it! This makes all FUA-s on 
> ordinary kaio_submit path silently lost...
>
> Thanks,
> Maxim
>
>
> On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:
>> barrier code is broken in many ways:
>> Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
>> But request also can goes though ->dio_submit_alloc()->dio_submit_pad and write_page (for indexes)
>> So in case of grow_dev we have following sequance:
>>
>> E_RELOC_DATA_READ:
>>               ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>>                ->delta->allocate
>>                   ->io->submit_allloc: dio_submit_alloc
>>                     ->dio_submit_pad
>> E_DATA_WBI : data written, time to update index
>>                ->delta->allocate_complete:ploop_index_update
>>                  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>>                  ->write_page
>>                  ->ploop_map_wb_complete
>>                    ->ploop_wb_complete_post_process
>>                      ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>> E_RELOC_NULLIFY:
>>
>>                 ->submit()
>>
>> This patch unify barrier handling like follows:
>> - Add assertation to ploop_complete_request for FORCE_{FLUSH,FUA} state
>> - Perform explicit FUA inside index_update for RELOC requests.
>>
>> This makes reloc sequence optimal:
>> RELOC_S: R1, W2, WBI:FUA
>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>
>> https://jira.sw.ru/browse/PSBM-47107
>> Signed-off-by: Dmitry Monakhov <dmonakhov at openvz.org>
>> ---
>>   drivers/block/ploop/dev.c | 10 +++++++---
>>   drivers/block/ploop/map.c | 29 ++++++++++++-----------------
>>   2 files changed, 19 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
>> index 96f7850..998fe71 100644
>> --- a/drivers/block/ploop/dev.c
>> +++ b/drivers/block/ploop/dev.c
>> @@ -1224,6 +1224,11 @@ static void ploop_complete_request(struct ploop_request * preq)
>>   
>>   	__TRACE("Z %p %u\n", preq, preq->req_cluster);
>>   
>> +	if (!preq->error) {
>> +		unsigned long state =  READ_ONCE(preq->state);
>> +		WARN_ON(state & (1 << PLOOP_REQ_FORCE_FUA));
>> +		WARN_ON(state & (1 <<PLOOP_REQ_FORCE_FLUSH));
>> +	}
>>   	while (preq->bl.head) {
>>   		struct bio * bio = preq->bl.head;
>>   		preq->bl.head = bio->bi_next;
>> @@ -2530,9 +2535,8 @@ restart:
>>   		top_delta = ploop_top_delta(plo);
>>   		sbl.head = sbl.tail = preq->aux_bio;
>>   
>> -		/* Relocated data write required sync before BAT updatee */
>> -		set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>> -
>> +		/* Relocated data write required sync before BAT updatee
>> +		 * this will happen inside index_update */
>>   		if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
>>   			preq->eng_state = PLOOP_E_DATA_WBI;
>>   			plo->st.bio_out++;
>> diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
>> index 3a6365d..c17e598 100644
>> --- a/drivers/block/ploop/map.c
>> +++ b/drivers/block/ploop/map.c
>> @@ -896,6 +896,7 @@ void ploop_index_update(struct ploop_request * preq)
>>   	struct ploop_device * plo = preq->plo;
>>   	struct map_node * m = preq->map;
>>   	struct ploop_delta * top_delta = map_top_delta(m->parent);
>> +	int fua = !!(preq->req_rw & REQ_FUA);
>>   	u32 idx;
>>   	map_index_t blk;
>>   	int old_level;
>> @@ -953,13 +954,13 @@ void ploop_index_update(struct ploop_request * preq)
>>   	__TRACE("wbi %p %u %p\n", preq, preq->req_cluster, m);
>>   	plo->st.map_single_writes++;
>>   	top_delta->ops->map_index(top_delta, m->mn_start, &sec);
>> -	/* Relocate requires consistent writes, mark such reqs appropriately */
>> +	/* Relocate requires consistent index update */
>>   	if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
>>   	    test_bit(PLOOP_REQ_RELOC_S, &preq->state))
>> -		set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>> -
>> -	top_delta->io.ops->write_page(&top_delta->io, preq, page, sec,
>> -				      !!(preq->req_rw & REQ_FUA));
>> +		fua = 1;
>> +	if (fua)
>> +		clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
>> +	top_delta->io.ops->write_page(&top_delta->io, preq, page, sec, fua);
>>   	put_page(page);
>>   	return;
>>   
>> @@ -1078,7 +1079,7 @@ static void map_wb_complete(struct map_node * m, int err)
>>   	int delayed = 0;
>>   	unsigned int idx;
>>   	sector_t sec;
>> -	int fua, force_fua;
>> +	int fua;
>>   
>>   	/* First, complete processing of written back indices,
>>   	 * finally instantiate indices in mapping cache.
>> @@ -1149,7 +1150,6 @@ static void map_wb_complete(struct map_node * m, int err)
>>   
>>   	main_preq = NULL;
>>   	fua = 0;
>> -	force_fua = 0;
>>   
>>   	list_for_each_safe(cursor, tmp, &m->io_queue) {
>>   		struct ploop_request * preq;
>> @@ -1168,13 +1168,12 @@ static void map_wb_complete(struct map_node * m, int err)
>>   				break;
>>   			}
>>   
>> -			if (preq->req_rw & REQ_FUA)
>> +			if (preq->req_rw & REQ_FUA ||
>> +			    test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
>> +			    test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
>> +				clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
>>   				fua = 1;
>> -
>> -			if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
>> -			    test_bit(PLOOP_REQ_RELOC_S, &preq->state))
>> -				force_fua = 1;
>> -
>> +			}
>>   			preq->eng_state = PLOOP_E_INDEX_WB;
>>   			get_page(page);
>>   			preq->sinfo.wi.tpage = page;
>> @@ -1199,10 +1198,6 @@ static void map_wb_complete(struct map_node * m, int err)
>>   	__TRACE("wbi2 %p %u %p\n", main_preq, main_preq->req_cluster, m);
>>   	plo->st.map_multi_writes++;
>>   	top_delta->ops->map_index(top_delta, m->mn_start, &sec);
>> -
>> -	if (force_fua)
>> -		set_bit(PLOOP_REQ_FORCE_FUA, &main_preq->state);
>> -
>>   	top_delta->io.ops->write_page(&top_delta->io, main_preq, page, sec, fua);
>>   	put_page(page);
>>   }
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 472 bytes
Desc: not available
URL: <http://lists.openvz.org/pipermail/devel/attachments/20160616/e62bd11e/attachment-0001.sig>


More information about the Devel mailing list