[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

Maxim Patlasov mpatlasov at virtuozzo.com
Wed Jun 15 18:47:39 PDT 2016


Dima,

I agree that the ploop barrier code is broken in many ways, but I don't 
think the patch actually fixes it. I hope you would agree that 
completion of REQ_FUA guarantees only landing that particular bio to the 
disk; it says nothing about flushing previously submitted (and 
completed) bio-s and it is also possible that power outage may catch us 
when this REQ_FUA is already landed to the disk, but previous bio-s are 
not yet.

Hence, for RELOC_{A|S} requests we actually need something like that:

      RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
      RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA

(i.e. we do need to flush all previously submitted data before starting 
to update BAT on disk)

not simply:

> RELOC_S: R1, W2, WBI:FUA
> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA

Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and 
PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we 
could remove them completely (along we that optimization delaying 
incoming FUA) and re-implement all this stuff from scratch:

1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set 
REQ_FUA in preq->req_rw before calling ->submit(preq)

2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating 
BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for 
RELOC_A|S in ploop_index_update and map_wb_complete

3) For that optimization delaying incoming FUA (what we do now if 
ploop_req_delay_fua_possible() returns true) we could introduce new 
ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update 
and map_wb_complete (the same thing as 2) above). And, yes, let's 
WARN_ON if we somehow missed its processing.

The only complication I foresee is about how to teach kaio to pre-flush 
in kaio_write_page -- it's doable, but involves kaio_resubmit that's 
already pretty convoluted.

Btw, I accidentally noticed awful silly bug in kaio_complete_io_state(): 
we checks for REQ_FUA after clearing it! This makes all FUA-s on 
ordinary kaio_submit path silently lost...

Thanks,
Maxim


On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:
> barrier code is broken in many ways:
> Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
> But request also can goes though ->dio_submit_alloc()->dio_submit_pad and write_page (for indexes)
> So in case of grow_dev we have following sequance:
>
> E_RELOC_DATA_READ:
>               ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>                ->delta->allocate
>                   ->io->submit_allloc: dio_submit_alloc
>                     ->dio_submit_pad
> E_DATA_WBI : data written, time to update index
>                ->delta->allocate_complete:ploop_index_update
>                  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>                  ->write_page
>                  ->ploop_map_wb_complete
>                    ->ploop_wb_complete_post_process
>                      ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
> E_RELOC_NULLIFY:
>
>                 ->submit()
>
> This patch unify barrier handling like follows:
> - Add assertation to ploop_complete_request for FORCE_{FLUSH,FUA} state
> - Perform explicit FUA inside index_update for RELOC requests.
>
> This makes reloc sequence optimal:
> RELOC_S: R1, W2, WBI:FUA
> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>
> https://jira.sw.ru/browse/PSBM-47107
> Signed-off-by: Dmitry Monakhov <dmonakhov at openvz.org>
> ---
>   drivers/block/ploop/dev.c | 10 +++++++---
>   drivers/block/ploop/map.c | 29 ++++++++++++-----------------
>   2 files changed, 19 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index 96f7850..998fe71 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -1224,6 +1224,11 @@ static void ploop_complete_request(struct ploop_request * preq)
>   
>   	__TRACE("Z %p %u\n", preq, preq->req_cluster);
>   
> +	if (!preq->error) {
> +		unsigned long state =  READ_ONCE(preq->state);
> +		WARN_ON(state & (1 << PLOOP_REQ_FORCE_FUA));
> +		WARN_ON(state & (1 <<PLOOP_REQ_FORCE_FLUSH));
> +	}
>   	while (preq->bl.head) {
>   		struct bio * bio = preq->bl.head;
>   		preq->bl.head = bio->bi_next;
> @@ -2530,9 +2535,8 @@ restart:
>   		top_delta = ploop_top_delta(plo);
>   		sbl.head = sbl.tail = preq->aux_bio;
>   
> -		/* Relocated data write required sync before BAT updatee */
> -		set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
> -
> +		/* Relocated data write required sync before BAT updatee
> +		 * this will happen inside index_update */
>   		if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
>   			preq->eng_state = PLOOP_E_DATA_WBI;
>   			plo->st.bio_out++;
> diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
> index 3a6365d..c17e598 100644
> --- a/drivers/block/ploop/map.c
> +++ b/drivers/block/ploop/map.c
> @@ -896,6 +896,7 @@ void ploop_index_update(struct ploop_request * preq)
>   	struct ploop_device * plo = preq->plo;
>   	struct map_node * m = preq->map;
>   	struct ploop_delta * top_delta = map_top_delta(m->parent);
> +	int fua = !!(preq->req_rw & REQ_FUA);
>   	u32 idx;
>   	map_index_t blk;
>   	int old_level;
> @@ -953,13 +954,13 @@ void ploop_index_update(struct ploop_request * preq)
>   	__TRACE("wbi %p %u %p\n", preq, preq->req_cluster, m);
>   	plo->st.map_single_writes++;
>   	top_delta->ops->map_index(top_delta, m->mn_start, &sec);
> -	/* Relocate requires consistent writes, mark such reqs appropriately */
> +	/* Relocate requires consistent index update */
>   	if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
>   	    test_bit(PLOOP_REQ_RELOC_S, &preq->state))
> -		set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
> -
> -	top_delta->io.ops->write_page(&top_delta->io, preq, page, sec,
> -				      !!(preq->req_rw & REQ_FUA));
> +		fua = 1;
> +	if (fua)
> +		clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
> +	top_delta->io.ops->write_page(&top_delta->io, preq, page, sec, fua);
>   	put_page(page);
>   	return;
>   
> @@ -1078,7 +1079,7 @@ static void map_wb_complete(struct map_node * m, int err)
>   	int delayed = 0;
>   	unsigned int idx;
>   	sector_t sec;
> -	int fua, force_fua;
> +	int fua;
>   
>   	/* First, complete processing of written back indices,
>   	 * finally instantiate indices in mapping cache.
> @@ -1149,7 +1150,6 @@ static void map_wb_complete(struct map_node * m, int err)
>   
>   	main_preq = NULL;
>   	fua = 0;
> -	force_fua = 0;
>   
>   	list_for_each_safe(cursor, tmp, &m->io_queue) {
>   		struct ploop_request * preq;
> @@ -1168,13 +1168,12 @@ static void map_wb_complete(struct map_node * m, int err)
>   				break;
>   			}
>   
> -			if (preq->req_rw & REQ_FUA)
> +			if (preq->req_rw & REQ_FUA ||
> +			    test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
> +			    test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
> +				clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
>   				fua = 1;
> -
> -			if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
> -			    test_bit(PLOOP_REQ_RELOC_S, &preq->state))
> -				force_fua = 1;
> -
> +			}
>   			preq->eng_state = PLOOP_E_INDEX_WB;
>   			get_page(page);
>   			preq->sinfo.wi.tpage = page;
> @@ -1199,10 +1198,6 @@ static void map_wb_complete(struct map_node * m, int err)
>   	__TRACE("wbi2 %p %u %p\n", main_preq, main_preq->req_cluster, m);
>   	plo->st.map_multi_writes++;
>   	top_delta->ops->map_index(top_delta, m->mn_start, &sec);
> -
> -	if (force_fua)
> -		set_bit(PLOOP_REQ_FORCE_FUA, &main_preq->state);
> -
>   	top_delta->io.ops->write_page(&top_delta->io, main_preq, page, sec, fua);
>   	put_page(page);
>   }



More information about the Devel mailing list