[Devel] [PATCH RHEL9 COMMIT] dm-ploop: introduce per-md page locking

Konstantin Khorenko khorenko at virtuozzo.com
Mon Jan 27 16:12:38 MSK 2025


The commit is pushed to "branch-rh9-5.14.0-427.44.1.vz9.80.x-ovz" and will appear at git at bitbucket.org:openvz/vzkernel.git
after rh9-5.14.0-427.44.1.vz9.80.6
------>
commit 42a5438a26c0d8ae2e05d61871cd3557973ce436
Author: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
Date:   Fri Jan 24 17:35:50 2025 +0200

    dm-ploop: introduce per-md page locking
    
    Currently we have single bat_rwlock for the whole ploop. However,
    runtime locking granularity can be reduced to single metadata page.
    In this patch, add rwlock to metadata structure, use it when
    accessing md->levels and md->page at the sime time to protect
    readers against writers. Since i (aa) faced problems with reader/
    writer starvation, despite it should not happen, leading to lockups.
    Change rwlock_t to spin_lock_t to avoid this. Readers vs writers
    are commented so we can investigate further why rwlock_t didn't
    work as expected.
    
    https://virtuozzo.atlassian.net/browse/VSTOR-91817
    Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
    
    Signed-off-by: Alexander Atanasov <alexander.atanasov at virtuozzo.com>
    
    ======
    Patchset description:
    ploop: optimistations and scalling
    
    Ploop processes requsts in a different threads in parallel
    where possible which results in significant improvement in
    performance and makes further optimistations possible.
    
    Known bugs:
      - delayed metadata writeback is not working and is missing error handling
         - patch to disable it until fixed
      - fast path is not working - causes rcu lockups - patch to disable it
    
    Further improvements:
      - optimize md pages lookups
    
    Alexander Atanasov (50):
      dm-ploop: md_pages map all pages at creation time
      dm-ploop: Use READ_ONCE/WRITE_ONCE to access md page data
      dm-ploop: fsync after all pios are sent
      dm-ploop: move md status to use proper bitops
      dm-ploop: convert wait_list and wb_batch_llist to use lockless lists
      dm-ploop: convert enospc handling to use lockless lists
      dm-ploop: convert suspended_pios list to use lockless list
      dm-ploop: convert the rest of the lists to use llist variant
      dm-ploop: combine processing of pios thru prepare list and remove
        fsync worker
      dm-ploop: move from wq to kthread
      dm-ploop: move preparations of pios into the caller from worker
      dm-ploop: fast path execution for reads
      dm-ploop: do not use a wrapper for set_bit to make a page writeback
      dm-ploop: BAT use only one list for writeback
      dm-ploop: make md writeback timeout to be per page
      dm-ploop: add interface to disable bat writeback delay
      dm-ploop: convert wb_batch_list to lockless variant
      dm-ploop: convert high_prio to status
      dm-ploop: split cow processing into two functions
      dm-ploop: convert md page rw lock to spin lock
      dm-ploop: convert bat_rwlock to bat_lock spinlock
      dm-ploop: prepare bat updates under bat_lock
      dm-ploop: make ploop_bat_write_complete ready for parallel pio
        completion
      dm-ploop: make ploop_submit_metadata_writeback return number of
        requests sent
      dm-ploop: introduce pio runner threads
      dm-ploop: add pio list ids to be used when passing pios to runners
      dm-ploop: process pios via runners
      dm-ploop: disable metadata writeback delay
      dm-ploop: disable fast path
      dm-ploop: use lockless lists for chained cow updates list
      dm-ploop: use lockless lists for data ready pios
      dm-ploop: give runner threads better name
      dm-ploop: resize operation - add holes bitmap locking
      dm-ploop: remove unnecessary operations
      dm-ploop: use filp per thread
      dm-ploop: catch if we try to advance pio past bio end
      dm-ploop: support REQ_FUA for data pios
      dm-ploop: proplerly access nr_bat_entries
      dm-ploop: fix locking and improve error handling when submitting pios
      dm-ploop: fix how ENOTBLK is handled
      dm-ploop: sync when suspended or stopping
      dm-ploop: rework bat completion logic
      dm-ploop: rework logic in pio processing
      dm-ploop: end fsync pios in parallel
      dm-ploop: make filespace preallocations async
      dm-ploop: resubmit enospc pios from dispatcher thread
      dm-ploop: dm-ploop: simplify discard completion
      dm-ploop: use GFP_ATOMIC instead of GFP_NOIO
      dm-ploop: fix locks used in mixed context
      dm-ploop: fix how current flags are managed inside threads
    
    Andrey Zhadchenko (13):
      dm-ploop: do not flush after metadata writes
      dm-ploop: set IOCB_DSYNC on all FUA requests
      dm-ploop: remove extra ploop_cluster_is_in_top_delta()
      dm-ploop: introduce per-md page locking
      dm-ploop: reduce BAT accesses on discard completion
      dm-ploop: simplify llseek
      dm-ploop: speed up ploop_prepare_bat_update()
      dm-ploop: make new allocations immediately visible in BAT
      dm-ploop: drop ploop_cluster_is_in_top_delta()
      dm-ploop: do not wait for BAT update for non-FUA requests
      dm-ploop: add delay for metadata writeback
      dm-ploop: submit all postponed metadata on REQ_OP_FLUSH
      dm-ploop: handle REQ_PREFLUSH
    
    Feature: dm-ploop: ploop target driver
---
 drivers/md/dm-ploop-bat.c | 14 ++++++++------
 drivers/md/dm-ploop-cmd.c | 47 +++++++++++++++++++++++++++++------------------
 drivers/md/dm-ploop-map.c | 16 ++++++++--------
 drivers/md/dm-ploop.h     |  6 ++++++
 4 files changed, 51 insertions(+), 32 deletions(-)

diff --git a/drivers/md/dm-ploop-bat.c b/drivers/md/dm-ploop-bat.c
index 655d0e4c91ab..96ba099b1ca4 100644
--- a/drivers/md/dm-ploop-bat.c
+++ b/drivers/md/dm-ploop-bat.c
@@ -88,6 +88,7 @@ static struct md_page *ploop_alloc_md_page(u32 id)
 	md->page = page;
 	md->kmpage = kmap(page);
 	md->id = id;
+	spin_lock_init(&md->md_lock);
 	return md;
 err_page:
 	kfree(levels);
@@ -134,20 +135,23 @@ bool ploop_try_update_bat_entry(struct ploop *ploop, u32 clu, u8 level, u32 dst_
 {
 	u32 *bat_entries, id = ploop_bat_clu_to_page_nr(clu);
 	struct md_page *md = ploop_md_page_find(ploop, id);
-
-	lockdep_assert_held(&ploop->bat_rwlock);
+	unsigned long flags;
+	bool ret = false;
 
 	if (!md)
 		return false;
 
 	clu = ploop_bat_clu_idx_in_page(clu); /* relative offset */
 
+	spin_lock_irqsave(&md->md_lock, flags); /* write */
 	if (READ_ONCE(md->bat_levels[clu]) == level) {
 		bat_entries = md->kmpage;
 		WRITE_ONCE(bat_entries[clu], dst_clu);
-		return true;
+		ret = true;
 	}
-	return false;
+	spin_unlock_irqrestore(&md->md_lock, flags);
+
+	return ret;
 }
 
 /* Alloc holes_bitmap and set bits of free clusters */
@@ -411,7 +415,6 @@ static void ploop_apply_delta_mappings(struct ploop *ploop,
 	if (!is_raw)
 		d_md = ploop_md_first_entry(md_root);
 
-	write_lock_irq(&ploop->bat_rwlock);
 	ploop_for_each_md_page(ploop, md, node) {
 		bat_entries = md->kmpage;
 		if (!is_raw)
@@ -455,7 +458,6 @@ static void ploop_apply_delta_mappings(struct ploop *ploop,
 		if (!is_raw)
 			d_md = ploop_md_next_entry(d_md);
 	}
-	write_unlock_irq(&ploop->bat_rwlock);
 }
 
 int ploop_check_delta_length(struct ploop *ploop, struct file *file, loff_t *file_size)
diff --git a/drivers/md/dm-ploop-cmd.c b/drivers/md/dm-ploop-cmd.c
index 1cd215800a16..649e9de029bf 100644
--- a/drivers/md/dm-ploop-cmd.c
+++ b/drivers/md/dm-ploop-cmd.c
@@ -27,6 +27,7 @@ static void ploop_advance_holes_bitmap(struct ploop *ploop,
 	u32 i, end, size, dst_clu, *bat_entries;
 	struct rb_node *node;
 	struct md_page *md;
+	unsigned long flags;
 
 	/* This is called only once */
 	if (cmd->resize.stage != PLOOP_GROW_STAGE_INITIAL)
@@ -44,6 +45,8 @@ static void ploop_advance_holes_bitmap(struct ploop *ploop,
 	ploop_for_each_md_page(ploop, md, node) {
 		ploop_init_be_iter(ploop, md->id, &i, &end);
 		bat_entries = md->kmpage;
+
+		spin_lock_irqsave(&md->md_lock, flags); /* read */
 		for (; i <= end; i++) {
 			if (!ploop_md_page_cluster_is_in_top_delta(ploop, md, i))
 				continue;
@@ -54,6 +57,7 @@ static void ploop_advance_holes_bitmap(struct ploop *ploop,
 				ploop_hole_clear_bit(dst_clu, ploop);
 			}
 		}
+		spin_unlock_irqrestore(&md->md_lock, flags);
 	}
 	write_unlock_irq(&ploop->bat_rwlock);
 }
@@ -157,11 +161,13 @@ static u32 ploop_find_bat_entry(struct ploop *ploop, u32 dst_clu, bool *is_locke
 	u32 i, end, *bat_entries, clu = U32_MAX;
 	struct rb_node *node;
 	struct md_page *md;
+	unsigned long flags;
 
-	read_lock_irq(&ploop->bat_rwlock);
 	ploop_for_each_md_page(ploop, md, node) {
 		ploop_init_be_iter(ploop, md->id, &i, &end);
 		bat_entries = md->kmpage;
+
+		spin_lock_irqsave(&md->md_lock, flags); /* read */
 		for (; i <= end; i++) {
 			if (READ_ONCE(bat_entries[i]) != dst_clu)
 				continue;
@@ -170,10 +176,10 @@ static u32 ploop_find_bat_entry(struct ploop *ploop, u32 dst_clu, bool *is_locke
 				break;
 			}
 		}
+		spin_unlock_irqrestore(&md->md_lock, flags);
 		if (clu != UINT_MAX)
 			break;
 	}
-	read_unlock_irq(&ploop->bat_rwlock);
 
 	*is_locked = false;
 	if (clu != UINT_MAX) {
@@ -344,10 +350,8 @@ static int ploop_grow_relocate_cluster(struct ploop *ploop,
 	}
 
 	/* Update local BAT copy */
-	write_lock_irq(&ploop->bat_rwlock);
 	WARN_ON(!ploop_try_update_bat_entry(ploop, clu,
 			ploop_top_level(ploop), new_dst));
-	write_unlock_irq(&ploop->bat_rwlock);
 not_occupied:
 	/*
 	 * Now dst_clu is not referenced in BAT, so increase the value
@@ -698,12 +702,10 @@ static int ploop_merge_latest_snapshot(struct ploop *ploop)
 	if (ret)
 		goto out;
 
-	write_lock_irq(&ploop->bat_rwlock);
 	level = ploop->nr_deltas - 2;
 	file = ploop->deltas[level].file;
 	ploop->deltas[level] = ploop->deltas[level + 1];
 	ploop->nr_deltas--;
-	write_unlock_irq(&ploop->bat_rwlock);
 	fput(file);
 
 	ploop_resume_submitting_pios(ploop);
@@ -721,15 +723,19 @@ static void notify_delta_merged(struct ploop *ploop, u8 level,
 	struct rb_node *node;
 	struct file *file;
 	bool stop = false;
+	unsigned long flags;
 	u32 clu;
 
 	d_md = ploop_md_first_entry(md_root);
 
-	write_lock_irq(&ploop->bat_rwlock);
 	ploop_for_each_md_page(ploop, md, node) {
 		init_be_iter(nr_be, md->id, &i, &end);
 		bat_entries = md->kmpage;
 		d_bat_entries = d_md->kmpage;
+
+		spin_lock_irqsave(&md->md_lock, flags); /* write */
+		spin_lock(&d_md->md_lock); /* write */
+
 		for (; i <= end; i++) {
 			clu = ploop_page_clu_idx_to_bat_clu(md->id, i);
 			if (clu == nr_be - 1)
@@ -762,6 +768,10 @@ static void notify_delta_merged(struct ploop *ploop, u8 level,
 			else
 				WRITE_ONCE(md->bat_levels[i], level);
 		}
+
+		spin_unlock(&d_md->md_lock);
+		spin_unlock_irqrestore(&md->md_lock, flags);
+
 		if (stop)
 			break;
 		d_md = ploop_md_next_entry(d_md);
@@ -773,7 +783,6 @@ static void notify_delta_merged(struct ploop *ploop, u8 level,
 		ploop->deltas[i - 1] = ploop->deltas[i];
 	memset(&ploop->deltas[--ploop->nr_deltas], 0,
 	       sizeof(struct ploop_delta));
-	write_unlock_irq(&ploop->bat_rwlock);
 	fput(file);
 }
 
@@ -784,7 +793,6 @@ static int ploop_process_update_delta_index(struct ploop *ploop, u8 level,
 	u32 clu, dst_clu, n;
 	int ret;
 
-	write_lock_irq(&ploop->bat_rwlock);
 	/* Check all */
 	while (sscanf(map, "%u:%u;%n", &clu, &dst_clu, &n) == 2) {
 		/*
@@ -809,7 +817,6 @@ static int ploop_process_update_delta_index(struct ploop *ploop, u8 level,
 	}
 	ret = 0;
 unlock:
-	write_unlock_irq(&ploop->bat_rwlock);
 	return ret;
 }
 ALLOW_ERROR_INJECTION(ploop_process_update_delta_index, ERRNO);
@@ -900,12 +907,9 @@ static int ploop_get_delta_name_cmd(struct ploop *ploop, u8 level,
 
 	/*
 	 * Nobody can change deltas in parallel, since
-	 * another cmds are prohibited, but do this
-	 * for uniformity.
+	 * another cmds are prohibited
 	 */
-	read_lock_irq(&ploop->bat_rwlock);
 	file = get_file(ploop->deltas[level].file);
-	read_unlock_irq(&ploop->bat_rwlock);
 
 	p = file_path(file, result, maxlen);
 	if (p == ERR_PTR(-ENAMETOOLONG)) {
@@ -973,7 +977,11 @@ static int process_flip_upper_deltas(struct ploop *ploop)
         bat_clusters = DIV_ROUND_UP(size, CLU_SIZE(ploop));
 	hb_nr = ploop->hb_nr;
 
-	write_lock_irq(&ploop->bat_rwlock);
+	/*
+	 * We can be here only if ploop is suspended:
+	 * no other IO nor command is possible
+	 */
+
 	/* Prepare holes_bitmap */
 	memset(holes_bitmap, 0xff, hb_nr/8);
 	for (i = (hb_nr & ~0x7); i < hb_nr; i++)
@@ -985,6 +993,7 @@ static int process_flip_upper_deltas(struct ploop *ploop)
 	ploop_for_each_md_page(ploop, md, node) {
 		ploop_init_be_iter(ploop, md->id, &i, &end);
 		bat_entries = md->kmpage;
+
 		for (; i <= end; i++) {
 			if (READ_ONCE(bat_entries[i]) == BAT_ENTRY_NONE)
 				continue;
@@ -999,7 +1008,6 @@ static int process_flip_upper_deltas(struct ploop *ploop)
 
 	/* FIXME */
 	swap(ploop->deltas[level], ploop->deltas[level+1]);
-	write_unlock_irq(&ploop->bat_rwlock);
 	return 0;
 }
 
@@ -1048,10 +1056,12 @@ static int ploop_check_delta_before_flip(struct ploop *ploop, struct file *file)
 	/* Points to hdr since md_page[0] also contains hdr. */
 	d_md = ploop_md_first_entry(&md_root);
 
-	write_lock_irq(&ploop->bat_rwlock);
 	ploop_for_each_md_page(ploop, md, node) {
 		init_be_iter(nr_be, md->id, &i, &end);
 		d_bat_entries = d_md->kmpage;
+
+		spin_lock_irq(&md->md_lock); /* read */
+		spin_lock(&d_md->md_lock);
 		for (; i <= end; i++) {
 			if (ploop_md_page_cluster_is_in_top_delta(ploop, md, i) &&
 			    d_bat_entries[i] != BAT_ENTRY_NONE) {
@@ -1060,6 +1070,8 @@ static int ploop_check_delta_before_flip(struct ploop *ploop, struct file *file)
 				goto unmap;
 			}
 		}
+		spin_unlock(&d_md->md_lock);
+		spin_unlock_irq(&md->md_lock);
 
 		clu = ploop_page_clu_idx_to_bat_clu(md->id, i);
 		if (clu == nr_be - 1) {
@@ -1072,7 +1084,6 @@ static int ploop_check_delta_before_flip(struct ploop *ploop, struct file *file)
 		d_md = ploop_md_next_entry(d_md);
 	}
 
-	write_unlock_irq(&ploop->bat_rwlock);
 	ploop_free_md_pages_tree(&md_root);
 out:
 #endif
diff --git a/drivers/md/dm-ploop-map.c b/drivers/md/dm-ploop-map.c
index 9943f2825e62..62fca243f755 100644
--- a/drivers/md/dm-ploop-map.c
+++ b/drivers/md/dm-ploop-map.c
@@ -388,13 +388,13 @@ static bool ploop_delay_if_md_busy(struct ploop *ploop, struct md_page *md,
 	WARN_ON_ONCE(!list_empty(&pio->list));
 
 	/* lock protects piwb */
-	read_lock_irqsave(&ploop->bat_rwlock, flags);
+	spin_lock_irqsave(&md->md_lock, flags); /* read */
 	piwb = md->piwb;
 	if (piwb && (piwb->type != type || test_bit(MD_WRITEBACK, &md->status))) {
 		llist_add((struct llist_node *)(&pio->list), &md->wait_llist);
 		busy = true;
 	}
-	read_unlock_irqrestore(&ploop->bat_rwlock, flags);
+	spin_unlock_irqrestore(&md->md_lock, flags);
 
 	return busy;
 }
@@ -805,11 +805,11 @@ static void ploop_advance_local_after_bat_wb(struct ploop *ploop,
 	}
 
 	WARN_ON_ONCE(!test_bit(MD_WRITEBACK, &md->status));
+	spin_lock_irqsave(&md->md_lock, flags); /* write */
 	clear_bit(MD_WRITEBACK, &md->status);
 	/* protect piwb */
-	write_lock_irqsave(&ploop->bat_rwlock, flags);
 	md->piwb = NULL;
-	write_unlock_irqrestore(&ploop->bat_rwlock, flags);
+	spin_unlock_irqrestore(&md->md_lock, flags);
 	kunmap_local(dst_clu);
 
 	wait_llist_pending = llist_del_all(&md->wait_llist);
@@ -925,10 +925,10 @@ static int ploop_prepare_bat_update(struct ploop *ploop, struct md_page *md,
 
 	bat_entries = md->kmpage;
 
-	write_lock_irq(&ploop->bat_rwlock);
+	spin_lock_irq(&md->md_lock); /* write */
 	md->piwb = piwb;
 	piwb->md = md;
-	write_unlock_irq(&ploop->bat_rwlock);
+	spin_unlock_irq(&md->md_lock);
 
 	piwb->page_id = page_id;
 	to = kmap_local_page(page);
@@ -973,10 +973,10 @@ void ploop_break_bat_update(struct ploop *ploop, struct md_page *md)
 	struct ploop_index_wb *piwb;
 	unsigned long flags;
 
-	write_lock_irqsave(&ploop->bat_rwlock, flags);
+	spin_lock_irqsave(&md->md_lock, flags); /* write */
 	piwb = md->piwb;
 	md->piwb = NULL;
-	write_unlock_irqrestore(&ploop->bat_rwlock, flags);
+	spin_unlock_irqrestore(&md->md_lock, flags);
 
 	ploop_free_piwb(piwb);
 }
diff --git a/drivers/md/dm-ploop.h b/drivers/md/dm-ploop.h
index 1e5927365d81..8e148e2a479f 100644
--- a/drivers/md/dm-ploop.h
+++ b/drivers/md/dm-ploop.h
@@ -124,6 +124,8 @@ struct md_page {
 
 	struct llist_node wb_llink;
 	struct ploop_index_wb *piwb;
+
+	spinlock_t md_lock;
 };
 
 enum {
@@ -423,6 +425,7 @@ static inline u32 ploop_bat_entries(struct ploop *ploop, u32 clu,
 {
 	u32 *bat_entries, dst_clu, id;
 	struct md_page *md;
+	unsigned long flags;
 
 	id = ploop_bat_clu_to_page_nr(clu);
 	md = ploop_md_page_find(ploop, id);
@@ -431,6 +434,7 @@ static inline u32 ploop_bat_entries(struct ploop *ploop, u32 clu,
 	/* Cluster index related to the page[page_id] start */
 	clu = ploop_bat_clu_idx_in_page(clu);
 
+	spin_lock_irqsave(&md->md_lock, flags); /* read */
 	if (bat_level)
 		*bat_level = READ_ONCE(md->bat_levels[clu]);
 	if (md_ret)
@@ -438,6 +442,8 @@ static inline u32 ploop_bat_entries(struct ploop *ploop, u32 clu,
 
 	bat_entries = md->kmpage;
 	dst_clu = READ_ONCE(bat_entries[clu]);
+	spin_unlock_irqrestore(&md->md_lock, flags);
+
 	return dst_clu;
 }
 


More information about the Devel mailing list