[Devel] [PATCH RHEL9 COMMIT] dm-ploop: introduce pio runner threads
Konstantin Khorenko
khorenko at virtuozzo.com
Mon Jan 27 16:12:46 MSK 2025
The commit is pushed to "branch-rh9-5.14.0-427.44.1.vz9.80.x-ovz" and will appear at git at bitbucket.org:openvz/vzkernel.git
after rh9-5.14.0-427.44.1.vz9.80.6
------>
commit a0a2f1d32890bcd05eda57ca877034ca295aa42a
Author: Alexander Atanasov <alexander.atanasov at virtuozzo.com>
Date: Fri Jan 24 17:36:12 2025 +0200
dm-ploop: introduce pio runner threads
Create threads to execute pios in parallel - call them pio runners.
Use number of CPUs to determine the number of threads started.
From worker each pio is sent to a thread in round-robin fashion
thru work_llist. Maintain the number of pios sent so we can wait
for them to be processed - NB we only want to keep the order of
execution of different pio types which can be different than
the order of their completion. We send a batch of pios to the runners
and if necessary we wait for them to be processed before moving
forwards - we need this for metadata writeback and flushes.
https://virtuozzo.atlassian.net/browse/VSTOR-91821
Signed-off-by: Alexander Atanasov <alexander.atanasov at virtuozzo.com>
======
Patchset description:
ploop: optimistations and scalling
Ploop processes requsts in a different threads in parallel
where possible which results in significant improvement in
performance and makes further optimistations possible.
Known bugs:
- delayed metadata writeback is not working and is missing error handling
- patch to disable it until fixed
- fast path is not working - causes rcu lockups - patch to disable it
Further improvements:
- optimize md pages lookups
Alexander Atanasov (50):
dm-ploop: md_pages map all pages at creation time
dm-ploop: Use READ_ONCE/WRITE_ONCE to access md page data
dm-ploop: fsync after all pios are sent
dm-ploop: move md status to use proper bitops
dm-ploop: convert wait_list and wb_batch_llist to use lockless lists
dm-ploop: convert enospc handling to use lockless lists
dm-ploop: convert suspended_pios list to use lockless list
dm-ploop: convert the rest of the lists to use llist variant
dm-ploop: combine processing of pios thru prepare list and remove
fsync worker
dm-ploop: move from wq to kthread
dm-ploop: move preparations of pios into the caller from worker
dm-ploop: fast path execution for reads
dm-ploop: do not use a wrapper for set_bit to make a page writeback
dm-ploop: BAT use only one list for writeback
dm-ploop: make md writeback timeout to be per page
dm-ploop: add interface to disable bat writeback delay
dm-ploop: convert wb_batch_list to lockless variant
dm-ploop: convert high_prio to status
dm-ploop: split cow processing into two functions
dm-ploop: convert md page rw lock to spin lock
dm-ploop: convert bat_rwlock to bat_lock spinlock
dm-ploop: prepare bat updates under bat_lock
dm-ploop: make ploop_bat_write_complete ready for parallel pio
completion
dm-ploop: make ploop_submit_metadata_writeback return number of
requests sent
dm-ploop: introduce pio runner threads
dm-ploop: add pio list ids to be used when passing pios to runners
dm-ploop: process pios via runners
dm-ploop: disable metadata writeback delay
dm-ploop: disable fast path
dm-ploop: use lockless lists for chained cow updates list
dm-ploop: use lockless lists for data ready pios
dm-ploop: give runner threads better name
dm-ploop: resize operation - add holes bitmap locking
dm-ploop: remove unnecessary operations
dm-ploop: use filp per thread
dm-ploop: catch if we try to advance pio past bio end
dm-ploop: support REQ_FUA for data pios
dm-ploop: proplerly access nr_bat_entries
dm-ploop: fix locking and improve error handling when submitting pios
dm-ploop: fix how ENOTBLK is handled
dm-ploop: sync when suspended or stopping
dm-ploop: rework bat completion logic
dm-ploop: rework logic in pio processing
dm-ploop: end fsync pios in parallel
dm-ploop: make filespace preallocations async
dm-ploop: resubmit enospc pios from dispatcher thread
dm-ploop: dm-ploop: simplify discard completion
dm-ploop: use GFP_ATOMIC instead of GFP_NOIO
dm-ploop: fix locks used in mixed context
dm-ploop: fix how current flags are managed inside threads
Andrey Zhadchenko (13):
dm-ploop: do not flush after metadata writes
dm-ploop: set IOCB_DSYNC on all FUA requests
dm-ploop: remove extra ploop_cluster_is_in_top_delta()
dm-ploop: introduce per-md page locking
dm-ploop: reduce BAT accesses on discard completion
dm-ploop: simplify llseek
dm-ploop: speed up ploop_prepare_bat_update()
dm-ploop: make new allocations immediately visible in BAT
dm-ploop: drop ploop_cluster_is_in_top_delta()
dm-ploop: do not wait for BAT update for non-FUA requests
dm-ploop: add delay for metadata writeback
dm-ploop: submit all postponed metadata on REQ_OP_FLUSH
dm-ploop: handle REQ_PREFLUSH
Feature: dm-ploop: ploop target driver
---
drivers/md/dm-ploop-map.c | 136 +++++++++++++++++++++++++++++++++++++++----
drivers/md/dm-ploop-target.c | 44 ++++++++++++--
drivers/md/dm-ploop.h | 13 ++++-
3 files changed, 175 insertions(+), 18 deletions(-)
diff --git a/drivers/md/dm-ploop-map.c b/drivers/md/dm-ploop-map.c
index f59889d2cd89..b820e9ba1218 100644
--- a/drivers/md/dm-ploop-map.c
+++ b/drivers/md/dm-ploop-map.c
@@ -20,6 +20,8 @@
#include "dm-ploop.h"
#include "dm-rq.h"
+static inline int ploop_runners_add_work(struct ploop *ploop, struct pio *pio);
+
#define PREALLOC_SIZE (128ULL * 1024 * 1024)
static void ploop_handle_cleanup(struct ploop *ploop, struct pio *pio);
@@ -1894,6 +1896,11 @@ static void ploop_process_resubmit_pios(struct ploop *ploop,
}
}
+static inline int ploop_runners_have_pending(struct ploop *ploop)
+{
+ return atomic_read(&ploop->kt_worker->inflight_pios);
+}
+
static int ploop_submit_metadata_writeback(struct ploop *ploop)
{
unsigned long flags;
@@ -1958,6 +1965,33 @@ static void process_ploop_fsync_work(struct ploop *ploop, struct llist_node *llf
}
}
+static inline int ploop_runners_add_work(struct ploop *ploop, struct pio *pio)
+{
+ struct ploop_worker *wrkr;
+
+ wrkr = READ_ONCE(ploop->last_used_runner)->next;
+ WRITE_ONCE(ploop->last_used_runner, wrkr);
+
+ atomic_inc(&ploop->kt_worker->inflight_pios);
+ llist_add((struct llist_node *)(&pio->list), &wrkr->work_llist);
+ wake_up_process(wrkr->task);
+
+ return 0;
+}
+
+static inline int ploop_runners_add_work_list(struct ploop *ploop, struct llist_node *list)
+{
+ struct llist_node *pos, *t;
+ struct pio *pio;
+
+ llist_for_each_safe(pos, t, list) {
+ pio = list_entry((struct list_head *)pos, typeof(*pio), list);
+ ploop_runners_add_work(ploop, pio);
+ }
+
+ return 0;
+}
+
void do_ploop_run_work(struct ploop *ploop)
{
LLIST_HEAD(deferred_pios);
@@ -2017,30 +2051,110 @@ void do_ploop_work(struct work_struct *ws)
do_ploop_run_work(ploop);
}
-int ploop_worker(void *data)
+int ploop_pio_runner(void *data)
{
struct ploop_worker *worker = data;
struct ploop *ploop = worker->ploop;
+ struct llist_node *llwork;
+ struct pio *pio;
+ struct llist_node *pos, *t;
+ unsigned int old_flags = current->flags;
+ int did_process_pios = 0;
for (;;) {
+ current->flags = old_flags;
set_current_state(TASK_INTERRUPTIBLE);
- if (kthread_should_stop()) {
- __set_current_state(TASK_RUNNING);
- break;
+check_for_more:
+ llwork = llist_del_all(&worker->work_llist);
+ if (!llwork) {
+ if (did_process_pios) {
+ did_process_pios = 0;
+ wake_up_interruptible(&ploop->dispatcher_wq_data);
+ }
+ /* Only stop when there is no more pios */
+ if (kthread_should_stop()) {
+ __set_current_state(TASK_RUNNING);
+ break;
+ }
+ schedule();
+ continue;
}
+ __set_current_state(TASK_RUNNING);
+ old_flags = current->flags;
+ current->flags |= PF_IO_THREAD|PF_LOCAL_THROTTLE|PF_MEMALLOC_NOIO;
+
+ llist_for_each_safe(pos, t, llwork) {
+ pio = list_entry((struct list_head *)pos, typeof(*pio), list);
+ INIT_LIST_HEAD(&pio->list);
+ switch (pio->queue_list_id) {
+ case PLOOP_LIST_FLUSH:
+ WARN_ON_ONCE(1); /* We must not see flushes here */
+ break;
+ case PLOOP_LIST_PREPARE:
+ // fsync pios can come here for endio
+ // XXX: make it a FSYNC list
+ ploop_pio_endio(pio);
+ break;
+ case PLOOP_LIST_DEFERRED:
+ ploop_process_one_deferred_bio(ploop, pio);
+ break;
+ case PLOOP_LIST_COW:
+ ploop_process_one_delta_cow(ploop, pio);
+ break;
+ case PLOOP_LIST_DISCARD:
+ ploop_process_one_discard_pio(ploop, pio);
+ break;
+ // XXX: make it list MDWB
+ case PLOOP_LIST_INVALID: /* resubmit sets the list id to invalid */
+ ploop_submit_rw_mapped(ploop, pio);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ }
+ atomic_dec(&ploop->kt_worker->inflight_pios);
+ }
+ cond_resched();
+ did_process_pios = 1;
+ goto check_for_more;
+ }
+ return 0;
+}
+
+int ploop_worker(void *data)
+{
+ struct ploop_worker *worker = data;
+ struct ploop *ploop = worker->ploop;
+
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+
if (llist_empty(&ploop->pios[PLOOP_LIST_FLUSH]) &&
- llist_empty(&ploop->pios[PLOOP_LIST_PREPARE]) &&
- llist_empty(&ploop->pios[PLOOP_LIST_DEFERRED]) &&
- llist_empty(&ploop->pios[PLOOP_LIST_DISCARD]) &&
- llist_empty(&ploop->pios[PLOOP_LIST_COW]) &&
- llist_empty(&ploop->llresubmit_pios)
- )
+ llist_empty(&ploop->pios[PLOOP_LIST_PREPARE]) &&
+ llist_empty(&ploop->pios[PLOOP_LIST_DEFERRED]) &&
+ llist_empty(&ploop->pios[PLOOP_LIST_DISCARD]) &&
+ llist_empty(&ploop->pios[PLOOP_LIST_COW]) &&
+ llist_empty(&ploop->llresubmit_pios) &&
+ !ploop->force_md_writeback) {
+ if (kthread_should_stop()) {
+ wait_event_interruptible(ploop->dispatcher_wq_data,
+ (!ploop_runners_have_pending(ploop)));
+ __set_current_state(TASK_RUNNING);
+ break;
+ }
schedule();
+ /* now check for pending work */
+ }
__set_current_state(TASK_RUNNING);
do_ploop_run_work(ploop);
- cond_resched();
+ cond_resched(); /* give other processes chance to run */
+ if (kthread_should_stop()) {
+ wait_event_interruptible(ploop->dispatcher_wq_data,
+ (!ploop_runners_have_pending(ploop)));
+ __set_current_state(TASK_RUNNING);
+ break;
+ }
}
return 0;
}
diff --git a/drivers/md/dm-ploop-target.c b/drivers/md/dm-ploop-target.c
index dc63c18cece8..3fed26137831 100644
--- a/drivers/md/dm-ploop-target.c
+++ b/drivers/md/dm-ploop-target.c
@@ -164,6 +164,7 @@ static void ploop_destroy(struct ploop *ploop)
int i;
if (ploop->kt_worker) {
+ ploop->force_md_writeback = 1;
wake_up_process(ploop->kt_worker->task);
/* try to send all pending - if we have partial io and enospc end bellow */
while (!llist_empty(&ploop->pios[PLOOP_LIST_FLUSH]) ||
@@ -175,9 +176,22 @@ static void ploop_destroy(struct ploop *ploop)
schedule();
}
+ if (ploop->kt_runners) {
+ for (i = 0; i < ploop->nkt_runners; i++) {
+ if (ploop->kt_runners[i]) {
+ wake_up_process(ploop->kt_runners[i]->task);
+ kthread_stop(ploop->kt_runners[i]->task);
+ kfree(ploop->kt_runners[i]);
+ }
+ }
+ }
+
kthread_stop(ploop->kt_worker->task); /* waits for the thread to stop */
+
WARN_ON(!llist_empty(&ploop->pios[PLOOP_LIST_PREPARE]));
WARN_ON(!llist_empty(&ploop->llresubmit_pios));
+ WARN_ON(!llist_empty(&ploop->enospc_pios));
+ kfree(ploop->kt_runners);
kfree(ploop->kt_worker);
}
@@ -347,7 +361,8 @@ ALLOW_ERROR_INJECTION(ploop_add_deltas_stack, ERRNO);
argv++; \
} while (0);
-static struct ploop_worker *ploop_worker_create(struct ploop *ploop)
+static struct ploop_worker *ploop_worker_create(struct ploop *ploop,
+ int (*worker_fn)(void *), const char *pref, int id)
{
struct ploop_worker *worker;
struct task_struct *task;
@@ -357,12 +372,13 @@ static struct ploop_worker *ploop_worker_create(struct ploop *ploop)
return NULL;
worker->ploop = ploop;
- task = kthread_create(ploop_worker, worker, "ploop-%d-0",
- current->pid);
+ task = kthread_create(worker_fn, worker, "ploop-%d-%s-%d",
+ current->pid, pref, id);
if (IS_ERR(task))
goto out_err;
worker->task = task;
+ init_llist_head(&worker->work_llist);
wake_up_process(task);
@@ -521,10 +537,30 @@ static int ploop_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto err;
- ploop->kt_worker = ploop_worker_create(ploop);
+ init_waitqueue_head(&ploop->dispatcher_wq_data);
+
+ ploop->kt_worker = ploop_worker_create(ploop, ploop_worker, "d", 0);
if (!ploop->kt_worker)
goto err;
+/* make it a param = either module or cpu based or dev req queue */
+#define PLOOP_PIO_RUNNERS nr_cpu_ids
+ ploop->kt_runners = kcalloc(PLOOP_PIO_RUNNERS, sizeof(struct kt_worker *), GFP_KERNEL);
+ if (!ploop->kt_runners)
+ goto err;
+
+ ploop->nkt_runners = PLOOP_PIO_RUNNERS;
+ for (i = 0; i < ploop->nkt_runners; i++) {
+ ploop->kt_runners[i] = ploop_worker_create(ploop, ploop_pio_runner, "r", i+1);
+ if (!ploop->kt_runners[i])
+ goto err;
+ }
+
+ for (i = 0; i < ploop->nkt_runners-1; i++)
+ ploop->kt_runners[i]->next = ploop->kt_runners[i+1];
+ ploop->kt_runners[ploop->nkt_runners-1]->next = ploop->kt_runners[0];
+ ploop->last_used_runner = ploop->kt_runners[0];
+
ret = ploop_add_deltas_stack(ploop, &argv[0], argc);
if (ret)
goto err;
diff --git a/drivers/md/dm-ploop.h b/drivers/md/dm-ploop.h
index 6f1b1e284fc3..452fc4a6c58f 100644
--- a/drivers/md/dm-ploop.h
+++ b/drivers/md/dm-ploop.h
@@ -146,14 +146,17 @@ enum {
struct ploop_worker {
struct ploop *ploop;
struct task_struct *task;
- u64 kcov_handle;
+ struct llist_head work_llist;
+ atomic_t inflight_pios;
+ struct ploop_worker *next;
};
struct ploop {
+ struct wait_queue_head dispatcher_wq_data;
struct dm_target *ti;
#define PLOOP_PRQ_POOL_SIZE 512 /* Twice nr_requests from blk_mq_init_sched() */
mempool_t *prq_pool;
-#define PLOOP_PIO_POOL_SIZE 256
+#define PLOOP_PIO_POOL_SIZE 512
mempool_t *pio_pool;
struct rb_root bat_entries;
@@ -198,7 +201,10 @@ struct ploop {
struct work_struct worker;
struct work_struct event_work;
- struct ploop_worker *kt_worker;
+ struct ploop_worker *kt_worker; /* dispatcher thread */
+ struct ploop_worker **kt_runners; /* pio runners */
+ unsigned int nkt_runners;
+ struct ploop_worker *last_used_runner;
struct completion inflight_bios_ref_comp;
struct percpu_ref inflight_bios_ref[2];
bool inflight_ref_comp_pending;
@@ -608,6 +614,7 @@ extern void ploop_enospc_timer(struct timer_list *timer);
extern loff_t ploop_llseek_hole(struct dm_target *ti, loff_t offset, int whence);
extern int ploop_worker(void *data);
+extern int ploop_pio_runner(void *data);
extern void ploop_disable_writeback_delay(struct ploop *ploop);
extern void ploop_enable_writeback_delay(struct ploop *ploop);
More information about the Devel
mailing list