[Devel] [PATCH v4 VZ9 3/5] dm-qcow2: make merge_backward command asyncronous

Pavel Tikhomirov ptikhomirov at virtuozzo.com
Wed Mar 5 06:28:21 MSK 2025



On 3/4/25 19:48, Andrey Zhadchenko wrote:
> 
> 
> On 3/4/25 12:32, Pavel Tikhomirov wrote:
>>
>>
>> On 3/4/25 18:51, Andrey Zhadchenko wrote:
>>>
>>>
>>> On 3/3/25 10:37, Pavel Tikhomirov wrote:
>>>> This adds merge_backward "start", "complete" and "cancel" commands. By
>>>> that we are able to split single merge_backward into two stages: start
>>>> asyncronous merging and completion. That can be usefull for restarting
>>>> qemu process while allowing backward merging to run asyncronously in
>>>> kernel.
>>>>
>>>> The "start" command runs merging preparations in workqueue work. After
>>>> it finishes, the "complete" command can be called to finish the process
>>>> and actually replace the top qcow2 with it's lower. The "cancel" 
>>>> command
>>>> forces the work to stop and flushes it. In case we are in completion
>>>> waiting state already and there is no work running, the "cancel" 
>>>> command
>>>> also reverts merging preparations.
>>>>
>>>> Locking:
>>>>
>>>> Data in tgt->backward_merge is protected by tgt->ctl_mutex. The "start"
>>>> and "complete" commands are fully under this lock, and the "cancel"
>>>> operation takes the lock explicitly and releases it for work flushing.
>>>> The work also takes the lock but only when updating tgt->backward_merge
>>>> data. For checks, if the work was caneled in the middle, we read the
>>>> state without locking as we don't modify the state there, also we would
>>>> re-check the state again before exiting the work function under lock.
>>>>
>>>> Now on target suspend we "cancel" currently running backward merge,
>>>> previously we were just hanging untill backward merge have been
>>>> finished for possibly a long time, cancelling seems cleaner. Though we
>>>> don't really expect hypervisor suspending the target in the middle of
>>>> backward merge that it by itself started.
>>>>
>>>> https://virtuozzo.atlassian.net/browse/VSTOR-100466
>>>> Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
>>>>
>>>> -- 
>>>> v2: Cancel from BACKWARD_MERGE_START state should not try to stop the
>>>> work via BACKWARD_MERGE_STOP state, else we will deadlock in this 
>>>> state.
>>>> ---
>>>>   drivers/md/dm-qcow2-cmd.c    | 142 ++++++++++++++++++++++++++++++ 
>>>> +----
>>>>   drivers/md/dm-qcow2-target.c |   6 ++
>>>>   drivers/md/dm-qcow2.h        |  19 +++++
>>>>   3 files changed, 153 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/drivers/md/dm-qcow2-cmd.c b/drivers/md/dm-qcow2-cmd.c
>>>> index 7b4b0ee68ad9f..04a992f3ebba6 100644
>>>> --- a/drivers/md/dm-qcow2-cmd.c
>>>> +++ b/drivers/md/dm-qcow2-cmd.c
>>>> @@ -52,6 +52,8 @@ static void service_qio_endio(struct qcow2_target 
>>>> *tgt, struct qio *qio,
>>>>       wake_up(&tgt->service_wq);
>>>>   }
>>>> +static bool qcow2_backward_merge_should_stop(struct qcow2_target 
>>>> *tgt);
>>>> +
>>>>   static int qcow2_service_iter(struct qcow2_target *tgt, struct 
>>>> qcow2 *qcow2,
>>>>             loff_t end, loff_t step, unsigned int bi_op, u8 qio_flags)
>>>>   {
>>>> @@ -63,7 +65,7 @@ static int qcow2_service_iter(struct qcow2_target 
>>>> *tgt, struct qcow2 *qcow2,
>>>>       WRITE_ONCE(service_status, BLK_STS_OK);
>>>>       for (pos = 0; pos < end; pos += step) {
>>>> -        if (fatal_signal_pending(current)) {
>>>> +        if (qcow2_backward_merge_should_stop(tgt)) {
>>>>               ret = -EINTR;
>>>>               break;
>>>>           }
>>>> @@ -161,10 +163,11 @@ static void 
>>>> set_backward_merge_in_process(struct qcow2_target *tgt,
>>>>       qcow2_submit_embedded_qios(tgt, &list);
>>>>   }
>>>> -static int qcow2_merge_backward(struct qcow2_target *tgt)
>>>> +static int qcow2_merge_backward_start(struct qcow2_target *tgt)
>>>>   {
>>>>       struct qcow2 *qcow2 = tgt->top, *lower = qcow2->lower;
>>>> -    int ret, ret2;
>>>> +
>>>> +    lockdep_assert_held(&tgt->ctl_mutex);
>>>>       if (!lower)
>>>>           return -ENOENT;
>>>> @@ -174,6 +177,35 @@ static int qcow2_merge_backward(struct 
>>>> qcow2_target *tgt)
>>>>           return -EOPNOTSUPP;
>>>>       if (lower->hdr.size < qcow2->hdr.size)
>>>>           return -EBADSLT;
>>>> +
>>>> +    if (tgt->backward_merge.state != BACKWARD_MERGE_STOPPED)
>>>> +        return -EBUSY;
>>>> +    tgt->backward_merge.state = BACKWARD_MERGE_START;
>>>> +    tgt->backward_merge.error = 0;
>>>> +
>>>> +    schedule_work(&tgt->backward_merge.work);
>>>
>>> Does this imply we potentially occupy one of the workers of the 
>>> global pool for the indefinite amount of time? What if we run as much 
>>> as nworkers (probably ncpus) merges simultaneously?
>>
>> System_wq has 1024*NCPU execution contexts:
>>
>>  > ``@max_active`` determines the maximum number of execution contexts 
>> per CPU
>>
>>  > The maximum limit for ``@max_active`` is 2048 and the default value 
>> used
>> when 0 is specified is 1024.
>>
>> If we try to run ~1024 works per cpu at the same time we might have a 
>> problem, and will need to either swithch to our own work-queue or 
>> create explicit kernel thread for each merge.
>>
>>
>> As flushing system-wide workqueues is now deprecated we are also fine 
>> with long running work in system_wq and not in system_long_wq, but we 
>> can move it to system_long_wq just to be on the safe side.
>>
>>   * system_wq is the one used by schedule[_delayed]_work[_on]().
>>   * Multi-CPU multi-threaded.  There are users which expect relatively
>>   * short queue flush time.  Don't queue works which can run for too
>>   * long.
>>
>>   * system_long_wq is similar to system_wq but may host long running
>>   * works.  Queue flushing might take relatively long.
>>
>> What do you think?
> 
> In close approximation we can just run on dm-qcow2 workqueue, as each 
> image allocates it's own wq, tgt->wq. And we only run 2 works there, 
> general and flush.
> However Alexander's dm-ploop rework dropped workqueues in favor of 
> threads, so I do not know if and when we intend to merge it into dm-qcow2.
> 
> With given limits 1024 per cpu we probably could ignore constraints on 
> global workqueue. Could you please also give me some links where I can 
> read this? Couldn't find anything myself

Strange, both citations are grepable in both mainstream and vz9 trees:

linux$ git grep "system_wq is the one used by schedule"
include/linux/workqueue.h: * system_wq is the one used by 
schedule[_delayed]_work[_on]().

https://github.com/torvalds/linux/blob/48a5eed9ad584315c30ed35204510536235ce402/include/linux/workqueue.h#L430

linux$ git grep "determines the maximum number of execution contexts"
Documentation/core-api/workqueue.rst:``@max_active`` determines the 
maximum number of execution contexts per

https://github.com/torvalds/linux/blob/48a5eed9ad584315c30ed35204510536235ce402/Documentation/core-api/workqueue.rst?plain=1#L242

In VZ9:

kernel-vz9$ git grep "system_wq is the one used by schedule"
include/linux/workqueue.h: * system_wq is the one used by 
schedule[_delayed]_work[_on]().

kernel-vz9$ git grep "determines the maximum number of execution contexts"
Documentation/core-api/workqueue.rst:``@max_active`` determines the 
maximum number of execution contexts

> 
>>
>>>
>>>> +    return 0;
>>>> +}
>>>> +ALLOW_ERROR_INJECTION(qcow2_merge_backward_start, ERRNO);
>>>> +
>>>> +void qcow2_merge_backward_work(struct work_struct *work)
>>>> +{
>>>> +    struct qcow2_target *tgt = container_of(work, struct qcow2_target,
>>>> +                        backward_merge.work);
>>>> +    struct qcow2 *qcow2, *lower;
>>>> +    int ret, ret2;
>>>> +
>>>> +    mutex_lock(&tgt->ctl_mutex);
>>>> +    if (tgt->backward_merge.state != BACKWARD_MERGE_START) {
>>>> +        mutex_unlock(&tgt->ctl_mutex);
>>>> +        return;
>>>> +    }
>>>> +    tgt->backward_merge.state = BACKWARD_MERGE_RUN;
>>>> +    mutex_unlock(&tgt->ctl_mutex);
>>>> +
>>>> +    qcow2 = tgt->top;
>>>> +    lower = qcow2->lower;
>>>> +
>>>>       /*
>>>>        * Break all COW clus at L1 level. Otherwise, later
>>>>        * there would be problems with unusing them:
>>>> @@ -183,13 +215,13 @@ static int qcow2_merge_backward(struct 
>>>> qcow2_target *tgt)
>>>>       ret = qcow2_break_l1cow(tgt);
>>>>       if (ret) {
>>>>           QC_ERR(tgt->ti, "Can't break L1 COW");
>>>> -        return ret;
>>>> +        goto out_err;
>>>>       }
>>>>       ret = qcow2_set_image_file_features(lower, true);
>>>>       if (ret) {
>>>>           QC_ERR(tgt->ti, "Can't set dirty bit");
>>>> -        return ret;
>>>> +        goto out_err;
>>>>       }
>>>>       set_backward_merge_in_process(tgt, qcow2, true);
>>>> @@ -200,22 +232,85 @@ static int qcow2_merge_backward(struct 
>>>> qcow2_target *tgt)
>>>>           ret2 = qcow2_set_image_file_features(lower, false);
>>>>           if (ret2 < 0)
>>>>               QC_ERR(tgt->ti, "Can't unuse lower (%d)", ret2);
>>>> -        return ret;
>>>>       }
>>>> +
>>>> +out_err:
>>>> +    mutex_lock(&tgt->ctl_mutex);
>>>> +    if (ret) {
>>>> +        /* Error */
>>>> +        tgt->backward_merge.state = BACKWARD_MERGE_STOPPED;
>>>> +        tgt->backward_merge.error = ret;
>>>> +    } else if (tgt->backward_merge.state == BACKWARD_MERGE_STOP) {
>>>> +        /* Merge is canceled */
>>>> +        set_backward_merge_in_process(tgt, qcow2, false);
>>>> +        tgt->backward_merge.state = BACKWARD_MERGE_STOPPED;
>>>> +        tgt->backward_merge.error = -EINTR;
>>>> +    } else {
>>>> +        /* Finish merge */
>>>> +        tgt->backward_merge.state = BACKWARD_MERGE_WAIT_COMPLETION;
>>>> +    }
>>>> +    mutex_unlock(&tgt->ctl_mutex);
>>>> +}
>>>> +
>>>> +static int qcow2_merge_backward_complete(struct qcow2_target *tgt)
>>>> +{
>>>> +    struct qcow2 *qcow2 = tgt->top, *lower = qcow2->lower;
>>>> +    int ret;
>>>> +
>>>> +    lockdep_assert_held(&tgt->ctl_mutex);
>>>> +
>>>> +    if (tgt->backward_merge.state != BACKWARD_MERGE_WAIT_COMPLETION)
>>>> +        return -EBUSY;
>>>> +
>>>>       tgt->top = lower;
>>>>       smp_wmb(); /* Pairs with qcow2_ref_inc() */
>>>>       qcow2_inflight_ref_switch(tgt); /* Pending qios */
>>>>       qcow2_flush_deferred_activity(tgt, qcow2); /* Delayed md pages */
>>>>       qcow2->lower = NULL;
>>>> -    ret2 = qcow2_set_image_file_features(qcow2, false);
>>>> -    if (ret2 < 0)
>>>> -        QC_ERR(tgt->ti, "Can't unuse merged img (%d)", ret2);
>>>> +    ret = qcow2_set_image_file_features(qcow2, false);
>>>> +    if (ret < 0)
>>>> +        QC_ERR(tgt->ti, "Can't unuse merged img (%d)", ret);
>>>>       qcow2_destroy(qcow2);
>>>> +    tgt->backward_merge.state = BACKWARD_MERGE_STOPPED;
>>>> +
>>>>       return 0;
>>>>   }
>>>> -ALLOW_ERROR_INJECTION(qcow2_merge_backward, ERRNO);
>>>> +ALLOW_ERROR_INJECTION(qcow2_merge_backward_complete, ERRNO);
>>>> +
>>>> +void qcow2_merge_backward_cancel(struct qcow2_target *tgt)
>>>> +{
>>>> +    bool flush = false;
>>>> +
>>>> +    mutex_lock(&tgt->ctl_mutex);
>>>> +    if (tgt->backward_merge.state == BACKWARD_MERGE_STOPPED) {
>>>> +        mutex_unlock(&tgt->ctl_mutex);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (tgt->backward_merge.state == BACKWARD_MERGE_START) {
>>>> +        tgt->backward_merge.state = BACKWARD_MERGE_STOPPED;
>>>> +        flush = true;
>>>> +    } else if (tgt->backward_merge.state == BACKWARD_MERGE_RUN) {
>>>> +        tgt->backward_merge.state = BACKWARD_MERGE_STOP;
>>>> +        flush = true;
>>>> +    } else if (tgt->backward_merge.state == BACKWARD_MERGE_STOP) {
>>>> +        flush = true;
>>>> +    } else if (tgt->backward_merge.state == 
>>>> BACKWARD_MERGE_WAIT_COMPLETION) {
>>>> +        set_backward_merge_in_process(tgt, tgt->top, false);
>>>> +        tgt->backward_merge.state = BACKWARD_MERGE_STOPPED;
>>>> +    }
>>>> +    mutex_unlock(&tgt->ctl_mutex);
>>>> +
>>>> +    if (flush)
>>>> +        flush_work(&tgt->backward_merge.work);
>>>> +}
>>>> +
>>>> +static bool qcow2_backward_merge_should_stop(struct qcow2_target *tgt)
>>>> +{
>>>> +    return READ_ONCE(tgt->backward_merge.state) == 
>>>> BACKWARD_MERGE_STOP;
>>>> +}
>>>>   static struct qcow2 *qcow2_get_img(struct qcow2_target *tgt, u32 
>>>> img_id, u8 *ref_index)
>>>>   {
>>>> @@ -374,11 +469,19 @@ int qcow2_message(struct dm_target *ti, 
>>>> unsigned int argc, char **argv,
>>>>           }
>>>>           ret = qcow2_get_event(tgt, result, maxlen);
>>>>           goto out;
>>>> +    } else if (!strcmp(argv[0], "merge_backward")) {
>>>> +        if (argc != 2) {
>>>> +            ret = -EINVAL;
>>>> +            goto out;
>>>> +        }
>>>> +        if (!strcmp(argv[1], "cancel")) {
>>>> +            qcow2_merge_backward_cancel(tgt);
>>>> +            ret = 0;
>>>> +            goto out;
>>>> +        }
>>>>       }
>>>> -    ret = mutex_lock_killable(&tgt->ctl_mutex);
>>>> -    if (ret)
>>>> -        goto out;
>>>> +    mutex_lock(&tgt->ctl_mutex);
>>>>       if (!strcmp(argv[0], "get_errors")) {
>>>>           ret = qcow2_get_errors(tgt, result, maxlen);
>>>> @@ -388,7 +491,18 @@ int qcow2_message(struct dm_target *ti, 
>>>> unsigned int argc, char **argv,
>>>>       } else if (!strcmp(argv[0], "merge_forward")) {
>>>>           ret = qcow2_merge_forward(tgt);
>>>>       } else if (!strcmp(argv[0], "merge_backward")) {
>>>> -        ret = qcow2_merge_backward(tgt);
>>>> +        if (argc != 2) {
>>>> +            ret = -EINVAL;
>>>> +            mutex_unlock(&tgt->ctl_mutex);
>>>> +            goto out;
>>>> +        }
>>>> +        if (!strcmp(argv[1], "start")) {
>>>> +            ret = qcow2_merge_backward_start(tgt);
>>>> +        } else if (!strcmp(argv[1], "complete")) {
>>>> +            ret = qcow2_merge_backward_complete(tgt);
>>>> +        } else {
>>>> +            ret = -ENOTTY;
>>>> +        }
>>>>       } else {
>>>>           ret = -ENOTTY;
>>>>       }
>>>> diff --git a/drivers/md/dm-qcow2-target.c b/drivers/md/dm-qcow2- 
>>>> target.c
>>>> index 540c03cb3c44f..6e2e583ba0b8b 100644
>>>> --- a/drivers/md/dm-qcow2-target.c
>>>> +++ b/drivers/md/dm-qcow2-target.c
>>>> @@ -25,6 +25,8 @@ static void qcow2_set_service_operations(struct 
>>>> dm_target *ti, bool allowed)
>>>>       mutex_lock(&tgt->ctl_mutex);
>>>>       tgt->service_operations_allowed = allowed;
>>>>       mutex_unlock(&tgt->ctl_mutex);
>>>> +    if (!allowed)
>>>> +        qcow2_merge_backward_cancel(tgt);
>>>>   }
>>>>   static void qcow2_set_wants_suspend(struct dm_target *ti, bool wants)
>>>>   {
>>>> @@ -251,6 +253,7 @@ static void qcow2_tgt_destroy(struct 
>>>> qcow2_target *tgt)
>>>>           /* Now kill the queue */
>>>>           destroy_workqueue(tgt->wq);
>>>>       }
>>>> +    qcow2_merge_backward_cancel(tgt);
>>>>       mempool_destroy(tgt->qio_pool);
>>>>       mempool_destroy(tgt->qrq_pool);
>>>> @@ -494,6 +497,9 @@ static struct qcow2_target 
>>>> *alloc_qcow2_target(struct dm_target *ti)
>>>>       timer_setup(&tgt->enospc_timer, qcow2_enospc_timer, 0);
>>>>       ti->private = tgt;
>>>>       tgt->ti = ti;
>>>> +
>>>> +    INIT_WORK(&tgt->backward_merge.work, qcow2_merge_backward_work);
>>>> +
>>>>       qcow2_set_service_operations(ti, false);
>>>>       return tgt;
>>>> diff --git a/drivers/md/dm-qcow2.h b/drivers/md/dm-qcow2.h
>>>> index a89fe3db2196d..bebfdc50ed6d4 100644
>>>> --- a/drivers/md/dm-qcow2.h
>>>> +++ b/drivers/md/dm-qcow2.h
>>>> @@ -149,6 +149,20 @@ struct md_page {
>>>>       struct list_head wpc_readers_wait_list;
>>>>   };
>>>> +enum qcow2_backward_merge_state {
>>>> +    BACKWARD_MERGE_STOPPED = 0,
>>>> +    BACKWARD_MERGE_START,
>>>> +    BACKWARD_MERGE_RUN,
>>>> +    BACKWARD_MERGE_WAIT_COMPLETION,
>>>> +    BACKWARD_MERGE_STOP,
>>>> +};
>>>> +
>>>> +struct qcow2_backward_merge {
>>>> +    struct work_struct work;
>>>> +    enum qcow2_backward_merge_state state;
>>>> +    int error;
>>>> +};
>>>> +
>>>>   struct qcow2_target {
>>>>       struct dm_target *ti;
>>>>   #define QCOW2_QRQ_POOL_SIZE 512 /* Twice nr_requests from 
>>>> blk_mq_init_sched() */
>>>> @@ -180,6 +194,8 @@ struct qcow2_target {
>>>>       struct work_struct event_work;
>>>>       spinlock_t event_lock;
>>>>       struct mutex ctl_mutex;
>>>> +
>>>> +    struct qcow2_backward_merge backward_merge;
>>>>   };
>>>>   enum {
>>>> @@ -375,6 +391,9 @@ int qcow2_inflight_ref_switch(struct 
>>>> qcow2_target *tgt);
>>>>   void qcow2_flush_deferred_activity(struct qcow2_target *tgt, 
>>>> struct qcow2 *qcow2);
>>>>   int qcow2_truncate_safe(struct file *file, loff_t new_len);
>>>> +void qcow2_merge_backward_work(struct work_struct *work);
>>>> +void qcow2_merge_backward_cancel(struct qcow2_target *tgt);
>>>> +
>>>>   static inline struct qcow2_target *to_qcow2_target(struct 
>>>> dm_target *ti)
>>>>   {
>>>>       return ti->private;
>>>
>>
> 

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.



More information about the Devel mailing list