[Devel] [PATCH RHEL9 COMMIT] core: Add glob_kstat, percpu kstat and account mm stat

Wed Oct 20 11:39:24 MSK 2021

The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh9-5.14.0-4.vz9.10.12
------>
commit 5d538c31e84c27f63d9514ebb7a1d864151ef611
Author: Kirill Tkhai <ktkhai at virtuozzo.com>
Date:   Wed Oct 20 11:39:24 2021 +0300

    core: Add glob_kstat, percpu kstat and account mm stat
    
    Adds latency calculation for:
      kstat_glob.swap_in
      kstat_glob.page_in
      kstat_glob.alloc_lat
    And fail count in:
      kstat_glob.alloc_fails
    
    Also incorporates fixups patches:
      kstat: Make kstat_glob::swap_in percpu - core part
      ve/mm/kstat: Port diff-ve-kstat-disable-interrupts-around-seqcount-write-lock
    
    Related buglinks:
    https://jira.sw.ru/browse/PCLIN-31259
    https://jira.sw.ru/browse/PSBM-33650
    
    Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
    Signed-off-by: Konstantin Khlebnikov <khlebnikov at openvz.org>
    
    Signed-off-by: Vladimir Davydov <vdavydov at parallels.com>
    
    Rebase to vz8:
    
    The commit [1] is trying to reimplement swap_in part of this patch, but
    loses "goto out" hunk added in vz7.150.1 on rebase, bring the hunk back.
    
    Note: On rebase I would prefere merging [1] to this patch instead of
    merging this to [1].
    
    Add vzstat.h where needed and replace __GFP_WAIT with it's successor
    __GFP_RECLAIM, skip kstat_init as it is already there.
    
    https://jira.sw.ru/browse/PSBM-127780
    (cherry-picked from vz7 commit 9caa91f6a857 ("core: Add glob_kstat, percpu kstat
    and account mm stat"))
    
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    
    +++
    vzstat: account "page_in" and "swap_in" in nanoseconds
    
    Up to now "page_in" and "swap_in" in /proc/vz/latency has been provided
    in cpu cycles while other latencies are in nanoseconds there.
    
    Let's make a single measure unit for all latencies, so provide swap_in
    and page_in in nanoseconds as well.
    
    Note: we left time accounting using direct rdtsc() with converting to ns
    afterwards. We understand there are some issues possible with
    correctness and using ktime_to_ns(ktime_get()) would be better (as it's
    done for other latencies), but switching to ktime_get() results in 2%
    performance loss on first memory access (pagefault + memory read),
    so decided not to slowdown fastpath and be aware of possible stats
    incorrectness.
    
    https://pmc.acronis.com/browse/VSTOR-16659
    
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    (cherry-picked from vz7 commit aedfe36c7fc5 ("vzstat: account "page_in" and
    "swap_in" in nanoseconds"))
    Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
    
    +++
    kstat: Make kstat_glob::swap_in percpu
    
    Patchset description:
    Make kstat_glob::swap_in percpu and cleanup
    
    This patchset continues escaping of kstat_glb_lock
    and makes swap_in percpu. Also, newly unused primitives
    are dropped and reduced memory usage by using percpu
    seqcount (instead of separate percpu seqcount for every
    kstat percpu variable).
    
    Kirill Tkhai (4):
          kstat: Make kstat_glob::swap_in percpu
          kstat: Drop global kstat_lat_struct
          kstat: Drop cpu argument in KSTAT_LAT_PCPU_ADD()
          kstat: Make global percpu kstat_pcpu_seq instead of percpu seq for every
                 variable
    
    ==========================================
    This patch description:
    
    Using of global local is not good for scalability.
    Better we make swap_in percpu, and it will be updated
    lockless like other statistics (e.g., page_in).
    
    Signed-off-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
    Ported to vz8:
     - Dropped all patchset but this patch, since it is already partially included
     - Introduced start in do_swap_page to use it for kstat_glob.swap_in
    
    (cherry picked from vz7 commit ed033a381e01 ("kstat: Make kstat_glob::swap_in
    percpu"))
    Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
    
    Reviewed-by: Kirill Tkhai <ktkhai at virtuozzo.com>
    
    +++
    vzstat: account "page_in" and "swap_in" in nanoseconds
    
    Up to now "page_in" and "swap_in" in /proc/vz/latency has been provided
    in cpu cycles while other latencies are in nanoseconds there.
    
    Let's make a single measure unit for all latencies, so provide swap_in
    and page_in in nanoseconds as well.
    
    Note: we left time accounting using direct rdtsc() with converting to ns
    afterwards. We understand there are some issues possible with
    correctness and using ktime_to_ns(ktime_get()) would be better (as it's
    done for other latencies), but switching to ktime_get() results in 2%
    performance loss on first memory access (pagefault + memory read),
    so decided not to slowdown fastpath and be aware of possible stats
    incorrectness.
    
    https://pmc.acronis.com/browse/VSTOR-16659
    
    Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
    
    (cherry-picked from vz7 commit aedfe36c7fc5 ("vzstat: account "page_in" and
    "swap_in" in nanoseconds"))
    Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
    
    +++
    mm/page_alloc: use sched_clock() instead of jiffies to measure latency
    
    sched_clock() (which is rdtsc() on x86) gives us more precise result
    than jiffies.
    
    Q: Why do we need greater accuracy?
    A: Because if we target to, say, 10000 IOPS (per cpu) then
       1 ms memory allocation latency is too much and we need
       to achieve less alloc latency and thus measure it.
    
    https://pmc.acronis.com/browse/VSTOR-19040
    
    Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    (cherry-picked from vz7 commit 99407f6d6f50 ("mm/page_alloc: use
    sched_clock() instead of jiffies to measure latency"))
    
    Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
    
    +++
    ve/kstat/alloc_lat: Don't separate GFP_HIGHMEM and !GFP_HIGHMEM allocation latencies
    
    We use mostly 64-bit systems this days. Since they don't have highmem
    it's better to not segregate GFP_HIGHMEM and !GFP_HIGHMEM latencies.
    For backward compatibility we still output alochigh/alochighmp fields
    in /proc/vz/latency but show only zeroes.
    
    https://jira.sw.ru/browse/PSBM-81395
    Signed-off-by: Andrey Ryabinin <aryabinin at virtuozzo.com>
    
    https://jira.sw.ru/browse/PSBM-127780
    (cherry-picked from vz7 commit 1fcbaf6d1fb2 ("ve/kstat/alloc_lat: Don't separate
    GFP_HIGHMEM and !GFP_HIGHMEM allocation latencies"))
    
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    
    (cherry-picked from vz8 commit ad75d76f5a08 ("core: Add glob_kstat,
    percpu kstat and account mm stat"))
    
    Signed-off-by: Nikita Yushchenko <nikita.yushchenko at virtuozzo.com>
---
 mm/memory.c     | 22 +++++++++++++++++++---
 mm/page_alloc.c | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2d0bc5ab5884..2511db99634e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -83,6 +83,7 @@
 #include <linux/uaccess.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
+#include <asm/tsc.h>
 
 #include "pgalloc-track.h"
 #include "internal.h"
@@ -3470,6 +3471,8 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
+#define CLKS2NSEC(c)	((c) * 1000000 / tsc_khz)
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -3489,7 +3492,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	int exclusive = 0;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
+	cycles_t start;
 
+	start = get_cycles();
 	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
 		goto out;
 
@@ -3693,6 +3698,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 out:
 	if (si)
 		put_swap_device(si);
+
+	local_irq_disable();
+	KSTAT_LAT_PCPU_ADD(&kstat_glob.swap_in,
+			   CLKS2NSEC(get_cycles() - start));
+	local_irq_enable();
+
 	return ret;
 out_nomap:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -3704,9 +3715,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		unlock_page(swapcache);
 		put_page(swapcache);
 	}
-	if (si)
-		put_swap_device(si);
-	return ret;
+	goto out;
 }
 
 /*
@@ -3834,6 +3843,7 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret;
+	cycles_t start;
 
 	/*
 	 * Preallocate pte before we take page_lock because this might lead to
@@ -3857,6 +3867,7 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 		smp_wmb(); /* See comment in __pte_alloc() */
 	}
 
+	start = get_cycles();
 	ret = vma->vm_ops->fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
 			    VM_FAULT_DONE_COW)))
@@ -3875,6 +3886,11 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 	else
 		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
 
+	local_irq_disable();
+	KSTAT_LAT_PCPU_ADD(&kstat_glob.page_in,
+			   CLKS2NSEC(get_cycles() - start));
+	local_irq_enable();
+
 	return ret;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c9afe924231..3fa186ba631f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -73,6 +73,7 @@
 #include <linux/padata.h>
 #include <linux/khugepaged.h>
 #include <linux/buffer_head.h>
+#include <linux/vzstat.h>
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -5405,6 +5406,34 @@ static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
 	}
 }
 
+static void __alloc_collect_stats(gfp_t gfp_mask, unsigned int order,
+		struct page *page, u64 time)
+{
+#ifdef CONFIG_VE
+	unsigned long flags;
+	u64 current_clock, delta;
+	int ind, cpu;
+
+	current_clock = sched_clock();
+	delta = current_clock - time;
+	if (!(gfp_mask & __GFP_RECLAIM))
+		ind = KSTAT_ALLOCSTAT_ATOMIC;
+	else
+		if (order > 0)
+			ind = KSTAT_ALLOCSTAT_LOW_MP;
+		else
+			ind = KSTAT_ALLOCSTAT_LOW;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	KSTAT_LAT_PCPU_ADD(&kstat_glob.alloc_lat[ind], delta);
+
+	if (!page)
+		kstat_glob.alloc_fails[cpu][ind]++;
+	local_irq_restore(flags);
+#endif
+}
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -5415,6 +5444,7 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
 	gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = { };
+	u64 start;
 
 	/*
 	 * There are several places where we assume that the order value is sane
@@ -5448,6 +5478,7 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
 	 */
 	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp);
 
+	start = sched_clock();
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
 	if (likely(page))
@@ -5471,6 +5502,7 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
 		page = NULL;
 	}
 
+	__alloc_collect_stats(alloc_gfp, order, page, start);
 	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
 
 	return page;