[Devel] [PATCH vz9 1/9] ms/x86/crash: Disable virt in core NMI crash handler to avoid double shootdown

Wed Nov 1 16:38:48 MSK 2023

From: Maxim Levitsky <mlevitsk at redhat.com>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2177720

commit 26044aff37a5455b19a91785086914fd33053ef4
Author: Sean Christopherson <seanjc at google.com>
Date:   Wed Nov 30 23:36:47 2022 +0000

    x86/crash: Disable virt in core NMI crash handler to avoid double shootdown

    Disable virtualization in crash_nmi_callback() and rework the
    emergency_vmx_disable_all() path to do an NMI shootdown if and only if a
    shootdown has not already occurred.   NMI crash shootdown fundamentally
    can't support multiple invocations as responding CPUs are deliberately
    put into halt state without unblocking NMIs.  But, the emergency reboot
    path doesn't have any work of its own, it simply cares about disabling
    virtualization, i.e. so long as a shootdown occurred, emergency reboot
    doesn't care who initiated the shootdown, or when.

    If "crash_kexec_post_notifiers" is specified on the kernel command line,
    panic() will invoke crash_smp_send_stop() and result in a second call to
    nmi_shootdown_cpus() during native_machine_emergency_restart().

    Invoke the callback _before_ disabling virtualization, as the current
    VMCS needs to be cleared before doing VMXOFF.  Note, this results in a
    subtle change in ordering between disabling virtualization and stopping
    Intel PT on the responding CPUs.  While VMX and Intel PT do interact,
    VMXOFF and writes to MSR_IA32_RTIT_CTL do not induce faults between one
    another, which is all that matters when panicking.

    Harden nmi_shootdown_cpus() against multiple invocations to try and
    capture any such kernel bugs via a WARN instead of hanging the system
    during a crash/dump, e.g. prior to the recent hardening of
    register_nmi_handler(), re-registering the NMI handler would trigger a
    double list_add() and hang the system if CONFIG_BUG_ON_DATA_CORRUPTION=y.

     list_add double add: new=ffffffff82220800, prev=ffffffff8221cfe8, next=ffffffff82220800.
     WARNING: CPU: 2 PID: 1319 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
     Call Trace:
      __register_nmi_handler+0xcf/0x130
      nmi_shootdown_cpus+0x39/0x90
      native_machine_emergency_restart+0x1c9/0x1d0
      panic+0x237/0x29b

    Extract the disabling logic to a common helper to deduplicate code, and
    to prepare for doing the shootdown in the emergency reboot path if SVM
    is supported.

    Note, prior to commit ed72736183c4 ("x86/reboot: Force all cpus to exit
    VMX root if VMX is supported"), nmi_shootdown_cpus() was subtly protected
    against a second invocation by a cpu_vmx_enabled() check as the kdump
    handler would disable VMX if it ran first.

    Fixes: ed72736183c4 ("x86/reboot: Force all cpus to exit VMX root if VMX is supported")
    Cc: stable at vger.kernel.org
    Reported-by: Guilherme G. Piccoli <gpiccoli at igalia.com>
    Cc: Vitaly Kuznetsov <vkuznets at redhat.com>
    Cc: Paolo Bonzini <pbonzini at redhat.com>
    Link: https://lore.kernel.org/all/20220427224924.592546-2-gpiccoli@igalia.com
    Tested-by: Guilherme G. Piccoli <gpiccoli at igalia.com>
    Reviewed-by: Thomas Gleixner <tglx at linutronix.de>
    Link: https://lore.kernel.org/r/20221130233650.1404148-2-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc at google.com>

Signed-off-by: Maxim Levitsky <mlevitsk at redhat.com>

(cherry picked from CentOS 9 Stream commit 0695a61e4f96)
https://pmc.acronis.work/browse/VSTOR-76102
Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>

Feature: fix ms/KVM
---
 arch/x86/include/asm/reboot.h |  2 ++
 arch/x86/kernel/crash.c       | 17 +--------
 arch/x86/kernel/reboot.c      | 65 ++++++++++++++++++++++++++++-------
 3 files changed, 56 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index 04c17be9b5fd..bc5b4d788c08 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -25,6 +25,8 @@ void __noreturn machine_real_restart(unsigned int type);
 #define MRR_BIOS	0
 #define MRR_APM		1
 
+void cpu_emergency_disable_virtualization(void);
+
 typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
 void nmi_panic_self_stop(struct pt_regs *regs);
 void nmi_shootdown_cpus(nmi_shootdown_cb callback);
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e8326a8d1c5d..eddd5e066990 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -37,7 +37,6 @@
 #include <linux/kdebug.h>
 #include <asm/cpu.h>
 #include <asm/reboot.h>
-#include <asm/virtext.h>
 #include <asm/intel_pt.h>
 #include <asm/crash.h>
 #include <asm/cmdline.h>
@@ -81,15 +80,6 @@ static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
 	 */
 	cpu_crash_vmclear_loaded_vmcss();
 
-	/* Disable VMX or SVM if needed.
-	 *
-	 * We need to disable virtualization on all CPUs.
-	 * Having VMX or SVM enabled on any CPU may break rebooting
-	 * after the kdump kernel has finished its task.
-	 */
-	cpu_emergency_vmxoff();
-	cpu_emergency_svm_disable();
-
 	/*
 	 * Disable Intel PT to stop its logging
 	 */
@@ -148,12 +138,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 	 */
 	cpu_crash_vmclear_loaded_vmcss();
 
-	/* Booting kdump kernel with VMX or SVM enabled won't work,
-	 * because (among other limitations) we can't disable paging
-	 * with the virt flags.
-	 */
-	cpu_emergency_vmxoff();
-	cpu_emergency_svm_disable();
+	cpu_emergency_disable_virtualization();
 
 	/*
 	 * Disable Intel PT to stop its logging
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index ebfb91108232..5dd46146dd1d 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -535,10 +535,7 @@ static inline void kb_wait(void)
 	}
 }
 
-static void vmxoff_nmi(int cpu, struct pt_regs *regs)
-{
-	cpu_emergency_vmxoff();
-}
+static inline void nmi_shootdown_cpus_on_restart(void);
 
 /* Use NMIs as IPIs to tell all CPUs to disable virtualization */
 static void emergency_vmx_disable_all(void)
@@ -561,7 +558,7 @@ static void emergency_vmx_disable_all(void)
 		__cpu_emergency_vmxoff();
 
 		/* Halt and exit VMX root operation on the other CPUs. */
-		nmi_shootdown_cpus(vmxoff_nmi);
+		nmi_shootdown_cpus_on_restart();
 	}
 }
 
@@ -802,6 +799,17 @@ void machine_crash_shutdown(struct pt_regs *regs)
 /* This is the CPU performing the emergency shutdown work. */
 int crashing_cpu = -1;
 
+/*
+ * Disable virtualization, i.e. VMX or SVM, to ensure INIT is recognized during
+ * reboot.  VMX blocks INIT if the CPU is post-VMXON, and SVM blocks INIT if
+ * GIF=0, i.e. if the crash occurred between CLGI and STGI.
+ */
+void cpu_emergency_disable_virtualization(void)
+{
+	cpu_emergency_vmxoff();
+	cpu_emergency_svm_disable();
+}
+
 #if defined(CONFIG_SMP)
 
 static nmi_shootdown_cb shootdown_callback;
@@ -824,7 +832,14 @@ static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
 		return NMI_HANDLED;
 	local_irq_disable();
 
-	shootdown_callback(cpu, regs);
+	if (shootdown_callback)
+		shootdown_callback(cpu, regs);
+
+	/*
+	 * Prepare the CPU for reboot _after_ invoking the callback so that the
+	 * callback can safely use virtualization instructions, e.g. VMCLEAR.
+	 */
+	cpu_emergency_disable_virtualization();
 
 	atomic_dec(&waiting_for_crash_ipi);
 	/* Assume hlt works */
@@ -835,18 +850,32 @@ static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
 	return NMI_HANDLED;
 }
 
-/*
- * Halt all other CPUs, calling the specified function on each of them
+/**
+ * nmi_shootdown_cpus - Stop other CPUs via NMI
+ * @callback:	Optional callback to be invoked from the NMI handler
+ *
+ * The NMI handler on the remote CPUs invokes @callback, if not
+ * NULL, first and then disables virtualization to ensure that
+ * INIT is recognized during reboot.
  *
- * This function can be used to halt all other CPUs on crash
- * or emergency reboot time. The function passed as parameter
- * will be called inside a NMI handler on all CPUs.
+ * nmi_shootdown_cpus() can only be invoked once. After the first
+ * invocation all other CPUs are stuck in crash_nmi_callback() and
+ * cannot respond to a second NMI.
  */
 void nmi_shootdown_cpus(nmi_shootdown_cb callback)
 {
 	unsigned long msecs;
+
 	local_irq_disable();
 
+	/*
+	 * Avoid certain doom if a shootdown already occurred; re-registering
+	 * the NMI handler will cause list corruption, modifying the callback
+	 * will do who knows what, etc...
+	 */
+	if (WARN_ON_ONCE(crash_ipi_issued))
+		return;
+
 	/* Make a note of crashing cpu. Will be used in NMI callback. */
 	crashing_cpu = safe_smp_processor_id();
 
@@ -874,7 +903,17 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
 		msecs--;
 	}
 
-	/* Leave the nmi callback set */
+	/*
+	 * Leave the nmi callback set, shootdown is a one-time thing.  Clearing
+	 * the callback could result in a NULL pointer dereference if a CPU
+	 * (finally) responds after the timeout expires.
+	 */
+}
+
+static inline void nmi_shootdown_cpus_on_restart(void)
+{
+	if (!crash_ipi_issued)
+		nmi_shootdown_cpus(NULL);
 }
 
 /*
@@ -904,6 +943,8 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
 	/* No other CPUs to shoot down */
 }
 
+static inline void nmi_shootdown_cpus_on_restart(void) { }
+
 void run_crash_ipi_callback(struct pt_regs *regs)
 {
 }
-- 
2.39.3