[Devel] [PATCH RHEL8 COMMIT] fence-watchdog: Print alive messages
Konstantin Khorenko
khorenko at virtuozzo.com
Tue May 11 18:48:28 MSK 2021
The commit is pushed to "branch-rh8-4.18.0-240.1.1.vz8.5.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-240.1.1.vz8.5.27
------>
commit a8728988a7d157782bd41198c631c98e465b7794
Author: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Date: Tue May 11 18:48:28 2021 +0300
fence-watchdog: Print alive messages
We have a situation when node worked for ~8 days and came to The Point
where jiffies > fence_wdog_jiffies64 (what ever updated
/sys/kernel/watchdog_timer stopped doing these) but node lived after
The Point for 17 more hours, these means that nobody called
fence_wdog_check_timer() for 17 hours, else crash or reboot would've
happened earlier.
When fence_wdog_check_timer() was called the first time after The Point
we started reboot action. As jiffies were even bigger than
(fence_wdog_jiffies64 + 30 * HZ) and we got to fence_wdog_check_timer()
again we crashed (we assume reboot action took to long if it does not
fit in 30 seconds after fence_wdog_jiffies64).
So it seem to be quiet a valid situation, except that
fence_wdog_check_timer() was not called for 17 hours. That means(should
mean) that there were no network activity on the node for 17 hours - no
packets send/received.
note: I mentioned vmxnet3 NIC on the node.
So to be sure that we never got to fence_wdog_check_timer make it print
alive messages to syslog. If no messages in syslog we might need to
understand why vmxnet3 does not get to dev_hard_start_xmit() and
net_rx_action().
Crashdump backtrace:
[ 84.606838] vmxnet3 0000:03:00.0 ens160: NIC Link is Up 10000 Mbps
...
[ 198.756251] arp_tables: (C) 2002 David S. Miller
[855063.554579] INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 0, t=61572840 jiffies, g=55111116, c=55111115, q=15)
[855063.554659] Task dump for CPU 1:
[855063.554662] MDNS-resolver R ffff88023250a020 0 1705 1 0 0x00000008
[855063.554665] ffff88023250a020 ffff88023250a7c0 ffff88022d613da0 ffffffff8109d2b1
[855063.554667] ffff88022d613e28 ffff88022d613e70 0000000000000293 0000000000001d4c
[855063.554669] 0000000000000004 ffff880035a35188 24e96a4dd65d19eb 00000000000c1b96
[855063.554670] Call Trace:
[855063.554677] [<ffffffff8109d2b1>] ? dequeue_signal+0xd1/0x170
[855063.554681] [<ffffffff810ef910>] ? ktime_get_ts64+0xa0/0xf0
[855063.554683] [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
[855063.554687] [<ffffffff812303c0>] ? select_estimate_accuracy+0x40/0x120
[855063.554688] [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
[855063.554693] [<ffffffff81267291>] ? ep_poll+0xa1/0x360
[855063.554695] [<ffffffff8109da81>] ? __set_task_blocked+0x41/0xa0
[855063.554697] [<ffffffff8126877d>] ? SyS_epoll_wait+0xed/0x120
[855063.554702] [<ffffffff816b15c9>] ? system_call_fastpath+0x16/0x1b
[855063.555345] Kernel panic - not syncing: fence-watchdog: reboot
[855063.555375] CPU: 1 PID: 1705 Comm: MDNS-resolver ve: 0 Not tainted 3.10.0-693.1.1.vz7.37.18 #1 37.18
[855063.555404] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015
[855063.555436] 000000000000000e 00000000f2e82b7f ffff88023fd03aa8 ffffffff816a0288
[855063.555462] ffff88023fd03b28 ffffffff8169b4a7 ffffffff00000010 ffff88023fd03b38
[855063.555487] ffff88023fd03ad8 00000000f2e82b7f 0000000000000000 ffffffff81905aad
[855063.555513] Call Trace:
[855063.556123] <IRQ> [<ffffffff816a0288>] dump_stack+0x19/0x1b
[855063.556555] [<ffffffff8169b4a7>] panic+0xe8/0x20d
[855063.556998] [<ffffffffc0569020>] ? iptable_nat_ipv4_fn+0x20/0x20 [iptable_nat]
[855063.557434] [<ffffffff8119130f>] fence_wdog_do_fence+0x4f/0x50
[855063.557880] [<ffffffff81191382>] fence_wdog_check_timer+0x72/0x80
[855063.558310] [<ffffffff8158b458>] dev_hard_start_xmit+0x38/0x3b0
[855063.558745] [<ffffffff8158e5b8>] __dev_queue_xmit+0x448/0x550
[855063.559173] [<ffffffff8158e6d0>] dev_queue_xmit+0x10/0x20
[855063.559599] [<ffffffff815d548e>] ip_finish_output+0x52e/0x780
[855063.560041] [<ffffffff815d5753>] ip_output+0x73/0xe0
[855063.560468] [<ffffffff815d4f60>] ? __ip_append_data.isra.51+0xa50/0xa50
[855063.560924] [<ffffffff815d3601>] ip_local_out_sk+0x31/0x40
[855063.561355] [<ffffffff815d3973>] ip_queue_xmit+0x143/0x390
[855063.561797] [<ffffffff815ed4ff>] tcp_transmit_skb+0x4cf/0x9b0
[855063.562234] [<ffffffff815f0cf0>] ? tcp_init_xmit_timers+0x30/0x30
[855063.562697] [<ffffffff815ef64b>] tcp_xmit_probe_skb+0xdb/0xf0
[855063.563147] [<ffffffff815f09a9>] tcp_write_wakeup+0x59/0x170
[855063.563584] [<ffffffff815f0e9e>] tcp_keepalive_timer+0x1ae/0x260
[855063.564030] [<ffffffff81098da6>] call_timer_fn+0x36/0x110
[855063.564465] [<ffffffff815f0cf0>] ? tcp_init_xmit_timers+0x30/0x30
[855063.564913] [<ffffffff8109b2bd>] run_timer_softirq+0x22d/0x310
[855063.565365] [<ffffffff816b3e9f>] __do_softirq+0x10f/0x2aa
[855063.565862] [<ffffffff816b255c>] call_softirq+0x1c/0x30
[855063.566323] [<ffffffff8102d2e5>] do_softirq+0x65/0xa0
[855063.566814] [<ffffffff81092425>] irq_exit+0x105/0x110
[855063.567265] [<ffffffff816b3cc2>] smp_apic_timer_interrupt+0x42/0x50
[855063.567722] [<ffffffff816b310d>] apic_timer_interrupt+0x6d/0x80
[855063.568152] <EOI> [<ffffffff810ef910>] ? ktime_get_ts64+0xa0/0xf0
[855063.568589] [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
[855063.569033] [<ffffffff812303c0>] select_estimate_accuracy+0x40/0x120
[855063.569454] [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
[855063.569880] [<ffffffff81267291>] ep_poll+0xa1/0x360
[855063.570271] [<ffffffff8109da81>] ? __set_task_blocked+0x41/0xa0
[855063.570670] [<ffffffff8126877d>] SyS_epoll_wait+0xed/0x120
[855063.571049] [<ffffffff816b15c9>] system_call_fastpath+0x16/0x1b
https://pmc.acronis.com/browse/TTASK-22056
Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Reviewed-by: Kirill Tkhai <ktkhai at virtuozzo.com>
(cherry picked from vz7 commit 774bb049ba0d ("fence-watchdog: print alive
messages"))
Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko at virtuozzo.com>
---
kernel/fence-watchdog.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/fence-watchdog.c b/kernel/fence-watchdog.c
index 34c01045c197..f165c407d6ed 100644
--- a/kernel/fence-watchdog.c
+++ b/kernel/fence-watchdog.c
@@ -150,6 +150,12 @@ void fence_wdog_do_fence(void)
inline int fence_wdog_check_timer(void)
{
+ static unsigned long print_alive_time;
+
+ if (fence_wdog_jiffies64 != MAX_U64)
+ if (printk_timed_ratelimit(&print_alive_time, 30*60*HZ))
+ printk("fence-watchdog: alive\n");
+
if (unlikely(get_jiffies_64() > fence_wdog_jiffies64 &&
fence_wdog_action != FENCE_WDOG_NETFILTER)) {
if (atomic_cmpxchg(&fence_stage, NOT_FENCED, FENCED) == NOT_FENCED
More information about the Devel
mailing list