[Devel] [PATCH RHEL7 COMMIT] fence-watchdog: print alive messages

Konstantin Khorenko khorenko at virtuozzo.com
Wed Jan 31 16:08:55 MSK 2018


The commit is pushed to "branch-rh7-3.10.0-693.11.6.vz7.42.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-693.11.6.vz7.42.3
------>
commit 11bddd0543b88ae36d2e3fa7d5a60294365a1c4d
Author: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Date:   Wed Jan 31 16:08:55 2018 +0300

    fence-watchdog: print alive messages
    
    We have a situation when node worked for ~8 days and came to The Point
    where jiffies > fence_wdog_jiffies64 (what ever updated
    /sys/kernel/watchdog_timer stopped doing these) but node lived after
    The Point for 17 more hours, these means that nobody called
    fence_wdog_check_timer() for 17 hours, else crash or reboot would've
    happened earlier.
    
    When fence_wdog_check_timer() was called the first time after The Point
    we started reboot action. As jiffies were even bigger than
    (fence_wdog_jiffies64 + 30 * HZ) and we got to fence_wdog_check_timer()
    again we crashed (we assume reboot action took to long if it does not
    fit in 30 seconds after fence_wdog_jiffies64).
    
    So it seem to be quiet a valid situation, except that
    fence_wdog_check_timer() was not called for 17 hours. That means(should
    mean) that there were no network activity on the node for 17 hours - no
    packets send/received.
    
    note: I mentioned vmxnet3 NIC on the node.
    
    So to be sure that we never got to fence_wdog_check_timer make it print
    alive messages to syslog. If no messages in syslog we might need to
    understand why vmxnet3 does not get to dev_hard_start_xmit() and
    net_rx_action().
    
    Crashdump backtrace:
    [   84.606838] vmxnet3 0000:03:00.0 ens160: NIC Link is Up 10000 Mbps
    ...
    [  198.756251] arp_tables: (C) 2002 David S. Miller
    [855063.554579] INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 0, t=61572840 jiffies, g=55111116, c=55111115, q=15)
    [855063.554659] Task dump for CPU 1:
    [855063.554662] MDNS-resolver   R ffff88023250a020     0  1705      1    0 0x00000008
    [855063.554665]  ffff88023250a020 ffff88023250a7c0 ffff88022d613da0 ffffffff8109d2b1
    [855063.554667]  ffff88022d613e28 ffff88022d613e70 0000000000000293 0000000000001d4c
    [855063.554669]  0000000000000004 ffff880035a35188 24e96a4dd65d19eb 00000000000c1b96
    [855063.554670] Call Trace:
    [855063.554677]  [<ffffffff8109d2b1>] ? dequeue_signal+0xd1/0x170
    [855063.554681]  [<ffffffff810ef910>] ? ktime_get_ts64+0xa0/0xf0
    [855063.554683]  [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
    [855063.554687]  [<ffffffff812303c0>] ? select_estimate_accuracy+0x40/0x120
    [855063.554688]  [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
    [855063.554693]  [<ffffffff81267291>] ? ep_poll+0xa1/0x360
    [855063.554695]  [<ffffffff8109da81>] ? __set_task_blocked+0x41/0xa0
    [855063.554697]  [<ffffffff8126877d>] ? SyS_epoll_wait+0xed/0x120
    [855063.554702]  [<ffffffff816b15c9>] ? system_call_fastpath+0x16/0x1b
    [855063.555345] Kernel panic - not syncing: fence-watchdog: reboot
    
    [855063.555375] CPU: 1 PID: 1705 Comm: MDNS-resolver ve: 0 Not tainted 3.10.0-693.1.1.vz7.37.18 #1 37.18
    [855063.555404] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015
    [855063.555436]  000000000000000e 00000000f2e82b7f ffff88023fd03aa8 ffffffff816a0288
    [855063.555462]  ffff88023fd03b28 ffffffff8169b4a7 ffffffff00000010 ffff88023fd03b38
    [855063.555487]  ffff88023fd03ad8 00000000f2e82b7f 0000000000000000 ffffffff81905aad
    [855063.555513] Call Trace:
    [855063.556123]  <IRQ>  [<ffffffff816a0288>] dump_stack+0x19/0x1b
    [855063.556555]  [<ffffffff8169b4a7>] panic+0xe8/0x20d
    [855063.556998]  [<ffffffffc0569020>] ? iptable_nat_ipv4_fn+0x20/0x20 [iptable_nat]
    [855063.557434]  [<ffffffff8119130f>] fence_wdog_do_fence+0x4f/0x50
    [855063.557880]  [<ffffffff81191382>] fence_wdog_check_timer+0x72/0x80
    [855063.558310]  [<ffffffff8158b458>] dev_hard_start_xmit+0x38/0x3b0
    [855063.558745]  [<ffffffff8158e5b8>] __dev_queue_xmit+0x448/0x550
    [855063.559173]  [<ffffffff8158e6d0>] dev_queue_xmit+0x10/0x20
    [855063.559599]  [<ffffffff815d548e>] ip_finish_output+0x52e/0x780
    [855063.560041]  [<ffffffff815d5753>] ip_output+0x73/0xe0
    [855063.560468]  [<ffffffff815d4f60>] ? __ip_append_data.isra.51+0xa50/0xa50
    [855063.560924]  [<ffffffff815d3601>] ip_local_out_sk+0x31/0x40
    [855063.561355]  [<ffffffff815d3973>] ip_queue_xmit+0x143/0x390
    [855063.561797]  [<ffffffff815ed4ff>] tcp_transmit_skb+0x4cf/0x9b0
    [855063.562234]  [<ffffffff815f0cf0>] ? tcp_init_xmit_timers+0x30/0x30
    [855063.562697]  [<ffffffff815ef64b>] tcp_xmit_probe_skb+0xdb/0xf0
    [855063.563147]  [<ffffffff815f09a9>] tcp_write_wakeup+0x59/0x170
    [855063.563584]  [<ffffffff815f0e9e>] tcp_keepalive_timer+0x1ae/0x260
    [855063.564030]  [<ffffffff81098da6>] call_timer_fn+0x36/0x110
    [855063.564465]  [<ffffffff815f0cf0>] ? tcp_init_xmit_timers+0x30/0x30
    [855063.564913]  [<ffffffff8109b2bd>] run_timer_softirq+0x22d/0x310
    [855063.565365]  [<ffffffff816b3e9f>] __do_softirq+0x10f/0x2aa
    [855063.565862]  [<ffffffff816b255c>] call_softirq+0x1c/0x30
    [855063.566323]  [<ffffffff8102d2e5>] do_softirq+0x65/0xa0
    [855063.566814]  [<ffffffff81092425>] irq_exit+0x105/0x110
    [855063.567265]  [<ffffffff816b3cc2>] smp_apic_timer_interrupt+0x42/0x50
    [855063.567722]  [<ffffffff816b310d>] apic_timer_interrupt+0x6d/0x80
    [855063.568152]  <EOI>  [<ffffffff810ef910>] ? ktime_get_ts64+0xa0/0xf0
    [855063.568589]  [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
    [855063.569033]  [<ffffffff812303c0>] select_estimate_accuracy+0x40/0x120
    [855063.569454]  [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
    [855063.569880]  [<ffffffff81267291>] ep_poll+0xa1/0x360
    [855063.570271]  [<ffffffff8109da81>] ? __set_task_blocked+0x41/0xa0
    [855063.570670]  [<ffffffff8126877d>] SyS_epoll_wait+0xed/0x120
    [855063.571049]  [<ffffffff816b15c9>] system_call_fastpath+0x16/0x1b
    
    https://pmc.acronis.com/browse/TTASK-22056
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    Reviewed-by: Kirill Tkhai <ktkhai at virtuozzo.com>
---
 kernel/fence-watchdog.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/fence-watchdog.c b/kernel/fence-watchdog.c
index 40d258a68d7c..3f65dbc69216 100644
--- a/kernel/fence-watchdog.c
+++ b/kernel/fence-watchdog.c
@@ -151,6 +151,12 @@ void fence_wdog_do_fence(void)
 
 inline int fence_wdog_check_timer(void)
 {
+	static unsigned long print_alive_time;
+
+	if (fence_wdog_jiffies64 != MAX_U64)
+		if (printk_timed_ratelimit(&print_alive_time, 30*60*HZ))
+			printk("fence-watchdog: alive\n");
+
 	if (unlikely(get_jiffies_64() > fence_wdog_jiffies64 &&
 			fence_wdog_action != FENCE_WDOG_NETFILTER)) {
 		if (atomic_cmpxchg(&fence_stage, NOT_FENCED, FENCED) == NOT_FENCED


More information about the Devel mailing list