[Devel] [PATCH RH7] fence-watchdog: print alive messages
Kirill Tkhai
ktkhai at virtuozzo.com
Wed Jan 24 12:41:39 MSK 2018
On 24.01.2018 11:50, Pavel Tikhomirov wrote:
> We have a situation when node worked for ~8 days and came to The Point
> where jiffies > fence_wdog_jiffies64 (what ever updated
> /sys/kernel/watchdog_timer stopped doing these) but node lived after
> The Point for 17 more hours, these means that nobody called
> fence_wdog_check_timer() for 17 hours, else crash or reboot would've
> happened earlier.
>
> When fence_wdog_check_timer() was called the first time after The Point
> we started reboot action. As jiffies were even bigger than
> (fence_wdog_jiffies64 + 30 * HZ) and we got to fence_wdog_check_timer()
> again we crashed (we assume reboot action took to long if it does not
> fit in 30 seconds after fence_wdog_jiffies64).
>
> So it seem to be quiet a valid situation, except that
> fence_wdog_check_timer() was not called for 17 hours. That means(should
> mean) that there were no network activity on the node for 17 hours - no
> packets send/received.
>
> note: I mentioned vmxnet3 NIC on the node.
>
> So to be sure that we never got to fence_wdog_check_timer make it print
> alive messages to syslog. If no messages in syslog we might need to
> understand why vmxnet3 does not get to dev_hard_start_xmit() and
> net_rx_action().
>
> Crashdump backtrace:
> [ 84.606838] vmxnet3 0000:03:00.0 ens160: NIC Link is Up 10000 Mbps
> ...
> [ 198.756251] arp_tables: (C) 2002 David S. Miller
> [855063.554579] INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 0, t=61572840 jiffies, g=55111116, c=55111115, q=15)
> [855063.554659] Task dump for CPU 1:
> [855063.554662] MDNS-resolver R ffff88023250a020 0 1705 1 0 0x00000008
> [855063.554665] ffff88023250a020 ffff88023250a7c0 ffff88022d613da0 ffffffff8109d2b1
> [855063.554667] ffff88022d613e28 ffff88022d613e70 0000000000000293 0000000000001d4c
> [855063.554669] 0000000000000004 ffff880035a35188 24e96a4dd65d19eb 00000000000c1b96
> [855063.554670] Call Trace:
> [855063.554677] [<ffffffff8109d2b1>] ? dequeue_signal+0xd1/0x170
> [855063.554681] [<ffffffff810ef910>] ? ktime_get_ts64+0xa0/0xf0
> [855063.554683] [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
> [855063.554687] [<ffffffff812303c0>] ? select_estimate_accuracy+0x40/0x120
> [855063.554688] [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
> [855063.554693] [<ffffffff81267291>] ? ep_poll+0xa1/0x360
> [855063.554695] [<ffffffff8109da81>] ? __set_task_blocked+0x41/0xa0
> [855063.554697] [<ffffffff8126877d>] ? SyS_epoll_wait+0xed/0x120
> [855063.554702] [<ffffffff816b15c9>] ? system_call_fastpath+0x16/0x1b
> [855063.555345] Kernel panic - not syncing: fence-watchdog: reboot
>
> [855063.555375] CPU: 1 PID: 1705 Comm: MDNS-resolver ve: 0 Not tainted 3.10.0-693.1.1.vz7.37.18 #1 37.18
> [855063.555404] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015
> [855063.555436] 000000000000000e 00000000f2e82b7f ffff88023fd03aa8 ffffffff816a0288
> [855063.555462] ffff88023fd03b28 ffffffff8169b4a7 ffffffff00000010 ffff88023fd03b38
> [855063.555487] ffff88023fd03ad8 00000000f2e82b7f 0000000000000000 ffffffff81905aad
> [855063.555513] Call Trace:
> [855063.556123] <IRQ> [<ffffffff816a0288>] dump_stack+0x19/0x1b
> [855063.556555] [<ffffffff8169b4a7>] panic+0xe8/0x20d
> [855063.556998] [<ffffffffc0569020>] ? iptable_nat_ipv4_fn+0x20/0x20 [iptable_nat]
> [855063.557434] [<ffffffff8119130f>] fence_wdog_do_fence+0x4f/0x50
> [855063.557880] [<ffffffff81191382>] fence_wdog_check_timer+0x72/0x80
> [855063.558310] [<ffffffff8158b458>] dev_hard_start_xmit+0x38/0x3b0
> [855063.558745] [<ffffffff8158e5b8>] __dev_queue_xmit+0x448/0x550
> [855063.559173] [<ffffffff8158e6d0>] dev_queue_xmit+0x10/0x20
> [855063.559599] [<ffffffff815d548e>] ip_finish_output+0x52e/0x780
> [855063.560041] [<ffffffff815d5753>] ip_output+0x73/0xe0
> [855063.560468] [<ffffffff815d4f60>] ? __ip_append_data.isra.51+0xa50/0xa50
> [855063.560924] [<ffffffff815d3601>] ip_local_out_sk+0x31/0x40
> [855063.561355] [<ffffffff815d3973>] ip_queue_xmit+0x143/0x390
> [855063.561797] [<ffffffff815ed4ff>] tcp_transmit_skb+0x4cf/0x9b0
> [855063.562234] [<ffffffff815f0cf0>] ? tcp_init_xmit_timers+0x30/0x30
> [855063.562697] [<ffffffff815ef64b>] tcp_xmit_probe_skb+0xdb/0xf0
> [855063.563147] [<ffffffff815f09a9>] tcp_write_wakeup+0x59/0x170
> [855063.563584] [<ffffffff815f0e9e>] tcp_keepalive_timer+0x1ae/0x260
> [855063.564030] [<ffffffff81098da6>] call_timer_fn+0x36/0x110
> [855063.564465] [<ffffffff815f0cf0>] ? tcp_init_xmit_timers+0x30/0x30
> [855063.564913] [<ffffffff8109b2bd>] run_timer_softirq+0x22d/0x310
> [855063.565365] [<ffffffff816b3e9f>] __do_softirq+0x10f/0x2aa
> [855063.565862] [<ffffffff816b255c>] call_softirq+0x1c/0x30
> [855063.566323] [<ffffffff8102d2e5>] do_softirq+0x65/0xa0
> [855063.566814] [<ffffffff81092425>] irq_exit+0x105/0x110
> [855063.567265] [<ffffffff816b3cc2>] smp_apic_timer_interrupt+0x42/0x50
> [855063.567722] [<ffffffff816b310d>] apic_timer_interrupt+0x6d/0x80
> [855063.568152] <EOI> [<ffffffff810ef910>] ? ktime_get_ts64+0xa0/0xf0
> [855063.568589] [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
> [855063.569033] [<ffffffff812303c0>] select_estimate_accuracy+0x40/0x120
> [855063.569454] [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
> [855063.569880] [<ffffffff81267291>] ep_poll+0xa1/0x360
> [855063.570271] [<ffffffff8109da81>] ? __set_task_blocked+0x41/0xa0
> [855063.570670] [<ffffffff8126877d>] SyS_epoll_wait+0xed/0x120
> [855063.571049] [<ffffffff816b15c9>] system_call_fastpath+0x16/0x1b
>
> https://pmc.acronis.com/browse/TTASK-22056
> Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
Reviewed-by: Kirill Tkhai <ktkhai at virtuozzo.com>
> ---
> kernel/fence-watchdog.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/fence-watchdog.c b/kernel/fence-watchdog.c
> index 40d258a68d7c..3f65dbc69216 100644
> --- a/kernel/fence-watchdog.c
> +++ b/kernel/fence-watchdog.c
> @@ -151,6 +151,12 @@ void fence_wdog_do_fence(void)
>
> inline int fence_wdog_check_timer(void)
> {
> + static unsigned long print_alive_time;
> +
> + if (fence_wdog_jiffies64 != MAX_U64)
> + if (printk_timed_ratelimit(&print_alive_time, 30*60*HZ))
> + printk("fence-watchdog: alive\n");
> +
> if (unlikely(get_jiffies_64() > fence_wdog_jiffies64 &&
> fence_wdog_action != FENCE_WDOG_NETFILTER)) {
> if (atomic_cmpxchg(&fence_stage, NOT_FENCED, FENCED) == NOT_FENCED
>
More information about the Devel
mailing list