[Devel] [PATCH RH7] fence-watchdog: print alive messages

Kirill Tkhai ktkhai at virtuozzo.com
Wed Jan 24 12:35:40 MSK 2018


On 24.01.2018 11:50, Pavel Tikhomirov wrote:
> We have a situation when node worked for ~8 days and came to The Point
> where jiffies > fence_wdog_jiffies64 (what ever updated
> /sys/kernel/watchdog_timer stopped doing these) but node lived after
> The Point for 17 more hours, these means that nobody called
> fence_wdog_check_timer() for 17 hours, else crash or reboot would've
> happened earlier.
> 
> When fence_wdog_check_timer() was called the first time after The Point
> we started reboot action. As jiffies were even bigger than
> (fence_wdog_jiffies64 + 30 * HZ) and we got to fence_wdog_check_timer()
> again we crashed (we assume reboot action took to long if it does not
> fit in 30 seconds after fence_wdog_jiffies64).
> 
> So it seem to be quiet a valid situation, except that
> fence_wdog_check_timer() was not called for 17 hours. That means(should
> mean) that there were no network activity on the node for 17 hours - no
> packets send/received.
> 
> note: I mentioned vmxnet3 NIC on the node.
> 
> So to be sure that we never got to fence_wdog_check_timer make it print
> alive messages to syslog. If no messages in syslog we might need to
> understand why vmxnet3 does not get to dev_hard_start_xmit() and
> net_rx_action().
> 
> Crashdump backtrace:
> [   84.606838] vmxnet3 0000:03:00.0 ens160: NIC Link is Up 10000 Mbps
> ...
> [  198.756251] arp_tables: (C) 2002 David S. Miller
> [855063.554579] INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 0, t=61572840 jiffies, g=55111116, c=55111115, q=15)
> [855063.554659] Task dump for CPU 1:
> [855063.554662] MDNS-resolver   R ffff88023250a020     0  1705      1    0 0x00000008
> [855063.554665]  ffff88023250a020 ffff88023250a7c0 ffff88022d613da0 ffffffff8109d2b1
> [855063.554667]  ffff88022d613e28 ffff88022d613e70 0000000000000293 0000000000001d4c
> [855063.554669]  0000000000000004 ffff880035a35188 24e96a4dd65d19eb 00000000000c1b96
> [855063.554670] Call Trace:
> [855063.554677]  [<ffffffff8109d2b1>] ? dequeue_signal+0xd1/0x170
> [855063.554681]  [<ffffffff810ef910>] ? ktime_get_ts64+0xa0/0xf0
> [855063.554683]  [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
> [855063.554687]  [<ffffffff812303c0>] ? select_estimate_accuracy+0x40/0x120
> [855063.554688]  [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
> [855063.554693]  [<ffffffff81267291>] ? ep_poll+0xa1/0x360
> [855063.554695]  [<ffffffff8109da81>] ? __set_task_blocked+0x41/0xa0
> [855063.554697]  [<ffffffff8126877d>] ? SyS_epoll_wait+0xed/0x120
> [855063.554702]  [<ffffffff816b15c9>] ? system_call_fastpath+0x16/0x1b
> [855063.555345] Kernel panic - not syncing: fence-watchdog: reboot
> 
> [855063.555375] CPU: 1 PID: 1705 Comm: MDNS-resolver ve: 0 Not tainted 3.10.0-693.1.1.vz7.37.18 #1 37.18
> [855063.555404] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015
> [855063.555436]  000000000000000e 00000000f2e82b7f ffff88023fd03aa8 ffffffff816a0288
> [855063.555462]  ffff88023fd03b28 ffffffff8169b4a7 ffffffff00000010 ffff88023fd03b38
> [855063.555487]  ffff88023fd03ad8 00000000f2e82b7f 0000000000000000 ffffffff81905aad
> [855063.555513] Call Trace:
> [855063.556123]  <IRQ>  [<ffffffff816a0288>] dump_stack+0x19/0x1b
> [855063.556555]  [<ffffffff8169b4a7>] panic+0xe8/0x20d
> [855063.556998]  [<ffffffffc0569020>] ? iptable_nat_ipv4_fn+0x20/0x20 [iptable_nat]
> [855063.557434]  [<ffffffff8119130f>] fence_wdog_do_fence+0x4f/0x50
> [855063.557880]  [<ffffffff81191382>] fence_wdog_check_timer+0x72/0x80
> [855063.558310]  [<ffffffff8158b458>] dev_hard_start_xmit+0x38/0x3b0
> [855063.558745]  [<ffffffff8158e5b8>] __dev_queue_xmit+0x448/0x550
> [855063.559173]  [<ffffffff8158e6d0>] dev_queue_xmit+0x10/0x20
> [855063.559599]  [<ffffffff815d548e>] ip_finish_output+0x52e/0x780
> [855063.560041]  [<ffffffff815d5753>] ip_output+0x73/0xe0
> [855063.560468]  [<ffffffff815d4f60>] ? __ip_append_data.isra.51+0xa50/0xa50
> [855063.560924]  [<ffffffff815d3601>] ip_local_out_sk+0x31/0x40
> [855063.561355]  [<ffffffff815d3973>] ip_queue_xmit+0x143/0x390
> [855063.561797]  [<ffffffff815ed4ff>] tcp_transmit_skb+0x4cf/0x9b0
> [855063.562234]  [<ffffffff815f0cf0>] ? tcp_init_xmit_timers+0x30/0x30
> [855063.562697]  [<ffffffff815ef64b>] tcp_xmit_probe_skb+0xdb/0xf0
> [855063.563147]  [<ffffffff815f09a9>] tcp_write_wakeup+0x59/0x170
> [855063.563584]  [<ffffffff815f0e9e>] tcp_keepalive_timer+0x1ae/0x260
> [855063.564030]  [<ffffffff81098da6>] call_timer_fn+0x36/0x110
> [855063.564465]  [<ffffffff815f0cf0>] ? tcp_init_xmit_timers+0x30/0x30
> [855063.564913]  [<ffffffff8109b2bd>] run_timer_softirq+0x22d/0x310
> [855063.565365]  [<ffffffff816b3e9f>] __do_softirq+0x10f/0x2aa
> [855063.565862]  [<ffffffff816b255c>] call_softirq+0x1c/0x30
> [855063.566323]  [<ffffffff8102d2e5>] do_softirq+0x65/0xa0
> [855063.566814]  [<ffffffff81092425>] irq_exit+0x105/0x110
> [855063.567265]  [<ffffffff816b3cc2>] smp_apic_timer_interrupt+0x42/0x50
> [855063.567722]  [<ffffffff816b310d>] apic_timer_interrupt+0x6d/0x80
> [855063.568152]  <EOI>  [<ffffffff810ef910>] ? ktime_get_ts64+0xa0/0xf0
> [855063.568589]  [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
> [855063.569033]  [<ffffffff812303c0>] select_estimate_accuracy+0x40/0x120
> [855063.569454]  [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0
> [855063.569880]  [<ffffffff81267291>] ep_poll+0xa1/0x360
> [855063.570271]  [<ffffffff8109da81>] ? __set_task_blocked+0x41/0xa0
> [855063.570670]  [<ffffffff8126877d>] SyS_epoll_wait+0xed/0x120
> [855063.571049]  [<ffffffff816b15c9>] system_call_fastpath+0x16/0x1b
> 
> https://pmc.acronis.com/browse/TTASK-22056
> Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
> ---
>  kernel/fence-watchdog.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/fence-watchdog.c b/kernel/fence-watchdog.c
> index 40d258a68d7c..3f65dbc69216 100644
> --- a/kernel/fence-watchdog.c
> +++ b/kernel/fence-watchdog.c
> @@ -151,6 +151,12 @@ void fence_wdog_do_fence(void)
>  
>  inline int fence_wdog_check_timer(void)
>  {
> +	static unsigned long print_alive_time;
> +
> +	if (fence_wdog_jiffies64 != MAX_U64)
> +		if (printk_timed_ratelimit(&print_alive_time, 30*60*HZ))
> +			printk("fence-watchdog: alive\n");

Except looks good for me except that printk() has no loglevel prefix.
Is this OK for readers of this message, we print it with default/last
loglevel?

Thanks,
Kirill


More information about the Devel mailing list