[Users] CT produces corrupt dump files

Tue Apr 23 15:52:12 EDT 2013

On 04/15/2013 02:14 AM, Roman Haefeli wrote:
> Hi all
>
> One of our CTs cannot be online migrated because it seems to produce
> corrupt dump files. Checkpointing finishes without errors, but restoring
> fails. The CT is running Debian 6.0.7 like all our other CTs. I haven't
> figured out yet why only this CT is causing troubles.
>
> The more severe problem is that the cluster management starts the CT
> with 'vzctl start' when restoring fails. However, 'vzctl start' does
> again read the corrupt dump file, which still fails, and only then tries
> to start it normally. Although the CT comes up, it produces a lot of
> errors in /var/log/messages and it cannot be entered:
>
> $ vzctl enter services0
> enter into CT 103 failed
> Unable to open pty: No such file or directory
>
>  From /var/log/messages:
>
> Apr 15 10:55:01 virtuetest1 kernel: ------------[ cut here ]------------
> Apr 15 10:55:01 virtuetest1 kernel: WARNING: at kernel/cpt/rst_delayfs.c:443 delayfs_wait_mnt+0x69/0x1c0 [vzrst]() (Tainted: G        W  ---------------   )
> Apr 15 10:55:01 virtuetest1 kernel: Hardware name: IBM eServer BladeCenter HS21 -[7995L3G]-
> Apr 15 10:55:01 virtuetest1 kernel: Modules linked in: ext4 jbd2 vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nf_conntrack vzdquota vzmon vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle iptable_filter xt_multiport xt_limit xt_dscp ipt_REJECT ip_tables nfs lockd fscache nfs_acl auth_rpcgss sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 8021q garp softdog snd_pcsp snd_pcm snd_timer radeon snd ttm soundcore drm_kms_helper snd_page_alloc drm i5000_edac edac_core i2c_algo_bit i5k_amb i2c_core serio_raw shpchp tpm_tis tpm tpm_bios ext3 jbd mbcache ata_generic pata_acpi mptsas ata_piix bnx2 mptscsih mptbase scsi_transport_sas [last unloaded: scsi_wait_scan]
> Apr 15 10:55:01 virtuetest1 kernel: Pid: 3806, comm: php5 veid: 103 Tainted: G        W  ---------------    2.6.32-19-pve #1
> Apr 15 10:55:01 virtuetest1 kernel: Call Trace:
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff8106d6c8>] ? warn_slowpath_common+0x88/0xc0
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff8106d71a>] ? warn_slowpath_null+0x1a/0x20
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffffa0624e09>] ? delayfs_wait_mnt+0x69/0x1c0 [vzrst]
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff811a7bc5>] ? do_lookup+0xb5/0x270
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffffa0624f95>] ? delay_permission+0x15/0x20 [vzrst]
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff811a80c9>] ? __link_path_walk+0xf9/0x1050
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff8115bfec>] ? handle_pte_fault+0x64c/0x1190
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff811a925a>] ? path_walk+0x6a/0xe0
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff811a942b>] ? do_path_lookup+0x5b/0xa0
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff811aa2cb>] ? do_filp_open+0xfb/0xca0
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff8104327c>] ? __do_page_fault+0x1ec/0x490
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff8127ebda>] ? strncpy_from_user+0x4a/0x90
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff811b7133>] ? alloc_fd+0x53/0x140
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff81195ba9>] ? do_sys_open+0x69/0x140
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff811ea81a>] ? compat_sys_open+0x1a/0x20
> Apr 15 10:55:01 virtuetest1 kernel: [<ffffffff810497c0>] ? sysenter_dispatch+0x7/0x2e
> Apr 15 10:55:01 virtuetest1 kernel: ---[ end trace 9edec813234428a4 ]---
>
>
> Those traces appear every few second, also trying to log in with ssh
> triggers them. It seems many processes in the CT cannot run correctly.
>
> The only way to fix this situation is to delete the dump file and then
> perform 'vzctl restart <ctid>'.
>
> On one of our host nodes this problem even triggered a kernel panic and
> thus killed all CTs running on that host. The problem could have been
> mitigated if 'vzctl start' would not try to read corrupt dump files. Or,
> if it detects a dump file corruption, it should ignore the dump file and
> start normally.
>
> Specs:
> * CT is running Debian 6.0.7 and has some NFS shares mounted inside.
> * Hostnode is Debian 6.0.7
> * vzkernel-2.6.32-042stab075.2 (~= pve-kernel-2.6.32-19-pve)
> * vzctl 4.2
>
> Does this look like a bug? If so, what else is necessary for a proper
> report? I have the corrupt dump file ready for further investigation,
> but would like to give it only to a 'trusted' entity (a.k.a someone of
> the OpenVZ core team).

This looks like NFS in CT related. delayfs is a neat trick used to 
workaround a chicken-and-egg kind of problem during vzctl restore and 
NFS mounts inside CT. If such traces are repeating that probably means 
NFS server can't be connected to.

In any case, if you will see anything suspicious like this, it won't 
hurt to file a bug to bugzilla, so next time you see it please file.

Regards,
   Kir