[Users] Issues after updating to 7.0.14 (136)

Mon Jun 29 13:26:44 MSK 2020

Hello,

After updating one of our OpenVZ VPS hosting nodes at the end of last 
week, we've started to have issues with corruption apparently occurring 
inside containers.  Issues of this nature have never affected the node 
previously, and there do not appear to be any hardware issues that could 
explain this.

Specifically, a few hours after updating, we began to see containers 
experiencing errors such as this in the logs:

[90471.678994] EXT4-fs (ploop35454p1): error count since last fsck: 25
[90471.679022] EXT4-fs (ploop35454p1): initial error at time 1593205255: ext4_ext_find_extent:904: inode 136399
[90471.679030] EXT4-fs (ploop35454p1): last error at time 1593232922: ext4_ext_find_extent:904: inode 136399
[95189.954569] EXT4-fs (ploop42983p1): error count since last fsck: 67
[95189.954582] EXT4-fs (ploop42983p1): initial error at time 1593210174: htree_dirblock_to_tree:918: inode 926441: block 3683060
[95189.954589] EXT4-fs (ploop42983p1): last error at time 1593276902: ext4_iget:4435: inode 1849777
[95714.207432] EXT4-fs (ploop60706p1): error count since last fsck: 42
[95714.207447] EXT4-fs (ploop60706p1): initial error at time 1593210489: ext4_ext_find_extent:904: inode 136272
[95714.207452] EXT4-fs (ploop60706p1): last error at time 1593231063: ext4_ext_find_extent:904: inode 136272

Shutting the containers down and manually mounting and e2fsck'ing their 
filesystems did clear these errors, but each of the containers (which were 
mostly used for running Plesk) had widespread issues with corrupt or 
missing files after the fsck's completed, necessitating their being 
restored from backup.

Concurrently, we also began to see messages like this appearing in 
/var/log/vzctl.log, which again have never appeared at any point prior to 
this update being installed:

/var/log/vzctl.log:2020-06-26T21:05:19+0100 : Error in fill_hole (check.c:240): Warning: ploop image '/vz/private/8288448/root.hdd/root.hds' is sparse
/var/log/vzctl.log:2020-06-26T21:09:41+0100 : Error in fill_hole (check.c:240): Warning: ploop image '/vz/private/8288450/root.hdd/root.hds' is sparse
/var/log/vzctl.log:2020-06-26T21:16:22+0100 : Error in fill_hole (check.c:240): Warning: ploop image '/vz/private/8288451/root.hdd/root.hds' is sparse
/var/log/vzctl.log:2020-06-26T21:19:57+0100 : Error in fill_hole (check.c:240): Warning: ploop image '/vz/private/8288452/root.hdd/root.hds' is sparse

The basic procedure we follow when updating our nodes is as follows:

1, Update the standby node we keep spare for this process
2. vzmigrate all containers from the live node being updated to the 
standby node
3. Update the live node
4. Reboot the live node
5. vzmigrate the containers from the standby node back to the live node 
they originally came from

So the only tool which has been used to affect these containers is 
'vzmigrate' itself, so I'm at something of a loss as to how to explain 
the root.hdd images for these containers containing sparse gaps.  This is 
something we have never done, as we have always been aware that OpenVZ 
does not support their use inside a container's hard drive image.  And the 
fact that these images have suddenly become sparse at the same time they 
have started to exhibit filesystem corruption is somewhat concerning.

We can restore all affected containers from backups, but I wanted to get 
in touch with the list to see if anyone else at any other site has 
experienced these or similar issues after applying the 7.0.14 (136) 
update.

Thank you,
Kevin Drysdale.