[Users] Vzmigration nightmare, ssh bug

Thu Apr 22 20:31:31 EDT 2010

Hi,

On 22.04.2010 16:36, JR Richardson wrote:
> I had a terrible time migrating containers from one hardware not to
> another.  I ran into the ssh bug with the public key not being
> accepted error:
>
> error: RSA_public_decrypt failed: error:0407006A:lib(4):func(112):reason(106)
>
> It was so random, migrating 50 containers, some would work and some
> migrations would prompt for the ssh password, truley random.  I had to
> migrate some containers several times before the ssh session would
> establish and complete.
>
> The strange thing is while migration between HN 1 & 2 was very
> problematic, migration between HN 4 & 5 worked as expected.  All 4
> nodes were built on the same day, same hardware and all have matching
> package versions and configuration.
>   

That points to hardware problems, either in your HNs or your network. I
once had to debug randomly failing SSH sessions, and it turned out to be
a faulty ethernet switch which corrupted data, but auto-corrected the
checksums of the ethernet packets, and this caused SSH to abort in
various stages.

Can you use ssh reliably between HN 1+2? If yes, try running scp of a
large file in both directions 100 times or so. The best idea is to copy
the same file back and forth, so errors will accumulate. Compute the
checksum of the file after each copy. I had SSH corrupt data in a tunnel
and the SSH connection stayed stable although the tunneled data was
clearly corrupt. Admittedly this was due to a faulty ethernet switch,
but still I had expected SSH to abort the connection or at least ensure
the integrity of my data, especially because the corruption was hardware
failure and not a determined attacker.

And yes, sometimes one machine in a batch of identical machines corrupts
data, or one ethernet port corrupts data and all others are OK.

Regards,
Carl-Daniel

-- 
http://www.hailfinger.org/