[Users] Re: Vzmigration nightmare, ssh bug

Fri Apr 23 15:36:47 EDT 2010

> On 22.04.2010 16:36, JR Richardson wrote:
>> I had a terrible time migrating containers from one hardware not to
>> another.  I ran into the ssh bug with the public key not being
>> accepted error:
>>
>> error: RSA_public_decrypt failed: error:0407006A:lib(4):func(112):reason(106)
>>
>> It was so random, migrating 50 containers, some would work and some
>> migrations would prompt for the ssh password, truley random.  I had to
>> migrate some containers several times before the ssh session would
>> establish and complete.
>>
>> The strange thing is while migration between HN 1 & 2 was very
>> problematic, migration between HN 4 & 5 worked as expected.  All 4
>> nodes were built on the same day, same hardware and all have matching
>> package versions and configuration.
>>
>
> That points to hardware problems, either in your HNs or your network. I
> once had to debug randomly failing SSH sessions, and it turned out to be
> a faulty ethernet switch which corrupted data, but auto-corrected the
> checksums of the ethernet packets, and this caused SSH to abort in
> various stages.
>
> Can you use ssh reliably between HN 1+2? If yes, try running scp of a
> large file in both directions 100 times or so. The best idea is to copy
> the same file back and forth, so errors will accumulate. Compute the
> checksum of the file after each copy. I had SSH corrupt data in a tunnel
> and the SSH connection stayed stable although the tunneled data was
> clearly corrupt. Admittedly this was due to a faulty ethernet switch,
> but still I had expected SSH to abort the connection or at least ensure
> the integrity of my data, especially because the corruption was hardware
> failure and not a determined attacker.
>
> And yes, sometimes one machine in a batch of identical machines corrupts
> data, or one ethernet port corrupts data and all others are OK.
>
> Regards,
> Carl-Daniel

While I was researching the ssh bug, 100 posts or so, I  came across
one that mentioned a hardware error on a sun workstation.  So I
started testing and did isolate the issue to HN1.  There are no
hardware errors reported on the Ethernet interfaces of the server or
the connected switch port.  During the next maintenance window I will
migrate all the production containers off and bring the server in the
lab and hammer on it.

Thanks for reporting your experience, I'm leaning toward a hardware
error as well.

JR
-- 
JR Richardson
Engineering for the Masses