[Users] Idea: adding online memory testing (RAMpage) to the VZ kernel

Corrado Fiore lists at corradofiore.it
Fri Jun 3 09:20:14 PDT 2016


Dear All,

as it is customary in any datacenter environment, we use ECC RAM on all of our machines.  Therefore, in the rare occasions where we had data corruption issues or sudden crashes, I used to think that RAM couldn't be the culprit (we've got ECC, right?), until I discovered this article:

https://www.neuhalfen.name/2013/09/05/your-data-is-corrupted-and-you-dont-know-it/

In a nutshell, the author proposes a software solution (a kernel module) to enhance hardware error correction, i.e. to catch and correct a number of errors that the hardware alone cannot fix.  This concept surprised me, as I used to think quite naively that hardware ECC was enough to catch and correct _100%_ of errors.

What's more interesting is that RAM sticks tend to exhibit more errors when they get old, so there's little point in running a burn-in test on your new server:  the problem will most likely happen once you've got valuable data onto your machines, maybe two years later.  Hence the need of an online detection and correction system.

Such a system has been developed already.  It's called RAMpage, it's based on the work of Jens Neuhalfen (the author of the article linked above) and it is currently available in a beta release:

https://github.com/schirmeier/rampage

Given the delicate area in which RAMpage operates, I would never use its beta version on a production server, but my colleagues and I agreed that the idea is absolutely terrific - if RAMpage was merged into the VZ kernel, several engineers that I know would be very interested.

What's your opinion?

Best,
Corrado Fiore


More information about the Users mailing list