[Users] vzctl chkpnt speed

Thu Jan 24 14:11:58 EST 2013

On Mon, 2013-01-21 at 13:32 +0400, Stanislav Kinsbursky wrote:
> > On Fre, 2013-01-18 at 16:26 +0100, Roman Haefeli wrote:
> >> Hi all
> >>
> >> Only recently I discovered that online migration seems to work for us
> >> now. CT on NFS or NFS mounted inside a CT is non-issue now.
> >>
> >> We are running all our CTs on an NFS filesystem shared between
> >> hostnodes. While checkpointing and restoring works flawlessly with that
> >> setup, I noticed that "vzctl chkpnt CTID" writes slower to the NFS mount
> >> compared to a dd-write, for instance.
> >>
> >> 'dd if=/dev/zero of=/mnt/nfs/deleteme bs=1M' writes with approx. 70MB/s.
> >> 'iotop' shows that 'vzctl chkpnt CTID' writes with only 18MB/s to the
> >> same dir.
> >>
> >> However, the speed of writing is similar to the one of dd when I use the
> >> 'bs=8k' option for dd. This makes me assume that quite some write
> >> performance could be gained, if checkpointing would write bigger blocks
> >> at time. I haven't read the respective source code to confirm my
> >> assumption that small block sizes are used, as my skills are far too
> >> limited, but if that is really the case, wouldn't it make sense to use
> >> bigger writes in order to improve checkpointing performance?
> >>
> >> What do you think?
> >
> > I found a way to speed up checkpointing so it uses the maximum possible
> > write speed. On the hostnodes we mount the nfs share with the 'sync'
> > mount option (in order to avoid mutual storage lags between the CTs).
> > However, when using 'async' the write speed is not dependent on the
> > block size anymore and is always fast, this means also checkpointing is
> > fast with 'async'. As we still want the CT's private areas to be mounted
> > with 'sync', the solution was to use a separate mount for the /vz/dump
> > directory with mount option 'async'. This way we can achieve maximum
> > checkpointing speed (which is ~70MB/s on our machines).
> >
> 
> Hello, Roman.
> It's not correct to compare checkpointing to dd, because it doesn't
> perform sequential writes. We perform a lot of disk seek operations
> during checkpointing.

I see.

> And yes, NFS works much faster in async mode (async mode shadows seek
> operations and allows to perform many writes before awaiting for
> attributes update from 
> server), than in sync. This can help you to reduce CPT time.

It significantly does so in our case.

> But you have to fsync resulting checkpointing image on source node to
> make sure that it's consistent on shared storage before resuming on
> another node.

The checkpointing and restoring is done by pacemaker - by the ManageVE
resource agent [1], to be precise. I checked the script and as far as I
can see it doesn't take any precautions to make sure everything is
synced before restoring.
I had the impression that everything was running fine. What would be the
effect of restoring from an incomplete dump file? Would I immediately
notice it (if it works at all)? 

Roman

[1]
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/ManageVE