[Devel] Re: [LPC] Notes from Checkpoint/Restart BOF
Oren Laadan
orenl at librato.com
Mon Oct 12 11:52:38 PDT 2009
Hi,
Thanks for posting the notes. I place a (modified) summary of the BOF
on the linux-c/r wiki:
http://ckpt.wiki.kernel.org/index.php/LPC2009
Oren.
Sukadev Bhattiprolu wrote:
>
> Notes from Checkpoint/Restart BOF at Linux Plumbers Conference, Sep 24, 2009.
>
> (I am missing some details and couple of names. They said they were on
> Containers mailing list though. If you have any other topics that we
> discussed or have any details, please add to this mail).
>
> ---
>
> Attendees:
> Oren Laadan, Joeseph Ruscio, <One more person> (Librato)
> Pavel Emelyanov, <One more person ?> (OpenVZ)
> Ying Han, Salman Qazi (Google)
> Dan Smith, Matt Helsley, Sukadev Bhattiprolu (IBM)
>
> 1. Pavel: A few months ago there were discussions about making a "dry-run"
> to see if checkpoint of an application will succeed. What is the
> current status of that ?
>
> The answer was there is no dry-run - user should just try the
> actual C/R. If application is using an uncheckpointable resource
> the C/R will fail cleanly without side-effects.
> The dry-run may not mean anything unless we freeze the application
> during the check and leave it frozen until the checkpoint is done.
> IOW, the dry-run does not guarantee that application is checkpointable
> unless the application is frozen.
>
> 2. Pavel: Alexey Dobriyan had earlier submitted some code for leak-detection. Do
> we still have that ?
>
> The answer was that most of the code was used and we also added reverse
> detection.
>
> 3. Do we have a config-option to make a process checkpointable.
>
> <Missed the context of this question> We have CONFIG_CHECKPOINT.
>
> 4 Checkpointing network connections:
>
> We quickly reviewed the status (AF_UNIX done, AF_INET done in a
> prototype and needs to be forward ported). Checkpoint of one-end
> of a network connection can cause the connection to be reset.
>
> 5. Briefly discussed distinction between Live migration and static migration
>
> 6. Do we need a pre-check during restart to ensure that the application can
> be restarted ? Eg: if the application used a specific math co-processor
> or futex at checkpoint and that resource is not available at restart,
> the restart may encounter some undefined behavior. Should we encode the
> hardware/OS capabilities in the checkpoint image and check these
> capabilities during restart (before actual restart). Reason for this
> check being the restart may not fail cleanly if the resource is missing.
>
> Conclusion was that there could be too many such capabilities that
> we would have to track and even so there may be some unexpected
> difference between checkpoint machine and restart machine.
>
> For now, let the restart fail and/or deal with in user-space.
>
> 7. Discussed briefly about clone2() aka clone_with_pids().
>
> Everyone seemed to agree that restoring process-tree even in user-space
> will work and can be used.
>
> 8. Oren: Error reporting during restart
>
> We currently fail the system call with an error code and if we ant
> more information on the failure, we have to add debug messages to
> the code. We discussed couple of options for error reporting on restart:
> - log detailed message(s) to console (risk wrapping dmesg buf)
> - pass an extra-buffer to the system call and have kernel
> fill-in more detailed error message (would need two new
> parameters, one pointer to the buf, one size of the buf).
>
> - Pass-in an extra 'log_fd' parameter to system call and have
> kernel write detailed messags to that log_fd (unless log_fd
> is -1). This seemed more flexible than the other two.
>
> We agreed that the format of the log messages can be free-format
> and that there is no guarantee that the format of the log
> messages will not change.
>
> But it was not clear (at least to me) if the log file should
> contain all log messages relating to the C/R or just the
> last (few) error messages.
>
> 9. Any application to summarize the checkpoint ?
>
> We have a 'ckptinfo' that could summarize the contents of a checkpoint.
>
> 10. Ying Han: Is there a performance difference between the original instance
> of the application and the restarted instance ? (Eg: on NUMA if application
> was on one node at checkpoint and after restart, ended up on another node).
>
> Not sure if there was a conclusion to this point.
>
> 11. Discussed that devices like tty, /dev/rtc etc must be virtualized before
> we can checkpoint them.
>
> 12. Oren: Checkpointing/Restoring mount namespaces
>
> Bind mounts are restored in container.
>
> NFS: at least on OpenVZ, since network is frozen, reopening files over
> NFS is not possible until restart is complete. OpenVZ creates fake
> dentries to allow the open to proceed.
>
> Loopback devices - cannot open them in a container since they can
> lockup system with huge memory footprint ??
>
> We should disable shared-mount propogation at least for now.
>
> 13. Oren: cradvise()
>
> Use a single system call to optimize the checkpoint/restart ?
> Eg: If an fd refers to /dev/tty1 in the checkpoint-image and that tty
> is not available on restart, user-space could open another tty and
> teach the kernel to use a different tty, /dev/tty2, during
> restart. Another example is if an application has several megs of
> "scratch" memory that does not need to checkpointed, they could
> use 'cradvise') system call to optimize the checkpoint or restart.
>
> The conclusion was it would be hard to get acceptance from community,
> for a new variant of ioctl/fcntl call. So, we should instead try to
> add the necessary features to existing system calls like fcntl(),
> shmctl() or madvise().
>
> 14. Oren: Unlinked files/directories
>
> May need to copy the contents of the deleted file to the
> checkpoint image (only on ext4?). Create a fake hard link to the
> file so the file still exists in the filesystem snapshot and remove
> the link during restart.
>
> There is a good paper discussing snapshot/restore of unlinked files
> on Xen. The same concept could be used in C/R too ?
>
> (If you have links to the paper, please add)
>
> 15. Network namespaces
>
> Restore namespaces in user-space, restore sockets in-kernel.
>
> Cannot create devices in user-space unless we know the index for
> the network device ?
>
> (Missed details on this discussion)
>
> 16. Time
>
> Will need some policies on restart like:
> - use absolute time or relative time
> - do new children inherit the policy ?
> - do we gradually adjust from relative to absolute time ?
>
> If not cradvise(), maybe timectl() :-p
>
> 17. VDSO
>
> (Missed details on this discussion)
>
> 18. Async I/O
>
> Getting a lockdep report during checkpoint ?
> OpenVZ flushes I/O, waits for pending I/O and then retries checkpoint
> We may need to the do the same for mmap I/O ?
>
> 19. Checkpoint data structures:
>
> - Try to keep extensions to existing data structures minimal
> - If necessary, add to end of data structures
> - But do not get locked down to an ABI at this point. i.e. even after
> entering mainline, format of checkpoint image may change for a while
> before stabilizing.
>
> 20. Test suite:
>
> OpenVZ has some test cases that has various applications go to specific
> states and wait for a checkpoint. After that and after restart they
> check that nothing has changed unexpectedly.
> _______________________________________________
> Containers mailing list
> Containers at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list