[CRIU] [PATCH 01/10] p.haul: implement migration over existing connections

Tue Oct 20 02:16:01 PDT 2015

On Mon, 2015-10-19 at 14:55 -0600, Tycho Andersen wrote:
> On Mon, Oct 19, 2015 at 11:39:35AM +0300, Pavel Emelyanov wrote:
> > On 10/15/2015 10:20 PM, Tycho Andersen wrote:
> > > On Thu, Oct 15, 2015 at 12:21:35PM +0300, Pavel Emelyanov wrote:
> > >> On 10/14/2015 10:27 PM, Tycho Andersen wrote:
> > >>> Hi Nikita,
> > >>>
> > >>> Thanks for this work, it will be very useful for us.
> > >>>
> > >>> On Fri, Oct 09, 2015 at 09:11:33PM +0400, Nikita Spiridonov wrote:
> > >>>> Remove standalone mode, p.haul now can work only over existing
> > >>>> connections specified via command line arguments as file
> > >>>> descriptors.
> > >>>>
> > >>>> Three arguments required - --fdrpc for rpc calls, --fdmem for c/r
> > >>>> images migration and --fdfs for disk migration. Expect that each
> > >>>> file descriptor represent socket opened in blocking mode with domain
> > >>>> AF_INET and type SOCK_STREAM.
> > >>>
> > >>> Do we have to require --fdfs here for anything? I haven't looked
> > >>> through the code to see why exactly it is required.
> > >>
> > >> The fd socket is required to copy filesystem, but (!) only if required.
> > >> If the storage the container's files are on is shared, then this fd
> > >> will effectively become unused.
> > >>
> > >> I think we can do it like -- one can omit this parameter, but if the
> > >> htype driver says that fs migration _is_ required, then p.haul will
> > >> fail with error "no data channel for fs migration". Does this sound
> > >> OK to you?
> > > 
> > > Yep, that sounds fine.
> > > 
> > >>> In LXD (and I guess openvz as well, with your ploop patch) we are
> > >>> managing our own storage backends, and have our own mechanism for
> > >>> transporting the rootfs. 
> > >>
> > >> Can you shed more light on this? :) If there's some backend that can
> > >> be used by us as well, maybe it would make sense to put migration code
> > >> into p.haul?
> > > 
> > > Right now we have backends for zfs, lvm, btrfs, and just a regular
> > > directory on a filesystem. I'm not aware of us planning support for
> > > any other backends right now, but it's not out of the question.
> > > Additionally, we also want to migrate a container's snapshots when we
> > > migrate the container, which requires something to know about how we
> > > handle snapshotting for these various storage backends as well.
> > 
> > Yup, pretty same for us :)
> > 
> > > We also support non-live copying containers, so we need the code even
> > > without p.haul and ideally it would be good not to maintain it in two
> > > places, but,
> > 
> > You mean off-line copying a container across nodes?
> 
> Yep exactly.
> 
> > >>> Ideally, I could invoke p.haul over an fd to
> > >>> just do the criu iterative piece, and potentially do some callbacks to
> > >>> tell LXD when the process is stopped so that we can do a final fs
> > >>> sync.
> > >>
> > >> The issue with fs sync is tightly coupled with memory migration iterations,
> > >> that's why I planned to put all this stuff into p.haul. If you do the
> > >> final fs sync and while doing this the amount of memory to be copied
> > >> increases, it might make sense to do one more iteration of pre-copy.
> > >> Without full p.haul control over both (memory and fs) it's hardly possible.
> > > 
> > > What about passing p.haul a socket and inventing a messaging protocol?
> > > Then p.haul could ask LXD (or whoever) to sync the filesystem, but
> > > also report any errors during migration better than just exit(1).
> > 
> > Let's try. Would you suggest how a protocol might look like?
> 
> What about something like,
> 
> enum phaulmsgtype {
> 	ERROR		= 0;
> 	SYNCFS		= 1;
> 	SUCCESS		= 2;
> 	/* other message types as necessary */
> }
> 
> message phaul {
> 	required phaulmsgtype	type		= 1;
> 
> 	/* for ERROR and SUCCESS, perhaps just the contents of the
> 	 * CRIU log?
> 	 */
> 	optional string		message		= 2;
> }
> 
> which you pass to p.haul via a --msgfd. I can think of a few ways it
> could work:
> 
> * if you pass msgfd, your client always has to move the filesystem.
>   This seems a little ugly though, as getting the logs (and not just
>   p.haul's exit code) may be useful for others, so they don't have to
>   know how p.haul drives CRIU to know where to look for the logs.
> 
> * when you pass msgfd, p.haul will send a SYNCFS message. If it gets
>   an UNSUP message back, it uses the htype driver's storage backend
>   (or fails if this also fails). If it is supported, the p.haul caller
>   either sends a SUCCESS or ERROR message depending on what happened.
> 
> Does that make sense? I haven't looked at the p.haul code much, so I
> could be totally off base.
> 
> Tycho

I can suggest slightly different semantics for --msgfd:

* We can use it for some diagnostic messages (e.g. ERROR, SUCCESS) and
  interaction with parent unconditionally if it was specified

* We can encapsulate more complicated logic (e.g. SYNCFS) inside
  certain phaul module (lxc in your case). Lxc module will create some
  specific fs driver (similarly to fs_haul_subtree or fs_haul_ploop)
  which will send messages through msgfd instead of doint actual work
  manually.

It is not much better than your 2 ways actually, but it is easier to
implement and it will not affect another phaul modules. Some changes
in fs/fs_receiver in p_haul_iters/p_haul_service needed for such
implementation. msgfd definitely needed, but I can't suggest good
design for it at once.

Btw, I like the idea to use protobuf for msgfd.