[CRIU] p.haul and lxc

Fri Nov 14 06:34:51 PST 2014

On 11/14/2014 06:05 PM, Tycho Andersen wrote:
> On Fri, Nov 14, 2014 at 01:09:04PM +0400, Pavel Emelyanov wrote:
>> On 11/14/2014 01:06 AM, Tycho Andersen wrote:
>>> Hi all,
>>>
>>> I've been looking p.haul a bit and thinking about how we might improve it for
>>> use with lxc. Based on my read of the code, I think there are two conceptual
>>> changes that would be needed:
>>>
>>> 1. We need to add incremental dump support to lxc-checkpoint, so that p.haul
>>>    can shell out to lxc-checkpoint, instead of calling criu directly. This
>>>    means that p.haul doesn't need to know about lxc internals (e.g. how veths,
>>>    ttys are set up and configured) in order to do its thing.
>>
>> Makes sense to me. This also means, that p.haul should do final dump via
>> lxc-checkpoint too. Which in turn means, that we should move more stuff to
>> the htype-s, not just atomic callbacks. But I'd like to make it so, that
>> p.haul keeps the ability to live-migrate just a task, w/o lxc/docker/openvz
>> backends. This would require p.haul to still call criu directly.
> 
> Great! I'll work on this ASAP, although I think the patches will be
> mostly for lxc :)

That will be awesome! Thanks :)

>>> 2. We can get rid of any p.haul specific handling of cgroups for lxc, since
>>>    these can be restored via criu and lxc-checkpoint and lxc-checkpoint will
>>>    try to do the right thing w.r.t. multiple containers with the same name or
>>>    any other cgroup collisions.
>>
>> Ah, yes :) Cgroups should have left p.haul long time ago. They stay there
>> simply because nobody had time to rip them off.
>>
>>
>> I would add one more thing. Consider the "reattach" problem you were solving
>> for plan lxc-restore. The restored tasks should become lxc's daemon children,
>> not criu's ones. And we introduce the --restore-sibling for that.
>>
>> The same should be true for p.haul. Right now restore tasks become children
>> on p.haul-service process. This is not good. We should make them be children
>> of lxc daemon again. From my perspective this can be implemented in two ways.
>>
>> 1. p.haul is not an utility, but a library that lxc-daemon links with and calls.
>> In this case criu will become the child on the daemon and with --restore-sibling
>> will restore tasks as the daemon's kids.
>>
>> 2. p.haul-service is uttility, lxc daemon fork()-s it, but then p.haul-service
>> forks criu as lxc-daemon's children again, so that criu, in turn, restores
>> tasks with --restore-sibling. This is hard to implement in python (CLONE_PARENT
>> doesn't exists there), but possible.
> 
> I realized I had overlooked this last night, but you're right that it
> is a sticky issue. Fortunately, at least in the lxc case I think it is
> easy, it looks something like this:
> 
> * The top level daemon spawns a p.haul server, to receive the
>   migrated container.

And optionally (many mandatory in your case) gives one the socket
to communicate via.

> * The p.haul server receives the deltas and the final checkpoint
>   through the normal iteration process.

Agree, this is what it does now.

> * The p.haul server uses the lxc python api to do ->restore(), which
>   by default execs criu, so the p.haul server process is replaced by
>   the criu process, which does CLONE_PARENT because of
>   --restore-sibling, and everything is happy.

Wow, you're right. If we delegate to parent LXC process the ->restore
callback then we've solved the issue with reattach :)

> For our purposes, we'd then use p.haul as an executable (although it
> would be fine as a library, we could just write a little wrapper to
> invoke it as a process).
> 
>> Thinking more about p.haul + docker I tend to think that the 1-st approach is
>> worse, as it would require rewriting most of the p.haul's part in C (as nobody
>> links with python) :)
>>
>> What do you think?
>>
>>> This would require a slight architectural change, since lxc-checkpoint is
>>> setuid and execs criu, rather than using the service. However, since the
>>> service mechanism is just to get around this problem, I think it should be ok.
>>>
>>> Another question is what to do about rewriting the images. Based on our last
>>> thread, we decided that in-process (e.g. while criu is restoring) rewriting is
>>> most efficient, so we want to pass some --crit-flags to criu to tell it how to
>>> rewrite things. Is p.haul the thing that would decide how to rewrite e.g.
>>> cpusets, or would that be at some higher level?
>>
>> Absolutely. The existing p.haul options handling allows specifying and pushing
>> arbitrary flags to arbitrary p.haul sub-modules. In particular, the criu-api one
>> can request for any crit flags.
>>
>>> Finally, a minor conceptual change is that it would be nice to be able to set
>>> up the channel that p.haul communicates over (e.g. a TLS socket or so), vs.
>>> just having p.haul do a connect() over a raw socket. If nothing else, I think
>>> we can just implement some other version of the `rpc_proxy` class to do this.
>>
>> Yes, Ruslan is working on it. He implemented the ssh tunnel, but I agree, that
>> there should be an option to use the pre-established channel.
>>
>> Also note one thing. Currently in p.haul there are two channels -- one for RPC
>> commands and the other one for memory (pre-)dumps. The latter is the socket
>> that is fed to criu pre-dump, dump and page-server actions. With pre-established
>> channel we'll have to do something about it. And pushing control commands AND
>> memory over the same socket doesn't seem as the good solution to me.
> 
> Yes, agreed. I think the best way is to allow users (perhaps via some
> plugins to the library, if not just executable arguments) to spawn new
> sockets, and then just have p.haul ask that plugin for a socket to the
> server; maybe with some ordering like the first socket is the control
> socket, and then every socket after that's type is negotiated over the
> control socket. We're interested in spawning only authenticated (TLS)
> sockets, so a simple connect() won't work for us.

Yes, I agree that plain socket is not the way to go. This was done so in
p.haul just for the simplicity. I haven't found quickly any secure proxies,
so decided to make the proof-of-concept on plain connect.

If p.haul will use LXC's sockets and will use LXC as "checkpoint-restore API"
then the workflow would look like this.

  src p.haul says to dst one "start page server"
  src p.haul says to local "criu api (lxc daemon)" -- start pre-dump

After these two steps criu page server on dst and criu pre-dump on the
src should be connected. Can LXC daemon provide this? Note, that it will
not be nice if for every such iteration the new socket will be created,
there can be several iterations.

Hmm...

I guess this can be solved if during LXC-to-LXC migration handshake they 
open two (3 in FS migration case) sockets, one is fed to p.haul-s, the 2nd
to criu pre-dump and criu page-server.

At the same time fork() + exec() of criu on every iteration doesn't sound
nice too (can be long). We have the "swrk" mode of criu -- it's when criu
gets a socket and reads RPC command from it instead of parsing command
line arguments. The page-server start, pre-dump, dump and restore work
nice through this mode. I guess we need to polish one in 1.4, document
and use _it_ in the migration case. Does this sound OK to you?

>> And one more thing about channels :) There are cases when we have to copy file
>> system to remote host. IIRC you used rsync in your demo :) So we will need the
>> 3rd channel. Can we make the channels set-up be independent from the p.haul
>> caller, i.e. p.haul should have an ability to set channels himself. Somehow.
>>
>>> Do these sound reasonable? Does anyone have any thoughts?
>>
>> Thanks for joining the p.haul efforts :)
> 
> No problem! I'm very excited to be working on p.haul and criu :)

:)

Thanks,
Pavel