[CRIU] p.haul and lxc

Mon Nov 17 03:00:39 PST 2014

On 11/14/2014 09:26 PM, Tycho Andersen wrote:
> On Fri, Nov 14, 2014 at 07:47:15PM +0400, Pavel Emelyanov wrote:
>> On 11/14/2014 08:04 PM, Tycho Andersen wrote:
>>
>>>> If p.haul will use LXC's sockets and will use LXC as "checkpoint-restore API"
>>>> then the workflow would look like this.
>>>>
>>>>   src p.haul says to dst one "start page server"
>>>>   src p.haul says to local "criu api (lxc daemon)" -- start pre-dump
>>>>
>>>> After these two steps criu page server on dst and criu pre-dump on the
>>>> src should be connected. Can LXC daemon provide this?
>>>
>>> Yes, I think we can provide the authenticated socket (or just pass a
>>> message for criu as a proxy). In fact, the proxy method might be the
>>> easiest -- p.haul sends stuff to lxd, and then lxd forwards it on to the
>>> other lxd, which sends it back to the other end's p.haul.
>>
>> Wait, we seem to talk about different sockets :) Maybe not, but let me
>> clarify the whole picture anyway :)
>>
>> The socket I'm talking about is the socket which will be used by criu 
>> pre-dump to send memory contents of tasks to the page server. Not the 
>> one that will be used by p.haul ends to talk to each other.
>>
>> The in-progress picture should look like this
>>
>> src-LXD                                dst-LXD
>>  `- p.haul --[ channel for commands ]-- `- p.haul-service
>>  `- criu   --[  channel for memory  ]-- `- criu
>>                                         `- init <-- will get CLONE_PARENT by criu
>>                                             `- ...
>>
>> There are two network channels and four local via which both p.haul-s can
>> talk to LXD-s as to "CRIU API" and LXD-s make calls to criu-s.
>>
>> As far as network channels are concerned.
>>
>> The 1st channel (for commands) can be implemented "via" LXDs, since it's
>> nothing but pre-dump/dump/restore stages synchronization. But the 2ns
>> channel (for memory) should be just a socket for data (auth-d and crypted,
>> but there's no need in whole LXD in between from my POV). 
>>
>> BTW, the same channel is currently used by p.haul to transfer non-memory 
>> images at the very end, so p.haul-s should "know" about it too.
> 
> I see. My concern is about the auth, actually. We're building some
> auth scheme (http-based certificates) into lxd, so we'll probably use
> a websocket to be the p.haul command layer, and potentially the data
> layer. I think we can just write a python module that understands
> lxd's websocket scheme, and expose it as a file-like object in python
> and that should be ok. (At least, based on my read of the current
> p.haul code, it looks like it should work.)
> 
> I agree, though, that in principle there is no reason (and ideally we
> wouldn't have) an lxd in the middle. The only reason to do it would be
> for some custom auth mechanism.

Ah, OK. Then it looks like we can just teach the p.haul to get the list
of interconnected sockets... file descriptors :) and work on them.

>>>> Note, that it will
>>>> not be nice if for every such iteration the new socket will be created,
>>>> there can be several iterations.
>>>
>>> I think the socket we give to p.haul would be for use exclusively by
>>> p.haul, so since it's not necessary now, I don't think it would be.
>>>
>>>
>>>> Hmm...
>>>>
>>>> I guess this can be solved if during LXC-to-LXC migration handshake they 
>>>> open two (3 in FS migration case) sockets, one is fed to p.haul-s, the 2nd
>>>> to criu pre-dump and criu page-server.
>>>>
>>>> At the same time fork() + exec() of criu on every iteration doesn't sound
>>>> nice too (can be long). We have the "swrk" mode of criu -- it's when criu
>>>> gets a socket and reads RPC command from it instead of parsing command
>>>> line arguments. The page-server start, pre-dump, dump and restore work
>>>> nice through this mode. I guess we need to polish one in 1.4, document
>>>> and use _it_ in the migration case. Does this sound OK to you?
>>>
>>> Ah, that's interesting. I hadn't thought about the multiple forks
>>> being expensive. 
>>
>> Fork()-s -- no. Execve()-s will (can) be :)
> 
> Ah, ok. Is this because it has to remap the binary every time, or look
> through the path?

If everything is in cache, then no problems. But on a loaded system
caches can get shrunk and in this case kernel will not only have to
lookup the path and read the binary, but repeat this for every single
library it depends from. This may take time and if we're in the middle
of live-migration any delay will make times worse.

>>> So we'd start lxc-checkpoint in some sort of daemon
>>> mode, which would then read rpc commands over the socket from p.haul
>>> until the final dump was done? Then on the restore side I guess it
>>> would just be the same single command thing.
>>
>> Not single, unfortunately. During iterations destination LXD will have
>> to ask CRIU to start page-servers to accept memory pages.
>>
>> Can LXD fork criu in swrk mode and just forward to it anything that
>> comes from p.haul? Without de/en-coding the contents.
> 
> Yes, that is one option. Ideally we'd be able to connect them
> directly, via some custom auth implementation that we can plug into
> (or wrap) p.haul with.

OK, so basically we have two major stuff to work on.

The first is to feed sockets to p.haul from caller, this is fairly easy.
And the 2nd thing is to teach p.haul to use LXD as C/R API. This would
eventually drive us to the problem with exec(), but that, in turn, can
be solved by using CRIU swrk mode.

:)

Thanks,
Pavel