[CRIU] Fwd: Checkpoint failure on arm64 platform

Pavel Emelyanov xemul at parallels.com
Tue Dec 22 02:43:55 PST 2015


On 12/22/2015 09:03 AM, Vijay Kilari wrote:
> On Mon, Dec 21, 2015 at 11:41 PM, Vijay Kilari <vijay.kilari at gmail.com> wrote:
>> On Mon, Dec 21, 2015 at 6:47 PM, Pavel Emelyanov <xemul at parallels.com> wrote:
>>>
>>>> (00.106975) Error (parasite-syscall.c:815): Can't retrieve FD from socket
>>>> pie: Daemon waits for command
>>>> (00.106999) Wait for ack 15 on daemon socket
>>>> (00.107036) Error (parasite-syscall.c:298): Message reply from daemon
>>>> is trimmed (12/0)
>>>> (00.107047) Error (cr-dump.c:1216): Can't get proc fd (pid: 1456)
>>>> (00.107066) Waiting for 1456 to trap
>>>> (00.107080) Daemon 1456 exited trapping
>>>>
>>>> In the kernel in readlinkat syscall, I have put printk to know the context
>>>> in which /proc/self is read. It shows the same process id 1456 and
>>>> name as 'tail'
>>>> which is the process running inside container.
>>>>
>>>> [ 6461.973166] In readlinkat error < 0 -9 pid 1456 name tail
>>>
>>> Heh.. Would you dig this deeper to find out where the EBADF comes from?
>>
>> The parameters values received by readlink system call is wrong.
>> Looks like there is mismatch in the system call codes. readlink() is calling
>> readlinkat().
>>
>> Need to debug more and confirm it.
> 
>  sys_readlink has syscall number 78 which is actually syscall number of
> sys_readlinkat. Because of mismatch in parameters, system call arguments
> are corrupted and is failing.
> 
> Changes as below worked
> 
> pie/parasite.c
> 
> - ret = sys_readlink("/proc/self", buf, sizeof(buf));
> +ret = sys_readlinkat(AT_FDCWD, "/proc/self", buf, sizeof(buf));
> 
> arch/arm/syscall.def
> 
> -readlink                         78      85       (const char *path,
> char *buf, int bufsize)
> +readlinkat                     78      85       (int fd, const char
> *path, char *buf, int bufsize)

Nice catch :) Would you, please, submit a patch for this?

> After this changes + changing PAGE_SIZE to 64KB, there is no error
> reporting during checkpoint.

Good :) Christopher (in Cc) once started to fix the page-size problem for
arm and aarch64.

>  However restore fails. Below are the steps I tried
> 
> ubuntu at ubuntu:~/criu/criu-1.8$ sudo docker ps
> CONTAINER ID        IMAGE               COMMAND             CREATED
>          STATUS              PORTS               NAMES
> ubuntu at ubuntu:~/criu/criu-1.8$ sudo docker run -d
> justinzh/arm64-vivid:latest tail -f /dev/null
> 2ee07bc4adf3493473801651ce8f030a1fcc778bc79eae79fedb3afde39d7438
> ubuntu at ubuntu:~/criu/criu-1.8$ sudo docker ps
> CONTAINER ID        IMAGE                         COMMAND
>  CREATED             STATUS              PORTS               NAMES
> 2ee07bc4adf3        justinzh/arm64-vivid:latest   "tail -f /dev/null"
>  5 seconds ago       Up 4 seconds
> adoring_kowalevski
> ubuntu at ubuntu:~/criu/criu-1.8$ sudo docker checkpoint 2ee07bc4adf3
> 2ee07bc4adf3
> ubuntu at ubuntu:~/criu/criu-1.8$
> ubuntu at ubuntu:~/criu/criu-1.8$ sudo docker ps
> CONTAINER ID        IMAGE               COMMAND             CREATED
>          STATUS              PORTS               NAMES
> ubuntu at ubuntu:~/criu/criu-1.8$ sudo docker restore 2ee07bc4adf3
> Error response from daemon: Cannot restore container 2ee07bc4adf3:
> cantstart: Cannot start container
> 2ee07bc4adf3493473801651ce8f030a1fcc778bc79eae79fedb3afde39d7438: criu
> failed: type NOTIFY errno 0
> log file: /var/lib/docker/0.0/containers/2ee07bc4adf3493473801651ce8f030a1fcc778bc79eae79fedb3afde39d7438/criu.work/restore.log
> Error: failed to restore one or more containers
> ubuntu at ubuntu:~/criu/criu-1.8$
> 
> Questions
> 
> 1) After checkpoint, 'docker ps' does not show up checkpointed
> docker?. Is this expected behaviour?

It is. The proper support for docker should come some time soon, the
feature is being discussed here:

https://github.com/docker/docker/pull/13602

> 2) Restore fails at clone() call.
> 
> restore.log
> --------------
> (00.006447) Warn  (cr-restore.c:1075): Set CLONE_PARENT | CLONE_NEWPID
> but it might cause restore problem,because not all kernels support
> such clone flags combinations!

This shouldn't be the case for 4.2 you're using.

> (00.006469) Forking task with 1 pid (flags 0x6c028000)
> (00.168693) Error (cr-restore.c:1175): Can't fork for 1: Invalid argument
> (00.236429) Error (cr-restore.c:1995): Restoring FAILED.

OK, the flags are 0x6c028000, that is NEWPID, NEWNET, NEWUTS, NEWIPC,
NEWNS and CLONE_PARENT, on 4.2 this shouldn't cause any problems. Would
you strace the criu restore command to see whether it goes calling the
system call at all?

-- Pavel


More information about the CRIU mailing list