[CRIU] Fwd: Checkpoint failure on arm64 platform
Vijay Kilari
vijay.kilari at gmail.com
Tue Dec 22 03:48:41 PST 2015
On Tue, Dec 22, 2015 at 4:13 PM, Pavel Emelyanov <xemul at parallels.com> wrote:
> On 12/22/2015 09:03 AM, Vijay Kilari wrote:
>> On Mon, Dec 21, 2015 at 11:41 PM, Vijay Kilari <vijay.kilari at gmail.com> wrote:
>>> On Mon, Dec 21, 2015 at 6:47 PM, Pavel Emelyanov <xemul at parallels.com> wrote:
>>>>
>>>>> (00.106975) Error (parasite-syscall.c:815): Can't retrieve FD from socket
>>>>> pie: Daemon waits for command
>>>>> (00.106999) Wait for ack 15 on daemon socket
>>>>> (00.107036) Error (parasite-syscall.c:298): Message reply from daemon
>>>>> is trimmed (12/0)
>>>>> (00.107047) Error (cr-dump.c:1216): Can't get proc fd (pid: 1456)
>>>>> (00.107066) Waiting for 1456 to trap
>>>>> (00.107080) Daemon 1456 exited trapping
>>>>>
>>>>> In the kernel in readlinkat syscall, I have put printk to know the context
>>>>> in which /proc/self is read. It shows the same process id 1456 and
>>>>> name as 'tail'
>>>>> which is the process running inside container.
>>>>>
>>>>> [ 6461.973166] In readlinkat error < 0 -9 pid 1456 name tail
>>>>
>>>> Heh.. Would you dig this deeper to find out where the EBADF comes from?
>>>
>>> The parameters values received by readlink system call is wrong.
>>> Looks like there is mismatch in the system call codes. readlink() is calling
>>> readlinkat().
>>>
>>> Need to debug more and confirm it.
>>
>> sys_readlink has syscall number 78 which is actually syscall number of
>> sys_readlinkat. Because of mismatch in parameters, system call arguments
>> are corrupted and is failing.
>>
>> Changes as below worked
>>
>> pie/parasite.c
>>
>> - ret = sys_readlink("/proc/self", buf, sizeof(buf));
>> +ret = sys_readlinkat(AT_FDCWD, "/proc/self", buf, sizeof(buf));
>>
>> arch/arm/syscall.def
>>
>> -readlink 78 85 (const char *path,
>> char *buf, int bufsize)
>> +readlinkat 78 85 (int fd, const char
>> *path, char *buf, int bufsize)
>
> Nice catch :) Would you, please, submit a patch for this?
Yes, I will do so
>
>> After this changes + changing PAGE_SIZE to 64KB, there is no error
>> reporting during checkpoint.
>
> Good :) Christopher (in Cc) once started to fix the page-size problem for
> arm and aarch64.
>
>> However restore fails. Below are the steps I tried
>>
>> ubuntu at ubuntu:~/criu/criu-1.8$ sudo docker ps
>> CONTAINER ID IMAGE COMMAND CREATED
>> STATUS PORTS NAMES
>> ubuntu at ubuntu:~/criu/criu-1.8$ sudo docker run -d
>> justinzh/arm64-vivid:latest tail -f /dev/null
>> 2ee07bc4adf3493473801651ce8f030a1fcc778bc79eae79fedb3afde39d7438
>> ubuntu at ubuntu:~/criu/criu-1.8$ sudo docker ps
>> CONTAINER ID IMAGE COMMAND
>> CREATED STATUS PORTS NAMES
>> 2ee07bc4adf3 justinzh/arm64-vivid:latest "tail -f /dev/null"
>> 5 seconds ago Up 4 seconds
>> adoring_kowalevski
>> ubuntu at ubuntu:~/criu/criu-1.8$ sudo docker checkpoint 2ee07bc4adf3
>> 2ee07bc4adf3
>> ubuntu at ubuntu:~/criu/criu-1.8$
>> ubuntu at ubuntu:~/criu/criu-1.8$ sudo docker ps
>> CONTAINER ID IMAGE COMMAND CREATED
>> STATUS PORTS NAMES
>> ubuntu at ubuntu:~/criu/criu-1.8$ sudo docker restore 2ee07bc4adf3
>> Error response from daemon: Cannot restore container 2ee07bc4adf3:
>> cantstart: Cannot start container
>> 2ee07bc4adf3493473801651ce8f030a1fcc778bc79eae79fedb3afde39d7438: criu
>> failed: type NOTIFY errno 0
>> log file: /var/lib/docker/0.0/containers/2ee07bc4adf3493473801651ce8f030a1fcc778bc79eae79fedb3afde39d7438/criu.work/restore.log
>> Error: failed to restore one or more containers
>> ubuntu at ubuntu:~/criu/criu-1.8$
>>
>> Questions
>>
>> 1) After checkpoint, 'docker ps' does not show up checkpointed
>> docker?. Is this expected behaviour?
>
> It is. The proper support for docker should come some time soon, the
> feature is being discussed here:
>
> https://github.com/docker/docker/pull/13602
>
>> 2) Restore fails at clone() call.
>>
>> restore.log
>> --------------
>> (00.006447) Warn (cr-restore.c:1075): Set CLONE_PARENT | CLONE_NEWPID
>> but it might cause restore problem,because not all kernels support
>> such clone flags combinations!
>
> This shouldn't be the case for 4.2 you're using.
>
>> (00.006469) Forking task with 1 pid (flags 0x6c028000)
>> (00.168693) Error (cr-restore.c:1175): Can't fork for 1: Invalid argument
>> (00.236429) Error (cr-restore.c:1995): Restoring FAILED.
>
> OK, the flags are 0x6c028000, that is NEWPID, NEWNET, NEWUTS, NEWIPC,
> NEWNS and CLONE_PARENT, on 4.2 this shouldn't cause any problems. Would
> you strace the criu restore command to see whether it goes calling the
> system call at all?
The clone() system is failing because, arm64 requires stack pointer to
be aligned
with 16 bytes. With below changes in cr-restore.c, clone() is ok
I will send patch for this as well. Many places inside test/zdtm also requires
similar changes.
struct cr_clone_arg {
/*
* Reserve some space for clone() to locate arguments
* and retcode in this place
*/
- char stack[128] __attribute__((aligned (8)));
+ char stack[128] __attribute__((aligned (16)));
char stack_ptr[0];
struct pstree_item *item;
unsigned long clone_flags;
int fd;
CoreEntry *core;
};
However now restore fails with signal 11 when executing
vdso_fill_symtable(). Any idea?
restore.log:
---------------
pie: vdso: Parsing at 0x3ff81110000 0x3ff81120000
pie: vdso: PT_LOAD p_vaddr: 0x0
pie: vdso: DT_HASH: 0x120
pie: vdso: DT_STRTAB: 0x1f8
pie: vdso: DT_SYMTAB: 0x150
pie: vdso: DT_STRSZ: 0x77
pie: vdso: DT_SYMENT: 0x18
pie: vdso: nbucket 0x3 nchain 0x7 bucket 0x3ff81110128 chain 0x3ff8111>
pie: 0134
(01.062166) Error (cr-restore.c:1266): 24818 killed by signal 11
(01.103990) Error (cr-restore.c:1266): 24818 killed by signal 9
(01.294182) Error (cr-restore.c:1999): Restoring FAILED.
Regards
Vijay
More information about the CRIU
mailing list