[Devel] Re: Linux Checkpoint-Restart - v19
Jiro SEKIBA
jir at dependable-os.net
Sat Apr 3 02:03:34 PDT 2010
Hi,
Sorry for the late replay
On 2010/03/30, at 12:05, Serge E. Hallyn wrote:
> Quoting Jiro SEKIBA (jir at dependable-os.net):
>> Hi
>>
>> On 2010/03/25, at 1:47, Serge E. Hallyn wrote:
>>
>>> Quoting Jiro SEKIBA (jir at dependable-os.net):
>>>>> If it doesn't work, can you please describe again the exact order of
>>>>> commands that you use and the reported error(s) ?
>>>>>
>>>> I'll let you know in any cases.
>>>>
>>>> Thank you very much for the advice
>>>
>>> Hi Jiro,
>>>
>>> Can you fetch the latest cr_tests
>>> (git clone git://git.sr71.net/~hallyn/cr_tests)
>>>
>>> and
>>> cd cr_tests; make; cd simple
>>> sh runtests.sh
>>>
>>> and tell me whether the second (restart --self) test succeeds?
>>> If it fails, can you send me the cr_*/log2 contents?
>>>
>>
>> I've tried on ckpt-v20 and the above test looks OK.
>> And looks like self_checkpointing is working fine so far.
>>
>> However, I'm still not able to restart external checkpoint correctly.
>>
>> Here are the program and scripts I used for the test.
>> I used user-cr ckpt-v20 branch for checkpoint/restart program.
>>
>> This time I disconnect the program from tty completely.
>>
>> ----------8<----------8<----------test.c----------8<----------8<----------
>> #include <stdio.h>
>> #include <unistd.h>
>>
>> int main(void)
>> {
>> FILE *fp;
>> int i;
>> pid_t pid;
>> int st;
>>
>> if(fork()) {
>> return 0;
>
> Odd thing to do, not sure if you had a reason for it. Still,
> should be fine :)
>
>> } else {
>> waitpid(getppid(), &st, NULL);
>>
>> close(0);
>> close(1);
>> close(2);
>> setsid();
>>
>> if(fork()) {
>> return 0;
>> } else
>> waitpid(getppid(), &st, NULL);
>> }
>>
>> //unlink("/tmp/test.out");
>> fp = fopen("/tmp/test.out","w");
>>
>> for(i=0;i<10;i++) {
>> fprintf(fp,"%d\n",i);
>> fflush(fp);
>> sleep(1);
>> }
>>
>> fclose(fp);
>> return 0;
>> }
>> ----------8<----------8<----------test.c----------8<----------8<----------
>>
>> ----------8<----------8<----------checkpoint.sh----------8<----------8<----------
>> #!/bin/sh
>>
>> CLOG=checkpoint.log
>> RLOG=restart.log
>> rm -f $CLOG $RLOG
>>
>> ./test &
>> sleep 1
>> PID=$(ps x | grep test | grep -v grep |cut -f 2 -d' ')
>>
>> sleep 2
>> echo $PID > /cgroup/0/tasks
>>
>> echo FROZEN > /cgroup/0/freezer.state
>> ./checkpoint -l $CLOG -v $PID > ckpt.image
>>
>> mv /tmp/test.out /tmp/test.out.orig
>> cp /tmp/test.out.orig /tmp/test.out
>>
>> echo THAWED > /cgroup/0/freezer.state
>>
>> ./restart --pidns -l $RLOG -v -i ckpt.image;
>> ----------8<----------8<----------checkpoint.sh----------8<----------8<----------
>>
>> When I run the above script, I got following:
>>
>> # mount -t cgroup -o freezer cgroup /cgroup
>> # mkdir /cgroup/0
>> # sh checkpoint.sh
>> checkpoint id 8
>> Success
>>
>> Then, I'm expecting to see number 0 to 9 in /tmp/test.out, but
>> I only got 0 to 3, which is the state I froze and checkpointed the process.
>>
>> checkpoint.log and restart.log are empty.
>> I guess it means the programs worked fine.
>>
>> I attached the dmesg I got by the single session of the script.
>> It looks the restart tries to reopen /tmp/test.out.
>>
>> Could you give me any clues that I should check with?
>
> Hmm, with ckpt-v20 of both kernel and user, on a powerpc system, I get:
>
> elm3b203:/usr/src/jiro # sh checkpoint.sh
> checkpoint id 146
> Success
> elm3b203:/usr/src/jiro # ls
> checkpoint.log checkpoint.sh ckpt.image restart.log test test.c
> elm3b203:/usr/src/jiro # cat /tmp/test.out
> 0
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9
Nhh, OK, thank you very much for the testing the script.
So what I'm doing is not a pointless so far..
>> My environment is Virtualbox VM.
>> I tried both with VT and without VT.
>> No virtualbox guest module is installed.
>
> What distro are you on?
>
I'm using ubuntu 9.10.
And I found that this ubuntu is using eglibc instead of glibc.
version is eglibc-2.10.1
> Anyway, two things to do. First, add '-d' to your restart flags, so
>
> restart --pidns -l $RLOG -vd -i ckpt.image
>
I got following.. looks like somehow getting SEGV right after restarting.
I attached the whole log.
--------8<--------8<--------8<--------8<--------8<--------
<6113>number of tasks: 1
<6113>total tasks (including ghosts): 1
<6113>====== TASKS
<6113> [0] pid 6102 ppid 1 sid 0 creator 0
<6113>............
<6114>====== PIDS ARRAY
<6114>[0] pid 6102 ppid 1 sid 0 pgid 0
<6114>............
<6113>new pidns without init
<6113>forking coordinator in new pidns
<1>forking child vpid 6102 flags 0x1
<6102>root task pid 6102
<6102>pid 6102: pid 6102 sid 0 parent 1
<6114>c/r read input 16384
<6114>c/r read input 16384
<6114>c/r read input 16384
<6114>c/r read input 16384
<6102>about to call sys_restart(), flags 0
<1>forked child vpid 6102 (asked 6102)
<6114>c/r read input 16384
<6114>c/r read input 16384
...
<6114>c/r read input 16384
<6114>c/r read input 3605
<6114>c/r read input 0
<1>restart succeeded
<1>SIGCHLD: already collected
<1>task terminated with signal 11
<1>mimic sig 11
<1>c/r succeeded
<6113>SIGCHLD: already collected
<6113>task exited with status 0
--------8<--------8<--------8<--------8<--------8<--------
> That will give you debugging info. For instance I get:
>
> checkpoint id 147
> <2507>number of tasks: 1
> <2507>total tasks (including ghosts): 1
> <2507>====== TASKS
> <2507> [0] pid 2497 ppid 1 sid 0 creator 0
> <2507>............
> <2507>new pidns without init
> <2507>forking coordinator in new pidns
> <2508>====== PIDS ARRAY
> <2508>[0] pid 2497 ppid 1 sid 0 pgid 0
> <2508>............
> <1>forking child vpid 2497 flags 0x1
> <1>forked child vpid 2497 (asked 2497)
> <2497>root task pid 2497
> <2497>pid 2497: pid 2497 sid 0 parent 1
> <2497>about to call sys_restart(), flags 0
> <2508>c/r read input 16384
> <2508>c/r read input 16384
> <2508>c/r read input 16384
> <2508>c/r read input 16384
> <2508>c/r read input 16384
> <2508>c/r read input 16384
> <2508>c/r read input 16384
> <2508>c/r read input 16384
> <2508>c/r read input 16384
> <2508>c/r read input 16384
> <2508>c/r read input 16384
> <2508>c/r read input 8336
> <2508>c/r read input 0
> Success
> <1>restart succeeded
> <1>SIGCHLD: already collected
> <1>task exited with status 0
> <1>mimic ret 0
> <1>c/r succeeded
> <2507>SIGCHLD: already collected
> <2507>task exited with status 0
>
>
> The other thing is to restart frozen and attach strace or gdb to the
> restarted test before thawing. So perhaps
>
> # cc -g -o test test.c
> # sh checkpoint.sh
>
> Then when that has failed, do
>
> # mkdir /cgroup/1
> # restart -F /cgroup/1 -i ckpt.image
>
> That will hang. Then in another terminal, you can
>
> # gdb -se test -p `pidof test`
>
> and in a third terminal,
>
> # echo THAWED > /cgroup/1/freezer.state
>
> Now in gdb you can figure out where the task is and step through
> to see where it dies.
I attached restarted process and found where I got SEGV,
here are the corresponding gdb log:
--------8<--------8<--------8<--------8<--------8<--------
0xb77eba50 in __nanosleep_nocancel () from /lib/tls/i686/cmov/libc.so.6
(gdb) n
Single stepping until exit from function __nanosleep_nocancel,
which has no line number information.
__sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:139
139 if (result == 0 && seconds != 0)
(gdb) n
148 }
(gdb)
main () at test.c:33
33 for(i=0;i<10;i++) {
(gdb)
34 fprintf(fp,"%d\n",i);
(gdb) s
__fprintf (stream=0x93b3008, format=0x8048801 "%d\n") at fprintf.c:27
27 __fprintf (FILE *stream, const char *format, ...)
(gdb)
33 done = vfprintf (stream, format, arg);
(gdb) s
_IO_vfprintf_internal (s=0x93b3008, format=0x8048801 "%d\n", ap=0xbf9c5448 "\004") at vfprintf.c:210
210 {
(gdb)
245 int save_errno = errno;
(gdb)
210 {
(gdb) n
245 int save_errno = errno;
(gdb) p save_errno
$1 = 2
(gdb) p errno
Cannot find thread-local variables on this target
(gdb) n
Program received signal SIGSEGV, Segmentation fault.
_IO_vfprintf_internal (s=0x93b3008, format=0x8048801 "%d\n", ap=0xbf9c5448 "\004") at vfprintf.c:245
245 int save_errno = errno;
--------8<--------8<--------8<--------8<--------8<--------
looks like errno is missing. nhh.
I also attached the whole log of gdb.
Thanks,
regards,
>
> thanks,
> -serge
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gdb.log
Type: application/octet-stream
Size: 2416 bytes
Desc: not available
URL: <http://lists.openvz.org/pipermail/devel/attachments/20100403/7d3eb09a/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: restart-error.log
Type: application/octet-stream
Size: 7280 bytes
Desc: not available
URL: <http://lists.openvz.org/pipermail/devel/attachments/20100403/7d3eb09a/attachment-0003.obj>
-------------- next part --------------
-------------- next part --------------
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list