[Devel] Re: Linux Checkpoint-Restart - v19
Serge E. Hallyn
serue at us.ibm.com
Mon Mar 29 20:05:35 PDT 2010
Quoting Jiro SEKIBA (jir at dependable-os.net):
> Hi
>
> On 2010/03/25, at 1:47, Serge E. Hallyn wrote:
>
> > Quoting Jiro SEKIBA (jir at dependable-os.net):
> >>> If it doesn't work, can you please describe again the exact order of
> >>> commands that you use and the reported error(s) ?
> >>>
> >> I'll let you know in any cases.
> >>
> >> Thank you very much for the advice
> >
> > Hi Jiro,
> >
> > Can you fetch the latest cr_tests
> > (git clone git://git.sr71.net/~hallyn/cr_tests)
> >
> > and
> > cd cr_tests; make; cd simple
> > sh runtests.sh
> >
> > and tell me whether the second (restart --self) test succeeds?
> > If it fails, can you send me the cr_*/log2 contents?
> >
>
> I've tried on ckpt-v20 and the above test looks OK.
> And looks like self_checkpointing is working fine so far.
>
> However, I'm still not able to restart external checkpoint correctly.
>
> Here are the program and scripts I used for the test.
> I used user-cr ckpt-v20 branch for checkpoint/restart program.
>
> This time I disconnect the program from tty completely.
>
> ----------8<----------8<----------test.c----------8<----------8<----------
> #include <stdio.h>
> #include <unistd.h>
>
> int main(void)
> {
> FILE *fp;
> int i;
> pid_t pid;
> int st;
>
> if(fork()) {
> return 0;
Odd thing to do, not sure if you had a reason for it. Still,
should be fine :)
> } else {
> waitpid(getppid(), &st, NULL);
>
> close(0);
> close(1);
> close(2);
> setsid();
>
> if(fork()) {
> return 0;
> } else
> waitpid(getppid(), &st, NULL);
> }
>
> //unlink("/tmp/test.out");
> fp = fopen("/tmp/test.out","w");
>
> for(i=0;i<10;i++) {
> fprintf(fp,"%d\n",i);
> fflush(fp);
> sleep(1);
> }
>
> fclose(fp);
> return 0;
> }
> ----------8<----------8<----------test.c----------8<----------8<----------
>
> ----------8<----------8<----------checkpoint.sh----------8<----------8<----------
> #!/bin/sh
>
> CLOG=checkpoint.log
> RLOG=restart.log
> rm -f $CLOG $RLOG
>
> ./test &
> sleep 1
> PID=$(ps x | grep test | grep -v grep |cut -f 2 -d' ')
>
> sleep 2
> echo $PID > /cgroup/0/tasks
>
> echo FROZEN > /cgroup/0/freezer.state
> ./checkpoint -l $CLOG -v $PID > ckpt.image
>
> mv /tmp/test.out /tmp/test.out.orig
> cp /tmp/test.out.orig /tmp/test.out
>
> echo THAWED > /cgroup/0/freezer.state
>
> ./restart --pidns -l $RLOG -v -i ckpt.image;
> ----------8<----------8<----------checkpoint.sh----------8<----------8<----------
>
> When I run the above script, I got following:
>
> # mount -t cgroup -o freezer cgroup /cgroup
> # mkdir /cgroup/0
> # sh checkpoint.sh
> checkpoint id 8
> Success
>
> Then, I'm expecting to see number 0 to 9 in /tmp/test.out, but
> I only got 0 to 3, which is the state I froze and checkpointed the process.
>
> checkpoint.log and restart.log are empty.
> I guess it means the programs worked fine.
>
> I attached the dmesg I got by the single session of the script.
> It looks the restart tries to reopen /tmp/test.out.
>
> Could you give me any clues that I should check with?
Hmm, with ckpt-v20 of both kernel and user, on a powerpc system, I get:
elm3b203:/usr/src/jiro # sh checkpoint.sh
checkpoint id 146
Success
elm3b203:/usr/src/jiro # ls
checkpoint.log checkpoint.sh ckpt.image restart.log test test.c
elm3b203:/usr/src/jiro # cat /tmp/test.out
0
1
2
3
4
5
6
7
8
9
> My environment is Virtualbox VM.
> I tried both with VT and without VT.
> No virtualbox guest module is installed.
What distro are you on?
Anyway, two things to do. First, add '-d' to your restart flags, so
restart --pidns -l $RLOG -vd -i ckpt.image
That will give you debugging info. For instance I get:
checkpoint id 147
<2507>number of tasks: 1
<2507>total tasks (including ghosts): 1
<2507>====== TASKS
<2507> [0] pid 2497 ppid 1 sid 0 creator 0
<2507>............
<2507>new pidns without init
<2507>forking coordinator in new pidns
<2508>====== PIDS ARRAY
<2508>[0] pid 2497 ppid 1 sid 0 pgid 0
<2508>............
<1>forking child vpid 2497 flags 0x1
<1>forked child vpid 2497 (asked 2497)
<2497>root task pid 2497
<2497>pid 2497: pid 2497 sid 0 parent 1
<2497>about to call sys_restart(), flags 0
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 8336
<2508>c/r read input 0
Success
<1>restart succeeded
<1>SIGCHLD: already collected
<1>task exited with status 0
<1>mimic ret 0
<1>c/r succeeded
<2507>SIGCHLD: already collected
<2507>task exited with status 0
The other thing is to restart frozen and attach strace or gdb to the
restarted test before thawing. So perhaps
# cc -g -o test test.c
# sh checkpoint.sh
Then when that has failed, do
# mkdir /cgroup/1
# restart -F /cgroup/1 -i ckpt.image
That will hang. Then in another terminal, you can
# gdb -se test -p `pidof test`
and in a third terminal,
# echo THAWED > /cgroup/1/freezer.state
Now in gdb you can figure out where the task is and step through
to see where it dies.
thanks,
-serge
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list