[CRIU] Restore failed. Exit code: 43
Pavel Emelyanov
xemul at parallels.com
Tue Jan 20 11:53:24 PST 2015
On 01/20/2015 06:51 PM, Paschalis Mpeis wrote:
>
>
>
> I'm now confused, sorry. Can you post
>
> * the full program you run (either, but only one way of calling restore, not two)
> * the way you run it (the shell commands you execute, one by one)
> * the result you see (if dump and restore are OK, then logs are not required)
> * and what you expect it to look like
>
> Thanks,
> Pavel
>
>
> I have created a gist, as it would be the easier way to share code.
> Gist: https://gist.github.com/Paschalis/a96b2747ed85b8e5a796
> I have omitted the header files for the source I have provided.
>
> *Short explanation of files:*
> *crlib.c:* Its a wrapper of your library, to provide a more "clean" interface.
> It contains the functions I 've told you about: initialise_criu, dumpApplication, restoreApplication.
>
> *linpack_h1_.c:* It contains one function taken out of the linpack benchmark (daxpy_real) , and a proxy function (daxpy). The proxy function runs each time the real function, but also the 1st time it is invoked it takes a snapshot.
>
> *main_linpack.c:* It is the source code of the linpack benchmark, except the daxpy function. All calls to daxpy are handled by the linpack_h1_.c. ***Nothing to see here! ***
>
> *main.c:* Takes a CLA, and either executes the benchmark, where on the 1st invocation of daxpy a capture is occurred, or it does restore. On restore, I was expecting the program to continue execution, just right before the 1st invocation of daxpy.
>
> *run.sh:* This is how I run the program. I run it in a sudo environment. It runs with 3 arguments:
> .1: it prepares the environment for CRIU, and compiles the app
> .2: the function "execute_criu_app" executes the app one time with argument 1, which is capture, and
> another time with argument 2 which is replay
> .3: I terminate the service and clear directories
>
> *output:* The current output
>
> *expected output:* The expected output
It's much clearer now, thanks :) So what I see happens there.
You start a process A to be the main.c's main with "1" as an argument. At some point A calls the
dumpApplication(). The dumpApplication() fork()-s kid B and A waitpid()-s for it. B calls the
criu_dump() which dumps the B process and leaves it running. B sees 0 ret code from criu_dump()
and then exit()-s with ret-code SUCC_DUMP_ECODE. It's parent (the A task) waits the kid, checks
for exit status being SUCC_DUMP_ECODE() and prints the
== Captured successfully ==\n\n
message, then dumpApplication() ends, A continues execution. This pretty much coincides with what
is there in the output file.
Next you want to replay and start new process, C, with the main.c's main as entry point and the
"2" as an argument. OK. In this case the restoreApplication() is called immediatelly. The latter
calls criu_restore_child(). Now what happens here is complex, confusing, but very interesting and
kinda unavoidable :) C forks() new child process D, when D is created it is "restored" by criu and
is put into the former B's state -- the state as if it is in the dumpAplication() call returning
from the call to criu_dump(), but getting the code 1 (not 0 as it was in B) into ret variable in
there. Next D will behave just like B did with the only difference that ret is 1, not 0, which
will be decoded into SUCC_RSTR_ECODE by this check
if (ret ==0)
ret = SUCC_DUMP_ECODE;
else if (ret ==1)
ret = SUCC_RSTR_ECODE;
else
ret =1;
and then D will call exit(ret) thus exiting with SUCC_RSTR_ECODE code. The D's parent (C) will
be woken up from the waitpid() call (line 119 of crlib.c) and will just exit. So this is what you
should get and do get.
I suspect this is not what you planned to see. Most likely you want D to continue doing what A
was, not B. In that case you should fix the dumpApplication() code not to exit() upon seeing the
SUCC_RSTR_ECODE, but to return from this function. This is the unavoidable nature of dump and
restore. If you dumped yourself (this is what dumpApplication does) and then restored, you get
back in time into the state where you have been right after you have requested to dump yourself.
The mentioned check for ret that sets one of SUCC_*_ECODE values is differentiating these two
cases -- whether you have just being dumped, or have just being restored.
Is this explanation clear and helpful?
Now I have questions about your output and expected-output. The lines
##########################################################
##### HERE IT IS THE OUTPUT OF THE LINPACK EXECUTION #####
##########################################################
and
After waitpid!
are not present in the sources. Where do they come from?
Thanks,
Pavel
More information about the CRIU
mailing list