[CRIU] Restore failed. Exit code: 43

Pavel Emelyanov xemul at parallels.com
Wed Jan 21 09:26:26 PST 2015


On 01/21/2015 05:19 PM, Paschalis Mpeis wrote:
>     It's much clearer now, thanks :) So what I see happens there.
> 
>     You start a process A to be the main.c's main with "1" as an argument. At some point A calls the
>     dumpApplication(). The dumpApplication() fork()-s kid B and A waitpid()-s for it. B calls the
>     criu_dump() which dumps the B process and leaves it running. B sees 0 ret code from criu_dump()
>     and then exit()-s with ret-code SUCC_DUMP_ECODE. It's parent (the A task) waits the kid, checks
>     for exit status being SUCC_DUMP_ECODE() and prints the
> 
>         == Captured successfully ==\n\n
> 
>     message, then dumpApplication() ends, A continues execution. This pretty much coincides with what
>     is there in the output file.
> 
>     Next you want to replay and start new process, C, with the main.c's main as entry point and the
>     "2" as an argument. OK. In this case the restoreApplication() is called immediatelly. The latter
>     calls criu_restore_child(). Now what happens here is complex, confusing, but very interesting and
>     kinda unavoidable :) C forks() new child process D, when D is created it is "restored" by criu and
>     is put into the former B's state -- the state as if it is in the dumpAplication() call returning
>     from the call to criu_dump(), but getting the code 1 (not 0 as it was in B) into ret variable in
>     there. Next D will behave just like B did with the only difference that ret is 1, not 0, which
>     will be decoded into SUCC_RSTR_ECODE by this check
> 
>          if (ret ==0)
>              ret = SUCC_DUMP_ECODE;
>          else if (ret ==1)
>              ret = SUCC_RSTR_ECODE;
>          else
>              ret =1;
> 
>     and then D will call exit(ret) thus exiting with SUCC_RSTR_ECODE code. The D's parent (C) will
>     be woken up from the waitpid() call (line 119 of crlib.c) and will just exit. So this is what you
>     should get and do get.
> 
> 
> ​Hmmm...​
> ​ There are in total 4 processes: A & B for capture, and C & D for restore.​
> So for each capture or restore, there are two exit points (when each process terminates).
> Lets name the process that does CRIU magic for capture the "capturer", and the process that does CRIU magic for restore, the "restorer".
> 
> From what you have told me, I have understood the following:
> 
> *Capture:*
> Process A is my program. Then, it is forked, so we have B,

Which is the complete copy of A.

> in which you do your magic, so my program is captured. B is a "capturer". Right?

Not exactly. The capturer in your case is criu service. And B is what is being captured.

> So, when B continues, it does staff unrelated to my program, maybe some CRIU staff, and then it finally exits.

No, it does what's written in your dumpApplication() function.

> Then, process A, waits for the dump to be finished, and when this happens, it continues execution.

Yes.

> Specifically, A will continue executing from line 30 here <https://gist.github.com/Paschalis/a96b2747ed85b8e5a796#file-linpack_h1_-c-L30>.
> ​Is that correct?

No. The A will continue execution here: https://gist.github.com/Paschalis/a96b2747ed85b8e5a796#file-crlib-c-L86

> Also, I have a question regarding command "criu_set_leave_running(true)".
> It will be executed by child B, right? 

Yes.

> Why should I bother setting this, since B is the capturer?

B is what being captured. Well, I think you can avoid setting this in B, in this
case criu would kill B after dump, A will wake up and continue running. After restore
everything should look the same.

> What I thought this setting was, is that it let process A continue its execution (not B), after the dump occurred.
> 
> 
> *​Replay:*
> C starts execution. It calls restoreApplication() here <https://gist.github.com/Paschalis/a96b2747ed85b8e5a796#file-main-c-L30>.
> Then, C is forked so we have process D.

Yes.

> Is C the "restorer"? (I am bit confused about this)

No. C forks criu, criu forks D as C's child and restores B's state into D. So after
restore D is C's child and is 100% clone of B.

> Then you say that D is restored into B state. I do not want this, since B is the "capturer" and not my program.

B is not capturer, it's being captured. This is how criu_dump() works if you don't
set the pid of the task to dump.

>  
> 
>     I suspect this is not what you planned to see. Most likely you want D to continue doing what A
>     was, not B.
> 
> ​Yes, I want precisely this. Given that I have described the processes A, B, C, and D above correctly​.
> 
>  
> 
>     In that case you should fix the dumpApplication() code not to exit() upon seeing the
>     SUCC_RSTR_ECODE, but to return from this function. This is the unavoidable nature of dump and
>     restore. If you dumped yourself (this is what dumpApplication does) and then restored, you get
>     back in time into the state where you have been right after you have requested to dump yourself.
> 
>     The mentioned check for ret that sets one of SUCC_*_ECODE values is differentiating these two
>     cases -- whether you have just being dumped, or have just being restored.
> 
>     Is this explanation clear and helpful?
> 
> 
> ​So basically, I will exploit that D will be magically travel-in-time into "dumpApplication" function, right after 
> the dump, 

Right after dump no magic happens. Process is either killed or is allowed to continue running.
Magic happens after restore :)

> and I will not terminate it. I will try this right away. I hope that I won't run into further problems! :)
> 
> Ultimately, I'd want to capture and replay just one function. Do you provide any API calls for doing such thing?

Right now no, CRIU doesn't differentiate individual functions. It works on the whole processes.

> One solution might be capture everything, as I do right now, and then instrument the function I want to exit
> right after execution. However, that would have stored in images lot of unnecessary program state!
> 
> The explanation was extremely extremely helpful. Are these explanations somewhere in your wiki pages?

Our wiki is our weakest side :( This explanation is scattered all over the wiki I guess.

> A simple description of these 4 processes on a capture and restore would have been extremely helpful for
> all naive users!

Now _this_ feedback from you is extremely helpful, thank you! I will add this on a wiki.

> 
> 
>     Now I have questions about your output and expected-output. The lines
> 
>         ##########################################################
>         ##### HERE IT IS THE OUTPUT OF THE LINPACK EXECUTION #####
>         ##########################################################
> 
> This was the output that the main_linpack.c produces. See here <https://gist.github.com/Paschalis/a96b2747ed85b8e5a796#file-main_linpack-c-L178>.​
> I just replaced this output with the above 3 lines so you could read it more easily!
>>  
>  
> 
>         After waitpid!
> 
> ​This is a printf that I have removed from the gist code. It was put in this line here
> <https://gist.github.com/Paschalis/a96b2747ed85b8e5a796#file-crlib-c-L124>.​

OK




More information about the CRIU mailing list