[Devel] Re: [RFC][PATCH 0/2] CR: save/restore a single, simple task

Wed Jul 30 16:46:31 PDT 2008

Disclaimer: long reply :)

Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl at cs.columbia.edu):
>> In the recent mini-summit at OLS 2008 and the following days it was
>> agreed to tackle the checkpoint/restart (CR) by beginning with a very
>> simple case: save and restore a single task, with simple memory
>> layout, disregarding other task state such as files, signals etc.
>>
>> Following these discussions I coded a prototype that can do exactly
>> that, as a starter. This code adds two system calls - sys_checkpoint
>> and sys_restart - that a task can call to save and restore its state
>> respectively. It also demonstrates how the checkpoint image file can
>> be formatted, as well as show its nested nature (e.g. cr_write_mm()
>> -> cr_write_vma() nesting).
>>
>> The state that is saved/restored is the following:
>> * some of the task_struct
>> * some of the thread_struct and thread_info
>> * the cpu state (including FPU)
>> * the memory address space
>>
>> [The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43
>> of Linus's tree (uhhh.. don't ask why), but against tonight's head too].
>>
>> In the current code, sys_checkpoint will checkpoint the current task,
>> although the logic exists to checkpoint other tasks (not in the
>> checkpointee's execution context). A simple loop will extend this to
>> handle multiple processes. sys_restart restarts the current tasks, and
>> with multiple tasks each task will call the syscall independently.
> 
> I assume that approach worked in Zap, so there must be a simple solution
> to this, but I don't see how having each process in a container
> independently call sys_restart works for sharing.  Oh, or is that where

The main reason to do that (and I thought openvz works similarly ?) is
that I want to re-use as much as possible the existing kernel functionality.
Restart differs from checkpoint in that you have to construct new resources
as opposed to only inspect existing resources. To inspect - you only need
a reference to the object and then to obtain its state by accessing it. In
contrast, to construct, you need to create a new resource.

In almost all cases, creating a resource for a process is easiest if done by
the process itself. For instance - to restore the memory map, you want the
process that owns the target mm to call mmap() (in particular, the lower
level and more convenient for us do_mmap_pgoff() function). If the process
that restores a given vma didn't own that mm, it would take much more pain
to build the vma into a "foreign" mm.

Thus, there is a huge advantage of doing everything in-context of the target
process, that is - we can re-use the existing kernel code (and spirit) to
create the resources, instead of having to hand-craft them carefully with
specialized code.

> a 'container restart context' comes in?  An nsproxy has a pointer to a

More or less. At a first approximation, this is how I envision it:

0) in user space, a new (empty) container will be created with all the
needed settings for the file system etc (mounts .. and the like)

1) the first task (container init) will call sys_restart with the checkpoint
image file.

2) the code will verify the header, then read in the global section; it will
create a restart-context which will be referenced from the container-object
(one option we considered is to have the freezer-cgroup be that object).

3) using the info from that section, it will create the task tree (forest)
to be restored. In particular, new tasks will be created and each will end
up in do_restart_task() inside the kernel.

[note that in Zap, step 3 is still done in user space...]

Since all tasks live in the container, they will all have access to the
restart-context, through which all coordination is done.

At first, the restart will be performed _one task at a time_, at the order
they were dumped. So while the init task restores itself, the remaining
tasks sleep. When the init task finishes - it will wake the next in line
and so on. The last one will wake the init task to finalize the work. So:

4) each task waits (sleeps) until it is prompted to restore its own state.
When it completes, it wakes up the next task in line and goes to a freeze
state.

5) the init task finalized the restart, and either completes the freeze or
unfreezes the container, depending on what the user requested.

This scheme makes sense because we assume that the data is streamed. So it
does not make much sense to try to restart the 5th job before the 2nd job
because the data isn't there yet. Moreover, if they refer to the same shared
object, job#5 will have to wait to job#2 to create the object, since its
state was saved with that job.

In the future, to speed the process by concurrent restarting multiple tasks,
we'll have to read in data from the stream into a buffer (read-ahead) and
then restarting tasks could skip data that doesn't belongs to them; while
they may still need to wait for shared resources to be created, other work
can be done in parallel in the meanwhile.

> checkpoint/restart context which the first task creates and all tasks
> reference and update?  So task 5 created its mm_struct, task 6 is
> supposed to use the same mm_struct, so it finds that out from the
> context?  I wonder whether that would start to become complicated
> when checkpointing nested containers.

Yes, that's what I had in mind - the restart context holds a hash table
that references all the shared objects that are created during the restart.
(Like the checkpoint context that will hold references to objects that
have been inspected).

Checkpointing nested containers ???   Why ?
I'm not sure why would that be a problem; but sure, we need to discuss
that using a concrete use-case and identify the needs and difficulties.

> So I still prefer the idea that the init process calls restart, and that
> creates all the tasks in the container and rebuilds them.  But you have
> code, so you win :)

I agree: the init task calls restart, and that creates all the tasks in
the container. And then, make each of them call do_restart_task() in
some way :)

> 
> Anyway I'm still reading through patch 2.  It looks great to me - the
> only comments I have written so far are:
> 	1. why not just store LINUX_VERSION_CODE in the header instead
> 	of breaking it up

hmph ... good question. Avoid 32/64 bit conversion complications ?

> 	2. the x86-specific code should of course go into arch-specific
> 	directories, but 

of course. I left it there for simplicity right now.

> neither of which really is worth the bother right now imo :)
> 
>> (Actually, to checkpoint outside the context of a task, it is also
>> necessary to also handle restart-block logic when saving/restoring the
>> thread data).
>>
>> It takes longer to describe what isn't implemented or supported by
>> this prototype ... basically everything that isn't as simple as the
>> above.
>>
>> As for containers - since we still don't have a representation for a
>> container, this patch has no notion of a container. The tests for
>> consistent namespaces (and isolation) are also omitted.
>>
>> Below are two example programs: one uses checkpoint (called ckpt) and
>> one uses restart (called rstr). Execute like this (as a superuser):
>>
>> orenl:~/test$ ./ckpt > out.1
>> hello, world!  (ret=1)		<-- sys_checkpoint returns positive id
>>  				<-- ctrl-c
>> orenl:~/test$ ./ckpt > out.2
>> hello, world!  (ret=2)
>>  				<-- ctrl-c
>> orenl:~/test$ ./rstr < out.1
>> hello, world!  (ret=0)		<-- sys_restart return 0
>>
>> (if you check the output of ps, you'll see that "rstr" changed its
>> name to "ckpt", as expected).
>>
>> Hoping this will accelerate the discussion. Comments are welcome.
>> Let the fun begin :)
>>
>> Oren.
>>
>>
>> ============================== ckpt.c ================================
>>
>> #define _GNU_SOURCE        /* or _BSD_SOURCE or _SVID_SOURCE */
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <errno.h>
>> #include <fcntl.h>
>> #include <unistd.h>
>> #include <asm/unistd_32.h>
>> #include <sys/syscall.h>
>>
>> int main(int argc, char *argv[])
>> {
>>  	pid_t pid = getpid();
>>  	int ret;
>>
>>  	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
>>  	if (ret < 0)
>>  		perror("checkpoint");
>>
>>  	fprintf(stderr, "hello, world!  (ret=%d)\n", ret);
>>
>>  	while (1)
>>  		;
>>
>>  	return 0;
>> }
>>
>> ============================== rstr.c ================================
>>
>> #define _GNU_SOURCE        /* or _BSD_SOURCE or _SVID_SOURCE */
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <errno.h>
>> #include <fcntl.h>
>> #include <unistd.h>
>> #include <asm/unistd_32.h>
>> #include <sys/syscall.h>
>>
>> int main(int argc, char *argv[])
>> {
>>  	pid_t pid = getpid();
>>  	int ret;
>>
>>  	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
>>  	if (ret < 0)
>>  		perror("restart");
>>
>>  	printf("should not reach here !\n");
>>
>>  	return 0;
>> }
>> _______________________________________________
>> Containers mailing list
>> Containers at lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers