[Devel] [cr][git]linux-cr branch, ckpt-v17-rc2, created. v2.6.27-rc5-45616-g96b7bc2
orenl at cs.columbia.edu
orenl at cs.columbia.edu
Tue Jul 14 14:15:25 PDT 2009
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "linux-cr".
The branch, ckpt-v17-rc2 has been created
at 96b7bc2d23eafb041f72be1f33911385e31835df (commit)
- Log -----------------------------------------------------------------
commit 96b7bc2d23eafb041f72be1f33911385e31835df
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:27 2009 -0400
c/r: checkpoint and restore (shared) task's sighand_struct
This patch adds the checkpointing and restart of signal handling
state - 'struct sighand_struct'. Since the contents of this state
only affect userspace, no input validation is required.
Add _NSIG to kernel constants saved/tested with image header.
Number of signals (_NSIG) is arch-dependent, but is within __KERNEL__
and not visibile to userspace compile. Therefore, define per arch
CKPT_ARCH_NSIG in <asm/checkpoint_hdr.h>.
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit c5f8ef7d9fb0281fa31f94094ac6122661965ca6
Author: Serge E. Hallyn <serue at us.ibm.com>
Date: Tue Jul 14 17:08:26 2009 -0400
cr: restore file->f_cred
Restore a file's f_cred. This is set to the cred of the task doing
the open, so often it will be the same as that of the restarted task.
Signed-off-by: Serge E. Hallyn <serue at us.ibm.com>
commit b6c9c855baafcb301bf8ad413cfd53cb86c44ce5
Author: Serge E. Hallyn <serue at us.ibm.com>
Date: Tue Jul 14 17:08:26 2009 -0400
cr: checkpoint and restore task credentials
This patch adds the checkpointing and restart of credentials
(uids, gids, and capabilities) to Oren's c/r patchset (on top
of v14). It goes to great pains to re-use (and define when
needed) common helpers, in order to make sure that as security
code is modified, the cr code will be updated. Some of the
helpers should still be moved (i.e. _creds() functions should
be in kernel/cred.c).
When building the credentials for the restarted process, I
1. create a new struct cred as a copy of the running task's
cred (using prepare_cred())
2. always authorize any changes to the new struct cred
based on the permissions of current_cred() (not the current
transient state of the new cred).
While this may mean that certain transient_cred1->transient_cred2
states are allowed which otherwise wouldn't be allowed, the
fact remains that current_cred() is allowed to transition to
transient_cred2.
The reconstructed creds are applied to the task at the very
end of the sys_restart call. This ensures that any objects which
need to be re-created (file, socket, etc) are re-created using
the creds of the task calling sys_restart - preventing an unpriv
user from creating a privileged object, and ensuring that a
root task can restart a process which had started out privileged,
created some privileged objects, then dropped its privilege.
With these patches, the root user can restart checkpoint images
(created by either hallyn or root) of user hallyn's tasks,
resulting in a program owned by hallyn.
Changelog:
Jun 15: Fix user_ns handling when !CONFIG_USER_N
Set creator_ref=0 for root_ns (discard @flags)
Don't overwrite global user-ns if CONFIG_USER_NS
Jun 10: Merge with ckpt-v16-dev (Oren Laadan)
Jun 01: Don't check ordering of groups in group_info, bc
set_groups() will sort it for us.
May 28: 1. Restore securebits
2. Address Alexey's comments: move prototypes out of
sched.h, validate ngroups < NGROUPS_MAX, validate
groups are sorted, and get rid of ckpt_hdr_cred->version.
3. remove bogus unused flag RESTORE_CREATE_USERNS
May 26: Move group, user, userns, creds c/r functions out
of checkpoint/process.c and into the appropriate files.
May 26: Define struct ckpt_hdr_task_creds and move task cred
objref c/r into {checkpoint_restore}_task_shared().
May 26: Take cred refs around checkpoint_write_creds()
May 20: Remove the limit on number of groups in groupinfo
at checkpoint time
May 20: Remove the depth limit on empty user namespaces
May 20: Better document checkpoint_user
May 18: fix more refcounting: if (userns 5, uid 0) had
no active tasks or child user_namespaces, then
it shouldn't exist at restart or it, its namespace,
and its whole chain of creators will be leaked.
May 14: fix some refcounting:
1. a new user_ns needs a ref to remain pinned
by its root user
2. current_user_ns needs an extra ref bc objhash
drops two on restart
3. cred needs a ref for the real credentials bc
commit_creds eats one ref.
May 13: folded in fix to userns refcounting.
Signed-off-by: Serge E. Hallyn <serue at us.ibm.com>
[orenl at cs.columbia.edu: merge with ckpt-v16-dev]
commit 14c39d14af79b65edef43393c28de6739ebc2109
Author: Serge E. Hallyn <serue at us.ibm.com>
Date: Tue Jul 14 17:08:26 2009 -0400
cr: capabilities: define checkpoint and restore fns
[ Andrew: I am punting on dealing with the subsystem cooperation
issues in this version, in favor of trying to get LSM issues
straightened out ]
An application checkpoint image will store capability sets
(and the bounding set) as __u64s. Define checkpoint and
restart functions to translate between those and kernel_cap_t's.
Define a common function do_capset_tocred() which applies capability
set changes to a passed-in struct cred.
The restore function uses do_capset_tocred() to apply the restored
capabilities to the struct cred being crafted, subject to the
current task's (task executing sys_restart()) permissions.
Changelog:
Jun 09: Can't choose securebits or drop bounding set if
file capabilities aren't compiled into the kernel.
Also just store caps in __u32s (looks cleaner).
Jun 01: Made the checkpoint and restore functions and the
ckpt_hdr_capabilities struct more opaque to the
rest of the c/r code, as suggested by Andrew Morgan,
and using naming suggested by Oren.
Jun 01: Add commented BUILD_BUG_ON() to point out that the
current implementation depends on 64-bit capabilities.
(Andrew Morgan and Alexey Dobriyan).
May 28: add helpers to c/r securebits
Signed-off-by: Serge E. Hallyn <serue at us.ibm.com>
commit 61bc6aaa6c6367f76fc42f8436d03f000fa8271e
Author: Serge E. Hallyn <serue at us.ibm.com>
Date: Tue Jul 14 17:08:25 2009 -0400
tFrom: Serge E. Hallyn <serue at us.ibm.com>
clone_with_pids: define the s390 syscall
Hook up the clone_with_pids system call for s390x. clone_with_pids()
takes an additional argument over clone(), which we pass in through
register 7. Stub code for using the syscall looks like:
struct target_pid_set {
int num_pids;
pid_t *target_pids;
unsigned long flags;
};
register unsigned long int __r2 asm ("2") = (unsigned long int)(stack);
register unsigned long int __r3 asm ("3") = (unsigned long int)(flags);
register unsigned long int __r4 asm ("4") = (unsigned long int)(NULL);
register unsigned long int __r5 asm ("5") = (unsigned long int)(NULL);
register unsigned long int __r6 asm ("6") = (unsigned long int)(NULL);
register unsigned long int __r7 asm ("7") = (unsigned long int)(setp);
register unsigned long int __result asm ("2");
__asm__ __volatile__(
" lghi %%r1,332\n"
" svc 0\n"
: "=d" (__result)
: "0" (__r2), "d" (__r3),
"d" (__r4), "d" (__r5), "d" (__r6), "d" (__r7)
: "1", "cc", "memory"
);
__result;
})
struct target_pid_set pid_set;
int pids[1] = { 19799 };
pid_set.num_pids = 1;
pid_set.target_pids = &pids[0];
pid_set.flags = 0;
rc = do_clone_with_pids(topstack, clone_flags, setp);
if (rc == 0)
printf("Child\n");
else if (rc > 0)
printf("Parent: child pid %d\n", rc);
else
printf("Error %d\n", rc);
Signed-off-by: Serge E. Hallyn <serue at us.ibm.com>
commit af648275ebbe6713b8f7e476966fd649e34d3952
Author: Dan Smith <danms at us.ibm.com>
Date: Tue Jul 14 17:08:25 2009 -0400
c/r: define s390-specific checkpoint-restart code
Implement the s390 arch-specific checkpoint/restart helpers. This
is on top of Oren Laadan's c/r code.
With these, I am able to checkpoint and restart simple programs as per
Oren's patch intro. While on x86 I never had to freeze a single task
to checkpoint it, on s390 I do need to. That is a prereq for consistent
snapshots (esp with multiple processes) anyway so I don't see that as
a problem.
Changelog:
Jun 15:
. Fix checkpoint and restart compat wrappers
May 28:
. Export asm/checkpoint_hdr.h to userspace
. Define CKPT_ARCH_ID for S390
Apr 11:
. Introduce ckpt_arch_vdso()
Feb 27:
. Add checkpoint_s390.h
. Fixed up save and restore of PSW, with the non-address bits
properly masked out
Feb 25:
. Make checkpoint_hdr.h safe for inclusion in userspace
. Replace comment about vsdo code
. Add comment about restoring access registers
. Write and read an empty ckpt_hdr_head_arch record to appease
code (mktree) that expects it to be there
. Utilize NUM_CKPT_WORDS in checkpoint_hdr.h
Feb 24:
. Use CKPT_COPY() to unify the un/loading of cpu and mm state
. Fix fprs definition in ckpt_hdr_cpu
. Remove debug WARN_ON() from checkpoint.c
Feb 23:
. Macro-ize the un/packing of trace flags
. Fix the crash when externally-linked
. Break out the restart functions into restart.c
. Remove unneeded s390_enable_sie() call
Jan 30:
. Switched types in ckpt_hdr_cpu to __u64 etc.
(Per Oren suggestion)
. Replaced direct inclusion of structs in
ckpt_hdr_cpu with the struct members.
(Per Oren suggestion)
. Also ended up adding a bunch of new things
into restart (mm_segment, ksp, etc) in vain
attempt to get code using fpu to not segfault
after restart.
Signed-off-by: Serge E. Hallyn <serue at us.ibm.com>
Signed-off-by: Dan Smith <danms at us.ibm.com>
commit 50ee8dcafcaf75b63ff0f51f017ac62d4e6a7c92
Author: Dan Smith <danms at us.ibm.com>
Date: Tue Jul 14 17:08:25 2009 -0400
c/r: add CKPT_COPY() macro
As suggested by Dave[1], this provides us a way to make the copy-in and
copy-out processes symmetric. CKPT_COPY_ARRAY() provides us a way to do
the same thing but for arrays. It's not critical, but it helps us unify
the checkpoint and restart paths for some things.
Changelog:
Mar 04:
. Removed semicolons
. Added build-time check for __must_be_array in CKPT_COPY_ARRAY
Feb 27:
. Changed CKPT_COPY() to use assignment, eliminating the need
for the CKPT_COPY_BIT() macro
. Add CKPT_COPY_ARRAY() macro to help copying register arrays,
etc
. Move the macro definitions inside the CR #ifdef
Feb 25:
. Changed WARN_ON() to BUILD_BUG_ON()
Signed-off-by: Dan Smith <danms at us.ibm.com>
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
1: https://lists.linux-foundation.org/pipermail/containers/2009-February/015821.html (all the way at the bottom)
commit 5107325f60e56dc04677090b564655e6561670eb
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:24 2009 -0400
c/r: (s390): expose a constant for the number of words (CRs)
We need to use this value in the checkpoint/restart code and would like to
have a constant instead of a magic '3'.
Changelog:
Mar 30:
. Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
Mar 03:
. Picked up additional use of magic '3' in ptrace.h
Signed-off-by: Dan Smith <danms at us.ibm.com>
commit e7c114704b543781c1b27f8d22894b197224fd22
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:24 2009 -0400
c/r: support semaphore sysv-ipc
Checkpoint of sysvipc semaphores is performed by iterating through all
sem objects and dumping the contents of each one. The semaphore array
of each sem is dumped with that object.
The semaphore array (sem->sem_base) holds an array of 'struct sem',
which is a {int, int}. Because this translates into the same format
on 32- and 64-bit architectures, the checkpoint format is simply the
dump of this array as is.
TODO: this patch does not handle semaphore-undo -- this data should be
saved per-task while iterating through the tasks.
Changelog[v17]:
- Restore objects in the right namespace
- Forward declare struct msg_msg (instead of include linux/msg.h)
- Fix typo in comment
- Don't unlock ipc before calling freeary in error path
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 45d915251fea5df431536c0dd9318942003f08a6
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:24 2009 -0400
c/r: support message-queues sysv-ipc
Checkpoint of sysvipc message-queues is performed by iterating through
all 'msq' objects and dumping the contents of each one. The message
queued on each 'msq' are dumped with that object.
Message of a specific queue get written one by one. The queue lock
cannot be held while dumping them, but the loop must be protected from
someone (who ?) writing or reading. To do that we grab the lock, then
hijack the entire chain of messages from the queue, drop the lock,
and then safely dump them in a loop. Finally, with the lock held, we
re-attach the chain while verifying that there isn't other (new) data
on that queue.
Writing the message contents themselves is straight forward. The code
is similar to that in ipc/msgutil.c, the main difference being that
we deal with kernel memory and not user memory.
Changelog[v17]:
- Allocate security context for msg_msg
- Restore objects in the right namespace
- Don't unlock ipc before freeing
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit af2c185e20e0cf71bc05341913bdaaebc6e0749c
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:23 2009 -0400
c/r: support share-memory sysv-ipc
Checkpoint of sysvipc shared memory is performed in two steps: first,
the entire ipc namespace is dumped as a whole by iterating through all
shm objects and dumping the contents of each one. The shmem inode is
registered in the objhash. Second, for each vma that refers to ipc
shared memory we find the inode in the objhash, and save the objref.
(If we find a new inode, that indicates that the ipc namespace is not
entirely frozen and someone must have manipulated it since step 1).
Handling of shm objects that have been deleted (via IPC_RMID) is left
to a later patch in this series.
Changelog[v17]:
- Restore objects in the right namespace
- Properly initialize ctx->deferqueue
- Fix compilation with CONFIG_CHECKPOINT=n
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 5850cd1d2170655d97a17fbc0f085055270c1bf9
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:23 2009 -0400
c/r: save and restore sysvipc namespace basics
Add the helpers to checkpoint and restore the contents of 'struct
kern_ipc_perm'. Add header structures for ipc state. Put place-holders
to save and restore ipc state.
Save and restores the common state (parameters) of ipc namespace.
Generic code to iterate through the objects of sysvipc shared memory,
message queues and semaphores. The logic to save and restore the state
of these objects will be added in the next few patches.
Right now, we return -EPERM if the user calling sys_restart() isn't
allowed to create an object with the checkpointed uid. We may prefer
to simply use the caller's uid in that case - but that could lead to
subtle userspace bugs? Unsure, so going for the stricter behavior.
TODO: restore kern_ipc_perms->security.
Changelog[v17]:
- Collect nsproxy->ipc_ns
- Restore objects in the right namespace
- If !CONFIG_IPC_NS only restore objects, not global settings
- Don't overwrite global ipc-ns if !CONFIG_IPC_NS
- Reset the checkpointed uid and gid info on ipc objects
- Fix compilation with CONFIG_SYSVIPC=n
Changelog [Dan Smith <danms at us.ibm.com>]
- Fix compilation with CONFIG_SYSVIPC=n
- Update to match UTS changes
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit c6315e096564a043058c83f4b41eb97ff7cc7f1f
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:23 2009 -0400
c/r (ipc): allow allocation of a desired ipc identifier
During restart, we need to allocate ipc objects that with the same
identifiers as recorded during checkpoint. Modify the allocation
code allow an in-kernel caller to request a specific ipc identifier.
The system call interface remains unchanged.
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 2e009ef7a182c1024f8c089cef743c677c5a77d5
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:23 2009 -0400
deferqueue: generic queue to defer work
Add a interface to postpone an action until the end of the entire
checkpoint or restart operation. This is useful when during the
scan of tasks an operation cannot be performed in place, to avoid
the need for a second scan.
One use case is when restoring an ipc shared memory region that has
been deleted (but is still attached), during restart it needs to be
create, attached and then deleted. However, creation and attachment
are performed in distinct locations, so deletion can not be performed
on the spot. Instead, this work (delete) is deferred until later.
(This example is in one of the following patches).
This interface allows chronic procrastination in the kernel:
deferqueue_create(void):
Allocates and returns a new deferqueue.
deferqueue_run(deferqueue):
Executes all the pending works in the queue. Returns the number
of works executed, or an error upon the first error reported by
a deferred work.
deferqueue_add(deferqueue, data, size, func, dtor):
Enqueue a deferred work. @function is the callback function to
do the work, which will be called with @data as an argument.
@size tells the size of data. @dtor is a destructor callback
that is invoked for deferred works remaining in the queue when
the queue is destroyed. NOTE: for a given deferred work, @dtor
is _not_ called if @func was already called (regardless of the
return value of the latter).
deferqueue_destroy(deferqueue):
Free the deferqueue and any queued items while invoking the
@dtor callback for each queued item.
Why aren't we using the existing kernel workqueue mechanism? We need
to defer to work until the end of the operation: not earlier, since we
need other things to be in place; not later, to not block waiting for
it. However, the workqueue schedules the work for 'some time later'.
Also, the kernel workqueue may run in any task context, but we require
many times that an operation be run in the context of some specific
restarting task (e.g., restoring IPC state of a certain ipc_ns).
Instead, this mechanism is a simple way for the c/r operation as a
whole, and later a task in particular, to defer some action until
later (but not arbitrarily later) _in the restore_ operation.
Changelog[v17]
- Fix deferqueue_add() function
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 60d441c5a3315d65dce2a3ad3741ee9a0ca8898f
Author: Dan Smith <danms at us.ibm.com>
Date: Tue Jul 14 17:08:22 2009 -0400
c/r: support for UTS namespace
This patch adds a "phase" of checkpoint that saves out information about any
namespaces the task(s) may have. Do this by tracking the namespace objects
of the tasks and making sure that tasks with the same namespace that follow
get properly referenced in the checkpoint stream.
Changes[v17]:
- Collect nsproxy->uts_ns
- Save uts string lengths once in ckpt_hdr_const
- Save and restore all fields of uts-ns
- Don't overwrite global uts-ns if !CONFIG_UTS_NS
- Replace sys_unshare() with create_uts_ns()
- Take uts_sem around access to uts data
Changes:
- Remove the kernel restore path
- Punt on nested namespaces
- Use __NEW_UTS_LEN in nodename and domainname buffers
- Add a note to Documentation/checkpoint/internals.txt to indicate where
in the save/restore process the UTS information is kept
- Store (and track) the objref of the namespace itself instead of the
nsproxy (based on comments from Dave on IRC)
- Remove explicit check for non-root nsproxy
- Store the nodename and domainname lengths and use ckpt_write_string()
to store the actual name strings
- Catch failure of ckpt_obj_add_ptr() in ckpt_write_namespaces()
- Remove "types" bitfield and use the "is this new" flag to determine
whether or not we should write out a new ns descriptor
- Replace kernel restore path
- Move the namespace information to be directly after the task
information record
- Update Documentation to reflect new location of namespace info
- Support checkpoint and restart of nested UTS namespaces
Signed-off-by: Dan Smith <danms at us.ibm.com>
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit f1c59e1daa86933efa233974da953463677a3e87
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:22 2009 -0400
c/r: make ckpt_may_checkpoint_task() check each namespace individually
For a given namespace type, say XXX, if a checkpoint was taken on a
CONFIG_XXX_NS system, is restarted on a !CONFIG_XXX_NS, then ensure
that:
1) The global settings of the global (init) namespace do not get
overwritten. Creating new objects in that namespace is ok, as long as
the request identifier is available.
2) All restarting tasks use a single namespace - because it is
impossible to create additional namespaces to accommodate for what had
been checkpointed.
Original patch introducing nsproxy c/r by Dan Smith <danms at us.ibm.com>
Chagnelog[v17]:
- Only collect sub-objects of struct_nsproxy once.
- Restore namespace pieces directly instead of using sys_unshare()
- Proper handling of restart from namespace(s) without namespace(s)
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 50c3d33d1163cca20007ed8aee8efad76f90b40f
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:22 2009 -0400
c/r: support for open pipes
A pipe is a double-headed inode with a buffer attached to it. We
checkpoint the pipe buffer only once, as soon as we hit one side of
the pipe, regardless whether it is read- or write- end.
To checkpoint a file descriptor that refers to a pipe (either end), we
first lookup the inode in the hash table: If not found, it is the
first encounter of this pipe. Besides the file descriptor, we also (a)
save the pipe data, and (b) register the pipe inode in the hash. If
found, it is the second encounter of this pipe, namely, as we hit the
other end of the same pipe. In both cases we write the pipe-objref of
the inode.
To restore, create a new pipe and thus have two file pointers (read-
and write- ends). We only use one of them, depending on which side was
checkpointed first. We register the file pointer of the other end in
the hash table, with the pipe_objref given for this pipe from the
checkpoint, to be used later when the other arrives. At this point we
also restore the contents of the pipe buffers.
To save the pipe buffer, given a source pipe, use do_tee() to clone
its contents into a temporary 'struct pipe_inode_info', and then use
do_splice_from() to transfer it directly to the checkpoint image file.
To restore the pipe buffer, with a fresh newly allocated target pipe,
use do_splice_to() to splice the data directly between the checkpoint
image file and the pipe.
Changelog[v17]:
- Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 74bf2931853ccb1d7a0a55770b353f1e1c981613
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:21 2009 -0400
splice: export pipe/file-to-pipe/file functionality
During pipes c/r pipes we need to save and restore pipe buffers. But
do_splice() requires two file descriptors, therefore we can't use it,
as we always have one file descriptor (checkpoint image) and one
pipe_inode_info.
This patch exports interfaces that work at the pipe_inode_info level,
namely link_pipe(), do_splice_to() and do_splice_from(). They are used
in the following patch to to save and restore pipe buffers without
unnecessary data copy.
It slightly modifies both do_splice_to() and do_splice_from() to
detect the case of pipe-to-pipe transfer, in which case they invoke
splice_pipe_to_pipe() directly.
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 31ac5ecf3232b5f7cd68fffd5d227ed19b63f423
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:21 2009 -0400
c/r: restore anonymous- and file-mapped- shared memory
The bulk of the work is in ckpt_read_vma(), which has been refactored:
the part that create the suitable 'struct file *' for the mapping is
now larger and moved to a separate function. What's left is to read
the VMA description, get the file pointer, create the mapping, and
proceed to read the contents in.
Both anonymous shared VMAs that have been read earlier (as indicated
by a look up to objhash) and file-mapped shared VMAs are skipped.
Anonymous shared VMAs seen for the first time have their contents
read in directly to the backing inode, as indexed by the page numbers
(as opposed to virtual addresses).
Changelog[v14]:
- Introduce patch
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 8ecc6f36671611bacc1228c957fe9575767fbe6e
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:21 2009 -0400
c/r: dump anonymous- and file-mapped- shared memory
We now handle anonymous and file-mapped shared memory. Support for IPC
shared memory requires support for IPC first. We extend ckpt_write_vma()
to detect shared memory VMAs and handle it separately than private
memory.
There is not much to do for file-mapped shared memory, except to force
msync() on the region to ensure that the file system is consistent
with the checkpoint image. Use our internal type CKPT_VMA_SHM_FILE.
Anonymous shared memory is always backed by inode in shmem filesystem.
We use that inode to look up the VMA in the objhash and register it if
not found (on first encounter). In this case, the type of the VMA is
CKPT_VMA_SHM_ANON, and we dump the contents. On the other hand, if it is
found there, we must have already saved it before, so we change the
type to CKPT_VMA_SHM_ANON_SKIP and skip it.
To dump the contents of a shmem VMA, we loop through the pages of the
inode in the shmem filesystem, and dump the contents of each dirty
(allocated) page - unallocated pages must be clean.
Note that we save the original size of a shmem VMA because it may have
been re-mapped partially. The format itself remains like with private
VMAs, except that instead of addresses we record _indices_ (page nr)
into the backing inode.
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 89e8044099e705beb3d7b6448036a3e3752dd65d
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:20 2009 -0400
c/r: export shmem_getpage() to support shared memory
Export functionality to retrieve specific pages from shared memory
given an inode in shmem-fs; this will be used in the next two patches
to provide support for c/r of shared memory.
mm/shmem.c:
- shmem_getpage() and 'enum sgp_type' moved to linux/mm.h
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 6fca0df22281982abd6fee0a36e9592e6c2bfea1
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:20 2009 -0400
c/r: restore memory address space (private memory)
Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the vma state and contents.
Call do_mmap_pgoffset() for each vma and then read in the data.
Changelog[v17]:
- Restore mm->{flags,def_flags,saved_auxv}
- Fix bogus warning in do_restore_mm()
Changelog[v16]:
- Restore mm->exe_file
Changelog[v14]:
- Introduce per vma-type restore() function
- Merge restart code into same file as checkpoint (memory.c)
- Compare saved 'vdso' field of mm_context with current value
- Check whether calls to ckpt_hbuf_get() fail
- Discard field 'h->parent'
- Revert change to pr_debug(), back to ckpt_debug()
Changelog[v13]:
- Avoid access to hh->vma_type after the header is freed
- Test for no vma's in exit_mmap() before calling unmap_vma() (or it
may crash if restart fails after having removed all vma's)
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v9]:
- Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
Changelog[v7]:
- Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
- Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)
Changelog[v5]:
- Improve memory restore code (following Dave Hansen's comments)
- Change dump format (and code) to allow chunks of <vaddrs, pages>
instead of one long list of each
- Memory restore now maps user pages explicitly to copy data into them,
instead of reading directly to user space; got rid of mprotect_fixup()
Changelog[v4]:
- Use standard list_... for ckpt_pgarr
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit ce4e519bf5d3082a8ff150b966262076390adf0b
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:20 2009 -0400
c/r: dump memory address space (private memory)
For each vma, there is a 'struct ckpt_vma'; Then comes the actual
contents, in one or more chunk: each chunk begins with a header that
specifies how many pages it holds, then the virtual addresses of all
the dumped pages in that chunk, followed by the actual contents of all
dumped pages. A header with zero number of pages marks the end of the
contents. Then comes the next vma and so on.
To checkpoint a vma, call the ops->checkpoint() method of that vma.
Normally the per-vma function will invoke generic_vma_checkpoint()
which first writes the vma description, followed by the specific
logic to dump the contents of the pages.
Currently for private mapped memory we save the pathname of the file
that is mapped (restart will use it to re-open it and then map it).
Later we change that to reference a file object.
Changelog[v17]:
- Only collect sub-objects of mm_struct once
- Save mm->{flags,def_flags,saved_auxv}
Changelog[v16]:
- Precede vaddrs/pages with a buffer header
- Checkpoint mm->exe_file
- Handle shared task->mm
Changelog[v14]:
- Modify the ops->checkpoint method to be much more powerful
- Improve support for VDSO (with special_mapping checkpoint callback)
- Save new field 'vdso' in mm_context
- Revert change to pr_debug(), back to ckpt_debug()
- Check whether calls to ckpt_hbuf_get() fail
- Discard field 'h->parent'
Changelog[v13]:
- pgprot_t is an abstract type; use the proper accessor (fix for
64-bit powerpc (Nathan Lynch <ntl at pobox.com>)
Changelog[v12]:
- Hide pgarr management inside ckpt_private_vma_fill_pgarr()
- Fix management of pgarr chain reset and alloc/expand: keep empty
pgarr in a pool chain
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
- Copy contents of 'init->fs->root' instead of pointing to them.
- Add missing test for VM_MAYSHARE when dumping memory
Changelog[v10]:
- Acquire dcache_lock around call to __d_path() in ckpt_fill_name()
Changelog[v9]:
- Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
- Test if __d_path() changes mnt/dentry (when crossing filesystem
namespace boundary). for now ckpt_fill_fname() fails the checkpoint.
Changelog[v7]:
- Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
- Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)
Changelog[v5]:
- Improve memory dump code (following Dave Hansen's comments)
- Change dump format (and code) to allow chunks of <vaddrs, pages>
instead of one long list of each
- Fix use of follow_page() to avoid faulting in non-present pages
Changelog[v4]:
- Use standard list_... for ckpt_pgarr
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit dba0c7ca96d9008bfeede6d63d3c4003c5c08866
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:19 2009 -0400
c/r: introduce method '->checkpoint()' in struct vm_operations_struct
Changelog[v17]
- Forward-declare 'ckpt_ctx et-al, don't use checkpoint_types.h
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit b7348f33ec9db1b1cf29dd290dfdb33c1ef9802a
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:19 2009 -0400
c/r: add generic '->checkpoint()' f_op to simple devices
* /dev/null
* /dev/zero
* /dev/random
* /dev/urandom
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit b1990378eacbb39729bb6f9d1badcb4f42200fd0
Author: Dave Hansen <dave at linux.vnet.ibm.com>
Date: Tue Jul 14 17:08:18 2009 -0400
c/r: add generic '->checkpoint' f_op to ext fses
This marks ext[234] as being checkpointable. There will be many
more to do this to, but this is a start.
Signed-off-by: Dave Hansen <dave at linux.vnet.ibm.com>
commit e7dd7f78fcd0d475f38ea6c5d56c46bd21e6b802
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:18 2009 -0400
c/r: restore open file descriptors
For each fd read 'struct ckpt_hdr_file_desc' and lookup objref in the
hash table; If not found in the hash table, (first occurence), read in
'struct ckpt_hdr_file', create a new file and register in the hash.
Otherwise attach the file pointer from the hash as an FD.
Changelog[v17]:
- Validate f_mode after restore against saved f_mode
- Fail if f_flags have O_CREAT|O_EXCL|O_NOCTTY|O_TRUN
- Reorder patch (move earlier in series)
- Handle shared files_struct objects
Changelog[v14]:
- Introduce a per file-type restore() callback
- Revert change to pr_debug(), back to ckpt_debug()
- Rename: restore_files() => restore_fd_table()
- Rename: ckpt_read_fd_data() => restore_file()
- Check whether calls to ckpt_hbuf_get() fail
- Discard field 'hh->parent'
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v6]:
- Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit e64abbd0cd0b36362c1ba0bae6fde52107a9ef23
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:18 2009 -0400
c/r: dump open file descriptors
Dump the file table with 'struct ckpt_hdr_file_table, followed by all
open file descriptors. Because the 'struct file' corresponding to an
fd can be shared, they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it
lives in the hash (the hash is only cleaned up at the end of the
checkpoint).
Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.
Changelog[v17]:
- Only collect sub-objects of files_struct once
- Better file error debugging
- Use (new) d_unlinked()
Changelog[v16]:
- Fix compile warning in checkpoint_bad()
Changelog[v16]:
- Reorder patch (move earlier in series)
- Handle shared files_struct objects
Changelog[v14]:
- File objects are dumped/restored prior to the first reference
- Introduce a per file-type restore() callback
- Use struct file_operations->checkpoint()
- Put code for generic file descriptors in a separate function
- Use one CKPT_FILE_GENERIC for both regular files and dirs
- Revert change to pr_debug(), back to ckpt_debug()
- Use only unsigned fields in checkpoint headers
- Rename: ckpt_write_files() => checkpoint_fd_table()
- Rename: ckpt_write_fd_data() => checkpoint_file()
- Discard field 'h->parent'
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
- Discard handling of opened symlinks (there is no such thing)
- ckpt_scan_fds() retries from scratch if hits size limits
Changelog[v9]:
- Fix a couple of leaks in ckpt_write_files()
- Drop useless kfree from ckpt_scan_fds()
Changelog[v8]:
- initialize 'coe' to workaround gcc false warning
Changelog[v6]:
- Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 8ede64ce064c8cbb789c79913f4d48c47c425b44
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:18 2009 -0400
c/r: introduce '->checkpoint()' method in 'struct file_operations'
While we assume all normal files and directories can be checkpointed,
there are, as usual in the VFS, specialized places that will always
need an ability to override these defaults. Although we could do this
completely in the checkpoint code, that would bitrot quickly.
This adds a new 'file_operations' function for checkpointing a file.
It is assumed that there should be a dirt-simple way to make something
(un)checkpointable that fits in with current code.
As you can see in the ext[234] patches down the road, all that we have
to do to make something simple be supported is add a single "generic"
f_op entry.
Also introduce vfs_fcntl() so that it can be called from restart (see
patch adding restart of files).
Changelog[v17]
- Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 5595c713347321dc0dfe3c691a1245c0a0235ae8
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:17 2009 -0400
c/r: detect resource leaks for whole-container checkpoint
Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE
checkpoint, return an error code if the actual objects' counts are
higher, indicating leaks (references to the objects from a task not
being checkpointed). Of course, by this time most of the checkpoint
image has been written out to disk, so this is purely advisory. But
then, it's probably naive to argue that anything more than an advisory
'this went wrong' error code is useful.
The comparison of the objhash user counts to object refcounts as a
basis for checking for leaks comes from Alexey's OpenVZ-based c/r
patchset.
"Leak detection" occurs _before_ any real state is saved, as a
pre-step. This prevents races due to sharing with outside world where
the sharing ceases before the leak test takes place, thus protecting
the checkpoint image from inconsistencies.
Once leak testing concludes, checkpoint will proceed. Because objects
are already in the objhash, checkpoint_obj() cannot distinguish
between the first and subsequent encounters. This is solved with a
flag (CKPT_OBJ_CHECKPOINTED) per object.
Two additional checks take place during checkpoint: for objects that
were created during, and objects destroyed, while the leak-detection
pre-step took place.
Changelog[v17]:
- Leak detection is performed in two-steps
- Detect reverse-leaks (objects disappearing unexpectedly)
- Skip reverse-leak detection if ops->ref_users isn't defined
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit a98590564f0ce626b6d05c35f1894e96aae24d7f
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:17 2009 -0400
c/r: infrastructure for shared objects
The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kernel address).
From then on the object will be found in the hash and only its identifier
is saved.
On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.
The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.
The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.
Changelog[v17]:
- Add ckpt_obj->flags with CKPT_OBJ_CHECKPOINTED flag
- Add prototype of ckpt_obj_lookup
- Complain on attempt to add NULL ptr to objhash
- Prepare for 'leaks detection'
Changelog[v16]:
- Introduce ckpt_obj_lookup() to find an object by its ptr
Changelog[v14]:
- Introduce 'struct ckpt_obj_ops' to better modularize shared objs.
- Replace long 'switch' statements with table lookups and callbacks.
- Introduce checkpoint_obj() and restart_obj() helpers
- Shared objects now dumped/saved right before they are referenced
- Cleanup interface of shared objects
Changelog[v13]:
- Use hash_long() with 'unsigned long' cast to support 64bit archs
(Nathan Lynch <ntl at pobox.com>)
Changelog[v11]:
- Doc: be explicit about grabbing a reference and object lifetime
Changelog[v4]:
- Fix calculation of hash table size
Changelog[v3]:
- Use standard hlist_... for hash table
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit fea789e4a2ccaa1cdd1e5e4c9fc9a2679b70ea9e
Author: Matt Helsley <matthltc at us.ibm.com>
Date: Tue Jul 14 17:08:17 2009 -0400
Save and restore the [compat_]robust_list member of the task struct.
These lists record which futexes the task holds. To keep the overhead of
robust futexes low the list is kept in userspace. When the task exits the
kernel carefully walks these lists to recover held futexes that
other tasks may be attempting to acquire with FUTEX_WAIT.
Because they point to userspace memory that is saved/restored by
checkpoint/restart saving the list pointers themselves is safe.
While saving the pointers is safe during checkpoint, restart is tricky
because the robust futex ABI contains provisions for changes based on
checking the size of the list head. So we need to save the length of
the list head too in order to make sure that the kernel used during
restart is capable of handling that ABI. Since there is only one ABI
supported at the moment taking the list head's size is simple. Should
the ABI change we will need to use the same size as specified during
sys_set_robust_list() and hence some new means of determining the length
of this userspace structure in sys_checkpoint would be required.
Rather than rewrite the logic that checks and handles the ABI we reuse
sys_set_robust_list() by factoring out the body of the function and
calling it during restart.
Signed-off-by: Matt Helsley <matthltc at us.ibm.com>
commit 681aac505a46b46177275bdf6cae425389a17ed5
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:16 2009 -0400
c/r: support for zombie processes
During checkpoint, a zombie processes need only save p->comm,
p->state, p->exit_state, and p->exit_code.
During restart, zombie processes are created like all other
processes. They validate the saved exit_code restore p->comm
and p->exit_code. Then they call do_exit() instead of waking
up the next task in line.
But before, they place the @ctx in p->checkpoint_ctx, so that
only at exit time they will wake up the next task in line,
and drop the reference to the @ctx.
This provides the guarantee that when the coordinator's wait
completes, all normal tasks completed their restart, and all
zombie tasks are already zombified (as opposed to perhap only
becoming a zombie).
Changelog[v17]:
- Validate t->exit_signal for both threads and leader
- Skip zombies in most of may_checkpoint_task()
- Save/restore t->pdeath_signal
- Validate ->exit_signal and ->pdeath_signal
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 07d0033191b3089d9dd65696951ac41ca5cc98e6
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:16 2009 -0400
c/r: introduce PF_RESTARTING, and skip notification on exit
To restore zombie's we will create the a task, that, on its turn to
run, calls do_exit(). Unlike normal tasks that exit, we need to
prevent notification side effects that send signals to other
processes, e.g. parent (SIGCHLD) or child tasks (per child's request).
There are three main cases for such notifications:
1) do_notify_parent(): parent of a process is notified about a change
in status (e.g. become zombie, reparent, etc). If parent ignores,
then mark child for immediate release (skip zombie).
2) kill_orphan_pgrp(): a process group that becomes orphaned will
signal stopped jobs (HUP then CONT).
3) reparent_thread(): children of a process are signaled (per request)
with p->pdeath_signal
Remember that restoring signal state (for any restarting task) must
complete _before_ it is allowed to resume execution, and not during
the resume. Otherwise, a running task may send a signal to another
task that hasn't restored yet, so the new signal will be lost
soon-after.
I considered two possible way to address this:
1. Add another sync point to restart: all tasks will first restore
their state without signals (all signals blocked), and zombies call
do_exit(). A sync point then will ensure that all zombies are gone and
their effects done. Then all tasks restore their signal state (and
mask), and sync (new point) again. Only then they may resume
execution.
The main disadvantage is the added complexity and inefficiency,
for no good reason.
2. Introduce PF_RESTARTING: mark all restarting tasks with a new flag,
and teach the above three notifications to skip sending the signal if
theis flag is set.
The main advantage is simplicity and completeness. Also, such a flag
may to be useful later on. This the method implemented.
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit e8304d20bc3ad7c7cf725cd323763d2de3df5068
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 17:08:11 2009 -0400
c/r: restart multiple processes
Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.
This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.
There is one special task - the coordinator - that is not part of the
restarted hierarchy. The coordinator task allocates the restart
context (ctx) and orchestrates the restart. Thus even if a restart
fails after, or during the restore of the root task, the user
perceives a clean exit and an error message.
The coordinator task will:
1) read header and tree, create @ctx (wake up restarting tasks)
2) set the ->checkpoint_ctx field of itself and all descendants
3) wait for all restarting tasks to reach sync point #1
4) activate first restarting task (root task)
5) wait for all other tasks to complete and reach sync point #3
6) wake up everybody
(Note that in step #2 the coordinator assumes that the entire task
hierarchy exists by the time it enters sys_restart; this is arranged
in user space by 'mktree')
Task that are restarting has three sync points:
1) wait for its ->checkpoint_ctx to be set (by the coordinator)
2) wait for the task's turn to restore (be active)
[...now the task restores its state...]
3) wait for all other tasks to complete
The third sync point ensures that a task may only resume execution
after all tasks have successfully restored their state (or fail if an
error has occured). This prevents tasks from returning to user space
prematurely, before the entire restart completes.
If a single task wishes to restart, it can set the "RESTART_TASKSELF"
flag to restart(2) to skip the logic of the coordinator.
The root-task is a child of the coordinator, identified by the @pid
given to sys_restart() in the pid-ns of the coordinator. Restarting
tasks that aren't the coordinator, should set the @pid argument of
restart(2) syscall to zero.
All tasks explicitly test for an error flag on the checkpoint context
when they wakeup from sync points. If an error occurs during the
restart of some task, it will mark the @ctx with an error flag, and
wakeup the other tasks.
An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (@ctx) maintains a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.
Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.
In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.
Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes
(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.
Changelog[v17]:
- Add uflag RESTART_FROZEN to freeze tasks after restart
- Fix restore_retval() and use only for restarting tasks
- Coordinator converts -ERSTART... to -EINTR
- Coordinator marks and sets descendants' ->checkpoint_ctx
- Coordinator properly detects errors when woken up from wait
- Fix race where root_task could kick start too early
- Add a sync point for restarting tasks
- Multiple fixes to restart logic
Changelog[v14]:
- Revert change to pr_debug(), back to ckpt_debug()
- Discard field 'h.parent'
- Check whether calls to ckpt_hbuf_get() fail
Changelog[v13]:
- Clear root_task->checkpoint_ctx regardless of error condition
- Remove unused argument 'ctx' from do_restore_task() prototype
- Remove unused member 'pids_err' from 'struct ckpt_ctx'
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 1123b4105fe5bd8c29c93ea9f31ef43dc6d90e1d
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 16:45:35 2009 -0400
c/r: checkpoint multiple processes
Checkpointing of multiple processes works by recording the tasks tree
structure below a given "root" task. The root task is expected to be a
container init, and then an entire container is checkpointed. However,
passing CHECKPOINT_SUBTREE to checkpoint(2) relaxes this requirement
and allows to checkpoint a subtree of processes from the root task.
For a given root task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.
The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.
Whether checkpoints and restarts require CAP_SYS_ADMIN is determined
by sysctl 'ckpt_unpriv_allowed': if 1, then regular permission checks
are intended to prevent privilege escalation, however if 0 it prevents
unprivileged users from exploiting any privilege escalation bugs.
The logic is suitable for creation of processes during restart either
in userspace or by the kernel.
Currently we ignore threads and zombies.
Changelog[v16]:
- CHECKPOINT_SUBTREE flags allows subtree (not whole container)
- sysctl variable 'ckpt_unpriv_allowed' controls needed privileges
Changelog[v14]:
- Refuse non-self checkpoint if target task isn't frozen
- Refuse checkpoint (for now) if task is ptraced
- Revert change to pr_debug(), back to ckpt_debug()
- Use only unsigned fields in checkpoint headers
- Check retval of ckpt_tree_count_tasks() in ckpt_build_tree()
- Discard 'h.parent' field
- Check whether calls to ckpt_hbuf_get() fail
- Disallow threads or siblings to container init
Changelog[v13]:
- Release tasklist_lock in error path in ckpt_tree_count_tasks()
- Use separate index for 'tasks_arr' and 'hh' in ckpt_write_pids()
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit aa57087333333f0dea9ec5b08c7312d922050174
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 16:45:28 2009 -0400
c/r: restart-blocks
(Paraphrasing what's said this message:
http://lists.openwall.net/linux-kernel/2007/12/05/64)
Restart blocks are callbacks used cause a system call to be restarted
with the arguments specified in the system call restart block. It is
useful for system call that are not idempotent, i.e. the argument(s)
might be a relative timeout, where some adjustments are required when
restarting the system call. It relies on the system call itself to set
up its restart point and the argument save area. They are rare: an
actual signal would turn that it an EINTR. The only case that should
ever trigger this is some kernel action that interrupts the system
call, but does not actually result in any user-visible state changes -
like freeze and thaw.
So restart blocks are about time remaining for the system call to
sleep/wait. Generally in c/r, there are two possible time models that
we can follow: absolute, relative. Here, I chose to save the relative
timeout, measured from the beginning of the checkpoint. The time when
the checkpoint (and restart) begin is also saved. This information is
sufficient to restart in either model (absolute or negative).
Which model to use should eventually be a per application choice (and
possible configurable via cradvise() or some sort). For now, we adopt
the relative model, namely, at restart the timeout is set relative to
the beginning of the restart.
To checkpoint, we check if a task has a valid restart block, and if so
we save the *remaining* time that is has to wait/sleep, and the type
of the restart block.
To restart, we fill in the data required at the proper place in the
thread information. If the system call return an error (which is
possibly an -ERESTARTSYS eg), we not only use that error as our own
return value, but also arrange for the task to execute the signal
handler (by faking a signal). The handler, in turn, already has the
code to handle these restart request gracefully.
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit a76235f381ee4ce422b00b746aa5fd5e259ce4ff
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 16:44:37 2009 -0400
c/r: export functionality used in next patch for restart-blocks
To support c/r of restart-blocks (system call that need to be
restarted because they were interrupted but there was no userspace
visible side-effect), export restart-block callbacks for poll()
and futex() syscalls.
More details on c/r of restart-blocks and how it works in the
following patch.
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
Acked-by: Serge Hallyn <serue at us.ibm.com>
commit 521f272dfa3a46d56f32403b35f5d3aaa7309410
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 16:44:16 2009 -0400
c/r: external checkpoint of a task other than ourself
Now we can do "external" checkpoint, i.e. act on another task.
sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container, unless CHECKPOINT_SUBTREE flag is given.
Set state of freezer cgroup of checkpointed task hierarchy to
"CHECKPOINTING" during a checkpoint, to ensure that task(s) cannot be
thawed while at it.
Ensure that all tasks belong to root task's freezer cgroup (the root
task is also tested, to detect it if changes its freezer cgroups
before it moves to "CHECKPOINTING").
sys_restart() remains nearly the same, as the restart is always done
in the context of the restarting task. However, the original task may
have been frozen from user space, or interrupted from a syscall for
the checkpoint. This is accounted for by restoring a suitable retval
for the restarting task, according to how it was checkpointed.
Changelog[v17]:
- Move restore_retval() to this patch
- Tighten ptrace ceckpoint for checkpoint to PTRACE_MODE_ATTACH
- Use CHECKPOINTING state for hierarchy's freezer for checkpoint
Changelog[v16]:
- Use CHECKPOINT_SUBTREE to allow subtree (partial container)
Changelog[v14]:
- Refuse non-self checkpoint if target task isn't frozen
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
- Copy contents of 'init->fs->root' instead of pointing to them
Changelog[v10]:
- Grab vfs root of container init, rather than current process
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 29ca904cf0c01960ae30e37ab181d3e6ae40bf46
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 16:37:43 2009 -0400
c/r: x86_32 support for checkpoint/restart
Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.
In addition, architecture capabilities are saved in an architecure
specific extension of the header (ckpt_hdr_head_arch); Currently this
includes only FPU capabilities.
Currently only x86-32 is supported.
Changelog[v17]:
- Fix compilation for architectures that don't support checkpoint
- Validate cpu registers and TLS descriptors on restart
- Validate debug registers on restart
- Export asm/checkpoint_hdr.h to userspace
Changelog[v16]:
- All objects are preceded by ckpt_hdr (TLS and xstate_buf)
- Add architecture identifier to main header
Changelog[v14]:
- Use new interface ckpt_hdr_get/put()
- Embed struct ckpt_hdr in struct ckpt_hdr...
- Remove preempt_disable/enable() around init_fpu() and fix leak
- Revert change to pr_debug(), back to ckpt_debug()
- Move code related to task_struct to checkpoint/process.c
Changelog[v12]:
- A couple of missed calls to ckpt_hbuf_put()
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v9]:
- Add arch-specific header that details architecture capabilities;
split FPU restore to send capabilities only once.
- Test for zero TLS entries in ckpt_write_thread()
- Fix asm/checkpoint_hdr.h so it can be included from user-space
Changelog[v7]:
- Fix save/restore state of FPU
Changelog[v5]:
- Remove preempt_disable() when restoring debug registers
Changelog[v4]:
- Fix header structure alignment
Changelog[v2]:
- Pad header structures to 64 bits to ensure compatibility
- Follow Dave Hansen's refactoring of the original post
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit fd25315ce9d1cbe2957b66b60d4323e079d5e942
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 16:37:39 2009 -0400
c/r: basic infrastructure for checkpoint/restart
Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:
checkpoint/sys.c - user/kernel data transfer, as well as setup of the
c/r context (a per-checkpoint data structure for housekeeping)
checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
checkpoint/restart.c - input wrappers and basic restart handling
checkpoint/process.c - c/r of task data
For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to the syscall is ignored.
Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.
Changelog[v17]:
- Fix compilation for architectures that don't support checkpoint
- Save/restore t->{set,clear}_child_tid
- Restart(2) isn't idempotent: must return -EINTR if interrupted
- ckpt_debug does not depend on DYNAMIC_DEBUG, on by default
- Export generic checkpoint headers to userespace
- Fix comment for prototype of sys_restart
- Have ckpt_debug() print global-pid and __LINE__
- Only save and test kernel constants once (in header)
Changelog[v16]:
- Split ctx->flags to ->uflags (user flags) and ->kflags (kernel flags)
- Introduce __ckpt_write_err() and ckpt_write_err() to report errors
- Allow @ptr == NULL to write (or read) header only without payload
- Introduce _ckpt_read_obj_type()
Changelog[v15]:
- Replace header buffer in ckpt_ctx (hbuf,hpos) with kmalloc/kfree()
Changelog[v14]:
- Cleanup interface to get/put hdr buffers
- Merge checkpoint and restart code into a single file (per subsystem)
- Take uts_sem around access to uts->{release,version,machine}
- Embed ckpt_hdr in all ckpt_hdr_...., cleanup read/write helpers
- Define sys_checkpoint(0,...) as asking for a self-checkpoint (Serge)
- Revert use of 'pr_fmt' to avoid tainting whom includes us (Nathan Lynch)
- Explicitly indicate length of UTS fields in header
- Discard field 'h->parent' from ckpt_hdr
Changelog[v12]:
- ckpt_kwrite/ckpt_kread() again use vfs_read(), vfs_write() (safer)
- Split ckpt_write/ckpt_read() to two parts: _ckpt_write/read() helper
- Befriend with sparse : explicit conversion to 'void __user *'
- Redfine 'pr_fmt' instead of using special ckpt_debug()
Changelog[v10]:
- add ckpt_write_buffer(), ckpt_read_buffer() and ckpt_read_buf_type()
- force end-of-string in ckpt_read_string() (fix possible DoS)
Changelog[v9]:
- ckpt_kwrite/ckpt_kread() use file->f_op->write() directly
- Drop ckpt_uwrite/ckpt_uread() since they aren't used anywhere
Changelog[v6]:
- Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(although it's not really needed)
Changelog[v5]:
- Rename headers files s/ckpt/checkpoint/
Changelog[v2]:
- Added utsname->{release,version,machine} to checkpoint header
- Pad header structures to 64 bits to ensure compatibility
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 672fa8ad9ed86a9faea572b7de9f2c9cb4f4cadf
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 16:21:55 2009 -0400
c/r: documentation
Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.
Changelog[v16]:
- Update documentation
- Unify into readme.txt and usage.txt
Changelog[v14]:
- Discard the 'h.parent' field
- New image format (shared objects appear before they are referenced
unless they are compound)
Changelog[v8]:
- Split into multiple files in Documentation/checkpoint/...
- Extend documentation, fix typos and comments from feedback
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
Acked-by: Serge Hallyn <serue at us.ibm.com>
Signed-off-by: Dave Hansen <dave at linux.vnet.ibm.com>
commit 0a5b0caac4574e0d1a399e6722a26e068c14bc17
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 16:21:55 2009 -0400
c/r: create syscalls: sys_checkpoint, sys_restart
Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.
The syscalls take a pid, a file descriptor (for the image file) and
flags as arguments. The pid identifies the top-most (root) task in the
process tree, e.g. the container init: for sys_checkpoint the first
argument identifies the pid of the target container/subtree; for
sys_restart it will identify the pid of restarting root task.
A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.
By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.
We don't use copy_from_user()/copy_to_user() because it requires
holding the entire image in user space, and does not make sense for
restart. Also, we don't use a pipe, pseudo-fs file and the like,
because they work by generating data on demand as the user pulls it
(unless the entire image is buffered in the kernel) and would require
more complex logic. They also would significantly complicate
checkpoint that includes self.
Changelog[v17]:
- Move checkpoint closer to namespaces (kconfig)
- Kill "Enable" in c/r config option
Changelog[v16]:
- Change sys_restart() first argument to be 'pid_t pid'
Changelog[v14]:
- Change CONFIG_CHEKCPOINT_RESTART to CONFIG_CHECKPOINT (Ingo)
- Remove line 'def_bool n' (default is already 'n')
- Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
Changelog[v5]:
- Config is 'def_bool n' by default
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
Acked-by: Serge Hallyn <serue at us.ibm.com>
Signed-off-by: Dave Hansen <dave at linux.vnet.ibm.com>
commit c2ceb7f7fe66b1285e7954c0acde5809239385d1
Author: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Date: Tue Jul 14 16:21:54 2009 -0400
pids 7/7: Define clone_with_pids syscall
Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.
clone_with_pids(), intended for use during restart, is the same as clone(),
except that it takes a 'target_pid_set' paramter. This parameter lets caller
choose specific pid numbers for the child process, in the process's active
and ancestor pid namespaces. (Descendant pid namespaces in general don't
matter since processes don't have pids in them anyway, but see comments
in copy_target_pids() regarding CLONE_NEWPID).
Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to
prevent unprivileged processes from misusing this interface.
Call clone_with_pids as follows:
pid_t pids[] = { 0, 77, 99 };
struct target_pid_set pid_set;
pid_set.num_pids = sizeof(pids) / sizeof(int);
pid_set.target_pids = &pids;
syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);
If a target-pid is 0, the kernel continues to assign a pid for the process in
that namespace. In the above example, pids[0] is 0, meaning the kernel will
assign next available pid to the process in init_pid_ns. But kernel will assign
pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
77 or 99 are taken, the system call fails with -EBUSY.
If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
the system call fails with -EINVAL.
Its mostly an exploratory patch seeking feedback on the interface.
NOTE:
Compared to clone(), clone_with_pids() needs to pass in two more
pieces of information:
- number of pids in the set
- user buffer containing the list of pids.
But since clone() already takes 5 parameters, use a 'struct
target_pid_set'.
TODO:
- Gently tested.
- May need additional sanity checks in do_fork_with_pids().
Changelog[v3]:
- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
in the target_pids[] list and setting it 0. See copy_target_pids()).
- (Oren Laadan) Specified target pids should apply only to youngest
pid-namespaces (see copy_target_pids())
- (Matt Helsley) Update patch description.
Changelog[v2]:
- Remove unnecessary printk and add a note to callers of
copy_target_pids() to free target_pids.
- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
'num_pids == 0' (fall back to normal clone()).
- Move arch-independent code (sanity checks and copy-in of target-pids)
into kernel/fork.c and simplify sys_clone_with_pids()
Changelog[v1]:
- Fixed some compile errors (had fixed these errors earlier in my
git tree but had not refreshed patches before emailing them)
Signed-off-by: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
commit 17ea3ea9c73b1f85b1119563cdfd6a3bd1012ffa
Author: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Date: Tue Jul 14 16:21:54 2009 -0400
pids 6/7: Define do_fork_with_pids()
do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.
Changelog[v3]:
- Fix "long-line" warning from checkpatch.pl
Changelog[v2]:
- To facilitate moving architecture-inpdendent code to kernel/fork.c
pass in 'struct target_pid_set __user *' to do_fork_with_pids()
rather than 'pid_t *' (next patch moves the arch-independent
code to kernel/fork.c)
Signed-off-by: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue at us.ibm.com>
Reviewed-by: Oren Laadan <orenl at cs.columbia.edu>
commit 9259ece4e673844149d6d08803c66abdeefa8243
Author: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Date: Tue Jul 14 16:21:54 2009 -0400
pids 5/7: Add target_pids parameter to copy_process()
The new parameter will be used in a follow-on patch when clone_with_pids()
is implemented.
Signed-off-by: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue at us.ibm.com>
Reviewed-by: Oren Laadan <orenl at cs.columbia.edu>
commit 7ffa84de27c8e51974dc65d50788f009910d440e
Author: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Date: Tue Jul 14 16:21:53 2009 -0400
pids 4/7: Add target_pids parameter to alloc_pid()
This parameter is currently NULL, but will be used in a follow-on patch.
Signed-off-by: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue at us.ibm.com>
Reviewed-by: Oren Laadan <orenl at cs.columbia.edu>
commit 01d990434c6b77be5ca6a38071167f0fa5217ed0
Author: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Date: Tue Jul 14 16:21:53 2009 -0400
pids 3/7: Add target_pid parameter to alloc_pidmap()
With support for setting a specific pid number for a process,
alloc_pidmap() will need a paramter a 'target_pid' parameter.
Changelog[v2]:
- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
actually checks for 'pid <= 0' for completeness).
Signed-off-by: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue at us.ibm.com>
Reviewed-by: Oren Laadan <orenl at cs.columbia.edu>
commit db5d3b14baeb5122abb7b60db5fa6e6c3d7eccf9
Author: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Date: Tue Jul 14 16:21:53 2009 -0400
pids 2/7: Have alloc_pidmap() return actual error code
alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed. With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.
Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.
Signed-off-by: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue at us.ibm.com>
Reviewed-by: Oren Laadan <orenl at cs.columbia.edu>
commit a903c365952e582aee4da1fe827168afccf17eaf
Author: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Date: Tue Jul 14 16:21:52 2009 -0400
pids 1/7: Factor out code to allocate pidmap page
To implement support for clone_with_pids() system call we would
need to allocate pidmap page in more than one place. Move this
code to a new function alloc_pidmap_page().
Changelog[v2]:
- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
-ENOMEM on error instead of -1.
Signed-off-by: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue at us.ibm.com>
Reviewed-by: Oren Laadan <orenl at cs.columbia.edu>
commit 588dced6f5300597456015fe6a72b704e26428b9
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 16:21:51 2009 -0400
c/r: make file_pos_read/write() public
These two are used in the next patch when calling vfs_read/write()
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
commit 76ae59a36e3bda55b8a833e52d566ed5fe1be44d
Author: Dave Hansen <dave at linux.vnet.ibm.com>
Date: Tue Jul 14 16:21:50 2009 -0400
Namespaces submenu
Let's not steal too much space in the 'General Setup' menu.
Take a cue from the cgroups code and create a submenu.
This can go upstream now.
Signed-off-by: Dave Hansen <dave at linux.vnet.ibm.com>
Acked-by: Oren Laadan <orenl at cs.columbia.edu>
commit b1c27a8cf51088f0005c20b7e666ca614d2f59d2
Author: Oren Laadan <orenl at cs.columbia.edu>
Date: Tue Jul 14 16:21:46 2009 -0400
cgroup freezer: interface to freeze a cgroup from within the kernel
Add public interface to freeze a cgroup freezer given a task that
belongs to that cgroup: cgroup_freezer_make_frozen(task)
Freezing the root cgroup is not permitted. Freezing the cgroup to
which current process belong is also not permitted.
This will be used for restart(2) to be able to leave the restarted
processes in a frozen state, instead of resuming execution.
This is useful for debugging, if the user would like to attach a
debugger to the restarted task(s).
It is also useful if the restart procedure would like to perform
additional setup once the tasks are restored but before they are
allowed to proceed execution.
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
CC: Matt Helsley <matthltc at us.ibm.com>
Cc: Paul Menage <menage at google.com>
Cc: Li Zefan <lizf at cn.fujitsu.com>
Cc: Cedric Le Goater <legoater at free.fr>
commit b39db090fdac3e0cbebac6fee952ad0a3c1d079d
Author: Matt Helsley <matthltc at us.ibm.com>
Date: Tue Jul 14 15:04:51 2009 -0400
cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint
The CHECKPOINTING state prevents userspace from unfreezing tasks until
sys_checkpoint() is finished. When doing container checkpoint userspace
will do:
echo FROZEN > /cgroups/my_container/freezer.state
...
rc = sys_checkpoint( <pid of container root> );
To ensure a consistent checkpoint image userspace should not be allowed
to thaw the cgroup (echo THAWED > /cgroups/my_container/freezer.state)
during checkpoint.
"CHECKPOINTING" can only be set on a "FROZEN" cgroup using the checkpoint
system call. Once in the "CHECKPOINTING" state, the cgroup may not leave until
the checkpoint system call is finished and ready to return. Then the
freezer state returns to "FROZEN". Writing any new state to freezer.state while
checkpointing will return EBUSY. These semantics ensure that userspace cannot
unfreeze the cgroup midway through the checkpoint system call.
The cgroup_freezer_begin_checkpoint() and cgroup_freezer_end_checkpoint()
make relatively few assumptions about the task that is passed in. However the
way they are called in do_checkpoint() assumes that the root of the container
is in the same freezer cgroup as all the other tasks that will be
checkpointed.
Notes:
As a side-effect this prevents the multiple tasks from entering the
CHECKPOINTING state simultaneously. All but one will get -EBUSY.
Signed-off-by: Oren Laadan <orenl at cs.columbia.edu>
Signed-off-by: Matt Helsley <matthltc at us.ibm.com>
Cc: Paul Menage <menage at google.com>
Cc: Li Zefan <lizf at cn.fujitsu.com>
Cc: Cedric Le Goater <legoater at free.fr>
-----------------------------------------------------------------------
hooks/post-receive
--
linux-cr
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list