[Devel] Re: [PATCH 10/30] cr: core stuff
Oren Laadan
orenl at cs.columbia.edu
Tue Apr 14 11:19:21 PDT 2009
Alexey Dobriyan wrote:
> On Tue, Apr 14, 2009 at 01:22:03AM -0400, Oren Laadan wrote:
>> Alexey Dobriyan wrote:
>>> * add struct file_operations::checkpoint
>>>
>>> The point of hook is to serialize enough information to allow restoration
>>> of an opened file.
>>>
>>> The idea (good one!) is that the code which supplies struct file_operations
>>> know better what to do with file.
>> Actually, credit is due to Dave Hansen (or Christoph Hellwig, or both?).
>>
>>> Hook gets C/R context (a cookie more or less) on which dump code can
>>> cr_write() and small restrictions on what to write: globally unique object id
>>> and correct object length to allow jumping through objects.
>>>
>>> For usual files on on-disk filesystem add generic_file_checkpoint()
>>>
>>> Add ext3 opened regular files and directories for start.
>>>
>>> No ->checkpoint, checkpointing is aborted -- deny by default.
>>>
>>> FIXME: unlinked, but opened files aren't supported yet.
>>>
>>> * C/R image design
>>>
>>> The thing should be flexible -- kernel internals changes every day, so we can't
>>> really afford a format with much enforced structure.
>>>
>>> Image consists of header, object images and terminator.
>>>
>>> Image header consists of immutable part and mutable part (for future).
>>>
>>> Immutable header part is magic and image version: "LinuxC/R" + __le32
>>>
>>> Image version determines everything including image header's mutable part.
>>> Image version is going to be bumped at earliest opportunity following changes
>>> in kernel internals.
>>>
>>> So far image header mutable part consists of arch of the kernel which dumped
>>> the image (i386, x86_64, ...) and kernel version as found in utsname.
>>>
>>> Kernel version as string is for distributions. Distro can support C/R for
>>> their own kernels, but can't realistically be expected to bump image version --
>>> this will conflict with mainline kernels having used same version. We also don't
>>> want requests for private parts of image version space.
>> So far so good, like in our patch-set.
>>
>> You also need to address differences in configuration (kernel could
>> have been recompiled) and runtime environment (boot params, etc).
>>
>> We deferred this issue to a later time.
>>
>>> Distro expected to keep image version alone and on restart(2) check utsname
>>> version and compare it against previously release kernel versions and based
>>> on that turn on compatibility code.
>> Are you suggesting that conversion of a checkpoint image from an older
>> version to a newer version be done in the kernel ?
>
> For mainline kernel it's completely unrealistic to support all backwards
> compatibility code for previous versions. Some mythical userspace
> program will convert images.
>
> But it's completely realistic and much easier for distro kernel because
> distro kernel doesn't generally include patches with significant in-kernel
> internals changes, so they simply can support
> '2.6.26-1-amd64' => '2.6.26-2-amd64' situation.
>
> Distros can write conversion program too, but I don't expect they will.
>
>> It may work for a few versions, and then you'll get a spaghetti of
>> #ifdef's in the code, together with a plethora of legacy code.
>
> Expectation is for one kernel branch like RHEL5 kernel updates during
> RHEL5 lifecycle.
>
> For RHEL5 => RHEL6, it's up to them what to do.
>
> Anyway distro can add compat code _anyway_, for this we help them with
> this image format tweak, so they won't bug mainline with "reserve bit 31
> for Red Hat".
>
> Image version is kept small (__le32) for this reason too :-)
>
>> It is much better/easier to handle checkpoint image transformations
>> in user space. The kernel will only understand its "current" version
>> (for some definition of version).
>>
>>> Object image is very flexible, the only required parts are a) object type (u32)
>>> and b) object total length (u32, [knocks wood]) which must be at the beginning
>>> of an image. The rest is not generic C/R code problem.
>>>
>>> Object images follow one another without holes. Holes are in theory possible but
>>> unneeded.
>>>
>> When would you need holes ?
>>
>>> Image ends with terminator object. This is mostly to be sure, that, yes, image
>>> wasn't truncated for some reason.
>>>
>>>
>>> * Objects subject to C/R
>>>
>>> The idea is to not be very smart but directly dump core kernel data structures
>>> related to processes. This includes in this patch:
>>>
>>> struct task_struct
>>> struct mm_struct
>>> VMAs
>>> dirty pages
>>> struct file
>>>
>>> Relations between objects (task_struct has pointer to mm_struct) are fullfilled
>>> by dumping pointed to object first, keeping it's position in dumpfile and saving
>>> position in a image of pointe? object:
>> Unless you use the physical position to actually lseek to there to
>> re-read the data, there is no reason to use the actual position. In
>> fact it is easier to debug when the shared object identifier is a
>> simple counter.
>>
>> If you do use it to lseek, then it's a poor decision -- sounds fragile:
>> what if we change the file (legitimately) adding data in the middle -
>> the whole concept breaks.
>
> Adder of data is expected to understand image format and update all references
> just like surgeon is expected to understand human anatomy.
>
>>> struct cr_image_task_struct {
>>> cr_pos_t cr_pos_mm;
>>> ...
>>> };
>>>
>>> Code so far tries hard to dump objects in certain order so there won't be any loops.
>>> This property of process that dumpfile can in theory be O_APPEND, will likely be
>>> sacrifised (read: child can ptrace parent)
>> The ability to streamline the checkpoint image IMHO is invaluable.
>> It's the unix way (TM) of doing things; it makes the process pipe-able.
>>
>> You can do many nice things when the checkpoint can be streamed: you
>> can compress, sign, encrypt etc on the fly without taking additional
>> diskspace. You can transfer over the network (e.g. for migration),
>> or store remotely without explicit file system support. You can easily
>> transform the stream from one c/r version to another etc.
>>
>> This should be a design principle. In my experience I never hit a wall
>> that forced me to "sacrifice" this decision.
>>
>>> sacrifised (read: child can ptrace parent)
>> Hmmm... if all tasks are created in user space, then this specific
>> becomes a no-brainer !
>
> No!
Actually yes :)
>
> A ptraces B. Container is checkpointed.
>
> Kernel realizes ptrace is going on. A and B in theory can have any
> realitionship.
>
> Consequently, kernel doesn't know in which order to dump A and B.
>
> And there is no such order:
> *) A can be parent of B (you dump A, B),
> *) A can be child of B (you want to dump B, A, but this conflicts with
> ->real_parent order)
> *) A and B just tasks (any order).
Current code does not support ptrace() - which has a multitude
if tidy-bits issues to solve during restart regardless.
However, creating tasks in userspace uses (and will uses) only
"real" process relationships, not ptrace-relationships, when it
comes to decide on the fork/clone order.
Technically, that can be done in checkpoint (dumping the task tree)
or in restart-user-space (rearranging the data before fork/clone).
>
> I'm showing that whole issue can be avoided:
If the issue can be avoided, then why would you need to sacrifice
the stream-ability of the checkpoint image ?
> *) all tasks are simply created regardless of who is parent of whom
> (see kernel_thread())
> *) Every task_struct image among other things contains references to
> ->real_parent and ->parent.
> *) After every task is created it's time to change references:
> **) lookup who is ->real_parent, change ->real_parent _by hand_
> not with some "correct clone(2)" order.
> **) lookup who is ->parent, change ->parent.
>
> You're probably escaping all of this with object numbers?
(Will be) escaping this by arranging to fork/clone in the proper order.
>
>>> * add struct vm_operations_struct::checkpoint
>>>
>>> just like with files, code that creates special VMAs should know what to do with them
>>> used.
>>>
>>> just like with files, deny checkpointing by default
>>>
>>> So far used to install vDSO to same place.
>> VDSO can be a troublemaker; in recent kernels its location in the MM
>> can be randomized.
>
> See arch_setup_additional_pages() patch.
>
>> It is not necessarily immutable - it can reflect
>> ynamic kernel data. It may contain different code on newer versions,
>> so must be compared or worked around during restart etc.
>
> i386 if I'm not mistaken only contain syscall entry code, but, yes,
> generally one should check if PC is inside such page.
If you up restart on a different kernel that has a different VDSO,
then you need to bring the old VDSO with you, and tweak it so it
pulls the dynamic kernel data from the right place. Ugh ... :(
Oren.
>
>>> * add checkpoint(2)
>>>
>>> Done by determining which tasks are subject to checkpointing, freezeing them,
>>> collecting pointers to necessary kernel internals (task_struct, mm_struct, ...),
>>> doing that checking supported/unsupported status and aborting if necessary,
>>> actual dumping, unfreezeing/killing set of tasks.
>>>
>>> Also in-checkpoint refcount is maintained to abort on possible invisible changes.
>>> Now it works:
>>>
>>> For every collected object (mm_struct) keep numbers of references from
>>> other collected objects. It should match object's own refcount.
>>> If there is a mismatch, something is likely pinning object, which means
>>> there is "leak" to outside which means checkpoint(2) can't realistically and
>>> without consequences proceed.
>>>
>>> This is in some sense independent check. It's designed to protect from internals
>>> change when C/R code was forgotten to be updated.
>>>
>>> Userpsace supplies pid of root task and opened file descriptor of future dump file.
>>> Kernel reports 0/-E as usual.
>>>
>>> Runtime tracking of "checkpointable" property is explicitly not done.
>>> This introduces overhead even if checkpoint(2) is not done as shown by proponents.
>>> Instead any check is done at checkpoint(2) time and -E is returned if something is
>>> suspicious or known to be unsupported.
>>>
>>> FIXME: more checks especially in cr_check_task_struct().
>>>
>>> * add restart(2)
>>>
>>> Recreate tasks and evething dumped by checkpoint(2) as if nothing happened.
>>>
>>> The focus is on correct recreating, checking every possibility that target kernel
>>> can be on different arch (i386 => x86_64) and target kernel can be very different
>>> from source kernel by mistake (i386 => x86_64 COMPAT=n) kernel.
>>>
>>> restart(2) is done first by creating kernel thread and that demoting it to usual
>>> process by adding mm_struct, VMAs, et al. This saves time against method when
>>> userspace does fork(2)+restart(2) -- forked mm_struct will be thrown out anyway
>>> or at least everything will be unmapped in any case.
>> Do have figures to support your claims about "saves time" ?
>>
>> The *largest* component of the restart time, as you probably know,
>> is the time it takes to restore the memory address space (pages, pages)
>> of the tasks.
>>
>> If you do show that this optimization is worth our attention, then it
>> takes < 10 lines to change current mktree.c to use CLONE_VM ... voila.
>>
>> I'm interested in hearing more convincing arguments in favor of kernel
>> creations of restarting tasks (see my other post about it).
>
> OK, in another post.
>
>>> Restoration is done in current context except CPU registers at last stage.
>>> This is because "creation is done by current" is in many, many places,
>>> e.g. mmap(2) code.
>>>
>>> It's expected that filesystem state will be the same. Kernel can't do anything
>>> about it expect probably virtual filesystems. If a file is not there anymore,
>>> it's not kernel fault, -E will be returned, restart aborted.
>>>
>>> FIXME: errors aren't propagated correctly out of kernel thread context
>> Heh .. I guess they always propagate correctly out of regular task
>> context ;)
>
> :-)
>
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list