[Devel] Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart

Matthieu Fertré matthieu.fertre at kerlabs.com
Mon May 4 02:20:37 PDT 2009

Oren Laadan a écrit :
> Matthieu Fertré wrote:
>> Hi,
>> Louis Rilling a écrit :
>>> On 29/04/09 18:47 -0400, Oren Laadan wrote:
>>>> Hi Louis,
>>>> Louis Rilling wrote:
>>>>> Hi,
>>>>> On 28/04/09 19:23 -0400, Oren Laadan wrote:
>>>>>> Here is the latest and greatest of checkpoint/restart (c/r) patchset.
>>>>>> The logic and image format reworked and simplified, code refactored,
>>>>>> support for PPC, s390, sysvipc, shared memory of all sorts, namespaces
>>>>>> (uts and ipc).
>>>>> I should have asked before, but what are the reasons to checkpoint SYSV IPCs
>>>>> in the same file/stream as tasks? Would it be better to checkpoint them
>>>>> independently, like the file system state?
>>>>> In Kerrighed we chose to checkpoint SYSV IPCs independently, a bit like the file
>>>>> system state, because SYSV IPCs objects' lifetime do not depend on tasks
>>>>> lifetime, and we can gain more flexibility this way. In particular we envision
>>>>> cases in which two applications share a state in a SYSV SHM (something like a
>>>>> producer-consumer scheme), but do not need to be checkpointed together. In such
>>>>> a case the SYSV SHM itself could even need more high-availability (using
>>>>> active replication) than a checkpoint/restart facility.
>>>> Thanks for the feedback, this is actually an interesting idea.
>>>> Indeed in the past I also considered SYSV IPC to be a "global" resource
>>>> that was checkpointed before iterating through the tasks.
>>>> However, in the presence of namespaces, the lifetime of an IPC namespace
>>>> does depend on on tasks lifetime - when the last task referring to a
>>>> given namespace exits - that namespace is destroyed. Of course, the
>>>> root namespace is truly global, because init(1) never exits.
>>>> What would 'checkpoint them independently' mean in this case ?
>>> I mean that the producer and the consumer could have separate checkpointing
>>> policies (if any), and the IPC SHM as well.
>>>> In your use-case, can you restart either application without first
>>>> restoring the relevant SYSVIPC ?
>>> Probably not.
>> Well, it depends. It has no sense to restart the application without
>> restoring the relevant SHM but it may have for a message queue (this is
>> application specific of course). Message queue is not linked to the
>> process, it can disappear during the life of the application.
> Agreed - the concern regards mainly the SHM case.
>>>> Can you think of other use-cases for such a division ?  Am I right to
>>>> guess that your use case is specific to the distributed (and SSI-)
>>>> nature of your system ?  (Active-replication of SYSV_SHM sounds
>>>> awfully related to DSM :)
>>> The case of active-replication may be specific to DSM-based systems, but the
>>> case of independent policies is already interesting in standalone boxes.
>>>> While not focusing on such use cases, I want to keep the design flexible
>>>> enough to not exclude them a-priori, and be able to address them later
>>>> on. Indeed, the code is split such that the the function to save a given
>>>> IPC namespace does not depend on the task that uses it. Future code
>>>> could easily use the same functionality.
>>>> One way to be flexible to support your use case, is by having some
>>>> mechanism in place to select whether a resource (virtually any) is
>>>> to be chekcpointed/restored.
>>>> For example, you could imagine checkpoint(..., CHECKPOINT_SYSVIPC)
>>>> to checkpoint (also) IPC, and not checkpoint IPC in its absence.
>>>> So normally you'd have checkpoint(..., CHECKPOINT_ALL). When you don't
>>>> want IPC, you'd use CHECKPOINT_ALL & ~CHECKPOINT_SYSVIPC. When you
>>>> want only IPC, you'd use CHECKPOINT_SYSVIPC only.
>>>> Same thing for restart, only that it will get trickier in the "only IPC"
>>>> case, since you will need to tell which IPC namespace is affected.
>>>> Also, I envision a task saying cradvise(CHECKPOINT_SYSVIPC, false),
>>>> telling the kernel to not c/r its IPC namespace. (Or any other
>>>> resource). Again there would need to be a way to add a restored
>>>> namespace.
>>>> Does this address your concerns ?
>>> Yes this sounds flexible enough. Thanks for taking this into account.
>> I see one drawback with this approach if you allow checkpoint of
>> application that is not isolated in a container. In that case, you may
>> want to select which IPC objects to dump to not dump all the IPC objects
>> living in the system. Indeed, this is why we have chosen in Kerrighed to
>> checkpoint IPC objects independently of tasks, since we have no
>> container/namespaces support currently.
> I assume that in this case it will be the application itself that
> will somehow tell the system which specific sysvipc objects (ids) it
> cares about.

Sure, the system can not know it.

> (I'm not sure how would the system otherwise know what to dump and
> what to leave out).
> I originally proposed the construct of cradvise() syscall to handle
> exactly those cases where the application would like to advise the
> kernel about certain resources. So, extending the previous example,
> a task may call something like:
>    cradvise(CHECKPOINT_SYSVIPC_SHM, false);  /* generally skip shm */
>    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, true);  /* but include this */
> or:
>    cradvise(CHECKPOINT_SYSVIPC_SHM, true);  /* generally include shm */
>    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, false);  /* but skip this */
> Anyway, these are just examples of the concept and what sort of generic
> interface can be used to implement it; don't pick on the details...

Ok, seems good :)



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openvz.org/pipermail/devel/attachments/20090504/7c56bede/attachment-0001.sig>
-------------- next part --------------
Containers mailing list
Containers at lists.linux-foundation.org

More information about the Devel mailing list