Oren Laadan orenl at cs.columbia.edu
Mon May 4 02:06:35 PDT 2009

Matthieu Fertré wrote:
> Hi,
Louis Rilling a écrit :
On 29/04/09 18:47 -0400, Oren Laadan wrote:
>>> Hi Louis,
Louis Rilling wrote:
>>>> Hi,
On 28/04/09 19:23 -0400, Oren Laadan wrote:
>>>>> Here is the latest and greatest of checkpoint/restart (c/r) patchset.
>>>>> The logic and image format reworked and simplified, code refactored,
>>>>> support for PPC, s390, sysvipc, shared memory of all sorts, namespaces
>>>>> (uts and ipc).
>>>> I should have asked before, but what are the reasons to checkpoint SYSV IPCs
>>>> in the same file/stream as tasks? Would it be better to checkpoint them
>>>> independently, like the file system state?
>>>> In Kerrighed we chose to checkpoint SYSV IPCs independently, a bit like the file
>>>> system state, because SYSV IPCs objects' lifetime do not depend on tasks
>>>> lifetime, and we can gain more flexibility this way. In particular we envision
>>>> cases in which two applications share a state in a SYSV SHM (something like a
>>>> producer-consumer scheme), but do not need to be checkpointed together. In such
>>>> a case the SYSV SHM itself could even need more high-availability (using
>>>> active replication) than a checkpoint/restart facility.
>>> Thanks for the feedback, this is actually an interesting idea.
>>> Indeed in the past I also considered SYSV IPC to be a "global" resource
>>> that was checkpointed before iterating through the tasks.
>>> However, in the presence of namespaces, the lifetime of an IPC namespace
>>> does depend on on tasks lifetime - when the last task referring to a
>>> given namespace exits - that namespace is destroyed. Of course, the
>>> root namespace is truly global, because init(1) never exits.
>>> What would 'checkpoint them independently' mean in this case ?
>> I mean that the producer and the consumer could have separate checkpointing
>> policies (if any), and the IPC SHM as well.
>>> In your use-case, can you restart either application without first
>>> restoring the relevant SYSVIPC ?
>> Probably not.
> Well, it depends. It has no sense to restart the application without
> restoring the relevant SHM but it may have for a message queue (this is
> application specific of course). Message queue is not linked to the
> process, it can disappear during the life of the application.

Agreed - the concern regards mainly the SHM case.

>>> Can you think of other use-cases for such a division ?  Am I right to
>>> guess that your use case is specific to the distributed (and SSI-)
>>> nature of your system ?  (Active-replication of SYSV_SHM sounds
>>> awfully related to DSM :)
>> The case of active-replication may be specific to DSM-based systems, but the
>> case of independent policies is already interesting in standalone boxes.
>>> While not focusing on such use cases, I want to keep the design flexible
>>> enough to not exclude them a-priori, and be able to address them later
>>> on. Indeed, the code is split such that the the function to save a given
>>> IPC namespace does not depend on the task that uses it. Future code
>>> could easily use the same functionality.
>>> One way to be flexible to support your use case, is by having some
>>> mechanism in place to select whether a resource (virtually any) is
>>> to be chekcpointed/restored.
>>> For example, you could imagine checkpoint(..., CHECKPOINT_SYSVIPC)
>>> to checkpoint (also) IPC, and not checkpoint IPC in its absence.
>>> So normally you'd have checkpoint(..., CHECKPOINT_ALL). When you don't
>>> want IPC, you'd use CHECKPOINT_ALL & ~CHECKPOINT_SYSVIPC. When you
>>> want only IPC, you'd use CHECKPOINT_SYSVIPC only.
>>> Same thing for restart, only that it will get trickier in the "only IPC"
>>> case, since you will need to tell which IPC namespace is affected.
>>> Also, I envision a task saying cradvise(CHECKPOINT_SYSVIPC, false),
>>> telling the kernel to not c/r its IPC namespace. (Or any other
>>> resource). Again there would need to be a way to add a restored
>>> namespace.
>>> Does this address your concerns ?
>> Yes this sounds flexible enough. Thanks for taking this into account.
> I see one drawback with this approach if you allow checkpoint of
> application that is not isolated in a container. In that case, you may
> want to select which IPC objects to dump to not dump all the IPC objects
> living in the system. Indeed, this is why we have chosen in Kerrighed to
> checkpoint IPC objects independently of tasks, since we have no
> container/namespaces support currently.

I assume that in this case it will be the application itself that
will somehow tell the system which specific sysvipc objects (ids) it
cares about.

(I'm not sure how would the system otherwise know what to dump and
what to leave out).

I originally proposed the construct of cradvise() syscall to handle
exactly those cases where the application would like to advise the
kernel about certain resources. So, extending the previous example,
a task may call something like:

   cradvise(CHECKPOINT_SYSVIPC_SHM, false);  /* generally skip shm */
   cradvise(CHECKPOINT_SYSVIPC_SHMID, id, true);  /* but include this */

   cradvise(CHECKPOINT_SYSVIPC_SHM, true);  /* generally include shm */
   cradvise(CHECKPOINT_SYSVIPC_SHMID, id, false);  /* but skip this */

Anyway, these are just examples of the concept and what sort of generic
interface can be used to implement it; don't pick on the details...


