[Devel] Re: multi-threaded app fails to restart

John Paul Walters jpnwalters at gmail.com
Thu Jul 22 09:23:32 PDT 2010


Hi Oren,

Thanks for the patch.  For the --pidns case, that seems to have solved
the problem.  In the case of --no-pidns, restart still hangs as
described before.  Should this work with in the --no-pidns case, or is
it expected to fail in this case?

JP

On Wed, Jul 21, 2010 at 9:04 PM, Oren Laadan <orenl at cs.columbia.edu> wrote:
> Hi John,
>
> This is a bit embarrassing, the behavior sounds too familiar --
> please try to following patch:
>
> --
> diff --git a/arch/x86/kernel/checkpoint.c b/arch/x86/kernel/checkpoint.c
> index 3fb9deb..b770f70 100644
> --- a/arch/x86/kernel/checkpoint.c
> +++ b/arch/x86/kernel/checkpoint.c
> @@ -104,7 +104,7 @@ int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
>        h->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
>        h->sizeof_tls_array = tls_size;
>        h->sysenter_return = (__u64) (unsigned long)
> -               task_thread_info(current)->sysenter_return;
> +               task_thread_info(t)->sysenter_return;
>
>        /* For simplicity dump the entire array */
>        memcpy(h + 1, t->thread.tls_array, tls_size);
> --
>
> On Wed, 21 Jul 2010, John Paul Walters wrote:
>
>> >>
>> >> Hi Oren,
>> >>
>> >> I'm still unable to fully restart the application with your patch, but
>> >> the result is now different.  If I attempt to restart using  --pidns
>> >> and -F, both threads are created and frozen.  However, as soon as I
>> >> thaw them I get a segfault.  If I attempt to restart them without the
>> >> --pidns option, I get a message from restart indicating that it's
>> >> about to call sys_restart and restart hangs.  I also have the
>> >> following in my syslog:
>> >
>> > Hi John,
>> >
>> > I assume the log below is for the --no-pidns case, right ?
>> > Can you also post the output of 'restart -vd ...' ?
>> > (Unfortunately I won't have a chance to try it until the weekend)
>> >
>>
>> Hi Oren,
>>
>> That's correct, the original log was for the --no-pidns case.  Below
>> I've included the restart log up to the point where it hangs at
>> sys_restart.  Thanks again for all of your help.
>>
>> best,
>> JP
>>
>> ./restart -v -d --no-pidns < checkpoint_out
>> <4124>number of tasks: 2
>> <4124>number of vpids: 0
>> <4124>total tasks (including ghosts): 3
>> <4124>pid 3583: thread tgid 3582
>> <4124>pid 3583: creator set to 3582
>> <4124>pid 1: propagate session 3582
>> <4124>pid 1: creator set to 3582
>> <4124>pid 1: set session
>> <4124>pid 1: moving up to 3582
>> <4124>====== TASKS
>> <4124>        [0] pid 3582 ppid 3349 sid 0 creator 0
>> <4124>        [1] pid 3583 ppid 3349 sid 0 creator 3582 prev 1 T
>> <4124>        [2] pid 1 ppid 3582 sid 3582 creator 3582 next 3583   S G
>> <4124>............
>> <4124>task[0].vidx = -1
>> <4124>task[1].vidx = -1
>> <4124>subtree (existing pidns)
>> <4124>forking child vpid 3582 flags 0x1
>> <4124>task 3582 forking with flags 11 numpids 1
>> <4124>task 3582 pid[0]=0
>> <4124>forked child vpid 4126 (asked 3582)
>> <4126>root task pid 4126
>> <4126>pid 3582: pid 4126 sid 3386 parent 4124
>> <4126>pid 3582: fork child 1 with session
>> <4126>forking child vpid 1 flags 0x12
>> <4126>task 1 forking with flags 11 numpids 1
>> <4126>task 1 pid[0]=0
>> <4126>forked child vpid 4127 (asked 1)
>> <4126>pid 3582: fork child 3583 without session
>> <4126>forking child vpid 3583 flags 0x4
>> <4126>task 3583 forking with flags 10911 numpids 1
>> <4126>task 3583 pid[0]=0
>> <4126>forked child vpid 4128 (asked 3583)
>> <4126>about to call sys_restart(), flags 0
>> <4125>====== PIDS ARRAY
>> <4125>[0] pid 3582 ppid 1 sid 1 pgid 3582
>> <4125>[1] pid 3583 ppid 1 sid 1 pgid 3582
>> <4125>............
>> <4125>c/r swap old 3582 new 4126
>> <4128>pid 3583: pid 4128 sid 3386 parent 4124
>> <4128>about to call sys_restart(), flags 0
>> <4125>c/r swap old 3583 new 4128
>> <4127>pid 1: pid 4127 sid 3386 parent 4126
>> <4125>c/r swap old 1 new 4127
>> <4125>====== PIDS ARRAY (swaped)
>> <4125>[0] pid 4126 ppid 1 sid 4127 pgid 4126
>> <4125>[1] pid 4128 ppid 1 sid 4127 pgid 4126
>> <4125>............
>> <4125>c/r read input 16384
>> <4127>about to call sys_restart(), flags 0x4
>> <4125>c/r read input 16384
>> <4125>c/r read input 16384
>> <4125>c/r read input 16384
>> <4125>c/r read input 16384
>>
>>
>>
>>
>>
>>
>> > Thanks,
>> >
>> > Oren.
>> >
>> >>
>> >>
>> >> [ 1482.348060] [3753:3753:c/r:walk_task_subtree:633] total 2 ret 1
>> >> [ 1482.348060] [3753:3753:c/r:prepare_descendants:1148] nr 2/2
>> >> [ 1482.348060] [3753:3753:c/r:do_restore_coord:1320] restore prepare: 2
>> >> [ 1541.864073] [err -512][pos 419][E @ do_ghost_task:973]ghost restart failed
>> >> [ 1541.864343] [err -512][pos 419][E @ do_restore_task:1084]task restart failed
>> >> [ 1541.864346] [3755:3755:c/r:clear_task_ctx:852] task 3755 clear checkpoint_ctx
>> >> [ 1541.864349] [3755:3755:c/r:do_restart:1444] restart err -4, exiting
>> >> [ 1541.864352] [3755:3755:c/r:do_restart:1451] sys_restart returns -4
>> >> [ 1541.864366] [3757:3757:c/r:wait_checkpoint_ctx:938]
>> >> wait_checkpoint_ctx: failed (-512)
>> >> [ 1541.864368] [3757:3757:c/r:do_restart:1444] restart err -4, exiting
>> >> [ 1541.864371] [3757:3757:c/r:do_restart:1451] sys_restart returns -4
>> >> [ 1541.864689] [3753:3753:c/r:wait_all_tasks_finish:1173] final sync
>> >> kflags 0x1a (ret 0)
>> >> [ 1541.864692] [3753:3753:c/r:do_restore_coord:1325] restore finish: 0
>> >> [ 1541.864694] [3753:3753:c/r:do_restore_coord:1331] restore deferqueue: 0
>> >> [ 1541.864698] [err -512][pos 419][E @
>> >> ckpt_read_obj_type:426]Expecting to read type 9001
>> >> [ 1541.864700] [3753:3753:c/r:do_restore_coord:1336] restore tail: -512
>> >> [ 1541.864703] [err -512][pos 419][E @ do_restore_coord:1350]restart
>> >> failed (coordinator)
>> >> [ 1541.864706] [3753:3753:c/r:walk_task_subtree:633] total 0 ret 0
>> >> [ 1541.864709] [3753:3753:c/r:clear_task_ctx:852] task 3753 clear checkpoint_ctx
>> >> [ 1541.864715] [3753:3753:c/r:do_restart:1451] sys_restart returns -4
>> >> [ 1541.864718] [3753:3753:c/r:restore_debug_free:144] 3 tasks
>> >> registered, nr_tasks was 0 nr_total 1
>> >> [ 1541.864721] [3753:3753:c/r:restore_debug_free:147] active pid was
>> >> 0, ctx->errno -512
>> >> [ 1541.864723] [3753:3753:c/r:restore_debug_free:149] kflags 26 uflags
>> >> 0 oflags 1
>> >> [ 1541.864726] [3753:3753:c/r:restore_debug_free:151] task[0] to run 3755
>> >> [ 1541.864728] [3753:3753:c/r:restore_debug_free:151] task[1] to run 3757
>> >> [ 1541.864731] [3753:3753:c/r:restore_debug_free:176] pid 3753 type
>> >> Coord state Failed
>> >> [ 1541.864735] [3753:3753:c/r:restore_debug_free:176] pid 3755 type
>> >> Root state Failed
>> >> [ 1541.864737] [3753:3753:c/r:restore_debug_free:176] pid 3756 type
>> >> Ghost state Failed
>> >>
>> >> thanks,
>> >> JP
>> >>
>> >> >
>> >> > ---
>> >> > diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
>> >> > index 171c867..3288af0 100644
>> >> > --- a/kernel/checkpoint/sys.c
>> >> > +++ b/kernel/checkpoint/sys.c
>> >> > @@ -605,13 +605,13 @@ int walk_task_subtree(struct task_struct *root,
>> >> >                        continue;
>> >> >                }
>> >> >
>> >> > +               /* if not last thread - proceed with thread */
>> >> > +               task = next_thread(task);
>> >> > +               if (!thread_group_leader(task))
>> >> > +                       continue;
>> >> > +
>> >> >                /* by definition, skip siblings of root */
>> >> >                while (task != root) {
>> >> > -                       /* if not last thread - proceed with thread */
>> >> > -                       task = next_thread(task);
>> >> > -                       if (!thread_group_leader(task))
>> >> > -                               break;
>> >> > -
>> >> >                        /* if has sibling - proceed with sibling */
>> >> >                        if (!list_is_last(&task->sibling, &parent->children)) {
>> >> >                                task = list_entry(task->sibling.next,
>> >> > ---
>> >>
>> >>
>>
>>
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers




More information about the Devel mailing list