[CRIU] cgroups restore issue at AWS
Mark
fl0yd at me.com
Tue Aug 25 13:12:08 PDT 2015
I haven't tried that. I was trying to use the native features within CRIU to do this. I'm not particularly interested in another workaround since i have one, I'm more interested in trying to understand the root cause.
Mark
> On Aug 25, 2015, at 10:42 AM, Hui Kang <hkang.sunysb at gmail.com> wrote:
>
> On Tue, Aug 25, 2015 at 11:40 AM, Mark <fl0yd at me.com> wrote:
>> I'm using Docker 1.9 experimental, the one that has your network changes in
>> it from Boucher's branch.
>>
>> I've tried CG_MODE_SOFT and FULL and in both cases the ffclose(f) still
>> returns Invalid Argument.
>
> I remember I used full mode and then manually run "echo 0 >
> /sys/fs/cgroup/cpuset/docker/cpuset.cpus" and "echo 0 >
> /sys/fs/cgroup/cpuset/docker/cpuset.mems"
>
> Then the restore will success. Have you tried this?
>
> - Hui
>
>>
>> Mark
>>
>> On Aug 25, 2015, at 10:36 AM, Hui Kang <hkang.sunysb at gmail.com> wrote:
>>
>> Hi, Mark
>> I think the failure is caused by criu restoring the root directory of group.
>>
>> Which branch of docker you used to restore a container? You probably
>> need to set manage-cgroup=full.
>>
>> - Hui
>>
>> On Tue, Aug 25, 2015 at 5:44 AM, Pavel Emelyanov <xemul at parallels.com>
>> wrote:
>>
>> On 08/22/2015 01:39 AM, Mark wrote:
>>
>> Hi,
>>
>> We're seeing some issues doing docker-based restores on AWS machines. On
>> the first try the restore.log shows the following output:
>> (00.000551) cg: Preparing cgroups yard (cgroups restore mode 0x4)
>> (00.000607) cg: Opening .criu.cgyard.aRmYI0 as cg yard
>> (00.000617) cg: Making controller dir .criu.cgyard.aRmYI0/cpuset
>> (cpuset)
>> (00.000691) cg: Created cgroup dir
>> cpuset/system.slice/docker-f00d0fe34bcc352377f4750f99fc4a649bd14db65fc15639df35043c62f7733a.scope
>> (00.000733) Error (cgroup.c:978): cg: Failed closing
>> cpuset/system.slice/docker-f
>> 00d0fe34bcc352377f4750f99fc4a649bd14db65fc15639df35043c62f7733a.scope/cpuset.cpus:
>> Invalid argument
>> (00.000737) Error (cgroup.c:1083): cg: Restoring special cpuset props
>> failed!
>>
>>
>> Failure to close the file is actually because fprintf fails.
>>
>> On the 2nd try the restore works because it skips the attempt:
>>
>> (00.000785) cg: Preparing cgroups yard (cgroups restore mode 0x4)
>> (00.000840) cg: Opening .criu.cgyard.BQ2bKQ as cg yard
>> (00.000850) cg: Making controller dir .criu.cgyard.BQ2bKQ/cpuset
>> (cpuset)
>> (00.000877) cg: Determined cgroup dir
>> cpuset/system.slice/docker-404a13eab68e35753ee2c66f636aa727aa2c9a7723671d25cc9ffb0ede574178.scope
>> already exist
>> (00.000880) cg: Skip restoring properties on cgroup dir
>> cpuset/system.slice/docker-404a13eab68e35753ee2c66f636aa727aa2c9a7723671d25cc9ffb0ede574178.scope
>>
>>
>> Well, yes, this is because the directory was created on first restore.
>>
>> It appears to be a timing issue on the fclose(f) call in cgroups.c. I've
>> tried using CG_MODE_SOFT and CG_MODE_FULL and neither have an affect, the
>> 1st attempt fails and the 2nd succeeds.
>>
>> To workaround the issue, we've created a fork with these changes and the
>> issue hasn't recurred. In fact there hasn't even been a single "Failed to
>> flush..." message printed in the logs, so it seems to be a matter of split
>> second timing that the for loop allows enough time for the handle to flush.
>>
>> diff --git a/cgroup.c b/cgroup.c
>> index a4e0146..9495206 100644
>> --- a/cgroup.c
>> +++ b/cgroup.c
>> @@ -950,6 +950,8 @@ static int restore_cgroup_prop(const CgroupPropEntry *
>> cg_prop_entry_p,
>> {
>> FILE *f;
>> int cg;
>> + int flushcounter=0;
>> + int maxtries=500;
>>
>> if (!cg_prop_entry_p->value) {
>> pr_err("cg_prop_entry->value was empty when should have had a
>> value");
>> @@ -974,9 +976,26 @@ static int restore_cgroup_prop(const CgroupPropEntry *
>> cg_prop_entry_p,
>> return -1;
>> }
>>
>> + /* The fclose() below was failing intermittently with EINVAL at
>> AWS*/
>> + /* So we try fflush() in a loop until it succeeds or we've */
>> + /* tried it a bunch. */
>> + for (;;) {
>> + flushcounter++;
>> + if (fflush(f) == 0) {
>> + break;
>> + }
>> + if (flushcounter > maxtries) {
>> + pr_perror("Max fflush() tries %d exceeded. Moving
>> along anyway.\n",maxtries);
>> + break;
>> + }
>> + if (fflush(f) != 0) {
>> + pr_perror("Failed to flush %s [%d/%d]\n", path,
>> flushcounter,maxtries);
>> + }
>> + }
>> +
>>
>>
>> Does this help?!
>>
>> if (fclose(f) != 0) {
>> - pr_perror("Failed closing %s", path);
>> - return -1;
>> + pr_perror("Failed closing %s\n",path);
>> + return -1;
>> }
>>
>> Can anyone reproduce the issue of offer a suggestion on how we should
>> proceed?
>>
>>
>> Hui (in Cc) sees similar in his experiments.
>>
>> -- Pavel
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/criu/attachments/20150825/14244a21/attachment.html>
More information about the CRIU
mailing list