<div dir="ltr">Ok, it's reasonable, will redo.<br>
</div><div dir="ltr"><br>
</div><div dir="ltr"><br>
</div><div dir="ltr"><br>
</div><div class="wps_signature">Best regards, Tikhomirov Pavel.</div><div class="wps_quotion">Dmitry Safonov <0x7f454c46@gmail.com> | От: 9 февр. 2018 г. 20:02 | Сообщение:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p></p><p dir="ltr">2018-02-09 16:06 GMT+00:00 Pavel Tikhomirov <<a href="mailto:ptikhomirov@virtuozzo.com">ptikhomirov@virtuozzo.com</a>>:
<br>
> We have a problem when a pid is reused between consequent dumps we can't
<br>
> understand if pagemap and pages from images of parent dump are invalid
<br>
> to restore these pid already. That can lead even to wrong memory
<br>
> restored for these pid, see the test in last patch.
<br>
>
<br>
> So these is a try do separate processes with (likely) invalid previous
<br>
> memory dump from processes with 100% valid previous dump.
<br>
>
<br>
> For that we use the value of /proc/<pid>/stat's start_time and also the
<br>
> timestamp of each (pre)dump. If the start time is strictly less than the
<br>
> timestamp, that means that the pagemap for these pid from previous dump
<br>
> is valid - was done for exactly the same process.
<br>
>
<br>
> Creation time is in centiseconds by default so if predump is really fast
<br>
> (<1csec) we can have false negative decisions for some processes, but in
<br>
> case of long running processes we are fine.
<br>
>
<br>
> <a href="https://jira.sw.ru/browse/PSBM-67502">https://jira.sw.ru/browse/PSBM-67502</a>
<br>
>
<br>
> Signed-off-by: Pavel Tikhomirov <<a href="mailto:ptikhomirov@virtuozzo.com">ptikhomirov@virtuozzo.com</a>>
<br>
> ---
<br>
> criu/mem.c | 37 ++++++++++++++++++++++++++++++++++++-
<br>
> 1 file changed, 36 insertions(+), 1 deletion(-)
<br>
>
<br>
> diff --git a/criu/mem.c b/criu/mem.c
<br>
> index 4c6942a11..355c992c7 100644
<br>
> --- a/criu/mem.c
<br>
> +++ b/criu/mem.c
<br>
> @@ -30,9 +30,11 @@
<br>
> #include "fault-injection.h"
<br>
> #include "prctl.h"
<br>
> #include <compel/compel.h>
<br>
> +#include "proc_parse.h"
<br>
>
<br>
> #include "protobuf.h"
<br>
> #include "images/pagemap.pb-c.h"
<br>
> +#include "images/stats.pb-c.h"
<br>
>
<br>
> static int task_reset_dirty_track(int pid)
<br>
> {
<br>
> @@ -303,6 +305,7 @@ static int __parasite_dump_pages_seized(struct pstree_item *item,
<br>
> int ret = -1;
<br>
> unsigned cpp_flags = 0;
<br>
> unsigned long pmc_size;
<br>
> + bool possible_pid_reuse = false;
<br>
>
<br>
> if (opts.check_only)
<br>
> return 0;
<br>
> @@ -360,6 +363,38 @@ static int __parasite_dump_pages_seized(struct pstree_item *item,
<br>
> xfer.parent = NULL + 1;
<br>
> }
<br>
>
<br>
> + if (xfer.parent) {
<br>
> + struct proc_pid_stat pps_buf;
<br>
> + StatsEntry *stats = NULL;
<br>
> + unsigned long dump_ticks;
<br>
> + unsigned long clock_ticks;
<br>
> +
<br>
> + clock_ticks = sysconf(_SC_CLK_TCK);
<br>
> + if (clock_ticks == -1) {
<br>
> + pr_perror("Failed to get clock ticks via sysconf");
<br>
> + goto out_xfer;
<br>
> + }
<br>
> +
<br>
> + ret = parse_pid_stat(item->pid->real, &pps_buf);
<br>
> + if (ret < 0)
<br>
> + goto out_xfer;
<br>
> +
<br>
> + ret = get_parent_stats((void**)&stats);
<br>
> + if (ret < 0)
<br>
> + goto out_xfer;
<br>
> + dump_ticks = stats->dump->dump_uptime/(USEC_PER_SEC / clock_ticks);
<br>
> + stats_entry__free_unpacked(stats, NULL);
<br>
> +
<br>
> + if (pps_buf.start_time >= dump_ticks) {
<br>
> + pr_warn("Detected possible pid reuse pid=%d, " \
<br>
> + "start_time=%llu, parent's dump_uptime=%lu\n",
<br>
> + item->pid->real, pps_buf.start_time,
<br>
> + dump_ticks);
<br>
> + possible_pid_reuse = true;
<br>
<br>
What the meaning of this warning in logs?
<br>
Can we separate the two cases:
<br>
1. pps_buf.start_time > dump_ticks
<br>
Real pid-reuse, silently re-dumping pid.
<br>
2. pps_buf.start_time == dump_ticks)
<br>
Warn that the reuse is possible and re-dump.
<br>
<br>
For (2) we may really be interested how often it's happening
<br>
because if it happens way too often - we might be interested
<br>
in improving this detection.. Like inserting 1csec delay before
<br>
saving uptime.
<br>
<br>
--
<br>
Dmitry
<br>
</p>
</blockquote></div>