[CRIU] [PATCH 3/3] Issue #360: Anonymize image files
Harshavardhan Unnibhavi
hvubfoss at gmail.com
Tue Jun 25 16:38:17 MSK 2019
On Mon, Jun 24, 2019 at 8:45 PM Radostin Stoyanov <rstoyanov1 at gmail.com> wrote:
>
> On 24/06/2019 13:00, Pavel Emelianov wrote:
> > On 6/22/19 12:37 PM, Harshavardhan Unnibhavi wrote:
> >> This commit adds the file anonymizer function which anonymizes file names present in images.
> >>
> >> The anonymized file names are just the shuffled names along the path from root.
> >>
> >> Signed-off-by: Harshavardhan Unnibhavi <hvubfoss at gmail.com>
> >> ---
> >> lib/py/cli.py | 9 ++++++-
> >> lib/py/strip.py | 66 +++++++++++++++++++++++++++++++++++++++++++++++++
> >> 2 files changed, 74 insertions(+), 1 deletion(-)
> >> create mode 100644 lib/py/strip.py
> >>
> >> diff --git a/lib/py/cli.py b/lib/py/cli.py
> >> index 17622fd2..4a8efeff 100755
> >> --- a/lib/py/cli.py
> >> +++ b/lib/py/cli.py
> >> @@ -5,6 +5,7 @@ import json
> >> import os
> >>
> >> import pycriu
> >> +import strip
> the keyword 'strip' is very popular (e.g. Python has string.strip()
> method) and using it as a module name may cause confusion, perhaps using
> 'anonymize' would be better?
Yes anonymize would be better name. I will change it.
> >>
> >> def inf(opts):
> >> if opts['in']:
> >> @@ -281,15 +282,21 @@ def anonymize(opts):
> >> img_files = os.listdir(opts['in'])
> >>
> >> for i in img_files:
> >> - temp = {'in':os.path.join(opts['in'], i)}
> >> + temp = {'in':os.path.join(opts['in'], i), 'out':os.path.join(opts['out'], i)}
> >>
> >> try:
> >> m, img = pycriu.images.load(inf(temp), anon_info = True)
> >> + print("Processing File name:{} with magic:{}".format(i, m))
> >> except pycriu.images.MagicException as exc:
> >> print("Unknown magic %#x.\n"\
> >> "Found a raw image, continuing ..."% exc.magic, file=sys.stderr)
> >> continue
> >>
> >> + anon_dict = strip.anon_handler(img, m)
> >> + if anon_dict != -1:
> >> + pycriu.images.dump(anon_dict, outf(temp))
> > A message about skipping the file is needed.
> >
> >> +
> >> +
> >>
> >> explorers = { 'ps': explore_ps, 'fds': explore_fds, 'mems': explore_mems, 'rss': explore_rss }
> >>
> >> diff --git a/lib/py/strip.py b/lib/py/strip.py
> >> new file mode 100644
> >> index 00000000..4069275c
> >> --- /dev/null
> >> +++ b/lib/py/strip.py
> The indentation of this file is using spaces, (IMHO we should be using
> space for python code) however, the rest of the code base is using tabs.
> For consistency it might be better to use tabs in this file as well?
> >> @@ -0,0 +1,66 @@
> >> +# This file contains methods to deal with anonymising images.
> >> +#
> >> +# Contents being anonymised can be found at: https://github.com/checkpoint-restore/criu/issues/360
> Could you please add the content that is being anonymised instead of
> providing an external link to the github issue? This will be helpful
> when reading the source code offline.
Sure.
> >> +#
> >> +# Inorder to anonymise the image files three steps are followed:
> s/Inorder/In order/g
> >> +# - decode the binary image to json
> >> +# - strip the necessary information from the json dict
> >> +# - encode the json dict back to a binary image, which is now anonymised
> >> +
> >> +import sys
> >> +import json
> >> +import random
> >> +
> >> +def files_anon(image):
> >> + levels = {}
> >> +
> >> + for e in image['entries']:
> >> + f_path = e['reg']['name']
> we should handle KeyError: 'reg' or check if the reg key exists.
> >> + f_path = f_path.split('/')
> >> +
> >> + lev_num = 0
> >> + for p in f_path:
> >> + if p == '':
> >> + continue
> >> + if lev_num in levels.keys():
> >> + if p not in levels[lev_num].keys():
> is .keys() necessary here?
> >> + temp = list(p)
> >> + random.shuffle(temp)
> > Erm, I'm not 100% it's OK to anonymize file paths like that.
> Computing a hash could be another option?
Yes I will look into it.
> >
> >> + levels[lev_num][p] = ''.join(temp)
> >> + else:
> >> + levels[lev_num] = {}
> >> + temp = list(p)
> >> + random.shuffle(temp)
> >> + levels[lev_num][p] = ''.join(temp)
> > Can we factor out these two branches a bit? Smth like
> >
> > if lev_num not in levels.keys():
> > levels[lev_num] = {}
> > if p not in levels[lev_num].keys():
> > temp = list(p)
> > random.shuffle(temp)
> > levels[lev_num][p] = ''.join(temp)
> >
> >> + lev_num += 1
> >> +
> >> + for i, e in enumerate(image['entries']):
> >> + f_path = e['reg']['name']
> >> + if f_path == '/':
> >> + continue
> >> + f_path = f_path.split('/')
> >> +
> >> + lev_num = 0
> >> + for j, p in enumerate(f_path):
> >> + if p == '':
> >> + continue
> >> + f_path[j] = levels[lev_num][p]
> >> + lev_num += 1
> >> + f_path = '/'.join(f_path)
> >> + image['entries'][i]['reg']['name'] = f_path
> >> +
> >> + return image
> >> +
> >> +
> >> +
> >> +
> >> +anonymizers = {
> >> + 'FILES': files_anon,
> >> + }
> >> +
> > Please, run lint tool on this file, I'm afraid this coding style is not correct. The config
> > file for lint is in scripts/flake8.cfg.
>
> To run the lint tool you will need to install flake8:
> $ pip install flake8
> $ make lint
>
> >> +def anon_handler(image, magic):
> >> + if magic != 'FILES':
> >> + return -1
> >> + handler = anonymizers[magic]
> >> + anon_image = handler(image)
> >> + return anon_image
> please add an empty line before the return statement?
Sure.
> >> \ No newline at end of file
> please add a new line at end of file?
More information about the CRIU
mailing list