[Devel] design of user namespaces

Thu Jun 19 19:01:08 PDT 2008

I got to thinking about the user namespace again 
today and I wanted to summarize my thoughts.

First and foremost I want a design that solves the suid problem,
and I think I have.

In particular if we can let an unprivileged user run a process
in a different user namespace, and become a user with 0 == uid == gid
and have all capabilities except CAP_SYS_RAWIO and they still can not
do anything security wise they could not before. We have a very useful design.

My thinking is as follows.
- All user visible security credentials uid, gid, selinux filesystem lables,
  etc will be relative to a user namespace.

  This makes security easier, and it the only thing that makes
  sense in the concept of separate administrative domain.

- Each struct user will live in exactly one user namespace.

- Each user namespace will hold the struct user of the user who
  created the user namespace.

- capable() will be extended to take a user namespace parameter.  So
  the question capable asks is does the current process have the
  capability with respect to the current user namespace.

  This makes security namespaces loosely hierarchical.  And the rule
  would be that if you could do something to the files and directories
  and other objects owned by the creator of the namespace you can do
  it to objects owned by users in the user namespace.

  Going the other direction from child to parent user namespaces,
  users in user namespace no matter what capabilities they have will
  have no permissions except those given to every process.

- Each user namespace will have a bounding set of capabilities that
  suid executables and the initial process get.  Basically something
  so fundamentally machine specific capabilities like CAP_SYS_RAWIO
  can never be set inside of a user namespace unless the creator of
  the user namespace had those capabilities.

- The initial user in a user namespace will be uid 0 and have all
  capabilities, non machine specific cpabilities CAP_NET_ADMIN,
  CAP_SYS_ADMIN, CAP_SYS_CHROOT etc.  Since those capabilities will be
  restricted to objects in the user namespace they will have no
  security implications.

- struct vfsmount will include a user namespace pointer.  Recording
  in which mount namespace the mount occurred, and not changing
  when struct vfsmount is copied or propagated (including through bind
  mounts).  This user namespace will act as the user namespace to map
  all security credentials on the filesystem into.  If you are
  accessing the vfsmount from another user namespace the best you can
  get is a the world readable permissions.

  struct vfsmount will also include a filesystem data pointer.  Not
  changing except when struct user changes.  Giving filesystems a
  place to store their per user namespace credential translation
  state.

  This allows is shared filesystem caches between user namespaces.

- For filesystems to be safely shared between 2 user namespaces we
  need two things.  The owner of the filesystem has to allow it.  The
  user of the filesystem needs to request it.

  For user namespaces we have two cases.  Allowing native mounts
  to a user namespace of a previously mounted filesystem.  Allowing
  native mounts to of an unmounted filesystem.

  Working with previously mounted filesystems is the safest as we
  don't have to deal with the hard problem of poisoned filesystem data
  trying to crash our filesystem implementation.

  For previously filesystems the simplest and most comprehensive way I can
  see to do this is to implement a special case of mount -o remount.
  In which a parameter is passed telling the filesystem which
  credential mapping strategy to employ.  The credential mapping
  strategies I have seen to date are
  - identity mapping read-only.  
  - identity mapping read-write (this is a problem with quotas)
  - uid/gid offset by a constant (solves the quota problem)
  - security label for namespaces (can solve the quota problem).

  Then the already mounted filesystem can either performs an upcall
  or examine it's configuration (potentially stored in a normal file
  in the filesystem like the quotas are) to see if the remount into
  native mode for the user namespace can be allowed. 

  For the specific and very common case of identity mapping read-only
  filesystems we could even have a completely generic mount flag you
  can set a priori to allow any user namespace to remount it native.

  Most unix filesystems have common enough properties that we can
  implement a library they can wire up to implement this functionality
  without requiring an on disk format change.  So despite it being
  filesystem specific functionality we can extensive share the code.

  For network filesystems like nfs I expect the request would go to
  the server and authenticating a new mount.  It is still safer then
  a totally new nfs mount because an unprivileged user can not specify
  the server.

- Previously I really wanted to say just do something like idmapd and
  we would be golden.  The problem of quotas in different namespaces
  would be solved, etc.  After thinking about it I realized there is
  no same place to run such a mapping daemon that could map between
  arbitrary user namespaces that could do something useful and not
  also compromise security.

  Mapping daemons are good at changing the form of security tokens.
  Say by looking up user names in /etc/passwd.  They are not
  sufficient to perform a general mapping.

  The owner of the filesystem has to configure what you are allowed
  to see and access.  Which uids you can use, which directories you
  can use etc.  While it is the job of the mapping daemon to map the
  resources it has available to it (username strings typically) to
  something the users of the filesystem can actually use.

Eric
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers