[Devel] user namespace - introduction
Serge E. Hallyn
serue at us.ibm.com
Mon Jun 4 12:39:57 PDT 2007
[ I've been sitting on this for some months, and am just dumping it so
people can talk if they like, maybe even build on the patchset by
adding support for more filesystems or implementing the keyring. Or
tell me how much the approach sucks. ]
First, I point out once more that the base user namespace patchset
Cedric originally sent out really is sufficient. We just need for users
to have different quotas, limits, and in-kernel key storage. Signal
delivery, file controls, etc can be set up using pidspaces and separate
mount trees, using selinux policy or other lsms, and even using
ecryptfs.
But if it will be insisted upon that uid checks be enhanced, here is
a new patchset which just might satisfy everyone, and which is based
on user namespace discussions from the last year, particularly
comments by Eric Biederman and David Howells.
Below is how I think the user namespace controls would work. The
patches that follow only touch on parts of steps 1-4.
1. let filesystem tag inodes and superblocks with one user namespace
2. let generic_permission - and through inode->i_op->permission, the
fs, if it wants to be smarter - enforce user namespaces
3. by default, inode->i_userns comes from sb->s_userns, just as is
done in these patches.
4. By default, if inode->i_userns != task->userns, the process gets
treated as 'nobody'. This is a change from my current patches and
what is done in -lxc, where all permission is denied. I think it
is a far preferable behavior. It allows read-only bind mount
sharing among user namespaces without a silly MS_USER_NS flag.
5. Capabilities relating to actions on subjects or objects associated
with a user namespace are only effective for targets in the same
user namespace as the actor.
This *could* be changed to also work for targets in decendant
user namespaces, but that could slow things down.
6. Create a new keychain for user namespaces. Two types of entries.
The first type of entry, (user_ns 5, uid 501) means that whichever
user has that key will be recognized in user namespace 5 as uid
501. Presumably, uid 501 in user_ns 5 would have started a vserver
with a new user namespace, say user_ns 7. He would likely want to
give uid 0 in user_ns 7 a (user_ns 5, uid 501) key.
The second type of key, (user_ns 5, CAP_FOWNER) gives the user
holding the key the ability to have CAP_FOWNER in userns 5. By
default, uid 0 in userns 7 cannot have CAP_FOWNER in userns 5.
(Only) a task with (userns 5, CAP_SETPCAP) can give that key to
any user in userns 7. The key by itself does not grant the
capability, but allows a task with that uid which has CAP_FOWNER
in its P set to assert it for userns 5.
7. Eventually filesystems could begin storing global uids in inode
xattrs on disk, and use these in inode->i_op->permission() along
with data in the user's userns key to do global uid permission
checking. Really this should almost trivial to implement once
the above has been implemented. It could be done right in ext234
etc, or in a small stackable fs.
-serge
Note: step 1 has been complained about bc some think it should be done
at the vfsmount level. If you read through the whole set of steps I
think you'll see why it is not more limited. The fs gets to decide the
real owner of a file, and despite there being one real owner, any number
of users can be made to be treated as the owner, so there is no
limitation in this approach.
More information about the Devel
mailing list