[Devel] [PATCH RHEL8 COMMIT] ms/Revert "vfs: Allow userns root to call mknod on owned filesystems."
Konstantin Khorenko
khorenko at virtuozzo.com
Wed Jan 22 17:52:32 MSK 2020
The commit is pushed to "branch-rh8-4.18.0-80.1.2.vz8.3.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-80.1.2.vz8.3.1
------>
commit 6fa346a6cacc2ecd78be636a972b079ef1f27999
Author: Christian Brauner <christian at brauner.io>
Date: Thu Jul 5 17:51:20 2018 +0200
ms/Revert "vfs: Allow userns root to call mknod on owned filesystems."
This reverts commit 55956b59df336f6738da916dbb520b6e37df9fbd.
commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.")
enabled mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since any filesystem
mounted from a non-initial user namespace will set the SB_I_NODEV flag on
the filesystem. Now, when a device node s created in a non-initial user
namespace a call to open() on said device node will fail due to:
bool may_open_dev(const struct path *path)
{
return !(path->mnt->mnt_flags & MNT_NODEV) &&
!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}
The problem with this is that as of the aforementioned commit mknod()
creates partially functional device nodes in non-initial user namespaces.
In particular, it has the consequence that as of the aforementioned commit
open() will be more privileged with respect to device nodes than mknod().
Before it was the other way around. Specifically, if mknod() succeeded
then it was transparent for any userspace application that a fatal error
must have occured when open() failed.
All of this breaks multiple userspace workloads and a widespread assumption
about how to handle mknod(). Basically, all container runtimes and systemd
live by the slogan "ask for forgiveness not permission" when running user
namespace workloads. For mknod() the assumption is that if the syscall
succeeds the device nodes are useable irrespective of whether it succeeds
in a non-initial user namespace or not. This logic was chosen explicitly
to allow for the glorious day when mknod() will actually be able to create
fully functional device nodes in user namespaces.
A specific problem people are already running into when running 4.18 rc
kernels are failing systemd services. For any distro that is run in a
container systemd services started with the PrivateDevices= property set
will fail to start since the device nodes in question cannot be
opened (cf. the arguments in [1]).
Full disclosure, Seth made the very sound argument that it is already
possible to end up with partially functional device nodes. Any filesystem
mounted with MS_NODEV set will allow mknod() to succeed but will not allow
open() to succeed. The difference to the case here is that the MS_NODEV
case is transparent to userspace since it is an explicitly set mount option
while the SB_I_NODEV case is an implicit property enforced by the kernel
and hence opaque to userspace.
[1]: https://github.com/systemd/systemd/pull/9483
Signed-off-by: Christian Brauner <christian at brauner.io>
Cc: "Eric W. Biederman" <ebiederm at xmission.com>
Cc: Seth Forshee <seth.forshee at canonical.com>
Cc: Serge Hallyn <serge at hallyn.com>
Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
https://jira.sw.ru/browse/PSBM-100581
Warning: next patch will allow mknod in CT init userns.
The explanation above is good, but we always had such a behavior (mknod
succeeds but later open() can fail) and never had a problem because of
that, so let it be the same until we face the problem.
Signed-off-by: Konstantin Khorenko <khorenko at virtuozzo.com>
---
fs/namei.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index fac616427913..07ce37de342a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3700,8 +3700,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
if (error)
return error;
- if ((S_ISCHR(mode) || S_ISBLK(mode)) &&
- !ns_capable(dentry->d_sb->s_user_ns, CAP_MKNOD))
+ if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
return -EPERM;
if (!dir->i_op->mknod)
More information about the Devel
mailing list