[Devel] Re: [RFC patch 0/2] posix mqueue namespace (v11)
Serge E. Hallyn
serue at us.ibm.com
Tue Dec 16 07:14:19 PST 2008
Quoting Cedric Le Goater (clg at fr.ibm.com):
> Serge E. Hallyn wrote:
> > (Ok I don't know what the actual version number is - it's
> > high but 11 is probably safe)
> >
> > Cedric and Nadia took several approaches to making posix
> > message queues per-namespace. I ended up mamking some
> > deep changes so am not retaining their Signed-off-by:s
> > on this version, but this is definately very much based
> > on work by both of them.
>
> you can keep mine. i have had a similar version on 2.6.26.
>
> http://legoater.free.fr/patches/2.6.26/2.6.26/
>
> and it's easier to track where the patches go.
>
> > Patch 2 hopefully explains my approach. Briefly,
Thanks, Cedric, will put those back.
> > 1. sysv and posix ipc are both under CLONE_NEWIPC
> > 2. the mqueue sb is per-ipc-namespace
> >
> > So to create a new ipc namespace, you would
> >
> > unshare(CLONE_NEWIPC|CLONE_NEWNS);
>
> does CLONE_NEWIPC requires CLONE_NEWNS ?
No, the mq_* syscalls don't need the fs to be actually mounted,
and a container could just chroot("/vs1"); and mount -t mqueue
under /vs1/dev/mqueue, not requiring a new mounts namespace.
> > umount /dev/mqueue
> > mount -t mqueue mqueue /dev/mqueue
>
> the semantic looks good, much better than a 'newinstance' mount
> option.
Agreed. newinstance works for a pure filesystem like devpts,
but it simply isn't a good fit for mqueue.
> if CLONE_NEWNS is not required, what happens to the user mount (and
> the mq_ns below it) when the task dies. that's the big issue. if
> CLONE_NEWNS is required were safe, but I think Pavel made
> some objection to that.
(Huh, I just noticed get_ns_from_sb() doesn't seem to be called
anywhere <scribble><scribble>)
Short version:
The user mount hangs around until someone umounts it. Now of course
I expect that most users WILL want to do CLONE_NEWIPC|CLONE_NEWNS.
Long version:
Any VFS actions through mqueuefs will do:
spin_lock(&mq_lock);
ipc_ns = get_ipc_ns(inode->i_sb->s_fs_info);
spin_unlock(&mq_lock);
where s_fs_info is the ipc_ns. Freeing an ipc_ns does
if (atomic_dec_and_lock(&ipc_ns->count, &mq_lock)) {
mq_ns->mnt->mnt_sb->s_fs_info = NULL;
spin_unlock(&mq_lock);
mntput(mq_ns->mnt);
}
So if a vfs_create() by a task in another ipc_ns is racing with the
task exit of the last task in the ipc_ns, then either
1. the vfs_create() manages to pin the ipc_ns before
the other task exits. So the task exit won't
free the ipc_ns. The put_ipc_ns() at the end
of vfs_create() will.
or
2. the task exits first, vfs_create() finds
s_fs_info NULL, and returns -EACCES. Unlink
simply succeeds.
Pavel, please let me know if you have issues with my approach.
> > It's perfectly valid to do vfs operations on files
> > in another ipc_namespace's /dev/mqueue, but any use
> > of mq_open(3) and friends will act in your own ipc_ns.
>
> ok.
Nadia had written a cool set of ltp tests. They were based
around the mount -o newinstance semantics so i'll have to
see which ones are still relevant and rework some others,
then will post them and repost the kernel patchset.
Thanks for taking a look, Cedric, and for getting this set
going before.
-serge
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
More information about the Devel
mailing list