<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman <span dir="ltr"><<a href="mailto:gregkh@linuxfoundation.org" target="_blank">gregkh@linuxfoundation.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div>On Wed, Sep 25, 2013 at 02:34:54PM -0700, Eric W. Biederman wrote:<br>
> So the big issues for a device namespace to solve are filtering which<br>
> devices a container has access to and being able to dynamically change<br>
> which devices those are at run time (aka hotplug).<br>
<br>
</div>As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG<br>
anymore, because it was redundant), I think you need to really think<br>
this through better (pci, memory, cpus, etc.) before you do anything in<br>
the kernel.<br>
<div><br>
> After having thought about this for a bit I don't know if a pure<br>
> userspace solution is sufficient or actually a good idea.<br>
><br>
> - We can manually manage a tmpfs with device nodes in userspace.<br>
> (But that is deprecated functionality in the mainstream kernel).<br>
<br>
</div>Yes, but I'm not going to namespace devtmpfs, as that is going to be an<br>
impossible task, right?<br></blockquote><div><br></div><div>That sounds like a challenge ;-)</div><div>Seriously, as Serge correctly noted, it would not be that different from devpts</div><div>if you start from an empty devtmpfs and populate it with devices that are "added</div>
<div>in the context of that namespace".</div><div>The semantics in which devices are "added in the context of a namespace"</div><div>is the missing piece of the puzzle.</div><div><br></div><div>What we really like to see is a setns() style API that can be used to</div>
<div>add a device in the context of a namespace in either a "shared" or "private"</div><div>mode.</div><div>This kind of API is a required building block for us to write device drivers</div><div>that are namespace aware in a way that userspace will have enough flexibility</div>
<div>for dynamic configuration.</div><div><br></div><div>We are trying to come up with a proposal for that sort of API.</div><div>When we have something decent, we shall post it.</div><div><br></div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
And remember, udev doesn't create device nodes anymore...<br>
<div><br>
> - We can manually export a subset of sysfs with bind mounts.<br>
> (But that feels hacky, and is essentially incompatible with hotplug).<br>
<br>
</div>True.<br>
<div><br>
> - We can relay a call of /sbin/hotplug from outside of a container<br>
> to inside of a container based on policy.<br>
> (But no one uses /sbin/hotplug anymore).<br>
<br>
</div>That's right, they should be listening to libudev events, so why can't<br>
your daemon shuffle them off to the proper container, all in userspace?<br>
<div><br>
> - There is no way to fake netlink uevents for a container to see them.<br>
> (The best we could do is replace udev everywhere with something that<br>
> listens on a unix domain socket).<br>
<br>
</div>You shouldn't need to do this.<br>
<div><br>
> - It would be nice to replace the device cgroup with a comprehensive<br>
> solution that really works. (Among other things the device cgroup<br>
> does not work in terms of struct device the underlying kernel<br>
> abstraction for devices).<br>
<br>
</div>I didn't even know there was a device cgroup.<br>
<br>
Which means that if there is one, odds are it's useless.<br>
<div><br>
> We must manage sysfs entries as well device nodes because:<br>
> - Seeing more than we should has the real potential to confuse<br>
> userspace, especially a userspace that replays uevents.<br>
<br>
</div>You should never replay uevents. If you don't do that, why can't you<br>
see all of sysfs?<br>
<div><br>
> - Some device control must happens through writing to sysfs files and<br>
> if we don't remove all root privileges from a container only by<br>
> exporting a subset of sysfs to that container can we limit which<br>
> sysfs nodes can be written to.<br>
<br>
</div>But you have the issue of controlling devices in a "shared" way, which<br>
isn't going to be usable for almost all devices.<br>
<div><br>
> The current kernel tagged sysfs entry support does not look like a good<br>
> match for the impelementing device filtering. The common case will<br>
> be allowing devices like /dev/zero, and /dev/null that live in<br>
> /sys/devices/virtual and are the devices we are most likely to care<br>
> about. Those devices need to live in multiple device namespaces so<br>
> everyone can use them. Perhaps exclusive assignment will be the more<br>
> common paradigm for device namespaces like it is for network devices in<br>
> the network namespace but from what little I can of this problem right now I<br>
> don't think so.<br>
><br>
> I definitely think we should hold off on a kernel level implementation<br>
> until we really understand the issues and are ready to implement device<br>
> namespaces correctly.<br>
<br>
</div>I agree, especially as I don't think this will ever work.<br>
<div><br>
> A userspace implementation looks like it can only do about 95% of what<br>
> is really needed, but at the same time looks like an easy way to<br>
> experiment until the problem is sufficiently well understood.<br>
<br>
</div>95% is probably way better than what you have today, and will fit the<br>
needs of almost everyone today, so why not do it?<br>
<br>
I'd argue that those last 5% either are custom solutions that never get<br>
merged, or candidates for true virtulization.<br>
<div><br>
> In summary the situation with device hoptlug and containers sucks today,<br>
> and we need to do something. Running a linux desktop in a container is<br>
> a reasonably good example use case.<br>
<br>
</div>No it isn't. I'd argue that this is a horrible use case, one that you<br>
shouldn't do. Why not just use multi-head machines like people do who<br>
really want to do this, relying on user separation? That's a workable<br>
solution that is quite common and works very well today.<br>
<div><br>
> Having one standard common maintainable implementation would be very<br>
> useful and the most logical place for that would be in the kernel.<br>
> For now we should focus on simple device filtering and hotplug.<br>
<br>
</div>Just listen for libudev stuff, don't try to filter them, or ever<br>
"replay" them, that way lies madness, and lots of nasty race conditions<br>
that is guaranteed to break things.<br>
<br>
good luck,<br>
<br>
greg k-h<br>
</blockquote></div><br></div></div>