user namespace delta over 3.7

Mon Nov 19 15:48:36 UTC 2012

Colin Ian King <colin.king at canonical.com> writes:

> On 14/11/12 20:55, Serge Hallyn wrote:
>> Quoting Tim Gardner (tim.gardner at canonical.com):
>>> On 11/06/2012 09:36 AM, Serge Hallyn wrote:
>>>> Hi,
>>>>
>>>> the core of user namespace code has landed upstream, however some more
>>>> is needed to run full ubuntu containers in a user namespace.  Some of
>>>> this will land in 3.8, but probably not all.  Eric's development tree
>>>> is at http://git.kernel.org/?p=linux/kernel/git/ebiederm/user-namespace.git;a=summary
>>>>
>>>> I have pushed that tree on top of a recent raring tree at
>>>> git://kernel.ubuntu.com/serge/quantal-userns.git in branch
>>>> master.oct25.userns-v70.  It consists of 84 patches (including 5 just
>>>> updating under debian/, one by me fix to account for ubuntu delta, and
>>>> one not (yet) in Eric's tree to allow tmpfs mounts in a container),
>>>> which I can git-email if desired.  The built kernel is in
>>>> ppa:serge-hallyn/userns-natty and does allow me to boot a full ubuntu
>>>> container in a user namespace - meaning every root owned process and
>>>> file is actually owned by userid 100000 on the host and contained.
>>>>
>>>> I'm sending this now in the hopes that whatever bits don't land in
>>>> 3.8 can be pushed onto the raring kernel.  Our goal this cycle is to
>>>> support user namespaces, and next cycle to support completely
>>>> unprivileged creation and starting of containers.
>>>>
>>>> -serge
>>>>
>>>
>>> Serge - how about a pull request for a branch that has been rebased
>>> on Raring master-next ? I took a quick stab at it and quickly ran
>>> into uapi transition conflicts (I think).
>>
>> A successfully built kernel is at
>> git://kernel.ubuntu.com/serge/quantal-userns.git (branch
>> master-next.nov14.userns which should be the default).
>>
>> -serge
>>
>
> I've got some questions and/or observations about the following commits:
>
> b3f4f523c8c20f2ca2ac031900f1a252d750ec1d
> debian changes to build in ppa
>
> 	..this fiddles around with the skipabi, skipmodules to allow building in
> a PPA, but we should not pull that into the raring kernel.
>
> 1c428901dcae93832f13a01492539cb77fea6c85
> net: Allow opening an af_unix socket
>
> sock_open() has:
>         /* file->f_flags??? */
>         //file->f_flags = O_RDWR | (flags & O_NONBLOCK);
>
> ..the comment seems to be alluding to the fact we're not sure if we should be
> setting f->f_flags and that the code was put in during development (for testig?)
> and then commented out.  Anyhow, it's confusing and I'm now not sure what this
> is meant to be doing. Should this be removed?

Yes.  That patch is one of the patches at the end of my user namespace
tree that I have been playing with but is not fully baked.  The goal of
that change is to allow us to just open a unix domain sockect with open
instead of connect.

>
> 8b16d00119a210ad9ed6d62fd0addb37f23c683b
> fuse: Teach fuse how to handle the pid namespace.
>
> convert_fuse_file_lock() has:
>
> 	fl->fl_pid = pid_vnr(find_pid_ns(ffl->pid, fc->pid_ns));
>
> is it seems possible (but unlikely) for find_pid_ns() to return NULL which
> passes NULL into pid_vnr() which in turn passes NULL into
> pid_nr_ns() which returns 0. Is a zero pid of fl->fl_pid valid?

Good question. My fuse state is paged out at the moment.

> 05ee1be568bf321167fe93219aaac029a49b3d3a
> devpts: Remove the devpty cleanup special case.
>
> in ptmx_open():
>
> 	/* Find the devpts instance we are working with */
> 	mnt = devpts_mntget(filp);
>
> getpts_mntget() can return ERR_PTR(-ENODEV), so mnt probably needs checking for
> this kind of unlikely failure case.

You mean the IS_ERR(mnt) check on the next line in ptmx_open?

Certainly I see it there in my tree.

> ebb713c0e5b2bbedcb4c94715a3973c0b084e723
> devpts: Make the newinstance option historical
>
> parse_mount_options():
> 	
> case Opt_newinstance:  this is now a historical mount option and now silently
> does nothing.  Perhaps we should print some kind of warning or info message to
> indicate this just to warn users that this is now being ignored, however this is
> documented in the changes in Documentation/filesystems/devpts.txt so maybe this
> is totally unnecessary.

> 7350816a4fea9a9368f94a81cd44c3eb31fbbf3d
> net: Push capable(CAP_NET_ADMIN) into the rtnl methods
>
> Comment in this patch:
>     "Later patches will remove the extra capable calls from methods
>     that are safe for unprivilged users."
>
> ..are these later patches in this patch set, if so, which ones are they?

I assume Serge has picked them up.  They are the net patches where
capable(CAP_NET_ADMIN) in netlink methods are deleted.  Basically for
ipv4, ipv6 and bridge as I recall.  With all of the weird protocols that
don't work in network namespaces were left with capable(CAP_NET_ADMIN)
calls.

> 375cb2956423a172e017c1bfbc0c32d96250f41f
> net: Don't export sysctls to unprivileged users
>
> __ip_vs_lblc_init():
>
> the following change added a '\' which looks like a typo:
>
> -                       kfree(ipvs->lblc_ctl_table);
> +                       kfree(ipvs->lblc_ctl_table);\
>                 return -ENOMEM;

It is definitely a typo.

It is the best kind of typo that causes no build problems and gets
merged before anyone notices.  Ouch!

I have just sent a patch to David so this doesn't linger in net-next.

> d4856b6128ec5735e654259407fd8573faec4b88
> userns: Convert nfs and nfsd to use kuid/kgid where appropriate
>
> find_guid():
>
> When a gid is not found then a new one is added to the aces[] array. I don't see
> any bounds checking on this, so can this potentially fall off the end of the
> aces[] array at some point?

init_state allocates enough space so that this is not needed.  This
logic is unchanged from the existing nfsd logic.  The only practical
change here is that find_uid and find_gid are split into separate
functions for type safety.

Eric