This commit is contained in:
Xingyu Wang 2019-08-20 23:33:35 +08:00
parent 6286ef33d9
commit fee9c60702
2 changed files with 146 additions and 145 deletions

View File

@ -1,145 +0,0 @@
[#]: collector: (lujun9972)
[#]: translator: (wxy)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (Podman and user namespaces: A marriage made in heaven)
[#]: via: (https://opensource.com/article/18/12/podman-and-user-namespaces)
[#]: author: (Daniel J Walsh https://opensource.com/users/rhatdan)
Podman and user namespaces: A marriage made in heaven
======
Learn how to use Podman to run containers in separate user namespaces.
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/architecture_structure_planning_design_.png?itok=KL7dIDct)
[Podman][1], part of the [libpod][2] library, enables users to manage pods, containers, and container images. In my last article, I wrote about [Podman as a more secure way to run containers][3]. Here, I'll explain how to use Podman to run containers in separate user namespaces.
I have always thought of [user namespace][4], primarily developed by Red Hat's Eric Biederman, as a great feature for separating containers. User namespace allows you to specify a user identifier (UID) and group identifier (GID) mapping to run your containers. This means you can run as UID 0 inside the container and UID 100000 outside the container. If your container processes escape the container, the kernel will treat them as UID 100000. Not only that, but any file object owned by a UID that isn't mapped into the user namespace will be treated as owned by "nobody" (65534, kernel.overflowuid), and the container process will not be allowed access unless the object is accessible by "other" (world readable/writable).
If you have a file owned by "real" root with permissions [660][5], and the container processes in the user namespace attempt to read it, they will be prevented from accessing it and will see the file as owned by nobody.
### An example
Here's how that might work. First, I create a file in my system owned by root.
```
$ sudo bash -c "echo Test > /tmp/test"
$ sudo chmod 600 /tmp/test
$ sudo ls -l /tmp/test
-rw-------. 1 root root 5 Dec 17 16:40 /tmp/test
```
Next, I volume-mount the file into a container running with a user namespace map 0:100000:5000.
```
$ sudo podman run -ti -v /tmp/test:/tmp/test:Z --uidmap 0:100000:5000 fedora sh
# id
uid=0(root) gid=0(root) groups=0(root)
# ls -l /tmp/test
-rw-rw----. 1 nobody nobody 8 Nov 30 12:40 /tmp/test
# cat /tmp/test
cat: /tmp/test: Permission denied
```
The **\--uidmap** setting above tells Podman to map a range of 5000 UIDs inside the container, starting with UID 100000 outside the container (so the range is 100000-104999) to a range starting at UID 0 inside the container (so the range is 0-4999). Inside the container, if my process is running as UID 1, it is 100001 on the host
Since the real UID=0 is not mapped into the container, any file owned by root will be treated as owned by nobody. Even if the process inside the container has **CAP_DAC_OVERRIDE** , it can't override this protection. **DAC_OVERRIDE** enables root processes to read/write any file on the system, even if the process was not owned by root or world readable or writable.
User namespace capabilities are not the same as capabilities on the host. They are namespaced capabilities. This means my container root has capabilities only within the container—really only across the range of UIDs that were mapped into the user namespace. If a container process escaped the container, it wouldn't have any capabilities over UIDs not mapped into the user namespace, including UID=0. Even if the processes could somehow enter another container, they would not have those capabilities if the container uses a different range of UIDs.
Note that SELinux and other technologies also limit what would happen if a container process broke out of the container.
### Using `podman top` to show user namespaces
We have added features to **podman top** to allow you to examine the usernames of processes running inside a container and identify their real UIDs on the host.
Let's start by running a sleep container using our UID mapping.
```
$ sudo podman run --uidmap 0:100000:5000 -d fedora sleep 1000
```
Now run **podman top** :
```
$ sudo podman top --latest user huser
USER   HUSER
root   100000
$ ps -ef | grep sleep
100000   21821 21809  0 08:04 ?         00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 1000
```
Notice **podman top** reports that the user process is running as root inside the container but as UID 100000 on the host (HUSER). Also the **ps** command confirms that the sleep process is running as UID 100000.
Now let's run a second container, but this time we will choose a separate UID map starting at 200000.
```
$ sudo podman run --uidmap 0:200000:5000 -d fedora sleep 1000
$ sudo podman top --latest user huser
USER   HUSER
root   200000
$ ps -ef | grep sleep
100000   21821 21809  0 08:04 ?         00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 1000
200000   23644 23632  1 08:08 ?         00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 1000
```
Notice that **podman top** reports the second container is running as root inside the container but as UID=200000 on the host.
Also look at the **ps** command—it shows both sleep processes running: one as 100000 and the other as 200000.
This means running the containers inside separate user namespaces gives you traditional UID separation between processes, which has been the standard security tool of Linux/Unix from the beginning.
### Problems with user namespaces
For several years, I've advocated user namespace as the security tool everyone wants but hardly anyone has used. The reason is there hasn't been any filesystem support or a shifting file system.
In containers, you want to share the **base** image between lots of containers. The examples above use the Fedora base image in each example. Most of the files in the Fedora image are owned by real UID=0. If I run a container on this image with the user namespace 0:100000:5000, by default it sees all of these files as owned by nobody, so we need to shift all of these UIDs to match the user namespace. For years, I've wanted a mount option to tell the kernel to remap these file UIDs to match the user namespace. Upstream kernel storage developers continue to investigate and make progress on this feature, but it is a difficult problem.
Podman can use different user namespaces on the same image because of automatic [chowning][6] built into [containers/storage][7] by a team led by Nalin Dahyabhai. Podman uses containers/storage, and the first time Podman uses a container image in a new user namespace, container/storage "chowns" (i.e., changes ownership for) all files in the image to the UIDs mapped in the user namespace and creates a new image. Think of this as the **fedora:0:100000:5000** image.
When Podman runs another container on the image with the same UID mappings, it uses the "pre-chowned" image. When I run the second container on 0:200000:5000, containers/storage creates a second image, let's call it **fedora:0:200000:5000**.
Note if you are doing a **podman build** or **podman commit** and push the newly created image to a container registry, Podman will use container/storage to reverse the shift and push the image with all files chowned back to real UID=0.
This can cause a real slowdown in creating containers in new UID mappings since the **chown** can be slow depending on the number of files in the image. Also, on a normal [OverlayFS][8], every file in the image gets copied up. The normal Fedora image can take up to 30 seconds to finish the chown and start the container.
Luckily, the Red Hat kernel storage team, primarily Vivek Goyal and Miklos Szeredi, added a new feature to OverlayFS in kernel 4.19. The feature is called **metadata only copy-up**. If you mount an overlay filesystem with **metacopy=on** as a mount option, it will not copy up the contents of the lower layers when you change file attributes; the kernel creates new inodes that include the attributes with references pointing at the lower-level data. It will still copy up the contents if the content changes. This functionality is available in the Red Hat Enterprise Linux 8 Beta, if you want to try it out.
This means container chowning can happen in a couple of seconds, and you won't double the storage space for each container.
This makes running containers with tools like Podman in separate user namespaces viable, greatly increasing the security of the system.
### Going forward
I want to add a new flag, like **\--userns=auto** , to Podman that will tell it to automatically pick a unique user namespace for each container you run. This is similar to the way SELinux works with separate multi-category security (MCS) labels. If you set the environment variable **PODMAN_USERNS=auto** , you won't even need to set the flag.
Podman is finally allowing users to run containers in separate user namespaces. Tools like [Buildah][9] and [CRI-O][10] will also be able to take advantage of user namespaces. For CRI-O, however, Kubernetes needs to understand which user namespace will run the container engine, and the upstream is working on that.
In my next article, I will explain how to run Podman as non-root in a user namespace.
--------------------------------------------------------------------------------
via: https://opensource.com/article/18/12/podman-and-user-namespaces
作者:[Daniel J Walsh][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/rhatdan
[b]: https://github.com/lujun9972
[1]: https://podman.io/
[2]: https://github.com/containers/libpod
[3]: https://opensource.com/article/18/10/podman-more-secure-way-run-containers
[4]: http://man7.org/linux/man-pages/man7/user_namespaces.7.html
[5]: https://chmodcommand.com/chmod-660/
[6]: https://en.wikipedia.org/wiki/Chown
[7]: https://github.com/containers/storage
[8]: https://en.wikipedia.org/wiki/OverlayFS
[9]: https://buildah.io/
[10]: http://cri-o.io/

View File

@ -0,0 +1,146 @@
[#]: collector: (lujun9972)
[#]: translator: (wxy)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (Podman and user namespaces: A marriage made in heaven)
[#]: via: (https://opensource.com/article/18/12/podman-and-user-namespaces)
[#]: author: (Daniel J Walsh https://opensource.com/users/rhatdan)
Podman 和用户名字空间:天作之合
======
> 了解如何使用 Podman 在单独的用户空间运行容器。
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/architecture_structure_planning_design_.png?itok=KL7dIDct)
[Podman][1] 是 [libpod][2] 库的一部分,使用户能够管理 pod、容器和容器镜像。在我的上一篇文章中我写过 [Podman 作为一种更安全的运行容器的方式][3]。在这里,我将解释如何使用 Podman 在单独的用户命名空间中运行容器。
我一直在思考<ruby>[用户命名空间][4]<rt>user namespace</rt></ruby>,它主要是由 Red Hat 的 Eric Biederman 开发的作为分离容器的一个很棒的功能。用户命名空间允许你指定用于运行容器的用户标识符UID和组标识符GID映射。这意味着你可以在容器内运行 UID 0在容器外运行 UID 100000。如果容器进程逃逸出了容器内核会将它们视为 UID 100000。不仅如此任何未映射到用户命名空间的 UID 所拥有的任何文件对象都将被视为 `nobody` 所拥有65534`kernel.overflowuid`),并且不允许容器进程访问,除非该对象可由“其他人”访问(世界可读/可写)。
如果你拥有一个权限为 [660][5] 的属主为“真实” `root` 的文件,并且用户命名空间中的容器进程尝试读取它,则会阻止它们访问它,并且会将该文件视为 `nobody` 所拥有。
### 示例
以下是它是如何工作的。首先,我在 `root` 拥有的系统中创建一个文件。
```
$ sudo bash -c "echo Test > /tmp/test"
$ sudo chmod 600 /tmp/test
$ sudo ls -l /tmp/test
-rw-------. 1 root root 5 Dec 17 16:40 /tmp/test
```
接下来,我将该文件卷挂载到一个使用用户命名空间映射 `0:100000:5000` 运行的容器中。
```
$ sudo podman run -ti -v /tmp/test:/tmp/test:Z --uidmap 0:100000:5000 fedora sh
# id
uid=0(root) gid=0(root) groups=0(root)
# ls -l /tmp/test
-rw-rw----. 1 nobody nobody 8 Nov 30 12:40 /tmp/test
# cat /tmp/test
cat: /tmp/test: Permission denied
```
上面的 `--uidmap` 设置告诉 Podman 在容器内映射一系列 5000 个 UID从容器外的 UID 100000开始因此范围是 100000-104999到容器内 UID 0 开始的范围(所以范围是 0-4999。在容器内部如果我的进程以 UID 1 运行,则它在主机上为 100001。
由于实际的 `UID=0` 未映射到容器中,因此 `root` 拥有的任何文件都将被视为 `nobody` 所拥有。即使容器内的进程具有 `CAP_DAC_OVERRIDE`,也无法覆盖此种保护。`DAC_OVERRIDE` 使根进程能够读/写系统上的任何文件,即使进程不是 `root` 用户拥有,也不是全局可读或可写的。
用户命名空间功能与主机上的功能不同。它们是命名空间功能。这意味着我的容器根只在容器内具有功能,实际上只在映射到用户命名空间的 UID 范围内。如果容器进程逃逸了容器,则它将没有任何功能而不是映射到用户命名空间的 UID包括 UID=0。即使进程可能以某种方式进入另一个容器如果容器使用不同范围的 UID它们也不具备这些功能。
请注意SELinux 和其他技术还限制了容器进程破开容器时会发生的情况。
### 使用 podman top 来显示用户名字空间
我们在 `podman top` 中添加了一些功能,允许你检查容器内运行的进程的用户名,并在主机上标识它们的真实 UID。
让我们首先使用我们的 UID 映射运行一个 `sleep` 容器。
```
$ sudo podman run --uidmap 0:100000:5000 -d fedora sleep 1000
```
现在运行 `podman top`
```
$ sudo podman top --latest user huser
USER   HUSER
root   100000
$ ps -ef | grep sleep
100000   21821 21809  0 08:04 ?         00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 1000
```
注意 `podman top` 报告用户进程在容器内以 `root` 身份运行,但在主机(`HUSER`)上以 UID 100000 运行。此外,`ps` 命令确认 `sleep` 过程以 UID 100000 运行。
现在让我们运行第二个容器,但这次我们将选择一个单独的 UID 映射,从 200000 开始。
```
$ sudo podman run --uidmap 0:200000:5000 -d fedora sleep 1000
$ sudo podman top --latest user huser
USER   HUSER
root   200000
$ ps -ef | grep sleep
100000   21821 21809  0 08:04 ?         00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 1000
200000   23644 23632  1 08:08 ?         00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 1000
```
请注意,`podman top` 报告第二个容器在容器内以 `root` 身份运行,但主机上的 UID=200000。
另请参阅 `ps` 命令,它显示两个 `sleep` 进程都在运行:一个为 100000另一个为 200000。
这意味着在单独的用户命名空间内运行容器可以在进程之间进行传统的 UID 分离,这从一开始就是 Linux/Unix 的标准安全工具。
### 用户名字空间的问题
几年来,我一直主张用户命名空间应该作为每个人应该有的安全工具,但几乎没有人使用过。原因是没有任何文件系统支持或转移文件系统。
在容器中,你希望在许多容器之间共享**基本**镜像。上面的示例在每个示例中使用 Fedora 基本镜像。Fedora 镜像中的大多数文件都由实际的 UID=0 拥有。如果我使用用户名称空间 0:100000:5000 在此镜像上运行容器,默认情况下它会将所有这些文件视为 `nobody` 所拥有,因此我们需要移动所有这些 UID 以匹配用户名称空间。多年来,我想要一个挂载选项来告诉内核重新映射这些文件 UID 以匹配用户命名空间。上游内核存储开发人员继续调查并在此功能上取得进展,但这是一个难题。
Podman 可以在同一镜像上使用不同的用户名称空间,是由于自动 [chown][6] 内置于由 Nalin Dahyabhai 领导的团队开发的[容器/存储][7]中。Podman使用容器/存储Podman 第一次在新的用户命名空间中使用容器镜像,容器/存储 “chowns”更改所有权镜像中的所有文件到用户命名空间中映射的 UID 并创建新镜像。可以把它想象成 `fedora:0:100000:5000` 镜像。
当 Podman 在具有相同 UID 映射的镜像上运行另一个容器时它使用“预先设置所有权”的图像。当我在0:200000:5000 上运行第二个容器时,容器/存储会创建第二个镜像,我们称之为 `fedora:0:200000:5000`
请注意,如果你正在执行 `podman build``podman commit` 并将新创建的镜像推送到容器注册表Podman 将使用容器/存储来反转移位并将所有文件推回到实际 UID=0 的镜像。
这可能会导致在新的 UID 映射中创建容器时出现真正的减速,因为 `chown` 可能会很慢,具体取决于镜像中的文件数。此外,在普通 [OverlayFS][8] 上,镜像中的每个文件都会被复制。正常的 Fedora 镜像最多可能需要 30 秒才能完成 `chown` 并启动容器。
幸运的是Red Hat 内核存储团队(主要是 Vivek Goyal 和 Miklos Szeredi在内核 4.19 中为 OverlayFS 添加了一项新功能。该功能称为“仅元数据复制”。如果使用 `metacopy=on` 挂载覆盖文件系统作为挂载选项,则在更改文件属性时,它不会复制较低层的内容;内核创建新的 inode其中包含引用指向较低级别数据的属性。如果内容发生变化它仍会复制内容。如果你想试用它可以在 Red Hat Enterprise Linux 8 Beta 中使用此功能。
这意味着容器 `chown` 可能在几秒钟内发生,并且你不会将每个容器的存储空间加倍。
这使得像 Podman 这样的工具在不同的用户命名空间中运行容器是可行的,大大提高了系统的安全性。
### 前瞻
我想向 Podman 添加一个新标志,比如 `--userns=auto`,它会告诉它为你运行的每个容器自动选择一个唯一的用户命名空间。这类似于 SELinux 与单独的多类别安全MCS标签一起使用的方式。如果设置环境变量 `PODMAN_USERNS=auto`,则甚至不需要设置标志。
Podman 最终允许用户在不同的用户名称空间中运行容器。像 [Buildah][9] 和 [CRI-O][10] 这样的工具也可以利用用户命名空间。但是,对于 CRI-OKubernetes 需要了解哪个用户命名空间将运行容器引擎,而上游正在处理它。
在我的下一篇文章中,我将解释如何在用户命名空间中将 Podman 作为非 root 用户运行。
--------------------------------------------------------------------------------
via: https://opensource.com/article/18/12/podman-and-user-namespaces
作者:[Daniel J Walsh][a]
选题:[lujun9972][b]
译者:[wxy](https://github.com/wxy)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/rhatdan
[b]: https://github.com/lujun9972
[1]: https://podman.io/
[2]: https://github.com/containers/libpod
[3]: https://opensource.com/article/18/10/podman-more-secure-way-run-containers
[4]: http://man7.org/linux/man-pages/man7/user_namespaces.7.html
[5]: https://chmodcommand.com/chmod-660/
[6]: https://en.wikipedia.org/wiki/Chown
[7]: https://github.com/containers/storage
[8]: https://en.wikipedia.org/wiki/OverlayFS
[9]: https://buildah.io/
[10]: http://cri-o.io/