Merge pull request #26267 from lkxed/20220628-Notes-on-running-containers-with-bubblewrap

[手动选题][tech]: 20220628 Notes on running containers with bubblewrap.md
This commit is contained in:
六开箱 2022-06-29 09:57:11 +08:00 committed by GitHub
commit de2ebdf222
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -0,0 +1,480 @@
[#]: subject: "Notes on running containers with bubblewrap"
[#]: via: "https://jvns.ca/blog/2022/06/28/some-notes-on-bubblewrap/"
[#]: author: "Julia Evans https://jvns.ca/"
[#]: collector: "lkxed"
[#]: translator: " "
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
Notes on running containers with bubblewrap
======
Hello! About a year ago I got mad about Docker container startup time. This was
because I was building an [nginx playground][1]
where I was starting a new “container” on every HTTP request, and so for it to
feel reasonably snappy, nginx needed to start quickly.
Also, I was running this project on a pretty small cloud machine (256MB RAM), a
small CPU, so I really wanted to avoid unnecessary overhead.
Ive been looking for a way to run containers faster since then, but I couldnt
find one until last week when I discovered
[bubblewrap][2]!! Its very fast and I
think its super cool, but I also ran into a bunch of fun problems that I
wanted to write down for my future self.
#### some disclaimers
* Im not sure if the way Im using bubblewrap in this post is maybe not how its intended to be used
* there are a lot of sharp edges when using bubblewrap in this way, you need to
think a lot about Linux namespaces and how containers work
* bubblewrap is a security tool but I am not a security person and I am only
doing this for weird tiny projects. you should definitely not take security
advice from me.
Okay, all of that said, lets talk about Im trying to use bubblewrap to run
containers fast and in a relatively secure way :)
#### Docker containers take ~300ms to start on my machine
I ran a quick benchmark to see how long a Docker container takes to run a
simple command (`ls` ). For both Docker and Podman, its about 300ms.
```
$ time docker run --network none -it ubuntu:20.04 ls / > /dev/null
Executed in 378.42 millis
$ time podman run --network none -it ubuntu:20.04 ls / > /dev/null
Executed in 279.27 millis
```
Almost all of this time is overhead from docker and podman just running `ls`
by itself takes about 3ms:
```
$ time ls / > /dev/null
Executed in 2.96 millis
```
I want to stress that, while Im not sure exactly what the slowest part of
Docker and podman startup time is (I spent 5 minutes trying to profile them and
gave up), Im 100% sure its something important.
The way were going to run containers faster with bubblewrap has a lot of
limitations and its a lower level interface which is a lot trickier to use.
#### goal 1: containers that start quickly
I felt like it *should* be possible to have containers that start essentially
instantly or at least in less than 5ms. My thought process:
* creating a new namespace with `unshare` is basically instant
* [containers are basically just a bunch of namespaces][3]
* whats the problem?
#### container startup time is (usually) not that important
Most of the time when people are using containers, theyre running some
long-running process inside the container like a webserver, so it doesnt
really matter if it takes 300ms to start.
So it makes sense to me that there arent a lot of container tools that
optimize for startup time. But I still wanted to optimize for startup time :)
#### goal 2: run the containers as an unprivileged user
Another goal I had was to be able to run my containers as an unprivileged user
instead of root.
I was surprised the first time I learned that Docker actually runs containers
as root even though I run `docker run ubuntu:20.04` as an unprivileged user (`bork` ), that
message is actually sent to a daemon running as root, and the Docker container
process itself also runs as root (albeit a `root` thats stripped of all its
capabilities).
Thats fine for Docker (they have lots of very smart people making sure that
they get it right!), but if Im going to do container stuff *without* using
Docker (for the speed reasons mentioned above), Id rather not do it as root to
keep everything a bit more secure.
#### podman can run containers as an non-root user
Before we start talking about how to do weird stuff with bubblewrap, I want to
quickly talk about a much more normal tool to run containers: podman!
Podman, unlike Docker, can run containers as an unprivileged user!
If I run this from my normal user:
```
$ podman run -it ubuntu:20.04 ls
```
it doesnt secretly run as root behind the scenes! It just starts the container
as my normal user, and then uses something called “user namespaces” so that
*inside the container* I appear to be root.
The other cool thing aboud podman is that it has exactly the same interface as
Docker, so you can just take a Docker command and replace `docker` with
`podman` and itll Just Work. Ive found that sometimes I need to do some extra
work to get podman to work in practice, but its still pretty nice that it has
the same command line interface.
This “run containers as a non-root user” feature is normally called “rootless
containers”. (I find that name kind of counterintuitive, but thats what people call it)
#### failed attempt 1: write my own tool using runc
```
runc
```
I knew that Docker and podman use
[runc][4] under the hood, so I thought
well, maybe I can just use `runc` directly to make my own tool that starts
containers faster than Docker does!
I tried to do this 6 months ago and I dont remember most of the details, but basically
I spent 8 hours working on it, got frustrated because I couldnt get anything
to work, and gave up.
One specific detail I remember struggling with was setting up a working `/dev`
for my programs to use.
#### enter bubblewrap
Okay, that was a very long preamble so lets get to the point! Last week, I
discovered a tool called `bubblewrap` that was basically exactly the thing I
was trying to build with `runc` in my failed attempt, except that it actually
works and has many more features and its built by people who know things about
security! Hooray!
The interface to bubblewrap is pretty different than the interface to Docker
its much lower level. Theres no concept of a container image instead you
map a bunch of directories on your host to directories in the container.
For example, heres how to run a container with the same root directory as your
host operating system, but with only read access to that root directory, and only write access to `/tmp`.
```
bwrap \
--ro-bind / / \
--bind /tmp /tmp \
--proc /proc --dev /dev \
--unshare-pid \
--unshare-net \
bash
```
For example, you could imagine running some untrusted process under bubblewrap
this way and then putting all the files you the process to access in `/tmp`.
#### bubblewrap runs containers as an unprivileged (non-root) user
Like podman, bubblewrap runs containers as a non-root user, using user
namespaces. It can also run containers as root, but in this post were just
going to be talking about using it as an unprivileged user.
#### bubblewrap is fast
Lets see how long it takes to run `ls` in a bubblewrap container!
```
$ time bwrap --ro-bind / / --proc /proc --dev /dev --unshare-pid ls /
Executed in 8.04 millis
```
Thats a big difference! 8ms is a lot faster than 279ms.
Of course, like we said before, the reason bubblewrap is faster is that it does
a lot less. So lets talk about some things bubblewrap doesnt do.
#### some things bubblewrap doesnt do
Here are some things that Docker/podman do that bubblewrap doesnt do:
* set up overlayfs mounts for you, so that your changes to the filesystem dont affect the base image
* set up networking bridges so that you can connect to a webserver inside the container
* probably a bunch more stuff that Im not thinking of
In general, bubblewrap is a much lower level tool than something like Docker.
Also, bubblewrap seems to have pretty different goals than Docker the README
seems to say that its intended as a tool for sandboxing desktop software (I
think it comes from [flatpak][5]).
#### running a container image with bubblewrap
I couldnt find instructions for running a Docker container image with
bubblewrap, so here they are. Basically I just use Docker to download the
container image and put it into a directory and then run it with `bwrap` :
Theres also a tool called [bwrap-oci][6] which looks cool but I
couldnt get it to compile.
```
mkdir rootfs
docker export $(docker create frapsoft/fish) | tar -C rootfs -xf -
bwrap \
--bind $PWD/rootfs / \
--proc /proc --dev /dev \
--uid 0 \
--unshare-pid \
--unshare-net \
fish
```
One important thing to note is that this doesnt create a temporary overlay
filesystem for the containers file writes, so itll let the container edit
files in the image.
I wrote a post about [overlay filesystems][7] if
you want to see how you could do that yourself though.
#### running “containers” with bubblewrap isnt the same as with podman
I just gave an example of how to “run a container” with bubblewrap, and you
might think “cool, this is just like podman but faster!”. It is not, and its
actually unlike using podman in even more ways than I expected.
I put “container” in scare quotes because there are two ways to define “container”:
* something that implements [OCI runtime specification][8]
* any way of running a process in a way thats somehow isolated from the host system
bubblewrap is a “container” tool in the second sense. It definitely provides
isolation, and it does that using the same features Linux namespaces as
Docker.
But its not a container tool in the first sense. And its a lower level tool
so you can get into a bunch of weird states and you really need to think about
all the weird details of how container work while using it.
For the rest of the post Im going to talk about some weird things that can
happen with bubblewrap that would not happen with podman/Docker.
#### weird thing 1: processes that dont exist
Heres an example of a weird situation I got into with bubblewrap that confused
me for a minute:
```
$ bwrap --ro-bind / / --unshare-all bash
$ ps aux
... some processes
root 390073 0.0 0.0 2848 124 pts/9 S 14:28 0:00 bwrap --ro-bind / / --unshare-all --uid 0 bash
... some other processes
$ kill 390073
bash: kill: (390073) - No such process
$ ps aux | grep 390073
root 390073 0.0 0.0 2848 124 pts/9 S 14:28 0:00 bwrap --ro-bind / / --unshare-all --uid 0 bash
```
Heres what happened
* I started a bash shell inside bubblewrap
* I ran `ps aux`, and saw a process with PID `390073`
* I try to kill the process. It fails with the error `no such process`. What?
* I ran `ps aux`, and still see the process with PID `390073`
Whats going on? Why doesnt the process `390073` exist, even though `ps` says it does? Isnt that impossible?
Well, the problem is that `ps` doesnt actually list all the processes in your
current PID namespace. Instead, it iterates through all the entries in `/proc`
and prints those out. Usually, whats in `/proc` is actually the same as the processes on your system.
But with Linux containers these things can get out of sync. Whats happening in
this example is that we have the `/proc` from the host PID namespace, but those
arent actually the processes that we have access to in our PID namespace.
Passing `--proc /proc` to bwrap fixes the issue `ps` then actually lists the correct processes.
```
$ bwrap --ro-bind / / --unshare-all --dev /dev --proc /proc ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
bork 1 0.0 0.0 3644 136 ? S+ 16:21 0:00 bwrap --ro-bind / / --unshare-all --dev /dev --proc /proc ps au
bork 2 0.0 0.0 21324 1552 ? R+ 16:21 0:00 ps aux
```
Just 2 processes! Everything is normal!
#### weird thing 2: trying to listen on port 80
Passing `--uid 0` to bubblewrap makes the user inside the container `root`. You
might think that this means that the root user has administrative privileges
inside the container, but thats not true!
For example, lets try to listen on port 80:
```
$ bwrap --ro-bind / / --unshare-all --uid 0 nc -l 80
nc: Permission denied
```
Whats going on here is that the new root user actually doesnt have the
**capabilities** it needs to listen on port 80. (you need special permissions
to listen on ports less than 1024, and 80 is less than 1024)
Theres actually a capability specifically for listening on privileged ports
called `CAP_NET_BIND_SERVICE`.
So to fix this all we need to do is to tell bubblewrap to give our user that
capability.
```
$ bwrap --ro-bind / / --unshare-all --uid 0 --cap-add cap_net_bind_service nc -l 80
(no output, success!!!)
```
This works! Hooray!
#### finding the right capabilities is pretty annoying
bubblewrap doesnt give out any capabilities by default, and I find that
figuring out all the right capabilities and adding them manually is kind of
annoying. Basically my process is
* run the thing
* see what fails
* read `man capabilities` to figure out what capabilities Im missing
* add the capability with `--cap-add`
* repeat until everything is running
But thats the price I pay for wanting things to be fast I guess :)
#### weird thing 2b: --dev /dev makes listening on privileged ports not work
```
--dev /dev
```
One other strange thing is that if I take the exact same command above (which
worked!) and add `--dev /dev` (to set up the `/dev/` directory), it causes it to not work again:
```
$ bwrap --ro-bind / / --dev /dev --unshare-all --uid 0 --cap-add cap_net_bind_service nc -l 80
nc: Permission denied
```
I think this might be a bug in bubblewrap, but I havent mustered the courage
to dive into the bubblewrap code and start investigating yet. Or maybe theres
something obvious Im missing!
#### weird thing 3: UID mappings
Another slightly weird thing was I tried to run `apt-get update` inside a bubblewrap Ubuntu container and everything went very poorly.
Heres how I ran `apt-get update` inside the Ubuntu container:
```
mkdir rootfs
docker export $(docker create ubuntu:20.04) | tar -C rootfs -xf -
bwrap \
--bind $PWD/rootfs / \
--proc /proc\
--uid 0 \
--unshare-pid \
apt-get update
```
And here are the error messages:
```
E: setgroups 65534 failed - setgroups (1: Operation not permitted)
E: setegid 65534 failed - setegid (22: Invalid argument)
E: seteuid 100 failed - seteuid (22: Invalid argument)
E: setgroups 0 failed - setgroups (1: Operation not permitted)
.... lots more similar errors
```
At first I thought “ok, this is a capabilities problem, I need to set
`CAP_SETGID` or something to give the container permission to change groups. But I did that and it didnt help at all!
I think whats going on here is a problem with UID maps. What are UID maps?
Well, every time you run a container using “user namespaces” (which podman is
doing), it creates a mapping of UIDs inside the container to UIDs on the host.
Lets look that the UID maps! Heres how do that:
````
[[email protected]][9]:/# cat /proc/self/uid_map
0 1000 1
[[email protected]][10]:/# cat /proc/self/gid_map
1000 1000 1
```
This is saying that user 0 in the container is mapped to user 1000 on in the
host, and group 1000 is mapped to group 1000. (My normal user's UID/GID is 1000, so this makes sense). You can find out
about this `uid_map` file in `man user_namespaces`.
All other users/groups that aren't 1000 are mapped to user 65534 by default, according
to `man user_namespaces`.
### what's going on: non-mapped users can't be used
The only users and groups that have been mapped are `0` and `1000`. But `man user_namespaces` says:
> After the uid_map and gid_map files have been written, only the mapped values may be used in system calls that change user and group IDs.
`apt` is trying to use users 100 and 65534. Those aren't on the list of mapped
users! So they can't be used!
This works fine in podman, because podman sets up its UID and GID mappings differently:
```
$ podman run -it ubuntu:20.04 bash
[[email protected]][11]:/# cat /proc/self/uid_map
0 1000 1
1 100000 65536
[[email protected]][12]:/# cat /proc/self/gid_map
0 1000 1
1 100000 65536
```
All the users get mapped, not just 1000.
I dont quite know how to fix this, but I think its probably possible in
bubblewrap to set up the uid mappings the same way as podman does theres an
[issue about it here that links to a workaround][13].
But this wasnt an actual problem I was trying to solve so I didnt dig further
into it.
#### it works pretty great!
Ive talked about a bunch of issues, but the things Ive been trying to do in bubblewrap
have been very constrained and its actually been pretty simple. For example, I
was working on a git project where I really just want to run `git` inside a
container and map a git repository from the host.
Thats very simple to get to work with bubblewrap! There were basically no weird problems!
Its really fast!
So Im pretty excited about this tool and I might use it for more stuff in the
future.
--------------------------------------------------------------------------------
via: https://jvns.ca/blog/2022/06/28/some-notes-on-bubblewrap/
作者:[Julia Evans][a]
选题:[lkxed][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://jvns.ca/
[b]: https://github.com/lkxed
[1]: https://jvns.ca/blog/2021/09/24/new-tool--an-nginx-playground/
[2]: https://github.com/containers/bubblewrap
[3]: https://jvns.ca/blog/2016/10/10/what-even-is-a-container/
[4]: https://github.com/opencontainers/runc
[5]: https://flatpak.org/
[6]: https://github.com/projectatomic/bwrap-oci
[7]: https://jvns.ca/blog/2019/11/18/how-containers-work--overlayfs/
[8]: https://opencontainers.org/about/overview/
[9]: https://jvns.ca/cdn-cgi/l/email-protection
[10]: https://jvns.ca/cdn-cgi/l/email-protection
[11]: https://jvns.ca/cdn-cgi/l/email-protection
[12]: https://jvns.ca/cdn-cgi/l/email-protection
[13]: https://github.com/containers/bubblewrap/issues/468