translated by @haoqixu

This commit is contained in:
xu0o0 2017-08-23 14:08:20 +08:00
parent 300e2b4378
commit f65abb33c4
2 changed files with 364 additions and 350 deletions

View File

@ -1,350 +0,0 @@
【haoqixu翻译中】How Linux containers have evolved
============================================================
### Containers have come a long way in the past few years. We walk through the timeline.
![How Linux containers have evolved](https://opensource.com/sites/default/files/styles/image-full-size/public/images/business/containers_2015-3-osdc-lead.png?itok=yOKCc8ze "How Linux containers have evolved")
Image credits : 
[Daniel Ramirez][11]. [CC BY-SA 4.0][12]
In the past few years, containers have become a hot topic among not just developers, but also enterprises. This growing interest has caused an increased need for security improvements and hardening, and preparing for scaleability and interoperability. This has necessitated a lot of engineering, and here's the story of how much of that engineering has happened at an enterprise level at Red Hat.
When I first met up with representatives from Docker Inc. (Docker.io) in the fall of 2013, we were looking at how to make Red Hat Enterprise Linux (RHEL) use Docker containers. (Part of the Docker project has since been rebranded as  _Moby_ .) We had several problems getting this technology into RHEL. The first big hurdle was getting a supported Copy On Write (COW) file system to handle container image layering. Red Hat ended up contributing a few COW implementations, including [Device Mapper][13], [btrfs][14], and the first version of [OverlayFS][15]. For RHEL, we defaulted to Device Mapper, although we are getting a lot closer on OverlayFS support.
Linux Containers
* [What are Linux containers?][1]
* [What is Docker?][2]
* [What is Kubernetes?][3]
* [An introduction to container terminology][4]
The next major hurdle was on the tooling to launch the container. At that time, upstream docker was using [LXC][16]tools for launching containers, and we did not want to support LXC tools set in RHEL. Prior to working with upstream docker, I had been working with the [libvirt][17] team on a tool called [virt-sandbox][18], which used **libvirt-lxc** for launching containers.
At the time, some people at Red Hat thought swapping out the LXC tools and adding a bridge so the Docker daemon would communicate with libvirt using **libvirt-lxc** to launch containers was a good idea. There were serious concerns with this approach. Consider the following example of starting a container with the Docker client (**docker-cli**) and the layers of calls before the container process (**pid1OfContainer**) is started:
**docker-cli → docker-daemon → libvirt-lxc → pid1OfContainer**
I did not like the idea of having two daemons between your tool to launch containers and the final running container.
My team worked hard with the upstream docker developers on a native [Go programming language][19] implementation of the container runtime, called [libcontainer][20]. This library eventually got released as the initial implementation of the [OCI Runtime Specification][21]along with runc.
**docker- ****cli**** → docker-daemon @ pid1OfContainer**
Although most people mistakenly think that when they execute a container, the container process is a child of the **docker-cli**, they actually have executed a client/server operation and the container process is running as a child of a totally separate environment. This client/server operation can lead to instability and potential security concerns, and it blocks useful features. For example, [systemd][22] has a feature called socket activation, where you can set up a daemon to run only when a process connects to a socket. This means your system uses less memory and only has services executing when they are needed. The way socket activation works is systemd listens at a TCP socket, and when a packet arrives for the socket, systemd activates the service that normally listens on the socket. Once the service is activated, systemd hands the socket to the newly started daemon. Moving this daemon into a Docker-based container causes issues. The unit file would start the container using the Docker CLI and there was no easy way for systemd to pass the connected socket to the Docker daemon through the Docker CLI.
Problems like this made us realize that we needed alternate ways to run containers.
### The container orchestration problem
The upstream docker project made using containers easy, and it continues to be a great tool for learning about Linux containers. You can quickly experience launching a container by running a simple command like **docker run -ti fedora sh** and instantly you are in a container.
The real power of containers comes about when you start to run many containers simultaneously and hook them together into a more powerful application. The problem with setting up a multi-container application is the complexity quickly grows and wiring it up using simple Docker commands falls apart. How do you manage the placement or orchestration of container applications across a cluster of nodes with limited resources? How does one manage their lifecycle, and so on?
At the first DockerCon, at least seven different companies/open source projects showed how you could orchestrate containers. Red Hat's [OpenShift][23] had a project called [geard][24], loosely based on OpenShift v2 containers (called "gears"), which we were demonstrating. Red Hat decided that we needed to re-look at orchestration and maybe partner with others in the open source community.
Google was demonstrating [Kubernetes][25] container orchestration based on all of the knowledge Google had developed in orchestrating their own internal architecture. OpenShift decided to drop our Gear project and start working with Google on Kubernetes. Kubernetes is now one of the largest community projects on GitHub.
#### Kubernetes
Kubernetes was developed to use Google's [lmctfy][26] container runtime. Lmctfy was ported to work with Docker during the summer of 2014\. Kubernetes runs a daemon on each node in the Kubernetes cluster called a [kubelet][27]. This means the original Kubernetes with Docker 1.8 workflow looked something like:
**kubelet → dockerdaemon @ PID1**
Back to the two-daemon system.
But it gets worse. With every release of Docker, Kubernetes broke.Docker 1.10 Switched the backing store causing a rebuilding of all images.Docker 1.11 started using **runc** to launch containers:
**kubelet → dockerdaemon @ runc @PID1**
Docker 1.12 added a container daemon to launch containers. Its main purpose was to satisfy Docker Swarm (a Kubernetes competitor):
**kubelet → dockerdaemon → containerd @runc @ pid1**
As was stated previously,  _every_  Docker release has broken Kubernetes functionality, which is why Kubernetes and OpenShift require us to ship older versions of Docker for their workloads.
Now we have a three-daemon system, where if anything goes wrong on any of the daemons, the entire house of cards falls apart.
### Toward container standardization
### CoreOS, rkt, and the alternate runtime
Due to the issues with the Docker runtime, several organizations were looking at alternative runtimes. One such organization was CoreOS. CoreOS had offered an alternative container runtime to upstream docker, called  _rkt_  (rocket). They also introduced a standard container specification called  _appc_  (App Container). Basically, they wanted to get everyone to use a standard specification for how you store applications in a container image bundle.
This threw up red flags. When I first started working on containers with upstream docker, my biggest fear is that we would end up with multiple specifications. I did not want an RPM vs. Debian-like war to affect the next 20 years of shipping Linux software. One good outcome from the appc introduction was that it convinced upstream docker to work with the open source community to create a standards body called the [Open Container Initiative][28] (OCI).
The OCI has been working on two specifications:
**[OCI Runtime Specification][6]**: The OCI Runtime Specification "aims to specify the configuration, execution environment, and lifecycle of a container." It defines what a container looks like on disk, the JSON file that describes the application(s) that will run within the container, and how to spawn and execute the container. Upstream docker contributed the libcontainer work and built runc as a default implementation of the OCI Runtime Specification.
**[OCI Image Format Specification][7]**: The Image Format Specification is based mainly on the upstream docker image format and defines the actual container image bundle that sits at container registries. This specification allows application developers to standardize on a single format for their applications. Some of the ideas described in appc, although it still exists, have been added to the OCI Image Format Specification. Both of these OCI specifications are nearing 1.0 release. Upstream docker has agreed to support the OCI Image Specification once it is finalized. Rkt now supports running OCI images as well as traditional upstream docker images.
The Open Container Initiative, by providing a place for the industry to standardize around the container image and the runtime, has helped free up innovation in the areas of tooling and orchestration.
### Abstracting the runtime interface
One of the innovations taking advantage of this standardization is in the area of Kubernetes orchestration. As a big supporter of the Kubernetes effort, CoreOS submitted a bunch of patches to Kubernetes to add support for communicating and running containers via rkt in addition to the upstream docker engine. Google and upstream Kubernetes saw that adding these patches and possibly adding new container runtime interfaces in the future was going to complicate the Kubernetes code too much. The upstream Kubernetes team decided to implement an API protocol specification called the Container Runtime Interface (CRI). Then they would rework Kubernetes to call into CRI rather than to the Docker engine, so anyone who wants to build a container runtime interface could just implement the server side of the CRI and they could support Kubernetes. Upstream Kubernetes created a large test suite for CRI developers to test against to prove they could service Kubernetes. There is an ongoing effort to remove all of Docker-engine calls from Kubernetes and put them behind a shim called the docker-shim.
### Innovations in container tooling
### Container registry innovations with skopeo
A few years ago, we were working with the Project Atomic team on the [atomic CLI][29] . We wanted the ability to examine a container image when it sat on a container registry. At that time, the only way to look at the JSON data associated with a container image at a container registry was to pull the image to the local server and then you could use **docker inspect** to read the JSON files. These images can be huge, up to multiple gigabytes. Because we wanted to allow users to examine the images and decide not to pull them, we wanted to add a new **--remote** interface to **docker inspect**. Upstream docker rejected the pull request, telling us that they did not want to complicate the Docker CLI, and that we could easily build our own tooling to do the same.
My team, led by [Antonio Murdaca][30], ran with the idea and created [skopeo][31]. Antonio did not stop at just pulling the JSON file associated with the image—he decided to implement the entire protocol for pulling and pushing container images from container registries to/from the local host.
Skopeo is now used heavily within the atomic CLI for things such as checking for new updates for containers and inside of [atomic scan][32]. Atomic also uses skopeo for pulling and pushing images, instead of using the upstream docker daemon.
### Containers/image
We had been talking to CoreOS about potentially using skopeo with rkt, and they said that they did not want to **exec** out to a helper application, but would consider using the library that skopeo used. We decided to split skopeo apart into a library and executable and created **[image][8]**.
The [containers/image][33] library and skopeo are used in several other upstream projects and cloud infrastructure tools. Skopeo and containers/image have evolved to support multiple storage backends in addition to Docker, and it has the ability to move container images between container registries and many cool features. A [nice thing about skopeo][34]is it does not require any daemons to do its job. The breakout of containers/image library has also allowed us to add enhancements such as [container image signing][35].
### Innovations in image handling and scanning
I mentioned the **atomic** CLI command earlier in this article. We built this tool to add features to containers that did not fit in with the Docker CLI, and things that we did not feel we could get into the upstream docker. We also wanted to allow flexibility to support additional container runtimes, tools, and storage as they developed. Skopeo is an example of this.
One feature we wanted to add to atomic was **atomic mount**. Basically we wanted to take content that was stored in the Docker image store (upstream docker calls this a graph driver), and mount the image somewhere, so that tools could examine the image. Currently if you use upstream docker, the only way to look at an image is to start the container. If you have untrusted content, executing code inside of the container to look at the image could be dangerous. The second problem with examining an image by starting it is that the tools to examine the container are probably not in the container image.
Most container image scanners seem to have the following pattern: They connect to the Docker socket, do a **docker save** to create a tarball, then explode the tarball on disk, and finally examine the contents. This is a slow operation.
With **atomic mount**, we wanted to go into the Docker graph driver and mount the image. If the Docker daemon was using device mapper, we would mount the device. If it was using overlay, we would mount the overlay. This is an incredibly quick operation and satisfies our needs. You can now do:
```
# atomic mount fedora /mnt
# cd /mnt
```
And start examining the content. When you are done, do a:
```
# atomic umount /mnt
```
We use this feature inside of **atomic scan**, which allows you to have some of the fastest container scanners around.
#### **Issues with tool coordination**
One big problem is that **atomic mount** is doing this under the covers. The Docker daemon does not know that another process is using the image. This could cause problems (for example, if you mounted the Fedora image above and then someone went and executed **docker rmi fedora**, the Docker daemon would fail weirdly when trying to remove the Fedora image saying it was busy). The Docker daemon could get into a weird state.
### Containers storage
To solve this issue, we started looking at pulling the graph driver code out of the upstream docker daemon into its own repository. The Docker daemon did all of its locking in memory for the graph driver. We wanted to move this locking into the file system so that we could have multiple distinct processes able to manipulate the container storage at the same time, without having to go through a single daemon process.
We created a project called [container/storage][36], which can do all of the COW features required for running, building, and storing containers, without requiring one process to control and monitor it (i.e., no daemon required). Now skopeo and other tools and projects can take advantage of the storage. Other open source projects have begun to use containers/storage, and at some point we would like to merge this project back into the upstream docker project.
### Undock and let's innovate
If you think about what happens when Kubernetes runs a container on a node with the Docker daemon, first Kubernetes executes a command like:
```
kubelet run nginx image=nginx
```
This command tells the kubelet to run the NGINX application on the node. The kubelet calls into the CRI and asks it to start the NGINX application. At this point, the container runtime that implemented the CRI must do the following steps:
1. Check local storage for a container named **nginx**. If not local, the container runtime will search for a standardized container image at a container registry.
2. If the image is not in local storage, download it from the container registry to the local system.
3. Explode the the download container image on top of container storage—usually a COW storage—and mount it up.
4. Execute the container using a standardized container runtime.
Let's look at the features described above:
1. OCI Image Format Specification defines the standard image format for images stored at container registries.
2. Containers/image is the library that implements all features needed to pull a container image from a container registry to a container host.
3. Containers/storage provides a library to exploding OCI Image Formats onto COW storage and allows you to work with the image.
4. OCI Runtime Specification and **runc** provide tools for executing the containers (the same tool that the Docker daemon uses for running containers).
This means we can use these tools to implement the ability to use containers without requiring a big container daemon.
In a moderate- to large-scale DevOps-based CI/CD environment, efficiency, speed, and security are important. And as long as your tools conform to the OCI specifications, then a developer or an operator should be using the best tools for automation through the CI/CD pipeline and into production. Most of the container tooling is hidden beneath orchestration or higher-up container platform technology. We envision a time in which runtime or image bundle tool selection perhaps becomes an installation option of the container platform.
### System (standalone) containers
On Project Atomic we introduced the **atomic host**, a new way of building an operating system in which the software can be "atomicly" updated and most of the applications that run on it will be run as containers. Our goal with this platform is to prove that most software can be shipped in the future in OCI Image Format, and use standard protocols to get images from container registries and install them on your system. Providing software as container images allows you to update the host operating system at a different pace than the applications that run on it. The traditional RPM/yum/DNF way of distributing packages locks the applications to the live cycle of the host operating systems.
One problem we see with shipping most of the infrastructure as containers is that sometimes you must run an application before the container runtime daemon is executing. Let's look at our Kubernetes example running with the Docker daemon: Kubernetes requires a network to be set up so that it can put its pods/containers into isolated networks. The default daemon we use for this currently is **[flanneld][9]**, which must be running before the Docker daemon is started in order to hand the Docker daemon the network interfaces to run the Kubernetes pods. Also, flanneld uses [**etcd**][37] for its data store. This daemon is required to be run before flanneld is started.
If we want to ship etcd and flanneld as container images, we have a chicken and egg situation. We need the container runtime daemon to start the containerized applications, but these applications need to be running before the container runtime daemon is started. I have seen several hacky setups to try to handle this situation, but none of them are clean. Also, the Docker daemon currently has no decent way to configure the priority order that containers start. I have seen suggestions on this, but they all look like the old SysVInit way of starting services (and we know the complexities that caused).
### systemd
One reason for replacing SysVInit with systemd was to handle the priority and ordering of starting services, so why not take advantage of this technology? In Project Atomic, we decided that we wanted to run containers on the host without requiring a container runtime daemon, especially for early boot. We enhanced the atomic CLI to allow you to install container images. If you execute** atomic install --system etcd**, it uses skopeo to go out to a container registries and pulls down the etcd OCI Image. Then it explodes (or expands) the image onto an OSTree backing store. Because we are running etcd in production, we treat the image as read-only. Next the **atomic** command grabs the systemd unit file template from the container image and creates a unit file on disk to start the image. The unit file actually uses **runc** to start the container on the host (although **runc** is not necessary).
Similar things happen if you execute **atomic install --system flanneld**, except this time the flanneld unit file specifies that it needs etcd unit running before it starts.
When the system boots up, systemd ensures that etcd is running before flanneld, and that the container runtime is not started until after flanneld is started. This allows you to move the Docker daemon and Kubernetes into system containers. This means you can boot up an atomic host or a traditional rpm-based operating system that runs the entire container orchestration stack as containers. This is powerful because we know customers want to continue to patch their container hosts independently of these components. Furthermore, it keeps the host's operating system footprint to a minimum.
There even has been discussion about putting traditional applications into containers that can run either as standalone/system containers or as an orchestrated container. Consider an Apache container that you could install with the **atomic install --system httpd** command. This container image would be started the same way you start an rpm-based httpd service (**systemctl start httpd** except httpd will be started in a container). The storage could be local, meaning /var/www from the host gets mounted into the container, and the container listens on the local network at port 80\. This shows that you could run traditional workloads on a host inside of a container without requiring a container runtime daemon.
### Building container images
From my perspective, one of the saddest things about container innovation over the past four years has been the lack of innovation on mechanisms to build container images. A container image is nothing more than a tarball of tarballs and some JSON files. The base image of a container is a rootfs along with an JSON file describing the base image. Then as you add layers, the difference between the layers gets tard up along with changes to the JSON file. These layers and the base file get tar'd up together to form the container image.
Almost everyone is building with the **docker build** and the Dockerfile format. Upstream docker stopped accepting pull requests to modify or improve Dockerfile format and builds a couple of years ago. The Dockerfile played an important part in the evolution of containers. Developers or administrators/operators could build containers in a simple and straightforward manner; however, in my opinion, the Dockerfile is really just a poor mans bash script and creates several problems that have never been solved. For example:
* To build a container image, Dockerfile requires a Docker daemon to be running.
* No one has built standard tooling to create the OCI image outside of executing Docker commands.
* Even tools such as **ansible-containers** and OpenShift S2I (Source2Image) use **docker-engine** under the covers.
* Each line in a Dockerfile creates a new image, which helps in the development process of creating the container because the tooling is smart enough to know that the lines in the Dockerfile have not changed, so the existing images can be used and the lines do not need to be reprocessed. This can lead to a  _huge_  number of layers.
* Because of these, several people have requested mechanisms to squash the images eliminating the layers. I think upstream docker finally has accepted something to satisfy the need.
* To pull content from secured sites to put into your container image, often you need some form of secrets. For example you need access to the RHEL certificates and subscriptions in order to add RHEL content to an image.
* These secrets can end up in layers stored in the image. And the developer needs to jump through hoops to remove the secrets.
* To allow volumes to be mounted in during Docker build, we have added a **-v** volume switch to the projectatomic/docker package that we ship, but upstream docker has not accepted these patches.
* Build artifacts end up inside of the container image. So although Dockerfiles are great for getting started or building containers on a laptop while trying to understand the image you may want to build, they really are not an effective or efficient means to build images in a high-scaled enterprise environment. And behind an automated container platform, you shouldn't care if you are using a more efficient means to build OCI-compliant images.
### Undock with Buildah
At DevConf.cz 2017, I asked [Nalin Dahyabhai][38] on my team to look at building what I called **containers-coreutils**, basically, to use the containers/storage and containers/image libraries and build a series of command-line tools that could mimic the syntax of the Dockerfile. Nalin decided to call it [buildah][39], making fun of my Boston accent. With a few buildah primitives, you can build a container image:
* One of the main concepts of security is to keep the amount of content inside of an OS image as small as possible to eliminate unwanted tools. The idea is that a hacker might need tools to break through an application, and if the tools such as **gcc**, **make**, **dnf** are not present, the attacker can be stopped or confined.
* Because these images are being pulled and pushed over the internet, shrinking the size of the container is always a good idea.
* How Docker build works is commands to install software or compile software have to be in the **uildroot** of the container.
* Executing the **run** command requires all of the executables to be inside of the container image. Just using **dnf** inside of the container image requires that the entire Python stack be present, even if you never use Python in the application.
* **ctr=$(buildah from fedora)**:
* Uses containers/image to pull the Fedora image from a container registry.
* Returns a container ID (**ctr**).
* **mnt=$(buildah mount $ctr)**:
* Mounts up the newly created container image (**$ctr**).
* Returns the path to the mount point.
* You can now use this mount point to write content.
* **dnf install httpd installroot=$mnt**:
* You can use commands on the host to redirect content into the container, which means you can keep your secrets on the host, you don't have to put them inside of the container, and your build tools can be kept on the host.
* You don't need **dnf** inside of the container or the Python stack unless your application is going to use it.
* **cp foobar $mnt/dir**:
* You can use any command available in bash to populate the container.
* **buildah commit $ctr**:
* You can create a layer whenever you decide. You control the layers rather than the tool.
* **buildah config --env container=oci --entrypoint /usr/bin/httpd $ctr**:
* All of the commands available inside of Dockerfile can be specified.
* **buildah run $ctr dnf -y install httpd**:
* Buildah **run** is supported, but instead of relying on a container runtime daemon, buildah executes **runc** to run the command inside of a locked down container.
* **buildah build-using-dockerfile -f Dockerfile .**:
We want to move tools like **ansible-containers** and OpenShift S2I to use **buildah**rather than requiring a container runtime daemon.
Another big issue with building in the same container runtime that is used to run containers in production is that you end up with the lowest common denominator when it comes to security. Building containers tends to require a lot more privileges than running containers. For example, we allow the **mknod** capability by default. The **mknod** capability allows processes to create device nodes. Some package installs attempt to create device nodes, yet in production almost no applications do. Removing the **mknod** capability from your containers in production would make your systems more secure.
Another example is that we default container images to read/write because the install process means writing packages to **/usr**. Yet in production, I argue that you really should run all of your containers in read-only mode. Only allow the containers to write to **tmpfs** or directories that have been volume mounted into the container. By splitting the running of containers from the building, we could change the defaults and make for a much more secure environment.
* And yes, buildah can build a container image using a Dockerfile.
### CRI-O a runtime abstraction for Kubernetes
Kubernetes added an API to plug in any runtime for the pods called Container Runtime Interface (CRI). I am not a big fan of having lots of daemons running on my system, but we have added another. My team led by [Mrunal Patel][40] started working on [CRI-O][41] daemon in late 2016\. This is a Container Runtime Interface daemon for running OCI-based applications. Theoretically, in the future we could compile in the CRI-O code directly into the kubelet to eliminate the second daemon.
Unlike other container runtimes, CRI-O's only purpose in life is satisfying Kubernetes' needs. Remember the steps described above for what Kubernetes need to run a container.
Kubernetes sends a message to the kubelet that it wants it to run the NGINX server:
1. The kubelet calls out to the CRI-O to tell it to run NGINX.
2. CRI-O answers the CRI request.
3. CRI-O finds an OCI Image at a container registry.
4. CRI-O uses containers/image to pull the image from the registry to the host.
5. CRI-O unpacks the image onto local storage using containers/storage.
6. CRI-O launches a OCI Runtime Specification, usually **runc**, and starts the container. As I stated previously, the Docker daemon launches its containers using **runc**, in exactly the same way.
7. If desired, the kubelet could also launch the container using an alternate runtime, such as Clear Containers **runv**.
CRI-O is intended to be a stable platform for running Kubernetes, and we will not ship a new version of CRI-O unless it passes the entire Kubernetes test suite. All pull requests that go to [https://github.com/Kubernetes-incubator/cri-o][42] run against the entire Kubernetes test suite. You can not get a pull request into CRI-O without passing the tests. CRI-O is fully open, and we have had contributors from several different companies, including Intel, SUSE, IBM, Google, Hyper.sh. As long as a majority of maintainers agree to a patch to CRI-O, it will get accepted, even if the patch is not something that Red Hat wants.
### Conclusion
I hope this deep dive helps you understand how Linux containers have evolved. At one point, Linux containers were an every-vendor-for-themselves situation. Docker helped focus on a de facto standard for image creation and simplifying the tools used to work with containers. The Open Container Initiative now means that the industry is working around a core image format and runtime, which fosters innovation around making tooling more efficient for automation, more secure, highly scalable, and easier to use. Containers allow us to examine installing software in new and novel ways—whether they are traditional applications running on a host, or orchestrated micro-services running in the cloud. In many ways, this is just the beginning.
--------------------------------------------------------------------------------
作者简介:
Daniel J Walsh - Daniel Walsh has worked in the computer security field for almost 30 years. Dan joined Red Hat in August 2001.
via: https://opensource.com/article/17/7/how-linux-containers-evolved
作者:[ Daniel J Walsh (Red Hat)][a]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:https://opensource.com/users/rhatdan
[1]:https://opensource.com/resources/what-are-linux-containers?utm_campaign=containers&intcmp=70160000000h1s6AAA
[2]:https://opensource.com/resources/what-docker?utm_campaign=containers&intcmp=70160000000h1s6AAA
[3]:https://opensource.com/resources/what-is-kubernetes?utm_campaign=containers&intcmp=70160000000h1s6AAA
[4]:https://developers.redhat.com/blog/2016/01/13/a-practical-introduction-to-docker-container-terminology/utm_campaign=containers&intcmp=70160000000h1s6AAA
[5]:https://opensource.com/article/17/7/how-linux-containers-evolved?rate=k1UcW7wzh6axaB_z8ScE-U8cux6fLXXgW_vboB5tIwk
[6]:https://github.com/opencontainers/runtime-spec/blob/master/spec.md
[7]:https://github.com/opencontainers/image-spec/blob/master/spec.md
[8]:https://github.com/containers/imagehttps://github.com/containers/image
[9]:https://github.com/coreos/flannel
[10]:https://opensource.com/user/16673/feed
[11]:https://www.flickr.com/photos/danramarch/
[12]:https://creativecommons.org/licenses/by-sa/4.0/
[13]:https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/device_mapper.html
[14]:https://btrfs.wiki.kernel.org/index.php/Main_Page
[15]:https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt
[16]:https://linuxcontainers.org/
[17]:https://libvirt.org/
[18]:http://sandbox.libvirt.org/
[19]:https://opensource.com/article/17/6/getting-started-go
[20]:https://github.com/opencontainers/runc/tree/master/libcontainer
[21]:https://github.com/opencontainers/runtime-spec
[22]:https://opensource.com/business/15/10/lisa15-interview-alison-chaiken-mentor-graphics
[23]:https://www.openshift.com/
[24]:https://openshift.github.io/geard/
[25]:https://opensource.com/resources/what-is-kubernetes
[26]:https://github.com/google/lmctfy
[27]:https://kubernetes.io/docs/admin/kubelet/
[28]:https://www.opencontainers.org/
[29]:https://github.com/projectatomic/atomic
[30]:https://twitter.com/runc0m
[31]:https://github.com/projectatomic/skopeohttps://github.com/projectatomic/skopeo
[32]:https://developers.redhat.com/blog/2016/05/02/introducing-atomic-scan-container-vulnerability-detection/
[33]:https://github.com/containers/image
[34]:http://rhelblog.redhat.com/2017/05/11/skopeo-copy-to-the-rescue/
[35]:https://access.redhat.com/articles/2750891
[36]:https://github.com/containers/storage
[37]:https://github.com/coreos/etcd
[38]:https://twitter.com/nalind
[39]:https://github.com/projectatomic/buildah
[40]:https://twitter.com/mrunalp
[41]:https://github.com/Kubernetes-incubator/cri-o
[42]:https://github.com/Kubernetes-incubator/cri-o
[43]:https://opensource.com/users/rhatdan
[44]:https://opensource.com/users/rhatdan
[45]:https://opensource.com/article/17/7/how-linux-containers-evolved#comments

View File

@ -0,0 +1,364 @@
Linux 容器的演化过程
============================================================
### 容器在过去几年内取得很大的进展。现在我们来回顾它发展的时间线。
![How Linux containers have evolved](https://opensource.com/sites/default/files/styles/image-full-size/public/images/business/containers_2015-3-osdc-lead.png?itok=yOKCc8ze "Linux 容器的演化过程")
〔图片来自:[Daniel Ramirez][11] [CC BY-SA 4.0][12]
在过去几年内,容器不仅成为了开发者们热议的话题,还受到了企业的关注。持续增长的关注使得在安全性、可扩展性以及互用性等方面的需求也得以增长。满足这些需求需要很大的工程量,下面我们讲讲在红帽这些企业级工程是如何发展的。
我在 2013 年秋季第一次遇到 Docker 公司Docker.io的代表那时我们在设法使 Red Hat Enterprise Linux (RHEL) 支持 Docker 容器Docker 项目的一部分已经更名为 _Moby_的运行。在移植过程中我们遇到了一些问题。处理容器镜像分层所需的写时拷贝COW文件系统成了我们第一个重大阻碍。Red Hat 最终贡献了一些 COW 文件系统实现,包括 [Device Mapper][13][btrf][14],以及 [OverlayFS][15] 的第一个版本。在 RHEL 上,我们默认使用 Device Mapper 但是我们在 OverlayFS 上也已经取得了很大进展。
Linux 容器
* [什么是 Linux 容器?][1]
* [什么是 Docker ][2]
* [什么是 Kubernetes ][3]
* [容器术语介绍][4]
我们在用于启动容器的工具上遇到了第二个主要障碍。那时的上游 docker 使用 [LXC][16] 工具来启动容器,然而我们不想在 RHEL 上支持 LXC 工具集。而且在与上游 docker 合作之前,我们已经与 [libvrit][17] 团队携手构建了 [virt-sandbox][18] 工具,它使用 **libvrit-lxc** 来启动容器。
在那时,红帽里有员工提议先将 LXC 工具集移除,然后在 docker 守护进程与 libvirt 之间添加桥接器来调用 **libvirt-lxc** 启动容器。这个方案也有一些顾虑。考虑下面这个例子,使用 Docker 客户端(**docker-cli**)来启动容器,各层调用会在容器进程(**pid1OfContainer**)之前依次启动:
**docker-cli → docker-daemon → libvirt-lxc → pid1OfContainer**
我不是很喜欢这个方案,因为它在启动容器的工具与最终的容器进程之间有两个守护进程。
我的团队与上游 docker 开发者合作实现了一个原生的 [Go 编程语言][19] 版本的容器运行时,叫作 [libcontainer][20]。这个库作为 [OCI 运行时规范]的最初版实现与 runc 一同发布。
**docker- ****cli**** → docker-daemon @ pid1OfContainer**
大多数人误认为当他们执行一个容器时,容器进程是作为 **docker-cli** 的子进程运行的。实际上他们执行的是一个客户端/服务端请求操作,容器进程是在一个完全单独的环境作为子进程运行的。客户端/服务端请求会导致不稳定性和潜在的安全问题,而且会阻碍一些实用特性的实现。举个例子,[systemd][22]
有个叫做套接字唤醒的特性,你可以将一个守护进程设置成仅当相应的套结字被连接时才启动。这意味着你的系统可以节约内存并按需执行服务。套结字唤醒的工作原理是 systemd 代为监听 TCP 套结字并在包到达套结字时启动相应的服务。一旦服务启动完毕systemd 将套结字交给新启动的守护进程。如果将守护进程运行在基于 docker 的容器中就会出现问题。systemd 的 unit 文件通过 Docker CLI 执行容器,然而这时 systemd 却无法简单地经由 Docker CLI 将套结字转交给 Docker 守护进程。
类似这样的问题让我们意识到我们需要一个运行容器的替代方案。
### 容器编排问题
上游的 docker 项目简化了容器的使用过程,同时也是一个绝佳的 Linux 容器学习工具。你可以通过一条简单的命令快速地体验如何启动一个容器,例如运行 **docker run -ti fedora sh** 然后立即得到一个容器进程。
当开始把许多容器组织成一个功能更为强大的应用时,你才能体会到容器真正的能力。但是问题在于伴随多容器应用而来的高复杂度使得简单的 Docker 命令无法胜任编排工作。你要如何管理容器应用在有限资源的集群节点间的布局与编排?如何管理它们的生命周期等等?
在第一届 DockerCon至少有 7 种不同的公司/开源项目展示了其容器的编排方案。红帽演示了 [OpenShift][23] 的 [geard][24] 项目,它基于 OpenShift v2 的容器(叫作 gears。红帽觉得公司需要重新审视容器编排而且可能要与开源社区的其他人合作。
Google 则演示了 Kubernetes 容器编排工具,它来源于 Google 对其自主互联网架构进行编排时所积累的知识经验。OpenShift 决定放弃 Gear 项目,开始和 Google 一同开发 Kubernetes。 现在 Kubernetes 是 GitHub 上最大的社区项目之一。
#### Kubernetes
Kubernetes 原先被设计成使用 Google 的 [lmctfy][26] 容器运行时来完成工作。在 2014 年夏天lmctfy 兼容了 docker。Kubernetes 还会在 kubernetes 集群的每个节点运行一个 [kubelet][27] 守护进程,这意味着原先使用 docker 1.8 的 kubernetes 工作流看起来是这样的:
**kubelet → dockerdaemon @ PID1**
回退到双守护进程的模式。
然而更糟糕的是,每次 docker 的新版本发布都使得 kubernetes 无法工作。Docker 1.10 切换镜像存储方案导致所有镜像重建。而 Docker 1.11 开始使用 **runc** 来启动镜像:
**kubelet → dockerdaemon @ runc @PID1**
Docker 1.12 则增加了一个容器守护进程用于启动容器。其主要目的是为了支持 Docker Swarm Kubernetes 的竞争者之一):
**kubelet → dockerdaemon → containerd @runc @ pid1**
如上所述_每一次_ docker 发布都破坏了 Kubernetes 的功能,这也是为什么 Kubernetes 和 OpenShift 请求我们为他们维护老版本 Docker 的原因。
现在我们有了一个三守护进程的系统,只要任何一个出现问题,整个系统都将崩溃。
### 走向容器标准化
### CoreOS rkt 和其它替代运行时
因为 docker 运行时带来的问题几个组织都在寻求一个替代的运行时。CoreOS 就是其中之一。他们提供了一个 docker 容器运行时替代,叫 _rkt_ rocket。他们同时还引入一个标准容器规范称作 _appc_ App Container。从根本上讲他们是希望能使得所有人都使用一个标准规范来管理容器镜像中的应用。
这一行为为标准化工作树立了一面旗帜。当我第一次开始和上游 docker 合作时,我最大的担忧就是最终我们会分裂出多个标准。我不希望类似 RPM 和 DEB 之间的战争影响接下来 20 年的 Linux 软件部署。appc 的一个成果是它说服了上游 docker 与开源社区合作创建了一个称作 [Open Container Initiative][28] (OCI) 的标准团体。
OCI 已经着手制定两个规范:
**[OCI 运行时规范][6]**: OCI 运行时规范“旨在规范容器的配置,执行环境以及生命周期”。它定义了容器的磁盘存储,描述容器内应用的 JSON 文件,容器的生成和执行方式。上游 docker 贡献了 libcontainer 并构建了 runc 作为 OCI 运行时规范的默认实现。
**[OCI 镜像文件格式规范][7]**:镜像文件格式规范主要基于上游 docker 所使用的镜像格式,定义了容器仓库中实际存储的容器镜像格式。该规范使得应用开发者能为应用使用单一的标准化格式。一些 appc 中描述的概念被加入到 OCI 镜像格式规范中得以保留。这两份规范 1.0 版本的发布已经临近。上游 docker 已经同意在 OCI 镜像规范定案后支持该规范。Rkt 现在既支持运行 OCI 镜像也支持传统的上游 docker 镜像。
Open Container Initiative 通过为工业界提供容器镜像与运行时标准化的环境,帮助在工具与编排领域解放创新的力量。
### 抽象运行时接口
得益于标准化工作, Kubernetes 编排器领域也有所创新。作为 Kubernetes 的一大支持者CoreOS 提交一堆补丁,使 Kubernetes 除了 docker 引擎外还能通过 rkt 运行容器并且与容器通讯。Google 和 Kubernetes 上游预见到类似的容器运行时接口补丁给 Kubernetes 带来的代码复杂度,他们决定实现一个叫作 Container Runtime Interface (CRI) 的 API 协议规范。于是他们将 Kubernetes 由原来的直接调用 docker 引擎改为调用 CRI这样任何人都可以通过实现服务器端的 CRI 来创建支持
Kubernetes 的容器运行时。Kubernetes 上游还为 CRI 开发者们创建了一个测试集以验证他们的运行时对 Kubernetes 的支持情况。开发者们还在努力地移除 Kubernetes 对 docker 引擎的调用并将它们隐藏在一个叫作 docker-shim 的薄抽象层后。
### 容器工具的创新
### 伴随 skopeo 而来的容器仓库创新
几年前我们正与 Atomic 项目团队合作构建 [atomic CLI][29]。我们希望实现一个功能,在镜像还在镜像仓库时查看它的细节。在那时,查看容器镜像相关 JSON 文件的唯一方法是将镜像拉取到本地服务器再通过 **docker inspect** 来查看 JSON 文件。这些镜像可能会很大,上至几 GiB。为了允许用户在不拉取镜像的情况下查看镜像细节我们希望在 **docker inspect** 接口添加新的 **--remote** 参数。上游 docker 拒绝了我们的 pull request告知我们他们不希望将 Docker CLI
复杂化,我们可以构建我们自己的工具去实现相同的功能。
我们的团队在 [Antonio Murdaca][30] 的领导下执行这个提议,构建了 [skopeo][31]。Antonio 没有止步于拉取镜像相关的 JSON 文件,而是决定实现一个完整的协议,用于在容器仓库与本地主机之间拉取与推送容器镜像。
Skopeo is now used heavily within the atomic CLI for things such as checking for new updates for containers and inside of [atomic scan][32]. Atomic also uses skopeo for pulling and pushing images, instead of using the upstream docker daemon.
skopeo 现在被 atomic CLI 用于类似检查容器更新的功能以及 [atomic scan][32] 当中。Atomic 也使用 skopeo 取代上游 docker 守护进程拉取和推送镜像的功能。
### Containers/image
我们也曾和 CoreOS 讨论过在 rkt 中使用 skopeo 的可能,然而他们表示不希望运行一个外部的协助程序,但是会考虑使用 skopeo 所使用的代码库。于是我们决定将 skopeo 分离为一个代码库和一个可执行程序,创建了 **[image][8]**。
[containers/images][33] 代码库和 skopeo 被几个其它上游项目和云基础设施工具所使用。Skopeo 和 containers/image 已经支持除 docker 以外的其它几个存储后端,而且能够在容器仓库之间移动容器镜像,还拥有许多酷炫的特性。[skopeo 的一个优点][34]是它不需要任何守护进程的协助来完成任务。Containers/image 代码库的诞生使得类似[容器镜像签名][35]等增强功能得以实现。
### 镜像处理与扫描的创新
我在前文提到 **atomic** CLI。我们构建这个工具是为了添加不适合 docker CLI 或者我们无法在上游 docker 中实现的特性。我们也希望获得足够灵活性将其用于开发额外的容器运行时工具和存储系统。Skopeo 就是一例。
我们想要在 atomic 实现的一个功能是 **atomic mount**。从根本上讲,我们希望从 Docker 镜像存储(上游 docker 称之为 graph driver中获取内容把镜像挂在到某处用工具来查看。如果你使用上游的 docker,
查看镜像内容的唯一方法就是启动一个容器。如果你有不可信的内容,在容器中运行代码查看它会有潜在危险。通过启动容器查看镜像内容的另一个问题是所需的工具可能没有被包含在容器镜像当中。
大多数容器镜像扫描器遵循以下流程:他们连接 Docker 的套结字,执行一个 **docker save** 来创建一个 tar 打包文件,然后在磁盘上分解这个打包文件,最后查看其中的内容。这是一个很慢的过程。
通过 **atomic mount**,我们希望直接使用 Docker graph driver 挂载镜像。如果 docker 守护进程使用 device mapper我们将挂载这个设备。如果它使用 overlay我们会挂载 overlay。这个操作很快而且满足我们的需求。现在你可以执行
```
# atomic mount fedora /mnt
# cd /mnt
```
然后开始探查内容。你完成相应工作后,执行:
```
# atomic umount /mnt
```
我们在 **atomic scan** 中使用了这一特性,实现了一个快速的容器扫描器。
#### **工具协作的问题**
其中一个严重的问题是 **atomic mount** 隐式地执行这些工作。Docker 守护进程不知道有另一个进程在使用这个镜像。这会导致一些问题(例如,如果你先挂载了 Fedora 镜像,然后某个人执行了 **docker rmi fedora** 命令docker 守护进程就会奇怪地产生操作失败同时报告相应的资源忙碌。Docker 守护进程可能因此进入一个奇怪的状态。
### 容器存储系统
为了解决这个问题,我们开始尝试从上游 docker 守护进程中剥离 graph driver 的代码。Docker 守护进程在内存中为 graph driver 完成所有锁的获取。我们想要将这些锁操作转移到文件系统中,这样我们可以支持多个不同的进程来同时操作容器的存储系统,而不用通过单一的守护进程。
我们创建了 [containers/storage][36] 项目,实现了容器运行、构建、存储所需的所有写时拷贝特性,同时不再需要一个单一进程来控制和监控这个过程(也就是不需要守护进程)。现在 skopeo 以及其它工具和项目可以直接利用镜像的存储系统。其它开源项目也开始使用 containers/storage在某些时候我们也会把这些项目合并会上游 docker 项目。
### 放弃 docker 进行创新
当 Kubernetes 在一个结点上使用 docker 守护进程运行容器时会发生什么首先Kubernetes 执行一条类似的命令:
```
kubelet run nginx image=nginx
```
这个命令告诉 kubelet 在结点上运行 NGINX 应用。kubelet 调用 CRI 请求启动 NGINX 应用。在这时,实现了 CRI 规范的容器运行时必须执行以下步骤:
1. 检查本地是否存在名为 **nginx** 的容器。如果没有,容器运行时会在容器仓库中搜索容器镜像。
2. 如果镜像不存在于本地,从容器仓库下载到本地系统。
3. 使用容器存储系统(通常是写时拷贝存储系统)解析下载的容器镜像并挂载它。
4. 使用标准化的容器运行时执行容器。
让我们看看上述过程使用到的特性:
1. OCI 镜像格式规范定义了容器仓库存储的标准镜像格式。
2. Containers/image 代码库实现了从容器仓库拉取镜像到容器主机所需的所有特性。
3. Containers/storage 提供了代码库以供在写时拷贝的存储系统上探查并处理 OCI 镜像格式。
4. OCI 运行时规范以及 **runc** 提供了执行容器的工具(同时也是 docker 守护进程用来运行容器的工具)。
这意味着我们可以利用这些工具来使用容器,而无需一个大型的容器守护进程。
在中等到大规模的基于 DevOps 的持续集成/持续交付环境下,效率、速率和安全性至关重要。只要你的工具遵循 OCI 规范,开发者和执行者就能在持续集成、持续交付到生产环境的自动化中自然地使用最佳的工具。大多数的容器工具被隐藏在容器编排或上级容器平台技术之下。我们预想着有朝一日,运行时和镜像工具的选择会变成一个容器平台的安装选项。
### 系统(独立)容器
在 Atomic 项目中我们引入了 **atomic host**,一种新的操作系统构建方式:所有的软件可以被自动升级并且大多数应用以容器的形式运行在操作系统中。这个平台的目的是证明将来所有的软件都能部署在 OCI 镜像格式中并且使用标准协议从容器仓库中拉取,然后安装到系统上。用容器镜像的形式发布软件允许你以不同的速度升级应用程序和操作系统。传统的 RPM/yum/DNF
包分发方式把应用更新锁定到操作系统的生命周期中。
在以容器部署基础设施时会遇到一个问题——有时一些应用必须在容器运行时执行之前启动。我们看一个使用 docker 的 Kubernetes 的例子Kubernetes 为了将 pods 或者容器 部署在独立的网络中,要求先建立一个网络。现在默认用于创建网络的守护进程是 **[flanneld][9]**,而它必须在 docker 守护进程之前启动,以支持 docker 网络接口,来运行 Kubernetes 的 pods。而且flanneld 使用 [**etcd**][37] 来存储数据,这个守护进程必须在 flanneld 启动之前运行。
如果你想把 etcd 和 flanneld 部署到容器镜像中,那就陷入了鸡与鸡蛋的困境中。我们需要容器运行时来启动容器化的应用,但这些应用又需要在容器运行时之前启动。我见过几个取巧的方法尝试解决这个问题,但这些方法都不太干净。而且 docker 守护进程没有合适的方法来配置容器启动的优先级顺序。我见过一些提议,但它们看起来和 SysVInit
所使用的启动服务的方式相似(我们知道它带来的复杂度)。
### systemd
用 systemd 替代 SysVInit 的原因之一就是为了处理服务启动的优先级和顺序,我们为什么不充分利用这种技术呢?在项目 Atomic 中我们决定在让它在没有容器运行时的情况下也能启动容器,尤其是在系统启动早期。我们增强 atomic CLI 的功能,让用户可以安装容器镜像。当你执行 **atomic install --system etc**,它将利用 skopeo 从外部的容器仓库拉取 etcd 的 OCI 镜像,然后把它分解(扩展)为 OSTree 仓库。因为 etcd 运行在生产环境中,我们把镜像处理为只读。接着
**atomic** 命令抓取容器镜像中的 systemd unit 文件模板,用它在磁盘上创建 unit 文件来启动镜像。这个 unit 文件实际上使用 **runc** 来在主机上启动容器(虽然 **runc** 不是必需的)。
执行 **atomic install --system flanneld** 时会进行相似的过程,但是这时 flanneld 的 unit 文件中会指明它依赖 etcd。
在系统引导时systemd 会保证 etcd 先于 flanneld 运行,并且直到 flanneld 启动完毕后再启动容器运行时。这样我们就能把 docker 守护进程和 Kubernetes 部署到系统容器当中。这也意味着你可以启动一台 atomic host 或者使用传统基于 rpm 操作系统,让整个容器编排工具栈运行在容器中。这是一个强大的特性,因为用户往往希望改动容器主机时不受这些组件影响。而且,它保持了主机的操作系统足迹最小化。
大家甚至讨论把传统的应用程序部署到独立/系统容器或者被编排的容器中。设想一下,可以用 **atomic install --system httpd** 命令安装一个 Apache 容器,这个容器可以和用 RPM 安装的 httpd 服务以相同的方式启动(**systemctl start httpd** 区别是这个容器 httpd 运行在一个容器中)。存储系统可以是本地的,换言之,/var/www 是从主机挂载到容器当中的,而容器监听着本地网络的 80 端口。这表明了我们可以在不使用容器守护进程的情况下将传统的负载组件部署到一个容器中。
### 构建容器镜像
在我看来,在过去 4 年容器发展方面最让人失落的是缺少容器镜像构建机制上的创新。容器镜像不过是将一些打包文件与 JSON 文件一起打包形成的文件。基础镜像则是一个 rootfs 与一个描述镜像本身的 JSON 文件。然后当你增加镜像层时,层与层之间的差异会被打包同时 JSON 文件会做出相应修改。这些镜像层与基础文件一起被打包,共同构成一个容器镜像。
Almost everyone is building with the **docker build** and the Dockerfile format. Upstream docker stopped accepting pull requests to modify or improve Dockerfile format and builds a couple of years ago. The Dockerfile played an important part in the evolution of containers. Developers or administrators/operators could build containers in a simple and straightforward manner; however, in my opinion, the Dockerfile is really just a poor mans bash script and creates several problems that have never been solved. For example:
现在几乎所有人都使用 **docker build** 与 Dockerfile 格式来构建镜像。上游 docker 已经在几年前停止了接受修改或改进 Dockerfile 格式的 pull request 了。Dockerfile 在容器的演进过程中起着重要作用,开发者和管理员可以通过简单直接的方式来构建镜像;然而我觉得 Dockerfile 就像一个简陋的 bash 脚本,还带来了一些尚未解决的问题,例如:
* 使用 Dockerfile 创建容器镜像要求运行着 Docker 守护进程。
* 没有可以独立于 docker 命令的标准工具用于创建 OCI 镜像。
* 甚至类似 **ansible-containers** 和 OpenShift S2I (Source2Image) 的工具也在底层使用 **docker-engine**
* Dockerfile 中的每一行都会创建一个新的镜像,这是为了方便开发的过程。构建工具能够识别 Dockerfile 中的未改动行复用已经存在的镜像从而避免了未改动行的重复执行。这个特性会产生_大量_的镜像层。
* 因此,不少人希望构建机制能压缩镜像消除这些镜像层。我猜想上游 docker 最后应该接受了一些提交满足了这个需求。
* 要从受保护的站点拉取内容到容器镜像,你往往需要某种密钥。比如你为了添加 REHL 的内容到镜像中,就需要访问 RHEL 的证书和订阅。
* 这些密钥最终会被保存在镜像中。开发者要费很大工夫去移除它们。
* 为了允许在 docker 构建过程中挂载 volume 我们在我们维护的 projectatomic 和 docker 中加入了 **-v** 选项,但是这些修改没有被上游 docker 接受。
* 构建过程的中间产物最终会保留在容器镜像中,所以尽管 Dockerfile 易于学习和使用,但是如果你深入其原理,就会明白它在大规模企业环境下还不够高效。然而在自动化容器平台下,你应该不会关心用于构建 OCI 镜像的方式是否高效。
### 放弃 docker 改用 Buildah
在 DevConf.cz 2017我让我们团队的 [Nalin Dahyabhai][38] 考虑构建被我称为 **containers-coreutils** 的工具,基本上就是基于 containers/storage 和 containers/image 的一系列可以使用类似 Dockerfile 语法的命令行工具。Nalin 为了取乐我的波士顿口音,决定把它叫做 [buildah]。我们只需要少量的 buildah 原语就可以构建一个容器镜像:
* 最小化 OS 镜像,消除不必要的工具是主要的安全原则之一。因为黑客在攻击应用时需要一些工具,如果类似 **gcc****make****dnf** 这样的工具根本不存在,就能阻碍攻击者的行动。
* 减小容器的体积总是有益的,因为这些镜像会通过互联网拉取与推送。
* 使用 Docker 进行构建的基本原理是在容器构建的根目录下利用命令安装或编译软件。
* 执行 **run** 命令要求所有的可执行文件都包含在容器镜像内。而在容器镜像中使用 **dnf** 需要完整的 Python 栈,即使在应用中从未使用到 Python。
* **ctr=$(buildah from fedora)**:
* 使用 containers/image 从容器仓库拉取 Fedora 镜像。
* 返回一个容器 ID **ctr**)。
* **mnt=$(buildah mount $ctr)**:
* 挂载新建的容器镜像(**$ctr**.
* 返回挂载点路径。
* 现在你可以使用挂载点来写入内容。
* **dnf install httpd installroot=$mnt**:
* 你可以使用主机上的命令把内容重定向到容器中,这样你可以把密钥保留在主机而不导入到容器内,同时构建所用的工具也仅仅存在于主机上。
* 容器内不需要包含 **dnf** 或者 Python 栈,除非你的应用用到它们。
* **cp foobar $mnt/dir**:
* 你可以使用任何 bash 中可用的命令来构造镜像。
* **buildah commit $ctr**:
* 你可以随时创建一个镜像层,镜像的分层由用户而不是工具来决定。
* **buildah config --env container=oci --entrypoint /usr/bin/httpd $ctr**:
* Buildah 支持所有 Dockerfile 的命令。
* **buildah run $ctr dnf -y install httpd**:
* Buildah 支持 **run** 命令,但它是在一个锁定的容器内利用 **runc** 执行命令,而不依赖容器运行时守护进程。
* **buildah build-using-dockerfile -f Dockerfile .**:
我们希望将移植类似 **ansible-containers** 和 OpenShift S2I 这样的工具,改用 **buildah** 以去除对容器运行时守护进程的依赖。
Another big issue with building in the same container runtime that is used to run containers in production is that you end up with the lowest common denominator when it comes to security. Building containers tends to require a lot more privileges than running containers. For example, we allow the **mknod** capability by default. The **mknod** capability allows processes to create device nodes. Some package installs attempt to create device nodes, yet in production almost no applications do. Removing the **mknod** capability from your containers in production would make your systems more secure.
使用与生产环境相同的容器运行时构建容器镜像会遇到另一个问题。为了保证安全性,我们需要把权限限制到支持容器构建与运行所需的最小权限。构建容器比起运行容器往往需要更多额外的权限。举个例子,我们默认允许 **mknod** 权限,这会允许进程创建设备节点。有些包的安装会尝试创建设备节点,然而在生产环境中的应用几乎都不会这么做。如果默认移除生产环境中容器的 **mknod**
特权会让系统更为安全。
另一个例子是,容器镜像默认是可读写的,因为安装过程意味着向 **/usr** 存入软件包。然而在生产环境中,我强烈建议把所有容器设为只读模式,仅仅允许它们写 **tmpfs** 或者是挂载了 volume 的目录。通过分离容器的构建与运行环境,我们可以更改这些默认设置,提供一个更为安全的环境。
* 当然buildah 可以使用 Dockerfile 构建容器镜像。
### CRI-O —— 一个 Kubernetes 的运行时抽象
Kubernetes 添加了 Container Runtime Interface CRI接口使 pod 可以在任何运行时上工作。虽然我不是很喜欢在我的系统上运行太多的守护进程,然而我们还是构建了一个。我的团队在 [Mrunal Patel][40] 的领导下于 2016 年后期开始构建 [CRI-O] 守护进程。这是一个用来运行 OCI 应用程序的 Container Runtime Interface 守护进程。理论上,将来我们能够把 CRI-O 的代码直接并入 kubelet 中从而消除这个多余的守护进程。
不像其它容器运行时CRI-O 的唯一目的就只是为了满足 Kubernetes 的需求。记得前文描述的 Kubernetes 运行容器的条件。
Kubernetes 传递消息给 kubelet 告知其运行 NGINX 服务器:
1. kubelet 唤醒 CRI-O 并告知它运行 NGINX。
2. CRI-O 回应 CRI 请求。
3. CRI-O 在容器仓库查找 OCI 镜像。
4. CRI-O 使用 containers/image 从仓库拉取镜像到主机。
5. CRI-O 使用 containers/storage 解压镜像到本地磁盘。
6. CRI-O 按照 OCI 运行时规范(通常使用 **runc**启动容器。如前文所述Docker 守护进程也同样使用 **runc** 启动它的容器。
7. 按照需要kubelet 也可以使用替代的运行时启动容器,例如 Clear Containers **runcv**
CRI-O 旨在成为稳定的 Kubernetes 运行平台。只有通过完整的 Kubernetes 测试集后,新版本的 CRI-O 才会被推出。所有提交到 [https://github.com/Kubernetes-incubator/cri-o][42] 的 pull request 都会运行完整的 Kubernetes 测试集。没有通过测试集的 pull request 都不会被接受。CRI-O 是完全开发的,我们已经收到了来自 IntelSUSEIBMGoogleHyper.sh 等公司的代码贡献。即使不是红帽想要的特性,只要通过一定数量维护者的同意,提交给 CRI-O 的补丁就会被接受。
### 小结
我希望这份深入的介绍能够帮助你理解 Linux 容器的演化过程。Linux 容器曾经陷入一种各自为营的困境Docker 建立起了镜像创建的事实标准简化了容器的使用工具。Open Container Initiative
则意味着工业界在核心镜像格式与运行时方面的合作,这促进了工具自动化效率、安全性、高可扩展性、易用性方面的创新。容器使我们能够以一种新奇的方式部署软件——无论是运行于主机上的传统应用还是部署在云端的微服务。而在许多方面,这一切还仅仅是个开始。
--------------------------------------------------------------------------------
作者简介:
Daniel J Walsh - Daniel 有将近 30 年的计算机安全领域工作经验。他在 2001 年 8 月加入 Red Hat。
via: https://opensource.com/article/17/7/how-linux-containers-evolved
作者:[Daniel J Walsh (Red Hat)][a]
译者:[haoqixu](https://github.com/haoqixu)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:https://opensource.com/users/rhatdan
[1]:https://opensource.com/resources/what-are-linux-containers?utm_campaign=containers&intcmp=70160000000h1s6AAA
[2]:https://opensource.com/resources/what-docker?utm_campaign=containers&intcmp=70160000000h1s6AAA
[3]:https://opensource.com/resources/what-is-kubernetes?utm_campaign=containers&intcmp=70160000000h1s6AAA
[4]:https://developers.redhat.com/blog/2016/01/13/a-practical-introduction-to-docker-container-terminology/utm_campaign=containers&intcmp=70160000000h1s6AAA
[5]:https://opensource.com/article/17/7/how-linux-containers-evolved?rate=k1UcW7wzh6axaB_z8ScE-U8cux6fLXXgW_vboB5tIwk
[6]:https://github.com/opencontainers/runtime-spec/blob/master/spec.md
[7]:https://github.com/opencontainers/image-spec/blob/master/spec.md
[8]:https://github.com/containers/imagehttps://github.com/containers/image
[9]:https://github.com/coreos/flannel
[10]:https://opensource.com/user/16673/feed
[11]:https://www.flickr.com/photos/danramarch/
[12]:https://creativecommons.org/licenses/by-sa/4.0/
[13]:https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/device_mapper.html
[14]:https://btrfs.wiki.kernel.org/index.php/Main_Page
[15]:https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt
[16]:https://linuxcontainers.org/
[17]:https://libvirt.org/
[18]:http://sandbox.libvirt.org/
[19]:https://opensource.com/article/17/6/getting-started-go
[20]:https://github.com/opencontainers/runc/tree/master/libcontainer
[21]:https://github.com/opencontainers/runtime-spec
[22]:https://opensource.com/business/15/10/lisa15-interview-alison-chaiken-mentor-graphics
[23]:https://www.openshift.com/
[24]:https://openshift.github.io/geard/
[25]:https://opensource.com/resources/what-is-kubernetes
[26]:https://github.com/google/lmctfy
[27]:https://kubernetes.io/docs/admin/kubelet/
[28]:https://www.opencontainers.org/
[29]:https://github.com/projectatomic/atomic
[30]:https://twitter.com/runc0m
[31]:https://github.com/projectatomic/skopeohttps://github.com/projectatomic/skopeo
[32]:https://developers.redhat.com/blog/2016/05/02/introducing-atomic-scan-container-vulnerability-detection/
[33]:https://github.com/containers/image
[34]:http://rhelblog.redhat.com/2017/05/11/skopeo-copy-to-the-rescue/
[35]:https://access.redhat.com/articles/2750891
[36]:https://github.com/containers/storage
[37]:https://github.com/coreos/etcd
[38]:https://twitter.com/nalind
[39]:https://github.com/projectatomic/buildah
[40]:https://twitter.com/mrunalp
[41]:https://github.com/Kubernetes-incubator/cri-o
[42]:https://github.com/Kubernetes-incubator/cri-o
[43]:https://opensource.com/users/rhatdan
[44]:https://opensource.com/users/rhatdan
[45]:https://opensource.com/article/17/7/how-linux-containers-evolved#comments