This started with a consulting snafu: Government organisation A got government organisation B to develop a web application. Government organisation B subcontracted part of the work to somebody. Hosting and maintenance of the project was later contracted out to a private-sector company C. Company C discovered that the subcontracted somebody (who was long gone) had built a custom Docker image and made it a dependency of the build system, but without committing the original Dockerfile. That left company C with a contractual obligation to manage a Docker image they had no source code for. Company C calls me in once in a while to do various things, so doing something about this mystery meat Docker image became my job.
Fortunately, the Docker image format is a lot more transparent than it could be. A little detective work is needed, but a lot can be figured out just by pulling apart an image file. As an example, here’s a quick walkthrough of an image for [the Prettier code formatter][1].
First let’s get the Docker daemon to pull the image, then extract the image to a file:
docker pull tmknom/prettier:2.0.5
docker save tmknom/prettier:2.0.5 > prettier.tar
Yes, the file is just an archive in the classic tarball format:
As you can see, Docker uses hashes a lot for naming things. Let’s have a look at the `manifest.json`. It’s in hard-to-read compacted JSON, but the [`jq` JSON Swiss Army knife][2] can pretty print it for us:
Note that the three layers correspond to the three hash-named directories. We’ll look at them later. For now, let’s look at the JSON file pointed to by the `Config` key. It’s a little long, so I’ll just dump the first bit here:
$ jq . 88f38be28f05f38dba94ce0c1328ebe2b963b65848ab96594f8172a9c3b0f25b.json | head -n 20
The most interesting part is the `history` list, which lists every single layer in the image. A Docker image is a stack of these layers. Almost every statement in a Dockerfile turns into a layer that describes the changes to the image made by that statement. If you have a `RUN` statement that creates `really_big_file` that you then delete with `RUN rm really_big_file`, you actually get two layers in the Docker image: one that contains `really_big_file`, and one that contains a `.wh.really_big_file` tombstone to cancel it out. The overall image file isn’t any smaller. That’s why you often see Dockerfile statements chained together like `RUN && rm really_big_file` — it ensures all changes are coalesced into one layer.
Here are all the layers recorded in the Docker image. Notice that most layers don’t change the filesystem image and are marked `"empty_layer": true`. Only three are non-empty, which matches up with what we saw before.
Fantastic! All the statements are right there in the `created_by` fields, so we can almost reconstruct the Dockerfile just from this. Almost. The `ADD` statement at the very top doesn’t actually give us the file we need to `ADD`. `COPY` statements are also going to be opaque. We also lose `FROM` statements because they expand out to all the layers inherited from the base Docker image.
We can group the layers by Dockerfile by looking at the timestamps. Most layer timestamps are under a minute apart, representing how long each layer took to build. However, the first two layers are from `2020-04-24`, and the rest of the layers are from `2020-04-29`. This would be because the first two layers are from a base Docker image. Ideally we’d figure out a `FROM` statement that gets us that image, so that we have a maintainable Dockerfile.
The `manifest.json` says that the first non-empty layer is `a9cc4ace48cd792ef888ade20810f82f6c24aaf2436f30337a2a712cd054dc97/layer.tar`. Let’s take a look:
$ cd a9cc4ace48cd792ef888ade20810f82f6c24aaf2436f30337a2a712cd054dc97/
$ tar tf | head
Okay, that looks like it might be an operating system base image, which is what you’d expect from a typical Dockerfile. There are 488 entries in the tarball, and if you scroll through them, some interesting ones stand out:
Sure enough, it’s an [Alpine][3] image, which you might have guessed if you noticed that the other layers used an `apk` command to install packages. Let’s extract the tarball and look around:
$ mkdir files
$ cd files
$ tar xf ../layer.tar
$ ls
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
$ cat etc/alpine-release
If you pull `alpine:3.11.6` and extract it, you’ll find that there’s one non-empty layer inside it, and the `layer.tar` is identical to the `layer.tar` in the base layer of the Prettier image.
Just for the heck of it, what’s in the other two non-empty layers? The second layer is the main layer containing the Prettier installation. It has 528 entries, including Prettier, a bunch of dependencies and certificate updates: