mirror of
https://github.com/LCTT/TranslateProject.git
synced 2025-02-03 23:40:14 +08:00
commit
46cc91e975
@ -1,10 +1,9 @@
|
||||
9 个提升开发者与设计师协作的方法
|
||||
======
|
||||
> 抛开成见,设计师和开发者的命运永远交织在一起。 以下是如何让每个人都在同一页面上。
|
||||
|
||||
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/BUS_consensuscollab1.png?itok=ULQdGjlV)
|
||||
|
||||
本文由我与 [Jason Porter][1] 共同完成。
|
||||
|
||||
在任何软件项目中,设计至关重要。设计师不像开发团队那样熟悉其内部工作,但迟早都要知道开发人员写代码的意图。
|
||||
|
||||
两边都有自己的成见。工程师经常认为设计师们古怪不理性,而设计师也认为工程师们死板要求高。在一天的工作快要结束时,情况会变得更加微妙。设计师和开发者们的命运永远交织在一起。
|
||||
@ -53,7 +52,7 @@
|
||||
|
||||
via: https://opensource.com/article/18/5/9-ways-improve-collaboration-developers-designers
|
||||
|
||||
作者:[Jason Brock][a]
|
||||
作者:[Jason Brock][a], [Jason Porter][1]
|
||||
选题:[lujun9972](https://github.com/lujun9972)
|
||||
译者:[LuuMing](https://github.com/LuuMing)
|
||||
校对:[pityonline](https://github.com/pityonline)
|
@ -1,41 +1,41 @@
|
||||
Linux 命令行中使用 tcpdump 抓包
|
||||
在 Linux 命令行中使用 tcpdump 抓包
|
||||
======
|
||||
|
||||
Tcpdump 是一款灵活、功能强大的抓包工具,能有效地帮助排查网络故障问题。
|
||||
> `tcpdump` 是一款灵活、功能强大的抓包工具,能有效地帮助排查网络故障问题。
|
||||
|
||||
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/terminal_command_linux_desktop_code.jpg?itok=p5sQ6ODE)
|
||||
|
||||
根据我作为管理员的经验,在网络连接中经常遇到十分难以排查的故障问题。对于这类情况,tcpdump 便能派上用场。
|
||||
以我作为管理员的经验,在网络连接中经常遇到十分难以排查的故障问题。对于这类情况,`tcpdump` 便能派上用场。
|
||||
|
||||
Tcpdump 是一个命令行实用工具,允许你抓取和分析经过系统的流量数据包。它通常被用作于网络故障分析工具以及安全工具。
|
||||
`tcpdump` 是一个命令行实用工具,允许你抓取和分析经过系统的流量数据包。它通常被用作于网络故障分析工具以及安全工具。
|
||||
|
||||
Tcpdump 是一款强大的工具,支持多种选项和过滤规则,适用场景十分广泛。由于它是命令行工具,因此适用于在远程服务器或者没有图形界面的设备中收集数据包以便于事后分析。它可以在后台启动,也可以用 cron 等定时工具创建定时任务启用它。
|
||||
`tcpdump` 是一款强大的工具,支持多种选项和过滤规则,适用场景十分广泛。由于它是命令行工具,因此适用于在远程服务器或者没有图形界面的设备中收集数据包以便于事后分析。它可以在后台启动,也可以用 cron 等定时工具创建定时任务启用它。
|
||||
|
||||
本文中,我们将讨论 tcpdump 最常用的一些功能。
|
||||
本文中,我们将讨论 `tcpdump` 最常用的一些功能。
|
||||
|
||||
### 1\. 在 Linux 中安装 tcpdump
|
||||
### 1、在 Linux 中安装 tcpdump
|
||||
|
||||
Tcpdump 支持多种 Linux 发行版,所以你的系统中很有可能已经安装了它。用下面的命令检查一下是否已经安装了 tcpdump:
|
||||
`tcpdump` 支持多种 Linux 发行版,所以你的系统中很有可能已经安装了它。用下面的命令检查一下是否已经安装了 `tcpdump`:
|
||||
|
||||
```
|
||||
$ which tcpdump
|
||||
/usr/sbin/tcpdump
|
||||
```
|
||||
|
||||
如果还没有安装 tcpdump,你可以用软件包管理器安装它。
|
||||
例如,在 CentOS 或者 Red Hat Enterprise 系统中,用如下命令安装 tcpdump:
|
||||
如果还没有安装 `tcpdump`,你可以用软件包管理器安装它。
|
||||
例如,在 CentOS 或者 Red Hat Enterprise 系统中,用如下命令安装 `tcpdump`:
|
||||
|
||||
```
|
||||
$ sudo yum install -y tcpdump
|
||||
```
|
||||
|
||||
Tcpdump 依赖于 `libpcap`,该库文件用于捕获网络数据包。如果该库文件也没有安装,系统会根据依赖关系自动安装它。
|
||||
`tcpdump` 依赖于 `libpcap`,该库文件用于捕获网络数据包。如果该库文件也没有安装,系统会根据依赖关系自动安装它。
|
||||
|
||||
现在你可以开始抓包了。
|
||||
|
||||
### 2\. 用 tcpdump 抓包
|
||||
### 2、用 tcpdump 抓包
|
||||
|
||||
使用 tcpdump 抓包,需要管理员权限,因此下面的示例中绝大多数命令都是以 `sudo` 开头。
|
||||
使用 `tcpdump` 抓包,需要管理员权限,因此下面的示例中绝大多数命令都是以 `sudo` 开头。
|
||||
|
||||
首先,先用 `tcpdump -D` 命令列出可以抓包的网络接口:
|
||||
|
||||
@ -80,7 +80,7 @@ listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
|
||||
$
|
||||
```
|
||||
|
||||
Tcpdump 会持续抓包直到收到中断信号。你可以按 `Ctrl+C` 来停止抓包。正如上面示例所示,`tcpdump` 抓取了超过 9000 个数据包。在这个示例中,由于我是通过 `ssh` 连接到服务器,所以 tcpdump 也捕获了所有这类数据包。`-c` 选项可以用于限制 tcpdump 抓包的数量:
|
||||
`tcpdump` 会持续抓包直到收到中断信号。你可以按 `Ctrl+C` 来停止抓包。正如上面示例所示,`tcpdump` 抓取了超过 9000 个数据包。在这个示例中,由于我是通过 `ssh` 连接到服务器,所以 `tcpdump` 也捕获了所有这类数据包。`-c` 选项可以用于限制 `tcpdump` 抓包的数量:
|
||||
|
||||
```
|
||||
$ sudo tcpdump -i any -c 5
|
||||
@ -97,9 +97,9 @@ listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
|
||||
$
|
||||
```
|
||||
|
||||
如上所示,`tcpdump` 在抓取 5 个数据包后自动停止了抓包。这在有些场景中十分有用——比如你只需要抓取少量的数据包用于分析。当我们需要使用过滤规则抓取特定的数据包(如下所示)时,`-c` 的作用就十分突出了。
|
||||
如上所示,`tcpdump` 在抓取 5 个数据包后自动停止了抓包。这在有些场景中十分有用 —— 比如你只需要抓取少量的数据包用于分析。当我们需要使用过滤规则抓取特定的数据包(如下所示)时,`-c` 的作用就十分突出了。
|
||||
|
||||
在上面示例中,tcpdump 默认是将 IP 地址和端口号解析为对应的接口名以及服务协议名称。而通常在网络故障排查中,使用 IP 地址和端口号更便于分析问题;用 `-n` 选项显示 IP 地址,`-nn` 选项显示端口号:
|
||||
在上面示例中,`tcpdump` 默认是将 IP 地址和端口号解析为对应的接口名以及服务协议名称。而通常在网络故障排查中,使用 IP 地址和端口号更便于分析问题;用 `-n` 选项显示 IP 地址,`-nn` 选项显示端口号:
|
||||
|
||||
```
|
||||
$ sudo tcpdump -i any -c5 -nn
|
||||
@ -115,13 +115,13 @@ listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
|
||||
0 packets dropped by kernel
|
||||
```
|
||||
|
||||
如上所示,抓取的数据包中显示 IP 地址和端口号。这样还可以阻止 tcpdump 发出 DNS 查找,有助于在网络故障排查中减少数据流量。
|
||||
如上所示,抓取的数据包中显示 IP 地址和端口号。这样还可以阻止 `tcpdump` 发出 DNS 查找,有助于在网络故障排查中减少数据流量。
|
||||
|
||||
现在你已经会抓包了,让我们来分析一下这些抓包输出的含义吧。
|
||||
|
||||
### 3\. 理解抓取的报文
|
||||
### 3、理解抓取的报文
|
||||
|
||||
Tcpdump 能够抓取并解码多种协议类型的数据报文,如 TCP,UDP,ICMP 等等。虽然这里我们不可能介绍所有的数据报文类型,但可以分析下 TCP 类型的数据报文,来帮助你入门。更多有关 tcpdump 的详细介绍可以参考其 [帮助手册][1]。Tcpdump 抓取的 TCP 报文看起来如下:
|
||||
`tcpdump` 能够抓取并解码多种协议类型的数据报文,如 TCP、UDP、ICMP 等等。虽然这里我们不可能介绍所有的数据报文类型,但可以分析下 TCP 类型的数据报文,来帮助你入门。更多有关 `tcpdump` 的详细介绍可以参考其 [帮助手册][1]。`tcpdump` 抓取的 TCP 报文看起来如下:
|
||||
|
||||
```
|
||||
08:41:13.729687 IP 192.168.64.28.22 > 192.168.64.1.41916: Flags [P.], seq 196:568, ack 1, win 309, options [nop,nop,TS val 117964079 ecr 816509256], length 372
|
||||
@ -137,7 +137,7 @@ Tcpdump 能够抓取并解码多种协议类型的数据报文,如 TCP,UDP
|
||||
|
||||
在源 IP 和目的 IP 之后,可以看到是 TCP 报文标记段 `Flags [P.]`。该字段通常取值如下:
|
||||
|
||||
| Value | Flag Type | Description |
|
||||
| 值 | 标志类型 | 描述 |
|
||||
| ----- | --------- | ----------------- |
|
||||
| S | SYN | Connection Start |
|
||||
| F | FIN | Connection Finish |
|
||||
@ -149,19 +149,19 @@ Tcpdump 能够抓取并解码多种协议类型的数据报文,如 TCP,UDP
|
||||
|
||||
接下来是该数据包中数据的序列号。对于抓取的第一个数据包,该字段值是一个绝对数字,后续包使用相对数值,以便更容易查询跟踪。例如此处 `seq 196:568` 代表该数据包包含该数据流的第 196 到 568 字节。
|
||||
|
||||
接下来是 ack 值:`ack 1`。该数据包是数据发送方,ack 值为1。在数据接收方,该字段代表数据流上的下一个预期字节数据,例如,该数据流中下一个数据包的 ack 值应该是 568。
|
||||
接下来是 ack 值:`ack 1`。该数据包是数据发送方,ack 值为 1。在数据接收方,该字段代表数据流上的下一个预期字节数据,例如,该数据流中下一个数据包的 ack 值应该是 568。
|
||||
|
||||
接下来字段是接收窗口大小 `win 309`,它表示接收缓冲区中可用的字节数,后跟 TCP 选项如 MSS(最大段大小)或者窗口比例值。更详尽的 TCP 协议内容请参考 [Transmission Control Protocol(TCP) Parameters][2]。
|
||||
|
||||
最后,`length 372`代表数据包有效载荷字节长度。这个长度和 seq 序列号中字节数值长度是不一样的。
|
||||
最后,`length 372` 代表数据包有效载荷字节长度。这个长度和 seq 序列号中字节数值长度是不一样的。
|
||||
|
||||
现在让我们学习如何过滤数据报文以便更容易的分析定位问题。
|
||||
|
||||
### 4\. 过滤数据包
|
||||
### 4、过滤数据包
|
||||
|
||||
正如上面所提,tcpdump 可以抓取很多种类型的数据报文,其中很多可能和我们需要查找的问题并没有关系。举个例子,假设你正在定位一个与 web 服务器连接的网络问题,就不必关系 SSH 数据报文,因此在抓包结果中过滤掉 SSH 报文可能更便于你分析问题。
|
||||
正如上面所提,`tcpdump` 可以抓取很多种类型的数据报文,其中很多可能和我们需要查找的问题并没有关系。举个例子,假设你正在定位一个与 web 服务器连接的网络问题,就不必关系 SSH 数据报文,因此在抓包结果中过滤掉 SSH 报文可能更便于你分析问题。
|
||||
|
||||
Tcpdump 有很多参数选项可以设置数据包过滤规则,例如根据源 IP 以及目的 IP 地址,端口号,协议等等规则来过滤数据包。下面就介绍一些最常用的过滤方法。
|
||||
`tcpdump` 有很多参数选项可以设置数据包过滤规则,例如根据源 IP 以及目的 IP 地址,端口号,协议等等规则来过滤数据包。下面就介绍一些最常用的过滤方法。
|
||||
|
||||
#### 协议
|
||||
|
||||
@ -181,7 +181,7 @@ PING opensource.com (54.204.39.132) 56(84) bytes of data.
|
||||
64 bytes from ec2-54-204-39-132.compute-1.amazonaws.com (54.204.39.132): icmp_seq=1 ttl=47 time=39.6 ms
|
||||
```
|
||||
|
||||
回到运行 tcpdump 命令的终端中,可以看到它筛选出了 ICMP 报文。这里 tcpdump 并没有显示有关 `opensource.com`的域名解析数据包:
|
||||
回到运行 `tcpdump` 命令的终端中,可以看到它筛选出了 ICMP 报文。这里 `tcpdump` 并没有显示有关 `opensource.com` 的域名解析数据包:
|
||||
|
||||
```
|
||||
09:34:20.136766 IP rhel75 > ec2-54-204-39-132.compute-1.amazonaws.com: ICMP echo request, id 20361, seq 1, length 64
|
||||
@ -215,7 +215,7 @@ listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
|
||||
|
||||
#### 端口号
|
||||
|
||||
Tcpdump 可以根据服务类型或者端口号来筛选数据包。例如,抓取和 HTTP 服务相关的数据包:
|
||||
`tcpdump` 可以根据服务类型或者端口号来筛选数据包。例如,抓取和 HTTP 服务相关的数据包:
|
||||
|
||||
```
|
||||
$ sudo tcpdump -i any -c5 -nn port 80
|
||||
@ -303,11 +303,11 @@ listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
|
||||
|
||||
该例子中我们只抓取了来自源 IP 为 `192.168.122.98` 或者 `54.204.39.132` 的 HTTP (端口号80)的数据包。使用该方法就很容易抓取到数据流中交互双方的数据包了。
|
||||
|
||||
### 5\. 检查数据包内容
|
||||
### 5、检查数据包内容
|
||||
|
||||
在以上的示例中,我们只按数据包头部的信息来建立规则筛选数据包,例如源地址、目的地址、端口号等等。有时我们需要分析网络连接问题,可能需要分析数据包中的内容来判断什么内容需要被发送、什么内容需要被接收等。Tcpdump 提供了两个选项可以查看数据包内容,`-X` 以十六进制打印出数据报文内容,`-A` 打印数据报文的 ASCII 值。
|
||||
在以上的示例中,我们只按数据包头部的信息来建立规则筛选数据包,例如源地址、目的地址、端口号等等。有时我们需要分析网络连接问题,可能需要分析数据包中的内容来判断什么内容需要被发送、什么内容需要被接收等。`tcpdump` 提供了两个选项可以查看数据包内容,`-X` 以十六进制打印出数据报文内容,`-A` 打印数据报文的 ASCII 值。
|
||||
|
||||
例如,HTTP request 报文内容如下:
|
||||
例如,HTTP 请求报文内容如下:
|
||||
|
||||
```
|
||||
$ sudo tcpdump -i any -c10 -nn -A port 80
|
||||
@ -379,9 +379,9 @@ E..4..@.@.....zb6.'....P....o..............
|
||||
|
||||
这对定位一些普通 HTTP 调用 API 接口的问题很有用。当然如果是加密报文,这个输出也就没多大用了。
|
||||
|
||||
### 6\. 保存抓包数据
|
||||
### 6、保存抓包数据
|
||||
|
||||
Tcpdump 提供了保存抓包数据的功能以便后续分析数据包。例如,你可以夜里让它在那里抓包,然后早上起来再去分析它。同样当有很多数据包时,显示过快也不利于分析,将数据包保存下来,更有利于分析问题。
|
||||
`tcpdump` 提供了保存抓包数据的功能以便后续分析数据包。例如,你可以夜里让它在那里抓包,然后早上起来再去分析它。同样当有很多数据包时,显示过快也不利于分析,将数据包保存下来,更有利于分析问题。
|
||||
|
||||
使用 `-w` 选项来保存数据包而不是在屏幕上显示出抓取的数据包:
|
||||
|
||||
@ -398,7 +398,7 @@ tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 2621
|
||||
|
||||
正如示例中所示,保存数据包到文件中时屏幕上就没有任何有关数据报文的输出,其中 `-c10` 表示抓取到 10 个数据包后就停止抓包。如果想有一些反馈来提示确实抓取到了数据包,可以使用 `-v` 选项。
|
||||
|
||||
Tcpdump 将数据包保存在二进制文件中,所以不能简单的用文本编辑器去打开它。使用 `-r` 选项参数来阅读该文件中的报文内容:
|
||||
`tcpdump` 将数据包保存在二进制文件中,所以不能简单的用文本编辑器去打开它。使用 `-r` 选项参数来阅读该文件中的报文内容:
|
||||
|
||||
```
|
||||
$ tcpdump -nn -r webserver.pcap
|
||||
@ -418,7 +418,7 @@ $
|
||||
|
||||
这里不需要管理员权限 `sudo` 了,因为此刻并不是在网络接口处抓包。
|
||||
|
||||
你还可以使用我们讨论过的任何过滤规则来过滤文件中的内容,就像使用实时数据一样。 例如,通过执行以下命令从源 IP 地址`54.204.39.132` 检查文件中的数据包:
|
||||
你还可以使用我们讨论过的任何过滤规则来过滤文件中的内容,就像使用实时数据一样。 例如,通过执行以下命令从源 IP 地址 `54.204.39.132` 检查文件中的数据包:
|
||||
|
||||
```
|
||||
$ tcpdump -nn -r webserver.pcap src 54.204.39.132
|
||||
@ -431,11 +431,11 @@ reading from file webserver.pcap, link-type LINUX_SLL (Linux cooked)
|
||||
|
||||
### 下一步做什么?
|
||||
|
||||
以上的基本功能已经可以帮助你使用强大的 tcpdump 抓包工具了。更多的内容请参考 [tcpdump网页][3] 以及它的 [帮助文件][4]。
|
||||
以上的基本功能已经可以帮助你使用强大的 `tcpdump` 抓包工具了。更多的内容请参考 [tcpdump 网站][3] 以及它的 [帮助文件][4]。
|
||||
|
||||
Tcpdump 命令行工具为分析网络流量数据包提供了强大的灵活性。如果需要使用图形工具来抓包请参考 [Wireshark][5]。
|
||||
`tcpdump` 命令行工具为分析网络流量数据包提供了强大的灵活性。如果需要使用图形工具来抓包请参考 [Wireshark][5]。
|
||||
|
||||
Wireshark 还可以用来读取 tcpdump 保存的 `pcap` 文件。你可以使用 tcpdump 命令行在没有 GUI 界面的远程机器上抓包然后在 Wireshark 中分析数据包。
|
||||
Wireshark 还可以用来读取 `tcpdump` 保存的 pcap 文件。你可以使用 `tcpdump` 命令行在没有 GUI 界面的远程机器上抓包然后在 Wireshark 中分析数据包。
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
@ -444,7 +444,7 @@ via: https://opensource.com/article/18/10/introduction-tcpdump
|
||||
作者:[Ricardo Gerardi][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[jrg](https://github.com/jrglinux)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
校对:[wxy](https://github.com/wxy)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
@ -1,33 +1,42 @@
|
||||
如何分析并探索 Docker 容器镜像的内容
|
||||
======
|
||||
|
||||
![](https://www.ostechnix.com/wp-content/uploads/2018/10/dive-tool-720x340.png)
|
||||
|
||||
或许你已经了解到 Docker 容器镜像是一个轻量、独立、含有运行某个应用所需全部软件的可执行包,这也是为什么容器镜像会经常被开发者用于构建和分发应用。假如你很好奇一个 Docker 镜像里面包含了什么东西,那么这篇简要的指南或许会帮助到你。今天,我们将学会使用一个名为 **Dive** 的工具来分析和探索 Docker 镜像每层的内容。通过分析 Docker 镜像,我们可以发现在各个层之间可能重复的文件并通过移除它们来减小 Docker 镜像的大小。Dive 工具不仅仅是一个 Docker 镜像分析工具,它还可以帮助我们来构建镜像。Dive 是一个用 Go 编程语言编写的免费开源工具。
|
||||
或许你已经了解到 Docker 容器镜像是一个轻量、独立、含有运行某个应用所需全部软件的可执行包,这也是为什么容器镜像会经常被开发者用于构建和分发应用。假如你很好奇一个 Docker 镜像里面包含了什么东西,那么这篇简要的指南或许会帮助到你。今天,我们将学会使用一个名为 **Dive** 的工具来分析和探索 Docker 镜像每层的内容。
|
||||
|
||||
通过分析 Docker 镜像,我们可以发现在各个层之间可能重复的文件并通过移除它们来减小 Docker 镜像的大小。Dive 工具不仅仅是一个 Docker 镜像分析工具,它还可以帮助我们来构建镜像。Dive 是一个用 Go 编程语言编写的自由开源工具。
|
||||
|
||||
### 安装 Dive
|
||||
|
||||
首先从该项目的 [**发布页**][1] 下载最新版本,然后像下面展示的那样根据你所使用的发行版来安装它。
|
||||
首先从该项目的 [发布页][1] 下载最新版本,然后像下面展示的那样根据你所使用的发行版来安装它。
|
||||
|
||||
假如你正在使用 **Debian** 或者 **Ubuntu**,那么可以运行下面的命令来下载并安装它。
|
||||
|
||||
```
|
||||
$ wget https://github.com/wagoodman/dive/releases/download/v0.0.8/dive_0.0.8_linux_amd64.deb
|
||||
```
|
||||
|
||||
```
|
||||
$ sudo apt install ./dive_0.0.8_linux_amd64.deb
|
||||
```
|
||||
|
||||
**在 RHEL 或 CentOS 系统中**
|
||||
|
||||
```
|
||||
$ wget https://github.com/wagoodman/dive/releases/download/v0.0.8/dive_0.0.8_linux_amd64.rpm
|
||||
```
|
||||
|
||||
```
|
||||
$ sudo rpm -i dive_0.0.8_linux_amd64.rpm
|
||||
```
|
||||
|
||||
Dive 也可以使用 [**Linuxbrew**][2] 包管理器来安装。
|
||||
Dive 也可以使用 [Linuxbrew][2] 包管理器来安装。
|
||||
|
||||
```
|
||||
$ brew tap wagoodman/dive
|
||||
```
|
||||
|
||||
```
|
||||
$ brew install dive
|
||||
```
|
||||
@ -36,34 +45,37 @@ $ brew install dive
|
||||
|
||||
### 分析并探索 Docker 镜像的内容
|
||||
|
||||
要分析一个 Docker 镜像,只需要运行加上 Docker 镜像 ID的 dive 命令就可以了。你可以使用 `sudo docker images` 来得到 Docker 镜像的 ID。
|
||||
要分析一个 Docker 镜像,只需要运行加上 Docker 镜像 ID 的 `dive` 命令就可以了。你可以使用 `sudo docker images` 来得到 Docker 镜像的 ID。
|
||||
|
||||
```
|
||||
$ sudo dive ea4c82dcd15a
|
||||
```
|
||||
|
||||
上面命令中的 **ea4c82dcd15a** 是某个镜像的 id。
|
||||
上面命令中的 `ea4c82dcd15a` 是某个镜像的 ID。
|
||||
|
||||
然后 Dive 命令将快速地分析给定 Docker 镜像的内容并将它在终端中展示出来。
|
||||
然后 `dive` 命令将快速地分析给定 Docker 镜像的内容并将它在终端中展示出来。
|
||||
|
||||
![](https://www.ostechnix.com/wp-content/uploads/2018/10/Dive-1.png)
|
||||
|
||||
正如你在上面的截图中看到的那样,在终端的左边一栏列出了给定 Docker 镜像的各个层及其详细内容,浪费的空间大小等信息。右边一栏则给出了给定 Docker 镜像每一层的内容。你可以使用 **Ctrl+SPACEBAR** 来在左右栏之间切换,使用 **UP/DOWN** 上下键来在目录树中进行浏览。
|
||||
正如你在上面的截图中看到的那样,在终端的左边一栏列出了给定 Docker 镜像的各个层及其详细内容,浪费的空间大小等信息。右边一栏则给出了给定 Docker 镜像每一层的内容。你可以使用 `Ctrl+空格` 来在左右栏之间切换,使用 `UP`/`DOWN` 光标键来在目录树中进行浏览。
|
||||
|
||||
下面是 `Dive` 的快捷键列表:
|
||||
* **Ctrl+Spacebar** – 在左右栏之间切换
|
||||
* **Spacebar** – 展开或收起目录树
|
||||
* **Ctrl+A** – 文件树视图:展示或隐藏增加的文件
|
||||
* **Ctrl+R** – 文件树视图:展示或隐藏被移除的文件
|
||||
* **Ctrl+M** – 文件树视图:展示或隐藏被修改的文件
|
||||
* **Ctrl+U** – 文件树视图:展示或隐藏未修改的文件
|
||||
* **Ctrl+L** – 层视图:展示当前层的变化
|
||||
* **Ctrl+A** – 层视图:展示总的变化
|
||||
* **Ctrl+/** – 筛选文件
|
||||
* **Ctrl+C** – 退出
|
||||
下面是 `dive` 的快捷键列表:
|
||||
|
||||
在上面的例子中,我使用了 `sudo` 权限,这是因为我的 Docker 镜像存储在 **/var/lib/docker/** 目录中。假如你的镜像保存在你的家目录 `$HOME`或者在其他不属于 `root` 用户的目录,你就没有必要使用 `sudo` 命令。
|
||||
* `Ctrl+空格` —— 在左右栏之间切换
|
||||
* `空格` —— 展开或收起目录树
|
||||
* `Ctrl+A` —— 文件树视图:展示或隐藏增加的文件
|
||||
* `Ctrl+R` —— 文件树视图:展示或隐藏被移除的文件
|
||||
* `Ctrl+M` —— 文件树视图:展示或隐藏被修改的文件
|
||||
* `Ctrl+U` —— 文件树视图:展示或隐藏未修改的文件
|
||||
* `Ctrl+L` —— 层视图:展示当前层的变化
|
||||
* `Ctrl+A` —— 层视图:展示总的变化
|
||||
* `Ctrl+/` —— 筛选文件
|
||||
* `Ctrl+C` —— 退出
|
||||
|
||||
在上面的例子中,我使用了 `sudo` 权限,这是因为我的 Docker 镜像存储在 `/var/lib/docker/` 目录中。假如你的镜像保存在你的家目录 (`$HOME`)或者在其他不属于 `root` 用户的目录,你就没有必要使用 `sudo` 命令。
|
||||
|
||||
你还可以使用下面的单个命令来构建一个 Docker 镜像并立刻分析该镜像:
|
||||
|
||||
```
|
||||
$ dive build -t <some-tag>
|
||||
```
|
||||
@ -83,7 +95,7 @@ via: https://www.ostechnix.com/how-to-analyze-and-explore-the-contents-of-docker
|
||||
作者:[SK][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[FSSlc](https://github.com/FSSlc)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
校对:[wxy](https://github.com/wxy)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
@ -91,4 +103,4 @@ via: https://www.ostechnix.com/how-to-analyze-and-explore-the-contents-of-docker
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://github.com/wagoodman/dive/releases
|
||||
[2]: https://www.ostechnix.com/linuxbrew-common-package-manager-linux-mac-os-x/
|
||||
[3]: https://github.com/wagoodman/dive
|
||||
[3]: https://github.com/wagoodman/dive
|
@ -1,165 +0,0 @@
|
||||
HackChow translating
|
||||
|
||||
5 alerting and visualization tools for sysadmins
|
||||
======
|
||||
These open source tools help users understand system behavior and output, and provide alerts for potential problems.
|
||||
|
||||
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/metrics_data_dashboard_system_computer_analytics.png?itok=oxAeIEI-)
|
||||
|
||||
You probably know (or can guess) what alerting and visualization tools are used for. Why would we discuss them as observability tools, especially since some systems include visualization as a feature?
|
||||
|
||||
Observability comes from control theory and describes our ability to understand a system based on its inputs and outputs. This article focuses on the output component of observability.
|
||||
|
||||
Alerting and visualization tools analyze the outputs of other systems and provide structured representations of these outputs. Alerts are basically a synthesized understanding of negative system outputs, and visualizations are disambiguated structured representations that facilitate user comprehension.
|
||||
|
||||
### Common types of alerts and visualizations
|
||||
|
||||
#### Alerts
|
||||
|
||||
Let’s first cover what alerts are _not_. Alerts should not be sent if the human responder can’t do anything about the problem. This includes alerts that are sent to multiple individuals with only a few who can respond, or situations where every anomaly in the system triggers an alert. This leads to alert fatigue and receivers ignoring all alerts within a specific medium until the system escalates to a medium that isn’t already saturated.
|
||||
|
||||
For example, if an operator receives hundreds of emails a day from the alerting system, that operator will soon ignore all emails from the alerting system. The operator will respond to a real incident only when he or she is experiencing the problem, emailed by a customer, or called by the boss. In this case, alerts have lost their meaning and usefulness.
|
||||
|
||||
Alerts are not a constant stream of information or a status update. They are meant to convey a problem from which the system can’t automatically recover, and they are sent only to the individual most likely to be able to recover the system. Everything that falls outside this definition isn’t an alert and will only damage your employees and company culture.
|
||||
|
||||
Everyone has a different set of alert types, so I won't discuss things like priority levels (P1-P5) or models that use words like "Informational," "Warning," and "Critical." Instead, I’ll describe the generic categories emergent in complex systems’ incident response.
|
||||
|
||||
You might have noticed I mentioned an “Informational” alert type right after I wrote that alerts shouldn’t be informational. Well, not everyone agrees, but I don’t consider something an alert if it isn’t sent to anyone. It is a data point that many systems refer to as an alert. It represents some event that should be known but not responded to. It is generally part of the visualization system of the alerting tool and not an event that triggers actual notifications. Mike Julian covers this and other aspects of alerting in his book [Practical Monitoring][1]. It's a must read for work in this area.
|
||||
|
||||
Non-informational alerts consist of types that can be responded to or require action. I group these into two categories: internal outage and external outage. (Most companies have more than two levels for prioritizing their response efforts.) Degraded system performance is considered an outage in this model, as the impact to each user is usually unknown.
|
||||
|
||||
Internal outages are a lower priority than external outages, but they still need to be responded to quickly. They often include internal systems that company employees use or components of applications that are visible only to company employees.
|
||||
|
||||
External outages consist of any system outage that would immediately impact a customer. These don’t include a system outage that prevents releasing updates to the system. They do include customer-facing application failures, database outages, and networking partitions that hurt availability or consistency if either can impact a user. They also include outages of tools that may not have a direct impact on users, as the application continues to run but this transparent dependency impacts performance. This is common when the system uses some external service or data source that isn’t necessary for full functionality but may cause delays as the application performs retries or handles errors from this external dependency.
|
||||
|
||||
### Visualizations
|
||||
|
||||
There are many visualization types, and I won’t cover them all here. It’s a fascinating area of research. On the data analytics side of my career, learning and applying that knowledge is a constant challenge. We need to provide simple representations of complex system outputs for the widest dissemination of information. [Google Charts][2] and [Tableau][3] have a wide selection of visualization types. We’ll cover the most common visualizations and some innovative solutions for quickly understanding systems.
|
||||
|
||||
#### Line chart
|
||||
|
||||
The line chart is probably the most common visualization. It does a pretty good job of producing an understanding of a system over time. A line chart in a metrics system would have a line for each unique metric or some aggregation of metrics. This can get confusing when there are a lot of metrics in the same dashboard (as shown below), but most systems can select specific metrics to view rather than having all of them visible. Also, anomalous behavior is easy to spot if it’s significant enough to escape the noise of normal operations. Below we can see purple, yellow, and light blue lines that might indicate anomalous behavior.
|
||||
|
||||
![](https://opensource.com/sites/default/files/uploads/monitoring_guide_line_chart.png)
|
||||
|
||||
Another feature of a line chart is that you can often stack them to show relationships. For example, you might want to look at requests on each server individually, but also in aggregate. This allows you to understand the overall system as well as each instance in the same graph.
|
||||
|
||||
![](https://opensource.com/sites/default/files/uploads/monitoring_guide_line_chart_aggregate.png)
|
||||
|
||||
#### Heatmaps
|
||||
|
||||
Another common visualization is the heatmap. It is useful when looking at histograms. This type of visualization is similar to a bar chart but can show gradients within the bars representing the different percentiles of the overall metric. For example, suppose you’re looking at request latencies and you want to quickly understand the overall trend as well as the distribution of all requests. A heatmap is great for this, and it can use color to disambiguate the quantity of each section with a quick glance.
|
||||
|
||||
The heatmap below shows the higher concentration around the centerline of the graph with an easy-to-understand visualization of the distribution vertically for each time bucket. We might want to review a couple of points in time where the distribution gets wide while the others are fairly tight like at 14:00. This distribution might be a negative performance indicator.
|
||||
|
||||
![](https://opensource.com/sites/default/files/uploads/monitoring_guide_histogram.png)
|
||||
|
||||
#### Gauges
|
||||
|
||||
The last common visualization I’ll cover here is the gauge, which helps users understand a single metric quickly. Gauges can represent a single metric, like your speedometer represents your driving speed or your gas gauge represents the amount of gas in your car. Similar to the gas gauge, most monitoring gauges clearly indicate what is good and what isn’t. Often (as is shown below), good is represented by green, getting worse by orange, and “everything is breaking” by red. The middle row below shows traditional gauges.
|
||||
|
||||
![](https://opensource.com/sites/default/files/uploads/monitoring_guide_gauges.png)
|
||||
|
||||
This image shows more than just traditional gauges. The other gauges are single stat representations that are similar to the function of the classic gauge. They all use the same color scheme to quickly indicate system health with just a glance. Arguably, the bottom row is probably the best example of a gauge that allows you to glance at a dashboard and know that everything is healthy (or not). This type of visualization is usually what I put on a top-level dashboard. It offers a full, high-level understanding of system health in seconds.
|
||||
|
||||
#### Flame graphs
|
||||
|
||||
A less common visualization is the flame graph, introduced by [Netflix’s Brendan Gregg][4] in 2011. It’s not ideal for dashboarding or quickly observing high-level system concerns; it’s normally seen when trying to understand a specific application problem. This visualization focuses on CPU and memory and the associated frames. The X-axis lists the frames alphabetically, and the Y-axis shows stack depth. Each rectangle is a stack frame and includes the function being called. The wider the rectangle, the more it appears in the stack. This method is invaluable when trying to diagnose system performance at the application level and I urge everyone to give it a try.
|
||||
|
||||
![](https://opensource.com/sites/default/files/uploads/monitoring_guide_flame_graph_0.png)
|
||||
|
||||
### Tool options
|
||||
|
||||
There are several commercial options for alerting, but since this is Opensource.com, I’ll cover only systems that are being used at scale by real companies that you can use at no cost. Hopefully, you’ll be able to contribute new and innovative features to make these systems even better.
|
||||
|
||||
### Alerting tools
|
||||
|
||||
#### Bosun
|
||||
|
||||
If you’ve ever done anything with computers and gotten stuck, the help you received was probably thanks to a Stack Exchange system. Stack Exchange runs many different websites around a crowdsourced question-and-answer model. [Stack Overflow][5] is very popular with developers, and [Super User][6] is popular with operations. However, there are now hundreds of sites ranging from parenting to sci-fi and philosophy to bicycles.
|
||||
|
||||
Stack Exchange open-sourced its alert management system, [Bosun][7], around the same time Prometheus and its [AlertManager][8] system were released. There were many similarities in the two systems, and that’s a really good thing. Like Prometheus, Bosun is written in Golang. Bosun’s scope is more extensive than Prometheus’ as it can interact with systems beyond metrics aggregation. It can also ingest data from log and event aggregation systems. It supports Graphite, InfluxDB, OpenTSDB, and Elasticsearch.
|
||||
|
||||
Bosun’s architecture consists of a single server binary, a backend like OpenTSDB, Redis, and [scollector agents][9]. The scollector agents automatically detect services on a host and report metrics for those processes and other system resources. This data is sent to a metrics backend. The Bosun server binary then queries the backends to determine if any alerts need to be fired. Bosun can also be used by tools like [Grafana][10] to query the underlying backends through one common interface. Redis is used to store state and metadata for Bosun.
|
||||
|
||||
A really neat feature of Bosun is that it lets you test your alerts against historical data. This was something I missed in Prometheus several years ago, when I had data for an issue I wanted alerts on but no easy way to test it. To make sure my alerts were working, I had to create and insert dummy data. This system alleviates that very time-consuming process.
|
||||
|
||||
Bosun also has the usual features like showing simple graphs and creating alerts. It has a powerful expression language for writing alerting rules. However, it only has email and HTTP notification configurations, which means connecting to Slack and other tools requires a bit more customization ([which its documentation covers][11]). Similar to Prometheus, Bosun can use templates for these notifications, which means they can look as awesome as you want them to. You can use all your HTML and CSS skills to create the baddest email alert anyone has ever seen.
|
||||
|
||||
#### Cabot
|
||||
|
||||
[Cabot][12] was created by a company called [Arachnys][13]. You may not know who Arachnys is or what it does, but you have probably felt its impact: It built the leading cloud-based solution for fighting financial crimes. That sounds pretty cool, right? At a previous company, I was involved in similar functions around [“know your customer"][14] laws. Most companies would consider it a very bad thing to be linked to a terrorist group, for example, funneling money through their systems. These solutions also help defend against less-atrocious offenders like fraudsters who could also pose a risk to the institution.
|
||||
|
||||
So why did Arachnys create Cabot? Well, it is kind of a Christmas present to everyone, as it was a Christmas project built because its developers couldn’t wrap their heads around [Nagios][15]. And really, who can blame them? Cabot was written with Django and Bootstrap, so it should be easy for most to contribute to the project. (Another interesting factoid: The name comes from the creator’s dog.)
|
||||
|
||||
The Cabot architecture is similar to Bosun in that it doesn’t collect any data. Instead, it accesses data through the APIs of the tools it is alerting for. Therefore, Cabot uses a pull (rather than a push) model for alerting. It reaches out into each system’s API and retrieves the information it needs to make a decision based on a specific check. Cabot stores the alerting data in a Postgres database and also has a cache using Redis.
|
||||
|
||||
Cabot natively supports [Graphite][16], but it also supports [Jenkins][17], which is rare in this area. [Arachnys][13] uses Jenkins like a centralized cron, but I like this idea of treating build failures like outages. Obviously, a build failure isn’t as critical as a production outage, but it could still alert the team and escalate if the failure isn’t resolved. Who actually checks Jenkins every time an email comes in about a build failure? Yeah, me too!
|
||||
|
||||
Another interesting feature is that Cabot can integrate with Google Calendar for on-call rotations. Cabot calls this feature Rota, which is a British term for a roster or rotation. This makes a lot of sense, and I wish other systems would take this idea further. Cabot doesn’t support anything more complex than primary and backup personnel, but there is certainly room for additional features. The docs say if you want something more advanced, you should look at a commercial option.
|
||||
|
||||
#### StatsAgg
|
||||
|
||||
[StatsAgg][18]? How did that make the list? Well, it’s not every day you come across a publishing company that has created an alerting platform. I think that deserves recognition. Of course, [Pearson][19] isn’t just a publishing company anymore; it has several web presences and a joint venture with [O’Reilly Media][20]. However, I still think of it as the company that published my schoolbooks and tests.
|
||||
|
||||
StatsAgg isn’t just an alerting platform; it’s also a metrics aggregation platform. And it’s kind of like a proxy for other systems. It supports Graphite, StatsD, InfluxDB, and OpenTSDB as inputs, but it can also forward those metrics to their respective platforms. This is an interesting concept, but potentially risky as loads increase on a central service. However, if the StatsAgg infrastructure is robust enough, it can still produce alerts even when a backend storage platform has an outage.
|
||||
|
||||
StatsAgg is written in Java and consists only of the main server and UI, which keeps complexity to a minimum. It can send alerts based on regular expression matching and is focused on alerting by service rather than host or instance. Its goal is to fill a void in the open source observability stack, and I think it does that quite well.
|
||||
|
||||
### Visualization tools
|
||||
|
||||
#### Grafana
|
||||
|
||||
Almost everyone knows about [Grafana][10], and many have used it. I have used it for years whenever I need a simple dashboard. The tool I used before was deprecated, and I was fairly distraught about that until Grafana made it okay. Grafana was gifted to us by Torkel Ödegaard. Like Cabot, Grafana was also created around Christmastime, and released in January 2014. It has come a long way in just a few years. It started life as a Kibana dashboarding system, and Torkel forked it into what became Grafana.
|
||||
|
||||
Grafana’s sole focus is presenting monitoring data in a more usable and pleasing way. It can natively gather data from Graphite, Elasticsearch, OpenTSDB, Prometheus, and InfluxDB. There’s an Enterprise version that uses plugins for more data sources, but there’s no reason those other data source plugins couldn’t be created as open source, as the Grafana plugin ecosystem already offers many other data sources.
|
||||
|
||||
What does Grafana do for me? It provides a central location for understanding my system. It is web-based, so anyone can access the information, although it can be restricted using different authentication methods. Grafana can provide knowledge at a glance using many different types of visualizations. However, it has started integrating alerting and other features that aren’t traditionally combined with visualizations.
|
||||
|
||||
Now you can set alerts visually. That means you can look at a graph, maybe even one showing where an alert should have triggered due to some degradation of the system, click on the graph where you want the alert to trigger, and then tell Grafana where to send the alert. That’s a pretty powerful addition that won’t necessarily replace an alerting platform, but it can certainly help augment it by providing a different perspective on alerting criteria.
|
||||
|
||||
Grafana has also introduced more collaboration features. Users have been able to share dashboards for a long time, meaning you don’t have to create your own dashboard for your [Kubernetes][21] cluster because there are several already available—with some maintained by Kubernetes developers and others by Grafana developers.
|
||||
|
||||
The most significant addition around collaboration is annotations. Annotations allow a user to add context to part of a graph. Other users can then use this context to understand the system better. This is an invaluable tool when a team is in the middle of an incident and communication and common understanding are critical. Having all the information right where you’re already looking makes it much more likely that knowledge will be shared across the team quickly. It’s also a nice feature to use during blameless postmortems when the team is trying to understand how the failure occurred and learn more about their system.
|
||||
|
||||
#### Vizceral
|
||||
|
||||
Netflix created [Vizceral][22] to understand its traffic patterns better when performing a traffic failover. Unlike Grafana, which is a more general tool, Vizceral serves a very specific use case. Netflix no longer uses this tool internally and says it is no longer actively maintained, but it still updates the tool periodically. I highlight it here primarily to point out an interesting visualization mechanism and how it can help solve a problem. It’s worth running it in a demo environment just to better grasp the concepts and witness what’s possible with these systems.
|
||||
|
||||
### What to read next
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://opensource.com/article/18/10/alerting-and-visualization-tools-sysadmins
|
||||
|
||||
作者:[Dan Barker][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[译者ID](https://github.com/译者ID)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://opensource.com/users/barkerd427
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://www.practicalmonitoring.com/
|
||||
[2]: https://developers.google.com/chart/interactive/docs/gallery
|
||||
[3]: https://libguides.libraries.claremont.edu/c.php?g=474417&p=3286401
|
||||
[4]: http://www.brendangregg.com/flamegraphs.html
|
||||
[5]: https://stackoverflow.com/
|
||||
[6]: https://superuser.com/
|
||||
[7]: http://bosun.org/
|
||||
[8]: https://prometheus.io/docs/alerting/alertmanager/
|
||||
[9]: https://bosun.org/scollector/
|
||||
[10]: https://grafana.com/
|
||||
[11]: https://bosun.org/notifications
|
||||
[12]: https://cabotapp.com/
|
||||
[13]: https://www.arachnys.com/
|
||||
[14]: https://en.wikipedia.org/wiki/Know_your_customer
|
||||
[15]: https://www.nagios.org/
|
||||
[16]: https://graphiteapp.org/
|
||||
[17]: https://jenkins.io/
|
||||
[18]: https://github.com/PearsonEducation/StatsAgg
|
||||
[19]: https://www.pearson.com/us/
|
||||
[20]: https://www.oreilly.com/
|
||||
[21]: https://opensource.com/resources/what-is-kubernetes
|
||||
[22]: https://github.com/Netflix/vizceral
|
@ -1,3 +1,4 @@
|
||||
zianglei translating
|
||||
How Do We Find Out The Installed Packages Came From Which Repository?
|
||||
======
|
||||
Sometimes you might want to know the installed packages came from which repository. This will helps you to troubleshoot when you are facing the package conflict issue.
|
||||
|
@ -0,0 +1,162 @@
|
||||
5 个适合系统管理员使用的告警可视化工具
|
||||
======
|
||||
这些开源的工具能够通过输出帮助用户了解系统的运行状况,并对可能发生的潜在问题作出告警。
|
||||
|
||||
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/metrics_data_dashboard_system_computer_analytics.png?itok=oxAeIEI-)
|
||||
|
||||
你大概已经已经知道告警可视化工具是用来做什么的了。下面我们就要来说一下,为什么要讨论这样的工具,甚至某些系统专门将可视化作为特有的功能。
|
||||
|
||||
可观察性的概念来自控制理论,这个概念描述了我们通过对系统的输入和系统的输出来了解系统的能力。本文将重点介绍具有可观察性的输出组件。
|
||||
|
||||
告警可视化工具可以对系统的输出进行分析,进而对输出的信息结构化。告警实际上是对系统异常状态的描述,而可视化则是让用户能够直观理解的结构化表示。
|
||||
|
||||
### 常见的可视化告警
|
||||
|
||||
#### 告警
|
||||
|
||||
首先要明确一下告警的含义。在人员无法响应告警内容情况下,不应该发送告警。包括那些发给多个人,但只有其中少数人可以响应的告警,以及系统中的每个异常都触发的告警。因为这样会产生告警疲劳,告警接收者也往往会对这些过多的告警采取忽视的态度。
|
||||
|
||||
例如,如果管理员每天都会收到告警系统发来的数百封告警邮件,他就很容易会忽略告警系统的所有邮件。除非问题真正发生,并且受到了客户或上级的询问时,管理员才会重新重视告警信息。在这种情况下,告警已经失去了原有的意义和用途。
|
||||
|
||||
告警不是一个持续的信息流或者状态更新。告警的目的在于暴露系统无法自动恢复的问题,而且告警应该只发送给最有可能解决问题的人员。超出这个定义的内容都不应该作为告警,否则将会对实际工作造成不良的影响。
|
||||
|
||||
不同的告警体系都会有各自的告警类型,因此不能用优先级(P1-P5)或者诸如“信息”、“警告”、“严重”之类的字眼来一概而论,而应该使用一些通用的分类方式来对复杂系统事件进行描述。
|
||||
|
||||
刚才我提到了一个“信息”这个告警类型,但实际上告警不应该是一个信息,尽管有些人可能会不这样认为。但我觉得如果一个告警没有发送给任何一个人,它就不应该是警报,而只是一些在系统中被视为警报的数据点,代表了一些应该知晓但不需要响应的事件。它更应该作为告警可视化工具的一部分,而不是会导致触发告警的事件。《[实用监控][1]》是这个领域的必读书籍,其作者 Mike Julian 在书中就介绍了他自己关于告警的看法。
|
||||
|
||||
而非信息警报则代表告警需要被响应以及需要相关的操作。我将这些告警大致分为内部故障和外部故障两种类型,而对于大多数公司来说,通常会有两个以上的级别来确定响应告警的优先级。系统性能下降就是一种故障,因为这种现象对用户的影响通常都是未知的。
|
||||
|
||||
内部故障比外部故障的优先级低,但也需要快速响应。内部故障通常包括公司员工使用的内部系统或仅对公司员工可见的应用故障。
|
||||
|
||||
外部则包括任何会产生业务影响的系统故障,但不包括影响系统更新的故障。外部故障一般包括客户端应用故障、数据库故障和导致系统可用性或一致性失效的网络故障,这些都会影响用户的正常使用。对于不直接影响用户的依赖组件故障也属于外部故障,随着应用程序的不断运行,一旦依赖组件发生故障,系统的性能也会受到波及。这种情况对于使用某些外部服务或数据源的系统来说很常见,尽管这些外部服务或数据源对于可能不涉及到系统的主要功能,但是当系统在处理相关依赖组件的错误时可能会出现较明显的延迟。
|
||||
|
||||
### 可视化
|
||||
|
||||
可视化的种类有很多,我就不一一赘述了。这是一个有趣的研究领域,在我这些年的数据分析经历当中,学习和应用可视化方面的知识可以说是相当有挑战性。我们需要将复杂的系统输出通过直观的方式来向他人展示,才能有效地把信息传播出去。[Google Charts][2] 和 [Tableau][3] 都提供了很多可视化方面的工具。下面将会介绍一些最常见的可视化创新解决方案。
|
||||
|
||||
#### 折线图
|
||||
|
||||
折线图可能是最常见的可视化方式了,它可以让用户很直观地按照时间维度了解系统的情况。系统中每个不同的指标都会以一条独立的折线在图表中体现。但当同一个图表中同时存在多条折线时,就可能会对阅读有所影响(如下图所示),所以大多数情况下都可以选择仅查看其中的少数几条折线,而不是让所有折线同时显示。如果某个指标的数值产生了大于正常范围的波动,就会很容易发现。例如下图中异常的紫线、黄线、浅蓝线。
|
||||
|
||||
![](https://opensource.com/sites/default/files/uploads/monitoring_guide_line_chart.png)
|
||||
|
||||
折线图的另一个用法是可以将多条折线堆积起来以显示它们之间的关系。例如对于通过折线图反映服务器的请求数量,可以单独显示每台服务器上的请求,也可以把多台服务器的数据合在一起显示。这就可以在同一个图表中灵活查看整个系统中每个实例的情况了。
|
||||
|
||||
![](https://opensource.com/sites/default/files/uploads/monitoring_guide_line_chart_aggregate.png)
|
||||
|
||||
#### 热力图
|
||||
|
||||
另一种常见的可视化方式是热力图。热力图与条形图比较类似,还可以在条形图的基础上显示某部分在整体中占比的变化情况。例如在查看网络请求延时的时候,就可以使用热力图快速查看到所有网络请求的总体趋势和分布情况,另外,它可以使用不同颜色来表示不同部分的数值。
|
||||
|
||||
在以下这个热力图中,通过竖直方向上每个时间段的色块数量分布,可以清楚地看到大部分数据集中在整个范围的中心位置。我们还可以发现,大多数时间段的色块分布都是比较宽松的,而 14:00 到 15:00 这一段则分布得很密集,这样的分布有可能意味着一种不健康的状态。
|
||||
|
||||
![](https://opensource.com/sites/default/files/uploads/monitoring_guide_histogram.png)
|
||||
|
||||
#### 仪表图
|
||||
|
||||
还有一种常见的可视化方式是仪表图,用户可以通过仪表图快速了解单个指标。仪表一般用于单个指标的显示,例如车速表代表汽车的行驶速度、油量表代表油箱中的汽油量等等。大多数的仪表图都有一个共通点,就是会划分出所示指标的对应状态。如下图所示,绿色表示正常的状态,橙色表示不良的状态,而红色则表示极差的状态。中间一行则模拟了真实仪表的显示情况。
|
||||
|
||||
![](https://opensource.com/sites/default/files/uploads/monitoring_guide_gauges.png)
|
||||
|
||||
上面图表中,除了常规仪表样式的显示方式之外,还有较为直接的数据显示方式,配合相同的配色方案,一眼就可以看出各个指标所处的状态,这一点与和仪表的特点类似。所以,最下面一行可能是仪表图的最佳显示方式,用户不需要仔细阅读,就可以大致了解各个指标的不同状态。这种类型的可视化是我最常用的类型,在数秒钟之间,我就可以全面地总览系统各方面地运行情况。
|
||||
|
||||
#### 火焰图
|
||||
|
||||
由 [Netflix 的 Brendan Gregg][4] 在 2011 年开始使用的火焰图是一种较为少见地可视化方式。它不像仪表图那样可以从图表中快速得到关键信息,通常只会在需要解决某个应用的问题的时候才会用到这种图表。火焰图主要用于 CPU、内存和相关帧方面的表示,X 轴按字母顺序将帧一一列出,而 Y 轴则表示堆栈的深度。图中每个矩形都是一个标明了调用的函数的堆栈帧。矩形越宽,就表示它在堆栈中出现越频繁。在分析系统性能问题的时候,火焰图能够起到很大的作用,大家不妨尝试一下。
|
||||
|
||||
![](https://opensource.com/sites/default/files/uploads/monitoring_guide_flame_graph_0.png)
|
||||
|
||||
### 工具的选择
|
||||
|
||||
在告警工具方面,有几个商用的工具相当不错。但由于这是一篇介绍开源技术的文章,我只会介绍那些已经被广泛使用的免费工具。希望你也能够为这些工具贡献你自己的代码,让它们更加完善。
|
||||
|
||||
### 告警工具
|
||||
|
||||
#### Bosun
|
||||
|
||||
如果你的电脑出现问题,得多亏 Stack Exchange 你才能在网上查到解决办法。Stack Exchange 以众包问答的模式运营着很多不同类型的网站。其中就有广受开发者欢迎的 [Stack Overflow][5],以及运维方面有名的 [Super User][6]。除此以外,从育儿经验到科幻小说、从哲学讨论到单车论坛,Stack Exchange 都有涉猎。
|
||||
|
||||
Stack Exchange 开源了它的开源告警管理系统 [Bosun][7],同时也发布了使用 [AlertManager][8] 的 Prometheus 系统。这两个系统有共通点。Bosun 和 Prometheus 一样使用 Golang 开发,但 Bosun 比 Prometheus 更为强大,因为它可以使用<ruby>权值聚合<rt>metrics aggregation</rt></ruby>以外的方式与系统交互。Bosun 还可以从日志收集系统中提取数据,并且支持 Graphite、InfluxDB、OpenTSDB 和 Elasticsearch。
|
||||
|
||||
Bosun 的架构包括一个二进制服务文件,以及一个诸如 OpenTSDB 的后端、Redis 以及 [scollector agents][9]。 scollector agents 会自动检测主机上正在运行的服务,并反馈这些进程和其它的系统资源情况。这些数据将发送到后端。随后 Bosun 二进制服务文件会向后端发起查询,确定是否需要触发告警。也可以通过 [Grafana][10] 这些工具通过一个通用接口查询 Bosun 的底层后端。而 Redis 则用于存储 Bosun 的状态信息和元数据。
|
||||
|
||||
Bosun 有一个非常巧妙的功能,就是可以根据历史数据来测试告警。这是我几年前在使用 Prometheus 的时候就非常需要的功能,当时我有一个异常的数据需要产生告警,但没有一个可以用于测试的简便方法。为了确保告警能够正常触发,我不得不造出对应的数据来进行测试。而 Bosun 让这个步骤的耗时大大缩短。
|
||||
|
||||
Bosun 更是涵盖了所有常用过的功能,包括简单的图形化表示和告警的创建。它还带有强大的用于编写告警规则的表达式语言。但 Bosun 默认只带有电子邮件通知配置和 HTTP 通知配置,因此如果需要连接到 Slack 或其它工具,就需要对配置作出更大程度的定制化。类似于 Prometheus,Bosun 还可以使用模板通知,你可以使用 HTML 和 CSS 来创建你所需要的电子邮件通知。
|
||||
|
||||
#### Cabot
|
||||
|
||||
[Cabot][12] 由 [Arachnys][13] 公司开发。你或许对 Arachnys 公司并不了解,但它很有影响力:Arachnys 公司构建了一个基于云的先进解决方案,用于防范金融犯罪。在以前的公司,我也曾经参与过类似“[了解你的客户][14]”的工作。但大多数公司都认为与恐怖组织产生联系会造成相当不好的影响,因为恐怖组织可能会利用自己的系统来筹集资金。而这些解决方案将有助于防范欺诈类犯罪,尽管这类犯罪情节相对较轻,但仍然也会对机构产生风险。
|
||||
|
||||
Arachnys 公司为什么要开发 Cabot 呢?其实只是因为 Arachnys 的开发人员对 [Nagios][15] 不太熟悉。Cabot 的出现对很多人来说都是一个好消息,它基于 Django 和 Bootstrap 开发,因此如果相对这个项目做出自己的贡献,门槛并不高。另外值得一提的是,Cabot 这个名字来源于开发者的狗。
|
||||
|
||||
与 Bosun 类似,Cabot 也不对数据进行收集,而是使用监控对象的 API 提供的数据。因此,Cabot 告警的模式是 pull 而不是 push。它通过访问每个监控对象的 API,根据特定的指标检索所需的数据,然后将告警数据使用 Redis 缓存,进而持久化存储到 Postgres 数据库。
|
||||
|
||||
Cabot 的一个较为少见的特点是,它原生支持 [Graphite][16],同时也支持 [Jenkins][17]。Jenkins 在这里被视为一个集中式的 cron,它会以对待故障的方式去对待构建失败的状况。构建失败当然没有系统故障那么紧急,但一旦出现构建失败,还是需要团队采取措施去处理,毕竟并不是每个人在收到构建失败的电子邮件时都会亲自去检查 Jenkins。
|
||||
|
||||
Cabot 另一个有趣的功能是它可以接入 Google 日历安排值班人员,这个称为 Rota 的功能用处很大,希望其它告警系统也能加入类似的功能。Cabot 目前仅支持安排主备联系人,但还有继续改进的空间。它自己的文档也提到,如果需要全面的功能,更应该考虑付费的解决方案。
|
||||
|
||||
#### StatsAgg
|
||||
|
||||
[Pearson][19] 作为一家开发了 [StatsAgg][18] 告警平台的出版公司,这是极为罕见的,当然也很值得敬佩。除此以外,Pearson 还运营着另外几个网站,以及和 [O'Reilly Media][20] 合资的企业。但我仍然会将它视为出版教学书籍的公司。
|
||||
|
||||
StatsAgg 除了是一个告警平台,还是一个指标聚合平台,甚至也有点类似其它系统的代理。StatsAgg 支持通过 Graphite、StatsD、InfluxDB 和 OpenTSDB 输入数据,也支持将其转发到各种平台。但随着中心服务的负载不断增加,风险也不断增大。尽管如此,如果 StatsAgg 的基础架构足够强壮,即使后端存储平台出现故障,也不会对它产生告警的过程造成影响。
|
||||
|
||||
StatsAgg 是用 Java 开发的,为了尽可能降低复杂性,它仅包括主服务和一个 UI。StatsAgg 支持基于正则表达式匹配来发送告警,而且它更注重于服务方面的告警,而不是服务器基础告警。我认为它填补了开源监控工具方面的空白,而这正式它自己的目标。
|
||||
|
||||
### 可视化工具
|
||||
|
||||
#### Grafana
|
||||
|
||||
[Grafana][10] 的知名度很高,它也被广泛采用。每当我需要用到数据面板的时候,我总是会想到它,因为它比我使用过的任何一款类似的产品都要好。Grafana 由 Torkel Ödegaard 在圣诞节期间开发,并在 2014 年 1 月发布。在短短几年之间,它已经有了长足的发展。Grafana 基于 Kibana 开发,Torkel 开启了新的分支并将其命名为 Grafana。
|
||||
|
||||
Grafana 着重体现了实用性已经数据呈现的美观性。它可以原生地从 Graphite、Elasticsearch、OpenTSDB、Prometheus 和 InfluxDB 收集数据。此外有一个 Grafana 商用版插件可以从更多数据源获取数据,尽管这个插件没有开源,但 Grafana 的生态系统提供的各种数据源已经足够了。
|
||||
|
||||
Grafana 提供了一个集系统各种数据于一身的平台。它通过 web 来展示数据,任何人都有机会访问到相关信息,因此需要使用身份验证来对访问进行限制。Grafana 还支持不同类型的可视化方式,包括集成告警可视化的功能。
|
||||
|
||||
现在你可以更直观地设置告警了。通过Grafana,可以查看图表,还可以设置系统性能下降触发告警的阈值,并告诉 Grafana 应该如何发送告警。这是一个对告警体系非常强大的补充。告警平台不一定会因此而被取代,但告警系统一定会由此得到更多启发和发展。
|
||||
|
||||
Grafana 还引入了很多团队协作的功能。不同用户之间能够共享数据面板,你不再需要为 [Kubernetes][21] 集群创建独立的数据面板,因为由 Kubernetes 开发者和 Grafana 开发者共同维护的一些数据面板已经可以即插即用。
|
||||
|
||||
团队协作过程中一个重要的功能是注释。注释功能允许用户将上下文添加到图表当中,其他用户就可以通过上下文更直观地理解图表。当团队成员在处理某个事件,并且需要沟通和理解时,这个功能就十分重要了。将所有相关信息都放在需要的位置,可以让整个团队中快速达成共识。在团队需要调查故障原因和定位事件责任时,这个功能就可以发挥作用了。
|
||||
|
||||
#### Vizceral
|
||||
|
||||
[Vizceral][22] 由 Netflix 开发,用于在故障发生时更有效地了解流量的情况。Grafana 是一种通用性更强的工具,而 Vizceral 则专用于某些领域。 尽管 Netflix 表示已经不再在内部使用 Vizceral,也不再主动对其展开维护,但 Vizceral 仍然会定期更新。我在这里介绍这个工具,主要是为了介绍它的的可视化机制,以及如何利用它来协助解决问题。你可以在样例环境中用它来更好地掌握这一类系统的特性。
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://opensource.com/article/18/10/alerting-and-visualization-tools-sysadmins
|
||||
|
||||
作者:[Dan Barker][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[HankChow](https://github.com/HankChow)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://opensource.com/users/barkerd427
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://www.practicalmonitoring.com/
|
||||
[2]: https://developers.google.com/chart/interactive/docs/gallery
|
||||
[3]: https://libguides.libraries.claremont.edu/c.php?g=474417&p=3286401
|
||||
[4]: http://www.brendangregg.com/flamegraphs.html
|
||||
[5]: https://stackoverflow.com/
|
||||
[6]: https://superuser.com/
|
||||
[7]: http://bosun.org/
|
||||
[8]: https://prometheus.io/docs/alerting/alertmanager/
|
||||
[9]: https://bosun.org/scollector/
|
||||
[10]: https://grafana.com/
|
||||
[11]: https://bosun.org/notifications
|
||||
[12]: https://cabotapp.com/
|
||||
[13]: https://www.arachnys.com/
|
||||
[14]: https://en.wikipedia.org/wiki/Know_your_customer
|
||||
[15]: https://www.nagios.org/
|
||||
[16]: https://graphiteapp.org/
|
||||
[17]: https://jenkins.io/
|
||||
[18]: https://github.com/PearsonEducation/StatsAgg
|
||||
[19]: https://www.pearson.com/us/
|
||||
[20]: https://www.oreilly.com/
|
||||
[21]: https://opensource.com/resources/what-is-kubernetes
|
||||
[22]: https://github.com/Netflix/vizceral
|
||||
|
Loading…
Reference in New Issue
Block a user