[Translated] 20141014 How to monitor and troubleshoot a Linux server with sysdig.md

This commit is contained in:
GOLinux 2014-10-17 15:10:18 +08:00
parent e024c326b7
commit 27d6e11a88
2 changed files with 150 additions and 150 deletions

View File

@ -1,150 +0,0 @@
Translating by GOLinux!
How to monitor and troubleshoot a Linux server using sysdig
================================================================================
What is the first thing that comes to mind when you need to track system calls made and received by a process? You'll probably think of strace, and you are right. What tool would you use to monitor raw network traffic from the command line? If you thought about tcpdump, you made an excellent choice again. And if you ever run into the need to having to keep track of open files (in the Unix sense of the word: everything is a file), chances are you'll use lsof.
strace, tcpdump, and lsof are indeed great utilities that should be part of every sysadmin's toolset, and that is precisely the reason why you will love [sysdig][1], a powerful open source tool for system-level exploration and troubleshooting, introduced by its creators as "strace + tcpdump + lsof + awesome sauce with a little Lua cherry on top." Humor aside, one of the great features of sysdig resides in its ability not only to analyze the "live" state of a Linux system, but also to save the state in a dump file for offline inspection. What's more, you can customize sysdig's behavior or even enhance its capabilities by using built-in (or writing your own) small scripts called chisels. Individual chisels are used to analyze sysdig-captured event streams in various script-specific fashions.
In this tutorial we'll explore the installation and basic usage of sysdig to perform system monitoring and troubleshooting on Linux.
### Installing Sysdig ###
For this tutorial, we will choose to use the automatic installation process described in the official website for the sake of simplicity, brevity, and distribution agnosticity. In the automatic process, the installation script automatically detects the operating system and installs all the necessary dependencies.
Run the following command as root to install sysdig from the official apt/yum repository:
# curl -s https://s3.amazonaws.com/download.draios.com/stable/install-sysdig | bash
![](https://farm6.staticflickr.com/5616/15334678580_99c5d38328_z.jpg)
Once the installation is complete, we can invoke sysdig as follows to get a feel for it:
# sysdig
Our screen will be immediately filled with all that is going on in our system, not allowing us to do much more with that information. For that reason, we will run:
# sysdig -cl | less
to see a list of available chisels.
![](https://farm4.staticflickr.com/3938/15334678610_e5956a26e8_z.jpg)
The following categories are available by default, each of which is populated by multiple built-in chisels.
- CPU Usage
- Errors
- I/O
- Logs
- Misc
- Net
- Performance
- Security
- System State
To display information (including detailed command-line usage) on a particular chisel, run:
# sysdig -cl [chisel_name]
For example, we can check information about spy_port chisel under "Net" category by running:
# sysdig -i spy_port
![](https://farm4.staticflickr.com/3945/15521424095_0365bf20c3_z.jpg)
Chisels can be combined with filters (which can be applied to both live data or a trace file) to obtain more useful output.
Filters follow a "class.field" structure. For example:
- **fd.cip**: client IP address.
- **evt.dir**: event direction can be either '>' for enter events or '<' for exit events.
The complete filter list can be displayed with:
# sysdig -l
In the rest of the tutorial, I will demonstrate several use cases of sysdig.
### Sysdig Example: Troubleshooting Server Performance ###
Suppose your server is experiencing performance issues (e.g., unresponsiveness or significant delays in responding). You can use the bottlenecks chisel to display a list of the 10 slowest systems calls at the moment.
Use the following command to check up on a live server in real time. The "-c" flag followed by a chisel name tells sysdig to run the specified chisel.
# sysdig -c bottlenecks
Alternatively, you can conduct a server performance analysis offline. In that case, you can save a complete sysdig trace to a file, and run the bottlenecks chisel against the trace as follows.
First, save a sysdig trace (use Ctrl+c to stop the collection):
# sysdig -w trace.scap
Once the trace is collected, you can check the slowest systems calls that were performed during the capture interval by running:
# sysdig -r trace.scap -c bottlenecks
![](https://farm6.staticflickr.com/5611/15334678670_ebbe93265e_z.jpg)
You want to pay attention fo columns #2, #3, and #4, which indicate execution time, process name, and PID, respectively.
### Sysdig Example: Monitoring Interactive User Activities ###
Suppose you as a sysadmin want to monitor interactive user activities in a system (e.g., what command a user typed from the command line, and what directories the user went to). That is when spy_user chisel comes in handy.
Let's first collect a sysdig trace with a couple of extra options.
# sysdig -s 4096 -z -w /mnt/sysdig/$(hostname).scap.gz
- "-s 4096" tells sysdig to capture up to 4096 bytes of each event.
- "-z" (used with "-w") enables compression for a trace file.
- "-w <trace-file>" saves sysdig traces to a specified file.
In the above, we customize the name of the compressed trace file on a per-host basis. Remember that you can interrupt the execution of sysdig at any moment by pressing Ctrl + c.
Once we've collected a reasonable amount of data, we can view interactive activities of every user in a system by running:
# sysdig -r /mnt/sysdig/debian.scap.gz -c spy_users
![](https://farm4.staticflickr.com/3932/15518254291_5c9671ca41_z.jpg)
The first column in the above output indicates the PID of the process associated with a given user's activity.
What if you want to target a specific user, and monitor the user's activities only? You can filter the results of the spy_users chisel by username:
# sysdig -r /mnt/sysdig/debian.scap.gz -c spy_users "user.name=xmodulo"
![](https://farm4.staticflickr.com/3944/15498248556_66d15422b1_z.jpg)
### Sysdig Example: Monitoring File I/O ###
We can customize the output format of sysdig traces with "-p" flag, and indicate desired fields (e.g., user name, process name, and file or socket name) enclosed inside double quotes. In this example, we will create a trace file that will only contain writing events in home directories (which we can inspect later with "sysdig -r writetrace.scap.gz").
# sysdig -p "%user.name %proc.name %fd.name" "evt.type=write and fd.name contains /home/" -z -w writetrace.scap.gz
![](https://farm6.staticflickr.com/5610/15498248586_de5f5fc93d_z.jpg)
### Sysdig Example: Monitoring Network I/O ###
As part of server troubleshooting, you may want to snoop on network traffic, which is typically done with tcpdump. With sysdig, traffic sniffing can be done as easily, but in more user friendly fashions.
For example, you can inspect data (in ASCII) that has been exchanged with a particular IP address, served by a particular process (e.g., apache2):
# sysdig -s 4096 -A -c echo_fds fd.cip=192.168.0.100 -r /mnt/sysdig/debian.scap.gz proc.name=apache2
If you want to monitor raw data transfer (in binary) instead, replace "-A" with "-X":
# sysdig -s 4096 -X -c echo_fds fd.cip=192.168.0.100 -r /mnt/sysdig/debian.scap.gz proc.name=apache2
For more information, examples, and case studies, you can check out the [project website][1]. Believe me, the possibilities are limitless. But don't just take my word for it. Install sysdig and start digging today!
--------------------------------------------------------------------------------
via: http://xmodulo.com/monitor-troubleshoot-linux-server-sysdig.html
作者:[Gabriel Cánepa][a]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创翻译,[Linux中国](http://linux.cn/) 荣誉推出
[a]:http://xmodulo.com/author/gabriel
[1]:http://www.sysdig.org/

View File

@ -0,0 +1,150 @@
Linux服务器监控和排障利器sysdig
================================================================================
当你需要追踪某个进程产生和接收的系统调用时首先浮现在你脑海中的是什么你可能会想到strace那么你是对的。你会使用什么样的命令行工具来监控原始网络通信呢如果你想到了tcpdump你又作出了一个极佳的选择。而如果你碰到必须追踪打开的文件在Unix意义上一切皆文件的需求可能你会使用lsof。
strace、tcpdump以及lsof确实是些伟大的工具它们应该成为每个系统管理员工具集中的一部分而这也正是你为什么会爱上[sysdig][1]的原因。它是一个强大的开源工具用于系统级别的勘察和排障它的创建者在介绍它时称之为“strace+tcpdump+lsof+上面点缀着lua樱桃的绝妙酱汁”。抛开幽默不说sysdig的最棒特性之一在于它不仅能分析Linux系统的“存活”状态也能将该状态保存为转储文件以供离线检查。更重要的是你可以自定义sysdig的行为或者甚至通过内建的你也可以自己编写名为凿子的小脚本增强其功能。单独的凿子可以脚本指定的各种风格分析sysdig捕获的事件流。
在本教程中我们将探索sysdig的安装及其基本用法在Linux上实施系统监控和排障。
### 安装Sysdig ###
对于本教程,由于为了简便、缩短安装流程以及版本不可知,我们将选择使用官方网站提供的自动化安装过程。在自动化过程中,安装脚本会自动检测操作系统并安装必需的依赖包。
以root身份运行以下命令来从官方apt/yum仓库安装sysdig
# curl -s https://s3.amazonaws.com/download.draios.com/stable/install-sysdig | bash
![](https://farm6.staticflickr.com/5616/15334678580_99c5d38328_z.jpg)
安装完成后我们可以通过以下方法调用sysdig来感受一下它
# sysdig
我们的屏幕将马上被系统上发生的所有事件填满,对于这些信息,我们不能做更多操作。要进一步处理,我们可以运行:
# sysdig -cl | less
来查看可用的凿子列表。
![](https://farm4.staticflickr.com/3938/15334678610_e5956a26e8_z.jpg)
The following categories are available by default, each of which is populated by multiple built-in chisels.
默认有以下类目可用,各个类目中分布有多个内建的凿子。
- CPU UsageCPU使用量
- Errors错误
- I/O
- Logs日志
- Misc混杂
- Net网络
- Performance性能
- Security安全
- System State系统状态
要显示指定凿子上的信息(包括详细的命令行用法),运行以下命令:
# sysdig -cl [chisel_name]
例如我们可以检查“网络”类目下关于spy_port凿子的信息
# sysdig -i spy_port
![](https://farm4.staticflickr.com/3945/15521424095_0365bf20c3_z.jpg)
凿子可以通过过滤器(可同时应用于存活数据和记录文件)组合,以获取更多有用的输出。
过滤器遵从“类.字段”结构。例如:
- **fd.cip**客户端IP地址。
- **evt.dir**:事件方向,可以是‘>’用于进入事件,或‘<’用于退出事件。
完成得过滤器列表可以通过以下命令显示:
# sysdig -l
在本教程剩余部分我将演示几个sysdig的使用案例。
### Sysdig实例 服务器性能排障 ###
假定你的服务器发生了性能问题没有回应或者重大的回应延迟。你可以使用瓶颈凿子来显示当前10个最慢系统调用的列表。
使用以下命令在存活服务器上进行实时检查。“-c”标识后跟凿子名告诉sysdig运行指定的凿子。
# sysdig -c bottlenecks
或者你可以离线对服务器实施性能分析。在此种情况下你可以保存完整的sysdig记录到文件然后像下面这样针对记录运行瓶颈凿子。
首先保存sysdige记录使用Ctrl+c来停止收集
# sysdig -w trace.scap
收集完记录后,你可以运行以下命令来检查捕获间隔中最慢的系统调用:
# sysdig -r trace.scap -c bottlenecks
![](https://farm6.staticflickr.com/5611/15334678670_ebbe93265e_z.jpg)
你想要关注栏#2#3和#4这些分别表示执行时间、进程名和PID。
### Sysdig实例 监控交互用户活动 ###
假定你作为系统管理员想要监控系统中交互的用户活动用户在命令行输入了什么命令以及用户去了什么目录这时spy_user凿子就派上用场了。
让我们首先通过一些额外选项来收集一个sysdig记录。
# sysdig -s 4096 -z -w /mnt/sysdig/$(hostname).scap.gz
- “-s 4096”告诉sysdig每个事件捕获4096字节。
- “-z” (与“-w”一起使用为记录文件启用压缩。
- “-w <trace-file>”保存sysdig记录到指定的文件。
在上面的例子中我们自定义了基于每个主机的压缩的记录文件的名称。记住你可以在任何时候按下Ctrl+c来打断sysdig的执行。
在我们收集到了合理数量的数据后,我们可以通过运行以下命令来查看每个用户的交互活动:
# sysdig -r /mnt/sysdig/debian.scap.gz -c spy_users
![](https://farm4.staticflickr.com/3932/15518254291_5c9671ca41_z.jpg)
上面输出的第一栏表示与指定用户的活动相关进程的PID。
如果你想要定位一个指定的用户以及只监控该用户的活动又怎么样呢你可以通过用户名对spy_users凿子的结果进行过滤
# sysdig -r /mnt/sysdig/debian.scap.gz -c spy_users "user.name=xmodulo"
![](https://farm4.staticflickr.com/3944/15498248556_66d15422b1_z.jpg)
### Sysdig实例 监控文件I/O ###
我们可以使用“-p”标识来自定义sysdig记录的输出格式并指定双引号括起来的想要的字段如用户名、进程名以及文件或套接口名称。在本例中我们将创建一个记录文件该文件将只包含在家目录中的写入事件我们今后可以使用“sysdig -r writetrace.scap.gz”来检测该文件
# sysdig -p "%user.name %proc.name %fd.name" "evt.type=write and fd.name contains /home/" -z -w writetrace.scap.gz
![](https://farm6.staticflickr.com/5610/15498248586_de5f5fc93d_z.jpg)
### Sysdig实例 监控网络I/O ###
作为服务器排障的一部分你可能想要监听网络通信此工作通常由tcpdump做。对于sysdig可以很容易进行通信嗅探其风格更为对用户友好。
例如你可以检查由特定IP地址特定进程如apache2提供的数据ASCII编码格式
# sysdig -s 4096 -A -c echo_fds fd.cip=192.168.0.100 -r /mnt/sysdig/debian.scap.gz proc.name=apache2
如果你想要监控原生数据传输(二进制格式),请用“-A”替换“-X”
# sysdig -s 4096 -X -c echo_fds fd.cip=192.168.0.100 -r /mnt/sysdig/debian.scap.gz proc.name=apache2
要获取更多信息、实例以及案例分析,你可以查阅[项目网站][1]。相信我会有着无限可能但请不要仅仅局限于我所写的这些。安装sysdig请从今天开始深入挖掘吧
--------------------------------------------------------------------------------
via: http://xmodulo.com/monitor-troubleshoot-linux-server-sysdig.html
作者:[Gabriel Cánepa][a]
译者:[GOLinux](https://github.com/GOLinux)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创翻译,[Linux中国](http://linux.cn/) 荣誉推出
[a]:http://xmodulo.com/author/gabriel
[1]:http://www.sysdig.org/