This commit is contained in:
Xingyu Wang 2021-04-05 09:44:24 +08:00
parent 195be1d650
commit 54fe7c136a
2 changed files with 134 additions and 144 deletions

View File

@ -1,144 +0,0 @@
[#]: subject: (What problems do people solve with strace?)
[#]: via: (https://jvns.ca/blog/2021/04/03/what-problems-do-people-solve-with-strace/)
[#]: author: (Julia Evans https://jvns.ca/)
[#]: collector: (lujun9972)
[#]: translator: (wxy)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
What problems do people solve with strace?
======
Yesterday I [asked on Twitter about what problems people are solving with strace][1] and as usual everyone really delivered! I got 200 answers and then spent a bunch of time manually categorizing them into 9 categories of problems.
All of the problems are about either finding files a program depends on, figuring out why a program is stuck or slow, or finding out why a program is failing. These generally matched up with what I use strace for myself, but there were some things I hadnt thought of too!
Im not going to explain what strace is in this post but I have a [free zine about it][2] and [a talk][3] and [lots of blog posts][4].
### problem 1: wheres the config file?
The #1 most popular problem was “this program has a configuration file and I dont know where it is”. This is probably my most common use for strace too, because its such a simple question.
This is great because there are a million ways for a program to document where its config file is (in a man page, on its website, in `--help`, etc), but theres only one way for it to actually open it (with a system call!)
### problem 2: what other files does this program depend on?
You can also use strace to find other types of files a program depends on, like:
* dynamically linked libraries (“why is my program loading the wrong version of this `.so` file?“) like [this ruby problem I debugged in 2014][5]
* where its looking for its Ruby gems (Ruby specifically came up a few times!)
* SSL root certificates
* a games save files
* a closed-source programs data files
* [which node_modules files arent being used][6]
### problem 3: why is this program hanging?
You have a program, its just sitting there doing nothing, whats going on? This one is especially easy to answer because a lot of the time you just need to run `strace -p PID` and look at what system call is currently running. You dont even have to look through hundreds of lines of output!
The answer is usually waiting for some kind of I/O. Some possible answers for “why is this stuck” (though there are a lot more!):
* its polling forever on a `select()`
* its `wait()`ing for a subprocess to finish
* its making a network request to something that isnt responding
* its doing `write()` but its blocked because the buffer is full
* its doing a `read()` on stdin and its waiting for input
Someone also gave a nice example of using strace to debug a stuck `df`: with strace df -h you can find the stuck mount and unmount it”.
### problem 4: is this program stuck?
A variation on the previous one: sometimes a program has been running for longer than you expected, and you just want to know if its stuck or of its still making progress.
As long as the program makes system calls while its running, this is super easy to answer with strace just strace it and see if its making new system calls!
### problem 5: why is this program slow?
You can use strace as a sort of coarse profiling tool `strace -t` will show the timestamp of each system call, so you can look for big gaps and find the culprit.
Here are 9 short stories from Twitter of people using strace to debug “why is this program slow?”.
* Back in 2000, a Java-based web site that I helped support was dying under modest load: pages loaded slowly, if at all. We straced the J2EE application server and found that it was reading class files one. byte. at. a. time. Devs werent using BufferedReader, classic Java mistake.
* Optimizing app startup times… running strace can be an eye-opening experience, in terms of the amount of unnecessary file system interaction going on (e.g. open/read/close on the same config file over and over again; loading gobs of font files over a slow NFS mount, etc)
* Asked myself why reading from session files in PHP (usually <100 bytes) was incredibly slow. Turned out some `flock`-syscalls took ~60s
* A program was behaving abnormally slow. Used strace to figure out it was re-initializing its internal pseudo-random number generator on every request by reading from /dev/random and exhausting entropy
* Last thing I remember was attaching to a job worker and seeing just how many network calls it was making (which was unexpected).
* Why is this program so slow to start? strace shows it opening/reading the same config file thousands of times.
* Server using 100% CPU time randomly with low actual traffic. Turns out its hitting the number of open files limit accepting a socket, and retrying forever after getting EMFILE and not reporting it.
* A workflow was running super slow but no logs, ends up it was trying to do a post request that was taking 30s before timing out and then retrying 5 times… ends up the backend service was overwhelmed but also had no visibility
* using strace to notice that gethostbyname() is taking a long time to return (you cant see the `gethostbyname` directly but you can see the DNS packets in strace)
### problem 6: hidden permissions errors
Sometimes a program is failing for a mysterious reason, but the problem is just that theres some file that it doesnt have permission to open. In an ideal world programs would report those errors (“Error opening file /dev/whatever: permission denied”), but of course the world is not perfect, so strace can really help with this!
This is actually the most recent thing I used strace for: I was using an AxiDraw pen plotter and it printed out an inscrutable error message when I tried to start it. I `strace`d it and it turned out that my user just didnt have permission to open the USB device.
### problem 7: what command line arguments are being used?
Sometimes a script is running another program, and you want to know what command line flags its passing!
A couple of examples from Twitter:
* find what compiler flags are actually being used to build some code
* a command was failing due to having too long a command line
### problem 8: why is this network connection failing?
Basically the goal here is just to find which domain / IP address the network connection is being made to. You can look at the DNS request to find the domain or the `connect` system call to find the IP.
In general there are a lot of stories about using strace to debug network issues when `tcpdump` isnt available for some reason or just because its what the person is more familiar with.
### problem 9: why does this program succeed when run one way and fail when run in another way?
For example:
* the same binary works on one machine, fails on another machine
* works when you run it, fails when spawned by a systemd unit file
* works when you run it, fails when you run it as “su - user /some/script”
* works when you run it, fails when run as a cron job
Being able to compare the strace output in both cases is very helpful. Though my first step when debugging “this works as my user and fails when run in a different way on the same computer” would be “look at my environment variables”.
### what Im doing with this: slowly building some challenges
The reason Im thinking about this is that Ive been slowly working on some challenges to help people practice using strace and other command line tools. The idea is that youre given a problem to solve, a terminal, and youre free to solve it in any way you want.
So my goal is to use this to build some practice problems that you can solve with strace that reflect the kinds of problems that people actually use it for in real life.
### thats all!
There are probably more problems that can be solved with strace that I havent covered here Id love to hear what Ive missed!
I really loved seeing how many of the same uses came up over and over and over again at least 20 different people replied saying that they use strace to find config files. And as always I think its really delightful how such a simple tool (“trace system calls!”) can be used to solve so many different kinds of problems.
--------------------------------------------------------------------------------
via: https://jvns.ca/blog/2021/04/03/what-problems-do-people-solve-with-strace/
作者:[Julia Evans][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://jvns.ca/
[b]: https://github.com/lujun9972
[1]: https://twitter.com/b0rk/status/1378014888405168132
[2]: https://wizardzines.com/zines/strace
[3]: https://www.youtube.com/watch?v=4pEHfGKB-OE
[4]: https://jvns.ca/categories/strace
[5]: https://jvns.ca/blog/2014/03/10/debugging-shared-library-problems-with-strace/
[6]: https://indexandmain.com/post/shrink-node-modules-with-refining

View File

@ -0,0 +1,134 @@
[#]: subject: (What problems do people solve with strace?)
[#]: via: (https://jvns.ca/blog/2021/04/03/what-problems-do-people-solve-with-strace/)
[#]: author: (Julia Evans https://jvns.ca/)
[#]: collector: (lujun9972)
[#]: translator: (wxy)
[#]: reviewer: (wxy)
[#]: publisher: ( )
[#]: url: ( )
strace 可以解决什么问题?
======
昨天我 [在 Twitter 上询问大家用 strace 解决了什么问题?][1],和往常一样,大家真的是给出了自己的答案! 我收到了大约 200 个答案,然后花了很多时间手动将它们归为 9 类。
这些解决的问题都是关于寻找程序依赖的文件、找出程序卡住或慢的原因、或者找出程序失败的原因。这些总体上与我自己使用 `strace` 的内容相吻合,但也有一些我没有想到的东西!
我不打算在这篇文章里解释什么是 `strace`,但我有一本 [关于它的免费杂志][2] 和 [一个讲座][3] 以及 [很多博文][4]。
### 问题 1配置文件在哪里
最受欢迎的问题是“这个程序有一个配置文件,但我不知道它在哪里”。这可能也是我最常使用 `strace` 解决的问题,因为这是个很简单的问题。
这很好,因为一个程序有一百万种方法来记录它的配置文件在哪里(在手册页、网站上、`--help`等),但只有一种方法可以让它真正打开它(用系统调用!)。
### 问题 2这个程序还依赖什么文件
你也可以使用 `strace` 来查找程序依赖的其他类型的文件,比如:
* 动态链接库(“为什么我的程序加载了这个错误版本的 `.so` 文件?"),比如 [我在 2014 年调试的这个 ruby 问题][5]
* 它在哪里寻找它的 Ruby gemRuby 出现了几次这种情况!)
* SSL 根证书
* 游戏的存档文件
* 一个闭源程序的数据文件
* [哪些 node_modules 文件没有被使用][6]
### 问题 3为什么这个程序会挂掉
你有一个程序,它只是坐在那里什么都不做,这是怎么回事?这个问题特别容易回答,因为很多时候你只需要运行 `strace -p PID`,看看当前运行的是什么系统调用。你甚至不需要看几百行的输出。
答案通常是“正在等待某种 I/O”。“为什么会卡住”的一些可能的答案虽然还有很多
* 它一直在轮询 `select()`
* 正在 `wait()` 等待一个子进程完成
* 它在向某个没有响应的东西发出网络请求
* 正在进行 `write()`,但由于缓冲区已满而被阻止。
* 它在 stdin 上做 `read()`,等待输入。
有人还举了一个很好的例子,用 `strace` 调试一个卡住的 `df` 命令:“用 `strace df -h` 你可以找到卡住的挂载,然后卸载它”。
### 问题 4这个程序卡住了吗
这是上一个问题的变种:有时一个程序运行的时间比你预期的要长,你只是想知道它是否卡住了,或者它是否还在继续进行。
只要程序在运行过程中进行系统调用,用 `strace` 就可以超简单地回答这个问题:只需 `strace` 它,看看它是否在进行新的系统调用!
### 问题 5为什么这个程序很慢
你可以使用 `strace` 作为一种粗略的剖析工具:`strace -t` 会显示每次系统调用的时间戳,这样你就可以寻找大的漏洞,找到罪魁祸首。
以下是 Twitter 上 9 个人使用 `strace` 调试“为什么这个程序很慢?”的小故事。
* 早在 2000 年,我帮助支持的一个基于 Java 的网站在适度的负载下奄奄一息:页面加载缓慢,甚至完全加载不出来。我们对 J2EE 应用服务器进行了测试,发现它每次只读取一个类文件。开发人员没有使用 BufferedReader这是典型的 Java 错误。
* 优化应用程序的启动时间……运行 `strace` 可以让人大开眼界,因为有大量不必要的文件系统交互在进行(例如,在同一个配置文件上反复打开/读取/关闭;在一个缓慢的 NFS 挂载上加载大量的字体文件,等等)。
* 问自己为什么在 PHP 中从会话文件中读取(通常是小于 100 字节)非常慢。结果发现一些 `flock` 系统调用花了大约 60 秒。
* 一个程序表现得异常缓慢。使用 `strace` 找出它在每次请求时,通过从 `/dev/random` 读取数据并耗尽熵来重新初始化其内部伪随机数发生器。
* 我记得最近一件事是连接到一个任务处理程序,看到它有多少网络调用(这是意想不到的)。
* `strace` 显示它打开/读取同一个配置文件数千次。
* 服务器随机使用 100% 的 CPU 时间,实际流量很低。原来是碰到打开文件数限制,接受一个套接字时,得到 EMFILE 错误而没有报告,然后一直重试。
* 一个工作流运行超慢,但是没有日志,结果它做一个 POST 请求花了 30 秒而超时,然后重试了 5 次……结果后台服务不堪重负,但是也没有可视性。
* 使用 `strace` 注意到 `gethostbyname()` 需要很长时间才能返回(你不能直接看到 `gethostbyname`,但你可以看到 `strace` 中的 DNS 数据包)
### 问题 6隐藏的权限错误
有时候程序因为一个神秘的原因而失败但问题只是有一些它没有权限打开的文件。在理想的世界里程序会报告这些错误“Error opening file /dev/whatever: permission denied”当然这个世界并不完美所以 `strace` 真的可以帮助解决这个问题!
这其实是我最近使用 `strace` 做的事情。我使用了一台 AxiDraw 绘图仪,当我试图启动它时,它打印出了一个难以理解的错误信息。我 `strace` 它,结果发现我的用户没有权限打开 USB 设备。
### 问题 7正在使用什么命令行参数
有时候,一个脚本正在运行另一个程序,你想知道它传递的是什么命令行标志!
几个来自 Twitter 的例子。
* 找出实际上是用来编译代码的编译器标志
* 由于命令行太长,命令失败了
### 问题 8为什么这个网络连接失败
基本上,这里的目标是找到网络连接的域名 / IP 地址。你可以通过 DNS 请求来查找域名,或者通过 `connect` 系统调用来查找 IP。
一般来说,当 `tcpdump` 因为某些原因不能使用或者只是因为比较熟悉 `strace` 时,就经常会使用 `strace` 调试网络问题。
### 问题 9为什么这个程序以一种方式运行时成功以另一种方式运行时失败
例如:
* 同样的二进制程序在一台机器上可以运行,在另一台机器上却失败了
* 可以运行,但被 systemd 单元文件生成时失败
* 可以运行,但以 `su - user /some/script` 的方式运行时失败
* 可以运行,作为 cron 作业运行时失败
能够比较两种情况下的 `strace` 输出是非常有用的。虽然我在调试“以我的用户身份工作,而在同一台计算机上以不同方式运行时却失败了”时,第一步是“看看我的环境变量”。
### 我在做什么:慢慢地建立一些挑战
我之所以会想到这个问题,是因为我一直在慢慢地进行一些挑战,以帮助人们练习使用 `strace` 和其他命令行工具。我的想法是,给你一个问题,一个终端,你可以自由地以任何方式解决它。
所以我的目标是用它来建立一些你可以用 `strace` 解决的练习题,这些练习题反映了人们在现实生活中实际使用它解决的问题。
### 就是这样!
可能还有更多的问题可以用 `strace` 解决,我在这里还没有讲到,我很乐意听到我错过了什么!
我真的很喜欢看到很多相同的用法一次又一次地出现:至少有 20 个不同的人回答说他们使用 `strace` 来查找配置文件。而且和以往一样,我觉得这样一个简单的工具(“跟踪系统调用!”)可以用来解决这么多不同类型的问题,真的很令人高兴。
--------------------------------------------------------------------------------
via: https://jvns.ca/blog/2021/04/03/what-problems-do-people-solve-with-strace/
作者:[Julia Evans][a]
选题:[lujun9972][b]
译者:[wxy](https://github.com/wxy)
校对:[wxy](https://github.com/wxy)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://jvns.ca/
[b]: https://github.com/lujun9972
[1]: https://twitter.com/b0rk/status/1378014888405168132
[2]: https://wizardzines.com/zines/strace
[3]: https://www.youtube.com/watch?v=4pEHfGKB-OE
[4]: https://jvns.ca/categories/strace
[5]: https://jvns.ca/blog/2014/03/10/debugging-shared-library-problems-with-strace/
[6]: https://indexandmain.com/post/shrink-node-modules-with-refining