@wxy
This commit is contained in:
Xingyu Wang 2020-08-22 21:40:12 +08:00
parent 1763278298
commit 09096c9d4e
2 changed files with 244 additions and 267 deletions

View File

@ -1,267 +0,0 @@
[#]: collector: (lujun9972)
[#]: translator: (wxy)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (Creating and debugging Linux dump files)
[#]: via: (https://opensource.com/article/20/8/linux-dump)
[#]: author: (Stephan Avenwedde https://opensource.com/users/hansic99)
Creating and debugging Linux dump files
======
Knowing how to deal with dump files will help you find and fix
hard-to-reproduce bugs in an application.
![Magnifying glass on code][1]
Crash dump, memory dump, core dump, system dump … all produce the same outcome: a file containing the state of an application's memory at a specific time—usually when the application crashes.
Knowing how to deal with these files can help you find the root cause(s) of a failure. Even if you are not a developer, dump files created on your system can be very helpful (as well as approachable) in understanding software.
This is a hands-on article, and can you follow along with the example by cloning the sample application repository with:
```
`git clone https://github.com/hANSIc99/core_dump_example.git`
```
### How signals relate to dumps
Signals are a kind of interprocess communication between the operating system and the user applications. Linux uses the signals defined in the [POSIX standard][2]. On your system, you can find the standard signals defined in `/usr/include/bits/signum-generic.h`. There is also an informative [man signal][3] page if you want more on using signals in your application. Put simply, Linux uses signals to trigger further activities based on whether they were expected or unexpected.
When you quit a running application, the application will usually receive the `SIGTERM` signal. Because this type of exit signal is expected, this action will not create a memory dump.
The following signals will cause a dump file to be created (source: [GNU C Library][4]):
* SIGFPE: Erroneous arithmetic operation
* SIGILL: Illegal instruction
* SIGSEGV: Invalid access to storage
* SIGBUS: Bus error
* SIGABRT: An error detected by the program and reported by calling abort
* SIGIOT: Labeled archaic on Fedora, this signal used to trigger on `abort()` on a [PDP-11][5] and now maps to SIGABRT
### Creating dump files
Navigate to the `core_dump_example` directory, run `make`, and execute the sample with the `-c1` switch:
```
`./coredump -c1`
```
The application should exit in state 4 with an error:
![Dump written][6]
(Stephan Avenwedde, [CC BY-SA 4.0][7])
"Abgebrochen (Speicherabzug geschrieben)" roughly translates to "Segmentation fault (core dumped)."
Whether it creates a core dump or not is determined by the resource limit of the user running the process. You can modify the resource limits with the `ulimit` command.
Check the current setting for core dump creation:
```
`ulimit -c`
```
If it outputs `unlimited`, then it is using the (recommended) default. Otherwise, correct the limit with:
```
`ulimit -c unlimited`
```
To disable creating core dumps' type:
```
`ulimit -c 0`
```
The number specifies the resource in kilobytes.
### What are core dumps?
The way the kernel handles core dumps is defined in:
```
`/proc/sys/kernel/core_pattern`
```
I'm running Fedora 31, and on my system, the file contains:
```
`/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h`
```
This shows core dumps are forwarded to the `systemd-coredump` utility. The contents of `core_pattern` can vary widely between the different flavors of Linux distributions. When `systemd-coredump` is in use, the dump files are saved compressed under `/var/lib/systemd/coredump`. You don't need to touch the files directly; instead, you can use `coredumpctl`. For example:
```
`coredumpctl list`
```
shows all available dump files saved on your system.
With `coredumpctl dump`, you can retrieve information from the last dump file saved:
```
[stephan@localhost core_dump_example]$ ./coredump 
Application started…
(…….)
Message: Process 4598 (coredump) of user 1000 dumped core.
Stack trace of thread 4598:
#0 0x00007f4bbaf22625 __GI_raise (libc.so.6)
#1 0x00007f4bbaf0b8d9 __GI_abort (libc.so.6)
#2 0x00007f4bbaf664af __libc_message (libc.so.6)
#3 0x00007f4bbaf6da9c malloc_printerr (libc.so.6)
#4 0x00007f4bbaf6f49c _int_free (libc.so.6)
#5 0x000000000040120e n/a (/home/stephan/Dokumente/core_dump_example/coredump)
#6 0x00000000004013b1 n/a (/home/stephan/Dokumente/core_dump_example/coredump)
#7 0x00007f4bbaf0d1a3 __libc_start_main (libc.so.6)
#8 0x000000000040113e n/a (/home/stephan/Dokumente/core_dump_example/coredump)
Refusing to dump core to tty (use shell redirection or specifyoutput).
```
This shows that the process was stopped by `SIGABRT`. The stack trace in this view is not very detailed because it does not include function names. However, with `coredumpctl debug`, you can simply open the dump file with a debugger ([GDB][8] by default). Type `bt` (short for backtrace) to get a more detailed view:
```
Core was generated by `./coredump -c1'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  return ret;
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fc37a9aa8d9 in __GI_abort () at abort.c:79
#2  0x00007fc37aa054af in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fc37ab14f4b "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007fc37aa0ca9c in malloc_printerr (str=str@entry=0x7fc37ab130e0 "free(): invalid pointer") at malloc.c:5339
#4  0x00007fc37aa0e49c in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:4173
#5  0x000000000040120e in freeSomething(void*) ()
#6  0x0000000000401401 in main ()
```
The memory addresses: `main()` and `freeSomething()` are quite low compared to subsequent frames. Due to the fact that shared objects are mapped to an area at the end of the virtual address space, you can assume that the `SIGABRT` was caused by a call in a shared library. Memory addresses of shared objects are not constant between invocations, so it is totally fine when you see varying addresses between calls.
The stack trace shows that subsequent calls originate from `malloc.c`, which indicates that something with memory (de-)allocation could have gone wrong.
In the source code, you can see (even without any knowledge of C++) that it tried to free a pointer, which was not returned by a memory management function. This results in undefined behavior and causes the `SIGABRT`:
```
void freeSomething(void *ptr){
    [free][9](ptr);
}
int nTmp = 5;
int *ptrNull = &nTmp;
freeSomething(ptrNull);
```
The systemd coredump utility can be configured under `/etc/systemd/coredump.conf`. Rotation of dump file cleaning can be configured in `/etc/systemd/system/systemd-tmpfiles-clean.timer`.
You can find more information about `coredumpctl` on its [man page][10].
### Compiling with debug symbols
Open the `Makefile` and comment out the last part of line 9. It should now look like:
```
`CFLAGS =-Wall -Werror -std=c++11 -g`
```
The `-g` switch enables the compiler to create debug information. Start the application, this time with the `-c2` switch:
```
`./coredump -c2`
```
You will get a floating-point exception. Open the dump in GDB with:
```
`coredumpctl debug`
```
This time, you are pointed directly to the line in the source code that caused the error:
```
Reading symbols from /home/stephan/Dokumente/core_dump_example/coredump…
[New LWP 6218]
Core was generated by `./coredump -c2'.
Program terminated with signal SIGFPE, Arithmetic exception.
#0 0x0000000000401233 in zeroDivide () at main.cpp:29
29 nRes = 5 / nDivider;
(gdb)
```
Type `list` to get a better overview of the source code:
```
(gdb) list
24      int zeroDivide(){
25          int nDivider = 5;
26          int nRes = 0;
27          while(nDivider > 0){
28              nDivider--;
29              nRes = 5 / nDivider;
30          }
31          return nRes;
32      }
```
Use the command `info locals` to retrieve the values of the local variables from the point in time when the application failed:
```
(gdb) info locals
nDivider = 0
nRes = 5
```
In combination with the source code, you can see that you ran into a division by zero:
```
`nRes = 5 / 0`
```
### Conclusion
Knowing how to deal with dump files will help you find and fix hard-to-reproduce random bugs in an application. And if it is not your application, forwarding a core dump to the developer will help her or him find and fix the problem.
--------------------------------------------------------------------------------
via: https://opensource.com/article/20/8/linux-dump
作者:[Stephan Avenwedde][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/hansic99
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/find-file-linux-code_magnifying_glass_zero.png?itok=E2HoPDg0 (Magnifying glass on code)
[2]: https://en.wikipedia.org/wiki/POSIX
[3]: https://man7.org/linux/man-pages/man7/signal.7.html
[4]: https://www.gnu.org/software/libc/manual/html_node/Program-Error-Signals.html#Program-Error-Signals
[5]: https://en.wikipedia.org/wiki/PDP-11
[6]: https://opensource.com/sites/default/files/uploads/dump_written.png (Dump written)
[7]: https://creativecommons.org/licenses/by-sa/4.0/
[8]: https://www.gnu.org/software/gdb/
[9]: http://www.opengroup.org/onlinepubs/009695399/functions/free.html
[10]: https://man7.org/linux/man-pages/man1/coredumpctl.1.html

View File

@ -0,0 +1,244 @@
[#]: collector: (lujun9972)
[#]: translator: (wxy)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (Creating and debugging Linux dump files)
[#]: via: (https://opensource.com/article/20/8/linux-dump)
[#]: author: (Stephan Avenwedde https://opensource.com/users/hansic99)
在 Linux 上创建并调试转储文件
======
> 了解如何处理转储文件将帮你找到应用中难以重现的 bug。
![Magnifying glass on code][1]
崩溃转储、内存转储、核心转储、系统转储……这些全都会产生同样的产物:一个包含了当应用崩溃时,在那个特定时刻的应用的内存状态的文件。
这是一篇指导文章,你可以通过克隆示例应用仓库来跟随学习:
```
git clone https://github.com/hANSIc99/core_dump_example.git
```
### 信号如何关联到转储
信号是操作系统和用户应用之间的进程间通讯。Linux 使用 [POSIX 标准][2]中定义的信号。在你的系统上,你可以在 `/usr/include/bits/signum-generic.h` 找到标准信号的定义。如果你想知道更多关于在你的应用程序中使用信号的信息,还有一个信息丰富的 [signal 手册页][3]。简单地说Linux基于预期的或意外的信号来触发进一步的活动。
当你退出一个正在运行的应用程序时,应用程序通常会收到 `SIGTERM` 信号。因为这种类型的退出信号是预期的,所以这个操作不会创建一个内存转储。
以下信号将导致创建一个转储文件(来源:[GNU C库][4]:
* `SIGFPE`:错误的算术操作
* `SIGILL`:非法指令
* `SIGSEGV`:对存储的无效访问
* `SIGBUS`:总线错误
* `SIGABRT`:程序检测到的错误,并通过调用 `abort()` 来报告
* `SIGIOT`:在 Fedora 上已经过时,这个信号过去在 [PDP-11][5] 上用 `abort()` 时触发,现在映射到 SIGABRT
### 创建转储文件
导航到 `core_dump_example` 目录,运行 `make`,并使用 `-c1` 开关执行该示例二进制:
```
./coredump -c1
```
该应用将以状态 4 退出,带有如下错误:
![Dump written][6]
“Abgebrochen (Speicherabzug geschrieben) ”LCTT 译注:这是德语,应该是因为本文作者系统是德语环境)大致翻译为“分段故障(核心转储)”。
是否创建核心转储是由运行该进程的用户的资源限制决定的。你可以用 `ulimit` 命令修改资源限制。
检查当前创建核心转储的设置:
```
ulimit -c
```
如果它输出 `unlimited`,那么它使用的是(建议的)默认值。否则,用以下方法纠正限制:
```
ulimit -c unlimited
```
要禁用创建核心转储,可以设置其大小为 0
```
ulimit -c 0
```
这个数字指定了核心转储文件的大小,单位是块。
### 什么是核心转储?
内核处理核心转储的方式定义在:
```
/proc/sys/kernel/core_pattern
```
我运行的是 Fedora 31在我的系统上该文件包含
```
/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
```
这表明核心转储被转发到 `systemd-coredump` 工具。在不同的 Linux 发行版中,`core_pattern` 的内容会有很大的不同。当使用 `systemd-coredump` 时,转储文件被压缩保存在 `/var/lib/systemd/coredump` 下。你不需要直接接触这些文件,你可以使用 `coredumpctl`。比如说:
```
coredumpctl list
```
会显示系统中保存的所有可用的转储文件。
使用 `coredumpctl dump`,你可以从最后保存的转储文件中检索信息:
```
[stephan@localhost core_dump_example]$ ./coredump 
Application started…
(…….)
Message: Process 4598 (coredump) of user 1000 dumped core.
Stack trace of thread 4598:
#0 0x00007f4bbaf22625 __GI_raise (libc.so.6)
#1 0x00007f4bbaf0b8d9 __GI_abort (libc.so.6)
#2 0x00007f4bbaf664af __libc_message (libc.so.6)
#3 0x00007f4bbaf6da9c malloc_printerr (libc.so.6)
#4 0x00007f4bbaf6f49c _int_free (libc.so.6)
#5 0x000000000040120e n/a (/home/stephan/Dokumente/core_dump_example/coredump)
#6 0x00000000004013b1 n/a (/home/stephan/Dokumente/core_dump_example/coredump)
#7 0x00007f4bbaf0d1a3 __libc_start_main (libc.so.6)
#8 0x000000000040113e n/a (/home/stephan/Dokumente/core_dump_example/coredump)
Refusing to dump core to tty (use shell redirection or specifyoutput).
```
这表明该进程被 `SIGABRT` 停止。这个视图中的堆栈跟踪不是很详细,因为它不包括函数名。然而,使用 `coredumpctl debug`,你可以简单地用调试器(默认为 [GDB][8])打开转储文件。输入 `bt`<ruby>回溯<rt>backtrace</rt></ruby>的缩写)可以得到更详细的视图:
```
Core was generated by `./coredump -c1'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  return ret;
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fc37a9aa8d9 in __GI_abort () at abort.c:79
#2  0x00007fc37aa054af in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fc37ab14f4b "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007fc37aa0ca9c in malloc_printerr (str=str@entry=0x7fc37ab130e0 "free(): invalid pointer") at malloc.c:5339
#4  0x00007fc37aa0e49c in _int_free (av=&lt;optimized out&gt;, p=&lt;optimized out&gt;, have_lock=0) at malloc.c:4173
#5  0x000000000040120e in freeSomething(void*) ()
#6  0x0000000000401401 in main ()
```
内存地址:与后续帧相比,`main()` 和 `freeSomething()` 的内存地址相当低。由于共享对象被映射到虚拟地址空间末尾的区域,可以认为 `SIGABRT` 是由共享库中的调用引起的。共享对象的内存地址在多次调用之间并不是恒定不变的,所以当你看到多次调用之间的地址不同时,完全可以认为是共享对象。
堆栈跟踪显示,后续的调用源于 `malloc.c`,这说明内存的(取消)分配可能出了问题。
在源代码中,(即使没有任何 C++ 知识)你也可以看到,它试图释放一个指针,而这个指针并没有被内存管理函数返回。这导致了未定义的行为,并导致了 `SIGABRT`
```
void freeSomething(void *ptr){
free(ptr);
}
int nTmp = 5;
int *ptrNull = &nTmp;
freeSomething(ptrNull);
```
systemd 的这个 `coredump` 工具可以在 `/etc/systemd/coredump.conf` 中配置。可以在 `/etc/systemd/systemd-tmpfiles-clean.timer` 中配置轮换清理转储文件。
你可以在其[手册页][10]中找到更多关于 `coredumpctl` 的信息。
### 用调试符号编译
打开 `Makefile` 并注释掉第 9 行的最后一部分。现在应该是这样的:
```
CFLAGS =-Wall -Werror -std=c++11 -g
```
`-g` 开关使编译器能够创建调试信息。启动应用程序,这次使用 `-c2` 开关。
```
./coredump -c2
```
你会得到一个浮点异常。在GDB中打开转储
```
coredumpctl debug
```
这一次,你会直接被指向源代码中导致错误的那一行:
```
Reading symbols from /home/stephan/Dokumente/core_dump_example/coredump…
[New LWP 6218]
Core was generated by `./coredump -c2'.
Program terminated with signal SIGFPE, Arithmetic exception.
#0 0x0000000000401233 in zeroDivide () at main.cpp:29
29 nRes = 5 / nDivider;
(gdb)
```
键入 `list` 以获得更好的源代码概览:
```
(gdb) list
24      int zeroDivide(){
25          int nDivider = 5;
26          int nRes = 0;
27          while(nDivider &gt; 0){
28              nDivider--;
29              nRes = 5 / nDivider;
30          }
31          return nRes;
32      }
```
使用命令 `info locals` 从应用程序失败的时间点检索局部变量的值:
```
(gdb) info locals
nDivider = 0
nRes = 5
```
结合源码,可以看出,你遇到的是零除错误:
```
nRes = 5 / 0
```
### 结论
了解如何处理转储文件将帮助你找到并修复应用程序中难以重现的随机错误。而如果不是你的应用程序,将核心转储转发给开发人员将帮助她或他找到并修复问题。
--------------------------------------------------------------------------------
via: https://opensource.com/article/20/8/linux-dump
作者:[Stephan Avenwedde][a]
选题:[lujun9972][b]
译者:[wxy](https://github.com/wxy)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/hansic99
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/find-file-linux-code_magnifying_glass_zero.png?itok=E2HoPDg0 (Magnifying glass on code)
[2]: https://en.wikipedia.org/wiki/POSIX
[3]: https://man7.org/linux/man-pages/man7/signal.7.html
[4]: https://www.gnu.org/software/libc/manual/html_node/Program-Error-Signals.html#Program-Error-Signals
[5]: https://en.wikipedia.org/wiki/PDP-11
[6]: https://opensource.com/sites/default/files/uploads/dump_written.png (Dump written)
[7]: https://creativecommons.org/licenses/by-sa/4.0/
[8]: https://www.gnu.org/software/gdb/
[9]: http://www.opengroup.org/onlinepubs/009695399/functions/free.html
[10]: https://man7.org/linux/man-pages/man1/coredumpctl.1.html