选题[tech]: 20210727 Analyze the Linux kernel with ftrace

sources/tech/20210727 Analyze the Linux kernel with ftrace.md
This commit is contained in:
DarkSun 2021-07-28 05:03:14 +08:00
parent ad9af9f08a
commit b5aba291ee

View File

@ -0,0 +1,404 @@
[#]: subject: (Analyze the Linux kernel with ftrace)
[#]: via: (https://opensource.com/article/21/7/linux-kernel-ftrace)
[#]: author: (Gaurav Kamathe https://opensource.com/users/gkamathe)
[#]: collector: (lujun9972)
[#]: translator: ( )
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
Analyze the Linux kernel with ftrace
======
Ftrace is a great way to learn more about the internal workings of the
Linux kernel.
![Linux keys on the keyboard for a desktop computer][1]
An operating system's kernel is one of the most elusive pieces of software out there. It's always there running in the background from the time your system gets turned on. Every user achieves their computing work with the help of the kernel, yet they never interact with it directly. The interaction with the kernel occurs by making system calls or having those calls made on behalf of the user by various libraries or applications that they use daily.
I've covered how to trace system calls in an earlier article using `strace`. However, with `strace`, your visibility is limited. It allows you to view the system calls invoked with specific parameters and, after the work gets done, see the return value or status indicating whether they passed or failed. But you had no idea what happened inside the kernel during this time. Besides just serving system calls, there's a lot of other activity happening inside the kernel that you're oblivious to.
### Ftrace Introduction
This article aims to shed some light on tracing the kernel functions by using a mechanism called `ftrace`. It makes kernel tracing easily accessible to any Linux user, and with its help you can learn a lot about Linux kernel internals.
The default output generated by the `ftrace` is often massive, given that the kernel is always busy. To save space, I've kept the output to a minimum and, in many cases truncated the output entirely.
I am using Fedora for these examples, but they should work on any of the latest Linux distributions.
### Enabling ftrace
`Ftrace` is part of the Linux kernel now, and you no longer need to install anything to use it. It is likely that, if you are using a recent Linux OS, `ftrace` is already enabled. To verify that the `ftrace` facility is available, run the mount command and search for `tracefs`. If you see output similar to what is below, `ftrace` is enabled, and you can easily follow the examples in this article:
```
$ sudo mount | grep tracefs
none on /sys/kernel/tracing type tracefs (rw,relatime,seclabel)
```
To make use of `ftrace`, you first must navigate to the special directory as specified in the mount command above, from where you'll run the rest of the commands in the article:
```
`$ cd /sys/kernel/tracing`
```
### General work flow
First of all, you must understand the general workflow of capturing a trace and obtaining the output. If you're using `ftrace` directly, there isn't any special `ftrace-`specific commands to run. Instead, you basically write to some files and read from some files using standard command-line Linux utilities.
The general steps:
1. Write to some specific files to enable/disable tracing.
2. Write to some specific files to set/unset filters to fine-tune tracing.
3. Read generated trace output from files based on 1 and 2.
4. Clear earlier output or buffer from files.
5. Narrow down to your specific use case (kernel functions to trace) and repeat steps 1, 2, 3, 4.
### Types of available tracers
There are several different kinds of tracers available to you. As mentioned earlier, you need to be in a specific directory before running any of these commands because the files of interest are present there. I use relative paths (as opposed to absolute paths) in my examples.
You can view the contents of the `available_tracers` file to see all the types of tracers available. You can see a few listed below. Don't worry about all of them just yet:
```
$ pwd
/sys/kernel/tracing
$ sudo cat available_tracers
hwlat blk mmiotrace function_graph wakeup_dl wakeup_rt wakeup function nop
```
Out of all the given tracers, I focus on three specific ones: `function` and `function_graph` to enable tracing, and `nop` to disable tracing.
### Identify current tracer
Usually, by default, the tracer is set to `nop`. That is, "No operation" in the special file `current_tracer`, which usually means tracing is currently off:
```
$ pwd
/sys/kernel/tracing
$ sudo cat current_tracer
nop
```
### View Tracing output
Before you enable any tracing, take a look at the file where the tracing output gets stored. You can view the contents of the file named `trace` using the [cat][2] command:
```
$ sudo cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0   #P:8
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
```
### Enable function tracer
You can enable your first tracer called `function` by writing `function` to the file `current_tracer` (its earlier content was `nop`, indicating that tracing was off.) Think of this operation as a way of enabling tracing:
```
$ pwd
/sys/kernel/tracing
$ sudo cat current_tracer
nop
$ echo function > current_tracer
$
$ cat current_tracer
function
```
### View updated tracing output for function tracer
Now that you've enabled tracing, it's time to view the output. If you view the contents of the `trace` file, you see a lot of data being written to it continuously. I've piped the output and am currently viewing only the top 20 lines to keep things manageable. If you follow the headers in the output on the left, you can see which task and Process ID are running on which CPU. Toward the right side of the output, you see the exact kernel function running, followed by its parent function. There is also time stamp information in the center:
```
$ sudo cat trace | head -20
# tracer: function
#
# entries-in-buffer/entries-written: 409936/4276216   #P:8
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
          <idle>-0       [000] d...  2088.841739: tsc_verify_tsc_adjust <-arch_cpu_idle_enter
          <idle>-0       [000] d...  2088.841739: local_touch_nmi <-do_idle
          <idle>-0       [000] d...  2088.841740: rcu_nocb_flush_deferred_wakeup <-do_idle
          <idle>-0       [000] d...  2088.841740: tick_check_broadcast_expired <-do_idle
          <idle>-0       [000] d...  2088.841740: cpuidle_get_cpu_driver <-do_idle
          <idle>-0       [000] d...  2088.841740: cpuidle_not_available <-do_idle
          <idle>-0       [000] d...  2088.841741: cpuidle_select <-do_idle
          <idle>-0       [000] d...  2088.841741: menu_select <-do_idle
          <idle>-0       [000] d...  2088.841741: cpuidle_governor_latency_req <-menu_select
```
Remember that tracing is on, which means the output of tracing continues to get written to the trace file until you turn tracing off.
### Turn off tracing
Turning off tracing is simple. All you have to do is replace `function` tracer with `nop` in the `current_tracer` file and tracing gets turned off:
```
$ sudo cat current_tracer
function
$ sudo echo nop > current_tracer
$ sudo cat current_tracer
nop
```
### Enable function_graph tracer
Now try the second tracer, called `function_graph`. You can enable this using the same steps as before: write `function_graph` to the `current_tracer` file:
```
$ sudo echo function_graph > current_tracer
$ sudo cat current_tracer
function_graph
```
### Tracing output of function_graph tracer
Notice that the output format of the `trace` file has changed. Now, you can see the CPU ID and the duration of the kernel function execution. Next, you see curly braces indicating the beginning of a function and what other functions were called from inside it:
```
$ sudo cat trace | head -20
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 6)               |              n_tty_write() {
 6)               |                down_read() {
 6)               |                  __cond_resched() {
 6)   0.341 us    |                    rcu_all_qs();
 6)   1.057 us    |                  }
 6)   1.807 us    |                }
 6)   0.402 us    |                process_echoes();
 6)               |                add_wait_queue() {
 6)   0.391 us    |                  _raw_spin_lock_irqsave();
 6)   0.359 us    |                  _raw_spin_unlock_irqrestore();
 6)   1.757 us    |                }
 6)   0.350 us    |                tty_hung_up_p();
 6)               |                mutex_lock() {
 6)               |                  __cond_resched() {
 6)   0.404 us    |                    rcu_all_qs();
 6)   1.067 us    |                  }
```
### Enable trace settings to increase the depth of tracing
You can always tweak the tracer slightly to see more depth of the function calls using the steps below. After which, you can view the contents of the `trace` file and see that the output is slightly more detailed. For readability, the output of this example is omitted:
```
$ sudo cat max_graph_depth
0
$ sudo echo 1 > max_graph_depth
$ # or
$ sudo echo 2 > max_graph_depth
$ sudo cat trace
```
### Finding functions to trace
The steps above are sufficient to get started with tracing. However, the amount of output generated is enormous, and you can often get lost while trying to find out items of interest. Often you want the ability to trace specific functions only and ignore the rest. But how do you know which processes to trace if you don't know their exact names? There is a file that can help you with this—`available_filter_functions` provides you with a list of available functions for tracing:
```
$ sudo wc -l available_filter_functions  
63165 available_filter_functions
```
### Search for general kernel functions
Now try searching for a simple kernel function that you are aware of. User-space has `malloc` to allocate memory, while the kernel has its `kmalloc` function, which provides similar functionality. Below are all the `kmalloc` related functions:
```
$ sudo grep kmalloc available_filter_functions
debug_kmalloc
mempool_kmalloc
kmalloc_slab
kmalloc_order
kmalloc_order_trace
kmalloc_fix_flags
kmalloc_large_node
__kmalloc
__kmalloc_track_caller
__kmalloc_node
__kmalloc_node_track_caller
[...]
```
### Search for kernel module or driver related functions
From the output of `available_filter_functions`, you can see some lines ending with text in brackets, such as `[kvm_intel]` in the example below. These functions are related to the kernel module `kvm_intel`, which is currently loaded. You can run the `lsmod` command to verify:
```
$ sudo grep kvm available_filter_functions | tail
__pi_post_block [kvm_intel]
vmx_vcpu_pi_load [kvm_intel]
vmx_vcpu_pi_put [kvm_intel]
pi_pre_block [kvm_intel]
pi_post_block [kvm_intel]
pi_wakeup_handler [kvm_intel]
pi_has_pending_interrupt [kvm_intel]
pi_update_irte [kvm_intel]
vmx_dump_dtsel [kvm_intel]
vmx_dump_sel [kvm_intel]
$ lsmod  | grep -i kvm
kvm_intel             335872  0
kvm                   987136  1 kvm_intel
irqbypass              16384  1 kvm
```
### Trace specific functions only
To enable tracing of specific functions or patterns, you can make use of the `set_ftrace_filter` file to specify which functions from the above output you want to trace.
This file also accepts the `*` pattern, which expands to include additional functions with the given pattern. As an example, I am using the `ext4` filesystem on my machine. I can specify `ext4` specific kernel functions to trace using the following commands:
```
$ sudo mount | grep home
/dev/mapper/fedora-home on /home type ext4 (rw,relatime,seclabel)
$ pwd
/sys/kernel/tracing
$ sudo cat set_ftrace_filter
#### all functions enabled ####
$
$ echo ext4_* > set_ftrace_filter
$
$ cat set_ftrace_filter
ext4_has_free_clusters
ext4_validate_block_bitmap
ext4_get_group_number
ext4_get_group_no_and_offset
ext4_get_group_desc
[...]
```
Now, when you see the tracing output, you can only see functions `ext4` related to kernel functions for which you had set a filter earlier. All the other output gets ignored:
```
$ sudo cat trace |head -20
# tracer: function
#
# entries-in-buffer/entries-written: 3871/3871   #P:8
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
           cupsd-1066    [004] ....  3308.989545: ext4_file_getattr <-vfs_fstat
           cupsd-1066    [004] ....  3308.989547: ext4_getattr <-ext4_file_getattr
           cupsd-1066    [004] ....  3308.989552: ext4_file_getattr <-vfs_fstat
           cupsd-1066    [004] ....  3308.989553: ext4_getattr <-ext4_file_getattr
           cupsd-1066    [004] ....  3308.990097: ext4_file_open <-do_dentry_open
           cupsd-1066    [004] ....  3308.990111: ext4_file_getattr <-vfs_fstat
           cupsd-1066    [004] ....  3308.990111: ext4_getattr <-ext4_file_getattr
           cupsd-1066    [004] ....  3308.990122: ext4_llseek <-ksys_lseek
           cupsd-1066    [004] ....  3308.990130: ext4_file_read_iter <-new_sync_read
```
### Exclude functions from being traced
You don't always know what you want to trace but, you surely know what you don't want to trace. For that, there is this file aptly named `set_ftrace_notrace`—notice the "no" in there. You can write your desired pattern in this file and enable tracing, upon which everything except the mentioned pattern gets traced. This is often helpful to remove common functionality that clutters our output:
```
$ sudo cat set_ftrace_notrace
#### no functions disabled ####
```
### Targetted tracing
So far, you've been tracing everything that has happened in the kernel. But that won't help us if you wish to trace events related to a specific command. To achieve this, you can turn tracing on and off on-demand and, and in between them, run our command of choice so that you do not get extra output in your trace output. You can enable tracing by writing `1` to `tracing_on`, and `0` to turn it off:
```
$ sudo cat tracing_on
0
$ sudo echo 1 > tracing_on
$ sudo cat tracing_on
1
$ # Run some specific command that we wish to trace here
$ sudo echo 0 > tracing_on
$ cat tracing_on
0
```
### Tracing specific PID
If you want to trace activity related to a specific process that is already running, you can write that PID to a file named `set_ftrace_pid` and then enable tracing. That way, tracing is limited to this PID only, which is very helpful in some instances:
```
`$ sudo echo $PID > set_ftrace_pid`
```
### Conclusion
`Ftrace` is a great way to learn more about the internal workings of the Linux kernel. With some practice, you can learn to fine-tune `ftrace` and narrow down your searches. To understand `ftrace` in more detail and its advanced usage, see these excellent articles written by the core author of `ftrace` himself—Steven Rostedt.
* [Debugging the Linux kernel, part 1][3]
* [Debugging the Linux kernel, part 2][4]
* [Debugging the Linux kernel, part 3][5]
--------------------------------------------------------------------------------
via: https://opensource.com/article/21/7/linux-kernel-ftrace
作者:[Gaurav Kamathe][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/gkamathe
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/linux_keyboard_desktop.png?itok=I2nGw78_ (Linux keys on the keyboard for a desktop computer)
[2]: https://opensource.com/article/19/2/getting-started-cat-command
[3]: https://lwn.net/Articles/365835/
[4]: https://lwn.net/Articles/366796/
[5]: https://lwn.net/Articles/370423/