Notes on BPF & eBPF ============================================================ Today it was Papers We Love, my favorite meetup! Today [Suchakra Sharma][6]([@tuxology][7] on twitter/github) gave a GREAT talk about the original BPF paper and recent work in Linux on eBPF. It really made me want to go write eBPF programs! The paper is [The BSD Packet Filter: A New Architecture for User-level Packet Capture][8] I wanted to write some notes on the talk here because I thought it was super super good. To start, here are the [slides][9] and a [pdf][10]. The pdf is good because there are links at the end and in the PDF you can click the links. ### what’s BPF? Before BPF, if you wanted to do packet filtering you had to copy all the packets into userspace and then filter them there (with “tap”). this had 2 problems: 1. if you filter in userspace, it means you have to copy all the packets into userspace, copying data is expensive 2. the filtering algorithms people were using were inefficient The solution to problem #1 seems sort of obvious, move the filtering logic into the kernel somehow. Okay. (though the details of how that’s done isn’t obvious, we’ll talk about that in a second) But why were the filtering algorithms inefficient! Well!! If you run `tcpdump host foo` it actually runs a relatively complicated query, which you could represent with this tree: ![](https://jvns.ca/images/bpf-1.png) Evaluating this tree is kind of expensive. so the first insight is that you can actually represent this tree in a simpler way, like this: ![](https://jvns.ca/images/bpf-2.png) Then if you have `ether.type = IP` and `ip.src = foo` you automatically know that the packet matches `host foo`, you don’t need to check anything else. So this data structure (they call it a “control flow graph” or “CFG”) is a way better representation of the program you actually want to execute to check matches than the tree we started with. ### How BPF works in the kernel The main important here is that packets are just arrays of bytes. BPF programs run on these arrays of bytes. They’re not allowed to have loops but they  _can_  have smart stuff to figure out the length of the IP header (IPv6 & IPv4 are different lengths!) and then find the TCP port based on that length ``` x = ip_header_length port = *(packet_start + x + port_offset) ``` (it looks different from that but it’s basically the same). There’s a nice description of the virtual machine in the paper/slides so I won’t explain it. When you run `tcpdump host foo` this is what happens, as far as I understand 1. convert `host foo` into an efficient DAG of the rules 2. convert that DAG into a BPF program (in BPF bytecode) for the BPF virtual machine 3. Send the BPF bytecode to the Linux kernel, which verifies it 4. compile the BPF bytecode program into native code. For example [here’s the JIT code for ARM][1] and for [x86][2] 5. when packets come in, Linux runs the native code to decide if that packet should be filtered or not. It’l often run only 100-200 CPU instructions for each packet that needs to be processed, which is super fast! ### the present: eBPF But BPF has been around for a long time! Now we live in the EXCITING FUTURE which is eBPF. I’d heard about eBPF a bunch before but I felt like this helped me put the pieces together a little better. (i wrote this [XDP & eBPF post][11]back in April when I was at netdev) some facts about eBPF: * eBPF programs have their own bytecode language, and are compiled from that bytecode language into native code in the kernel, just like BPF programs * eBPF programs run in the kernel * eBPF programs can’t access arbitrary kernel memory. Instead the kernel provides functions to get at some restricted subset of things. * they  _can_  communicate with userspace programs through BPF maps * there’s a `bpf` syscall as of Linux 3.18 ### kprobes & eBPF You can pick a function (any function!) in the Linux kernel and execute a program that you write every time that function happens. This seems really amazing and magical. For example! There’s this [BPF program called disksnoop][12] which tracks when you start/finish writing a block to disk. Here’s a snippet from the code: ``` BPF_HASH(start, struct request *); void trace_start(struct pt_regs *ctx, struct request *req) { // stash start timestamp by request ptr u64 ts = bpf_ktime_get_ns(); start.update(&req, &ts); } ... b.attach_kprobe(event="blk_start_request", fn_name="trace_start") b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start") ``` This basically declares a BPF hash (which the program uses to keep track of when the request starts / finishes), a function called `trace_start` which is going to be compiled into BPF bytecode, and attaches `trace_start` to the `blk_start_request` kernel function. This is all using the `bcc` framework which lets you write Python-ish programs that generate BPF code. You can find it (it has tons of example programs) at[https://github.com/iovisor/bcc][13] ### uprobes & eBPF So I sort of knew you could attach eBPF programs to kernel functions, but I didn’t realize you could attach eBPF programs to userspace functions! That’s really exciting. Here’s [an example of counting malloc calls in Python using an eBPF program][14]. ### things you can attach eBPF programs to * network cards, with XDP (which I wrote about a while back) * tc egress/ingress (in the network stack) * kprobes (any kernel function) * uprobes (any userspace function apparently ?? like in any C program with symbols.) * probes that were built for dtrace called “USDT probes” (like [these mysql probes][3]). Here’s an [example program using dtrace probes][4] * [the JVM][5] * tracepoints (not sure what that is yet) * seccomp / landlock security things * a bunch more things ### this talk was super cool There are a bunch of great links in the slides and in [LINKS.md][15] in the iovisor repository. It is late now but soon I want to actually write my first eBPF program! -------------------------------------------------------------------------------- via: https://jvns.ca/blog/2017/06/28/notes-on-bpf---ebpf/ 作者:[Julia Evans ][a] 译者:[译者ID](https://github.com/译者ID) 校对:[校对者ID](https://github.com/校对者ID) 本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 [a]:https://jvns.ca/ [1]:https://github.com/torvalds/linux/blob/v4.10/arch/arm/net/bpf_jit_32.c#L512 [2]:https://github.com/torvalds/linux/blob/v3.18/arch/x86/net/bpf_jit_comp.c#L189 [3]:https://dev.mysql.com/doc/refman/5.7/en/dba-dtrace-ref-query.html [4]:https://github.com/iovisor/bcc/blob/master/examples/tracing/mysqld_query.py [5]:http://blogs.microsoft.co.il/sasha/2016/03/31/probing-the-jvm-with-bpfbcc/ [6]:http://suchakra.in/ [7]:https://twitter.com/tuxology [8]:http://www.vodun.org/papers/net-papers/van_jacobson_the_bpf_packet_filter.pdf [9]:https://speakerdeck.com/tuxology/the-bsd-packet-filter [10]:http://step.polymtl.ca/~suchakra/PWL-Jun28-MTL.pdf [11]:https://jvns.ca/blog/2017/04/07/xdp-bpf-tutorial/ [12]:https://github.com/iovisor/bcc/blob/0c8c179fc1283600887efa46fe428022efc4151b/examples/tracing/disksnoop.py [13]:https://github.com/iovisor/bcc [14]:https://github.com/iovisor/bcc/blob/00f662dbea87a071714913e5c7382687fef6a508/tests/lua/test_uprobes.lua [15]:https://github.com/iovisor/bcc/blob/master/LINKS.md