Merge pull request #1129 from GOLinux/master

Translated:20140603 Write Your First Linux Kernel Module.md
This commit is contained in:
Xingyu.Wang 2014-06-06 15:45:16 +08:00
commit f042596f20
2 changed files with 369 additions and 376 deletions

View File

@ -1,376 +0,0 @@
GOLinux在此蹲着不怕臭的来抢。。。
Write your first Linux Kernel module
================================================================================
> Ever wanted to start hacking the kernel? Dont have a clue how to begin? Let us show you how its done…
Kernel programming is often seen as a black magic. In Arthur C Clarkes sense, it probably is. The Linux kernel is quite different from its user space: many abstractions are waived, and you have to take extra care, as a bug in you code affects the whole system. There is no easy way to do floating-point maths, the stack is fixed and small, and the code you write is always asynchronous so you need to think about the concurrency. Despite all of this though, the Linux kernel is just a very large and complex C program that is open for everyone to read, learn and improve, and you too can be a part of it.
> “The easiest way to start kernel programming
> is to write a module a piece of code that
> can be dynamically loaded into the kernel.“
Probably the easiest way to start kernel programming is to write a module a piece of code that can be dynamically loaded into the kernel and removed from it. There are limits to what modules can do for example, they cant add or remove fields to common data structures like process descriptors. But in all other ways they are full-fledged kernel-level code, and they can always be compiled into the kernel (thus removing all the restrictions) if needed. It is fully possible to develop and compile a module outside the Linux source tree (this is unsurprisingly called an out-of-tree build), which is very convenient if you just want to play a bit and do not wish to submit your changes for inclusion into the mainline kernel.
In this tutorial, well develop a simple kernel module that creates a **/dev/reverse** device. A string written to this device is read back with the word order reversed (“Hello World” becomes “World Hello”). It is a popular programmer interview puzzle, and you are likely to get some bonus points when you show the ability to implement it at the kernel level as well. A word of warning before we start: a bug in your module may lead to a system crash and (unlikely, but possible) data loss. Be sure youve backed up all your important data before you start, or, even better, experiment in a virtual machine.
### Avoid root if possible ###
> By default, **/dev/reverse** is available to root only, so youll have to run your test programs with **sudo**. To fix this, create a **/lib/udev/rules.d/99-reverse.rules** file that contains:
>
> SUBSYSTEM=="misc", KERNEL=="reverse", MODE="0666"
>
> Dont forget to reinsert the module. Making device nodes accessible to non-root users is generally not a good idea, but it is quite useful during development. This is not to mention that running test binaries as root is not a good idea either.
#### A modules anatomy ####
As most of the Linux kernel modules are written in C (apart from low-level architecture-specific parts), it is recommended that you keep your module in a single file (say, reverse.c). Weve put the full source code on GitHub and here well look at some snippets of it. To begin, lets include some common headers and describe the module using predefined macros:
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Valentine Sinitsyn <valentine.sinitsyn@gmail.com>");
MODULE_DESCRIPTION("In-kernel phrase reverser");
Everything is straightforward here, except for **MODULE_LICENSE()**: it is not a mere marker. The kernel strongly favours GPL-compatible code, so if you set the licence to something non GPL-compatible (say, “Proprietary”), certain kernel functions will not be available to your module.
### When not to write a kernel module ###
> Kernel programming is fun, but writing (and especially debugging) kernel code in a real-world project requires certain skills. In general, you should descend to the kernel level only if there is no other way to solve your problem. Chances are you can stay in the userspace if:
>
> - You develop a USB driver have a look at [libusb][1].
> - You develop a filesystem try [FUSE][2].
> - You are extending Netfilter [libnetfilter_queue][3] may help you then.
>
> Generally, native kernel code will perform better, but for many projects this performance loss isnt crucial.
Since kernel programming is always asynchronous, there is no **main()** function that Linux executes sequentially to run your module. Instead, you provide callbacks for various events, like this:
static int __init reverse_init(void)
{
printk(KERN_INFO "reverse device has been registered\n");
return 0;
}
static void __exit reverse_exit(void)
{
printk(KERN_INFO "reverse device has been unregistered\n");
}
module_init(reverse_init);
module_exit(reverse_exit);
Here, we define functions to be called on the modules insertion and removal. Only the first one is required. For now, they simply print a message to the kernel ring buffer (accessible from the userspace via the **dmesg** command); **KERN_INFO** is a log level (note there is no comma). **__init** and **__exit** are attributes the pieces of metadata attached to functions (or variables). Attributes are rarely seen in userspace C code but are pretty common in the kernel. Everything marked with **__init** is recycled after the initialisation (remember the old “Freeing unused kernel memory…” message?). **__exit** denotes functions that are safe to optimise out when the code is built statically into the kernel. Finally, the **module_init()** and **module_exit()** macros set **reverse_init()** and **reverse_exit()** functions as lifecycle callbacks for our module. The actual function names arent important; you can call them **init()** and **exit()** or **start()** and **stop()**, if you wish. They are declared static and hence invisible outside your module. In fact, any function in the kernel is invisible unless explicitly exported. However, prefixing your functions with a module name is a common convention among kernel programmers.
These are bare bones lets make things more interesting. Modules can accept parameters, like this:
# modprobe foo bar=1
The **modinfo** command displays all parameters accepted by the module, and these are also available under **/sys/module//parameters** as files. Our module will need a buffer to store phrases lets make its size user-configurable. Add the following three lines just below **MODULE_DESCRIPTION()**:
static unsigned long buffer_size = 8192;
module_param(buffer_size, ulong, (S_IRUSR | S_IRGRP | S_IROTH));
MODULE_PARM_DESC(buffer_size, "Internal buffer size");
Here, we define a variable to store the value, wrap it into a parameter, and make it readable by everyone via sysfs. The parameters description (the last line) appears in the modinfos output.
As the user can set **buffer_size** directly, we need to sanitise it in **reverse_init()**. You should always check the data that comes outside the kernel if you dont, you are opening yourself to kernel panics or even security holes.
static int __init reverse_init()
{
if (!buffer_size)
return -1;
printk(KERN_INFO
"reverse device has been registered, buffer size is %lu bytes\n",
buffer_size);
return 0;
}
Non-zero return value from a module init function indicates a failure.
### Navigation ###
> The Linux kernel is the ultimate source for everything you may need when developing modules. However, its quite big, and you may have trouble trying to find what you are after. Luckily, there are tools that make it easier to navigate large codebases. First of all, there is Cscope a venerable tool that runs in a terminal. Simply run **make cscope && cscope** in the kernel sources top-level directory. Cscope integrates well with Vim and Emacs, so you can use it without leaving the comfort of your favorite editor.
>
> If terminal-based tools arent your cup of tea, visit [http://lxr.free-electrons.com][4]. It is a web-based kernel navigation tool with not quite as many features as Cscope (for example, you cant easily find usages for the function), but it still provides enough for the quick lookups.
Now its time to compile the module. You will need the headers for the kernel version you are running (**linux-headers** or equivalent package) and **build-essential** (or analogous). Next, its time to create a boilerplate Makefile:
obj-m += reverse.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
Now, call **make** to build your first module. If you typed everything correctly, you will find **reverse.ko** in the current directory. Insert it with **sudo insmod reverse.ko**, and run:
$ dmesg | tail -1
[ 5905.042081] reverse device has been registered, buffer size is 8192 bytes
Congratulations! However, for now this line is telling lies there is no device node yet. Lets fix it.
#### Miscellaneous devices ####
In Linux, there is a special character device type called “miscellaneous” (or simply “misc”). It is designed for small device drivers with a single entry point, and is exactly what we need. All misc devices share the same major number (10), so the one driver (**drivers/char/misc.c**) can look after all of them, and they are distinguished by their minor numbers. In all other senses, they are just normal character devices.
To register a minor number (and an entry point) for the device, you declare **struct misc_device**, fill its fields (note the syntax), and call **misc_register()** with a pointer to this structure. For this to work, you will also need to include the **linux/miscdevice.h** header file:
static struct miscdevice reverse_misc_device = {
.minor = MISC_DYNAMIC_MINOR,
.name = "reverse",
.fops = &reverse_fops
};
static int __init reverse_init()
{
...
misc_register(&reverse_misc_device);
printk(KERN_INFO ...
}
Here, we request a first available (dynamic) minor number for the device named “reverse”; th ellipsis indicates omitted code that weve already seen. Dont forget to unregister the device on the modules teardown:
static void __exit reverse_exit(void)
{
misc_deregister(&reverse_misc_device);
...
}
The fops field stores a pointer to a struct **file_operations** (declared in linux/fs.h), and this is the entry point for our module. **reverse_fops** is defined as:
static struct file_operations reverse_fops = {
.owner = THIS_MODULE,
.open = reverse_open,
...
.llseek = noop_llseek
};
Again, **reverse_fops** contains a set of callbacks (also known as methods) to be executed when userspace code opens a device, reads from it, writes to it or closes the file descriptor. If you omit any of these, a sensible fallback will be used instead. Thats why we explicitly set the **llseek** method to **noop_llseek()**, which (as the name implies) does nothing. The default implementation changes a file pointer, and we dont want our device to be seekable now (this will be your home assignment for today).
#### I open at the close ####
Lets implement the methods. Well allocate a new buffer for each file descriptor opened, and free it on close. This is not really safe: if a userspace application leaks descriptors (perhaps intentionally), it may hog the RAM, and render the system unusable. You should always think about these possibilities in the real world, but for the tutorial, its acceptable.
Well need a structure to describe the buffer. The kernel provides many generic data structures: linked lists (which are double-linked), hash tables, trees and so on. However, buffers are usually implemented from scratch. We will call ours “struct buffer”:
struct buffer {
char *data, *end, *read_ptr;
unsigned long size;
};
**data** is a pointer to the string this buffer stores, and end is the first byte after the string end. **read_ptr** is where **read()** should start reading the data from. The buffer size is stored for the completeness for now, we dont use this field. You shouldnt assume the users of your structure will correctly initialise all of these, so it is better to encapsulate buffer allocation and deallocation in functions. They are usually named **buffer_alloc()** and **buffer_free()**.
static struct buffer *buffer_alloc(unsigned long size)
{
struct buffer *buf;
buf = kzalloc(sizeof(*buf), GFP_KERNEL);
if (unlikely(!buf))
goto out;
...
out:
return buf;
}
Kernel memory is allocated with **kmalloc()** and freed with **kfree()**; the **kzalloc()** flavour sets the memory to all-zeroes. Unlike standard **malloc()**, its kernel counterpart receives flags specifying the type of memory requested in the second argument. Here, **GFP_KERNEL** means we need a normal kernel memory (not in DMA or high-memory zones) and the function can sleep (reschedule the process) if needed. **sizeof(*buf)** is a common way to get the size of a structure accessible via pointer.
You should always check **kmalloc()**s return value: dereferencing NULL pointer will result in kernel panic. Also note the use of **unlikely()** macro. It (and the opposite **likely()** macro) is widely used in the kernel to signify that the condition is almost always true (or false). It doesnt affect control flow, but helps modern processors to boost performance with branch prediction.
Finally, note the **gotos**. They are often considered evil, however, the Linux kernel (and some other system software) employs them to implement centralised function exiting. This results in less deeply nested and more readable code, and is much like the **try-ctach** blocks used in higher-level languages.
With **buffer_alloc()** and **buffer_free()** in place, the implementation of the **open** and **close** methods becomes pretty straightforward.
static int reverse_open(struct inode *inode, struct file *file)
{
int err = 0;
file->private_data = buffer_alloc(buffer_size);
...
return err;
}
**struct file** is a standard kernel data structure that stores information about an opened file, like current file position (**file->f_pos**), flags (**file->f_flags**), or open mode (**file->f_mode**). Another field, **file->private_data** is used to associate the file with some arbitrary data. Its type is void *, and it is opaque to the kernel outside the files owner. We store a buffer there.
If the buffer allocation fails, we indicate this to the calling user space code by returning negative value (**-ENOMEM**). A C library doing **open(2)** system call (probably, **glibc**) will detect this and set **errno** appropriately.
#### Learn to read and write ####
“Read” and “write” methods are where the real job is done. When data is written to a buffer, we drop its previous contents and reverse the phrase in-place, without any temporary storage. The **read** method simply copies the data from the kernel buffer into the userspace. But what should the **reverse_read()** method do if there is no data in the buffer yet? In userspace, the **read()** call would block until the data is available. In the kernel, you must wait. Luckily, there is a mechanism for this, and it is called wait queues.
The idea is simple. If a current process needs to wait for some event, its descriptor (a **struct task_struct** stored as current) is put into non-runnable (sleeping) state and added to a queue. Then **schedule()** is called to select another process to run. A code that generates the event uses the queue to wake up the waiters by putting them back to the **TASK_RUNNING** state. The scheduler will select one of them somewhere in the future. Linux has several non-runnable process states, most notably **TASK_INTERRUPTIBLE** (a sleep that can be interrupted with a signal) and **TASK_KILLABLE** (a sleeping process that can be killed). All of this should be handled correctly, and wait queues do this for you.
A natural place to store our read wait queue head is struct buffer, so start with adding **wait_queue_head_t read_queue** field to it. You should also include **linux/sched.h**. A wait queue can be declared statically with DECLARE_WAITQUEUE() macro. In our case, dynamic initialisation is needed, so add this line to **buffer_alloc()**:
init_waitqueue_head(&buf->read_queue);
We wait for the data to be available; or for **read_ptr != end** condition to become true. We also want the wait to be interruptible (say, by Ctrl+C). So the “read” method should start like this:
static ssize_t reverse_read(struct file *file, char __user * out,
size_t size, loff_t * off)
{
struct buffer *buf = file->private_data;
ssize_t result;
while (buf->read_ptr == buf->end) {
if (file->f_flags & O_NONBLOCK) {
result = -EAGAIN;
goto out;
}
if (wait_event_interruptible
(buf->read_queue, buf->read_ptr != buf->end)) {
result = -ERESTARTSYS;
goto out;
}
}
...
We loop until the data is available and use **wait_event_interruptible()** (its a macro, not a function, thats why the queue is passed by value) to wait if it isnt. If **wait_event_interruptible()** is, well, interrupted, it returns a non-zero value, which we translate to **-ERESTARTSYS**. This code means the system call should be restarted. **file->f_flags** check accounts for files opened in non-blocking mode: if there is no data, we return **-EAGAIN**.
We cant use **if()** instead of **while()**, since there can be many processes waiting for the data. When the **write** method awakes them, the scheduler chooses the one to run in an unpredictable way, so by the time this code is given a chance to execute, the buffer can be empty again. Now we need to copy the data from **buf->data** to the userspace. The **copy_to_user()** kernel function does just that:
size = min(size, (size_t) (buf->end - buf->read_ptr));
if (copy_to_user(out, buf->read_ptr, size)) {
result = -EFAULT;
goto out;
}
The call can fail if the user space pointer is wrong; if this happen, we return **-EFAULT**. Remember not to trust anything coming outside the kernel!
buf->read_ptr += size;
result = size;
out:
return result;
}
Simple arithmetic is needed so the data can be read in arbitrary chunks. The method returns the number of bytes read or an error code.
The write method is simpler and shorter. First, we check that the buffer have enough space, then we use the **copy_from_userspace()** function to get the data. Then **read_ptr** and end pointers are reset and the buffer contents are reversed:
buf->end = buf->data + size;
buf->read_ptr = buf->data;
if (buf->end > buf->data)
reverse_phrase(buf->data, buf->end - 1);
Here, **reverse_phrase()** does all heavy lifting. It relies on the **reverse_word()** function, which is quite short and marked inline. This is another common optimisation; however, you shouldnt overuse it, since aggressive inlining makes the kernel image unnecessarily large.
Finally, we need to wake up processes waiting for the data at **read_queue**, as described earlier. **wake_up_interruptible()** does just that:
wake_up_interruptible(&buf->read_queue);
Phew! You now have a kernel module that at least compiles successfully. Now its time to test it.
### Debugging kernel code ###
> Perhaps the most common debugging method in the kernel is printing. You can use plain **printk()** (presumably with **KERN_DEBUG** log level) if you wish. However, there are better ways. Use **pr_debug()** or **dev_dbg()**, if you are writing a device driver that has its own “struct device”: they support the dynamic debug (**dyndbg**) feature and can be enabled or disabled on request (see **Documentation/dynamic-debug-howto.txt**). For pure development messages, use **pr_devel()**, which becomes a no-op unless DEBUG is defined. To enable DEBUG for our module, include:
>
> CFLAGS_reverse.o := -DDEBUG
>
> in the Makefile. After that, use **dmesg** to view debug messages generated by **pr_debug()** or **pr_devel()**.
>
> Alternatively, you can send debug messages directly to the console. To do this, either set the **console_loglevel** kernel variable to 8 or greater (**echo 8 /proc/sys/kernel/printk**) or temporarily print the debug message in question at the high log level like **KERN_ERR**. Naturally, you should remove debug statements of this kind before publishing your code.
>
> Note that kernel messages appear on the console, not in a terminal emulator window such as Xterm; thats why youll find recommendations not to do kernel development in the X environment.
### Surprise, surprise! ###
Compile the module and load it into the kernel:
$ make
$ sudo insmod reverse.ko buffer_size=2048
$ lsmod
reverse 2419 0
$ ls -l /dev/reverse
crw-rw-rw- 1 root root 10, 58 Feb 22 15:53 /dev/reverse
Everything seems to be in place. Now, to test how the module works, well write a small program that reverses its first command line argument. The **main()** function (sans error checking) may look like this:
int fd = open("/dev/reverse", O_RDWR);
write(fd, argv[1], strlen(argv[1]));
read(fd, argv[1], strlen(argv[1]));
printf("Read: %s\n", argv[1]);
Run it as:
$ ./test 'A quick brown fox jumped over the lazy dog'
Read: dog lazy the over jumped fox brown quick A
It works! Play with it a little: try passing single-word or single-letter phrases, empty or non-English strings (if you have a keyboard layout set) and anything else.
Now lets make things a little trickier. Well create two processes that share the file descriptor (and hence the kernel buffer). One will continuously write strings to the device, and another will read them. The **fork(2)** system call is used in the example below, but pthreads will work as well. I also omitted the code that opens and closes the device and does the error checking (again):
char *phrase = "A quick brown fox jumped over the lazy dog";
if (fork())
/* Parent is the writer */
while (1)
write(fd, phrase, len);
else
/* child is the reader */
while (1) {
read(fd, buf, len);
printf("Read: %s\n", buf);
}
What do you expect this program to output? Below is what Ive got on my laptop:
Read: dog lazy the over jumped fox brown quick A
Read: A kcicq brown fox jumped over the lazy dog
Read: A kciuq nworb xor jumped fox brown quick A
Read: A kciuq nworb xor jumped fox brown quick A
...
Whats going on here? Its a race. We thought **read** and **write** were atomic, or executed one instruction at a time from the beginning till the end. However the kernel is a concurrent beast, and it can easily reschedule the process running the kernel-mode part of the **write** operation somewhere inside the **reverse_phrase()** function. If the process that does **read()** is scheduled before the writer is given a chance to finish, it will see the data in an inconsistent state. Such bugs are really hard to debug. But how to fix it?
Basically, we need to ensure that no **read** method can be executed until the write method returns. If you ever programmed a multi-threaded application, youve probably seen synchronisation primitives (locks) like mutexes or semaphores. Linux has them as well, but there are nuances. Kernel code can run in the process context (working “on behalf” of the userspace code, as our methods do) and in the interrupt context (for example, in an IRQ handler). If you are in the process context and a lock you need has already been taken, you simply sleep and retry until you succeed. You cant sleep in the interrupt context, so the code spins in a loop until the lock become available. The corresponding primitive is called a spinlock, but in our case, a simple mutex an object that only one process can “hold” at the given time is sufficient. A real-world code may also use a read-write semaphore, for performance reasons.
Locks always protect some data (in our case, a “struct buffer” instance), and it is very common to embed them in a structure they are protecting. So we add a mutex (struct mutex lock) into the “struct buffer”. We must also initialise the mutex with **mutex_init()**; **buffer_alloc()** is a good place for this. The code that uses mutexes must also include **linux/mutex.h**.
A mutex is much like a traffic light its useless unless drivers look at it and follow the signals. So we need to update **reverse_read()** and **reverse_write()** to acquire the mutex before doing anything to the buffer and release it when they are done. Lets have a look at the **read** method **write** works just the same way:
static ssize_t reverse_read(struct file *file, char __user * out,
size_t size, loff_t * off)
{
struct buffer *buf = file->private_data;
ssize_t result;
if (mutex_lock_interruptible(&buf->lock)) {
result = -ERESTARTSYS;
goto out;
}
We acquire the lock at the very beginning of the function. **mutex_lock_interruptible()** either grabs the mutex and returns or puts the process to sleep until the mutex is available. As before, the **_interruptible** suffix means the sleep can be interrupted with a signal.
while (buf->read_ptr == buf->end) {
mutex_unlock(&buf->lock);
/* ... wait_event_interruptible() here ... */
if (mutex_lock_interruptible(&buf->lock)) {
result = -ERESTARTSYS;
goto out;
}
}
Below is our “wait for the data” loop. You should never sleep when holding a mutex, or a situation called a “deadlock” may occur. So, if there is no data, we release the mutex and call **wait_event_interruptible()**. When it returns, we reacquire the mutex and continue as usual:
if (copy_to_user(out, buf->read_ptr, size)) {
result = -EFAULT;
goto out_unlock;
}
...
out_unlock:
mutex_unlock(&buf->lock);
out:
return result;
Finally, the mutex is unlocked when the function ends or if an error occurs while the mutex is being held. Recompile the module (dont forget to reload it) and run the second test again. You should see no corrupted data now.
### Whats next? ###
Now you have a taste of kernel hacking. Weve just scratched the surface of the topic, and there is much more to see. Our first module was intentionally simple, however the concepts you learned will stay the same in more complex scenarios as well. Concurrency, method tables, registering callbacks, putting processes to sleep and waking them up are things that every kernel hacker should be comfortable with, and now youve seen all of them in action. Maybe your kernel code will end up in the mainline Linux source tree some day drop us a line if this happens!
--------------------------------------------------------------------------------
via: http://www.linuxvoice.com/be-a-kernel-hacker/
译者:[译者ID](https://github.com/译者ID) 校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创翻译,[Linux中国](http://linux.cn/) 荣誉推出
[1]:http://www.libusb.org/
[2]:http://fuse.sf.net/
[3]:http://www.linuxvoice.com/be-a-kernel-hacker/www.netfilter.org/projects/libnetfilter_queue
[4]:http://lxr.free-electrons.com/

View File

@ -0,0 +1,369 @@
编写属于你的第一个Linux内核模块
================================================================================
> 曾经多少次想要在内核游荡?曾经多少次茫然不知方向?你不要再对着它迷惘,让我们指引你走向前方……
内核编程常常看起来像是黑魔法,而在亚瑟 C 克拉克的眼中它八成就是了。Linux内核和它的用户空间是大不相同的抛开漫不经心你必须小心翼翼因为你编程中的一个bug就会影响到整个系统。浮点数学做起来可不容易堆栈固定而渺小而你写的代码总是异步的因此你需要想想怎样让它并发。而除了所有这一切之外Linux内核只是一个很大的、很复杂的C程序它对每个人开放任何人都去读它、学习它并改进它而你也可以是其中之一。
> “开始内核编程的最简单的方式
> 是写模块——一段代码
> 可以用来动态加载进内核。”
可能开始内核编程的最简单的方式就是写模块——一段可以动态加载进内核并从内核移除的代码。模块所能做的事是有限的——例如他们不能添加或移除像进程描述符这样的常规数据结构域。但是在其它方面他们是成熟的内核级的代码可以在需要时随时编译进内核这样就可以摒弃所有的限制了。完全可以在Linux源代码树以外来开发并编译一个模块这并不奇怪它称为树外开发如果你只是想稍微玩玩而并不想提交修改以包含到主线内核中去这样的方式是很方便的。
在本教程中,我们将开发一个简单的内核模块用以创建一个**/dev/reverse**设备。写入该设备的字符串将以逆序的方式读回“Hello World”读成“World Hello”。这是一个流行的节目采访智力游戏而当你展示能力来实施时你也可能获得一些奖励分。在开始前有一句忠告你的模块中的一个bug会导致系统崩溃虽然可能性不大但还是有可能的和数据丢失。在开始前请确保你已经将重要数据备份或者采用一种更好的方式在虚拟机中进行试验。
### 尽可能避免root身份 ###
> 默认情况下,**/dev/reverse**只有root可以使用因此你不得不使用**sudo**来测试该程序。要解决该问题,可以创建一个包含以下内容的**/lib/udev/rules.d/99-reverse.rules**文件:
>
> SUBSYSTEM=="misc", KERNEL=="reverse", MODE="0666"
>
> 别忘了重新插入模块。让设备节点让非root用户访问这往往不是一个好主意但是在开发其间却是十分有用的这不是说以root身份运行二进制测试文件也不是个好主意。
#### 模块的构造 ####
由于大多数的Linux内核模块是用C写的除了低级别特定架构部分所以推荐你将模块以单一文件形式保存例如reverse.c。我们已经把完整的源代码放在GitHub上——这里我们将看其中的一些片段。开始时我们先要包含一些常见的文件头并用预定义的宏来描述模块
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Valentine Sinitsyn <valentine.sinitsyn@gmail.com>");
MODULE_DESCRIPTION("In-kernel phrase reverser");
这里一切都直接明了,除了**MODULE_LICENSE()**它不仅仅是一个标记。内核坚定地支持GPL兼容代码因此如果你把许可证设置为其它非GPL兼容的“专利”特定的内核功能将在你的模块中不可用。
### 什么时候不该写内核模块 ###
> 内核编程很有趣,但是在现实项目中写(尤其是调试)内核代码要求特定的技巧。通常来讲,在没有其它方式解决你的问题时,你才应该沉入内核级别。可能你可以待在用户空间中,如果:
> - 你开发一个USB驱动 —— 请查看[libusb][1]。
> - 你开发一个文件系统 —— 试试[FUSE][2]。
> - 你在扩展Netfilter —— 那么[libnetfilter_queue][3]对你有所帮助。
>
> 通常,本地内核代码会干得更好,但是对于许多项目而言,这点性能丢失并不严重。
由于内核编程总是异步的没有Linux顺序执行得**main()**函数来运行你的模块。取而代之的是,你为各种事件提供了回调函数,像这个:
static int __init reverse_init(void)
{
printk(KERN_INFO "reverse device has been registered\n");
return 0;
}
static void __exit reverse_exit(void)
{
printk(KERN_INFO "reverse device has been unregistered\n");
}
module_init(reverse_init);
module_exit(reverse_exit);
这儿,我们定义了函数,用来访问模块的插入和移除功能,只有第一个是必要的。目前,它们只是打印消息到内核环缓冲区(可以通过**dmesg**命令从用户空间访问);**KERN_INFO**是日志等级(注意,没有逗号)。**_init**和**_exit**是属性 —— 联结到函数的元数据片或者变量。属性在用户空间的C代码中是很罕见的但是内核中却很普遍。所有标记为**_init**的,会在初始化后再生(还记得那条老旧的“释放未使用的内核内存……”信息?)。**__exit**表明,当代码被静态构建进内核时,该函数可以安全地优化。最后,**module_init()**和**module_exit()**这两个宏将**reverse_init()**和**reverse_exit()**函数设置成为我们模块的生命周期回调函数。实际的函数名称并不重要,你可以称它们为**init()**和**exit()**,或者**start()**和**stop()**,你想叫什么就叫什么吧。在你的模块外,它们被申明成为静态的和不可见的。事实上,内核中的任何函数都是不可见的,除非明确地被导出。然而,在内核程序员中,给你的函数加上模块名前缀是约定俗成的。
这些是基本要素 —— 让我们把事情变得更有趣些。模块可以接收参数,就像这样:
# modprobe foo bar=1
**modinfo**命令显示了所有模块接受的参数,而这些也可以在**/sys/module//parameters**下作为文件使用。我们的模块需要一个缓冲区来存储短语 —— 让我们把这大小设置为用户可配置。添加**MODULE_DESCRIPTION()**以下的三行:
static unsigned long buffer_size = 8192;
module_param(buffer_size, ulong, (S_IRUSR | S_IRGRP | S_IROTH));
MODULE_PARM_DESC(buffer_size, "Internal buffer size");
这儿我们定义了一个变量来存储该值将其包裹到一个参数中并通过sysfs来让所有人可读。参数的描述最后一行会出现在modinfo的输出中。
由于用户可以直接设置**buffer_size**,我们需要在**reverseinit()**来清除它。你总该检查来自内核外的数据 —— 如果你不这么做,你就是会将你自身置于内核异常之中,设置造成安全漏洞。
static int __init reverse_init()
{
if (!buffer_size)
return -1;
printk(KERN_INFO
"reverse device has been registered, buffer size is %lu bytes\n",
buffer_size);
return 0;
}
来自模块初始化函数的非0返回值意味着模块执行失败。
### 导航 ###
> 但你开发模块时Linux内核就是你所需一切的源头。然而它相当大你可能在查找你所要的内容时会有困难。幸运的是在浏览庞大的代码库时有工具可以帮助你干得轻松一点。首先是Cscope —— 在终端中运行的一个令人肃然起敬的工具。你所要做的,就是在内核源代码的顶级目录中运行**make cscope && cscope**。Cscope和Vim以及Emacs整合得很好因此你可以在使用你最喜爱的编辑器舒适地工作时来使用它。
> 如果基于终端的工具不是你的最爱,那么就访问[http://lxr.free-electrons.com][4]吧。它是一个基于web的内核导航工具即使它的功能没有Cscope来得多例如你不能方便地找到函数的用法但它仍然提供了足够多的快速查询功能。
现在是时候来编译模块了。你将需要用于正在运行的内核版本的头文件(**linux-headers**,或者同等软件包)和**build-essential**或者类似的包。接下来该创建一个标准的Makefile模板
obj-m += reverse.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
现在,调用**make**来构建你的第一个模块。如果你输入的都正确在当前目录内会发现reverse.ko文件。使用**sudo insmod reverse.ko**插入,然后运行:
$ dmesg | tail -1
[ 5905.042081] reverse device has been registered, buffer size is 8192 bytes
恭喜了!然而,目前这一行还只是在逗你玩而已 —— 还没有设备节点呢。让我们来修复它。
#### 混杂设备 ####
在Linux中有一种特殊的字符设备类型叫做“混杂设备”或者简称为“misc”。它设计用于只有一个单一接入点的小型设备驱动而这正是我们所需要的。所有混杂设备共享同一个主设备号10因此一个驱动(**drivers/char/misc.c**)就可以查看它们所有设备了,而这些设备用次设备号来区分。在所有其它意义上,它们只是普通字符设备。
要为该设备注册一个次设备号(以及一个接入点),你需要声明**struct misc_device**,填上所有字段(注意语法),然后使用指针指向该结构函数来调用**misc_register()**。为了这个能工作,你也需要包含**linux/miscdevice.h**头文件:
static struct miscdevice reverse_misc_device = {
.minor = MISC_DYNAMIC_MINOR,
.name = "reverse",
.fops = &reverse_fops
};
static int __init reverse_init()
{
...
misc_register(&reverse_misc_device);
printk(KERN_INFO ...
}
这儿我们为名为“reverse”的设备请求一个第一个可用的动态的次设备号省略号表明我们已经见过的省略的代码。别忘了在模块卸下后注销掉该设备。
static void __exit reverse_exit(void)
{
misc_deregister(&reverse_misc_device);
...
}
fops字段存储了一个指针指向结构函数**file_operations**在Linux/fs.h中已声明而这真是我们模块的接入点。**reverse_fops**定义如下:
static struct file_operations reverse_fops = {
.owner = THIS_MODULE,
.open = reverse_open,
...
.llseek = noop_llseek
};
再者,**reverse_fops**包含了一系列回调函数(也称之为方法),当用户空间代码打开一个设备时,就会执行。从该设备读取,向该设备写入,或者关闭文件描述符。如果你忽略了所有这些,就会使用一个灵敏的回调函数来替代。这就是为什么我们明确给**noop_llseek()**设置了**llseek**方法,而它却什么也不干(就像名称中暗指的)。默认部署改变了文件指针,我们现在也不想我们的设备被找到(这是你们的今天的回家作业)。
#### 我在关闭时打开 ####
让我们实施该方法。我们将分配一个新的缓冲区给每个打开的文件描述符并在它关闭时释放。这事实上并不安全如果一个用户空间应用程序泄漏了描述符也许是故意的它就会霸占RAM并使系统不可用。在现实世界中你总得考虑到这些可能性。但在本教程中这种方法可以接受。
我们需要一个结构函数来描述缓冲区。内核提供了许多常规的数据结构链接列表双联的哈希表树等等之类。然而缓冲区常常从零开始实施。我们将调用我们的“struct buffer”
struct buffer {
char *data, *end, *read_ptr;
unsigned long size;
};
**data**是该缓冲区存储的一个指向字符串的指针,而最后部分是字符串结尾后的第一个字节。**read_ptr**是**read()**开始读取数据的地方。缓冲区大小为了完整性而存储 —— 目前,我们还没有使用该区域。你不能假设使用你结构体的用户会正确地初始化所有这些东西,所以最好在函数中封装缓冲区分配和解除。它们通常命名为**buffer_alloc()**和**buffer_free()**。
static struct buffer *buffer_alloc(unsigned long size)
{
struct buffer *buf;
buf = kzalloc(sizeof(*buf), GFP_KERNEL);
if (unlikely(!buf))
goto out;
...
out:
return buf;
}
内核内存使用**kmalloc()**来分配,并使用**kfree()**来释放;**kzalloc()**的风格是将内存设置为全零。不同于标准的**malloc()**,它的内核对应部分收到的标志指定了第二个参数中请求的内存类型。这里,**GFP_KERNEL**是说我们需要一个普通的内核内存不是在DMA或高内存中以及函数可以按需睡眠重新编排进程。**sizeof(*buf)**是一种常见的方式,它用来获取可通过指针访问的结构体的大小。
你应该随时检查**kmalloc()**的返回值解应用NULL指针将导致内核异常。同时也需要注意**unlikely()**宏的使用。它(及其相对宏**likely()**)被广泛用于内核中,用于表明条件几乎总是真的(或假的)。它不会影响到控制流,但是能帮助现代处理器通过分支预测技术来提升性能。
最后,注意**gotos**。它们常常为认为是邪恶的但是Linux内核以及一些其它系统软件采用它们来实施集中式的函数退出。这样的结果是减少嵌套深度使代码更具可读性而且非常像更高级语言中的**try-catch**区块。
有了**buffer_alloc()**和**buffer_free()****open**和**close**方法就变得很简单了。
static int reverse_open(struct inode *inode, struct file *file)
{
int err = 0;
file->private_data = buffer_alloc(buffer_size);
...
return err;
}
**struct file**是一个标准的内核数据结构,用以存储打开的文件的信息,如当前文件位置(**file->fpos**),标志(**file->flags**),或者打开模式(**file->fmode**)。另外一个字段**file->privatedata**用于关联文件到一些专有数据它的类型是void *,而且它在文件拥有者以外对内核不透明。我们将一个缓冲区存储在那里。
如果缓冲区分配失败,我们通过返回否定值(**-ENOMEM**)来为调用的用户空间代码标明。
#### 学会读写 ####
“read”和“write”方法是真正完成工作的地方。当数据写入到缓冲区时我们就丢弃它里头先前的内容并在没有任何临时存储时将短语恢复原状。**read**方法仅仅是从内核缓冲区复制数据到用户空间。但是如果缓冲区还没有数据,**reverseread()**会做什么呢?在用户空间中,**read()**调用会在有可用数据前阻塞它。在内核中你必须等待。幸运的是有一项机制用于处理这种情况就是wait queues
想法很简单。如果当前进程需要等待某个事件,它的描述符(**struct task_struct**存储为current被放进非可运行睡眠中状态并添加到一个队列中。然后**schedule()**就被调用来选择另一个进程运行。生成事件的代码通过使用队列将等待进程放回**TASKRUNNING**状态来唤醒它们。调度程序将在以后在某个地方选择它们之一。Linux有多种非可运行状态最值得注意的是**TASKINTERRUPTIBLE**(一个可以通过信号中断的睡眠)和**TASKKILLABLE**(一个可被杀死的睡眠中的进程)。所有这些都应该正确处理,并等待队列为你做这些事。
一个用以存储读取等待队列头的天然场所就是结构缓冲区,所以从为它添加**wait_queue_head_t read_queue**字段开始。你也应该包含**linux/sched.h**。可以使用DECLARE_WAITQUEUE()宏来静态声明一个等待队列。在我们这种情况下,需要动态初始化,因此添加下面这行到**buffer_alloc()**
init_waitqueue_head(&buf->read_queue);
我们等待可用数据;或者等待**read_ptr != end**条件成立。我们也想要让等待操作可以被中断通过Ctrl+C。因此“read”方法应该像这样开始
static ssize_t reverse_read(struct file *file, char __user * out,
size_t size, loff_t * off)
{
struct buffer *buf = file->private_data;
ssize_t result;
while (buf->read_ptr == buf->end) {
if (file->f_flags & O_NONBLOCK) {
result = -EAGAIN;
goto out;
}
if (wait_event_interruptible
(buf->read_queue, buf->read_ptr != buf->end)) {
result = -ERESTARTSYS;
goto out;
}
}
...
我们让它循环,直到有可用数据,如果没有则使用**wait_event_interruptible()**(它是一个宏,不是函数,这就是为什么要给队列传递值)来等待。好吧,如果**wait_event_interruptible()**被中断它返回一个非0值这个值代表**-ERESTARTSYS**。这段代码意味着系统调用应该重新启动。**file->f_flags**检查以非阻塞模式打开的文件数:如果没有数据,返回**-EAGAIN**。
我们不能使用**if()**来替代**while()**,因为可能有许多进程正等待数据。当**write**方法唤醒它们时,调度程序选择一个来以不可预知的方式运行,因此,在这段代码有机会执行的时候,缓冲区可能再次空出。现在,我们需要将数据从**buf->data** 复制到用户空间。**copytouser()**内核函数就干了此事:
size = min(size, (size_t) (buf->end - buf->read_ptr));
if (copy_to_user(out, buf->read_ptr, size)) {
result = -EFAULT;
goto out;
}
如果用户空间指针错误,那么调用可能会失败;如果发生了此事,我们就返回**-EFAULT**。记住,不要相信任何来自内核外的事物!
buf->read_ptr += size;
result = size;
out:
return result;
}
为了让数据能读入到专有组块中,需要进行简单运算。该方法返回读入的字节数,或者一个错误代码。
写方法更简短。首先,我们检查缓冲区是否有足够的空间,然后我们使用**copy_from_userspace()**函数来获取数据。再然后**read_ptr**和结束指针会被重置,缓冲区内容会被撤销掉:
buf->end = buf->data + size;
buf->read_ptr = buf->data;
if (buf->end > buf->data)
reverse_phrase(buf->data, buf->end - 1);
这里, **reverse_phrase()**干了所有吃力的工作。它依赖于**reverse_word()**函数,该函数相当简短并且标记为内联。这是另外一个常见的优化;但是,你不能过度使用。因为积极的内联会导致内核映像徒然增大。
最后,我们需要唤醒**read_queue**中等待数据的进程,就跟先前讲过的那样。**wake_up_interruptible()**就是用来干此事的:
wake_up_interruptible(&buf->read_queue);
唷!你现在已经有了一个内核模块,它至少已经编译成功了。现在,是时候来测试了。
### 调试内核代码 ###
> 或许,内核中最常见的调试方法就是打印。如果你愿意,你可以使用普通的**printk()** (假定使用**KERN_DEBUG**日志等级。然而那儿还有更好的办法。如果你正在写一个设备驱动这个设备驱动有它自己的“struct device”可以使用**pr_debug()**或者**dev_dbg()**:它们支持动态调试(**dyndbg**)特性,并可以根据需要启用或者禁用(请查阅**Documentation/dynamic-debug-howto.txt**)。对于单纯的开发消息,使用**prdevel()**该函数没有操作符除非设置了DEBUG。要为我们的模块启用DEBUG请添加以下行到Makefile中
> CFLAGS_reverse.o := -DDEBUG
>
> 完了之后,使用**dmesg**来查看**pr_debug()**或**pr_devel()**生成的调试信息。
> 或者,你可以直接发送调试信息到控制台。要想这么干,你可以设置**console_loglevel**内核变量为8或者更大的值**echo 8 /proc/sys/kernel/printk**),或者在高日志等级,如**KERN_ERR**,来临时打印要查询的调试信息。很自然,在发布代码前,你应该移除这样的调试声明。
> 注意出现在控制台的内核消息而不要在Xterm这样的终端模拟器窗口中去查看那也是你在内核开发时经常会建议你不要再X环境下进行的原因。
### 惊喜,惊喜! ###
编译模块,然后加载进内核:
$ make
$ sudo insmod reverse.ko buffer_size=2048
$ lsmod
reverse 2419 0
$ ls -l /dev/reverse
crw-rw-rw- 1 root root 10, 58 Feb 22 15:53 /dev/reverse
一切似乎就位。现在,要测试模块是否正常工作,我们将写一段小程序来翻转它的第一个命令行参数。**main()**(没有错误检查)可能看上去像这样:
int fd = open("/dev/reverse", O_RDWR);
write(fd, argv[1], strlen(argv[1]));
read(fd, argv[1], strlen(argv[1]));
printf("Read: %s\n", argv[1]);
像这样运行:
$ ./test 'A quick brown fox jumped over the lazy dog'
Read: dog lazy the over jumped fox brown quick A
它工作正常!玩得更逗一点:试试传递单个单词或者单个字母的短语,空的字符串或者是非英语字符串(如果你有这样的键盘布局设置),以及其它任何东西。
现在,让我们让事情变得更好玩一点。我们将创建两个进程,它们共享一个文件描述符(因而还有内核缓冲区)。其中一个会持续写入字符串到设备,而另一个将读取这些字符串。在下例中,我们使用了**fork(2)**系统调用而pthreads也很好用。我也忽略了打开和关闭设备以及错误检查部分的代码又来了
char *phrase = "A quick brown fox jumped over the lazy dog";
if (fork())
/* Parent is the writer */
while (1)
write(fd, phrase, len);
else
/* child is the reader */
while (1) {
read(fd, buf, len);
printf("Read: %s\n", buf);
}
你希望这个程序会输出什么呢?下面就是在我的笔记本上得到的东西:
Read: dog lazy the over jumped fox brown quick A
Read: A kcicq brown fox jumped over the lazy dog
Read: A kciuq nworb xor jumped fox brown quick A
Read: A kciuq nworb xor jumped fox brown quick A
...
这里发生了什么呢?举行了一场比赛。我们认为**read**和**write**是很小的,或者从头到尾一次执行一个指令。然而,内核是并发的野兽,它可以很容易地重排**reverse_phrase()**函数内部某个地方运行着的内核模式部分的写入操作。如果进行**read()**操作的进程在写入操作结束前就被编排进去就会产生数据不连续状态。这些bug非常难以排除。但是怎样来处理这个问题呢
基本上,我们需要确保在写方法返回前没有**read**方法能被执行。如果你曾经编写过一个多线程的应用程序你可能见过同步原语如互斥锁或者信号。Linux也有这些但有些细微的差别。内核代码可以运行在进程条件中“代表”用户空间代码工作就像我们的方法那样以及运行在中断条件中例如在IRQ处理器中。如果你的程序处于进程条件中并且你需要的锁已经被拿走你的程序就会睡眠并重试直至成功。在中断条件中是无法睡眠的因此代码在循环中流转直到有可用的锁为止。关联原语被称为自旋锁但在我们的环境中一个简单的互斥锁 —— 在特定时间内只有唯一一个进程能“占有”的对象 —— 就足够了。处于性能方面的考虑,现实的代码可能也会使用读-写信号。
锁总是保护某些数据在我们的环境中是一个“struct buffer”实例而且也常常会把它们嵌入到它们所保护的结构体中。因此我们添加一个互斥锁struct mutex lock到“struct buffer”中。我们也必须用**mutex_init()**来初始化互斥锁;**buffer_alloc**是用来处理这件事的好地方。使用互斥锁的代码也必须包含**linux/mutex.h**。
互斥锁很像交通信号灯 —— 除非驱动查看并跟踪信号,否则它没什么用。因此,在对缓冲区做操作并在操作完成时释放它之前,我们需要更新**reverse_read()**和**reverse_write()**来获取互斥锁。让我们来看看**read**方法 —— **write**的工作原理相同:
static ssize_t reverse_read(struct file *file, char __user * out,
size_t size, loff_t * off)
{
struct buffer *buf = file->private_data;
ssize_t result;
if (mutex_lock_interruptible(&buf->lock)) {
result = -ERESTARTSYS;
goto out;
}
我们在函数一开始就获取锁。**mutex_lock_interruptible()**要么抓取互斥锁然后返回,要么让进程睡眠,直到有可用的互斥锁。就像前面一样,**_interruptible**后缀意味着睡眠可以由信号来中断。
while (buf->read_ptr == buf->end) {
mutex_unlock(&buf->lock);
/* ... wait_event_interruptible() here ... */
if (mutex_lock_interruptible(&buf->lock)) {
result = -ERESTARTSYS;
goto out;
}
}
下面是我们的“等待数据”循环。当持有互斥锁,或者发生称之为“死锁”的情境时,不应该让进程睡眠。因此,如果没有数据,我们释放互斥锁并调用**wait_event_interruptible()**。当它返回时,我们重新获取互斥锁并像往常一样继续:
if (copy_to_user(out, buf->read_ptr, size)) {
result = -EFAULT;
goto out_unlock;
}
...
out_unlock:
mutex_unlock(&buf->lock);
out:
return result;
最后,当函数结束,或者在互斥锁被占有过程中发生错误时,互斥锁被解锁。重新编译模块(别忘了重新加载),然后再次进行测试。现在你应该没发现毁坏的数据了。
### 接下来是什么? ###
现在你体验了一把内核侵入。我们刚刚为你揭开了今天话题的外衣里面还有更多东西供你探索。我们的第一个模块是有意识地写得简单一点在从中学到的概念在更复杂的环境中也一样。并发、方法表、注册回调函数、使进程睡眠以及唤醒进程这些都是内核黑客们耳熟能详的东西而现在你已经看过了它们的运作。或许某天你的内核代码也将被加入到主线Linux源代码树中 —— 如果真这样,请联系我们!
--------------------------------------------------------------------------------
via: http://www.linuxvoice.com/be-a-kernel-hacker/
译者:[GOLinux](https://github.com/GOLinux) 校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创翻译,[Linux中国](http://linux.cn/) 荣誉推出
[1]:http://www.libusb.org/
[2]:http://fuse.sf.net/
[3]:http://www.linuxvoice.com/be-a-kernel-hacker/www.netfilter.org/projects/libnetfilter_queue
[4]:http://lxr.free-electrons.com/