diff --git a/sources/tech/20171220 The current state of kernel page-table isolation.md b/sources/tech/20171220 The current state of kernel page-table isolation.md deleted file mode 100644 index 03dbdf66a5..0000000000 --- a/sources/tech/20171220 The current state of kernel page-table isolation.md +++ /dev/null @@ -1,54 +0,0 @@ -Translating by qhwdw The current state of kernel page-table isolation -============================================================ - -At the end of October, the [KAISER][8] patch set was unveiled; this work separates the page tables used by the kernel from those belonging to user space in an attempt to address x86 processor bugs that can disclose the layout of the kernel to an attacker. Those patches have seen significant work in the weeks since their debut, but they appear to be approaching a final state. It seems like an appropriate time for another look. - -This work has since been renamed to "kernel page-table isolation" or KPTI, but the objective remains the same: split the page tables, which are currently shared between user and kernel space, into two sets of tables, one for each side. This is a fundamental change to how the kernel's memory management works and is the sort of thing that one would ordinarily expect to see debated for years, especially given its associated performance impact. KPTI remains on the fast track, though. [A set of preparatory patches][2] was merged into the mainline after the 4.15-rc4 release — when only important fixes would ordinarily be allowed — and the rest seems destined for the 4.16 merge window. Many of the core kernel developers have clearly put a lot of time into this work, and Linus Torvalds is [expecting][3] it to be backported to the long-term stable kernels. - -KPTI, in other words, has all the markings of a security patch being readied under pressure from a deadline. Just in case there are any smug ARM-based readers out there, it's worth noting that there is [an equivalent patch set for arm64][4] in the works. - -#### 51 Patches and counting - -As of this writing, the x86 patch series is at [version 163][5]. It contains 51 patches, so we can all be grateful that most of the intervening versions were not posted publicly. The initial patch set, posted by Dave Hansen, has been extensively reworked by Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, and Hugh Dickins, with suggestions from many others. Any bugs that remain in this work won't be there as the result of a lack of experienced eyeballs on the code. - -Page tables on contemporary systems are organized in a tree-like structure that makes for efficient storage of a sparse memory map and supports the huge pages feature; see [this 2005 article][6] for more details and a diagram of how it works. On a system with four levels of page tables (most largish systems, these days), the top level is the page global directory (PGD). Below that come the page upper directory (PUD), page middle directory (PMD), and page-table entries (PTE). Systems with five-level page tables insert a level (called the P4D) just below the PGD. - -Page-fault resolution normally traverses this entire tree to find the PTE of interest, but huge pages can be represented by special entries at the higher levels. For example, a 2MB chunk of memory could be represented by either a single huge-page entry at the PMD level or a full page of single-page PTE entries. - -In current kernels, each process has a single PGD; one of the first steps taken in the KPTI patch series is to create a second PGD. The original remains in use when the kernel is running; it maps the full address space. The second is made active (at the end of the patch series) when the process is running in user space. It points to the same directory hierarchy for pages belonging to the process itself, but the portion describing kernel space (which sits at the high end of the virtual address space) is mostly absent. - -Page-table entries contain permission bits describing how the memory they describe can be accessed; these bits are, naturally, set to prevent user space from accessing kernel pages, even though those pages are mapped into the address space. Unfortunately, a number of hardware-level bugs allow a user-space attacker to determine whether a given kernel-space address is mapped or not, regardless of whether any page mapped at that address is accessible. That information can be used to defeat kernel address-space layout randomization, making life much easier for a local attacker. The core idea behind KPTI is that switching to a PGD lacking a kernel-space mapping will defeat attacks based on these vulnerabilities, of which we have apparently not yet seen the last. - -#### Details - -The idea is simple but, as is so often the case, there are a number of troublesome details that turn a simple idea into a 51-part patch series. The first of those is that, if the processor responds to a hardware interrupt while running in user mode, the kernel code needed to deal with the interrupt will no longer exist in the address space. So there must be enough kernel code mapped in user mode to switch back to the kernel PGD and make the rest available. A similar situation exists for traps, non-maskable interrupts, and system calls. This code is small and can be isolated from the rest, but there are a number of tricky details involved in handling that switch safely and efficiently. - -Another complication comes in the form of the x86 local descriptor table (LDT), which can be used to change how the user-space memory layout looks. It can be tweaked with the little-known [`modify_ldt()`][7] system call. The early POSIX threads implementation on Linux used the LDT to create a thread-local storage area, for example. On current Linux systems, the LDT is almost unused but some applications (Wine, for example) still need it. When it is used, the LDT must be available to both kernel and user space, but it must live in kernel space. The KPTI patch set shuffles kernel memory around to reserve an entire entry at the PGD level for the LDT; the space available for `vmalloc()` calls shrinks to a mere 12,800TB as a result. That allows space for a large number of LDTs, needed on systems with many CPUs. One result of this change is that the location of the LDT is fixed and known to user space — a potential problem, since the ability to overwrite the LDT is easily exploited to compromise the system as a whole. The final patch in the series maps the LDT read-only in an attempt to head off any such attacks. - -Another potential vulnerability comes about if the kernel can ever be manipulated into returning to user space without switching back to the sanitized PGD. Since the kernel-space PGD also maps user-space memory, such an omission could go unnoticed for some time. The response here is to map the user-space portion of the virtual address space as non-executable in the kernel PGD. Should user space ever start running with the wrong page tables, it will immediately crash as a result. - -Finally, while all existing x86 processors are seemingly affected by information-disclosure vulnerabilities, future processors may not be. KPTI comes with a measurable run-time cost, estimated at about 5%. That is a cost that some users may not want to pay, especially once they get newer processors that lack these problems. There will be a `nopti`command-line option to disable this mechanism at boot time. The patch series also adds a new "feature" flag (`X86_BUG_CPU_INSECURE`) to indicate vulnerable CPUs; it is set on all x86 CPUs currently, but might not be on future hardware. In the absence of this feature flag, page-table isolation will automatically be turned off. - -Approximately one month remains before the opening of the 4.16 merge window. During that time, the KPTI patch set will undoubtedly go through a number of additional revisions as the inevitable glitches come to light. Once things settle down, though, it would appear that this code will be merged and backported to stable kernels in a relative hurry. Apparently, we can look forward to slower — but more secure — kernels as a new-year's present. - --------------------------------------------------------------------------------- - -via: https://lwn.net/SubscriberLink/741878/eb6c9d3913d7cb2b/ - -作者:[Jonathan Corbet ][a] -译者:[译者ID](https://github.com/译者ID) -校对:[校对者ID](https://github.com/校对者ID) - -本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 - -[a]:https://lwn.net/SubscriberLink/741878/eb6c9d3913d7cb2b/ -[1]:https://lwn.net/Promo/slink-trial2-2/claim -[2]:https://git.kernel.org/linus/64a48099b3b31568ac45716b7fafcb74a0c2fcfe -[3]:https://lwn.net/Articles/741882/ -[4]:https://lwn.net/Articles/740393/ -[5]:https://lwn.net/Articles/741883/ -[6]:https://lwn.net/Articles/117749/ -[7]:http://man7.org/linux/man-pages/man2/modify_ldt.2.html -[8]:https://lwn.net/Articles/738975/ - - diff --git a/translated/tech/20171220 The current state of kernel page-table isolation.md b/translated/tech/20171220 The current state of kernel page-table isolation.md new file mode 100644 index 0000000000..06ce3bba15 --- /dev/null +++ b/translated/tech/20171220 The current state of kernel page-table isolation.md @@ -0,0 +1,54 @@ +内核页面现在的状况 - 页表隔离 +============================================================ + +在十月底的时候,[KAISER][8] 补丁集公布了;它做了一项工作,将内核空间与用户空间使用的页面进行了隔离,以解决 x86 处理器上向攻击者透露内核布局的 bugs。这些补丁是自它们被公布以来,这一星期中最值关注的事情,但是,它们似乎正在接近最终的状态。这看上起像是再次审视它们的合适机会。 + +这项工作被重命名为 “内核页表隔离” 或者 KPTI,但是目的是一样的:分割页面表,它现在被用户空间和内核空间共享使用,将这张表分成两套,内核空间和用户空间各自使用一个。这对内核的内存管理产生根本性的变化,并且,这也是这些年来为此问题进行争论的人所希望看到的,尤其是考虑到它的性能影响的时候。不过,KPTI 仍然处于快速发展的轨道上。[一组准备的补丁][2] 已被被合并到 4.15 - rc4 之后的主版本线上了 — 一般情况下仅重要的修复才被允许这样做 — 并且其余的似乎被确定在 4.16 版的合并窗口中。许多内核开发者都在这项工作上投入了大量的时间,并且 Linus Torvalds [要求][3] 将这项工作回迁到长周期的稳定内核中。 + +也就是说已经,KPTI,已经在最后期限的压力下有了所有的安全补丁的标记。对于任何基于 ARM 的读者,在这里值的注意的是,在这项工作中有一个 [为 arm64 的等效补丁集][4]。 + +#### 51 补丁和总数 + +在这篇文章中,x86 补丁系统在 [version 163][5]。它包含 51 补丁,因此,我们应该感谢那些没有公开的版本。最初的补丁集,由 Dave Hansen 发布,由 Thomas Gleixner、Peter Zijlstra、Andy Lutomirski、和 Hugh Dickins 根据许多其它人的建议,做了大量的修订。在这项工作中剩余的任何 bugs 将不会在那里出现,这是在代码方面缺乏经验的结果。 + +在现代系统中,页面表是以一个树形结构进行组织的,这样可以高效地存储稀疏内存映射和支持巨页特性;查看[这篇 2005 年的文章][6] 了解更多细节以及它是怎么工作的一个示意图。在一个有四级页面表的系统上(目前的大多数大型系统都是这样),顶级是页面全局目录(PGD)。紧接着是页面上层目录(PUD),页面中层目录(PMD),和页面表条目(PTE)。有五级页面表的系统是在 PGD 下面插入了一层(称为 P4D)。 + +页面故障解析通常遍历整个树去查找感兴趣的 PTE,但是,巨页可以被高层级的特定条目表示。例如,一个 2MB 的内存 chunk 既可以由 PMD 层级的一个单个的巨页条目表示,也可以由一个单页 PTE 条目的完整页面表示。 + +在现在的内核中,每个处理器有一个单个的 PGD;在 KPTI 系列补丁中所采取的第一步的其中一个措施是,去创建一个第二 PGD。当内核运行时,原来的仍然在使用;它映射所有的地址空间。当处理器运行在用户空间时,第二个被激活后(在打完系列补丁之后)。它指向属于这个处理器的页面的相同的目录层次,但是,描述内核空间(位于虚拟地址空间的顶端)的部分通常都不在这里。 + +页面表条目包含权限描述位,它记录了内存如何被访问;这些位是,不用说都知道,设置它用来阻止用户空间访问内核页面,即便是通过这些页面映射进入的地址空间。不幸的是,一些硬件级的 bugs 允许一个用户空间的攻击者去确定一个给定的内核空间地址是否被映射,而不管那个页面上映射的地址是否被允许访问。而那个信息可以被用于击败内核地址空间布局随机化,可以使一个本地的攻击者更易于得逞。在 KPTI 背后的核心思想是,切换到一个没有内核空间映射的 PGD,将会使基于这个漏洞的攻击失效,而到现在为止,我们还没有看到这些漏洞。 + +#### 细节 + +这个想法很简单,但是,就像经常发生的那样,有很多麻烦的细节,把一个简单的想法变成一个补丁系列的 51 部分,如果处理器在用户模式运行时响应一个硬件中断,处理中断需要的内核代码将在地址空间中不存在。因此,必须有足够的内核代码在用户模式中去映射,以能够切换回到内核 PGD,并且使剩余的也可用。对于 traps、非屏蔽中断、和系统调用,也存在一个相似的情况。这个代码很小而且可以与其它部分隔离,但是,在处理切换安全性和效率时,涉及到一些很复杂的细节。 + +另一个难题来自 x86 本地描述符表(LDT)的构成,它可以被用于去改变用户空间的内存布局。它可以使用鲜为人知的 [`modify_ldt()`][7] 系统调用来做微调。例如,在 Linux 上早期的 POSIX 线程实现,使用了 LDT 去创建一个本地线程存储区域。在现在的 Linux 系统上,LDT 几乎不再使用了,但是,一些应用程序(比如,Wine)仍然需要它。当它被使用时,LDT 必须能够被用户空间和内核空间都可以访问到,但是,它必须一直处于内核空间中。KPTI 补丁集清理内核附近内存,在 PGD 级别上为 LDT 保留一个完全的条目;因此,`vmalloc()` 调用的可用空间收缩到仅有 12,800TB。那是一个非常巨大的 LDTs 空间数,可以满足有很多 CPUs 的系统需要。这种变化的其中一个结果是,LDT 的位置是固定的,并且已知道用户空间 — 一个潜在的问题是,因此可以通过覆写 LDT 来很容易地破坏整个系统。在这个系列的最终的补丁是映射为只读 LDT,以阻止此类攻击。 + +另一个潜在的漏洞是,如果内核可以被操纵进入用户空间而无需切换到已净化的 PGD。因为内核空间 PGD 也映射用户空间内存,这种疏忽可能被忽视一段时间。在这里的响应是将用户空间的虚拟地址部分映射为内核 PGD 的非可执行部分。用户空间使用了一个错误的页面表来开始运行,将会立即崩溃。 + +虽然所有已存在的 x86 处理器似乎都会受到已经披露的漏洞信息的影响,但是,以后的处理器可能不会受此影响。KPTI 有一个可测量的运行时成本(cost),估计在 5%。这个成本不应该由用户去承担,尤其是他们拿到了没有这个问题的新处理器时。这里有一个 `nopti` 命令行选项,可以在机器引导时去禁用它。这个补丁系列也增加了一个新的“特性”标识(`X86_BUG_CPU_INSECURE`)去标识有漏洞的 CPUs;它是在现在的所有 x86 CPUs 上设置,但是在以后的硬件上可能没有。如果没有该特性标识,页面隔离将自动被关闭。 + +在 4.16 版的合并窗口打开之前剩下的这一个月。在这段时间里,KPTI 补丁集毫无疑问将通过一些额外的修改来解决一些难以避免的小问题。一旦稳定之后,这些代码将被合并,并且将会尽快回迁到稳定内核。显然,我们期待一个缓慢更新的 — 但是更安全的 — 内核,作为我们的新年礼物。 + +-------------------------------------------------------------------------------- + +via: https://lwn.net/SubscriberLink/741878/eb6c9d3913d7cb2b/ + +作者:[Jonathan Corbet ][a] +译者:[qhwdw](https://github.com/qhwdw) +校对:[校对者ID](https://github.com/校对者ID) + +本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 + +[a]:https://lwn.net/SubscriberLink/741878/eb6c9d3913d7cb2b/ +[1]:https://lwn.net/Promo/slink-trial2-2/claim +[2]:https://git.kernel.org/linus/64a48099b3b31568ac45716b7fafcb74a0c2fcfe +[3]:https://lwn.net/Articles/741882/ +[4]:https://lwn.net/Articles/740393/ +[5]:https://lwn.net/Articles/741883/ +[6]:https://lwn.net/Articles/117749/ +[7]:http://man7.org/linux/man-pages/man2/modify_ldt.2.html +[8]:https://lwn.net/Articles/738975/ + +