Skip to content

Commit

Permalink
Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linu…
Browse files Browse the repository at this point in the history
…x/kernel/git/tip/tip

Pull x86 pti updates from Thomas Gleixner:
 "This contains:

   - a PTI bugfix to avoid setting reserved CR3 bits when PCID is
     disabled. This seems to cause issues on a virtual machine at least
     and is incorrect according to the AMD manual.

   - a PTI bugfix which disables the perf BTS facility if PTI is
     enabled. The BTS AUX buffer is not globally visible and causes the
     CPU to fault when the mapping disappears on switching CR3 to user
     space. A full fix which restores BTS on PTI is non trivial and will
     be worked on.

   - PTI bugfixes for EFI and trusted boot which make sure that the user
     space visible page table entries have the NX bit cleared

   - removal of dead code in the PTI pagetable setup functions

   - add PTI documentation

   - add a selftest for vsyscall to verify that the kernel actually
     implements what it advertises.

   - a sysfs interface to expose vulnerability and mitigation
     information so there is a coherent way for users to retrieve the
     status.

   - the initial spectre_v2 mitigations, aka retpoline:

      + The necessary ASM thunk and compiler support

      + The ASM variants of retpoline and the conversion of affected ASM
        code

      + Make LFENCE serializing on AMD so it can be used as speculation
        trap

      + The RSB fill after vmexit

   - initial objtool support for retpoline

  As I said in the status mail this is the most of the set of patches
  which should go into 4.15 except two straight forward patches still on
  hold:

   - the retpoline add on of LFENCE which waits for ACKs

   - the RSB fill after context switch

  Both should be ready to go early next week and with that we'll have
  covered the major holes of spectre_v2 and go back to normality"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  x86,perf: Disable intel_bts when PTI
  security/Kconfig: Correct the Documentation reference for PTI
  x86/pti: Fix !PCID and sanitize defines
  selftests/x86: Add test_vsyscall
  x86/retpoline: Fill return stack buffer on vmexit
  x86/retpoline/irq32: Convert assembler indirect jumps
  x86/retpoline/checksum32: Convert assembler indirect jumps
  x86/retpoline/xen: Convert Xen hypercall indirect jumps
  x86/retpoline/hyperv: Convert assembler indirect jumps
  x86/retpoline/ftrace: Convert ftrace assembler indirect jumps
  x86/retpoline/entry: Convert entry assembler indirect jumps
  x86/retpoline/crypto: Convert crypto assembler indirect jumps
  x86/spectre: Add boot time option to select Spectre v2 mitigation
  x86/retpoline: Add initial retpoline support
  objtool: Allow alternatives to be ignored
  objtool: Detect jumps to retpoline thunks
  x86/pti: Make unpoison of pgd for trusted boot work for real
  x86/alternatives: Fix optimize_nops() checking
  sysfs/cpu: Fix typos in vulnerability documentation
  x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC
  ...
  • Loading branch information
torvalds committed Jan 14, 2018
2 parents 2c1cfa4 + 99a9dc9 commit 40548c6
Show file tree
Hide file tree
Showing 44 changed files with 1,525 additions and 100 deletions.
16 changes: 16 additions & 0 deletions Documentation/ABI/testing/sysfs-devices-system-cpu
Original file line number Diff line number Diff line change
Expand Up @@ -375,3 +375,19 @@ Contact: Linux kernel mailing list <[email protected]>
Description: information about CPUs heterogeneity.

cpu_capacity: capacity of cpu#.

What: /sys/devices/system/cpu/vulnerabilities
/sys/devices/system/cpu/vulnerabilities/meltdown
/sys/devices/system/cpu/vulnerabilities/spectre_v1
/sys/devices/system/cpu/vulnerabilities/spectre_v2
Date: January 2018
Contact: Linux kernel mailing list <[email protected]>
Description: Information about CPU vulnerabilities

The files are named after the code names of CPU
vulnerabilities. The output of those files reflects the
state of the CPUs in the system. Possible output values:

"Not affected" CPU is not affected by the vulnerability
"Vulnerable" CPU is affected and no mitigation in effect
"Mitigation: $M" CPU is affected and mitigation $M is in effect
49 changes: 42 additions & 7 deletions Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2623,6 +2623,11 @@
nosmt [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.

nospectre_v2 [X86] Disable all mitigations for the Spectre variant 2
(indirect branch prediction) vulnerability. System may
allow data leaks with this option, which is equivalent
to spectre_v2=off.

noxsave [BUGS=X86] Disables x86 extended register state save
and restore using xsave. The kernel will fallback to
enabling legacy floating-point and sse state.
Expand Down Expand Up @@ -2709,8 +2714,6 @@
steal time is computed, but won't influence scheduler
behaviour

nopti [X86-64] Disable kernel page table isolation

nolapic [X86-32,APIC] Do not enable or use the local APIC.

nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
Expand Down Expand Up @@ -3291,11 +3294,20 @@
pt. [PARIDE]
See Documentation/blockdev/paride.txt.

pti= [X86_64]
Control user/kernel address space isolation:
on - enable
off - disable
auto - default setting
pti= [X86_64] Control Page Table Isolation of user and
kernel address spaces. Disabling this feature
removes hardening, but improves performance of
system calls and interrupts.

on - unconditionally enable
off - unconditionally disable
auto - kernel detects whether your CPU model is
vulnerable to issues that PTI mitigates

Not specifying this option is equivalent to pti=auto.

nopti [X86_64]
Equivalent to pti=off

pty.legacy_count=
[KNL] Number of legacy pty's. Overwrites compiled-in
Expand Down Expand Up @@ -3946,6 +3958,29 @@
sonypi.*= [HW] Sony Programmable I/O Control Device driver
See Documentation/laptops/sonypi.txt

spectre_v2= [X86] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability.

on - unconditionally enable
off - unconditionally disable
auto - kernel detects whether your CPU model is
vulnerable

Selecting 'on' will, and 'auto' may, choose a
mitigation method at run time according to the
CPU, the available microcode, the setting of the
CONFIG_RETPOLINE configuration option, and the
compiler with which the kernel was built.

Specific mitigations can also be selected manually:

retpoline - replace indirect branches
retpoline,generic - google's original retpoline
retpoline,amd - AMD-specific minimal thunk

Not specifying this option is equivalent to
spectre_v2=auto.

spia_io_base= [HW,MTD]
spia_fio_base=
spia_pedr=
Expand Down
186 changes: 186 additions & 0 deletions Documentation/x86/pti.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
Overview
========

Page Table Isolation (pti, previously known as KAISER[1]) is a
countermeasure against attacks on the shared user/kernel address
space such as the "Meltdown" approach[2].

To mitigate this class of attacks, we create an independent set of
page tables for use only when running userspace applications. When
the kernel is entered via syscalls, interrupts or exceptions, the
page tables are switched to the full "kernel" copy. When the system
switches back to user mode, the user copy is used again.

The userspace page tables contain only a minimal amount of kernel
data: only what is needed to enter/exit the kernel such as the
entry/exit functions themselves and the interrupt descriptor table
(IDT). There are a few strictly unnecessary things that get mapped
such as the first C function when entering an interrupt (see
comments in pti.c).

This approach helps to ensure that side-channel attacks leveraging
the paging structures do not function when PTI is enabled. It can be
enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
Once enabled at compile-time, it can be disabled at boot with the
'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).

Page Table Management
=====================

When PTI is enabled, the kernel manages two sets of page tables.
The first set is very similar to the single set which is present in
kernels without PTI. This includes a complete mapping of userspace
that the kernel can use for things like copy_to_user().

Although _complete_, the user portion of the kernel page tables is
crippled by setting the NX bit in the top level. This ensures
that any missed kernel->user CR3 switch will immediately crash
userspace upon executing its first instruction.

The userspace page tables map only the kernel data needed to enter
and exit the kernel. This data is entirely contained in the 'struct
cpu_entry_area' structure which is placed in the fixmap which gives
each CPU's copy of the area a compile-time-fixed virtual address.

For new userspace mappings, the kernel makes the entries in its
page tables like normal. The only difference is when the kernel
makes entries in the top (PGD) level. In addition to setting the
entry in the main kernel PGD, a copy of the entry is made in the
userspace page tables' PGD.

This sharing at the PGD level also inherently shares all the lower
layers of the page tables. This leaves a single, shared set of
userspace page tables to manage. One PTE to lock, one set of
accessed bits, dirty bits, etc...

Overhead
========

Protection against side-channel attacks is important. But,
this protection comes at a cost:

1. Increased Memory Use
a. Each process now needs an order-1 PGD instead of order-0.
(Consumes an additional 4k per process).
b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
aligned so that it can be mapped by setting a single PMD
entry. This consumes nearly 2MB of RAM once the kernel
is decompressed, but no space in the kernel image itself.

2. Runtime Cost
a. CR3 manipulation to switch between the page table copies
must be done at interrupt, syscall, and exception entry
and exit (it can be skipped when the kernel is interrupted,
though.) Moves to CR3 are on the order of a hundred
cycles, and are required at every entry and exit.
b. A "trampoline" must be used for SYSCALL entry. This
trampoline depends on a smaller set of resources than the
non-PTI SYSCALL entry code, so requires mapping fewer
things into the userspace page tables. The downside is
that stacks must be switched at entry time.
d. Global pages are disabled for all kernel structures not
mapped into both kernel and userspace page tables. This
feature of the MMU allows different processes to share TLB
entries mapping the kernel. Losing the feature means more
TLB misses after a context switch. The actual loss of
performance is very small, however, never exceeding 1%.
d. Process Context IDentifiers (PCID) is a CPU feature that
allows us to skip flushing the entire TLB when switching page
tables by setting a special bit in CR3 when the page tables
are changed. This makes switching the page tables (at context
switch, or kernel entry/exit) cheaper. But, on systems with
PCID support, the context switch code must flush both the user
and kernel entries out of the TLB. The user PCID TLB flush is
deferred until the exit to userspace, minimizing the cost.
See intel.com/sdm for the gory PCID/INVPCID details.
e. The userspace page tables must be populated for each new
process. Even without PTI, the shared kernel mappings
are created by copying top-level (PGD) entries into each
new process. But, with PTI, there are now *two* kernel
mappings: one in the kernel page tables that maps everything
and one for the entry/exit structures. At fork(), we need to
copy both.
f. In addition to the fork()-time copying, there must also
be an update to the userspace PGD any time a set_pgd() is done
on a PGD used to map userspace. This ensures that the kernel
and userspace copies always map the same userspace
memory.
g. On systems without PCID support, each CR3 write flushes
the entire TLB. That means that each syscall, interrupt
or exception flushes the TLB.
h. INVPCID is a TLB-flushing instruction which allows flushing
of TLB entries for non-current PCIDs. Some systems support
PCIDs, but do not support INVPCID. On these systems, addresses
can only be flushed from the TLB for the current PCID. When
flushing a kernel address, we need to flush all PCIDs, so a
single kernel address flush will require a TLB-flushing CR3
write upon the next use of every PCID.

Possible Future Work
====================
1. We can be more careful about not actually writing to CR3
unless its value is actually changed.
2. Allow PTI to be enabled/disabled at runtime in addition to the
boot-time switching.

Testing
========

To test stability of PTI, the following test procedure is recommended,
ideally doing all of these in parallel:

1. Set CONFIG_DEBUG_ENTRY=y
2. Run several copies of all of the tools/testing/selftests/x86/ tests
(excluding MPX and protection_keys) in a loop on multiple CPUs for
several minutes. These tests frequently uncover corner cases in the
kernel entry code. In general, old kernels might cause these tests
themselves to crash, but they should never crash the kernel.
3. Run the 'perf' tool in a mode (top or record) that generates many
frequent performance monitoring non-maskable interrupts (see "NMI"
in /proc/interrupts). This exercises the NMI entry/exit code which
is known to trigger bugs in code paths that did not expect to be
interrupted, including nested NMIs. Using "-c" boosts the rate of
NMIs, and using two -c with separate counters encourages nested NMIs
and less deterministic behavior.

while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done

4. Launch a KVM virtual machine.
5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
This has been a lightly-tested code path and needs extra scrutiny.

Debugging
=========

Bugs in PTI cause a few different signatures of crashes
that are worth noting here.

* Failures of the selftests/x86 code. Usually a bug in one of the
more obscure corners of entry_64.S
* Crashes in early boot, especially around CPU bringup. Bugs
in the trampoline code or mappings cause these.
* Crashes at the first interrupt. Caused by bugs in entry_64.S,
like screwing up a page table switch. Also caused by
incorrectly mapping the IRQ handler entry code.
* Crashes at the first NMI. The NMI code is separate from main
interrupt handlers and can have bugs that do not affect
normal interrupts. Also caused by incorrectly mapping NMI
code. NMIs that interrupt the entry code must be very
careful and can be the cause of crashes that show up when
running perf.
* Kernel crashes at the first exit to userspace. entry_64.S
bugs, or failing to map some of the exit code.
* Crashes at first interrupt that interrupts userspace. The paths
in entry_64.S that return to userspace are sometimes separate
from the ones that return to the kernel.
* Double faults: overflowing the kernel stack because of page
faults upon page faults. Caused by touching non-pti-mapped
data in the entry code, or forgetting to switch to kernel
CR3 before calling into C functions which are not pti-mapped.
* Userspace segfaults early in boot, sometimes manifesting
as mount(8) failing to mount the rootfs. These have
tended to be TLB invalidation issues. Usually invalidating
the wrong PCID, or otherwise missing an invalidation.

1. https://gruss.cc/files/kaiser.pdf
2. https://meltdownattack.com/meltdown.pdf
14 changes: 14 additions & 0 deletions arch/x86/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ config X86
select GENERIC_CLOCKEVENTS_MIN_ADJUST
select GENERIC_CMOS_UPDATE
select GENERIC_CPU_AUTOPROBE
select GENERIC_CPU_VULNERABILITIES
select GENERIC_EARLY_IOREMAP
select GENERIC_FIND_FIRST_BIT
select GENERIC_IOMAP
Expand Down Expand Up @@ -428,6 +429,19 @@ config GOLDFISH
def_bool y
depends on X86_GOLDFISH

config RETPOLINE
bool "Avoid speculative indirect branches in kernel"
default y
help
Compile kernel with the retpoline compiler options to guard against
kernel-to-user data leaks by avoiding speculative indirect
branches. Requires a compiler with -mindirect-branch=thunk-extern
support for full protection. The kernel may run slower.

Without compiler support, at least indirect branches in assembler
code are eliminated. Since this includes the syscall entry path,
it is not entirely pointless.

config INTEL_RDT
bool "Intel Resource Director Technology support"
default n
Expand Down
10 changes: 10 additions & 0 deletions arch/x86/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,16 @@ KBUILD_CFLAGS += -Wno-sign-compare
#
KBUILD_CFLAGS += -fno-asynchronous-unwind-tables

# Avoid indirect branches in kernel to deal with Spectre
ifdef CONFIG_RETPOLINE
RETPOLINE_CFLAGS += $(call cc-option,-mindirect-branch=thunk-extern -mindirect-branch-register)
ifneq ($(RETPOLINE_CFLAGS),)
KBUILD_CFLAGS += $(RETPOLINE_CFLAGS) -DRETPOLINE
else
$(warning CONFIG_RETPOLINE=y, but not supported by the compiler. Toolchain update recommended.)
endif
endif

archscripts: scripts_basic
$(Q)$(MAKE) $(build)=arch/x86/tools relocs

Expand Down
5 changes: 3 additions & 2 deletions arch/x86/crypto/aesni-intel_asm.S
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
#include <linux/linkage.h>
#include <asm/inst.h>
#include <asm/frame.h>
#include <asm/nospec-branch.h>

/*
* The following macros are used to move an (un)aligned 16 byte value to/from
Expand Down Expand Up @@ -2884,7 +2885,7 @@ ENTRY(aesni_xts_crypt8)
pxor INC, STATE4
movdqu IV, 0x30(OUTP)

call *%r11
CALL_NOSPEC %r11

movdqu 0x00(OUTP), INC
pxor INC, STATE1
Expand Down Expand Up @@ -2929,7 +2930,7 @@ ENTRY(aesni_xts_crypt8)
_aesni_gf128mul_x_ble()
movups IV, (IVP)

call *%r11
CALL_NOSPEC %r11

movdqu 0x40(OUTP), INC
pxor INC, STATE1
Expand Down
3 changes: 2 additions & 1 deletion arch/x86/crypto/camellia-aesni-avx-asm_64.S
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

#include <linux/linkage.h>
#include <asm/frame.h>
#include <asm/nospec-branch.h>

#define CAMELLIA_TABLE_BYTE_LEN 272

Expand Down Expand Up @@ -1227,7 +1228,7 @@ camellia_xts_crypt_16way:
vpxor 14 * 16(%rax), %xmm15, %xmm14;
vpxor 15 * 16(%rax), %xmm15, %xmm15;

call *%r9;
CALL_NOSPEC %r9;

addq $(16 * 16), %rsp;

Expand Down
3 changes: 2 additions & 1 deletion arch/x86/crypto/camellia-aesni-avx2-asm_64.S
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

#include <linux/linkage.h>
#include <asm/frame.h>
#include <asm/nospec-branch.h>

#define CAMELLIA_TABLE_BYTE_LEN 272

Expand Down Expand Up @@ -1343,7 +1344,7 @@ camellia_xts_crypt_32way:
vpxor 14 * 32(%rax), %ymm15, %ymm14;
vpxor 15 * 32(%rax), %ymm15, %ymm15;

call *%r9;
CALL_NOSPEC %r9;

addq $(16 * 32), %rsp;

Expand Down
Loading

0 comments on commit 40548c6

Please sign in to comment.