Skip to content

Commit

Permalink
rseq: Introduce restartable sequences system call
Browse files Browse the repository at this point in the history
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.

* Restartable sequences (per-cpu atomics)

Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.

The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.

Here are benchmarks of various rseq use-cases.

Test hardware:

arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 [email protected], 16-core, hyperthreading

The following benchmarks were all performed on a single thread.

* Per-CPU statistic counter increment

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:                344.0                 31.4          11.0
x86-64:                15.3                  2.0           7.7

* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
             per-cpu buffer

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:               2502.0                 2250.0         1.1
x86-64:               117.4                   98.0         1.2

* liburcu percpu: lock-unlock pair, dereference, read/compare word

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:                751.0                 128.5          5.8
x86-64:                53.4                  28.6          1.9

* jemalloc memory allocator adapted to use rseq

Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):

The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.

* Reading the current CPU number

Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:

- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
  executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
  assembly, which makes it a useful building block for restartable
  sequences.
- The approach of reading the cpu id through memory mapping shared
  between kernel and user-space is portable (e.g. ARM), which is not the
  case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop):                                    8.4 ns
- Read CPU from rseq cpu_id:                               16.7 ns
- Read CPU from rseq cpu_id (lazy register):               19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
- getcpu system call:                                     234.9 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop):                                    0.8 ns
- Read CPU from rseq cpu_id:                                0.8 ns
- Read CPU from rseq cpu_id (lazy register):                0.8 ns
- Read using gs segment selector:                           0.8 ns
- "lsl" inline assembly:                                   13.0 ns
- glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
- getcpu system call:                                      53.9 ns

- Speed (benchmark taken on v8 of patchset)

Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:

Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.

* CONFIG_RSEQ=n

avg.:      41.37 s
std.dev.:   0.36 s

* CONFIG_RSEQ=y

avg.:      40.46 s
std.dev.:   0.33 s

- Size

On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.

[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: "H . Peter Anvin" <[email protected]>
Cc: Chris Lameter <[email protected]>
Cc: Russell King <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Maurer <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
  • Loading branch information
compudj authored and KAGA-KOKO committed Jun 6, 2018
1 parent b575e83 commit d7822b1
Show file tree
Hide file tree
Showing 13 changed files with 734 additions and 1 deletion.
11 changes: 11 additions & 0 deletions MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -11976,6 +11976,17 @@ F: include/dt-bindings/reset/
F: include/linux/reset.h
F: include/linux/reset-controller.h

RESTARTABLE SEQUENCES SUPPORT
M: Mathieu Desnoyers <[email protected]>
M: Peter Zijlstra <[email protected]>
M: "Paul E. McKenney" <[email protected]>
M: Boqun Feng <[email protected]>
L: [email protected]
S: Supported
F: kernel/rseq.c
F: include/uapi/linux/rseq.h
F: include/trace/events/rseq.h

RFKILL
M: Johannes Berg <[email protected]>
L: [email protected]
Expand Down
7 changes: 7 additions & 0 deletions arch/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,13 @@ config HAVE_REGS_AND_STACK_ACCESS_API
declared in asm/ptrace.h
For example the kprobes-based event tracer needs this API.

config HAVE_RSEQ
bool
depends on HAVE_REGS_AND_STACK_ACCESS_API
help
This symbol should be selected by an architecture if it
supports an implementation of restartable sequences.

config HAVE_CLK
bool
help
Expand Down
1 change: 1 addition & 0 deletions fs/exec.c
Original file line number Diff line number Diff line change
Expand Up @@ -1822,6 +1822,7 @@ static int do_execveat_common(int fd, struct filename *filename,
current->fs->in_exec = 0;
current->in_execve = 0;
membarrier_execve(current);
rseq_execve(current);
acct_update_integrals(current);
task_numa_free(current);
free_bprm(bprm);
Expand Down
134 changes: 134 additions & 0 deletions include/linux/sched.h
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
#include <linux/signal_types.h>
#include <linux/mm_types_task.h>
#include <linux/task_io_accounting.h>
#include <linux/rseq.h>

/* task_struct member predeclarations (sorted alphabetically): */
struct audit_context;
Expand Down Expand Up @@ -1047,6 +1048,17 @@ struct task_struct {
unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */

#ifdef CONFIG_RSEQ
struct rseq __user *rseq;
u32 rseq_len;
u32 rseq_sig;
/*
* RmW on rseq_event_mask must be performed atomically
* with respect to preemption.
*/
unsigned long rseq_event_mask;
#endif

struct tlbflush_unmap_batch tlb_ubc;

struct rcu_head rcu;
Expand Down Expand Up @@ -1757,4 +1769,126 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
#define TASK_SIZE_OF(tsk) TASK_SIZE
#endif

#ifdef CONFIG_RSEQ

/*
* Map the event mask on the user-space ABI enum rseq_cs_flags
* for direct mask checks.
*/
enum rseq_event_mask_bits {
RSEQ_EVENT_PREEMPT_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
RSEQ_EVENT_SIGNAL_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
RSEQ_EVENT_MIGRATE_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
};

enum rseq_event_mask {
RSEQ_EVENT_PREEMPT = (1U << RSEQ_EVENT_PREEMPT_BIT),
RSEQ_EVENT_SIGNAL = (1U << RSEQ_EVENT_SIGNAL_BIT),
RSEQ_EVENT_MIGRATE = (1U << RSEQ_EVENT_MIGRATE_BIT),
};

static inline void rseq_set_notify_resume(struct task_struct *t)
{
if (t->rseq)
set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
}

void __rseq_handle_notify_resume(struct pt_regs *regs);

static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
if (current->rseq)
__rseq_handle_notify_resume(regs);
}

static inline void rseq_signal_deliver(struct pt_regs *regs)
{
preempt_disable();
__set_bit(RSEQ_EVENT_SIGNAL_BIT, &current->rseq_event_mask);
preempt_enable();
rseq_handle_notify_resume(regs);
}

/* rseq_preempt() requires preemption to be disabled. */
static inline void rseq_preempt(struct task_struct *t)
{
__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
rseq_set_notify_resume(t);
}

/* rseq_migrate() requires preemption to be disabled. */
static inline void rseq_migrate(struct task_struct *t)
{
__set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
rseq_set_notify_resume(t);
}

/*
* If parent process has a registered restartable sequences area, the
* child inherits. Only applies when forking a process, not a thread. In
* case a parent fork() in the middle of a restartable sequence, set the
* resume notifier to force the child to retry.
*/
static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
{
if (clone_flags & CLONE_THREAD) {
t->rseq = NULL;
t->rseq_len = 0;
t->rseq_sig = 0;
t->rseq_event_mask = 0;
} else {
t->rseq = current->rseq;
t->rseq_len = current->rseq_len;
t->rseq_sig = current->rseq_sig;
t->rseq_event_mask = current->rseq_event_mask;
rseq_preempt(t);
}
}

static inline void rseq_execve(struct task_struct *t)
{
t->rseq = NULL;
t->rseq_len = 0;
t->rseq_sig = 0;
t->rseq_event_mask = 0;
}

#else

static inline void rseq_set_notify_resume(struct task_struct *t)
{
}
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
}
static inline void rseq_signal_deliver(struct pt_regs *regs)
{
}
static inline void rseq_preempt(struct task_struct *t)
{
}
static inline void rseq_migrate(struct task_struct *t)
{
}
static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
{
}
static inline void rseq_execve(struct task_struct *t)
{
}

#endif

#ifdef CONFIG_DEBUG_RSEQ

void rseq_syscall(struct pt_regs *regs);

#else

static inline void rseq_syscall(struct pt_regs *regs)
{
}

#endif

#endif
4 changes: 3 additions & 1 deletion include/linux/syscalls.h
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ struct old_linux_dirent;
struct perf_event_attr;
struct file_handle;
struct sigaltstack;
struct rseq;
union bpf_attr;

#include <linux/types.h>
Expand Down Expand Up @@ -897,7 +898,8 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
asmlinkage long sys_pkey_free(int pkey);
asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);

asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
int flags, uint32_t sig);

/*
* Architecture-specific system calls
Expand Down
57 changes: 57 additions & 0 deletions include/trace/events/rseq.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
/* SPDX-License-Identifier: GPL-2.0+ */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM rseq

#if !defined(_TRACE_RSEQ_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_RSEQ_H

#include <linux/tracepoint.h>
#include <linux/types.h>

TRACE_EVENT(rseq_update,

TP_PROTO(struct task_struct *t),

TP_ARGS(t),

TP_STRUCT__entry(
__field(s32, cpu_id)
),

TP_fast_assign(
__entry->cpu_id = raw_smp_processor_id();
),

TP_printk("cpu_id=%d", __entry->cpu_id)
);

TRACE_EVENT(rseq_ip_fixup,

TP_PROTO(unsigned long regs_ip, unsigned long start_ip,
unsigned long post_commit_offset, unsigned long abort_ip),

TP_ARGS(regs_ip, start_ip, post_commit_offset, abort_ip),

TP_STRUCT__entry(
__field(unsigned long, regs_ip)
__field(unsigned long, start_ip)
__field(unsigned long, post_commit_offset)
__field(unsigned long, abort_ip)
),

TP_fast_assign(
__entry->regs_ip = regs_ip;
__entry->start_ip = start_ip;
__entry->post_commit_offset = post_commit_offset;
__entry->abort_ip = abort_ip;
),

TP_printk("regs_ip=0x%lx start_ip=0x%lx post_commit_offset=%lu abort_ip=0x%lx",
__entry->regs_ip, __entry->start_ip,
__entry->post_commit_offset, __entry->abort_ip)
);

#endif /* _TRACE_SOCK_H */

/* This part must be outside protection */
#include <trace/define_trace.h>
Loading

0 comments on commit d7822b1

Please sign in to comment.