Skip to content

Commit

Permalink
memcg: add memory.pressure_level events
Browse files Browse the repository at this point in the history
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications.  The levels are defined like this:

The "low" level means that the system is reclaiming memory for new
allocations.  Monitoring this reclaiming activity might be useful for
maintaining cache level.  Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc.  Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger.  Applications should do whatever they can to help the
system.  It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

The events are propagated upward until the event is handled, i.e.  the
events are not pass-through.  Here is what this means: for example you
have three cgroups: A->B->C.  Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure.  In
this situation, only group C will receive the notification, i.e.  groups
A and B will not receive it.  This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing.  So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)

Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features.  Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off.  The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
option, so it will not require any changes on the userland side.

[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454

[[email protected]: coding-style fixes]
[[email protected]: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Leonid Moiseichuk <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Bartlomiej Zolnierkiewicz <[email protected]>
Cc: John Stultz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
  • Loading branch information
Anton Vorontsov authored and torvalds committed Apr 29, 2013
1 parent 84d96d8 commit 70ddf63
Show file tree
Hide file tree
Showing 6 changed files with 528 additions and 2 deletions.
70 changes: 69 additions & 1 deletion Documentation/cgroups/memory.txt
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ Features:
- soft limit
- moving (recharging) account at moving a task is selectable.
- usage threshold notifier
- memory pressure notifier
- oom-killer disable knob and oom-notifier
- Root cgroup has no limit controls.

Expand All @@ -65,6 +66,7 @@ Brief summary of control files.
memory.stat # show various statistics
memory.use_hierarchy # set/show hierarchical account enabled
memory.force_empty # trigger forced move charge to parent
memory.pressure_level # set memory pressure notifications
memory.swappiness # set/show swappiness parameter of vmscan
(See sysctl's vm.swappiness)
memory.move_charge_at_immigrate # set/show controls of moving charges
Expand Down Expand Up @@ -762,7 +764,73 @@ At reading, current status of OOM is shown.
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
be stopped.)

11. TODO
11. Memory Pressure

The pressure level notifications can be used to monitor the memory
allocation cost; based on the pressure, applications can implement
different strategies of managing their memory resources. The pressure
levels are defined as following:

The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file caches,
etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you have
three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
and C, and suppose group C experiences some pressure. In this situation,
only group C will receive the notification, i.e. groups A and B will not
receive it. This is done to avoid excessive "broadcasting" of messages,
which disturbs the system and which is especially bad if we are low on
memory or thrashing. So, organize the cgroups wisely, or propagate the
events manually (or, ask us to implement the pass-through events,
explaining why would you need them.)

The file memory.pressure_level is only used to setup an eventfd. To
register a notification, an application must:

- create an eventfd using eventfd(2);
- open memory.pressure_level;
- write string like "<event_fd> <fd of memory.pressure_level> <level>"
to cgroup.event_control.

Application will be notified through eventfd when memory pressure is at
the specific level (or higher). Read/write operations to
memory.pressure_level are no implemented.

Test:

Here is a small script example that makes a new cgroup, sets up a
memory limit, sets up a notification in the cgroup and then makes child
cgroup experience a critical pressure:

# cd /sys/fs/cgroup/memory/
# mkdir foo
# cd foo
# cgroup_event_listener memory.pressure_level low &
# echo 8000000 > memory.limit_in_bytes
# echo 8000000 > memory.memsw.limit_in_bytes
# echo $$ > tasks
# dd if=/dev/zero | read x

(Expect a bunch of notifications, and eventually, the oom-killer will
trigger.)

12. TODO

1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first
Expand Down
47 changes: 47 additions & 0 deletions include/linux/vmpressure.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#ifndef __LINUX_VMPRESSURE_H
#define __LINUX_VMPRESSURE_H

#include <linux/mutex.h>
#include <linux/list.h>
#include <linux/workqueue.h>
#include <linux/gfp.h>
#include <linux/types.h>
#include <linux/cgroup.h>

struct vmpressure {
unsigned long scanned;
unsigned long reclaimed;
/* The lock is used to keep the scanned/reclaimed above in sync. */
struct mutex sr_lock;

/* The list of vmpressure_event structs. */
struct list_head events;
/* Have to grab the lock on events traversal or modifications. */
struct mutex events_lock;

struct work_struct work;
};

struct mem_cgroup;

#ifdef CONFIG_MEMCG
extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
unsigned long scanned, unsigned long reclaimed);
extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);

extern void vmpressure_init(struct vmpressure *vmpr);
extern struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg);
extern struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr);
extern struct vmpressure *css_to_vmpressure(struct cgroup_subsys_state *css);
extern int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
struct eventfd_ctx *eventfd,
const char *args);
extern void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft,
struct eventfd_ctx *eventfd);
#else
static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
unsigned long scanned, unsigned long reclaimed) {}
static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
int prio) {}
#endif /* CONFIG_MEMCG */
#endif /* __LINUX_VMPRESSURE_H */
2 changes: 1 addition & 1 deletion mm/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o
obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
Expand Down
29 changes: 29 additions & 0 deletions mm/memcontrol.c
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
#include <linux/fs.h>
#include <linux/seq_file.h>
#include <linux/vmalloc.h>
#include <linux/vmpressure.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
Expand Down Expand Up @@ -261,6 +262,9 @@ struct mem_cgroup {
*/
struct res_counter res;

/* vmpressure notifications */
struct vmpressure vmpressure;

union {
/*
* the counter to account for mem+swap usage.
Expand Down Expand Up @@ -359,6 +363,7 @@ struct mem_cgroup {
atomic_t numainfo_events;
atomic_t numainfo_updating;
#endif

/*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
Expand Down Expand Up @@ -510,6 +515,24 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
return container_of(s, struct mem_cgroup, css);
}

/* Some nice accessors for the vmpressure. */
struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
{
if (!memcg)
memcg = root_mem_cgroup;
return &memcg->vmpressure;
}

struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
{
return &container_of(vmpr, struct mem_cgroup, vmpressure)->css;
}

struct vmpressure *css_to_vmpressure(struct cgroup_subsys_state *css)
{
return &mem_cgroup_from_css(css)->vmpressure;
}

static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
{
return (memcg == root_mem_cgroup);
Expand Down Expand Up @@ -5907,6 +5930,11 @@ static struct cftype mem_cgroup_files[] = {
.unregister_event = mem_cgroup_oom_unregister_event,
.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
},
{
.name = "pressure_level",
.register_event = vmpressure_register_event,
.unregister_event = vmpressure_unregister_event,
},
#ifdef CONFIG_NUMA
{
.name = "numa_stat",
Expand Down Expand Up @@ -6188,6 +6216,7 @@ mem_cgroup_css_alloc(struct cgroup *cont)
memcg->move_charge_at_immigrate = 0;
mutex_init(&memcg->thresholds_lock);
spin_lock_init(&memcg->move_lock);
vmpressure_init(&memcg->vmpressure);

return &memcg->css;

Expand Down
Loading

0 comments on commit 70ddf63

Please sign in to comment.