Skip to content

Commit

Permalink
mempolicy: rework mempolicy Reference Counting [yet again]
Browse files Browse the repository at this point in the history
After further discussion with Christoph Lameter, it has become clear that my
earlier attempts to clean up the mempolicy reference counting were a bit of
overkill in some areas, resulting in superflous ref/unref in what are usually
fast paths.  In other areas, further inspection reveals that I botched the
unref for interleave policies.

A separate patch, suitable for upstream/stable trees, fixes up the known
errors in the previous attempt to fix reference counting.

This patch reworks the memory policy referencing counting and, one hopes,
simplifies the code.  Maybe I'll get it right this time.

See the update to the numa_memory_policy.txt document for a discussion of
memory policy reference counting that motivates this patch.

Summary:

Lookup of mempolicy, based on (vma, address) need only add a reference for
shared policy, and we need only unref the policy when finished for shared
policies.  So, this patch backs out all of the unneeded extra reference
counting added by my previous attempt.  It then unrefs only shared policies
when we're finished with them, using the mpol_cond_put() [conditional put]
helper function introduced by this patch.

Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
containing just the policy.  read_swap_cache_async() can call alloc_page_vma()
multiple times, so we can't let alloc_page_vma() unref the shared policy in
this case.  To avoid this, we make a copy of any non-null shared policy and
remove the MPOL_F_SHARED flag from the copy.  This copy occurs before reading
a page [or multiple pages] from swap, so the overhead should not be an issue
here.

I introduced a new static inline function "mpol_cond_copy()" to copy the
shared policy to an on-stack policy and remove the flags that would require a
conditional free.  The current implementation of mpol_cond_copy() assumes that
the struct mempolicy contains no pointers to dynamically allocated structures
that must be duplicated or reference counted during copy.

Signed-off-by: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
  • Loading branch information
Lee Schermerhorn authored and torvalds committed Apr 28, 2008
1 parent a6020ed commit 52cd3b0
Show file tree
Hide file tree
Showing 6 changed files with 195 additions and 77 deletions.
68 changes: 68 additions & 0 deletions Documentation/vm/numa_memory_policy.txt
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,74 @@ Components of Memory Policies
MPOL_PREFERRED policies that were created with an empty nodemask
(local allocation).

MEMORY POLICY REFERENCE COUNTING

To resolve use/free races, struct mempolicy contains an atomic reference
count field. Internal interfaces, mpol_get()/mpol_put() increment and
decrement this reference count, respectively. mpol_put() will only free
the structure back to the mempolicy kmem cache when the reference count
goes to zero.

When a new memory policy is allocated, it's reference count is initialized
to '1', representing the reference held by the task that is installing the
new policy. When a pointer to a memory policy structure is stored in another
structure, another reference is added, as the task's reference will be dropped
on completion of the policy installation.

During run-time "usage" of the policy, we attempt to minimize atomic operations
on the reference count, as this can lead to cache lines bouncing between cpus
and NUMA nodes. "Usage" here means one of the following:

1) querying of the policy, either by the task itself [using the get_mempolicy()
API discussed below] or by another task using the /proc/<pid>/numa_maps
interface.

2) examination of the policy to determine the policy mode and associated node
or node lists, if any, for page allocation. This is considered a "hot
path". Note that for MPOL_BIND, the "usage" extends across the entire
allocation process, which may sleep during page reclaimation, because the
BIND policy nodemask is used, by reference, to filter ineligible nodes.

We can avoid taking an extra reference during the usages listed above as
follows:

1) we never need to get/free the system default policy as this is never
changed nor freed, once the system is up and running.

2) for querying the policy, we do not need to take an extra reference on the
target task's task policy nor vma policies because we always acquire the
task's mm's mmap_sem for read during the query. The set_mempolicy() and
mbind() APIs [see below] always acquire the mmap_sem for write when
installing or replacing task or vma policies. Thus, there is no possibility
of a task or thread freeing a policy while another task or thread is
querying it.

3) Page allocation usage of task or vma policy occurs in the fault path where
we hold them mmap_sem for read. Again, because replacing the task or vma
policy requires that the mmap_sem be held for write, the policy can't be
freed out from under us while we're using it for page allocation.

4) Shared policies require special consideration. One task can replace a
shared memory policy while another task, with a distinct mmap_sem, is
querying or allocating a page based on the policy. To resolve this
potential race, the shared policy infrastructure adds an extra reference
to the shared policy during lookup while holding a spin lock on the shared
policy management structure. This requires that we drop this extra
reference when we're finished "using" the policy. We must drop the
extra reference on shared policies in the same query/allocation paths
used for non-shared policies. For this reason, shared policies are marked
as such, and the extra reference is dropped "conditionally"--i.e., only
for shared policies.

Because of this extra reference counting, and because we must lookup
shared policies in a tree structure under spinlock, shared policies are
more expensive to use in the page allocation path. This is expecially
true for shared policies on shared memory regions shared by tasks running
on different NUMA nodes. This extra overhead can be avoided by always
falling back to task or system default policy for shared memory regions,
or by prefaulting the entire shared memory region into memory and locking
it down. However, this might not be appropriate for all applications.

MEMORY POLICY APIs

Linux supports 3 system calls for controlling memory policy. These APIS
Expand Down
35 changes: 35 additions & 0 deletions include/linux/mempolicy.h
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,31 @@ static inline void mpol_put(struct mempolicy *pol)
__mpol_put(pol);
}

/*
* Does mempolicy pol need explicit unref after use?
* Currently only needed for shared policies.
*/
static inline int mpol_needs_cond_ref(struct mempolicy *pol)
{
return (pol && (pol->flags & MPOL_F_SHARED));
}

static inline void mpol_cond_put(struct mempolicy *pol)
{
if (mpol_needs_cond_ref(pol))
__mpol_put(pol);
}

extern struct mempolicy *__mpol_cond_copy(struct mempolicy *tompol,
struct mempolicy *frompol);
static inline struct mempolicy *mpol_cond_copy(struct mempolicy *tompol,
struct mempolicy *frompol)
{
if (!frompol)
return frompol;
return __mpol_cond_copy(tompol, frompol);
}

extern struct mempolicy *__mpol_dup(struct mempolicy *pol);
static inline struct mempolicy *mpol_dup(struct mempolicy *pol)
{
Expand Down Expand Up @@ -201,6 +226,16 @@ static inline void mpol_put(struct mempolicy *p)
{
}

static inline void mpol_cond_put(struct mempolicy *pol)
{
}

static inline struct mempolicy *mpol_cond_copy(struct mempolicy *to,
struct mempolicy *from)
{
return from;
}

static inline void mpol_get(struct mempolicy *pol)
{
}
Expand Down
5 changes: 2 additions & 3 deletions ipc/shm.c
Original file line number Diff line number Diff line change
Expand Up @@ -271,10 +271,9 @@ static struct mempolicy *shm_get_policy(struct vm_area_struct *vma,

if (sfd->vm_ops->get_policy)
pol = sfd->vm_ops->get_policy(vma, addr);
else if (vma->vm_policy) {
else if (vma->vm_policy)
pol = vma->vm_policy;
mpol_get(pol); /* get_vma_policy() expects this */
}

return pol;
}
#endif
Expand Down
2 changes: 1 addition & 1 deletion mm/hugetlb.c
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ static struct page *dequeue_huge_page_vma(struct vm_area_struct *vma,
break;
}
}
mpol_put(mpol); /* unref if mpol !NULL */
mpol_cond_put(mpol);
return page;
}

Expand Down
Loading

0 comments on commit 52cd3b0

Please sign in to comment.