Skip to content

Commit

Permalink
Merge branch 'akpm' (patches from Andrew)
Browse files Browse the repository at this point in the history
Merge more updates from Andrew Morton:
 "More mm/ work, plenty more to come

  Subsystems affected by this patch series: slub, memcg, gup, kasan,
  pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
  thp, mmap, kconfig"

* akpm: (131 commits)
  arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  riscv: support DEBUG_WX
  mm: add DEBUG_WX support
  drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
  mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
  powerpc/mm: drop platform defined pmd_mknotpresent()
  mm: thp: don't need to drain lru cache when splitting and mlocking THP
  hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
  sparc32: register memory occupied by kernel as memblock.memory
  include/linux/memblock.h: fix minor typo and unclear comment
  mm, mempolicy: fix up gup usage in lookup_node
  tools/vm/page_owner_sort.c: filter out unneeded line
  mm: swap: memcg: fix memcg stats for huge pages
  mm: swap: fix vmstats for huge pages
  mm: vmscan: limit the range of LRU type balancing
  mm: vmscan: reclaim writepage is IO cost
  mm: vmscan: determine anon/file pressure balance at the reclaim root
  mm: balance LRU lists based on relative thrashing
  mm: only count actual rotations as LRU reclaim cost
  ...
  • Loading branch information
torvalds committed Jun 4, 2020
2 parents c444eb5 + 09587a0 commit ee01c4d
Show file tree
Hide file tree
Showing 147 changed files with 3,443 additions and 2,670 deletions.
19 changes: 7 additions & 12 deletions Documentation/admin-guide/cgroup-v1/memory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -199,11 +199,11 @@ An RSS page is unaccounted when it's fully unmapped. A PageCache page is
unaccounted when it's removed from radix-tree. Even if RSS pages are fully
unmapped (by kswapd), they may exist as SwapCache in the system until they
are really freed. Such SwapCaches are also accounted.
A swapped-in page is not accounted until it's mapped.
A swapped-in page is accounted after adding into swapcache.

Note: The kernel does swapin-readahead and reads multiple swaps at once.
This means swapped-in pages may contain pages for other tasks than a task
causing page fault. So, we avoid accounting at swap-in I/O.
Since page's memcg recorded into swap whatever memsw enabled, the page will
be accounted after swapin.

At page migration, accounting information is kept.

Expand All @@ -222,18 +222,13 @@ the cgroup that brought it in -- this will happen on memory pressure).
But see section 8.2: when moving a task to another cgroup, its pages may
be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.

Exception: If CONFIG_MEMCG_SWAP is not used.
When you do swapoff and make swapped-out pages of shmem(tmpfs) to
be backed into memory in force, charges for pages are accounted against the
caller of swapoff rather than the users of shmem.

2.4 Swap Extension (CONFIG_MEMCG_SWAP)
2.4 Swap Extension
--------------------------------------

Swap Extension allows you to record charge for swap. A swapped-in page is
charged back to original page allocator if possible.
Swap usage is always recorded for each of cgroup. Swap Extension allows you to
read and limit it.

When swap is accounted, following files are added.
When CONFIG_SWAP is enabled, following files are added.

- memory.memsw.usage_in_bytes.
- memory.memsw.limit_in_bytes.
Expand Down
40 changes: 27 additions & 13 deletions Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -834,12 +834,15 @@
See also Documentation/networking/decnet.rst.

default_hugepagesz=
[same as hugepagesz=] The size of the default
HugeTLB page size. This is the size represented by
the legacy /proc/ hugepages APIs, used for SHM, and
default size when mounting hugetlbfs filesystems.
Defaults to the default architecture's huge page size
if not specified.
[HW] The size of the default HugeTLB page. This is
the size represented by the legacy /proc/ hugepages
APIs. In addition, this is the default hugetlb size
used for shmget(), mmap() and mounting hugetlbfs
filesystems. If not specified, defaults to the
architecture's default huge page size. Huge page
sizes are architecture dependent. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]

deferred_probe_timeout=
[KNL] Debugging option to set a timeout in seconds for
Expand Down Expand Up @@ -1484,13 +1487,24 @@
hugepages using the cma allocator. If enabled, the
boot-time allocation of gigantic hugepages is skipped.

hugepages= [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
On x86-64 and powerpc, this option can be specified
multiple times interleaved with hugepages= to reserve
huge pages of different sizes. Valid pages sizes on
x86-64 are 2M (when the CPU supports "pse") and 1G
(when the CPU supports the "pdpe1gb" cpuinfo flag).
hugepages= [HW] Number of HugeTLB pages to allocate at boot.
If this follows hugepagesz (below), it specifies
the number of pages of hugepagesz to be allocated.
If this is the first HugeTLB parameter on the command
line, it specifies the number of pages to allocate for
the default huge page size. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: <integer>

hugepagesz=
[HW] The size of the HugeTLB pages. This is used in
conjunction with hugepages (above) to allocate huge
pages of a specific size at boot. The pair
hugepagesz=X hugepages=Y can be specified once for
each supported huge page size. Huge page sizes are
architecture dependent. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]

hung_task_panic=
[KNL] Should the hung task detector generate panics.
Expand Down
35 changes: 35 additions & 0 deletions Documentation/admin-guide/mm/hugetlbpage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,41 @@ with a huge page size selection parameter "hugepagesz=<size>". <size> must
be specified in bytes with optional scale suffix [kKmMgG]. The default huge
page size may be selected with the "default_hugepagesz=<size>" boot parameter.

Hugetlb boot command line parameter semantics
hugepagesz - Specify a huge page size. Used in conjunction with hugepages
parameter to preallocate a number of huge pages of the specified
size. Hence, hugepagesz and hugepages are typically specified in
pairs such as:
hugepagesz=2M hugepages=512
hugepagesz can only be specified once on the command line for a
specific huge page size. Valid huge page sizes are architecture
dependent.
hugepages - Specify the number of huge pages to preallocate. This typically
follows a valid hugepagesz or default_hugepagesz parameter. However,
if hugepages is the first or only hugetlb command line parameter it
implicitly specifies the number of huge pages of default size to
allocate. If the number of huge pages of default size is implicitly
specified, it can not be overwritten by a hugepagesz,hugepages
parameter pair for the default size.
For example, on an architecture with 2M default huge page size:
hugepages=256 hugepagesz=2M hugepages=512
will result in 256 2M huge pages being allocated and a warning message
indicating that the hugepages=512 parameter is ignored. If a hugepages
parameter is preceded by an invalid hugepagesz parameter, it will
be ignored.
default_hugepagesz - Specify the default huge page size. This parameter can
only be specified once on the command line. default_hugepagesz can
optionally be followed by the hugepages parameter to preallocate a
specific number of huge pages of default size. The number of default
sized huge pages to preallocate can also be implicitly specified as
mentioned in the hugepages section above. Therefore, on an
architecture with 2M default huge page size:
hugepages=256
default_hugepagesz=2M hugepages=256
hugepages=256 default_hugepagesz=2M
will all result in 256 2M huge pages being allocated. Valid default
huge page size is architecture dependent.

When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
indicates the current number of pre-allocated huge pages of the default size.
Thus, one can use the following command to dynamically allocate/deallocate
Expand Down
7 changes: 7 additions & 0 deletions Documentation/admin-guide/mm/transhuge.rst
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,13 @@ memory. A lower value can prevent THPs from being
collapsed, resulting fewer pages being collapsed into
THPs, and lower memory access performance.

``max_ptes_shared`` specifies how many pages can be shared across multiple
processes. Exceeding the number would block the collapse::

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared

A higher value may increase memory footprint for some workloads.

Boot parameter
==============

Expand Down
23 changes: 18 additions & 5 deletions Documentation/admin-guide/sysctl/vm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -831,14 +831,27 @@ tooling to work, you can do::
swappiness
==========

This control is used to define how aggressive the kernel will swap
memory pages. Higher values will increase aggressiveness, lower values
decrease the amount of swap. A value of 0 instructs the kernel not to
initiate swap until the amount of free and file-backed pages is less
than the high water mark in a zone.
This control is used to define the rough relative IO cost of swapping
and filesystem paging, as a value between 0 and 200. At 100, the VM
assumes equal IO cost and will thus apply memory pressure to the page
cache and swap-backed pages equally; lower values signify more
expensive swap IO, higher values indicates cheaper.

Keep in mind that filesystem IO patterns under memory pressure tend to
be more efficient than swap's random IO. An optimal value will require
experimentation and will also be workload-dependent.

The default value is 60.

For in-memory swap, like zram or zswap, as well as hybrid setups that
have swap on faster devices than the filesystem, values beyond 100 can
be considered. For example, if the random IO against the swap device
is on average 2x faster than IO from the filesystem, swappiness should
be 133 (x + 2x = 200, 2x = 133.33).

At 0, the kernel will not initiate swap until the amount of free and
file-backed pages is less than the high watermark in a zone.


unprivileged_userfaultfd
========================
Expand Down
41 changes: 31 additions & 10 deletions Documentation/core-api/padata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,26 @@
The padata parallel execution mechanism
=======================================

:Date: December 2019
:Date: May 2020

Padata is a mechanism by which the kernel can farm jobs out to be done in
parallel on multiple CPUs while retaining their ordering. It was developed for
use with the IPsec code, which needs to be able to perform encryption and
decryption on large numbers of packets without reordering those packets. The
crypto developers made a point of writing padata in a sufficiently general
fashion that it could be put to other uses as well.
parallel on multiple CPUs while optionally retaining their ordering.

Usage
=====
It was originally developed for IPsec, which needs to perform encryption and
decryption on large numbers of packets without reordering those packets. This
is currently the sole consumer of padata's serialized job support.

Padata also supports multithreaded jobs, splitting up the job evenly while load
balancing and coordinating between threads.

Running Serialized Jobs
=======================

Initializing
------------

The first step in using padata is to set up a padata_instance structure for
overall control of how jobs are to be run::
The first step in using padata to run serialized jobs is to set up a
padata_instance structure for overall control of how jobs are to be run::

#include <linux/padata.h>

Expand Down Expand Up @@ -162,6 +165,24 @@ functions that correspond to the allocation in reverse::
It is the user's responsibility to ensure all outstanding jobs are complete
before any of the above are called.

Running Multithreaded Jobs
==========================

A multithreaded job has a main thread and zero or more helper threads, with the
main thread participating in the job and then waiting until all helpers have
finished. padata splits the job into units called chunks, where a chunk is a
piece of the job that one thread completes in one call to the thread function.

A user has to do three things to run a multithreaded job. First, describe the
job by defining a padata_mt_job structure, which is explained in the Interface
section. This includes a pointer to the thread function, which padata will
call each time it assigns a job chunk to a thread. Then, define the thread
function, which accepts three arguments, ``start``, ``end``, and ``arg``, where
the first two delimit the range that the thread operates on and the last is a
pointer to the job's shared state, if any. Prepare the shared state, which is
typically allocated on the main thread's stack. Last, call
padata_do_multithreaded(), which will return once the job is finished.

Interface
=========

Expand Down
34 changes: 0 additions & 34 deletions Documentation/features/vm/numa-memblock/arch-support.txt

This file was deleted.

9 changes: 4 additions & 5 deletions Documentation/vm/memory-model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,11 +46,10 @@ maps the entire physical memory. For most architectures, the holes
have entries in the `mem_map` array. The `struct page` objects
corresponding to the holes are never fully initialized.

To allocate the `mem_map` array, architecture specific setup code
should call :c:func:`free_area_init_node` function or its convenience
wrapper :c:func:`free_area_init`. Yet, the mappings array is not
usable until the call to :c:func:`memblock_free_all` that hands all
the memory to the page allocator.
To allocate the `mem_map` array, architecture specific setup code should
call :c:func:`free_area_init` function. Yet, the mappings array is not
usable until the call to :c:func:`memblock_free_all` that hands all the
memory to the page allocator.

If an architecture enables `CONFIG_ARCH_HAS_HOLES_MEMORYMODEL` option,
it may free parts of the `mem_map` array that do not cover the
Expand Down
3 changes: 1 addition & 2 deletions Documentation/vm/page_owner.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,7 @@ Usage
4) Analyze information from page owner::

cat /sys/kernel/debug/page_owner > page_owner_full.txt
grep -v ^PFN page_owner_full.txt > page_owner.txt
./page_owner_sort page_owner.txt sorted_page_owner.txt
./page_owner_sort page_owner_full.txt sorted_page_owner.txt

See the result about who allocated each page
in the ``sorted_page_owner.txt``.
16 changes: 6 additions & 10 deletions arch/alpha/mm/init.c
Original file line number Diff line number Diff line change
Expand Up @@ -243,21 +243,17 @@ callback_init(void * kernel_end)
*/
void __init paging_init(void)
{
unsigned long zones_size[MAX_NR_ZONES] = {0, };
unsigned long dma_pfn, high_pfn;
unsigned long max_zone_pfn[MAX_NR_ZONES] = {0, };
unsigned long dma_pfn;

dma_pfn = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
high_pfn = max_pfn = max_low_pfn;
max_pfn = max_low_pfn;

if (dma_pfn >= high_pfn)
zones_size[ZONE_DMA] = high_pfn;
else {
zones_size[ZONE_DMA] = dma_pfn;
zones_size[ZONE_NORMAL] = high_pfn - dma_pfn;
}
max_zone_pfn[ZONE_DMA] = dma_pfn;
max_zone_pfn[ZONE_NORMAL] = max_pfn;

/* Initialize mem_map[]. */
free_area_init(zones_size);
free_area_init(max_zone_pfn);

/* Initialize the kernel's ZERO_PGE. */
memset((void *)ZERO_PGE, 0, PAGE_SIZE);
Expand Down
22 changes: 6 additions & 16 deletions arch/alpha/mm/numa.c
Original file line number Diff line number Diff line change
Expand Up @@ -144,8 +144,8 @@ setup_memory_node(int nid, void *kernel_end)
if (!nid && (node_max_pfn < end_kernel_pfn || node_min_pfn > start_kernel_pfn))
panic("kernel loaded out of ram");

memblock_add(PFN_PHYS(node_min_pfn),
(node_max_pfn - node_min_pfn) << PAGE_SHIFT);
memblock_add_node(PFN_PHYS(node_min_pfn),
(node_max_pfn - node_min_pfn) << PAGE_SHIFT, nid);

/* Zone start phys-addr must be 2^(MAX_ORDER-1) aligned.
Note that we round this down, not up - node memory
Expand Down Expand Up @@ -202,8 +202,7 @@ setup_memory(void *kernel_end)

void __init paging_init(void)
{
unsigned int nid;
unsigned long zones_size[MAX_NR_ZONES] = {0, };
unsigned long max_zone_pfn[MAX_NR_ZONES] = {0, };
unsigned long dma_local_pfn;

/*
Expand All @@ -215,19 +214,10 @@ void __init paging_init(void)
*/
dma_local_pfn = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;

for_each_online_node(nid) {
unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_present_pages;
max_zone_pfn[ZONE_DMA] = dma_local_pfn;
max_zone_pfn[ZONE_NORMAL] = max_pfn;

if (dma_local_pfn >= end_pfn - start_pfn)
zones_size[ZONE_DMA] = end_pfn - start_pfn;
else {
zones_size[ZONE_DMA] = dma_local_pfn;
zones_size[ZONE_NORMAL] = (end_pfn - start_pfn) - dma_local_pfn;
}
node_set_state(nid, N_NORMAL_MEMORY);
free_area_init_node(nid, zones_size, start_pfn, NULL);
}
free_area_init(max_zone_pfn);

/* Initialize the kernel's ZERO_PGE. */
memset((void *)ZERO_PGE, 0, PAGE_SIZE);
Expand Down
2 changes: 1 addition & 1 deletion arch/arc/include/asm/hugepage.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ static inline pmd_t pte_pmd(pte_t pte)
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
#define pmd_mkhuge(pmd) pte_pmd(pte_mkhuge(pmd_pte(pmd)))
#define pmd_mknotpresent(pmd) pte_pmd(pte_mknotpresent(pmd_pte(pmd)))
#define pmd_mkinvalid(pmd) pte_pmd(pte_mknotpresent(pmd_pte(pmd)))
#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))

#define pmd_write(pmd) pte_write(pmd_pte(pmd))
Expand Down
Loading

0 comments on commit ee01c4d

Please sign in to comment.