Skip to content

Commit

Permalink
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/sc…
Browse files Browse the repository at this point in the history
…m/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Kemeng Shi has contributed some compation maintenance work in the
     series 'Fixes and cleanups to compaction'

   - Joel Fernandes has a patchset ('Optimize mremap during mutual
     alignment within PMD') which fixes an obscure issue with mremap()'s
     pagetable handling during a subsequent exec(), based upon an
     implementation which Linus suggested

   - More DAMON/DAMOS maintenance and feature work from SeongJae Park i
     the following patch series:

	mm/damon: misc fixups for documents, comments and its tracepoint
	mm/damon: add a tracepoint for damos apply target regions
	mm/damon: provide pseudo-moving sum based access rate
	mm/damon: implement DAMOS apply intervals
	mm/damon/core-test: Fix memory leaks in core-test
	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval

   - In the series 'Do not try to access unaccepted memory' Adrian
     Hunter provides some fixups for the recently-added 'unaccepted
     memory' feature. To increase the feature's checking coverage. 'Plug
     a few gaps where RAM is exposed without checking if it is
     unaccepted memory'

   - In the series 'cleanups for lockless slab shrink' Qi Zheng has done
     some maintenance work which is preparation for the lockless slab
     shrinking code

   - Qi Zheng has redone the earlier (and reverted) attempt to make slab
     shrinking lockless in the series 'use refcount+RCU method to
     implement lockless slab shrink'

   - David Hildenbrand contributes some maintenance work for the rmap
     code in the series 'Anon rmap cleanups'

   - Kefeng Wang does more folio conversions and some maintenance work
     in the migration code. Series 'mm: migrate: more folio conversion
     and unification'

   - Matthew Wilcox has fixed an issue in the buffer_head code which was
     causing long stalls under some heavy memory/IO loads. Some cleanups
     were added on the way. Series 'Add and use bdev_getblk()'

   - In the series 'Use nth_page() in place of direct struct page
     manipulation' Zi Yan has fixed a potential issue with the direct
     manipulation of hugetlb page frames

   - In the series 'mm: hugetlb: Skip initialization of gigantic tail
     struct pages if freed by HVO' has improved our handling of gigantic
     pages in the hugetlb vmmemmep optimizaton code. This provides
     significant boot time improvements when significant amounts of
     gigantic pages are in use

   - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
     rationalization and folio conversions in the hugetlb code

   - Yin Fengwei has improved mlock()'s handling of large folios in the
     series 'support large folio for mlock'

   - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
     added statistics for memcg v1 users which are available (and
     useful) under memcg v2

   - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
     prctl so that userspace may direct the kernel to not automatically
     propagate the denial to child processes. The series is named 'MDWE
     without inheritance'

   - Kefeng Wang has provided the series 'mm: convert numa balancing
     functions to use a folio' which does what it says

   - In the series 'mm/ksm: add fork-exec support for prctl' Stefan
     Roesch makes is possible for a process to propagate KSM treatment
     across exec()

   - Huang Ying has enhanced memory tiering's calculation of memory
     distances. This is used to permit the dax/kmem driver to use 'high
     bandwidth memory' in addition to Optane Data Center Persistent
     Memory Modules (DCPMM). The series is named 'memory tiering:
     calculate abstract distance based on ACPI HMAT'

   - In the series 'Smart scanning mode for KSM' Stefan Roesch has
     optimized KSM by teaching it to retain and use some historical
     information from previous scans

   - Yosry Ahmed has fixed some inconsistencies in memcg statistics in
     the series 'mm: memcg: fix tracking of pending stats updates
     values'

   - In the series 'Implement IOCTL to get and optionally clear info
     about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
     which permits us to atomically read-then-clear page softdirty
     state. This is mainly used by CRIU

   - Hugh Dickins contributed the series 'shmem,tmpfs: general
     maintenance', a bunch of relatively minor maintenance tweaks to
     this code

   - Matthew Wilcox has increased the use of the VMA lock over
     file-backed page faults in the series 'Handle more faults under the
     VMA lock'. Some rationalizations of the fault path became possible
     as a result

   - In the series 'mm/rmap: convert page_move_anon_rmap() to
     folio_move_anon_rmap()' David Hildenbrand has implemented some
     cleanups and folio conversions

   - In the series 'various improvements to the GUP interface' Lorenzo
     Stoakes has simplified and improved the GUP interface with an eye
     to providing groundwork for future improvements

   - Andrey Konovalov has sent along the series 'kasan: assorted fixes
     and improvements' which does those things

   - Some page allocator maintenance work from Kemeng Shi in the series
     'Two minor cleanups to break_down_buddy_pages'

   - In thes series 'New selftest for mm' Breno Leitao has developed
     another MM self test which tickles a race we had between madvise()
     and page faults

   - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
     and an optimization to the core pagecache code

   - Nhat Pham has added memcg accounting for hugetlb memory in the
     series 'hugetlb memcg accounting'

   - Cleanups and rationalizations to the pagemap code from Lorenzo
     Stoakes, in the series 'Abstract vma_merge() and split_vma()'

   - Audra Mitchell has fixed issues in the procfs page_owner code's new
     timestamping feature which was causing some misbehaviours. In the
     series 'Fix page_owner's use of free timestamps'

   - Lorenzo Stoakes has fixed the handling of new mappings of sealed
     files in the series 'permit write-sealed memfd read-only shared
     mappings'

   - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
     series 'Batch hugetlb vmemmap modification operations'

   - Some buffer_head folio conversions and cleanups from Matthew Wilcox
     in the series 'Finish the create_empty_buffers() transition'

   - As a page allocator performance optimization Huang Ying has added
     automatic tuning to the allocator's per-cpu-pages feature, in the
     series 'mm: PCP high auto-tuning'

   - Roman Gushchin has contributed the patchset 'mm: improve
     performance of accounted kernel memory allocations' which improves
     their performance by ~30% as measured by a micro-benchmark

   - folio conversions from Kefeng Wang in the series 'mm: convert page
     cpupid functions to folios'

   - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
     kmemleak'

   - Qi Zheng has improved our handling of memoryless nodes by keeping
     them off the allocation fallback list. This is done in the series
     'handle memoryless nodes more appropriately'

   - khugepaged conversions from Vishal Moola in the series 'Some
     khugepaged folio conversions'"

[ bcachefs conflicts with the dynamically allocated shrinkers have been
  resolved as per Stephen Rothwell in

     https://lore.kernel.org/all/[email protected]/

  with help from Qi Zheng.

  The clone3 test filtering conflict was half-arsed by yours truly ]

* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
  mm/damon/sysfs: update monitoring target regions for online input commit
  mm/damon/sysfs: remove requested targets when online-commit inputs
  selftests: add a sanity check for zswap
  Documentation: maple_tree: fix word spelling error
  mm/vmalloc: fix the unchecked dereference warning in vread_iter()
  zswap: export compression failure stats
  Documentation: ubsan: drop "the" from article title
  mempolicy: migration attempt to match interleave nodes
  mempolicy: mmap_lock is not needed while migrating folios
  mempolicy: alloc_pages_mpol() for NUMA policy without vma
  mm: add page_rmappable_folio() wrapper
  mempolicy: remove confusing MPOL_MF_LAZY dead code
  mempolicy: mpol_shared_policy_init() without pseudo-vma
  mempolicy trivia: use pgoff_t in shared mempolicy tree
  mempolicy trivia: slightly more consistent naming
  mempolicy trivia: delete those ancient pr_debug()s
  mempolicy: fix migrate_pages(2) syscall return nr_failed
  kernfs: drop shared NUMA mempolicy hooks
  hugetlbfs: drop shared NUMA mempolicy pretence
  mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
  ...
  • Loading branch information
torvalds committed Nov 3, 2023
2 parents bc3012f + 9732336 commit ecae0bd
Show file tree
Hide file tree
Showing 281 changed files with 11,683 additions and 5,277 deletions.
7 changes: 7 additions & 0 deletions Documentation/ABI/testing/sysfs-kernel-mm-damon
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,13 @@ Contact: SeongJae Park <[email protected]>
Description: Writing to and reading from this file sets and gets the action
of the scheme.

What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/apply_interval_us
Date: Sep 2023
Contact: SeongJae Park <[email protected]>
Description: Writing a value to this file sets the action apply interval of
the scheme in microseconds. Reading this file returns the
value.

What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/sz/min
Date: Mar 2022
Contact: SeongJae Park <[email protected]>
Expand Down
1 change: 1 addition & 0 deletions Documentation/admin-guide/cgroup-v1/memory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -551,6 +551,7 @@ memory.stat file includes following statistics:
event happens each time a page is unaccounted from the
cgroup.
swap # of bytes of swap usage
swapcached # of bytes of swap cached in memory
dirty # of bytes that are waiting to get written back to the disk.
writeback # of bytes of file/anon cache that are queued for syncing to
disk.
Expand Down
38 changes: 38 additions & 0 deletions Documentation/admin-guide/cgroup-v2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,35 @@ cgroup v2 currently supports the following mount options.
relying on the original semantics (e.g. specifying bogusly
high 'bypass' protection values at higher tree levels).

memory_hugetlb_accounting
Count HugeTLB memory usage towards the cgroup's overall
memory usage for the memory controller (for the purpose of
statistics reporting and memory protetion). This is a new
behavior that could regress existing setups, so it must be
explicitly opted in with this mount option.

A few caveats to keep in mind:

* There is no HugeTLB pool management involved in the memory
controller. The pre-allocated pool does not belong to anyone.
Specifically, when a new HugeTLB folio is allocated to
the pool, it is not accounted for from the perspective of the
memory controller. It is only charged to a cgroup when it is
actually used (for e.g at page fault time). Host memory
overcommit management has to consider this when configuring
hard limits. In general, HugeTLB pool management should be
done via other mechanisms (such as the HugeTLB controller).
* Failure to charge a HugeTLB folio to the memory controller
results in SIGBUS. This could happen even if the HugeTLB pool
still has pages available (but the cgroup limit is hit and
reclaim attempt fails).
* Charging HugeTLB memory towards the memory controller affects
memory protection and reclaim dynamics. Any userspace tuning
(of low, min limits for e.g) needs to take this into account.
* HugeTLB pages utilized while this option is not selected
will not be tracked by the memory controller (even if cgroup
v2 is remounted later on).


Organizing Processes and Threads
--------------------------------
Expand Down Expand Up @@ -1539,6 +1568,15 @@ PAGE_SIZE multiple when read back.
collapsing an existing range of pages. This counter is not
present when CONFIG_TRANSPARENT_HUGEPAGE is not set.

thp_swpout (npn)
Number of transparent hugepages which are swapout in one piece
without splitting.

thp_swpout_fallback (npn)
Number of transparent hugepages which were split before swapout.
Usually because failed to allocate some continuous swap space
for the huge page.

memory.numa_stat
A read-only nested-keyed file which exists on non-root cgroups.

Expand Down
124 changes: 81 additions & 43 deletions Documentation/admin-guide/mm/damon/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,18 +20,18 @@ DAMON provides below interfaces for different users.
you can write and use your personalized DAMON sysfs wrapper programs that
reads/writes the sysfs files instead of you. The `DAMON user space tool
<https://github.com/awslabs/damo>`_ is one example of such programs.
- *debugfs interface. (DEPRECATED!)*
:ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
<sysfs_interface>`. This is deprecated, so users should move to the
:ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
move, please report your usecase to [email protected] and
[email protected].
- *Kernel Space Programming Interface.*
:doc:`This </mm/damon/api>` is for kernel space programmers. Using this,
users can utilize every feature of DAMON most flexibly and efficiently by
writing kernel space DAMON application programs for you. You can even extend
DAMON for various address spaces. For detail, please refer to the interface
:doc:`document </mm/damon/api>`.
- *debugfs interface. (DEPRECATED!)*
:ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
<sysfs_interface>`. This is deprecated, so users should move to the
:ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
move, please report your usecase to [email protected] and
[email protected].

.. _sysfs_interface:

Expand Down Expand Up @@ -76,7 +76,7 @@ comma (","). ::
│ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ...
│ │ │ │ │ schemes/nr_schemes
│ │ │ │ │ │ 0/action
│ │ │ │ │ │ 0/action,apply_interval_us
│ │ │ │ │ │ │ access_pattern/
│ │ │ │ │ │ │ │ sz/min,max
│ │ │ │ │ │ │ │ nr_accesses/min,max
Expand Down Expand Up @@ -105,14 +105,12 @@ having the root permission could use this directory.
kdamonds/
---------

The monitoring-related information including request specifications and results
are called DAMON context. DAMON executes each context with a kernel thread
called kdamond, and multiple kdamonds could run in parallel.

Under the ``admin`` directory, one directory, ``kdamonds``, which has files for
controlling the kdamonds exist. In the beginning, this directory has only one
file, ``nr_kdamonds``. Writing a number (``N``) to the file creates the number
of child directories named ``0`` to ``N-1``. Each directory represents each
controlling the kdamonds (refer to
:ref:`design <damon_design_execution_model_and_data_structures>` for more
details) exists. In the beginning, this directory has only one file,
``nr_kdamonds``. Writing a number (``N``) to the file creates the number of
child directories named ``0`` to ``N-1``. Each directory represents each
kdamond.

kdamonds/<N>/
Expand Down Expand Up @@ -150,9 +148,10 @@ kdamonds/<N>/contexts/

In the beginning, this directory has only one file, ``nr_contexts``. Writing a
number (``N``) to the file creates the number of child directories named as
``0`` to ``N-1``. Each directory represents each monitoring context. At the
moment, only one context per kdamond is supported, so only ``0`` or ``1`` can
be written to the file.
``0`` to ``N-1``. Each directory represents each monitoring context (refer to
:ref:`design <damon_design_execution_model_and_data_structures>` for more
details). At the moment, only one context per kdamond is supported, so only
``0`` or ``1`` can be written to the file.

.. _sysfs_contexts:

Expand Down Expand Up @@ -270,8 +269,8 @@ schemes/<N>/
------------

In each scheme directory, five directories (``access_pattern``, ``quotas``,
``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and one file
(``action``) exist.
``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files
(``action`` and ``apply_interval``) exist.

The ``action`` file is for setting and getting the scheme's :ref:`action
<damon_design_damos_action>`. The keywords that can be written to and read
Expand All @@ -297,6 +296,9 @@ Note that support of each action depends on the running DAMON operations set
- ``stat``: Do nothing but count the statistics.
Supported by all operations sets.

The ``apply_interval_us`` file is for setting and getting the scheme's
:ref:`apply_interval <damon_design_damos>` in microseconds.

schemes/<N>/access_pattern/
---------------------------

Expand Down Expand Up @@ -392,7 +394,7 @@ pages of all memory cgroups except ``/having_care_already``.::
echo N > 1/matching

Note that ``anon`` and ``memcg`` filters are currently supported only when
``paddr`` `implementation <sysfs_contexts>` is being used.
``paddr`` :ref:`implementation <sysfs_contexts>` is being used.

Also, memory regions that are filtered out by ``addr`` or ``target`` filters
are not counted as the scheme has tried to those, while regions that filtered
Expand Down Expand Up @@ -430,9 +432,9 @@ that reading it returns the total size of the scheme tried regions, and creates
directories named integer starting from ``0`` under this directory. Each
directory contains files exposing detailed information about each of the memory
region that the corresponding scheme's ``action`` has tried to be applied under
this directory, during next :ref:`aggregation interval
<sysfs_monitoring_attrs>`. The information includes address range,
``nr_accesses``, and ``age`` of the region.
this directory, during next :ref:`apply interval <damon_design_damos>` of the
corresponding scheme. The information includes address range, ``nr_accesses``,
and ``age`` of the region.

Writing ``update_schemes_tried_bytes`` to the relevant ``kdamonds/<N>/state``
file will only update the ``total_bytes`` file, and will not create the
Expand Down Expand Up @@ -495,6 +497,62 @@ Please note that it's highly recommended to use user space tools like `damo
<https://github.com/awslabs/damo>`_ rather than manually reading and writing
the files as above. Above is only for an example.

.. _tracepoint:

Tracepoints for Monitoring Results
==================================

Users can get the monitoring results via the :ref:`tried_regions
<sysfs_schemes_tried_regions>`. The interface is useful for getting a
snapshot, but it could be inefficient for fully recording all the monitoring
results. For the purpose, two trace points, namely ``damon:damon_aggregated``
and ``damon:damos_before_apply``, are provided. ``damon:damon_aggregated``
provides the whole monitoring results, while ``damon:damos_before_apply``
provides the monitoring results for regions that each DAMON-based Operation
Scheme (:ref:`DAMOS <damon_design_damos>`) is gonna be applied. Hence,
``damon:damos_before_apply`` is more useful for recording internal behavior of
DAMOS, or DAMOS target access
:ref:`pattern <damon_design_damos_access_pattern>` based query-like efficient
monitoring results recording.

While the monitoring is turned on, you could record the tracepoint events and
show results using tracepoint supporting tools like ``perf``. For example::

# echo on > monitor_on
# perf record -e damon:damon_aggregated &
# sleep 5
# kill 9 $(pidof perf)
# echo off > monitor_on
# perf script
kdamond.0 46568 [027] 79357.842179: damon:damon_aggregated: target_id=0 nr_regions=11 122509119488-135708762112: 0 864
[...]

Each line of the perf script output represents each monitoring region. The
first five fields are as usual other tracepoint outputs. The sixth field
(``target_id=X``) shows the ide of the monitoring target of the region. The
seventh field (``nr_regions=X``) shows the total number of monitoring regions
for the target. The eighth field (``X-Y:``) shows the start (``X``) and end
(``Y``) addresses of the region in bytes. The ninth field (``X``) shows the
``nr_accesses`` of the region (refer to
:ref:`design <damon_design_region_based_sampling>` for more details of the
counter). Finally the tenth field (``X``) shows the ``age`` of the region
(refer to :ref:`design <damon_design_age_tracking>` for more details of the
counter).

If the event was ``damon:damos_beofre_apply``, the ``perf script`` output would
be somewhat like below::

kdamond.0 47293 [000] 80801.060214: damon:damos_before_apply: ctx_idx=0 scheme_idx=0 target_idx=0 nr_regions=11 121932607488-135128711168: 0 136
[...]

Each line of the output represents each monitoring region that each DAMON-based
Operation Scheme was about to be applied at the traced time. The first five
fields are as usual. It shows the index of the DAMON context (``ctx_idx=X``)
of the scheme in the list of the contexts of the context's kdamond, the index
of the scheme (``scheme_idx=X``) in the list of the schemes of the context, in
addition to the output of ``damon_aggregated`` tracepoint.


.. _debugfs_interface:

debugfs Interface (DEPRECATED!)
Expand Down Expand Up @@ -790,23 +848,3 @@ directory by putting the name of the context to the ``rm_contexts`` file. ::

Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on`` files are in the
root directory only.


.. _tracepoint:

Tracepoint for Monitoring Results
=================================

Users can get the monitoring results via the :ref:`tried_regions
<sysfs_schemes_tried_regions>` or a tracepoint, ``damon:damon_aggregated``.
While the tried regions directory is useful for getting a snapshot, the
tracepoint is useful for getting a full record of the results. While the
monitoring is turned on, you could record the tracepoint events and show
results using tracepoint supporting tools like ``perf``. For example::

# echo on > monitor_on
# perf record -e damon:damon_aggregated &
# sleep 5
# kill 9 $(pidof perf)
# echo off > monitor_on
# perf script
11 changes: 11 additions & 0 deletions Documentation/admin-guide/mm/ksm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,15 @@ stable_node_chains_prune_millisecs
scan. It's a noop if not a single KSM page hit the
``max_page_sharing`` yet.

smart_scan
Historically KSM checked every candidate page for each scan. It did
not take into account historic information. When smart scan is
enabled, pages that have previously not been de-duplicated get
skipped. How often these pages are skipped depends on how often
de-duplication has already been tried and failed. By default this
optimization is enabled. The ``pages_skipped`` metric shows how
effective the setting is.

The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:

general_profit
Expand All @@ -169,6 +178,8 @@ pages_unshared
how many pages unique but repeatedly checked for merging
pages_volatile
how many pages changing too fast to be placed in a tree
pages_skipped
how many pages did the "smart" page scanning algorithm skip
full_scans
how many times all mergeable areas have been scanned
stable_node_chains
Expand Down
89 changes: 89 additions & 0 deletions Documentation/admin-guide/mm/pagemap.rst
Original file line number Diff line number Diff line change
Expand Up @@ -227,3 +227,92 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
always 12 at most architectures). Since Linux 3.11 their meaning changes
after first clear of soft-dirty bits. Since Linux 4.2 they are used for
flags unconditionally.

Pagemap Scan IOCTL
==================

The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get or optionally
clear the info about page table entries. The following operations are supported
in this IOCTL:

- Scan the address range and get the memory ranges matching the provided criteria.
This is performed when the output buffer is specified.
- Write-protect the pages. The ``PM_SCAN_WP_MATCHING`` is used to write-protect
the pages of interest. The ``PM_SCAN_CHECK_WPASYNC`` aborts the operation if
non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING`` can be
used with or without ``PM_SCAN_CHECK_WPASYNC``.
- Both of those operations can be combined into one atomic operation where we can
get and write protect the pages as well.

Following flags about pages are currently supported:

- ``PAGE_IS_WPALLOWED`` - Page has async-write-protection enabled
- ``PAGE_IS_WRITTEN`` - Page has been written to from the time it was write protected
- ``PAGE_IS_FILE`` - Page is file backed
- ``PAGE_IS_PRESENT`` - Page is present in the memory
- ``PAGE_IS_SWAPPED`` - Page is in swapped
- ``PAGE_IS_PFNZERO`` - Page has zero PFN
- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed

The ``struct pm_scan_arg`` is used as the argument of the IOCTL.

1. The size of the ``struct pm_scan_arg`` must be specified in the ``size``
field. This field will be helpful in recognizing the structure if extensions
are done later.
2. The flags can be specified in the ``flags`` field. The ``PM_SCAN_WP_MATCHING``
and ``PM_SCAN_CHECK_WPASYNC`` are the only added flags at this time. The get
operation is optionally performed depending upon if the output buffer is
provided or not.
3. The range is specified through ``start`` and ``end``.
4. The walk can abort before visiting the complete range such as the user buffer
can get full etc. The walk ending address is specified in``end_walk``.
5. The output buffer of ``struct page_region`` array and size is specified in
``vec`` and ``vec_len``.
6. The optional maximum requested pages are specified in the ``max_pages``.
7. The masks are specified in ``category_mask``, ``category_anyof_mask``,
``category_inverted`` and ``return_mask``.

Find pages which have been written and WP them as well::

struct pm_scan_arg arg = {
.size = sizeof(arg),
.flags = PM_SCAN_CHECK_WPASYNC | PM_SCAN_CHECK_WPASYNC,
..
.category_mask = PAGE_IS_WRITTEN,
.return_mask = PAGE_IS_WRITTEN,
};

Find pages which have been written, are file backed, not swapped and either
present or huge::

struct pm_scan_arg arg = {
.size = sizeof(arg),
.flags = 0,
..
.category_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED,
.category_inverted = PAGE_IS_SWAPPED,
.category_anyof_mask = PAGE_IS_PRESENT | PAGE_IS_HUGE,
.return_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED |
PAGE_IS_PRESENT | PAGE_IS_HUGE,
};

The ``PAGE_IS_WRITTEN`` flag can be considered as a better-performing alternative
of soft-dirty flag. It doesn't get affected by VMA merging of the kernel and hence
the user can find the true soft-dirty pages in case of normal pages. (There may
still be extra dirty pages reported for THP or Hugetlb pages.)

"PAGE_IS_WRITTEN" category is used with uffd write protect-enabled ranges to
implement memory dirty tracking in userspace:

1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
2. The ``UFFD_FEATURE_WP_UNPOPULATED`` and ``UFFD_FEATURE_WP_ASYNC`` features
are set by ``UFFDIO_API`` IOCTL.
3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
through ``UFFDIO_REGISTER`` IOCTL.
4. Then any part of the registered memory or the whole memory region must
be write protected using ``PAGEMAP_SCAN`` IOCTL with flag ``PM_SCAN_WP_MATCHING``
or the ``UFFDIO_WRITEPROTECT`` IOCTL can be used. Both of these perform the
same operation. The former is better in terms of performance.
5. Now the ``PAGEMAP_SCAN`` IOCTL can be used to either just find pages which
have been written to since they were last marked and/or optionally write protect
the pages as well.
Loading

0 comments on commit ecae0bd

Please sign in to comment.