Skip to content

Commit

Permalink
hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
Browse files Browse the repository at this point in the history
Patch series "hugetlb: Use new vma lock for huge pmd sharing
synchronization", v2.

hugetlb fault scalability regressions have recently been reported [1]. 
This is not the first such report, as regressions were also noted when
commit c0d0381 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"

			48 sample Avg
next-20220913		next-20220913			next-20220913
unmodified	revert i_mmap_sema locking	vma sema locking, this series
-----------------------------------------------------------------------------
498150 KB/s		501934 KB/s			504793 KB/s

The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings.  To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
   Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
   Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory.  The shared
mapping size was 250GB.  For comparison, a single instance of the program
was run.  Then, multiple instances were run in parallel to introduce
lock contention.  Changing the locking scheme results in a significant
performance benefit.

test		instances	unmodified	revert		vma
--------------------------------------------------------------------------
faults per sec	1		393043		395680		389932
faults per sec  24		 71405		 81191		 79048
forks per sec   1		  2802		  2747		  2725
forks per sec   24		   439		   536		   500
Combined faults 24		  1621		 68070		 53662
Combined forks  24		   358		    67		   142

Combined test is when running both faulting program and forking program
simultaneously.

Patches 1 and 2 of this series revert c0d0381 and 87bf91d which
depends on c0d0381.  Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share.  With c0d0381 reverted, this race is exposed:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

Reverting 87bf91d exposes races in page fault/file truncation.  When
the new vma lock is put to use in patch 8, this will handle the fault/file
truncation races.  This is explained in patch 9 where code associated with
these races is cleaned up.

Patches 3 - 5 restructure existing code in preparation for using the new
vma lock (rw semaphore) for pmd sharing synchronization.  The idea is that
this semaphore will be held in read mode for the duration of fault
processing, and held in write mode for unmap operations which may call
huge_pmd_unshare.  Acquiring i_mmap_rwsem is also still required to
synchronize huge pmd sharing.  However it is only required in the fault
path when setting up sharing, and will be acquired in huge_pmd_share().

Patch 6 adds the new vma lock and all supporting routines, but does not
actually change code to use the new lock.

Patch 7 refactors code in preparation for using the new lock.  And, patch
8 finally adds code to make use of this new vma lock.  Unfortunately, the
fault code and truncate/hole punch code would naturally take locks in the
opposite order which could lead to deadlock.  Since the performance of
page faults is more important, the truncation/hole punch code is modified
to back out and take locks in the correct order if necessary.

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/[email protected]/


This patch (of 9):

Commit c0d0381 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  The use of i_mmap_rwsem to prevent
fault/truncate races depends on this.  However, this has been shown to
cause performance/scaling issues.  As a result, that code will be
reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.

In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Kravetz <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: James Houghton <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mina Almasry <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Prakash Sangappa <[email protected]>
Cc: Sven Schnelle <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
  • Loading branch information
mjkravetz authored and akpm00 committed Oct 3, 2022
1 parent 3259914 commit 188a397
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 32 deletions.
30 changes: 9 additions & 21 deletions fs/hugetlbfs/inode.c
Original file line number Diff line number Diff line change
Expand Up @@ -419,9 +419,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
* maps and global counts. Page faults can not race with truncation
* in this routine. hugetlb_no_page() holds i_mmap_rwsem and prevents
* page faults in the truncated range by checking i_size. i_size is
* modified while holding i_mmap_rwsem.
* in this routine. hugetlb_no_page() prevents page faults in the
* truncated range. It checks i_size before allocation, and again after
* with the page table lock for the page held. The same lock must be
* acquired to unmap a page.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserve map
Expand Down Expand Up @@ -451,16 +452,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
u32 hash = 0;

index = folio->index;
if (!truncate_op) {
/*
* Only need to hold the fault mutex in the
* hole punch case. This prevents races with
* page faults. Races are not possible in the
* case of truncation.
*/
hash = hugetlb_fault_mutex_hash(mapping, index);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
}
hash = hugetlb_fault_mutex_hash(mapping, index);
mutex_lock(&hugetlb_fault_mutex_table[hash]);

/*
* If folio is mapped, it was faulted in after being
Expand Down Expand Up @@ -504,8 +497,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
}

folio_unlock(folio);
if (!truncate_op)
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
folio_batch_release(&fbatch);
cond_resched();
Expand Down Expand Up @@ -543,8 +535,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
BUG_ON(offset & ~huge_page_mask(h));
pgoff = offset >> PAGE_SHIFT;

i_mmap_lock_write(mapping);
i_size_write(inode, offset);
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
ZAP_FLAG_DROP_MARKER);
Expand Down Expand Up @@ -703,11 +695,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;

/*
* fault mutex taken here, protects against fault path
* and hole punch. inode_lock previously taken protects
* against truncation.
*/
/* mutex taken here, fault path and hole punch */
hash = hugetlb_fault_mutex_hash(mapping, index);
mutex_lock(&hugetlb_fault_mutex_table[hash]);

Expand Down
22 changes: 11 additions & 11 deletions mm/hugetlb.c
Original file line number Diff line number Diff line change
Expand Up @@ -5560,17 +5560,15 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}

/*
* We can not race with truncation due to holding i_mmap_rwsem.
* i_size is modified when holding i_mmap_rwsem, so check here
* once for faults beyond end of file.
* Use page lock to guard against racing truncation
* before we get page_table_lock.
*/
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto out;

new_page = false;
page = find_lock_page(mapping, idx);
if (!page) {
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto out;
/* Check for page in userfault range */
if (userfaultfd_missing(vma)) {
ret = hugetlb_handle_userfault(vma, mapping, idx,
Expand Down Expand Up @@ -5666,6 +5664,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}

ptl = huge_pte_lock(h, mm, ptep);
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto backout;

ret = 0;
/* If pte changed from under us, retry */
if (!pte_same(huge_ptep_get(ptep), old_pte))
Expand Down Expand Up @@ -5774,10 +5776,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,

/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
* until finished with ptep. This serves two purposes:
* 1) It prevents huge_pmd_unshare from being called elsewhere
* and making the ptep no longer valid.
* 2) It synchronizes us with i_size modifications during truncation.
* until finished with ptep. This prevents huge_pmd_unshare from
* being called elsewhere and making the ptep no longer valid.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
Expand Down

0 comments on commit 188a397

Please sign in to comment.