There is a lot of code needs to set the range of vma in mmap.c, introduce
vma_set_range() to simplify the code.
Link: https://lkml.kernel.org/r/20240124035719.3685193-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We avoid allocating THP for temporary stack, even though
khugepaged_enter_vma() is called for stack VMAs, it actualy returns
false. So no need to call it in the first place at all.
Link: https://lkml.kernel.org/r/20231221065943.2803551-1-shy828301@gmail.com
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The file parameter in the __remove_shared_vm_struct is no longer used,
remove it.
These functions vma_link() and mmap_region() have some of the same code,
introduce vma_link_file() helper function to simplify the code.
Link: https://lkml.kernel.org/r/20240110084622.2425927-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The addition of commit efa7df3e3b ("mm: align larger anonymous mappings
on THP boundaries") caused the "virtual_address_range" mm selftest to
start failing on arm64. Let's fix that regression.
There were 2 visible problems when running the test; 1) it takes much
longer to execute, and 2) the test fails. Both are related:
The (first part of the) test allocates as many 1GB anonymous blocks as it
can in the low 256TB of address space, passing NULL as the addr hint to
mmap. Before the faulty patch, all allocations were abutted and contained
in a single, merged VMA. However, after this patch, each allocation is in
its own VMA, and there is a 2M gap between each VMA. This causes the 2
problems in the test: 1) mmap becomes MUCH slower because there are so
many VMAs to check to find a new 1G gap. 2) mmap fails once it hits the
VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then causes a
subsequent calloc() to fail, which causes the test to fail.
The problem is that arm64 (unlike x86) selects
ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area()
allocates len+2M then always aligns to the bottom of the discovered gap.
That causes the 2M hole.
Fix this by detecting cases where we can still achive the alignment goal
when moved to the top of the allocated area, if configured to prefer
top-down allocation.
While we are at it, fix thp_get_unmapped_area's use of pgoff, which should
always be zero for anonymous mappings. Prior to the faulty change, while
it was possible for user space to pass in pgoff!=0, the old
mm->get_unmapped_area() handler would not use it. thp_get_unmapped_area()
does use it, so let's explicitly zero it before calling the handler. This
should also be the correct behavior for arches that define their own
get_unmapped_area() handler.
Link: https://lkml.kernel.org/r/20240123171420.3970220-1-ryan.roberts@arm.com
Fixes: efa7df3e3b ("mm: align larger anonymous mappings on THP boundaries")
Closes: https://lore.kernel.org/linux-mm/1e8f5ac7-54ce-433a-ae53-81522b2320e1@arm.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Allow mmap_rnd_bits_max to be updated on architectures that
determine virtual address space size at runtime instead of relying
on Kconfig options by changing it from const to __ro_after_init.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/20230929211155.3910949-5-samitolvanen@google.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
are included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the
series
"maple_tree: add mt_free_one() and mt_attr() helpers"
"Some cleanups of maple tree"
- In the series "mm: use memmap_on_memory semantics for dax/kmem"
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few
fixes) in the patch series
"Add folio_zero_tail() and folio_fill_tail()"
"Make folio_start_writeback return void"
"Fix fault handler's handling of poisoned tail pages"
"Convert aops->error_remove_page to ->error_remove_folio"
"Finish two folio conversions"
"More swap folio conversions"
- Kefeng Wang has also contributed folio-related work in the series
"mm: cleanup and use more folio in page fault"
- Jim Cromie has improved the kmemleak reporting output in the
series "tweak kmemleak report format".
- In the series "stackdepot: allow evicting stack traces" Andrey
Konovalov to permits clients (in this case KASAN) to cause
eviction of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series "mm:
page_alloc: fixes for high atomic reserve caluculations".
- Dmitry Rokosov has added to the samples/ dorectory some sample
code for a userspace memcg event listener application. See the
series "samples: introduce cgroup events listeners".
- Some mapletree maintanance work from Liam Howlett in the series
"maple_tree: iterator state changes".
- Nhat Pham has improved zswap's approach to writeback in the
series "workload-specific and memory pressure-driven zswap
writeback".
- DAMON/DAMOS feature and maintenance work from SeongJae Park in
the series
"mm/damon: let users feed and tame/auto-tune DAMOS"
"selftests/damon: add Python-written DAMON functionality tests"
"mm/damon: misc updates for 6.8"
- Yosry Ahmed has improved memcg's stats flushing in the series
"mm: memcg: subtree stats flushing and thresholds".
- In the series "Multi-size THP for anonymous memory" Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series "More buffer_head
cleanups".
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
"userfaultfd move option". UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a "KSM Advisor", in the series
"mm/ksm: Add ksm advisor". This is a governor which tunes KSM's
scanning aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory
use in the series "mm/zswap: dstmem reuse optimizations and
cleanups".
- Matthew Wilcox has performed some maintenance work on the
writeback code, both code and within filesystems. The series is
"Clean up the writeback paths".
- Andrey Konovalov has optimized KASAN's handling of alloc and
free stack traces for secondary-level allocators, in the series
"kasan: save mempool stack traces".
- Andrey also performed some KASAN maintenance work in the series
"kasan: assorted clean-ups".
- David Hildenbrand has gone to town on the rmap code. Cleanups,
more pte batching, folio conversions and more. See the series
"mm/rmap: interface overhaul".
- Kinsey Ho has contributed some maintenance work on the MGLRU
code in the series "mm/mglru: Kconfig cleanup".
- Matthew Wilcox has contributed lruvec page accounting code
cleanups in the series "Remove some lruvec page accounting
functions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
=0NHs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the series
'maple_tree: add mt_free_one() and mt_attr() helpers'
'Some cleanups of maple tree'
- In the series 'mm: use memmap_on_memory semantics for dax/kmem'
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few fixes)
in the patch series
'Add folio_zero_tail() and folio_fill_tail()'
'Make folio_start_writeback return void'
'Fix fault handler's handling of poisoned tail pages'
'Convert aops->error_remove_page to ->error_remove_folio'
'Finish two folio conversions'
'More swap folio conversions'
- Kefeng Wang has also contributed folio-related work in the series
'mm: cleanup and use more folio in page fault'
- Jim Cromie has improved the kmemleak reporting output in the series
'tweak kmemleak report format'.
- In the series 'stackdepot: allow evicting stack traces' Andrey
Konovalov to permits clients (in this case KASAN) to cause eviction
of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series 'mm:
page_alloc: fixes for high atomic reserve caluculations'.
- Dmitry Rokosov has added to the samples/ dorectory some sample code
for a userspace memcg event listener application. See the series
'samples: introduce cgroup events listeners'.
- Some mapletree maintanance work from Liam Howlett in the series
'maple_tree: iterator state changes'.
- Nhat Pham has improved zswap's approach to writeback in the series
'workload-specific and memory pressure-driven zswap writeback'.
- DAMON/DAMOS feature and maintenance work from SeongJae Park in the
series
'mm/damon: let users feed and tame/auto-tune DAMOS'
'selftests/damon: add Python-written DAMON functionality tests'
'mm/damon: misc updates for 6.8'
- Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
memcg: subtree stats flushing and thresholds'.
- In the series 'Multi-size THP for anonymous memory' Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series 'More buffer_head
cleanups'.
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
Add ksm advisor'. This is a governor which tunes KSM's scanning
aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory use
in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.
- Matthew Wilcox has performed some maintenance work on the writeback
code, both code and within filesystems. The series is 'Clean up the
writeback paths'.
- Andrey Konovalov has optimized KASAN's handling of alloc and free
stack traces for secondary-level allocators, in the series 'kasan:
save mempool stack traces'.
- Andrey also performed some KASAN maintenance work in the series
'kasan: assorted clean-ups'.
- David Hildenbrand has gone to town on the rmap code. Cleanups, more
pte batching, folio conversions and more. See the series 'mm/rmap:
interface overhaul'.
- Kinsey Ho has contributed some maintenance work on the MGLRU code
in the series 'mm/mglru: Kconfig cleanup'.
- Matthew Wilcox has contributed lruvec page accounting code cleanups
in the series 'Remove some lruvec page accounting functions'"
* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
mm, treewide: introduce NR_PAGE_ORDERS
selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
selftests/mm: skip test if application doesn't has root privileges
selftests/mm: conform test to TAP format output
selftests: mm: hugepage-mmap: conform to TAP format output
selftests/mm: gup_test: conform test to TAP format output
mm/selftests: hugepage-mremap: conform test to TAP format output
mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
mm/memcontrol: remove __mod_lruvec_page_state()
mm/khugepaged: use a folio more in collapse_file()
slub: use a folio in __kmalloc_large_node
slub: use folio APIs in free_large_kmalloc()
slub: use alloc_pages_node() in alloc_slab_page()
mm: remove inc/dec lruvec page state functions
mm: ratelimit stat flush from workingset shrinker
kasan: stop leaking stack trace handles
mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
mm/mglru: add dummy pmd_dirty()
...
Align larger anonymous memory mappings on THP boundaries by going through
thp_get_unmapped_area if THPs are enabled for the current process.
With this patch, larger anonymous mappings are now THP aligned. When a
malloc library allocates a 2MB or larger arena, that arena can now be
mapped with THPs right from the start, which can result in better TLB hit
rates and execution time.
Link: https://lkml.kernel.org/r/20220809142457.4751229f@imladris.surriel.com
Link: https://lkml.kernel.org/r/20231214223423.1133074-1-yang@os.amperecomputing.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With commit cf8e865810 ("arch: Remove Itanium (IA-64) architecture"),
there is no need to keep the IA64-specific vma expansion.
Clean up the IA64-specific vma expansion implementation.
Link: https://lkml.kernel.org/r/20231113124728.3974-1-lukas.bulwahn@gmail.com
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
directly replacing the entries of VMAs in the new maple tree can result in
better performance. __mt_dup() uses DFS pre-order to duplicate the maple
tree, so it is efficient.
The average time complexity of __mt_dup() is O(n), where n is the number
of VMAs. The proof of the time complexity is provided in the commit log
that introduces __mt_dup(). After duplicating the maple tree, each
element is traversed and replaced (ignoring the cases of deletion, which
are rare). Since it is only a replacement operation for each element,
this process is also O(n).
Analyzing the exact time complexity of the previous algorithm is
challenging because each insertion can involve appending to a node,
pushing data to adjacent nodes, or even splitting nodes. The frequency of
each action is difficult to calculate. The worst-case scenario for a
single insertion is when the tree undergoes splitting at every level. If
we consider each insertion as the worst-case scenario, we can determine
that the upper bound of the time complexity is O(n*log(n)), although this
is a loose upper bound. However, based on the test data, it appears that
the actual time complexity is likely to be O(n).
As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
fails, there will be a portion of VMAs that have not been duplicated in
the maple tree. To handle this, we mark the failure point with
XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop
releasing VMAs that have not been duplicated after this point.
There is a "spawn" in byte-unixbench[1], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.
Below are the test results. The first row shows the number of VMAs. The
second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively. The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results. There are still some
fluctuations in the test results, but at least they are better than the
original performance.
21 121 221 421 821 1621 3221 6421 12821 25621 51221
112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
[1] https://github.com/kdlucas/byte-unixbench/tree/master
Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series "Fixes and cleanups to compaction".
- Joel Fernandes has a patchset ("Optimize mremap during mutual
alignment within PMD") which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested.
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series "Do not try to access unaccepted memory" Adrian Hunter
provides some fixups for the recently-added "unaccepted memory' feature.
To increase the feature's checking coverage. "Plug a few gaps where
RAM is exposed without checking if it is unaccepted memory".
- In the series "cleanups for lockless slab shrink" Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code.
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series "use refcount+RCU method to implement
lockless slab shrink".
- David Hildenbrand contributes some maintenance work for the rmap code
in the series "Anon rmap cleanups".
- Kefeng Wang does more folio conversions and some maintenance work in
the migration code. Series "mm: migrate: more folio conversion and
unification".
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series "Add and use bdev_getblk()".
- In the series "Use nth_page() in place of direct struct page
manipulation" Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames.
- In the series "mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO" has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of gigantic
pages are in use.
- Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
rationalization and folio conversions in the hugetlb code.
- Yin Fengwei has improved mlock()'s handling of large folios in the
series "support large folio for mlock"
- In the series "Expose swapcache stat for memcg v1" Liu Shixin has
added statistics for memcg v1 users which are available (and useful)
under memcg v2.
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named "MDWE
without inheritance".
- Kefeng Wang has provided the series "mm: convert numa balancing
functions to use a folio" which does what it says.
- In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
makes is possible for a process to propagate KSM treatment across
exec().
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use "high
bandwidth memory" in addition to Optane Data Center Persistent Memory
Modules (DCPMM). The series is named "memory tiering: calculate
abstract distance based on ACPI HMAT"
- In the series "Smart scanning mode for KSM" Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans.
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
series "mm: memcg: fix tracking of pending stats updates values".
- In the series "Implement IOCTL to get and optionally clear info about
PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
us to atomically read-then-clear page softdirty state. This is mainly
used by CRIU.
- Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
- a bunch of relatively minor maintenance tweaks to this code.
- Matthew Wilcox has increased the use of the VMA lock over file-backed
page faults in the series "Handle more faults under the VMA lock". Some
rationalizations of the fault path became possible as a result.
- In the series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
and folio conversions.
- In the series "various improvements to the GUP interface" Lorenzo
Stoakes has simplified and improved the GUP interface with an eye to
providing groundwork for future improvements.
- Andrey Konovalov has sent along the series "kasan: assorted fixes and
improvements" which does those things.
- Some page allocator maintenance work from Kemeng Shi in the series
"Two minor cleanups to break_down_buddy_pages".
- In thes series "New selftest for mm" Breno Leitao has developed
another MM self test which tickles a race we had between madvise() and
page faults.
- In the series "Add folio_end_read" Matthew Wilcox provides cleanups
and an optimization to the core pagecache code.
- Nhat Pham has added memcg accounting for hugetlb memory in the series
"hugetlb memcg accounting".
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series "Abstract vma_merge() and split_vma()".
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series "Fix page_owner's use of free timestamps".
- Lorenzo Stoakes has fixed the handling of new mappings of sealed files
in the series "permit write-sealed memfd read-only shared mappings".
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series "Batch hugetlb vmemmap modification operations".
- Some buffer_head folio conversions and cleanups from Matthew Wilcox in
the series "Finish the create_empty_buffers() transition".
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the series
"mm: PCP high auto-tuning".
- Roman Gushchin has contributed the patchset "mm: improve performance
of accounted kernel memory allocations" which improves their performance
by ~30% as measured by a micro-benchmark.
- folio conversions from Kefeng Wang in the series "mm: convert page
cpupid functions to folios".
- Some kmemleak fixups in Liu Shixin's series "Some bugfix about
kmemleak".
- Qi Zheng has improved our handling of memoryless nodes by keeping them
off the allocation fallback list. This is done in the series "handle
memoryless nodes more appropriately".
- khugepaged conversions from Vishal Moola in the series "Some
khugepaged folio conversions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
FgeUPAD1oasg6CP+INZvCj34waNxwAc=
=E+Y4
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'
- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'
- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'
- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'
- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'
- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames
- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use
- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code
- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'
- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'
- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says
- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'
- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'
- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU
- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code
- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result
- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions
- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements
- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things
- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'
- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults
- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code
- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'
- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'
- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'
- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark
- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'
- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'
- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'
- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"
[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in
https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/
with help from Qi Zheng.
The clone3 test filtering conflict was half-arsed by yours truly ]
* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...
The ia64 architecture gets its well-earned retirement as planned,
now that there is one last (mostly) working release that will
be maintained as an LTS kernel.
The architecture specific system call tables are updated for
the added map_shadow_stack() syscall and to remove references
to the long-gone sys_lookup_dcookie() syscall.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmVC40IACgkQYKtH/8kJ
Uidhmw/9EX+aWSXGoObJ3fngaNSMw+PmrEuP8qEKBHxfKHcCdX3hc451Oh4GlhaQ
tru91pPwgNvN2/rfoKusxT+V4PemGIzfNni/04rp+P0kvmdw5otQ2yNhsQNsfVmq
XGWvkxF4P2GO6bkjjfR/1dDq7GtlyXtwwPDKeLbYb6TnJOZjtx+EAN27kkfSn1Ms
R4Sa3zJ+DfHUmHL5S9g+7UD/CZ5GfKNmIskI4Mz5GsfoUz/0iiU+Bge/9sdcdSJQ
kmbLy5YnVzfooLZ3TQmBFsO3iAMWb0s/mDdtyhqhTVmTUshLolkPYyKnPFvdupyv
shXcpEST2XJNeaDRnL2K4zSCdxdbnCZHDpjfl9wfioBg7I8NfhXKpf1jYZHH1de4
LXq8ndEFEOVQw/zSpYWfQq1sux8Jiqr+UK/ukbVeFWiGGIUs91gEWtPAf8T0AZo9
ujkJvaWGl98O1g5wmBu0/dAR6QcFJMDfVwbmlIFpU8O+MEaz6X8mM+O5/T0IyTcD
eMbAUjj4uYcU7ihKzHEv/0SS9Of38kzff67CLN5k8wOP/9NlaGZ78o1bVle9b52A
BdhrsAefFiWHp1jT6Y9Rg4HOO/TguQ9e6EWSKOYFulsiLH9LEFaB9RwZLeLytV0W
vlAgY9rUW77g1OJcb7DoNv33nRFuxsKqsnz3DEIXtgozo9CzbYI=
=H1vH
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull ia64 removal and asm-generic updates from Arnd Bergmann:
- The ia64 architecture gets its well-earned retirement as planned,
now that there is one last (mostly) working release that will be
maintained as an LTS kernel.
- The architecture specific system call tables are updated for the
added map_shadow_stack() syscall and to remove references to the
long-gone sys_lookup_dcookie() syscall.
* tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
hexagon: Remove unusable symbols from the ptrace.h uapi
asm-generic: Fix spelling of architecture
arch: Reserve map_shadow_stack() syscall number for all architectures
syscalls: Cleanup references to sys_lookup_dcookie()
Documentation: Drop or replace remaining mentions of IA64
lib/raid6: Drop IA64 support
Documentation: Drop IA64 from feature descriptions
kernel: Drop IA64 support from sig_fault handlers
arch: Remove Itanium (IA-64) architecture
- Support non-BSS ELF segments with 0 filesz (Eric W. Biederman, Kees Cook)
- Enable namespaced binfmt_misc (Christian Brauner)
- Remove struct tag 'dynamic' from ELF UAPI (Alejandro Colomar)
- Clean up binfmt_elf_fdpic debug output (Greg Ungerer)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmU/40AWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJh1rD/9g8iQ77KvC/l/GUjt9WNCsbwMR
4Ro15U6PP9TbafxEUTgYGrwpVPcWrOTz3zrLZ/NsR5GtgKolxLry94oeUPCpFRUP
v+4cQWpcQQtkAiRw+vc4/XfUivWmZuNGiLOt2egZUP6tQhJocmRNW7XbGF1ZDrSA
ASZknaA7qVLx/hnghm+bCXjNOx6hN/Md35QBPuyAclpD3sbUJDPSODZZb9Bcz+4w
0qD3acA3Nulug9k/5j7Ed0MzV8I/WfgZQQhGMl4K7yBQv06vcrRV6Eon4D9KvJVm
bjK3zFE/zILkY1BHIUZZT3h2DjdUwHrGr82u5y6u3buj88IcNyFfSaGyYYBqn3Ux
P7Y+dD9zZXQuMbqmhWbdK8UoSYiJ9isOB02lt0oHipONR5PqRocTsA6gseMsO9cv
TwvGL279WlfZIj+2pvn0VJv/7DOCKGjZfc2AXhgPSkjICSO9mEOlVcFv1v3ZuXAn
Cb/6/BMZyNqh/UIGWdPRyDVHEdswpJVcecewnJwmrG1vmvYyfyP8U+VoRE4ItELz
fMpZskAb7SKV+McHLDauV+9eCgqaF5DIM3/zgws5iayRcGZQfengXqIajL7Ujlwf
RKlnfhtRxkfgpF8vEmQDs0y5AVsU/l48dOSrb/0Vg9oXKBdBa9ozyhr1Ok5kwkiN
LfDZDjSMyERgO/UZHQ==
=hxk8
-----END PGP SIGNATURE-----
Merge tag 'execve-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
- Support non-BSS ELF segments with zero filesz
Eric Biederman and I refactored ELF segment loading to handle the
case where a segment has a smaller filesz than memsz. Traditionally
linkers only did this for .bss and it was always the last segment. As
a result, the kernel only handled this case when it was the last
segment. We've had two recent cases where linkers were trying to use
these kinds of segments for other reasons, and the were in the middle
of the segment list. There was no good reason for the kernel not to
support this, and the refactor actually ends up making things more
readable too.
- Enable namespaced binfmt_misc
Christian Brauner has made it possible to use binfmt_misc with mount
namespaces. This means some traditionally root-only interfaces (for
adding/removing formats) are now more exposed (but believed to be
safe).
- Remove struct tag 'dynamic' from ELF UAPI
Alejandro Colomar noticed that the ELF UAPI has been polluting the
struct namespace with an unused and overly generic tag named
"dynamic" for no discernible reason for many many years. After
double-checking various distro source repositories, it has been
removed.
- Clean up binfmt_elf_fdpic debug output (Greg Ungerer)
* tag 'execve-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_misc: enable sandboxed mounts
binfmt_misc: cleanup on filesystem umount
binfmt_elf_fdpic: clean up debug warnings
mm: Remove unused vm_brk()
binfmt_elf: Only report padzero() errors when PROT_WRITE
binfmt_elf: Use elf_load() for library
binfmt_elf: Use elf_load() for interpreter
binfmt_elf: elf_bss no longer used by load_elf_binary()
binfmt_elf: Support segments with 0 filesz and misaligned starts
elf, uapi: Remove struct tag 'dynamic'
In order for a F_SEAL_WRITE sealed memfd mapping to have an opportunity to
clear VM_MAYWRITE, we must be able to invoke the appropriate
vm_ops->mmap() handler to do so. We would otherwise fail the
mapping_map_writable() check before we had the opportunity to avoid it.
This patch moves this check after the call_mmap() invocation. Only memfd
actively denies write access causing a potential failure here (in
memfd_add_seals()), so there should be no impact on non-memfd cases.
This patch makes the userland-visible change that MAP_SHARED, PROT_READ
mappings of an F_SEAL_WRITE sealed memfd mapping will now succeed.
There is a delicate situation with cleanup paths assuming that a writable
mapping must have occurred in circumstances where it may now not have. In
order to ensure we do not accidentally mark a writable file unwritable by
mistake, we explicitly track whether we have a writable mapping and unmap
only if we do.
[lstoakes@gmail.com: do not set writable_file_mapping in inappropriate case]
Link: https://lkml.kernel.org/r/c9eb4cc6-7db4-4c2b-838d-43a0b319a4f0@lucifer.local
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217238
Link: https://lkml.kernel.org/r/55e413d20678a1bb4c7cce889062bbb07b0df892.1697116581.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "permit write-sealed memfd read-only shared mappings", v4.
The man page for fcntl() describing memfd file seals states the following
about F_SEAL_WRITE:-
Furthermore, trying to create new shared, writable memory-mappings via
mmap(2) will also fail with EPERM.
With emphasis on 'writable'. In turns out in fact that currently the
kernel simply disallows all new shared memory mappings for a memfd with
F_SEAL_WRITE applied, rendering this documentation inaccurate.
This matters because users are therefore unable to obtain a shared mapping
to a memfd after write sealing altogether, which limits their usefulness.
This was reported in the discussion thread [1] originating from a bug
report [2].
This is a product of both using the struct address_space->i_mmap_writable
atomic counter to determine whether writing may be permitted, and the
kernel adjusting this counter when any VM_SHARED mapping is performed and
more generally implicitly assuming VM_SHARED implies writable.
It seems sensible that we should only update this mapping if VM_MAYWRITE
is specified, i.e. whether it is possible that this mapping could at any
point be written to.
If we do so then all we need to do to permit write seals to function as
documented is to clear VM_MAYWRITE when mapping read-only. It turns out
this functionality already exists for F_SEAL_FUTURE_WRITE - we can
therefore simply adapt this logic to do the same for F_SEAL_WRITE.
We then hit a chicken and egg situation in mmap_region() where the check
for VM_MAYWRITE occurs before we are able to clear this flag. To work
around this, perform this check after we invoke call_mmap(), with careful
consideration of error paths.
Thanks to Andy Lutomirski for the suggestion!
[1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@linux-foundation.org/
[2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238
This patch (of 3):
There is a general assumption that VMAs with the VM_SHARED flag set are
writable. If the VM_MAYWRITE flag is not set, then this is simply not the
case.
Update those checks which affect the struct address_space->i_mmap_writable
field to explicitly test for this by introducing
[vma_]is_shared_maywrite() helper functions.
This remains entirely conservative, as the lack of VM_MAYWRITE guarantees
that the VMA cannot be written to.
Link: https://lkml.kernel.org/r/cover.1697116581.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mremap uses vma_merge() in the case where a VMA needs to be extended. This
can be significantly simplified and abstracted.
This makes it far easier to understand what the actual function is doing,
avoids future mistakes in use of the confusing vma_merge() function and
importantly allows us to make future changes to how vma_merge() is
implemented by knowing explicitly which merge cases each invocation uses.
Note that in the mremap() extend case, we perform this merge only when
old_len == vma->vm_end - addr. The extension_start, i.e. the start of the
extended portion of the VMA is equal to addr + old_len, i.e. vma->vm_end.
With this refactoring, vma_merge() is no longer required anywhere except
mm/mmap.c, so mark it static.
Link: https://lkml.kernel.org/r/f16cbdc2e72d37a1a097c39dc7d1fee8919a1c93.1697043508.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Only in mmap_region() and copy_vma() do we attempt to merge VMAs which
occupy entirely new regions of virtual memory.
We can abstract this logic and make the intent of this invocations of it
completely explicit, rather than invoking vma_merge() with an inscrutable
wall of parameters.
This also paves the way for a simplification of the core vma_merge()
implementation, as we seek to make it entirely an implementation detail.
The VMA merge call in mmap_region() occurs only for file-backed mappings,
where each of the parameters previously specified as NULL are defaulted to
NULL in vma_init() (called by vm_area_alloc()).
This matches the previous behaviour of specifying NULL for a number of
fields, however note that prior to this call we pass the VMA to the file
system driver via call_mmap(), which may in theory adjust fields that we
pass in to vma_merge_new_vma().
Therefore we actually resolve an oversight here by allowing for the fact
that the driver may have done this.
Link: https://lkml.kernel.org/r/3dc71d17e307756a54781d4a4ce7315cf8b18bea.1697043508.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now the common pattern of - attempting a merge via vma_merge() and should
this fail splitting VMAs via split_vma() - has been abstracted, the former
can be placed into mm/internal.h and the latter made static.
In addition, the split_vma() nommu variant also need not be exported.
Link: https://lkml.kernel.org/r/405f2be10e20c4e9fbcc9fe6b2dfea105f6642e0.1697043508.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mprotect() and other functions which change VMA parameters over a range
each employ a pattern of:-
1. Attempt to merge the range with adjacent VMAs.
2. If this fails, and the range spans a subset of the VMA, split it
accordingly.
This is open-coded and duplicated in each case. Also in each case most of
the parameters passed to vma_merge() remain the same.
Create a new function, vma_modify(), which abstracts this operation,
accepting only those parameters which can be changed.
To avoid the mess of invoking each function call with unnecessary
parameters, create inline wrapper functions for each of the modify
operations, parameterised only by what is required to perform the action.
We can also significantly simplify the logic - by returning the VMA if we
split (or merged VMA if we do not) we no longer need specific handling for
merge/split cases in any of the call sites.
Note that the userfaultfd_release() case works even though it does not
split VMAs - since start is set to vma->vm_start and end is set to
vma->vm_end, the split logic does not trigger.
In addition, since we calculate pgoff to be equal to vma->vm_pgoff + (start
- vma->vm_start) >> PAGE_SHIFT, and start - vma->vm_start will be 0 in this
instance, this invocation will remain unchanged.
We eliminate a VM_WARN_ON() in mprotect_fixup() as this simply asserts that
vma_merge() correctly ensures that flags remain the same, something that is
already checked in is_mergeable_vma() and elsewhere, and in any case is not
specific to mprotect().
Link: https://lkml.kernel.org/r/0dfa9368f37199a423674bf0ee312e8ea0619044.1697043508.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When tracing through the code in vma_merge(), it was not completely
clear why the error return to a dup_anon_vma() call would not overwrite
a previous attempt to the same function. This commit adds a comment
specifying why it is safe.
Link: https://lkml.kernel.org/r/20230929183041.2835469-4-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/linux-mm/CAG48ez3iDwFPR=Ed1BfrNuyUJPMK_=StjxhUsCkL6po1s7bONg@mail.gmail.com/
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When the calling function fails after the dup_anon_vma(), the
duplication of the anon_vma is not being undone. Add the necessary
unlink_anon_vma() call to the error paths that are missing them.
This issue showed up during inspection of the error path in vma_merge()
for an unrelated vma iterator issue.
Users may experience increased memory usage, which may be problematic as
the failure would likely be caused by a low memory situation.
Link: https://lkml.kernel.org/r/20230929183041.2835469-3-Liam.Howlett@oracle.com
Fixes: d4af56c5c7 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Calling vm_brk_flags() with flags set other than VM_EXEC will exit the
function without releasing the mmap_write_lock.
Just do the sanity check before the lock is acquired. This doesn't fix an
actual issue since no caller sets a flag other than VM_EXEC.
Link: https://lkml.kernel.org/r/20230929171937.work.697-kees@kernel.org
Fixes: 2e7ce7d354 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
expand_upwards() and expand_downwards() will return -EFAULT if VM_GROWSUP
or VM_GROWSDOWN is not correctly set in vma->vm_flags, however in
!CONFIG_STACK_GROWSUP case, expand_stack_locked() returns -EINVAL first if
!(vma->vm_flags & VM_GROWSDOWN) before calling expand_downwards(), to keep
the consistency with CONFIG_STACK_GROWSUP case, remove this check.
The usages of this function are as below:
A:fs/exec.c
ret = expand_stack_locked(vma, stack_base);
if (ret)
ret = -EFAULT;
or
B:mm/memory.c mm/mmap.c
if (expand_stack_locked(vma, addr))
return NULL;
which means the return value will not propagate to other places, so I
believe there is no user-visible effects of this change, and it's
unnecessary to backport to earlier versions.
Link: https://lkml.kernel.org/r/20230906103312.645712-1-xiujianfeng@huaweicloud.com
Fixes: f440fa1ac9 ("mm: make find_extend_vma() fail if write lock not held")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fix insert_vm_struct() so that only accounted memory is unaccounted if
vma_link() fails.
Link: https://lkml.kernel.org/r/20230830004324.16101-1-anthony.yznaga@oracle.com
Fixes: d4af56c5c7 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With fs/binfmt_elf.c fully refactored to use the new elf_load() helper,
there are no more users of vm_brk(), so remove it.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Suggested-by: Eric Biederman <ebiederm@xmission.com>
Tested-by: Pedro Falcato <pedro.falcato@gmail.com>
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20230929032435.2391507-6-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
The Itanium architecture is obsolete, and an informal survey [0] reveals
that any residual use of Itanium hardware in production is mostly HP-UX
or OpenVMS based. The use of Linux on Itanium appears to be limited to
enthusiasts that occasionally boot a fresh Linux kernel to see whether
things are still working as intended, and perhaps to churn out some
distro packages that are rarely used in practice.
None of the original companies behind Itanium still produce or support
any hardware or software for the architecture, and it is listed as
'Orphaned' in the MAINTAINERS file, as apparently, none of the engineers
that contributed on behalf of those companies (nor anyone else, for that
matter) have been willing to support or maintain the architecture
upstream or even be responsible for applying the odd fix. The Intel
firmware team removed all IA-64 support from the Tianocore/EDK2
reference implementation of EFI in 2018. (Itanium is the original
architecture for which EFI was developed, and the way Linux supports it
deviates significantly from other architectures.) Some distros, such as
Debian and Gentoo, still maintain [unofficial] ia64 ports, but many have
dropped support years ago.
While the argument is being made [1] that there is a 'for the common
good' angle to being able to build and run existing projects such as the
Grid Community Toolkit [2] on Itanium for interoperability testing, the
fact remains that none of those projects are known to be deployed on
Linux/ia64, and very few people actually have access to such a system in
the first place. Even if there were ways imaginable in which Linux/ia64
could be put to good use today, what matters is whether anyone is
actually doing that, and this does not appear to be the case.
There are no emulators widely available, and so boot testing Itanium is
generally infeasible for ordinary contributors. GCC still supports IA-64
but its compile farm [3] no longer has any IA-64 machines. GLIBC would
like to get rid of IA-64 [4] too because it would permit some overdue
code cleanups. In summary, the benefits to the ecosystem of having IA-64
be part of it are mostly theoretical, whereas the maintenance overhead
of keeping it supported is real.
So let's rip off the band aid, and remove the IA-64 arch code entirely.
This follows the timeline proposed by the Debian/ia64 maintainer [5],
which removes support in a controlled manner, leaving IA-64 in a known
good state in the most recent LTS release. Other projects will follow
once the kernel support is removed.
[0] https://lore.kernel.org/all/CAMj1kXFCMh_578jniKpUtx_j8ByHnt=s7S+yQ+vGbKt9ud7+kQ@mail.gmail.com/
[1] https://lore.kernel.org/all/0075883c-7c51-00f5-2c2d-5119c1820410@web.de/
[2] https://gridcf.org/gct-docs/latest/index.html
[3] https://cfarm.tetaneutral.net/machines/list/
[4] https://lore.kernel.org/all/87bkiilpc4.fsf@mid.deneb.enyo.de/
[5] https://lore.kernel.org/all/ff58a3e76e5102c94bb5946d99187b358def688a.camel@physik.fu-berlin.de/
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Convert IBT selftest to asm to fix objtool warning
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTv1QQACgkQaDWVMHDJ
krAUwhAAn6TOwHJK8BSkHeiQhON1nrlP3c5cv0AyZ2NP8RYDrZrSZvhpYBJ6wgKC
Cx5CGq5nn9twYsYS3KsktLKDfR3lRdsQ7K9qtyFtYiaeaVKo+7gEKl/K+klwai8/
gninQWHk0zmSCja8Vi77q52WOMkQKapT8+vaON9EVDO8dVEi+CvhAIfPwMafuiwO
Rk4X86SzoZu9FP79LcCg9XyGC/XbM2OG9eNUTSCKT40qTTKm5y4gix687NvAlaHR
ko5MTsdl0Wfp6Qk0ohT74LnoA2c1g/FluvZIM33ci/2rFpkf9Hw7ip3lUXqn6CPx
rKiZ+pVRc0xikVWkraMfIGMJfUd2rhelp8OyoozD7DB7UZw40Q4RW4N5tgq9Fhe9
MQs3p1v9N8xHdRKl365UcOczUxNAmv4u0nV5gY/4FMC6VjldCl2V9fmqYXyzFS4/
Ogg4FSd7c2JyGFKPs+5uXyi+RY2qOX4+nzHOoKD7SY616IYqtgKoz5usxETLwZ6s
VtJOmJL0h//z0A7tBliB0zd+SQ5UQQBDC2XouQH2fNX2isJMn0UDmWJGjaHgK6Hh
8jVp6LNqf+CEQS387UxckOyj7fu438hDky1Ggaw4YqowEOhQeqLVO4++x+HITrbp
AupXfbJw9h9cMN63Yc0gVxXQ9IMZ+M7UxLtZ3Cd8/PVztNy/clA=
=3UUm
-----END PGP SIGNATURE-----
Merge tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 shadow stack support from Dave Hansen:
"This is the long awaited x86 shadow stack support, part of Intel's
Control-flow Enforcement Technology (CET).
CET consists of two related security features: shadow stacks and
indirect branch tracking. This series implements just the shadow stack
part of this feature, and just for userspace.
The main use case for shadow stack is providing protection against
return oriented programming attacks. It works by maintaining a
secondary (shadow) stack using a special memory type that has
protections against modification. When executing a CALL instruction,
the processor pushes the return address to both the normal stack and
to the special permission shadow stack. Upon RET, the processor pops
the shadow stack copy and compares it to the normal stack copy.
For more information, refer to the links below for the earlier
versions of this patch set"
Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/
* tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
x86/shstk: Change order of __user in type
x86/ibt: Convert IBT selftest to asm
x86/shstk: Don't retry vm_munmap() on -EINTR
x86/kbuild: Fix Documentation/ reference
x86/shstk: Move arch detail comment out of core mm
x86/shstk: Add ARCH_SHSTK_STATUS
x86/shstk: Add ARCH_SHSTK_UNLOCK
x86: Add PTRACE interface for shadow stack
selftests/x86: Add shadow stack test
x86/cpufeatures: Enable CET CR4 bit for shadow stack
x86/shstk: Wire in shadow stack interface
x86: Expose thread features in /proc/$PID/status
x86/shstk: Support WRSS for userspace
x86/shstk: Introduce map_shadow_stack syscall
x86/shstk: Check that signal frame is shadow stack mem
x86/shstk: Check that SSP is aligned on sigreturn
x86/shstk: Handle signals for shadow stack
x86/shstk: Introduce routines modifying shstk
x86/shstk: Handle thread shadow stack
x86/shstk: Add user-mode shadow stack support
...
vma_prepare() is currently the central place where vmas are being locked
before vma_complete() applies changes to them. While this is convenient,
it also obscures vma locking and makes it harder to follow the locking
rules. Move vma locking out of vma_prepare() and take vma locks
explicitly at the locations where vmas are being modified. Move vma
locking and replace it with an assertion inside dup_anon_vma() to further
clarify the locking pattern inside vma_merge().
Link: https://lkml.kernel.org/r/20230804152724.3090321-7-surenb@google.com
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
While it's not strictly necessary to lock a newly created vma before
adding it into the vma tree (as long as no further changes are performed
to it), it seems like a good policy to lock it and prevent accidental
changes after it becomes visible to the page faults. Lock the vma before
adding it into the vma tree.
[akpm@linux-foundation.org: fix reject fixing in vma_link(), per Jann]
Link: https://lkml.kernel.org/r/20230804152724.3090321-6-surenb@google.com
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use helper macro K() to improve code readability. No functional
modification involved.
Link: https://lkml.kernel.org/r/20230804012559.2617515-7-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
By delaying the setting of prev/next VMA until after the write of NULL,
the probability of the prev/next VMA already being in the CPU cache is
significantly increased, especially for larger munmap operations. It
also means that prev/next will be loaded closer to when they are used.
This requires changing the loop type when gathering the VMAs that will
be freed.
Since prev will be set later in the function, it is better to reverse
the splitting direction of the start VMA (modify the new_below argument
to __split_vma).
Using the vma_iter_prev_range() to walk back to the correct location in
the tree will, on the most part, mean walking within the CPU cache.
Usually, this is two steps vs a node reset and a tree re-walk.
Link: https://lkml.kernel.org/r/20230724183157.3939892-16-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Set the correct limits for vma_iter_prealloc() calls so that the maple
tree can be smarter about how many nodes are needed.
Link: https://lkml.kernel.org/r/20230724183157.3939892-11-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Move the definition of vma_iter_clear_gfp() from mmap.c to internal.h so
it can be used in the nommu code. This will reduce node preallocations
in nommu.
Link: https://lkml.kernel.org/r/20230724183157.3939892-10-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The current preallocation strategy is to preallocate the absolute
worst-case allocation for a tree modification. The entry (or NULL) is
needed to know how many nodes are needed to write to the tree. Start by
adding the argument to the mas_preallocate() definition.
Link: https://lkml.kernel.org/r/20230724183157.3939892-8-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Using vma_iter_set() will reset the tree and cause a re-walk. Use
vmi_iter_config() to set the write to a sub-set of the range. Change
the file case to also use vmi_iter_config() so that the end is correctly
set.
Link: https://lkml.kernel.org/r/20230724183157.3939892-7-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If the prev does not exist, the vma iterator will be set to MAS_NONE,
which will be treated as a MAS_START when the mas_next or mas_find is
used. In this case, the next caller will be the vma iterator, which
uses mas_find() under the hood and will now do what the user expects.
Link: https://lkml.kernel.org/r/20230724183157.3939892-5-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The majority of the calls to munmap a vm range is within a single vma.
The maple tree is able to store a single entry at 0, with a size of 1 as
a pointer and avoid any allocations. Change do_vmi_align_munmap() to
store the VMAs being munmap()'ed into a tree indexed by the count. This
will leverage the ability to store the first entry without a node
allocation.
Storing the entries into a tree by the count and not the vma start and
end means changing the functions which iterate over the entries. Update
unmap_vmas() and free_pgtables() to take a maple state and a tree end
address to support this functionality.
Passing through the same maple state to unmap_vmas() and free_pgtables()
means the state needs to be reset between calls. This happens in the
static unmap_region() and exit_mmap().
Link: https://lkml.kernel.org/r/20230724183157.3939892-4-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Despite its name, mm_drop_all_locks() does not drop _all_ locks; the mmap
lock is held write-locked by the caller, and the caller is responsible for
dropping the mmap lock at a later point (which will also release the VMA
locks).
Calling vma_end_write_all() here is dangerous because the caller might
have write-locked a VMA with the expectation that it will stay
write-locked until the mmap_lock is released, as usual.
This _almost_ becomes a problem in the following scenario:
An anonymous VMA A and an SGX VMA B are mapped adjacent to each other.
Userspace calls munmap() on a range starting at the start address of A and
ending in the middle of B.
Hypothetical call graph with additional notes in brackets:
do_vmi_align_munmap
[begin first for_each_vma_range loop]
vma_start_write [on VMA A]
vma_mark_detached [on VMA A]
__split_vma [on VMA B]
sgx_vma_open [== new->vm_ops->open]
sgx_encl_mm_add
__mmu_notifier_register [luckily THIS CAN'T ACTUALLY HAPPEN]
mm_take_all_locks
mm_drop_all_locks
vma_end_write_all [drops VMA lock taken on VMA A before]
vma_start_write [on VMA B]
vma_mark_detached [on VMA B]
[end first for_each_vma_range loop]
vma_iter_clear_gfp [removes VMAs from maple tree]
mmap_write_downgrade
unmap_region
mmap_read_unlock
In this hypothetical scenario, while do_vmi_align_munmap() thinks it still
holds a VMA write lock on VMA A, the VMA write lock has actually been
invalidated inside __split_vma().
The call from sgx_encl_mm_add() to __mmu_notifier_register() can't
actually happen here, as far as I understand, because we are duplicating
an existing SGX VMA, but sgx_encl_mm_add() only calls
__mmu_notifier_register() for the first SGX VMA created in a given
process. So this could only happen in fork(), not on munmap(). But in my
view it is just pure luck that this can't happen.
Also, we wouldn't actually have any bad consequences from this in
do_vmi_align_munmap(), because by the time the bug drops the lock on VMA
A, we've already marked VMA A as detached, which makes it completely
ineligible for any VMA-locked page faults. But again, that's just pure
luck.
So remove the vma_end_write_all(), so that VMA write locks are only ever
released on mmap_write_unlock() or mmap_write_downgrade().
Also add comments to document the locking rules established by this patch.
Link: https://lkml.kernel.org/r/20230720193436.454247-1-jannh@google.com
Fixes: eeff9a5d47 ("mm/mmap: prevent pagefault handler from racing with mmu_notifier registration")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Don't set the lock to the mm lock so that the detached VMA tree does not
complain about being unlocked when the mmap_lock is dropped prior to
freeing the tree.
Introduce mt_on_stack() for setting the external lock to NULL only when
LOCKDEP is used.
Move the destroying of the detached tree outside the mmap lock all
together.
Link: https://lkml.kernel.org/r/20230719183142.ktgcmuj2pnlr3h3s@revolver
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "More strict maple tree lockdep", v2.
Linus asked for more strict maple tree lockdep checking [1] and for them
to resume the normal path through Andrews tree.
This series of patches adds checks to ensure the lock is held in write
mode during the write path of the maple tree instead of checking if it's
held at all.
It also reduces the validate_mm() calls by consolidating into commonly
used functions (patch 0001), and removes the necessity of holding the lock
on the detached tree during munmap() operations.
This patch (of 4):
validate_mm() calls are too spread out and duplicated in numerous
locations. Also, now that the stack write is done under the write lock,
it is not necessary to validate the mm prior to write operations.
Add a validate_mm() to the stack expansions, and to vma_complete() so
that numerous others may be dropped.
Note that vma_link() (and also insert_vm_struct() by call path) already
call validate_mm().
vma_merge() also had an unnecessary call to vma_iter_free() since the
logic change to abort earlier if no merging is necessary.
Drop extra validate_mm() calls at the start of functions and error paths
which won't write to the tree.
Relocate the validate_mm() call in the do_brk_flags() to avoid
re-running the same test when vma_complete() is used.
The call within the error path of mmap_region() is left intentionally
because of the complexity of the function and the potential of drivers
modifying the tree.
Link: https://lkml.kernel.org/r/20230714195551.894800-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20230714195551.894800-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
UnixBench/Execl represents a class of workload where bash scripts are
spawned frequently to do some short jobs. When running multiple parallel
tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
come from load_elf_binary through the call chain
"execl->do_execveat_common->bprm_execve->load_elf_binary".
In do_mmap,it will call mmap_region to create vma node, initialize it and
insert it to vma maintain structure in mm_struct and i_mmap tree of the
mapping file, then increase map_count to record the number of vma nodes
used. The hot osq_lock is to protect operations on file's i_mmap tree.
For the mm_struct member change like vma insertion and map_count update,
they do not affect i_mmap tree. Move those operations out of the lock's
critical section, to reduce hold time on the lock.
With this change, on Intel Sapphire Rapids 112C/224T platform, based on
v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
obvious performance gain on v6.5-rc1 due to regression of this benchmark
from this commit f1a7941243 (mm: convert
mm's rss stats into percpu_counter). Related discussion and conclusion
can be referred at the mail thread initiated by 0day as below: Link:
https://lore.kernel.org/linux-mm/a4aa2e13-7187-600b-c628-7e8fb108def0@intel.com/
Link: https://lkml.kernel.org/r/20230712145739.604215-1-yu.ma@intel.com
Signed-off-by: Yu Ma <yu.ma@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Zhu, Lipeng <lipeng.zhu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When VMAs are merged, dup_anon_vma() is called with `dst` pointing to the
VMA that is being expanded to cover the area previously occupied by
another VMA. This currently happens while `dst` is not write-locked.
This means that, in the `src->anon_vma && !dst->anon_vma` case, as soon as
the assignment `dst->anon_vma = src->anon_vma` has happened, concurrent
page faults can happen on `dst` under the per-VMA lock. This is already
icky in itself, since such page faults can now install pages into `dst`
that are attached to an `anon_vma` that is not yet tied back to the
`anon_vma` with an `anon_vma_chain`. But if `anon_vma_clone()` fails due
to an out-of-memory error, things get much worse: `anon_vma_clone()` then
reverts `dst->anon_vma` back to NULL, and `dst` remains completely
unconnected to the `anon_vma`, even though we can have pages in the area
covered by `dst` that point to the `anon_vma`.
This means the `anon_vma` of such pages can be freed while the pages are
still mapped into userspace, which leads to UAF when a helper like
folio_lock_anon_vma_read() tries to look up the anon_vma of such a page.
This theoretically is a security bug, but I believe it is really hard to
actually trigger as an unprivileged user because it requires that you can
make an order-0 GFP_KERNEL allocation fail, and the page allocator tries
pretty hard to prevent that.
I think doing the vma_start_write() call inside dup_anon_vma() is the most
straightforward fix for now.
For a kernel-assisted reproducer, see the notes section of the patch mail.
Link: https://lkml.kernel.org/r/20230721034643.616851-1-jannh@google.com
Fixes: 5e31275cc9 ("mm: add per-VMA lock and helper functions to control it")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.
The architecture of shadow stack constrains the ability of userspace to
move the shadow stack pointer (SSP) in order to prevent corrupting or
switching to other shadow stacks. The RSTORSSP instruction can move the
SSP to different shadow stacks, but it requires a specially placed token
in order to do this. However, the architecture does not prevent
incrementing the stack pointer to wander onto an adjacent shadow stack. To
prevent this in software, enforce guard pages at the beginning of shadow
stack VMAs, such that there will always be a gap between adjacent shadow
stacks.
Make the gap big enough so that no userspace SSP changing operations
(besides RSTORSSP), can move the SSP from one stack to the next. The
SSP can be incremented or decremented by CALL, RET and INCSSP. CALL and
RET can move the SSP by a maximum of 8 bytes, at which point the shadow
stack would be accessed.
The INCSSP instruction can also increment the shadow stack pointer. It
is the shadow stack analog of an instruction like:
addq $0x80, %rsp
However, there is one important difference between an ADD on %rsp and
INCSSP. In addition to modifying SSP, INCSSP also reads from the memory
of the first and last elements that were "popped". It can be thought of
as acting like this:
READ_ONCE(ssp); // read+discard top element on stack
ssp += nr_to_pop * 8; // move the shadow stack
READ_ONCE(ssp-8); // read+discard last popped stack element
The maximum distance INCSSP can move the SSP is 2040 bytes, before it
would read the memory. Therefore, a single page gap will be enough to
prevent any operation from shifting the SSP to an adjacent stack, since
it would have to land in the gap at least once, causing a fault.
This could be accomplished by using VM_GROWSDOWN, but this has a
downside. The behavior would allow shadow stacks to grow, which is
unneeded and adds a strange difference to how most regular stacks work.
In the maple tree code, there is some logic for retrying the unmapped
area search if a guard gap is violated. This retry should happen for
shadow stack guard gap violations as well. This logic currently only
checks for VM_GROWSDOWN for start gaps. Since shadow stacks also have
a start gap as well, create an new define VM_STARTGAP_FLAGS to hold
all the VM flag bits that have start gaps, and make mmap use it.
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-17-rick.p.edgecombe%40intel.com
There was no more caller passing vm_flags to do_mmap(), and vm_flags was
removed from the function's input by:
commit 45e55300f1 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").
There is a new user now. Shadow stack allocation passes VM_SHADOW_STACK to
do_mmap(). Thus, re-introduce vm_flags to do_mmap().
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-5-rick.p.edgecombe%40intel.com
Lockdep is certainly right to complain about
(&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f
but task is already holding lock:
(&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db
Invert those to the usual ordering.
Fixes: 33313a747e ("mm: lock newly mapped VMA which can be modified after it becomes visible")
Cc: stable@vger.kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mmap_region adds a newly created VMA into VMA tree and might modify it
afterwards before dropping the mmap_lock. This poses a problem for page
faults handled under per-VMA locks because they don't take the mmap_lock
and can stumble on this VMA while it's still being modified. Currently
this does not pose a problem since post-addition modifications are done
only for file-backed VMAs, which are not handled under per-VMA lock.
However, once support for handling file-backed page faults with per-VMA
locks is added, this will become a race.
Fix this by write-locking the VMA before inserting it into the VMA tree.
Other places where a new VMA is added into VMA tree do not modify it
after the insertion, so do not need the same locking.
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With recent changes necessitating mmap_lock to be held for write while
expanding a stack, per-VMA locks should follow the same rules and be
write-locked to prevent page faults into the VMA being expanded. Add
the necessary locking.
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an addition to commit ae80b40419 ("mm: validate the mm before
dropping the mmap lock"), because it turns out there were two problems,
but lockdep just stopped complaining after finding the first one.
The do_vmi_align_munmap() function now drops the mmap lock after doing
the validate_mm() call, but it turns out that one of the callers then
immediately calls validate_mm() again.
That's both a bit silly, and now (again) happens without the mmap lock
held.
So just remove that validate_mm() call from the caller, but make sure to
not lose any coverage by doing that mm sanity checking in the error path
of do_vmi_align_munmap() too.
Reported-and-tested-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/lkml/ZKN6CdkKyxBShPHi@xsang-OptiPlex-9020/
Fixes: 408579cd62 ("mm: Update do_vmi_align_munmap() return semantics")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 408579cd62 ("mm: Update do_vmi_align_munmap() return
semantics") made the return value and locking semantics of
do_vmi_align_munmap() more straightforward, but in the process it ended
up unlocking the mmap lock just a tad too early: the debug code doing
the mmap layout validation still needs to run with the lock held, or
things might change under it while it's trying to validate things.
So just move the unlocking to after the validate_mm() call.
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/lkml/ZKIsoMOT71uwCIZX@xsang-OptiPlex-9020/
Fixes: 408579cd62 ("mm: Update do_vmi_align_munmap() return semantics")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since do_vmi_align_munmap() will always honor the downgrade request on
the success, the callers no longer have to deal with confusing return
codes. Since all callers that request downgrade actually want the lock
to be dropped, change the downgrade to an unlock request.
Note that the lock still needs to be held in read mode during the page
table clean up to avoid races with a map request.
Update do_vmi_align_munmap() to return 0 for success. Clean up the
callers and comments to always expect the unlock to be honored on the
success path. The error path will always leave the lock untouched.
As part of the cleanup, the wrapper function do_vmi_munmap() and callers
to the wrapper are also updated.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-mm/20230629191414.1215929-1-willy@infradead.org/
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that stack growth must always hold the mmap_lock for write, we can
always downgrade the mmap_lock to read and safely unmap pages from the
page table, even if we're next to a stack.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This modifies our user mode stack expansion code to always take the
mmap_lock for writing before modifying the VM layout.
It's actually something we always technically should have done, but
because we didn't strictly need it, we were being lazy ("opportunistic"
sounds so much better, doesn't it?) about things, and had this hack in
place where we would extend the stack vma in-place without doing the
proper locking.
And it worked fine. We just needed to change vm_start (or, in the case
of grow-up stacks, vm_end) and together with some special ad-hoc locking
using the anon_vma lock and the mm->page_table_lock, it all was fairly
straightforward.
That is, it was all fine until Ruihan Li pointed out that now that the
vma layout uses the maple tree code, we *really* don't just change
vm_start and vm_end any more, and the locking really is broken. Oops.
It's not actually all _that_ horrible to fix this once and for all, and
do proper locking, but it's a bit painful. We have basically three
different cases of stack expansion, and they all work just a bit
differently:
- the common and obvious case is the page fault handling. It's actually
fairly simple and straightforward, except for the fact that we have
something like 24 different versions of it, and you end up in a maze
of twisty little passages, all alike.
- the simplest case is the execve() code that creates a new stack.
There are no real locking concerns because it's all in a private new
VM that hasn't been exposed to anybody, but lockdep still can end up
unhappy if you get it wrong.
- and finally, we have GUP and page pinning, which shouldn't really be
expanding the stack in the first place, but in addition to execve()
we also use it for ptrace(). And debuggers do want to possibly access
memory under the stack pointer and thus need to be able to expand the
stack as a special case.
None of these cases are exactly complicated, but the page fault case in
particular is just repeated slightly differently many many times. And
ia64 in particular has a fairly complicated situation where you can have
both a regular grow-down stack _and_ a special grow-up stack for the
register backing store.
So to make this slightly more manageable, the bulk of this series is to
first create a helper function for the most common page fault case, and
convert all the straightforward architectures to it.
Thus the new 'lock_mm_and_find_vma()' helper function, which ends up
being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon,
loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more
than half the architectures, we now have more shared code and avoid some
of those twisty little passages.
And largely due to this common helper function, the full diffstat of
this series ends up deleting more lines than it adds.
That still leaves eight architectures (ia64, m68k, microblaze, openrisc,
parisc, s390, sparc64 and um) that end up doing 'expand_stack()'
manually because they are doing something slightly different from the
normal pattern. Along with the couple of special cases in execve() and
GUP.
So there's a couple of patches that first create 'locked' helper
versions of the stack expansion functions, so that there's a obvious
path forward in the conversion. The execve() case is then actually
pretty simple, and is a nice cleanup from our old "grow-up stackls are
special, because at execve time even they grow down".
The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because
it's just more straightforward to write out the stack expansion there
manually, instead od having get_user_pages_remote() do it for us in some
situations but not others and have to worry about locking rules for GUP.
And the final step is then to just convert the remaining odd cases to a
new world order where 'expand_stack()' is called with the mmap_lock held
for reading, but where it might drop it and upgrade it to a write, only
to return with it held for reading (in the success case) or with it
completely dropped (in the failure case).
In the process, we remove all the stack expansion from GUP (where
dropping the lock wouldn't be ok without special rules anyway), and add
it in manually to __access_remote_vm() for ptrace().
Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases.
Everything else here felt pretty straightforward, but the ia64 rules for
stack expansion are really quite odd and very different from everything
else. Also thanks to Vegard Nossum who caught me getting one of those
odd conditions entirely the wrong way around.
Anyway, I think I want to actually move all the stack expansion code to
a whole new file of its own, rather than have it split up between
mm/mmap.c and mm/memory.c, but since this will have to be backported to
the initial maple tree vma introduction anyway, I tried to keep the
patches _fairly_ minimal.
Also, while I don't think it's valid to expand the stack from GUP, the
final patch in here is a "warn if some crazy GUP user wants to try to
expand the stack" patch. That one will be reverted before the final
release, but it's left to catch any odd cases during the merge window
and release candidates.
Reported-by: Ruihan Li <lrh2000@pku.edu.cn>
* branch 'expand-stack':
gup: add warning if some caller would seem to want stack expansion
mm: always expand the stack with the mmap write lock held
execve: expand new process stack manually ahead of time
mm: make find_extend_vma() fail if write lock not held
powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma()
mm/fault: convert remaining simple cases to lock_mm_and_find_vma()
arm/mm: Convert to using lock_mm_and_find_vma()
riscv/mm: Convert to using lock_mm_and_find_vma()
mips/mm: Convert to using lock_mm_and_find_vma()
powerpc/mm: Convert to using lock_mm_and_find_vma()
arm64/mm: Convert to using lock_mm_and_find_vma()
mm: make the page fault mmap locking killable
mm: introduce new 'lock_mm_and_find_vma()' page fault helper
- Yosry has also eliminated cgroup's atomic rstat flushing.
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability.
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning.
- Lorenzo Stoakes has done some maintanance work on the get_user_pages()
interface.
- Liam Howlett continues with cleanups and maintenance work to the maple
tree code. Peng Zhang also does some work on maple tree.
- Johannes Weiner has done some cleanup work on the compaction code.
- David Hildenbrand has contributed additional selftests for
get_user_pages().
- Thomas Gleixner has contributed some maintenance and optimization work
for the vmalloc code.
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code.
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting.
- Christoph Hellwig has some cleanups for the filemap/directio code.
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the provided
APIs rather than open-coding accesses.
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings.
- John Hubbard has a series of fixes to the MM selftesting code.
- ZhangPeng continues the folio conversion campaign.
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock.
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
128 to 8.
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management.
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code.
- Vishal Moola also has done some folio conversion work.
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
=B7yQ
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
If mas_store_gfp() in the gather loop failed, the 'error' variable that
ultimately gets returned was not being set. In many cases, its original
value of -ENOMEM was still in place, and that was fine. But if VMAs had
been split at the start or end of the range, then 'error' could be zero.
Change to the 'error = foo(); if (error) goto …' idiom to fix the bug.
Also clean up a later case which avoided the same bug by *explicitly*
setting error = -ENOMEM right before calling the function that might
return -ENOMEM.
In a final cosmetic change, move the 'Point of no return' comment to
*after* the goto. That's been in the wrong place since the preallocation
was removed, and this new error path was added.
Fixes: 606c812eb1 ("mm/mmap: Fix error path in do_vmi_align_munmap()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: stable@vger.kernel.org
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.
For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks. Let's see if any
strange users really wanted that.
It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma. This makes it fairly
straightforward to convert the remaining architectures.
As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid. So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64
Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make calls to extend_vma() and find_extend_vma() fail if the write lock
is required.
To avoid making this a flag-day event, this still allows the old
read-locking case for the trivial situations, and passes in a flag to
say "is it write-locked". That way write-lockers can say "yes, I'm
being careful", and legacy users will continue to work in all the common
cases until they have been fully converted to the new world order.
Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
arm has an additional check for address < FIRST_USER_ADDRESS before
expanding the stack. Since FIRST_USER_ADDRESS is defined everywhere
(generally as 0), move that check to the generic expand_downwards().
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Android reported a performance regression in the userfaultfd unmap path.
A closer inspection on the userfaultfd_unmap_prep() change showed that a
second tree walk would be necessary in the reworked code.
Fix the regression by passing each VMA that will be unmapped through to
the userfaultfd_unmap_prep() function as they are added to the unmap list,
instead of re-walking the tree for the VMA.
Link: https://lkml.kernel.org/r/20230601015402.2819343-1-Liam.Howlett@oracle.com
Fixes: 69dbe6daf1 ("userfaultfd: use maple tree iterator to iterate VMAs")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The error unrolling was leaving the VMAs detached in many cases and
leaving the locked_vm statistic altered, and skipping the unrolling
entirely in the case of the vma tree write failing.
Fix the error path by re-attaching the detached VMAs and adding the
necessary goto for the failed vma tree write, and fix the locked_vm
statistic by only updating after the vma tree write succeeds.
Fixes: 763ecb0350 ("mm: remove the vma linked list")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/gup: disallow GUP writing to file-backed mappings by
default", v9.
Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.
A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.
The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.
As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.
For example, consider the following scenario:-
1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).
This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.
This patch (of 3):
vma_wants_writenotify() is specifically intended for setting PTE page
table flags, accounting for existing page table flag state and whether the
underlying filesystem performs dirty tracking for a file-backed mapping.
Everything is predicated firstly on whether the mapping is shared
writable, as this is the only instance where dirty tracking is pertinent -
MAP_PRIVATE mappings will always be CoW'd and unshared, and read-only
file-backed shared mappings cannot be written to, even with FOLL_FORCE.
All other checks are in line with existing logic, though now separated
into checks eplicitily for dirty tracking and those for determining how to
set page table flags.
We make this change so we can perform checks in the GUP logic to determine
which mappings might be problematic when written to.
Link: https://lkml.kernel.org/r/cover.1683235180.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/0f218370bd49b4e6bbfbb499f7c7b92c26ba1ceb.1683235180.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Mika Penttilä <mpenttil@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It is felt that the name mlock_future_check() is vague - it doesn't
particularly convey the function's operation. mlock_future_ok() is a
clearer name for a predicate function.
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In all but one instance, mlock_future_check() is treated as a boolean
function despite returning an error code. In one instance, this error
code is ignored and replaced with -ENOMEM.
This is confusing, and the inversion of true -> failure, false -> success
is not warranted. Convert the function to a bool, lightly refactor and
return true if the check passes, false if not.
Link: https://lkml.kernel.org/r/20230522082412.56685-1-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If the iterator has moved to the previous entry, then step forward one
range, back to the gap.
Link: https://lkml.kernel.org/r/20230518145544.1722059-36-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The maple tree iterator clean up is incompatible with the way
do_vmi_align_munmap() expects it to behave. Update the expected behaviour
to map now since the change will work currently.
Link: https://lkml.kernel.org/r/20230518145544.1722059-23-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the vma iterator in the validation code and combine the code to check
the maple tree into the main validate_mm() function.
Introduce a new function vma_iter_dump_tree() to dump the maple tree in
hex layout.
Replace all calls to validate_mm_mt() with validate_mm().
[Liam.Howlett@oracle.com: update validate_mm() to use vma iterator CONFIG flag]
Link: https://lkml.kernel.org/r/20230606183538.588190-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20230518145544.1722059-18-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Allow different formatting strings to be used when dumping the tree.
Currently supports hex and decimal.
Link: https://lkml.kernel.org/r/20230518145544.1722059-6-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We may still have inconsistent input parameters even if we choose not to
merge and the vma_merge() invariant checks are useful for checking this
with no production runtime cost (these are only relevant when
CONFIG_DEBUG_VM is specified).
Therefore, perform these checks regardless of whether we merge.
This is relevant, as a recent issue (addressed in commit "mm/mempolicy:
Correctly update prev when policy is equal on mbind") in the mbind logic
was only picked up in the 6.2.y stable branch where these assertions are
performed prior to determining mergeability.
Had this remained the same in mainline this issue may have been picked up
faster, so moving forward let's always check them.
Link: https://lkml.kernel.org/r/df548a6ae3fa135eec3b446eb3dae8eb4227da97.1682885809.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page().
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
than exclusive, and has fixed a bunch of errors which were caused by its
unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
=J+Dj
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
After upgrading build guests to v6.3, rpm started segfaulting for
specific packages, which was bisected to commit 0503ea8f5b ("mm/mmap:
remove __vma_adjust()"). rpm is doing many mremap() operations with file
mappings of its db. The problem is that in vma_merge() case 3 (we merge
with the next vma, expanding it downwards) vm_pgoff is not adjusted as
it should when vm_start changes. As a result the rpm process most likely
sees data from the wrong offset of the file. Fix the vm_pgoff
calculation.
For case 8 this is a non-functional change as the resulting vm_pgoff is
the same.
Reported-and-bisected-by: Jiri Slaby <jirislaby@kernel.org>
Reported-and-tested-by: Fabian Vogt <fvogt@suse.com>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1210903
Fixes: 0503ea8f5b ("mm/mmap: remove __vma_adjust()")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of having callers care about the mmap_min_addr logic for the
lowest valid mapping address (and some of them getting it wrong), just
move the logic into vm_unmapped_area() itself. One less thing for various
architecture cases (and generic helpers) to worry about.
We should really try to make much more of this be common code, but baby
steps..
Without this, vm_unmapped_area() could return an address below
mmap_min_addr (because some caller forgot about that). That then causes
the mmap machinery to think it has found a workable address, but then
later security_mmap_addr(addr) is unhappy about it and the mmap() returns
with a nonsensical error (EPERM).
The proper action is to either return ENOMEM (if the virtual address space
is exhausted), or try to find another address (ie do a bottom-up search
for free addresses after the top-down one failed).
See commit 2afc745f3e ("mm: ensure get_unmapped_area() returns higher
address than mmap_min_addr"), which fixed this for one call site (the
generic arch_get_unmapped_area_topdown() fallback) but left other cases
alone.
Link: https://lkml.kernel.org/r/20230418214009.1142926-1-Liam.Howlett@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: process/cgroup ksm support", v9.
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
Use case 1:
The madvise call is not available in the programming language. An
example for this are programs with forked workloads using a garbage
collected language without pointers. In such a language madvise cannot
be made available.
In addition the addresses of objects get moved around as they are
garbage collected. KSM sharing needs to be enabled "from the outside"
for these type of workloads.
Use case 2:
The same interpreter can also be used for workloads where KSM brings
no benefit or even has overhead. We'd like to be able to enable KSM on
a workload by workload basis.
Use case 3:
With the madvise call sharing opportunities are only enabled for the
current process: it is a workload-local decision. A considerable number
of sharing opportunities may exist across multiple workloads or jobs (if
they are part of the same security domain). Only a higler level entity
like a job scheduler or container can know for certain if its running
one or more instances of a job. That job scheduler however doesn't have
the necessary internal workload knowledge to make targeted madvise
calls.
Security concerns:
In previous discussions security concerns have been brought up. The
problem is that an individual workload does not have the knowledge about
what else is running on a machine. Therefore it has to be very
conservative in what memory areas can be shared or not. However, if the
system is dedicated to running multiple jobs within the same security
domain, its the job scheduler that has the knowledge that sharing can be
safely enabled and is even desirable.
Performance:
Experiments with using UKSM have shown a capacity increase of around 20%.
Here are the metrics from an instagram workload (taken from a machine
with 64GB main memory):
full_scans: 445
general_profit: 20158298048
max_page_sharing: 256
merge_across_nodes: 1
pages_shared: 129547
pages_sharing: 5119146
pages_to_scan: 4000
pages_unshared: 1760924
pages_volatile: 10761341
run: 1
sleep_millisecs: 20
stable_node_chains: 167
stable_node_chains_prune_millisecs: 2000
stable_node_dups: 2751
use_zero_pages: 0
zero_pages_sharing: 0
After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.
Detailed changes:
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation,
but not calculated. This adds the general profit metric to
/sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat
This adds the process profit metric to /proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests and ksm_functional_tests
This adds an option to specify the merge type to the ksm_tests.
This allows to test madvise and prctl KSM.
It also adds a two new tests to ksm_functional_tests: one to test
the new prctl options and the other one is a fork test to verify that
the KSM process setting is inherited by client processes.
This patch (of 3):
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a
cgroup and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
1) Introduce new MMF_VM_MERGE_ANY flag
This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag
is set, kernel samepage merging (ksm) gets enabled for all vma's of a
process.
2) Setting VM_MERGEABLE on VMA creation
When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
VM_MERGEABLE flag will be set for this VMA.
3) support disabling of ksm for a process
This adds the ability to disable ksm for a process if ksm has been
enabled for the process with prctl.
4) add new prctl option to get and set ksm for a process
This adds two new options to the prctl system call
- enable ksm for all vmas of a process (if the vmas support it).
- query if ksm has been enabled for a process.
3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
In the s390 architecture when storage keys are used, the
MMF_VM_MERGE_ANY will be disabled.
Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The maple tree limits the gap returned to a window that specifically fits
what was asked. This may not be optimal in the case of switching search
directions or a gap that does not satisfy the requested space for other
reasons. Fix the search by retrying the operation and limiting the search
window in the rare occasion that a conflict occurs.
Link: https://lkml.kernel.org/r/20230414185919.4175572-1-Liam.Howlett@oracle.com
Fixes: 3499a13168 ("mm/mmap: use maple tree for unmapped_area{_topdown}")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
call_rcu() can take a long time when callback offloading is enabled. Its
use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed.
Because exit_mmap() is called only after the last mm user drops its
refcount, the page fault handlers can't be racing with it. Any other
possible user like oom-reaper or process_mrelease are already synchronized
using mmap_lock. Therefore exit_mmap() can free VMAs directly, without
the use of call_rcu().
Expose __vm_area_free() and use it from exit_mmap() to avoid possible
call_rcu() floods and performance regressions caused by it.
Link: https://lkml.kernel.org/r/20230227173632.3292573-33-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Per-vma locking mechanism will search for VMA under RCU protection and
then after locking it, has to ensure it was not removed from the VMA tree
after we found it. To make this check efficient, introduce a
vma->detached flag to mark VMAs which were removed from the VMA tree.
Link: https://lkml.kernel.org/r/20230227173632.3292573-23-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Page fault handlers might need to fire MMU notifications while a new
notifier is being registered. Modify mm_take_all_locks to write-lock all
VMAs and prevent this race with page fault handlers that would hold VMA
locks. VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
locking order as in page fault handlers.
Link: https://lkml.kernel.org/r/20230227173632.3292573-22-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Normally free_pgtables needs to lock affected VMAs except for the case
when VMAs were isolated under VMA write-lock. munmap() does just that,
isolating while holding appropriate locks and then downgrading mmap_lock
and dropping per-VMA locks before freeing page tables. Add a parameter to
free_pgtables for such scenario.
Link: https://lkml.kernel.org/r/20230227173632.3292573-20-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Write-lock VMA as locked before copying it and when copy_vma produces a
new VMA.
Link: https://lkml.kernel.org/r/20230227173632.3292573-18-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Write-lock all VMAs which might be affected by a merge, split, expand or
shrink operations. All these operations use vma_prepare() before making
the modifications, therefore it provides a centralized place to perform
VMA locking.
[surenb@google.com: remove unnecessary vp->vma check in vma_prepare]
Link: https://lkml.kernel.org/r/20230301022720.1380780-1-surenb@google.com
Link: https://lore.kernel.org/r/202302281802.J93Nma7q-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230227173632.3292573-17-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Laurent Dufour <laurent.dufour@fr.ibm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
vma_prepare() acquires all locks required before VMA modifications. Move
vma_prepare() before vma_adjust_trans_huge() so that VMA is locked before
any modification.
Link: https://lkml.kernel.org/r/20230227173632.3292573-15-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rather than setting err = -1 and only resetting if we hit merge cases,
explicitly check the non-mergeable case to make it abundantly clear that
we only proceed with the rest if something is mergeable, default err to 0
and only update if an error might occur.
Move the merge_prev, merge_next cases closer to the logic determining
curr, next and reorder initial variables so they are more logically
grouped.
This has no functional impact.
Link: https://lkml.kernel.org/r/99259fbc6403e80e270e1cc4612abbc8620b121b.1679516210.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Previously, vma was an uninitialised variable which was only definitely
assigned as a result of the logic covering all possible input cases - for
it to have remained uninitialised, prev would have to be NULL, and next
would _have_ to be mergeable.
The value of res defaults to NULL, so we can neatly eliminate the
assignment to res and vma in the if (prev) block and ensure that both res
and vma are both explicitly assigned, by just setting both to prev.
In addition we add an explanation as to under what circumstances both
might change, and since we absolutely do rely on addr == curr->vm_start
should curr exist, assert that this is the case.
Link: https://lkml.kernel.org/r/83938bed24422cbe5954bbf491341674becfe567.1679516210.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use find_vma_intersection() and vma_lookup() to both simplify the logic
and to fold the end == next->vm_start condition into one block.
This groups all of the simple range checks together and establishes the
invariant that, if prev, curr or next are non-NULL then their positions
are as expected.
This has no functional impact.
Link: https://lkml.kernel.org/r/c6d960641b4ba58fa6ad3d07bf68c27d847963c8.1679516210.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "further cleanup of vma_merge()", v2.
Following on from Vlastimil Babka's patch series "cleanup vma_merge() and
improve mergeability tests" which was in turn based on Liam's prior
cleanups, this patch series introduces changes discussed in review of
Vlastimil's series and goes further in attempting to make the logic as
clear as possible.
Nearly all of this should have absolutely no functional impact, however it
does add a singular VM_WARN_ON() case.
With many thanks to Vernon for helping kick start the discussion around
simplification - abstract use of vma did indeed turn out not to be
necessary - and to Liam for his excellent suggestions which greatly
simplified things.
This patch (of 4):
Previously the ASCII diagram above vma_merge() and the accompanying
variable naming was rather confusing, however recent efforts by Liam
Howlett and Vlastimil Babka have significantly improved matters.
This patch goes a little further - replacing 'X' with 'N' which feels a
lot more natural and replacing what was 'N' with 'C' which stands for
'concurrent' VMA.
No word quite describes a VMA that has coincident start as the input span,
concurrent, abbreviated to 'curr' (and which can be thought of also as
'current') however fits intuitions well alongside prev and next.
This has no functional impact.
Link: https://lkml.kernel.org/r/cover.1679431180.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/6001e08fa7e119470cbb1d2b6275ad8d742ff9a7.1679431180.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since pre-git times, is_mergeable_vma() returns false for a vma with
vm_ops->close, so that no owner assumptions are violated in case the vma
is removed as part of the merge.
This check is currently very conservative and can prevent merging even
situations where vma can't be removed, such as simple expansion of
previous vma, as evidenced by commit d014cd7c1c ("mm, mremap: fix
mremap() expanding for vma's with vm_ops->close()")
In order to allow more merging when appropriate and simplify the code that
was made more complex by commit d014cd7c1c, start distinguishing cases
where the vma can be really removed, and allow merging with vm_ops->close
otherwise.
As a first step, add a may_remove_vma parameter to is_mergeable_vma().
can_vma_merge_before() sets it to true, because when called from
vma_merge(), a removal of the vma is possible.
In can_vma_merge_after(), pass the parameter as false, because no
removal can occur in each of its callers:
- vma_merge() calls it on the 'prev' vma, which is never removed
- mmap_region() and do_brk_flags() call it to determine if it can expand
a vma, which is not removed
As a result, vma's with vm_ops->close may now merge with compatible ranges
in more situations than previously. We can also revert commit
d014cd7c1c as the next step to simplify mremap code again.
[vbabka@suse.cz: adjust comment as suggested by Lorenzo]
Link: https://lkml.kernel.org/r/74f2ea6c-f1a9-6dd7-260c-25e660f42379@suse.cz
Link: https://lkml.kernel.org/r/20230309111258.24079-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The comments already mention returning 'true' so make the code match them.
Link: https://lkml.kernel.org/r/20230309111258.24079-9-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The variable 'adj_next' holds the value by which we adjust vm_start of a
vma in variable 'adjust', that's either 'next' or 'mid', so the current
name is inaccurate. Rename it to 'adj_start'.
Link: https://lkml.kernel.org/r/20230309111258.24079-8-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There are several places where we test if 'mid' is really the area NNNN in
the diagram and the tests have two variants and are non-obvious to follow.
Instead, set 'mid' to NULL up-front if it's not the NNNN area, and
simplify the tests.
Also update the description in comment accordingly.
[vbabka@suse.cz: adjust/add comments as suggested by Lorenzo]
Link: https://lkml.kernel.org/r/def43190-53f7-a607-d1b0-b657565f4288@suse.cz
Link: https://lkml.kernel.org/r/20230309111258.24079-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It is more intuitive to go from prev to mid and then next. No functional
change.
Link: https://lkml.kernel.org/r/20230309111258.24079-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Almost all cases now use the 'next' pointer for the vma following the
merged area, and the cases diagram shows it as XXXX. Case 4 is different
as it uses 'mid' and NNNN, so change it for consistency. No functional
change.
Link: https://lkml.kernel.org/r/20230309111258.24079-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Case 1 is now shown in the comment as next vma being merged with prev, so
use 'next' instead of 'mid'. In case 1 they both point to the same vma.
As a consequence, in case 6, the dup_anon_vma() is now tried first on
'next' and then on 'mid', before it was the opposite order. This is not a
functional change, as those two vma's cannnot have a different anon_vma,
as that would have prevented the merging in the first place.
Link: https://lkml.kernel.org/r/20230309111258.24079-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In case 3 we we use 'next' for everything but vma_pgoff. So use 'next'
for that as well, instead of 'mid', for consistency. Then in case 8 we
have to use 'mid' explicitly, which should also make the intent more
obvious.
Adjust the diagram for cases 1-3 in the comment to match the code - we are
using 'next' for case 3 so mark the range with XXXX instead of NNNN. For
case 2 that's a no-op as the code doesn't touch 'next' or 'mid'. For case
1 it's now wrong but that will be fixed next.
No functional change.
Link: https://lkml.kernel.org/r/20230309111258.24079-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "cleanup vma_merge() and improve mergeability tests".
My initial goal here was to try making the check for vm_ops->close in
is_mergeable_vma() only be applied for vma's that would be truly removed
as part of the merge (see Patch 9). This would then allow reverting the
quick fix d014cd7c1c ("mm, mremap: fix mremap() expanding for vma's with
vm_ops->close()"). This was successful enough to allow the revert (Patch
10). Checks using can_vma_merge_before() are still pessimistic about
possible vma removal, and making them precise would probably complicate
the vma_merge() code too much.
Liam's 6.3-rc1 simplification of vma_merge() and removal of __vma_adjust()
was very much helpful in understanding the vma_merge() implementation and
especially when vma removals can happen, which is now very obvious. While
studing the code, I've found ways to make it hopefully even more easy to
follow, so that's the patches 1-8. That made me also notice a bug that's
now already fixed in 6.3-rc1.
This patch (of 10):
In the merging preparation part of vma_merge(), some vma pointer variables
are assigned for later execution of the merge, but also read from in the
block itself. The code is easier follow and check against the cases
diagram in the comment if the code reads only from the "primary" vma
variables prev, mid, next instead. No functional change.
Link: https://lkml.kernel.org/r/20230309111258.24079-1-vbabka@suse.cz
Link: https://lkml.kernel.org/r/20230309111258.24079-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>]
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the maple tree in RCU mode for VMA tracking.
The maple tree tracks the stack and is able to update the pivot
(lower/upper boundary) in-place to allow the page fault handler to write
to the tree while holding just the mmap read lock. This is safe as the
writes to the stack have a guard VMA which ensures there will always be a
NULL in the direction of the growth and thus will only update a pivot.
It is possible, but not recommended, to have VMAs that grow up/down
without guard VMAs. syzbot has constructed a testcase which sets up a VMA
to grow and consume the empty space. Overwriting the entire NULL entry
causes the tree to be altered in a way that is not safe for concurrent
readers; the readers may see a node being rewritten or one that does not
match the maple state they are using.
Enabling RCU mode allows the concurrent readers to see a stable node and
will return the expected result.
[Liam.Howlett@Oracle.com: we don't need to free the nodes with RCU[
Link: https://lore.kernel.org/linux-mm/000000000000b0a65805f663ace6@google.com/
Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com
Fixes: d4af56c5c7 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: syzbot+8d95422d3537159ca390@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>