mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-09-04 20:19:47 +08:00
7094832b5a
1440 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
45fec1e595 |
mm: skip over all consecutive none ptes in do_zap_pte_range()
Skip over all consecutive none ptes in do_zap_pte_range(), which helps optimize away need_resched() + force_break + incremental pte/addr increments etc. Link: https://lkml.kernel.org/r/8ecffbf990afd1c8ccc195a2ec321d55f0923908.1733305182.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
117cdb05e3 |
mm: introduce do_zap_pte_range()
This commit introduces do_zap_pte_range() to actually zap the PTEs, which will help improve code readability and facilitate secondary checking of the processed PTEs in the future. No functional change. Link: https://lkml.kernel.org/r/c3fd16807f83bb7d7a376cc6de023a9f5ead17da.1733305182.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
fabc0e8dac |
mm: introduce zap_nonpresent_ptes()
Similar to zap_present_ptes(), let's introduce zap_nonpresent_ptes() to handle non-present ptes, which can improve code readability. No functional change. Link: https://lkml.kernel.org/r/009ca882036d9c7a9f815489cfeafe0bdb79d62d.1733305182.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
773fc6ab71 |
mm/memory: fix a comment typo in lock_mm_and_find_vma()
s/equivalend/equivalent/ Link: https://lkml.kernel.org/r/20241120105041.2394283-1-yikming2222@gmail.com Signed-off-by: Chin Yik Ming <yikming2222@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
bfc1d17829 |
mm: migrate: remove unused argument vma from migrate_misplaced_folio()
Commit
|
||
![]() |
c51a4f11e6 |
mm: use clear_user_(high)page() for arch with special user folio handling
Some architectures have special handling after clearing user folios:
architectures, which set cpu_dcache_is_aliasing() to true, require
flushing dcache; arc, which sets cpu_icache_is_aliasing() to true, changes
folio->flags to make icache coherent to dcache. So __GFP_ZERO using only
clear_page() is not enough to zero user folios and clear_user_(high)page()
must be used. Otherwise, user data will be corrupted.
Fix it by always clearing user folios with clear_user_(high)page() when
cpu_dcache_is_aliasing() is true or cpu_icache_is_aliasing() is true.
Rename alloc_zeroed() to user_alloc_needs_zeroing() and invert the logic
to clarify its intend.
Link: https://lkml.kernel.org/r/20241209182326.2955963-2-ziy@nvidia.com
Fixes:
|
||
![]() |
f5d09de9f1 |
mm: use aligned address in copy_user_gigantic_page()
In current kernel, hugetlb_wp() calls copy_user_large_folio() with the
fault address. Where the fault address may be not aligned with the huge
page size. Then, copy_user_large_folio() may call
copy_user_gigantic_page() with the address, while
copy_user_gigantic_page() requires the address to be huge page size
aligned. So, this may cause memory corruption or information leak,
addtional, use more obvious naming 'addr_hint' instead of 'addr' for
copy_user_gigantic_page().
Link: https://lkml.kernel.org/r/20241028145656.932941-2-wangkefeng.wang@huawei.com
Fixes:
|
||
![]() |
8aca2bc96c |
mm: use aligned address in clear_gigantic_page()
In current kernel, hugetlb_no_page() calls folio_zero_user() with the
fault address. Where the fault address may be not aligned with the huge
page size. Then, folio_zero_user() may call clear_gigantic_page() with
the address, while clear_gigantic_page() requires the address to be huge
page size aligned. So, this may cause memory corruption or information
leak, addtional, use more obvious naming 'addr_hint' instead of 'addr' for
clear_gigantic_page().
Link: https://lkml.kernel.org/r/20241028145656.932941-1-wangkefeng.wang@huawei.com
Fixes:
|
||
![]() |
20bf82a898 |
mm: don't allow huge faults for files with pre content watches
There's nothing stopping us from supporting this, we could simply pass the order into the helper and emit the proper length. However currently there's no tests to validate this works properly, so disable it until there's a desire to support this along with the appropriate tests. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/9035b82cff08a3801cef3d06bbf2778b2e5a4dba.1731684329.git.josef@toxicpanda.com |
||
![]() |
7c53dfbdb0 |
mm: add PTE_MARKER_GUARD PTE marker
Add a new PTE marker that results in any access causing the accessing process to segfault. This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range. This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination. Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect. We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or MADV_FREE. We additionally add a WARN_ON_ONCE() in hugetlb logic should a guard marker be encountered there, as we explicitly do not support this operation and this should not occur. Link: https://lkml.kernel.org/r/f47f3d5acca2dcf9bbf655b6d33f3dc713e4a4a0.1730123433.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabkba@suse.cz> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
722376934b |
mm/memory.c: simplify pfnmap_lockdep_assert
Use local `mapping' to reduce the pointer chasing.
akpm: extracted from a bugfix which Linus fixed with
|
||
![]() |
5708d96da2 |
mm: avoid zeroing user movable page twice with init_on_alloc=1
Commit
|
||
![]() |
6359c39c9d |
mm: remove unused hugepage for vma_alloc_folio()
The hugepage parameter was deprecated since commit
|
||
![]() |
077c7c1e09 |
mm/memory.c: remove stray newline at top of file
Fixes:
|
||
![]() |
24553a978b |
mm: copy_pte_range() use pte_offset_map_rw_nolock()
In copy_pte_range(), we may modify the src_pte entry after holding the src_ptl, so convert it to using pte_offset_map_rw_nolock(). Since we already hold the exclusive mmap_lock, and the copy_pte_range() and retract_page_tables() are using vma->anon_vma to be exclusive, so the PTE page is stable, there is no need to get pmdval and do pmd_same() check. Link: https://lkml.kernel.org/r/9166f6fad806efbca72e318ab6f0f8af458056a9.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
d9c1ddf37b |
mm: handle_pte_fault() use pte_offset_map_rw_nolock()
In handle_pte_fault(), we may modify the vmf->pte after acquiring the vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we will do the pte_same() check, so there is no need to get pmdval to do pmd_same() check, just pass a dummy variable to it. Link: https://lkml.kernel.org/r/af8d694853b44c5a6018403ae435440e275854c7.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
f2f484085e |
mm: move mm flags to mm_types.h
The types of mm flags are now far beyond the core dump related features. This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h. The linux/sched/coredump.h has include the mm_types.h, so the C files related to coredump does not need to change head file inclusion. In addition, the inclusion of sched/coredump.h now can be deleted from the C files that irrelevant to core dump. Link: https://lkml.kernel.org/r/20240926074922.2721274-1-sunnanyong@huawei.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
01626a1823 |
mm: avoid unconditional one-tick sleep when swapcache_prepare fails
Commit |
||
![]() |
b1b4675167 |
mm: fix follow_pfnmap API lockdep assert
The lockdep asserts for the new follow_pfnmap() API "knows" that a
pfnmap always has a vma->vm_file, since that's the only way to create
such a mapping.
And that's actually true for all the normal cases. But not for the mmap
failure case, where the incomplete mapping is torn down and we have
cleared vma->vm_file because the failure occured before the file was
linked to the vma.
So this codepath does actually need to check for vm_file being NULL.
Reported-by: Jann Horn <jannh@google.com>
Fixes:
|
||
![]() |
2b0f922323 |
mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
We (or rather, readahead logic :) ) might be allocating a THP in the
pagecache and then try mapping it into a process that explicitly disabled
THP: we might end up installing PMD mappings.
This is a problem for s390x KVM, which explicitly remaps all PMD-mapped
THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before
starting the VM.
For example, starting a VM backed on a file system with large folios
supported makes the VM crash when the VM tries accessing such a mapping
using KVM.
Is it also a problem when the HW disabled THP using
TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case
without X86_FEATURE_PSE.
In the future, we might be able to do better on s390x and only disallow
PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED
really wants. For now, fix it by essentially performing the same check as
would be done in __thp_vma_allowable_orders() or in shmem code, where this
works as expected, and disallow PMD mappings, making us fallback to PTE
mappings.
Link: https://lkml.kernel.org/r/20241011102445.934409-3-david@redhat.com
Fixes:
|
||
![]() |
a5e8eb2513 |
mm: remove unused stub for can_swapin_thp()
When can_swapin_thp() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: mm/memory.c:4184:20: error: unused function 'can_swapin_thp' [-Werror,-Wunused-function] Fix this by removing the unused stub. See also commit |
||
![]() |
617a814f14 |
ALong with the usual shower of singleton patches, notable patch series in
this pull request are: "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. "mm: split underused THPs" from Yu Zhao. Improve THP=always policy - this was overprovisioning THPs in sparsely accessed memory areas. "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu1BBwAKCRDdBJ7gKXxA jlWNAQDYlqQLun7bgsAN4sSvi27VUuWv1q70jlMXTfmjJAvQqwD/fBFVR6IOOiw7 AkDbKWP2k0hWPiNJBGwoqxdHHx09Xgo= =s0T+ -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ... |
||
![]() |
839c4f596f |
12 hotfixes, 11 of which are cc:stable.
Four fixes for longstanding ocfs2 issues and the remainder address random MM things. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZuvTyAAKCRDdBJ7gKXxA jrc4AP95yg/8E50PLyVKFIsot3Vaodq908cz2vvS0n0915NO7AD+Psy11a2aR1E5 L/ZND8Zyv06qmz73WJ7BUIx0CHCniw4= =eX0p -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "12 hotfixes, 11 of which are cc:stable. Four fixes for longstanding ocfs2 issues and the remainder address random MM things" * tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/madvise: process_madvise() drop capability check if same mm mm/huge_memory: ensure huge_zero_folio won't have large_rmappable flag set mm/hugetlb.c: fix UAF of vma in hugetlb fault pathway mm: change vmf_anon_prepare() to __vmf_anon_prepare() resource: fix region_intersects() vs add_memory_driver_managed() zsmalloc: use unique zsmalloc caches names mm/damon/vaddr: protect vma traversal in __damon_va_thre_regions() with rcu read lock mm: vmscan.c: fix OOM on swap stress test ocfs2: cancel dqi_sync_work before freeing oinfo ocfs2: fix possible null-ptr-deref in ocfs2_set_buffer_uptodate ocfs2: remove unreasonable unlock in ocfs2_read_blocks ocfs2: fix null-ptr-deref when journal load failed. |
||
![]() |
242d12c981 |
mm: support large folios swap-in for sync io devices
Currently, we have mTHP features, but unfortunately, without support for
large folio swap-ins, once these large folios are swapped out, they are
lost because mTHP swap is a one-way process. The lack of mTHP swap-in
functionality prevents mTHP from being used on devices like Android that
heavily rely on swap.
This patch introduces mTHP swap-in support. It starts from sync devices
such as zRAM. This is probably the simplest and most common use case,
benefiting billions of Android phones and similar devices with minimal
implementation cost. In this straightforward scenario, large folios are
always exclusive, eliminating the need to handle complex rmap and
swapcache issues.
It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
swap-out and swap-in. Large folios in the buddy system are also
preserved as much as possible, rather than being fragmented due
to swap-in.
2. Eliminates fragmentation in swap slots and supports successful
THP_SWPOUT.
w/o this patch (Refer to the data from Chris's and Kairui's latest
swap allocator optimization while running ./thp_swap_allocator_test
w/o "-a" option [1]):
./thp_swap_allocator_test
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%
w/ this patch (always 0%):
Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
...
3. With both mTHP swap-out and swap-in supported, we offer the option to enable
zsmalloc compression/decompression with larger granularity[2]. The upcoming
optimization in zsmalloc will significantly increase swap speed and improve
compression efficiency. Tested by running 100 iterations of swapping 100MiB
of anon memory, the swap speed improved dramatically:
time consumption of swapin(ms) time consumption of swapout(ms)
lz4 4k 45274 90540
lz4 64k 22942 55667
zstdn 4k 85035 186585
zstdn 64k 46558 118533
The compression ratio also improved, as evaluated with 1 GiB of data:
granularity orig_data_size compr_data_size
4KiB-zstd 1048576000 246876055
64KiB-zstd 1048576000 199763892
Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
realized.
4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
of nr_pages. Swapping in content filled with the same data 0x11, w/o
and w/ the patch for five rounds (Since the content is the same,
decompression will be very fast. This primarily assesses the impact of
reduced page faults):
swp in bandwidth(bytes/ms) w/o w/
round1 624152 1127501
round2 631672 1127501
round3 620459 1139756
round4 606113 1139756
round5 624152
|
||
![]() |
325efb16da |
mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
With large folios swap-in, we might need to uncharge multiple entries all together, add nr argument in mem_cgroup_swapin_uncharge_swap(). For the existing two users, just pass nr=1. Link: https://lkml.kernel.org/r/20240908232119.2157-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
658be46520 |
mm: support poison recovery from copy_present_page()
Similar to other poison recovery, use copy_mc_user_highpage() to avoid potentially kernel panic during copy page in copy_present_page() from fork, once copy failed due to hwpoison in source page, we need to break out of copy in copy_pte_range() and release prealloc folio, so copy_mc_user_highpage() is moved ahead before set *prealloc to NULL. Link: https://lkml.kernel.org/r/20240906024201.1214712-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
aa549f923f |
mm: support poison recovery from do_cow_fault()
Patch series "mm: hwpoison: two more poison recovery".
One more CoW path to support poison recorvery in do_cow_fault(), and the
last copy_user_highpage() user is replaced to copy_mc_user_highpage() from
copy_present_page() during fork to support poison recorvery too.
This patch (of 2):
Like commit
|
||
![]() |
b0a1c0d0ed |
mm: remove follow_pte()
follow_pte() users have been converted to follow_pfnmap*(). Remove the API. Link: https://lkml.kernel.org/r/20240826204353.2228736-17-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
b17269a51c |
mm/access_process_vm: use the new follow_pfnmap API
Use the new API that can understand huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-16-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
6da8e9634b |
mm: new follow_pfnmap API
Introduce a pair of APIs to follow pfn mappings to get entry information. It's very similar to what follow_pte() does before, but different in that it recognizes huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-10-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
10d83d7781 |
mm/pagewalk: check pfnmap for folio_walk_start()
Teach folio_walk_start() to recognize special pmd/pud mappings, and fail them properly as it means there's no folio backing them. [peterx@redhat.com: remove some stale comments, per David] Link: https://lkml.kernel.org/r/20240829202237.2640288-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240826204353.2228736-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
2a058ab328 |
mm: change vmf_anon_prepare() to __vmf_anon_prepare()
Some callers of vmf_anon_prepare() may not want us to release the per-VMA
lock ourselves. Rename vmf_anon_prepare() to __vmf_anon_prepare() and let
the callers drop the lock when desired.
Also, make vmf_anon_prepare() a wrapper that releases the per-VMA lock
itself for any callers that don't care.
This is in preparation to fix this bug reported by syzbot:
https://lore.kernel.org/linux-mm/00000000000067c20b06219fbc26@google.com/
Link: https://lkml.kernel.org/r/20240914194243.245-1-vishal.moola@gmail.com
Fixes:
|
||
![]() |
79a61cc3fc |
mm: avoid leaving partial pfn mappings around in error case
As Jann points out, PFN mappings are special, because unlike normal memory mappings, there is no lifetime information associated with the mapping - it is just a raw mapping of PFNs with no reference counting of a 'struct page'. That's all very much intentional, but it does mean that it's easy to mess up the cleanup in case of errors. Yes, a failed mmap() will always eventually clean up any partial mappings, but without any explicit lifetime in the page table mapping itself, it's very easy to do the error handling in the wrong order. In particular, it's easy to mistakenly free the physical backing store before the page tables are actually cleaned up and (temporarily) have stale dangling PTE entries. To make this situation less error-prone, just make sure that any partial pfn mapping is torn down early, before any other error handling. Reported-and-tested-by: Jann Horn <jannh@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Simona Vetter <simona.vetter@ffwll.ch> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
246d3aa3e5 |
mm: cleanup count_mthp_stat() definition
Patch series "Shmem mTHP controls and stats improvements", v3. This is a small series to tidy up the way the shmem controls and stats are exposed. These patches were previously part of the series at [2], but I decided to split them out since they can go in independently. This patch (of 2): Let's move count_mthp_stat() so that it's always defined, even when THP is disabled. Previously uses of the function in files such as shmem.c, which are compiled even when THP is disabled, required ugly THP ifdeferry. With this cleanup, we can remove those ifdefs and the function resolves to a nop when THP is disabled. I shortly plan to call count_mthp_stat() from more THP-invariant source files. Link: https://lkml.kernel.org/r/20240808111849.651867-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240808111849.651867-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Barry Song <baohua@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
f77f0c7514 |
mm,memcg: provide per-cgroup counters for NUMA balancing operations
The ability to observe the demotion and promotion decisions made by the kernel on a per-cgroup basis is important for monitoring and tuning containerized workloads on machines equipped with tiered memory. Different containers in the system may experience drastically different memory tiering actions that cannot be distinguished from the global counters alone. For example, a container running a workload that has a much hotter memory accesses will likely see more promotions and fewer demotions, potentially depriving a colocated container of top tier memory to such an extent that its performance degrades unacceptably. For another example, some containers may exhibit longer periods between data reuse, causing much more numa_hint_faults than numa_pages_migrated. In this case, tuning hot_threshold_ms may be appropriate, but the signal can easily be lost if only global counters are available. In the long term, we hope to introduce per-cgroup control of promotion and demotion actions to implement memory placement policies in tiering. This patch set adds seven counters to memory.stat in a cgroup: numa_pages_migrated, numa_pte_updates, numa_hint_faults, pgdemote_kswapd, pgdemote_khugepaged, pgdemote_direct and pgpromote_success. pgdemote_* and pgpromote_success are also available in memory.numa_stat. count_memcg_events_mm() is added to count multiple event occurrences at once, and get_mem_cgroup_from_folio() is added because we need to get a reference to the memcg of a folio before it's migrated to track numa_pages_migrated. The accounting of PGDEMOTE_* is moved to shrink_inactive_list() before being changed to per-cgroup. [kaiyang2@cs.cmu.edu: add documentation of the memcg counters in cgroup-v2.rst] Link: https://lkml.kernel.org/r/20240814235122.252309-1-kaiyang2@cs.cmu.edu Link: https://lkml.kernel.org/r/20240814174227.30639-1-kaiyang2@cs.cmu.edu Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
727d50a7e0 |
mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)
do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To reduce redundancy, move common code to numa_migrate_prep() and rename the function to numa_migrate_check() to reflect its functionality. Now do_huge_pmd_numa_page() also checks shared folios to set TNF_SHARED flag. Link: https://lkml.kernel.org/r/20240809145906.1513458-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
94dc8bffd8 |
mm: return the folio from swapin_readahead
The unuse_pte_range() caller only wants the folio while do_swap_page() wants both the page and the folio. Since do_swap_page() already has logic for handling both the folio and the page, move the folio-to-page logic there. This also lets us allocate larger folios in the SWP_SYNCHRONOUS_IO path in future. Link: https://lkml.kernel.org/r/20240807193734.1865400-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
17fe833b0d |
mm: fix (harmless) type confusion in lock_vma_under_rcu()
There is a (harmless) type confusion in lock_vma_under_rcu(): After
vma_start_read(), we have taken the VMA lock but don't know yet whether
the VMA has already been detached and scheduled for RCU freeing. At this
point, ->vm_start and ->vm_end are accessed.
vm_area_struct contains a union such that ->vm_rcu uses the same memory as
->vm_start and ->vm_end; so accessing ->vm_start and ->vm_end of a
detached VMA is illegal and leads to type confusion between union members.
Fix it by reordering the vma->detached check above the address checks, and
document the rules for RCU readers accessing VMAs.
This will probably change the number of observed VMA_LOCK_MISS events
(since previously, trying to access a detached VMA whose ->vm_rcu has been
scheduled would bail out when checking the fault address against the
rcu_head members reinterpreted as VMA bounds).
Link: https://lkml.kernel.org/r/20240805-fix-vma-lock-type-confusion-v1-1-9f25443a9a71@google.com
Fixes:
|
||
![]() |
3523a37e65 |
mm: provide vm_normal_(page|folio)_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVES
Patch series "mm: replace follow_page() by folio_walk". Looking into a way of moving the last folio_likely_mapped_shared() call in add_folio_for_migration() under the PTL, I found myself removing follow_page(). This paves the way for cleaning up all the FOLL_, follow_* terminology to just be called "GUP" nowadays. The new page table walker will lookup a mapped folio and return to the caller with the PTL held, such that the folio cannot get unmapped concurrently. Callers can then conditionally decide whether they really want to take a short-term folio reference or whether the can simply unlock the PTL and be done with it. folio_walk is similar to page_vma_mapped_walk(), except that we don't know the folio we want to walk to and that we are only walking to exactly one PTE/PMD/PUD. folio_walk provides access to the pte/pmd/pud (and the referenced folio page because things like KSM need that), however, as part of this series no page table modifications are performed by users. We might be able to convert some other walk_page_range() users that really only walk to one address, such as DAMON with damon_mkold_ops/damon_young_ops. It might make sense to extend folio_walk in the future to optionally fault in a folio (if applicable), such that we can replace some get_user_pages() users that really only want to lookup a single page/folio under PTL without unconditionally grabbing a folio reference. I have plans to extend the approach to a range walker that will try batching various page table entries (not just folio pages) to be a better replace for walk_page_range() -- and users will be able to opt in which type of page table entries they want to process -- but that will require more work and more thoughts. KSM seems to work just fine (ksm_functional_tests selftests) and move_pages seems to work (migration selftest). I tested the leaf implementation excessively using various hugetlb sizes (64K, 2M, 32M, 1G) on arm64 using move_pages and did some more testing on x86-64. Cross compiled on a bunch of architectures. This patch (of 11): We want to make use of vm_normal_page_pmd() in generic page table walking code where we might walk hugetlb folios that are mapped by PMDs even without CONFIG_TRANSPARENT_HUGEPAGE. So let's expose vm_normal_page_pmd() + vm_normal_folio_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVES. Link: https://lkml.kernel.org/r/20240802155524.517137-1-david@redhat.com Link: https://lkml.kernel.org/r/20240802155524.517137-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
9f101bef40 |
mm: swap: add nr argument in swapcache_prepare and swapcache_clear to support large folios
Right now, swapcache_prepare() and swapcache_clear() supports one entry only, to support large folios, we need to handle multiple swap entries. To optimize stack usage, we iterate twice in __swap_duplicate(): the first time to verify that all entries are valid, and the second time to apply the modifications to the entries. Currently, we're using nr=1 for the existing users. [v-songbaohua@oppo.com: clarify swap_count_continued and improve readability for __swap_duplicate] Link: https://lkml.kernel.org/r/20240802071817.47081-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240730071339.107447-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
394290cba9 |
mm: turn USE_SPLIT_PTE_PTLOCKS / USE_SPLIT_PTE_PTLOCKS into Kconfig options
Patch series "mm: split PTE/PMD PT table Kconfig cleanups+clarifications". This series is a follow up to the fixes: "[PATCH v1 0/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking" When working on the fixes, I wondered why 8xx is fine (-> never uses split PT locks) and how PT locking even works properly with PMD page table sharing (-> always requires split PMD PT locks). Let's improve the split PT lock detection, make hugetlb properly depend on it and make 8xx bail out if it would ever get enabled by accident. As an alternative to patch #3 we could extend the Kconfig SPLIT_PTE_PTLOCKS option from patch #2 -- but enforcing it closer to the code that actually implements it feels a bit nicer for documentation purposes, and there is no need to actually disable it because it should always be disabled (!SMP). Did a bunch of cross-compilations to make sure that split PTE/PMD PT locks are still getting used where we would expect them. [1] https://lkml.kernel.org/r/20240725183955.2268884-1-david@redhat.com This patch (of 3): Let's clean that up a bit and prepare for depending on CONFIG_SPLIT_PMD_PTLOCKS in other Kconfig options. More cleanups would be reasonable (like the arch-specific "depends on" for CONFIG_SPLIT_PTE_PTLOCKS), but we'll leave that for another day. Link: https://lkml.kernel.org/r/20240726150728.3159964-1-david@redhat.com Link: https://lkml.kernel.org/r/20240726150728.3159964-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
2a28713a67 |
memory tiering: introduce folio_use_access_time() check
If memory tiering mode is on and a folio is not in the top tier memory, folio's cpupid field is repurposed to store page access time. Instead of an open coded check, use a function to encapsulate the check. Link: https://lkml.kernel.org/r/20240724130115.793641-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
40b760cfd4 |
mm/numa: no task_numa_fault() call if PTE is changed
When handling a numa page fault, task_numa_fault() should be called by a process that restores the page table of the faulted folio to avoid duplicated stats counting. Commit |
||
![]() |
4cd7ba16a0 |
mm: fix old/young bit handling in the faulting path
Commit |
||
![]() |
7a3fad30fd |
Random number generator updates for Linux 6.11-rc1.
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmaarzgACgkQSfxwEqXe A66ZWBAAlhXx8bve0uKlDRK8fffWHgruho/fOY4lZJ137AKwA9JCtmOyqdfL4Dmk VxFe7pEQJlQhcA/6kH54uO7SBXwfKlKZJth6SYnaCRMUIbFifHjjIQ0QqldjEKi0 rP90Hu4FVsbwQC7u9i9lQj9n2P36zb6pn83BzpZQ/2PtoVCSCrdSJUe0Rxa3H3GN 0+nNkDSXQt5otCByLaeE3x7KJgXLWL9+G2eFSFLTZ8rSVfMx1CdOIAG37WlLGdWm BaFYPDKMyBTVvVJBNgAe9YSqtrsZ5nlmLz+Z9wAe/hTL7RlL03kWUu34/Udcpull zzMDH0WMntiGK3eFQ2gOYSWqypvAjwHgn3BzqNmjUb69+89mZsdU1slcvnxWsUwU D3vphrscaqarF629tfsXti3jc5PoXwUTjROZVcCyeFPBhyAZgzK8xUvPpJO+RT+K EuUABob9cpA6FCpW/QeolDmMDhXlNT8QgsZu1juokZac2xP3Ly3REyEvT7HLbU2W ZJjbEqm1ppp3RmGELUOJbyhwsLrnbt+OMDO7iEWoG8aSFK4diBK/ZM6WvLMkr8Oi 7ioXGIsYkCy3c47wpZKTrAapOPJp5keqNAiHSEbXw8mozp6429QAEZxNOcczgHKC Ea2JzRkctqutcIT+Slw/uUe//i1iSsIHXbE81fp5udcQTJcUByo= =P8aI -----END PGP SIGNATURE----- Merge tag 'random-6.11-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: "This adds getrandom() support to the vDSO. First, it adds a new kind of mapping to mmap(2), MAP_DROPPABLE, which lets the kernel zero out pages anytime under memory pressure, which enables allocating memory that never gets swapped to disk but also doesn't count as being mlocked. Then, the vDSO implementation of getrandom() is introduced in a generic manner and hooked into random.c. Next, this is implemented on x86. (Also, though it's not ready for this pull, somebody has begun an arm64 implementation already) Finally, two vDSO selftests are added. There are also two housekeeping cleanup commits" * tag 'random-6.11-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: MAINTAINERS: add random.h headers to RNG subsection random: note that RNDGETPOOL was removed in 2.6.9-rc2 selftests/vDSO: add tests for vgetrandom x86: vdso: Wire up getrandom() vDSO implementation random: introduce generic vDSO getrandom() implementation mm: add MAP_DROPPABLE for designating always lazily freeable mappings |
||
![]() |
9651fcedf7 |
mm: add MAP_DROPPABLE for designating always lazily freeable mappings
The vDSO getrandom() implementation works with a buffer allocated with a new system call that has certain requirements: - It shouldn't be written to core dumps. * Easy: VM_DONTDUMP. - It should be zeroed on fork. * Easy: VM_WIPEONFORK. - It shouldn't be written to swap. * Uh-oh: mlock is rlimited. * Uh-oh: mlock isn't inherited by forks. - It shouldn't reserve actual memory, but it also shouldn't crash when page faulting in memory if none is available * Uh-oh: VM_NORESERVE means segfaults. It turns out that the vDSO getrandom() function has three really nice characteristics that we can exploit to solve this problem: 1) Due to being wiped during fork(), the vDSO code is already robust to having the contents of the pages it reads zeroed out midway through the function's execution. 2) In the absolute worst case of whatever contingency we're coding for, we have the option to fallback to the getrandom() syscall, and everything is fine. 3) The buffers the function uses are only ever useful for a maximum of 60 seconds -- a sort of cache, rather than a long term allocation. These characteristics mean that we can introduce VM_DROPPABLE, which has the following semantics: a) It never is written out to swap. b) Under memory pressure, mm can just drop the pages (so that they're zero when read back again). c) It is inherited by fork. d) It doesn't count against the mlock budget, since nothing is locked. e) If there's not enough memory to service a page fault, it's not fatal, and no signal is sent. This way, allocations used by vDSO getrandom() can use: VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE And there will be no problem with OOMing, crashing on overcommitment, using memory when not in use, not wiping on fork(), coredumps, or writing out to swap. In order to let vDSO getrandom() use this, expose these via mmap(2) as MAP_DROPPABLE. Note that this involves removing the MADV_FREE special case from sort_folio(), which according to Yu Zhao is unnecessary and will simply result in an extra call to shrink_folio_list() in the worst case. The chunk removed reenables the swapbacked flag, which we don't want for VM_DROPPABLE, and we can't conditionalize it here because there isn't a vma reference available. Finally, the provided self test ensures that this is working as desired. Cc: linux-mm@kvack.org Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> |
||
![]() |
cd1e0dac3a |
mm: unexport vmf_insert_mixed_mkwrite
vmf_insert_mixed_mkwrite is only used by the built-in DAX code. Link: https://lkml.kernel.org/r/20240702072327.1640911-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
28bdacbcb3 |
mm: move memory_failure_queue() into copy_mc_[user]_highpage()
Patch series "mm: migrate: support poison recover from migrate folio", v5. The folio migration is widely used in kernel, memory compaction, memory hotplug, soft offline page, numa balance, memory demote/promotion, etc, but once access a poisoned source folio when migrating, the kernel will panic. There is a mechanism in the kernel to recover from uncorrectable memory errors, ARCH_HAS_COPY_MC(eg, Machine Check Safe Memory Copy on x86), which is already used in NVDIMM or core-mm paths(eg, CoW, khugepaged, coredump, ksm copy), see copy_mc_to_{user,kernel}, copy_mc_{user_}highpage callers. This series of patches provide the recovery mechanism from folio copy for the widely used folio migration. Please note, because folio migration is no guarantee of success, so we could chose to make folio migration tolerant of memory failures, adding folio_mc_copy() which is a #MC versions of folio_copy(), once accessing a poisoned source folio, we could return error and make the folio migration fail, and this could avoid the similar panic shown below. CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0 pc : copy_page+0x10/0xc0 lr : copy_highpage+0x38/0x50 ... Call trace: copy_page+0x10/0xc0 folio_copy+0x78/0x90 migrate_folio_extra+0x54/0xa0 move_to_new_folio+0xd8/0x1f0 migrate_folio_move+0xb8/0x300 migrate_pages_batch+0x528/0x788 migrate_pages_sync+0x8c/0x258 migrate_pages+0x440/0x528 soft_offline_in_use_page+0x2ec/0x3c0 soft_offline_page+0x238/0x310 soft_offline_page_store+0x6c/0xc0 dev_attr_store+0x20/0x40 sysfs_kf_write+0x4c/0x68 kernfs_fop_write_iter+0x130/0x1c8 new_sync_write+0xa4/0x138 vfs_write+0x238/0x2d8 ksys_write+0x74/0x110 This patch (of 5): There is a memory_failure_queue() call after copy_mc_[user]_highpage(), see callers, eg, CoW/KSM page copy, it is used to mark the source page as h/w poisoned and unmap it from other tasks, and the upcomming poison recover from migrate folio will do the similar thing, so let's move the memory_failure_queue() into the copy_mc_[user]_highpage() instead of adding it into each user, this should also enhance the handling of poisoned page in khugepaged. Link: https://lkml.kernel.org/r/20240626085328.608006-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240626085328.608006-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
ee86814b05 |
mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
Currently we always take a folio reference even if migration will not even be tried or isolation failed, requiring us to grab+drop an additional reference. Further, we end up calling folio_likely_mapped_shared() while the folio might have already been unmapped, because after we dropped the PTL, that can easily happen. We want to stop touching mapcounts and friends from such context, and only call folio_likely_mapped_shared() while the folio is still mapped: mapcount information is pretty much stale and unreliable otherwise. So let's move checks into numamigrate_isolate_folio(), rename that function to migrate_misplaced_folio_prepare(), and call that function from callsites where we call migrate_misplaced_folio(), but still with the PTL held. We can now stop taking temporary folio references, and really only take a reference if folio isolation succeeded. Doing the folio_likely_mapped_shared() + folio isolation under PT lock is now similar to how we handle MADV_PAGEOUT. While at it, combine the folio_is_file_lru() checks. [david@redhat.com: fix list_del() corruption] Link: https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4b69@redhat.com Link: https://lkml.kernel.org/r/20240626191129.658CFC32782@smtp.kernel.org Link: https://lkml.kernel.org/r/20240620212935.656243-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Tested-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
4b88c23ab8 |
mm/migrate: make migrate_misplaced_folio() return 0 on success
Patch series "mm/migrate: move NUMA hinting fault folio isolation + checks under PTL". Let's just return 0 on success, which is less confusing. ... especially because we got it wrong in the migrate.h stub where we have "return -EAGAIN; /* can't migrate now */" instead of "return 0;". Likely this wrong return value doesn't currently matter, but it certainly adds confusion. We'll add migrate_misplaced_folio_prepare() next, where we want to use the same "return 0 on success" approach, so let's just clean this up. Link: https://lkml.kernel.org/r/20240620212935.656243-1-david@redhat.com Link: https://lkml.kernel.org/r/20240620212935.656243-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
2f9f085436 |
mm: memory: rename pages_per_huge_page to nr_pages
Since the callers are converted to use nr_pages naming, use it inside too. Link: https://lkml.kernel.org/r/20240618091242.2140164-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
530dd9926d |
mm: memory: improve copy_user_large_folio()
Use nr_pages instead of pages_per_huge_page and move the address alignment from copy_user_large_folio() into the callers since it is only needed when we don't know which address will be accessed. Link: https://lkml.kernel.org/r/20240618091242.2140164-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
5132633ee7 |
mm: memory: use folio in struct copy_subpage_arg
Directly use folio in struct copy_subpage_arg. Link: https://lkml.kernel.org/r/20240618091242.2140164-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
78fefd04c1 |
mm: memory: convert clear_huge_page() to folio_zero_user()
Patch series "mm: improve clear and copy user folio", v2. Some folio conversions. An improvement is to move address alignment into the caller as it is only needed if we don't know which address will be accessed when clearing/copying user folios. This patch (of 4): Replace clear_huge_page() with folio_zero_user(), and take a folio instead of a page. Directly get number of pages by folio_nr_pages() to remove pages_per_huge_page argument, furthermore, move the address alignment from folio_zero_user() to the callers since the alignment is only needed when we don't know which address will be accessed. Link: https://lkml.kernel.org/r/20240618091242.2140164-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240618091242.2140164-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
9ae2feaced |
mm: use folio_add_new_anon_rmap() if folio_test_anon(folio)==false
For the !folio_test_anon(folio) case, we can now invoke folio_add_new_anon_rmap() with the rmap flags set to either EXCLUSIVE or non-EXCLUSIVE. This action will suppress the VM_WARN_ON_FOLIO check within __folio_add_anon_rmap() while initiating the process of bringing up mTHP swapin. static __always_inline void __folio_add_anon_rmap(struct folio *folio, struct page *page, int nr_pages, struct vm_area_struct *vma, unsigned long address, rmap_t flags, enum rmap_level level) { ... if (unlikely(!folio_test_anon(folio))) { VM_WARN_ON_FOLIO(folio_test_large(folio) && level != RMAP_LEVEL_PMD, folio); } ... } It also improves the code's readability. Currently, all new anonymous folios calling folio_add_anon_rmap_ptes() are order-0. This ensures that new folios cannot be partially exclusive; they are either entirely exclusive or entirely shared. A useful comment from Hugh's fix: : Commit "mm: use folio_add_new_anon_rmap() if folio_test_anon(folio)== : false" has extended folio_add_new_anon_rmap() to use on non-exclusive : folios, already visible to others in swap cache and on LRU. : : That renders its non-atomic __folio_set_swapbacked() unsafe: it risks : overwriting concurrent atomic operations on folio->flags, losing bits : added or restoring bits cleared. Since it's only used in this risky way : when folio_test_locked and !folio_test_anon, many such races are excluded; : but, for example, isolations by folio_test_clear_lru() are vulnerable, and : setting or clearing active. : : It could just use the atomic folio_set_swapbacked(); but this function : does try to avoid atomics where it can, so use a branch instead: just : avoid setting swapbacked when it is already set, that is good enough. : (Swapbacked is normally stable once set: lazyfree can undo it, but only : later, when found anon in a page table.) : : This fixes a lot of instability under compaction and swapping loads: : assorted "Bad page"s, VM_BUG_ON_FOLIO()s, apparently even page double : frees - though I've not worked out what races could lead to the latter. [akpm@linux-foundation.org: comment fixes, per David and akpm] [v-songbaohua@oppo.com: lock the folio to avoid race] Link: https://lkml.kernel.org/r/20240622032002.53033-1-21cnbao@gmail.com [hughd@google.com: folio_add_new_anon_rmap() careful __folio_set_swapbacked()] Link: https://lkml.kernel.org/r/f3599b1d-8323-0dc5-e9e0-fdb3cfc3dd5a@google.com Link: https://lkml.kernel.org/r/20240617231137.80726-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Hugh Dickins <hughd@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Tested-by: Shuai Yuan <yuanshuai@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
15bde4abab |
mm: extend rmap flags arguments for folio_add_new_anon_rmap
Patch series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()", v2. This patchset is preparatory work for mTHP swapin. folio_add_new_anon_rmap() assumes that new anon rmaps are always exclusive. However, this assumption doesn’t hold true for cases like do_swap_page(), where a new anon might be added to the swapcache and is not necessarily exclusive. The patchset extends the rmap flags to allow folio_add_new_anon_rmap() to handle both exclusive and non-exclusive new anon folios. The do_swap_page() function is updated to use this extended API with rmap flags. Consequently, all new anon folios now consistently use folio_add_new_anon_rmap(). The special case for !folio_test_anon() in __folio_add_anon_rmap() can be safely removed. In conclusion, new anon folios always use folio_add_new_anon_rmap(), regardless of exclusivity. Old anon folios continue to use __folio_add_anon_rmap() via folio_add_anon_rmap_pmd() and folio_add_anon_rmap_ptes(). This patch (of 3): In the case of a swap-in, a new anonymous folio is not necessarily exclusive. This patch updates the rmap flags to allow a new anonymous folio to be treated as either exclusive or non-exclusive. To maintain the existing behavior, we always use EXCLUSIVE as the default setting. [akpm@linux-foundation.org: cleanup and constifications per David and akpm] [v-songbaohua@oppo.com: fix missing doc for flags of folio_add_new_anon_rmap()] Link: https://lkml.kernel.org/r/20240619210641.62542-1-21cnbao@gmail.com [v-songbaohua@oppo.com: enhance doc for extend rmap flags arguments for folio_add_new_anon_rmap] Link: https://lkml.kernel.org/r/20240622030256.43775-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240617231137.80726-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240617231137.80726-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: David Hildenbrand <david@redhat.com> Tested-by: Shuai Yuan <yuanshuai@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
20dfa5b7ad |
mm: set pte writable while pte_soft_dirty() is true in do_swap_page()
This patch leverages the new pte_needs_soft_dirty_wp() helper to optimize a scenario where softdirty is enabled, but the softdirty flag has already been set in do_swap_page(). In this situation, we can use pte_mkwrite instead of applying write-protection since we don't depend on write faults. Link: https://lkml.kernel.org/r/20240607211358.4660-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
b2d1f38b52 |
mm: swap: remove 'synchronous' argument to swap_read_folio()
Commit [1] introduced IO polling support duding swapin to reduce swap read latency for block devices that can be polled. However later commit [2] removed polling support. Commit [3] removed the remnants of polling support from read_swap_cache_async() and __read_swap_cache_async(). However, it left behind some remnants in swap_read_folio(), the 'synchronous' argument. swap_read_folio() reads the folio synchronously if synchronous=true or if SWP_SYNCHRONOUS_IO is set in swap_info_struct. The only caller that passes synchronous=true is in do_swap_page() in the SWP_SYNCHRONOUS_IO case. Hence, the argument is redundant, it is only set to true when the swap read would have been synchronous anyway. Remove it. [1] Commit |
||
![]() |
c18160dba5 |
mm: swap: reuse exclusive folio directly instead of wp page faults
After swapping out, we perform a swap-in operation. If we first read and then write, we encounter a major fault in do_swap_page for reading, along with additional minor faults in do_wp_page for writing. However, the latter appears to be unnecessary and inefficient. Instead, we can directly reuse in do_swap_page and completely eliminate the need for do_wp_page. This patch achieves that optimization specifically for exclusive folios. The following microbenchmark demonstrates the significant reduction in minor faults. #define DATA_SIZE (2UL * 1024 * 1024) #define PAGE_SIZE (4UL * 1024) static void *read_write_data(char *addr) { char tmp; for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) { tmp = *(volatile char *)(addr + i); *(volatile char *)(addr + i) = tmp; } } int main(int argc, char **argv) { struct rusage ru; char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); memset(addr, 0x11, DATA_SIZE); do { long old_ru_minflt, old_ru_majflt; long new_ru_minflt, new_ru_majflt; madvise(addr, DATA_SIZE, MADV_PAGEOUT); getrusage(RUSAGE_SELF, &ru); old_ru_minflt = ru.ru_minflt; old_ru_majflt = ru.ru_majflt; read_write_data(addr); getrusage(RUSAGE_SELF, &ru); new_ru_minflt = ru.ru_minflt; new_ru_majflt = ru.ru_majflt; printf("minor faults:%ld major faults:%ld\n", new_ru_minflt - old_ru_minflt, new_ru_majflt - old_ru_majflt); } while(0); return 0; } w/o patch, / # ~/a.out minor faults:512 major faults:512 w/ patch, / # ~/a.out minor faults:0 major faults:512 Minor faults decrease to 0! Link: https://lkml.kernel.org/r/20240602004502.26895-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
43e027e414 |
mm: memory: extend finish_fault() to support large folio
Patch series "add mTHP support for anonymous shmem", v5.
Anonymous pages have already been supported for multi-size (mTHP)
allocation through commit
|
||
![]() |
508758960b |
mm: swap: entirely map large folios found in swapcache
When a large folio is found in the swapcache, the current implementation requires calling do_swap_page() nr_pages times, resulting in nr_pages page faults. This patch opts to map the entire large folio at once to minimize page faults. Additionally, redundant checks and early exits for ARM64 MTE restoring are removed. Link: https://lkml.kernel.org/r/20240529082824.150954-7-21cnbao@gmail.com Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
4c3f966436 |
mm: swap: make should_try_to_free_swap() support large-folio
The function should_try_to_free_swap() operates under the assumption that swap-in always occurs at the normal page granularity, i.e., folio_nr_pages() = 1. However, in reality, for large folios, add_to_swap_cache() will invoke folio_ref_add(folio, nr). To accommodate large folio swap-in, this patch eliminates this assumption. Link: https://lkml.kernel.org/r/20240529082824.150954-6-21cnbao@gmail.com Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
29f252cdc2 |
mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages
Should do_swap_page() have the capability to directly map a large folio, metadata restoration becomes necessary for a specified number of pages denoted as nr. It's important to highlight that metadata restoration is solely required by the SPARC platform, which, however, does not enable THP_SWAP. Consequently, in the present kernel configuration, there exists no practical scenario where users necessitate the restoration of nr metadata. Platforms implementing THP_SWAP might invoke this function with nr values exceeding 1, subsequent to do_swap_page() successfully mapping an entire large folio. Nonetheless, their arch_do_swap_page_nr() functions remain empty. Link: https://lkml.kernel.org/r/20240529082824.150954-5-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
3577dbb192 |
mm: batch unlink_file_vma calls in free_pgd_range
Execs of dynamically linked binaries at 20-ish cores are bottlenecked on the i_mmap_rwsem semaphore, while the biggest singular contributor is free_pgd_range inducing the lock acquire back-to-back for all consecutive mappings of a given file. Tracing the count of said acquires while building the kernel shows: [1, 2) 799579 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [2, 3) 0 | | [3, 4) 3009 | | [4, 5) 3009 | | [5, 6) 326442 |@@@@@@@@@@@@@@@@@@@@@ | So in particular there were 326442 opportunities to coalesce 5 acquires into 1. Doing so increases execs per second by 4% (~50k to ~52k) when running the benchmark linked below. The lock remains the main bottleneck, I have not looked at other spots yet. Bench can be found here: http://apollo.backplane.com/DFlyMisc/doexec.c $ cc -O2 -o shared-doexec doexec.c $ ./shared-doexec $(nproc) Note this particular test makes sure binaries are separate, but the loader is shared. Stats collected on the patched kernel (+ "noinline") with: bpftrace -e 'kprobe:unlink_file_vma_batch_process { @ = lhist(((struct unlink_vma_file_batch *)arg0)->count, 0, 8, 1); }' Link: https://lkml.kernel.org/r/20240521234321.359501-1-mjguzik@gmail.com Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
6faa49d1c4 |
mm: use update_mmu_tlb_range() to simplify code
Let us simplify the code by update_mmu_tlb_range(). Link: https://lkml.kernel.org/r/20240522061204.117421-4-libang.li@antgroup.com Signed-off-by: Bang Li <libang.li@antgroup.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
fce831c920 |
mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()
For now we only get the (small) zeropage mapped to user space in four cases (excluding VM_PFNMAP mappings, such as /proc/vmstat): (1) Read page faults in anonymous VMAs (MAP_PRIVATE|MAP_ANON): do_anonymous_page() will not refcount it and map it pte_mkspecial() (2) UFFDIO_ZEROPAGE on anonymous VMA or COW mapping of shmem (MAP_PRIVATE). mfill_atomic_pte_zeropage() will not refcount it and map it pte_mkspecial(). (3) KSM in mergeable VMA (anonymous VMA or COW mapping). cmp_and_merge_page() will not refcount it and map it pte_mkspecial(). (4) FSDAX as an optimization for holes. vmf_insert_mixed()->__vm_insert_mixed() might end up calling insert_page() without CONFIG_ARCH_HAS_PTE_SPECIAL, refcounting the zeropage and not mapping it pte_mkspecial(). With CONFIG_ARCH_HAS_PTE_SPECIAL, we'll call insert_pfn() where we will not refcount it and map it pte_mkspecial(). In case (4), we might not have VM_MIXEDMAP set: while fs/fuse/dax.c sets VM_MIXEDMAP, we removed it for ext4 fsdax in commit |
||
![]() |
11b914ee9e |
mm/memory: move page_count() check into validate_page_before_insert()
Patch series "mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()", v2. There is interest in mapping zeropages via vm_insert_pages() [1] into MAP_SHARED mappings. For now, we only get zeropages in MAP_SHARED mappings via vmf_insert_mixed() from FSDAX code, and I think it's a bit shaky in some cases because we refcount the zeropage when mapping it but not necessarily always when unmapping it ... and we should actually never refcount it. It's all a bit tricky, especially how zeropages in MAP_SHARED mappings interact with GUP (FOLL_LONGTERM), mprotect(), write-faults and s390x forbidding the shared zeropage (rewrite [2] s now upstream). This series tries to take the careful approach of only allowing the zeropage where it is likely safe to use (which should cover the existing FSDAX use case and [1]), preventing that it could accidentally get mapped writable during a write fault, mprotect() etc, and preventing issues with FOLL_LONGTERM in the future with other users. Tested with a patch from Vincent that uses the zeropage in context of [1]. [1] https://lkml.kernel.org/r/20240430111354.637356-1-vdonnefort@google.com [2] https://lkml.kernel.org/r/20240411161441.910170-1-david@redhat.com This patch (of 3): We'll now also cover the case where insert_page() is called from __vm_insert_mixed(), which sounds like the right thing to do. Link: https://lkml.kernel.org/r/20240522125713.775114-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
0ba5e806e1 |
mm/vmscan: update stale references to shrink_page_list
Commit
|
||
![]() |
ab1ffc86cb |
mm/memory: don't require head page for do_set_pmd()
The requirement that the head page be passed to do_set_pmd() was added in commit |
||
![]() |
cfdd12b482 |
mm: fix possible OOB in numa_rebuild_large_mapping()
The large folio is mapped with folio size(not greater PMD_SIZE) aligned
virtual address during the pagefault, ie, 'addr = ALIGN_DOWN(vmf->address,
nr_pages * PAGE_SIZE)' in do_anonymous_page(). But after the mremap(),
the virtual address only requires PAGE_SIZE alignment. Also pte is moved
to new in move_page_tables(), then traversal of the new pte in the
numa_rebuild_large_mapping() could hit the following issue,
Unable to handle kernel paging request at virtual address 00000a80c021a788
Mem abort info:
ESR = 0x0000000096000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x04: level 0 translation fault
Data abort info:
ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
CM = 0, WnR = 0, TnD = 0, TagAccess = 0
GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=00002040341a6000
[00000a80c021a788] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 0000000096000004 [#1] SMP
...
CPU: 76 PID: 15187 Comm: git Kdump: loaded Tainted: G W 6.10.0-rc2+ #209
Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 1.79 08/21/2021
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : numa_rebuild_large_mapping+0x338/0x638
lr : numa_rebuild_large_mapping+0x320/0x638
sp : ffff8000b41c3b00
x29: ffff8000b41c3b30 x28: ffff8000812a0000 x27: 00000000000a8000
x26: 00000000000000a8 x25: 0010000000000001 x24: ffff20401c7170f0
x23: 0000ffff33a1e000 x22: 0000ffff33a76000 x21: ffff20400869eca0
x20: 0000ffff33976000 x19: 00000000000000a8 x18: ffffffffffffffff
x17: 0000000000000000 x16: 0000000000000020 x15: ffff8000b41c36a8
x14: 0000000000000000 x13: 205d373831353154 x12: 5b5d333331363732
x11: 000000000011ff78 x10: 000000000011ff10 x9 : ffff800080273f30
x8 : 000000320400869e x7 : c0000000ffffd87f x6 : 00000000001e6ba8
x5 : ffff206f3fb5af88 x4 : 0000000000000000 x3 : 0000000000000000
x2 : 0000000000000000 x1 : fffffdffc0000000 x0 : 00000a80c021a780
Call trace:
numa_rebuild_large_mapping+0x338/0x638
do_numa_page+0x3e4/0x4e0
handle_pte_fault+0x1bc/0x238
__handle_mm_fault+0x20c/0x400
handle_mm_fault+0xa8/0x288
do_page_fault+0x124/0x498
do_translation_fault+0x54/0x80
do_mem_abort+0x4c/0xa8
el0_da+0x40/0x110
el0t_64_sync_handler+0xe4/0x158
el0t_64_sync+0x188/0x190
Fix it by making the start and end not only within the vma range, but also
within the page table range.
Link: https://lkml.kernel.org/r/20240612122822.4033433-1-wangkefeng.wang@huawei.com
Fixes:
|
||
![]() |
384a746bb5 |
Revert "mm: init_mlocked_on_free_v3"
There was insufficient review and no agreement that this is the right
approach.
There are serious flaws with the implementation that make processes using
mlock() not even work with simple fork() [1] and we get reliable crashes
when rebooting.
Further, simply because we might be unmapping a single PTE of a large
mlocked folio, we shouldn't zero out the whole folio.
... especially because the code can also *corrupt* urelated memory because
kernel_init_pages(page, folio_nr_pages(folio));
Could end up writing outside of the actual folio if we work with a tail
page.
Let's revert it. Once there is agreement that this is the right approach,
the issues were fixed and there was reasonable review and proper testing,
we can consider it again.
[1] https://lkml.kernel.org/r/4da9da2f-73e4-45fd-b62f-a8a513314057@redhat.com
Link: https://lkml.kernel.org/r/20240605091710.38961-1-david@redhat.com
Fixes:
|
||
![]() |
de7e71ef8b |
mm: simplify and improve print_vma_addr() output
Use '%pD' to print out the filename, and print out the actual offset within the file too, rather than just what the virtual address of the mapping is (which doesn't tell you anything about any mapping offsets). Also, use the exact vma_lookup() instead of find_vma() - the latter looks up any vma _after_ the address, which is of questionable value (yes, maybe you fell off the beginning, but you'd be more likely to fall off the end). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
61307b7be4 |
The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkgQYwAKCRDdBJ7gKXxA jrdKAP9WVJdpEcXxpoub/vVE0UWGtffr8foifi9bCwrQrGh5mgEAx7Yf0+d/oBZB nvA4E0DcPrUAFy144FNM0NTCb7u9vAw= =V3R/ -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ... |
||
![]() |
737019cf6a |
mm: optimise vmf_anon_prepare() for VMAs without an anon_vma
If the mmap_lock can be taken for read, we can call __anon_vma_prepare() while holding it, saving ourselves a trip back through the fault handler. Link: https://lkml.kernel.org/r/20240426144506.1290619-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
a373baed5a |
mm: delay the check for a NULL anon_vma
Instead of checking the anon_vma early in the fault path where all page faults pay the cost, delay it until we know we're going to need the anon_vma to be filled in. This will have a slight negative effect on the first fault in an anonymous VMA, but it shortens every other page fault. It also makes the code slightly cleaner as the anon and file backed fault handling look more similar. The Intel kernel test bot reports a 3x improvement in vm-scalability throughput with the small-allocs-mt test. This is clearly an extreme situation that won't be replicated in any real-world workload, but it's a nice win. https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/ Link: https://lkml.kernel.org/r/20240426144506.1290619-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
e0ffb29bc5 |
mm: simplify thp_vma_allowable_order
Combine the three boolean arguments into one flags argument for readability. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
6ed31ba392 |
mm: memory: check userfaultfd_wp() in vmf_orig_pte_uffd_wp()
Add userfaultfd_wp() check in vmf_orig_pte_uffd_wp() to avoid the unnecessary FAULT_FLAG_ORIG_PTE_VALID check/pte_marker_entry_uffd_wp() in most pagefault, note, the function vmf_orig_pte_uffd_wp() is not inlined in the two kernel versions, the difference is shown below, perf date, perf report -i perf.data.before | grep vmf 0.17% 0.13% lat_pagefault [kernel.kallsyms] [k] vmf_orig_pte_uffd_wp.part.0.isra.0 perf report -i perf.data.after | grep vmf lat_pagefault -W 5 -N 5 /tmp/XXX latency before after diff average(8 tests) 0.262675 0.2600375 -0.0026375 Although it's a small, but the uffd_wp is a new feature than previous kernel, when the vma is not registered with UFFD_WP, let's avoid to execute the new logical, also adding __always_inline attribute to vmf_orig_pte_uffd_wp(), which make set_pte_range() only check VM_UFFD_WP flags without the function call. In addition, directly call the vmf_orig_pte_uffd_wp() in do_anonymous_page() and set_pte_range() to save an uffd_wp variable. Link: https://lkml.kernel.org/r/20240422030039.3293568-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
96ebdb0320 |
mm/memory: add any_dirty optional pointer to folio_pte_batch()
This commit adds the any_dirty pointer as an optional parameter to folio_pte_batch() function. By using both the any_young and any_dirty pointers, madvise_free can make smarter decisions about whether to clear the PTEs when marking large folios as lazyfree. Link: https://lkml.kernel.org/r/20240418134435.6092-4-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Jeff Xie <xiehuan09@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
1f2d8b4421 |
mm: move mm counter updating out of set_pte_range()
Patch series "mm: batch mm counter updating in filemap_map_pages()", v3. Let's batch mm counter updating to accelerate filemap_map_pages(). This patch (of 2): In order to support batch mm counter updating in filemap_map_pages(), move mm counter updating out of set_pte_range(), the folios are file from filemap, and distinguish folios by vmf->flags and vma->vm_flags from another caller finish_fault(). Link: https://lkml.kernel.org/r/20240412064751.119015-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240412064751.119015-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
ec33687c67 |
mm: add per-order mTHP anon_fault_alloc and anon_fault_fallback counters
Patch series "mm: add per-order mTHP alloc and swpout counters", v6. The patchset introduces a framework to facilitate mTHP counters, starting with the allocation and swap-out counters. Currently, only four new nodes are appended to the stats directory for each mTHP size. /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats anon_fault_alloc anon_fault_fallback anon_fault_fallback_charge anon_swpout anon_swpout_fallback These nodes are crucial for us to monitor the fragmentation levels of both the buddy system and the swap partitions. In the future, we may consider adding additional nodes for further insights. This patch (of 4): Profiling a system blindly with mTHP has become challenging due to the lack of visibility into its operations. Presenting the success rate of mTHP allocations appears to be pressing need. Recently, I've been experiencing significant difficulty debugging performance improvements and regressions without these figures. It's crucial for us to understand the true effectiveness of mTHP in real-world scenarios, especially in systems with fragmented memory. This patch establishes the framework for per-order mTHP counters. It begins by introducing the anon_fault_alloc and anon_fault_fallback counters. Additionally, to maintain consistency with thp_fault_fallback_charge in /proc/vmstat, this patch also tracks anon_fault_fallback_charge when mem_cgroup_charge fails for mTHP. Incorporating additional counters should now be straightforward as well. Link: https://lkml.kernel.org/r/20240412114858.407208-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240412114858.407208-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
3aeea4fc83 |
mm/memory: use folio_mapcount() in zap_present_folio_ptes()
We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. In zap_present_folio_ptes(), let's simply check the
folio mapcount(). If there is some issue, it will underflow at some point
either way when unmapping.
As indicated already in commit
|
||
![]() |
c5541ba378 |
mm: follow_pte() improvements
follow_pte() is now our main function to lookup PTEs in VM_PFNMAP/VM_IO VMAs. Let's perform some more sanity checks to make this exported function harder to abuse. Further, extend the doc a bit, it still focuses on the KVM use case with MMU notifiers. Drop the KVM+follow_pfn() comment, follow_pfn() is no more, and we have other users nowadays. Also extend the doc regarding refcounted pages and the interaction with MMU notifiers. KVM is one example that uses MMU notifiers and can deal with refcounted pages properly. VFIO is one example that doesn't use MMU notifiers, and to prevent use-after-free, rejects refcounted pages: pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)). Protection changes are less of a concern for users like VFIO: the behavior is similar to longterm-pinning a page, and getting the PTE protection changed afterwards. The primary concern with refcounted pages is use-after-free, which callers should be aware of. Link: https://lkml.kernel.org/r/20240410155527.474777-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
29ae7d96d1 |
mm: pass VMA instead of MM to follow_pte()
... and centralize the VM_IO/VM_PFNMAP sanity check in there. We'll now also perform these sanity checks for direct follow_pte() invocations. For generic_access_phys(), we might now check multiple times: nothing to worry about, really. Link: https://lkml.kernel.org/r/20240410155527.474777-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> [KVM] Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
3931b871c4 |
mm: madvise: avoid split during MADV_PAGEOUT and MADV_COLD
Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large folio that is fully and contiguously mapped in the pageout/cold vm range. This change means that large folios will be maintained all the way to swap storage. This both improves performance during swap-out, by eliding the cost of splitting the folio, and sets us up nicely for maintaining the large folio when it is swapped back in (to be covered in a separate series). Folios that are not fully mapped in the target range are still split, but note that behavior is changed so that if the split fails for any reason (folio locked, shared, etc) we now leave it as is and move to the next pte in the range and continue work on the proceeding folios. Previously any failure of this sort would cause the entire operation to give up and no folios mapped at higher addresses were paged out or made cold. Given large folios are becoming more common, this old behavior would have likely lead to wasted opportunities. While we are at it, change the code that clears young from the ptes to use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper function. This is more efficent than get_and_clear/modify/set, especially for contpte mappings on arm64, where the old approach would require unfolding/refolding and the new approach can be done in place. Link: https://lkml.kernel.org/r/20240408183946.2991168-8-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
a62fb92ac1 |
mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
Now that we no longer have a convenient flag in the cluster to determine if a folio is large, free_swap_and_cache() will take a reference and lock a large folio much more often, which could lead to contention and (e.g.) failure to split large folios, etc. Let's solve that problem by batch freeing swap and cache with a new function, free_swap_and_cache_nr(), to free a contiguous range of swap entries together. This allows us to first drop a reference to each swap slot before we try to release the cache folio. This means we only try to release the folio once, only taking the reference and lock once - much better than the previous 512 times for the 2M THP case. Contiguous swap entries are gathered in zap_pte_range() and madvise_free_pte_range() in a similar way to how present ptes are already gathered in zap_pte_range(). While we are at it, let's simplify by converting the return type of both functions to void. The return value was used only by zap_pte_range() to print a bad pte, and was ignored by everyone else, so the extra reporting wasn't exactly guaranteed. We will still get the warning with most of the information from get_swap_device(). With the batch version, we wouldn't know which pte was bad anyway so could print the wrong one. [ryan.roberts@arm.com: fix a build warning on parisc] Link: https://lkml.kernel.org/r/20240409111840.3173122-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240408183946.2991168-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
d2136d749d |
mm: support multi-size THP numa balancing
Now the anonymous page allocation already supports multi-size THP (mTHP), but the numa balancing still prohibits mTHP migration even though it is an exclusive mapping, which is unreasonable. Allow scanning mTHP: Commit |
||
![]() |
6b0ed7b3c7 |
mm: factor out the numa mapping rebuilding into a new helper
Patch series "support multi-size THP numa balancing", v2. This patchset tries to support mTHP numa balancing, as a simple solution to start, the NUMA balancing algorithm for mTHP will follow the THP strategy as the basic support. Please find details in each patch. This patch (of 2): To support large folio's numa balancing, factor out the numa mapping rebuilding into a new helper as a preparation. Link: https://lkml.kernel.org/r/cover.1712132950.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/cover.1711683069.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
68dbcf4899 |
mm: alloc_anon_folio: avoid doing vma_thp_gfp_mask in fallback cases
Fallback rates surpassing 90% have been observed on phones utilizing 64KiB CONT-PTE mTHP. In these scenarios, when one out of every 16 PTEs fails to allocate large folios, the remaining 15 PTEs fallback. Consequently, invoking vma_thp_gfp_mask seems redundant in such cases. Furthermore, abstaining from its use can also contribute to improved code readability. Link: https://lkml.kernel.org/r/20240329073750.20012-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Itaru Kitayama <itaru.kitayama@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
ba42b524a0 |
mm: init_mlocked_on_free_v3
Implements the "init_mlocked_on_free" boot option. When this boot option
is enabled, any mlock'ed pages are zeroed on free. If
the pages are munlock'ed beforehand, no initialization takes place.
This boot option is meant to combat the performance hit of
"init_on_free" as reported in commit
|
||
![]() |
239e9a90c8 |
mm: introduce vma_pgtable_walk_{begin|end}()
Introduce per-vma begin()/end() helpers for pgtable walks. This is a preparation work to merge hugetlb pgtable walkers with generic mm. The helpers need to be called before and after a pgtable walk, will start to be needed if the pgtable walker code supports hugetlb pages. It's a hook point for any type of VMA, but for now only hugetlb uses it to stablize the pgtable pages from getting away (due to possible pmd unsharing). Link: https://lkml.kernel.org/r/20240327152332.950956-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
5b34b76cb0 |
mm: move follow_phys to arch/x86/mm/pat/memtype.c
follow_phys is only used by two callers in arch/x86/mm/pat/memtype.c. Move it there and hardcode the two arguments that get the same values passed by both callers. [david@redhat.com: conflict resolutions] Link: https://lkml.kernel.org/r/20240403212131.929421-4-david@redhat.com Link: https://lkml.kernel.org/r/20240324234542.2038726-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fei Li <fei1.li@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
cb10c28ac8 |
mm: remove follow_pfn
Remove follow_pfn now that the last user is gone. Link: https://lkml.kernel.org/r/20240324234542.2038726-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fei Li <fei1.li@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
ebb34f78d7 |
mm: convert folio_estimated_sharers() to folio_likely_mapped_shared()
Callers of folio_estimated_sharers() only care about "mapped shared vs. mapped exclusively", not the exact estimate of sharers. Let's consolidate and unify the condition users are checking. While at it clarify the semantics and extend the discussion on the fuzziness. Use the "likely mapped shared" terminology to better express what the (adjusted) function actually checks. Whether a partially-mappable folio is more likely to not be partially mapped than partially mapped is debatable. In the future, we might be able to improve our estimate for partially-mappable folios, though. Note that we will now consistently detect "mapped shared" only if the first subpage is actually mapped multiple times. When the first subpage is not mapped, we will consistently detect it as "mapped exclusively". This change should currently only affect the usage in madvise_free_pte_range() and queue_folios_pte_range() for large folios: if the first page was already unmapped, we would have skipped the folio. [david@redhat.com: folio_likely_mapped_shared() kerneldoc fixup] Link: https://lkml.kernel.org/r/dd0ad9f2-2d7a-45f3-9ba3-979488c7dd27@redhat.com Link: https://lkml.kernel.org/r/20240227201548.857831-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Acked-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
f238b8c33c |
arm64: mm: swap: support THP_SWAP on hardware with MTE
Commit
|
||
![]() |
f8fd525ba3 |
mm/mempolicy: use numa_node_id() instead of cpu_to_node()
Patch series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy:, v4. This patchset is to optimize the cross-socket memory access with MPOL_PREFERRED_MANY policy. To test this patch we ran the following test on a 3 node system. Node 0 - 2GB - Tier 1 Node 1 - 11GB - Tier 1 Node 6 - 10GB - Tier 2 Below changes are made to memcached to set the memory policy, It select Node0 and Node1 as preferred nodes. #include <numaif.h> #include <numa.h> unsigned long nodemask; int ret; nodemask = 0x03; ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING, &nodemask, 10); /* If MPOL_F_NUMA_BALANCING isn't supported, * fall back to MPOL_PREFERRED_MANY */ if (ret < 0 && errno == EINVAL){ printf("set mem policy normal\n"); ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10); } if (ret < 0) { perror("Failed to call set_mempolicy"); exit(-1); } Test Procedure: =============== 1. Make sure memory tiring and demotion are enabled. 2. Start memcached. # ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7 -d -s "/tmp/memcached.sock" 3. Run memtier_benchmark to store 3200000 keys. #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary --threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1 --key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024 4. Start a memory eater on node 0 and 1. This will demote all memcached pages to node 6. 5. Make sure all the memcached pages got demoted to lower tier by reading /proc/<memcaced PID>/numa_maps. # cat /proc/2771/numa_maps --- default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64 default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64 --- 6. Kill memory eater. 7. Read the pgpromote_success counter. 8. Start reading the keys by running memtier_benchmark. #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary --pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R --key-minimum=1 --key-maximum=3200000 -n allkeys --threads=64 -c 1 -R -x 6 9. Read the pgpromote_success counter. Test Results: ============= Without Patch ------------------ 1. pgpromote_success before test Node 0: pgpromote_success 11 Node 1: pgpromote_success 140974 pgpromote_success after test Node 0: pgpromote_success 11 Node 1: pgpromote_success 140974 2. Memtier-benchmark result. AGGREGATED AVERAGE RESULTS (6 runs) ================================================================== Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency ------------------------------------------------------------------ Sets 0.00 --- --- --- --- Gets 305792.03 305791.93 0.10 0.18949 0.16700 Waits 0.00 --- --- --- --- Totals 305792.03 305791.93 0.10 0.18949 0.16700 ====================================== p99 Latency p99.9 Latency KB/sec ------------------------------------- --- --- 0.00 0.44700 1.71100 11542.69 --- --- --- 0.44700 1.71100 11542.69 With Patch --------------- 1. pgpromote_success before test Node 0: pgpromote_success 5 Node 1: pgpromote_success 89386 pgpromote_success after test Node 0: pgpromote_success 57895 Node 1: pgpromote_success 141463 2. Memtier-benchmark result. AGGREGATED AVERAGE RESULTS (6 runs) ==================================================================== Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency -------------------------------------------------------------------- Sets 0.00 --- --- --- --- Gets 521942.24 521942.07 0.17 0.11459 0.10300 Waits 0.00 --- --- --- --- Totals 521942.24 521942.07 0.17 0.11459 0.10300 ======================================= p99 Latency p99.9 Latency KB/sec --------------------------------------- --- --- 0.00 0.23100 0.31900 19701.68 --- --- --- 0.23100 0.31900 19701.68 Test Result Analysis: ===================== 1. With patch we could observe pages are getting promoted. 2. Memtier-benchmark results shows that, with the patch, performance has increased more than 50%. Ops/sec without fix - 305792.03 Ops/sec with fix - 521942.24 This patch (of 2): Instead of using 'cpu_to_node()', we use 'numa_node_id()', which is quicker. smp_processor_id is guaranteed to be stable in the 'mpol_misplaced()' function because it is called with ptl held. lockdep_assert_held was added to ensure that. No functional change in this patch. [donettom@linux.ibm.com: add "* @vmf: structure describing the fault" comment] Link: https://lkml.kernel.org/r/d8b993ea9dccfac0bc3ed61d3a81f4ac5f376e46.1711002865.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.com Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
1965e933dd |
mm/treewide: replace pXd_huge() with pXd_leaf()
Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(), reuse it. Luckily, pXd_huge() isn't widely used. Link: https://lkml.kernel.org/r/20240318200404.448346-12-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
a96cb3bf39 |
Merge x86 bugfixes from Linux 6.9-rc3
Pull fix for SEV-SNP late disable bugs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
f7842747d1 |
mm: replace set_pte_at_notify() with just set_pte_at()
With the demise of the .change_pte() MMU notifier callback, there is no notification happening in set_pte_at_notify(). It is a synonym of set_pte_at() and can be replaced with it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
04c35ab3bd |
x86/mm/pat: fix VM_PAT handling in COW mappings
PAT handling won't do the right thing in COW mappings: the first PTE (or, in fact, all PTEs) can be replaced during write faults to point at anon folios. Reliably recovering the correct PFN and cachemode using follow_phys() from PTEs will not work in COW mappings. Using follow_phys(), we might just get the address+protection of the anon folio (which is very wrong), or fail on swap/nonswap entries, failing follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and track_pfn_copy(), not properly calling free_pfn_range(). In free_pfn_range(), we either wouldn't call memtype_free() or would call it with the wrong range, possibly leaking memory. To fix that, let's update follow_phys() to refuse returning anon folios, and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings if we run into that. We will now properly handle untrack_pfn() with COW mappings, where we don't need the cachemode. We'll have to fail fork()->track_pfn_copy() if the first page was replaced by an anon folio, though: we'd have to store the cachemode in the VMA to make this work, likely growing the VMA size. For now, lets keep it simple and let track_pfn_copy() just fail in that case: it would have failed in the past with swap/nonswap entries already, and it would have done the wrong thing with anon folios. Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn(): <--- C reproducer ---> #include <stdio.h> #include <sys/mman.h> #include <unistd.h> #include <liburing.h> int main(void) { struct io_uring_params p = {}; int ring_fd; size_t size; char *map; ring_fd = io_uring_setup(1, &p); if (ring_fd < 0) { perror("io_uring_setup"); return 1; } size = p.sq_off.array + p.sq_entries * sizeof(unsigned); /* Map the submission queue ring MAP_PRIVATE */ map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE, ring_fd, IORING_OFF_SQ_RING); if (map == MAP_FAILED) { perror("mmap"); return 1; } /* We have at least one page. Let's COW it. */ *map = 0; pause(); return 0; } <--- C reproducer ---> On a system with 16 GiB RAM and swap configured: # ./iouring & # memhog 16G # killall iouring [ 301.552930] ------------[ cut here ]------------ [ 301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100 [ 301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g [ 301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1 [ 301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4 [ 301.559569] RIP: 0010:untrack_pfn+0xf4/0x100 [ 301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000 [ 301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282 [ 301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047 [ 301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200 [ 301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000 [ 301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000 [ 301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000 [ 301.564186] FS: 0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000 [ 301.564773] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0 [ 301.565725] PKRU: 55555554 [ 301.565944] Call Trace: [ 301.566148] <TASK> [ 301.566325] ? untrack_pfn+0xf4/0x100 [ 301.566618] ? __warn+0x81/0x130 [ 301.566876] ? untrack_pfn+0xf4/0x100 [ 301.567163] ? report_bug+0x171/0x1a0 [ 301.567466] ? handle_bug+0x3c/0x80 [ 301.567743] ? exc_invalid_op+0x17/0x70 [ 301.568038] ? asm_exc_invalid_op+0x1a/0x20 [ 301.568363] ? untrack_pfn+0xf4/0x100 [ 301.568660] ? untrack_pfn+0x65/0x100 [ 301.568947] unmap_single_vma+0xa6/0xe0 [ 301.569247] unmap_vmas+0xb5/0x190 [ 301.569532] exit_mmap+0xec/0x340 [ 301.569801] __mmput+0x3e/0x130 [ 301.570051] do_exit+0x305/0xaf0 ... Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Wupeng Ma <mawupeng1@huawei.com> Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com Fixes: |
||
![]() |
f8572367ea |
mm/memory: fix missing pte marker for !page on pte zaps
Commit |