2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

24503 Commits

Author SHA1 Message Date
Linus Torvalds
27b9989b87 9 hotfixes. 3 are cc:stable and the remainder address post-6.15 issues
or aren't considered necessary for -stable kernels.  Only 4 are for MM.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaE0BIAAKCRDdBJ7gKXxA
 jkTMAQCTWhvZZdcEdyxo0HQbGo2pcqB4awXjire6GabBFcr1owD5AVV0OYiQNNEN
 tbOVsr+2aZBr/aXTkTy4VpOg1kin8Ak=
 =ThOY
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "9 hotfixes. 3 are cc:stable and the remainder address post-6.15 issues
  or aren't considered necessary for -stable kernels. Only 4 are for MM"

* tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: add mmap_prepare() compatibility layer for nested file systems
  init: fix build warnings about export.h
  MAINTAINERS: add Barry as a THP reviewer
  drivers/rapidio/rio_cm.c: prevent possible heap overwrite
  mm: close theoretical race where stale TLB entries could linger
  mm/vma: reset VMA iterator on commit_merge() OOM failure
  docs: proc: update VmFlags documentation in smaps
  scatterlist: fix extraneous '@'-sign kernel-doc notation
  selftests/mm: skip failed memfd setups in gup_longterm
2025-06-14 08:18:09 -07:00
Lorenzo Stoakes
bb666b7c27 mm: add mmap_prepare() compatibility layer for nested file systems
Nested file systems, that is those which invoke call_mmap() within their
own f_op->mmap() handlers, may encounter underlying file systems which
provide the f_op->mmap_prepare() hook introduced by commit c84bf6dd2b
("mm: introduce new .mmap_prepare() file callback").

We have a chicken-and-egg scenario here - until all file systems are
converted to using .mmap_prepare(), we cannot convert these nested
handlers, as we can't call f_op->mmap from an .mmap_prepare() hook.

So we have to do it the other way round - invoke the .mmap_prepare() hook
from an .mmap() one.

in order to do so, we need to convert VMA state into a struct vm_area_desc
descriptor, invoking the underlying file system's f_op->mmap_prepare()
callback passing a pointer to this, and then setting VMA state accordingly
and safely.

This patch achieves this via the compat_vma_mmap_prepare() function, which
we invoke from call_mmap() if f_op->mmap_prepare() is specified in the
passed in file pointer.

We place the fundamental logic into mm/vma.h where VMA manipulation
belongs.  We also update the VMA userland tests to accommodate the
changes.

The compat_vma_mmap_prepare() function and its associated machinery is
temporary, and will be removed once the conversion of file systems is
complete.

We carefully place this code so it can be used with CONFIG_MMU and also
with cutting edge nommu silicon.

[akpm@linux-foundation.org: export compat_vma_mmap_prepare tp fix build]
[lorenzo.stoakes@oracle.com: remove unused declarations]
  Link: https://lkml.kernel.org/r/ac3ae324-4c65-432a-8c6d-2af988b18ac8@lucifer.local
Link: https://lkml.kernel.org/r/20250609165749.344976-1-lorenzo.stoakes@oracle.com
Fixes: c84bf6dd2b ("mm: introduce new .mmap_prepare() file callback").
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Closes: https://lore.kernel.org/linux-mm/CAG48ez04yOEVx1ekzOChARDDBZzAKwet8PEoPM4Ln3_rk91AzQ@mail.gmail.com/
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-12 21:39:02 -07:00
Ryan Roberts
383c4613c6 mm: close theoretical race where stale TLB entries could linger
Commit 3ea277194d ("mm, mprotect: flush TLB if potentially racing with a
parallel reclaim leaving stale TLB entries") described a theoretical race
as such:


"""
Nadav Amit identified a theoretical race between page reclaim and mprotect
due to TLB flushes being batched outside of the PTL being held.

He described the race as follows:

	CPU0                            CPU1
	----                            ----
					user accesses memory using RW PTE
					[PTE now cached in TLB]
	try_to_unmap_one()
	==> ptep_get_and_clear()
	==> set_tlb_ubc_flush_pending()
					mprotect(addr, PROT_READ)
					==> change_pte_range()
					==> [ PTE non-present - no flush ]

					user writes using cached RW PTE
	...

	try_to_unmap_flush()

The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such as
munmap, mremap and madvise.
"""

The solution was to introduce flush_tlb_batched_pending() and call it
under the PTL from mprotect/madvise/munmap/mremap to complete any pending
tlb flushes.

However, while madvise_free_pte_range() and
madvise_cold_or_pageout_pte_range() were both retro-fitted to call
flush_tlb_batched_pending() immediately after initially acquiring the PTL,
they both temporarily release the PTL to split a large folio if they
stumble upon one.  In this case, where re-acquiring the PTL
flush_tlb_batched_pending() must be called again, but it previously was
not.  Let's fix that.

There are 2 Fixes: tags here: the first is the commit that fixed
madvise_free_pte_range().  The second is the commit that added
madvise_cold_or_pageout_pte_range(), which looks like it copy/pasted the
faulty pattern from madvise_free_pte_range().

This is a theoretical bug discovered during code review.

Link: https://lkml.kernel.org/r/20250606092809.4194056-1-ryan.roberts@arm.com
Fixes: 3ea277194d ("mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries")
Fixes: 9c276cc65a ("mm: introduce MADV_COLD")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-11 22:42:35 -07:00
Lorenzo Stoakes
0cf4b1687a mm/vma: reset VMA iterator on commit_merge() OOM failure
While an OOM failure in commit_merge() isn't really feasible due to the
allocation which might fail (a maple tree pre-allocation) being 'too small
to fail', we do need to handle this case correctly regardless.

In vma_merge_existing_range(), we can theoretically encounter failures
which result in an OOM error in two ways - firstly dup_anon_vma() might
fail with an OOM error, and secondly commit_merge() failing, ultimately,
to pre-allocate a maple tree node.

The abort logic for dup_anon_vma() resets the VMA iterator to the initial
range, ensuring that any logic looping on this iterator will correctly
proceed to the next VMA.

However the commit_merge() abort logic does not do the same thing.  This
resulted in a syzbot report occurring because mlockall() iterates through
VMAs, is tolerant of errors, but ended up with an incorrect previous VMA
being specified due to incorrect iterator state.

While making this change, it became apparent we are duplicating logic -
the logic introduced in commit 41e6ddcaa0 ("mm/vma: add give_up_on_oom
option on modify/merge, use in uffd release") duplicates the
vmg->give_up_on_oom check in both abort branches.

Additionally, we observe that we can perform the anon_dup check safely on
dup_anon_vma() failure, as this will not be modified should this call
fail.

Finally, we need to reset the iterator in both cases, so now we can simply
use the exact same code to abort for both.

We remove the VM_WARN_ON(err != -ENOMEM) as it would be silly for this to
be otherwise and it allows us to implement the abort check more neatly.

Link: https://lkml.kernel.org/r/20250606125032.164249-1-lorenzo.stoakes@oracle.com
Fixes: 47b16d0462 ("mm: abort vma_modify() on merge out of memory failure")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: syzbot+d16409ea9ecc16ed261a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-mm/6842cc67.a00a0220.29ac89.003b.GAE@google.com/
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-11 22:42:35 -07:00
Linus Torvalds
aef17cb3d3 Revert "mm/damon/Kconfig: enable CONFIG_DAMON by default"
This reverts commit 28615e6eed.

No, we don't make random features default to being on.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-06-10 10:22:15 -07:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds
b7191581a9 LoongArch changes for v6.16
1, Adjust the 'make install' operation;
 2, Support SCHED_MC (Multi-core scheduler);
 3, Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS;
 4, Enable HAVE_ARCH_STACKLEAK;
 5, Increase max supported CPUs up to 2048;
 6, Introduce the numa_memblks conversion;
 7, Add PWM controller nodes in dts;
 8, Some bug fixes and other small changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmhEEh8WHGNoZW5odWFj
 YWlAa2VybmVsLm9yZwAKCRAChivD8uImenIIEACT43zpQD6lApWZ8tk8dVYXK6Lb
 pKqXLIhAe9g9cSDC6QfwUTwOjv5RZxJBXgKkQfYSSd3ZEZ6hQIqb2JVYMMpNy6WA
 I5fJ75pr/E6RMtueg5laYxJ/zLJMM5F2qK/Moff7JGeUuDU5BJEG0R7j7KEhtgo0
 5Cm+ktS7waJhNuJposa/Eay6IDWH9Of2UoNboPWmMfLvuZbIp9f1HpjwpZF6xnZ/
 xeIhVOcJw8Xft/0xrcx1xZt/nYnJflNwLgaa7/CGa2q1X7yxVarfMUcY63x6cZXX
 7jLppejqptPqixf+JFRGKBr8Y82Vu/1TW7XlKaywKrjHPZ/FXJsU682x6SQy2uac
 nOiu38qCc5DA9pKB57zxL/5DqTn+yAPGpnC//HvF9/koWCOI6lA5bYzSbBwmkpm7
 DS7mVKtfPTicUs5fZ7DsHRjXuyqs6i1+s7wNaTkgG4Y/gBNAffHa36ON5jM5Wzzz
 WJGcMEIG0Or6tt/yUVrzY390aYzIOkyAoWJgUQ+tPCXecyjnor9i5P3ZC66dIm2r
 UTbC7mX4+bBa/MA1xKKMmCHdJtMzO2A+tduZRf/lO8iYs4c6IJBBaXhW9xHxf8tT
 Av/sXnShJWsrRW40g6Kp85ghDcLBS6SLBm5Ux5OLd2byWIeeF/R2BHdF6qb4V3j6
 Y2ZHk1ewvLgB0BQ/ww==
 =FH8x
 -----END PGP SIGNATURE-----

Merge tag 'loongarch-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch updates from Huacai Chen:

 - Adjust the 'make install' operation

 - Support SCHED_MC (Multi-core scheduler)

 - Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS

 - Enable HAVE_ARCH_STACKLEAK

 - Increase max supported CPUs up to 2048

 - Introduce the numa_memblks conversion

 - Add PWM controller nodes in dts

 - Some bug fixes and other small changes

* tag 'loongarch-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  platform/loongarch: laptop: Unregister generic_sub_drivers on exit
  platform/loongarch: laptop: Add backlight power control support
  platform/loongarch: laptop: Get brightness setting from EC on probe
  LoongArch: dts: Add PWM support to Loongson-2K2000
  LoongArch: dts: Add PWM support to Loongson-2K1000
  LoongArch: dts: Add PWM support to Loongson-2K0500
  LoongArch: vDSO: Correctly use asm parameters in syscall wrappers
  LoongArch: Fix panic caused by NULL-PMD in huge_pte_offset()
  LoongArch: Preserve firmware configuration when desired
  LoongArch: Avoid using $r0/$r1 as "mask" for csrxchg
  LoongArch: Introduce the numa_memblks conversion
  LoongArch: Increase max supported CPUs up to 2048
  LoongArch: Enable HAVE_ARCH_STACKLEAK
  LoongArch: Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS
  LoongArch: Add SCHED_MC (Multi-core scheduler) support
  LoongArch: Add some annotations in archhelp
  LoongArch: Using generic scripts/install.sh in `make install`
  LoongArch: Add a default install.sh
2025-06-07 09:56:18 -07:00
Linus Torvalds
bdc7f8c5ad - The 4 patch series "Fix uprobe pte be overwritten when expanding vma"
fixes a longstanding and quite obscure bug related to the vma merging of
   the uprobe mmap page.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaEN1LAAKCRDdBJ7gKXxA
 jsfLAQCC+C8397X6lNKPI3bHGLGEHubP2uzb6bOFMAU6fIRobQEAqUnoUhfP+xsu
 tDbcQEBZ+vfyeT9Zr9vA+uBDcA3OGw0=
 =9oaG
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-06-06-16-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:
 "The series 'Fix uprobe pte be overwritten when expanding vma' fixes a
  longstanding and quite obscure bug related to the vma merging of the
  uprobe mmap page"

* tag 'mm-stable-2025-06-06-16-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  selftests/mm: add test about uprobe pte be orphan during vma merge
  selftests/mm: extract read_sysfs and write_sysfs into vm_util
  mm: expose abnormal new_pte during move_ptes
  mm: fix uprobe pte be overwritten when expanding vma
  mm/damon: s/primitives/code/ on comments
2025-06-06 22:06:57 -07:00
Linus Torvalds
d3c82f618a 13 hotfixes. 6 are cc:stable and the remainder address post-6.15 issues
or aren't considered necessary for -stable kernels.  11 are for MM.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaENzlAAKCRDdBJ7gKXxA
 joNYAP9n38QNDUoRR6ChFikzzY77q4alD2NL0aqXBZdcSRXoUgEAlQ8Ea+t6xnzp
 GnH+cnsA6FDp4F6lIoZBdENJyBYrkQE=
 =ud9O
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2025-06-06-16-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "13 hotfixes.

  6 are cc:stable and the remainder address post-6.15 issues or aren't
  considered necessary for -stable kernels. 11 are for MM"

* tag 'mm-hotfixes-stable-2025-06-06-16-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
  MAINTAINERS: add mm swap section
  kmsan: test: add module description
  MAINTAINERS: add tlb trace events to MMU GATHER AND TLB INVALIDATION
  mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  mm/hugetlb: unshare page tables during VMA split, not before
  MAINTAINERS: add Alistair as reviewer of mm memory policy
  iov_iter: use iov_offset for length calculation in iov_iter_aligned_bvec
  mm/mempolicy: fix incorrect freeing of wi_kobj
  alloc_tag: handle module codetag load errors as module load failures
  mm/madvise: handle madvise_lock() failure during race unwinding
  mm: fix vmstat after removing NR_BOUNCE
  KVM: s390: rename PROT_NONE to PROT_TYPE_DUMMY
2025-06-06 21:45:45 -07:00
Arnd Bergmann
ffa74d44f4 kmsan: test: add module description
Every module should have a description, and kbuild now warns for those
that don't.

WARNING: modpost: missing MODULE_DESCRIPTION() in mm/kmsan/kmsan_test.o

Link: https://lkml.kernel.org/r/20250603075323.1839608-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Macro Elver <elver@google.com>
Cc: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05 22:02:25 -07:00
Jann Horn
1013af4f58 mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
huge_pmd_unshare() drops a reference on a page table that may have
previously been shared across processes, potentially turning it into a
normal page table used in another process in which unrelated VMAs can
afterwards be installed.

If this happens in the middle of a concurrent gup_fast(), gup_fast() could
end up walking the page tables of another process.  While I don't see any
way in which that immediately leads to kernel memory corruption, it is
really weird and unexpected.

Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one(),
just like we do in khugepaged when removing page tables for a THP
collapse.

Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-2-1329349bad1a@google.com
Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-2-f4136f5ec58a@google.com
Fixes: 39dde65c99 ("[PATCH] shared page table for hugetlb page")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05 22:02:24 -07:00
Jann Horn
081056dc00 mm/hugetlb: unshare page tables during VMA split, not before
Currently, __split_vma() triggers hugetlb page table unsharing through
vm_ops->may_split().  This happens before the VMA lock and rmap locks are
taken - which is too early, it allows racing VMA-locked page faults in our
process and racing rmap walks from other processes to cause page tables to
be shared again before we actually perform the split.

Fix it by explicitly calling into the hugetlb unshare logic from
__split_vma() in the same place where THP splitting also happens.  At that
point, both the VMA and the rmap(s) are write-locked.

An annoying detail is that we can now call into the helper
hugetlb_unshare_pmds() from two different locking contexts:

1. from hugetlb_split(), holding:
    - mmap lock (exclusively)
    - VMA lock
    - file rmap lock (exclusively)
2. hugetlb_unshare_all_pmds(), which I think is designed to be able to
   call us with only the mmap lock held (in shared mode), but currently
   only runs while holding mmap lock (exclusively) and VMA lock

Backporting note:
This commit fixes a racy protection that was introduced in commit
b30c14cd61 ("hugetlb: unshare some PMDs when splitting VMAs"); that
commit claimed to fix an issue introduced in 5.13, but it should actually
also go all the way back.

[jannh@google.com: v2]
  Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-1-1329349bad1a@google.com
Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-0-1329349bad1a@google.com
Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-1-f4136f5ec58a@google.com
Fixes: 39dde65c99 ("[PATCH] shared page table for hugetlb page")
Signed-off-by: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>	[b30c14cd61: hugetlb: unshare some PMDs when splitting VMAs]
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05 22:02:24 -07:00
Joshua Hahn
41ffaa0ea7 mm/mempolicy: fix incorrect freeing of wi_kobj
We should not free wi_group->wi_kobj here.  In the error path of
add_weighted_interleave_group() where this snippet is called from,
kobj_{del, put} is immediately called right after this section.  Thus, it
is not only unnecessary but also incorrect to free it here.

Link: https://lkml.kernel.org/r/20250602162345.2595696-1-joshua.hahnjy@gmail.com
Fixes: e341f9c3c8 ("mm/mempolicy: Weighted Interleave Auto-tuning")
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506011545.Fduxqxqj-lkp@intel.com/
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05 22:02:23 -07:00
SeongJae Park
9c49e5d09f mm/madvise: handle madvise_lock() failure during race unwinding
When unwinding race on -ERESTARTNOINTR handling of process_madvise(),
madvise_lock() failure is ignored.  Check the failure and abort remaining
works in the case.

Link: https://lkml.kernel.org/r/20250602174926.1074-1-sj@kernel.org
Fixes: 4000e3d0a3 ("mm/madvise: remove redundant mmap_lock operations from process_madvise()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Barry Song <21cnbao@gmail.com>
Closes: https://lore.kernel.org/CAGsJ_4xJXXO0G+4BizhohSZ4yDteziPw43_uF8nPXPWxUVChzw@mail.gmail.com
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Barry Song <baohua@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05 22:02:23 -07:00
Kirill A. Shutemov
684088173b mm: fix vmstat after removing NR_BOUNCE
Hongyu noticed that the nr_unaccepted counter kept growing even in the
absence of unaccepted memory on the machine.

This happens due to a commit that removed NR_BOUNCE: it removed the
counter from the enum zone_stat_item, but left it in the vmstat_text
array.

As a result, all counters below nr_bounce in /proc/vmstat are shifted by
one line, causing the numa_hit counter to be labeled as nr_unaccepted.

To fix this issue, remove nr_bounce from the vmstat_text array.

Link: https://lkml.kernel.org/r/20250529103832.2937460-1-kirill.shutemov@linux.intel.com
Fixes: 194df9f66d ("mm: remove NR_BOUNCE zone stat")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05 22:02:22 -07:00
Pu Lehui
b36b701bbc mm: expose abnormal new_pte during move_ptes
When executing move_ptes, the new_pte must be NULL, otherwise it will be
overwritten by the old_pte, and cause the abnormal new_pte to be leaked. 
In order to make this problem to be more explicit, let's add WARN_ON_ONCE
when new_pte is not NULL.

[akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/]
Link: https://lkml.kernel.org/r/20250529155650.4017699-3-pulehui@huaweicloud.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05 21:55:41 -07:00
Pu Lehui
2b12d06c37 mm: fix uprobe pte be overwritten when expanding vma
Patch series "Fix uprobe pte be overwritten when expanding vma".


This patch (of 4):

We encountered a BUG alert triggered by Syzkaller as follows:
   BUG: Bad rss-counter state mm:00000000b4a60fca type:MM_ANONPAGES val:1

And we can reproduce it with the following steps:
1. register uprobe on file at zero offset
2. mmap the file at zero offset:
   addr1 = mmap(NULL, 2 * 4096, PROT_NONE, MAP_PRIVATE, fd, 0);
3. mremap part of vma1 to new vma2:
   addr2 = mremap(addr1, 4096, 2 * 4096, MREMAP_MAYMOVE);
4. mremap back to orig addr1:
   mremap(addr2, 4096, 4096, MREMAP_MAYMOVE | MREMAP_FIXED, addr1);

In step 3, the vma1 range [addr1, addr1 + 4096] will be remap to new vma2
with range [addr2, addr2 + 8192], and remap uprobe anon page from the vma1
to vma2, then unmap the vma1 range [addr1, addr1 + 4096].

In step 4, the vma2 range [addr2, addr2 + 4096] will be remap back to the
addr range [addr1, addr1 + 4096].  Since the addr range [addr1 + 4096,
addr1 + 8192] still maps the file, it will take vma_merge_new_range to
expand the range, and then do uprobe_mmap in vma_complete.  Since the
merged vma pgoff is also zero offset, it will install uprobe anon page to
the merged vma.  However, the upcomming move_page_tables step, which use
set_pte_at to remap the vma2 uprobe pte to the merged vma, will overwrite
the newly uprobe pte in the merged vma, and lead that pte to be orphan.

Since the uprobe pte will be remapped to the merged vma, we can remove the
unnecessary uprobe_mmap upon merged vma.

This problem was first found in linux-6.6.y and also exists in the
community syzkaller:
https://lore.kernel.org/all/000000000000ada39605a5e71711@google.com/T/

Link: https://lkml.kernel.org/r/20250529155650.4017699-1-pulehui@huaweicloud.com
Link: https://lkml.kernel.org/r/20250529155650.4017699-2-pulehui@huaweicloud.com
Fixes: 2b14449835 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints")
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05 21:55:41 -07:00
Enze Li
ea68ea9091 mm/damon: s/primitives/code/ on comments
The word 'primitive' is not explicit.  To make the code more easily
understood, this commit renames 'primitives' to 'code' in header comments
of some source files.

Link: https://lkml.kernel.org/r/20250530053115.153238-1-lienze@kylinos.cn
Signed-off-by: Enze Li <lienze@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05 21:55:41 -07:00
Linus Torvalds
1af80d00e1 slab updates for 6.16
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmhAB1wACgkQu+CwddJF
 iJrbwAf/Z5u9E7FO+dsNczPwI4/eZYdbfr4KYpLjnIBt0tbXp9HMjkkx8aJNiWoA
 4x4kcIxkATnYP669GTeELU9sskflrwjAmpAZbU4wWPZ89aqmPwQCk4ZbWX7Y+O+/
 OITeQiUVOcpWekZ8ke+JGWa9jjxWnWV5bn0ucBdxBmm2NvdVHEkwNUZmiFq85tEJ
 1M37TaoywC0d//ww+LGEBfpIYWoev7n3ITm8YeGXJT6+34lfBOBQ8F7C65olb5qq
 oGaGQH8VJ9XBRPOmQpUej/s76ovqc6wQfGs0sLcVMhslz8Vwcxtw62CuPtEw5JAK
 FdZueTH4kEhD6kspYNU0oq+UzUlc+g==
 =fGZh
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab updates from Vlastimil Babka:

 - Make kvmalloc() more suitable for callers that need it to succeed,
   but without unnecessary overhead by reclaim and compaction to get a
   physically contiguous allocation.

   Instead fall back to vmalloc() more easily by default, unless
   instructed by __GFP_RETRY_MAYFAIL to prefer kmalloc() harder. This
   should allow the removal of a xfs-specific workaround (Michal Hocko)

 - Remove potentially excessive warnings due to memory pressure when
   allocating structures for per-object allocation profiling metadata
   (Usama Arif)

* tag 'slab-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm: slub: only warn once when allocating slab obj extensions fails
  mm: kvmalloc: make kmalloc fast path real fast path
2025-06-04 08:59:59 -07:00
Linus Torvalds
fd1f847350 - The 2 patch series "zram: support algorithm-specific parameters" from
Sergey Senozhatsky adds infrastructure for passing algorithm-specific
   parameters into zram.  A single parameter `winbits' is implemented at
   this time.
 
 - The 5 patch series "memcg: nmi-safe kmem charging" from Shakeel Butt
   makes memcg charging nmi-safe, which is required by BFP, which can
   operate in NMI context.
 
 - The 5 patch series "Some random fixes and cleanup to shmem" from
   Kemeng Shi implements small fixes and cleanups in the shmem code.
 
 - The 2 patch series "Skip mm selftests instead when kernel features are
   not present" from Zi Yan fixes some issues in the MM selftest code.
 
 - The 2 patch series "mm/damon: build-enable essential DAMON components
   by default" from SeongJae Park reworks DAMON Kconfig to make it easier
   to enable CONFIG_DAMON.
 
 - The 2 patch series "sched/numa: add statistics of numa balance task
   migration" from Libo Chen adds more info into sysfs and procfs files to
   improve visibility into the NUMA balancer's task migration activity.
 
 - The 4 patch series "selftests/mm: cow and gup_longterm cleanups" from
   Mark Brown provides various updates to some of the MM selftests to make
   them play better with the overall containing framework.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDzA9wAKCRDdBJ7gKXxA
 js8sAP9V3COg+vzTmimzP3ocTkkbbIJzDfM6nXpE2EQ4BR3ejwD+NsIT2ZLtTF6O
 LqAZpgO7ju6wMjR/lM30ebCq5qFbZAw=
 =oruw
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

 - "zram: support algorithm-specific parameters" from Sergey Senozhatsky
   adds infrastructure for passing algorithm-specific parameters into
   zram. A single parameter `winbits' is implemented at this time.

 - "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg
   charging nmi-safe, which is required by BFP, which can operate in NMI
   context.

 - "Some random fixes and cleanup to shmem" from Kemeng Shi implements
   small fixes and cleanups in the shmem code.

 - "Skip mm selftests instead when kernel features are not present" from
   Zi Yan fixes some issues in the MM selftest code.

 - "mm/damon: build-enable essential DAMON components by default" from
   SeongJae Park reworks DAMON Kconfig to make it easier to enable
   CONFIG_DAMON.

 - "sched/numa: add statistics of numa balance task migration" from Libo
   Chen adds more info into sysfs and procfs files to improve visibility
   into the NUMA balancer's task migration activity.

 - "selftests/mm: cow and gup_longterm cleanups" from Mark Brown
   provides various updates to some of the MM selftests to make them
   play better with the overall containing framework.

* tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits)
  mm/khugepaged: clean up refcount check using folio_expected_ref_count()
  selftests/mm: fix test result reporting in gup_longterm
  selftests/mm: report unique test names for each cow test
  selftests/mm: add helper for logging test start and results
  selftests/mm: use standard ksft_finished() in cow and gup_longterm
  selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled
  sched/numa: add statistics of numa balance task
  sched/numa: fix task swap by skipping kernel threads
  tools/testing: check correct variable in open_procmap()
  tools/testing/vma: add missing function stub
  mm/gup: update comment explaining why gup_fast() disables IRQs
  selftests/mm: two fixes for the pfnmap test
  mm/khugepaged: fix race with folio split/free using temporary reference
  mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order
  mmu_notifiers: remove leftover stub macros
  selftests/mm: deduplicate test names in madv_populate
  kcov: rust: add flags for KCOV with Rust
  mm: rust: make CONFIG_MMU ifdefs more narrow
  mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()
  mm/damon/Kconfig: enable CONFIG_DAMON by default
  ...
2025-06-02 16:00:26 -07:00
Linus Torvalds
2619a6d413 fuse update for 6.16
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCaD2Y1wAKCRDh3BK/laaZ
 PFSHAP4q1+mOlQfZJPH/PFDwa+F0QW/uc3szXatS0888nxui/gEAsIeyyJlf+Mr8
 /1JPXxCqcapRFw9xsS0zioiK54Elfww=
 =2KxA
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Remove tmp page copying in writeback path (Joanne).

   This removes ~300 lines and with that a lot of complexity related to
   avoiding reclaim related deadlock. The old mechanism is replaced with
   a mapping flag that tells the MM not to block reclaim waiting for
   writeback to complete. The MM parts have been reviewed/acked by
   respective maintainers.

 - Convert more code to handle large folios (Joanne). This still just
   adds the code to deal with large folios and does not enable them yet.

 - Allow invalidating all cached lookups atomically (Luis Henriques).
   This feature is useful for CernVMFS, which currently does this
   iteratively.

 - Align write prefaulting in fuse with generic one (Dave Hansen)

 - Fix race causing invalid data to be cached when setting attributes on
   different nodes of a distributed fs (Guang Yuan Wu)

 - Update documentation for passthrough (Chen Linxuan)

 - Add fdinfo about the device number associated with an opened
   /dev/fuse instance (Chen Linxuan)

 - Increase readdir buffer size (Miklos). This depends on a patch to VFS
   readdir code that was already merged through Christians tree.

 - Optimize io-uring request expiration (Joanne)

 - Misc cleanups

* tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
  fuse: increase readdir buffer size
  readdir: supply dir_context.count as readdir buffer size hint
  fuse: don't allow signals to interrupt getdents copying
  fuse: support large folios for writeback
  fuse: support large folios for readahead
  fuse: support large folios for queued writes
  fuse: support large folios for stores
  fuse: support large folios for symlinks
  fuse: support large folios for folio reads
  fuse: support large folios for writethrough writes
  fuse: refactor fuse_fill_write_pages()
  fuse: support large folios for retrieves
  fuse: support copying large folios
  fs: fuse: add dev id to /dev/fuse fdinfo
  docs: filesystems: add fuse-passthrough.rst
  MAINTAINERS: update filter of FUSE documentation
  fuse: fix race between concurrent setattrs from multiple nodes
  fuse: remove tmp folio for writebacks and internal rb tree
  mm: skip folio reclaim in legacy memcg contexts for deadlockable mappings
  fuse: optimize over-io-uring request expiration check
  ...
2025-06-02 15:31:05 -07:00
Linus Torvalds
fcd0bb8e99 vfs-6.16-rc2.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaD1hUAAKCRCRxhvAZXjc
 ohIBAQCJQYP7bzg+A7C7K6cA+5LwC1u4aZ424B5puIZrLiEEDQEAxQli95/rTIeE
 m2DWBDl5rMrKlfmpqGvjbbJldU75swo=
 =2EhD
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Fix the AT_HANDLE_CONNECTABLE option so filesystems that don't know
   how to decode a connected non-dir dentry fail the request

 - Use repr(transparent) to ensure identical layout between the C and
   Rust implementation of struct file

 - Add a missing xas_pause() into the dax code employing
   wait_entry_unlocked_exclusive()

 - Fix FOP_DONTCACHE which we disabled for v6.15.

   A folio could get redirtied and/or scheduled for writeback after the
   initial dropbehind test. Change the test accordingly to handle these
   cases so we can re-enable FOP_DONTCACHE again

* tag 'vfs-6.16-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  exportfs: require ->fh_to_parent() to encode connectable file handles
  rust: file: improve safety comments
  rust: file: mark `LocalFile` as `repr(transparent)`
  fs/dax: Fix "don't skip locked entries when scanning entries"
  iomap: don't lose folio dropbehind state for overwrites
  mm/filemap: unify dropbehind flag testing and clearing
  mm/filemap: unify read/write dropbehind naming
  Revert "Disable FOP_DONTCACHE for now due to bugs"
  mm/filemap: use filemap_end_dropbehind() for read invalidation
  mm/filemap: gate dropbehind invalidate on folio !dirty && !writeback
2025-06-02 12:49:16 -07:00
Shivank Garg
0b43b8bc8e mm/khugepaged: clean up refcount check using folio_expected_ref_count()
Use folio_expected_ref_count() instead of open-coded logic in
is_refcount_suitable().  This avoids code duplication and improves
clarity.

Drop is_refcount_suitable() as it is no longer needed.

Link: https://lkml.kernel.org/r/20250526182818.37978-2-shivankg@amd.com
Signed-off-by: Shivank Garg <shivankg@amd.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Fengwei Yin <fengwei.yin@intel.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:16 -07:00
Chen Yu
ad6b26b6a0 sched/numa: add statistics of numa balance task
On systems with NUMA balancing enabled, it has been found that tracking
task activities resulting from NUMA balancing is beneficial.  NUMA
balancing employs two mechanisms for task migration: one is to migrate
a task to an idle CPU within its preferred node, and the other is to
swap tasks located on different nodes when they are on each other's
preferred nodes.

The kernel already provides NUMA page migration statistics in
/sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.  However, it
lacks statistics regarding task migration and swapping.  Therefore,
relevant counts for task migration and swapping should be added.

The following two new fields:

numa_task_migrated
numa_task_swapped

will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
and /proc/vmstat.

Introducing both per-task and per-memory cgroup (memcg) NUMA balancing
statistics facilitates a rapid evaluation of the performance and
resource utilization of the target workload.  For instance, users can
first identify the container with high NUMA balancing activity and then
further pinpoint a specific task within that group, and subsequently
adjust the memory policy for that task.  In short, although it is
possible to iterate through /proc/$pid/sched to locate the problematic
task, the introduction of aggregated NUMA balancing activity for tasks
within each memcg can assist users in identifying the task more
efficiently through a divide-and-conquer approach.

As Libo Chen pointed out, the memcg event relies on the text names in
vmstat_text, and /proc/vmstat generates corresponding items based on
vmstat_text.  Thus, the relevant task migration and swapping events
introduced in vmstat_text also need to be populated by
count_vm_numa_event(), otherwise these values are zero in /proc/vmstat.

In theory, task migration and swap events are part of the scheduler's
activities.  The reason for exposing them through the
memory.stat/vmstat interface is that we already have NUMA balancing
statistics in memory.stat/vmstat, and these events are closely related
to each other.  Following Shakeel's suggestion, we describe the
end-to-end flow/story of all these events occurring on a timeline for
future reference:

The goal of NUMA balancing is to co-locate a task and its memory pages
on the same NUMA node.  There are two strategies: migrate the pages to
the task's node, or migrate the task to the node where its pages
reside.

Suppose a task p1 is running on Node 0, but its pages are located on
Node 1.  NUMA page fault statistics for p1 reveal its "page footprint"
across nodes.  If NUMA balancing detects that most of p1's pages are on
Node 1:

1.Page Migration Attempt:
The Numa balance first tries to migrate p1's pages to Node 0.
The numa_page_migrate counter increments.

2.Task Migration Strategies:
After the page migration finishes, Numa balance checks every
1 second to see if p1 can be migrated to Node 1.

Case 2.1: Idle CPU Available

  If Node 1 has an idle CPU, p1 is directly scheduled there.  This
  event is logged as numa_task_migrated.

Case 2.2: No Idle CPU (Task Swap)

  If all CPUs on Node1 are busy, direct migration could cause CPU
  contention or load imbalance.  Instead: The Numa balance selects a
  candidate task p2 on Node 1 that prefers Node 0 (e.g., due to its own
  page footprint).  p1 and p2 are swapped.  This cross-node swap is
  recorded as numa_task_swapped.

Link: https://lkml.kernel.org/r/d00edb12ba0f0de3c5222f61487e65f2ac58f5b1.1748493462.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/7ef90a88602ed536be46eba7152ed0d33bad5790.1748002400.git.yu.c.chen@intel.com
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ayush Jain <Ayush.jain3@amd.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Libo Chen <libo.chen@oracle.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:15 -07:00
Jann Horn
52084f258e mm/gup: update comment explaining why gup_fast() disables IRQs
The current comment in gup_fast() talks about "IPIs that come from THPs
splitting", which is outdated and refers to the old THP splitting
implementation that was removed in commit ad0bed24e9 ("thp: drop all
split_huge_page()-related code"), which landed in v4.5.  Before then, THP
splitting involved a pmdp_splitting_flush(), which sent an IPI to
serialize against gup_fast().

Nowadays, we use tlb_remove_table_sync_one() to send IPIs that serialize
against gup_fast(); this is used, for example, in THP *collapsing* to stop
gup_fast() walks of a page table before depositing it.

Link: https://lkml.kernel.org/r/20250528-gup-irq-comment-fix-v1-1-b9d83c345333@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:14 -07:00
Shivank Garg
595cf68351 mm/khugepaged: fix race with folio split/free using temporary reference
hpage_collapse_scan_file() calls is_refcount_suitable(), which in turn
calls folio_mapcount().  folio_mapcount() checks folio_test_large() before
proceeding to folio_large_mapcount(), but there is a race window where the
folio may get split/freed between these checks, triggering:

  VM_WARN_ON_FOLIO(!folio_test_large(folio), folio)

Take a temporary reference to the folio in hpage_collapse_scan_file(). 
This stabilizes the folio during refcount check and prevents incorrect
large folio detection due to concurrent split/free.  Use helper
folio_expected_ref_count() + 1 to compare with folio_ref_count() instead
of using is_refcount_suitable().

Link: https://lkml.kernel.org/r/20250526182818.37978-1-shivankg@amd.com
Fixes: 05c5323b2a ("mm: track mapcount of large folios in single value")
Signed-off-by: Shivank Garg <shivankg@amd.com>
Reported-by: syzbot+2b99589e33edbe9475ca@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6828470d.a70a0220.38f255.000c.GAE@google.com
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Fengwei Yin <fengwei.yin@intel.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:13 -07:00
Juan Yescas
e13e7922d0 mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order
Problem: On large page size configurations (16KiB, 64KiB), the CMA
alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
and this causes the CMA reservations to be larger than necessary.  This
means that system will have less available MIGRATE_UNMOVABLE and
MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.

The CMA_MIN_ALIGNMENT_BYTES increases because it depends on MAX_PAGE_ORDER
which depends on ARCH_FORCE_MAX_ORDER.  The value of ARCH_FORCE_MAX_ORDER
increases on 16k and 64k kernels.

For example, in ARM, the CMA alignment requirement when:

- CONFIG_ARCH_FORCE_MAX_ORDER default value is used
- CONFIG_TRANSPARENT_HUGEPAGE is set:

PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
-----------------------------------------------------------------------
   4KiB   |      10        |       9         |  4KiB * (2 ^  9) =   2MiB
  16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
  64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB

There are some extreme cases for the CMA alignment requirement when:

- CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
- CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
- CONFIG_HUGETLB_PAGE is NOT set

PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
------------------------------------------------------------------------
   4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
  16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
  64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB

This affects the CMA reservations for the drivers. If a driver in a
4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
reservation has to be 32MiB due to the alignment requirements:

reserved-memory {
    ...
    cma_test_reserve: cma_test_reserve {
        compatible = "shared-dma-pool";
        size = <0x0 0x400000>; /* 4 MiB */
        ...
    };
};

reserved-memory {
    ...
    cma_test_reserve: cma_test_reserve {
        compatible = "shared-dma-pool";
        size = <0x0 0x2000000>; /* 32 MiB */
        ...
    };
};

Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that allows to set the
page block order in all the architectures.  The maximum page block order
will be given by ARCH_FORCE_MAX_ORDER.

By default, CONFIG_PAGE_BLOCK_ORDER will have the same value that
ARCH_FORCE_MAX_ORDER.  This will make sure that current kernel
configurations won't be affected by this change.  It is a opt-in change.

This patch will allow to have the same CMA alignment requirements for
large page sizes (16KiB, 64KiB) as that in 4kb kernels by setting a lower
pageblock_order.

Tests:

- Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 on
  4k and 16k kernels.

- Verified that Transparent Huge Pages work when pageblock_order is 1,
  7, 10 on 4k and 16k kernels.

- Verified that dma-buf heaps allocations work when pageblock_order is
  1, 7, 10 on 4k and 16k kernels.

Benchmarks:

The benchmarks compare 16kb kernels with pageblock_order 10 and 7.  The
reason for the pageblock_order 7 is because this value makes the min CMA
alignment requirement the same as that in 4kb kernels (2MB).

- Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
  SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4.  Use simpleperf
  (https://developer.android.com/ndk/guides/simpleperf) to measure the #
  of instructions and page-faults on 16k kernels.  The benchmark was
  executed 10 times.  The averages are below:

           # instructions         |     #page-faults
    order 10     |  order 7       | order 10 | order 7
--------------------------------------------------------
 13,891,765,770	 | 11,425,777,314 |    220   |   217
 14,456,293,487	 | 12,660,819,302 |    224   |   219
 13,924,261,018	 | 13,243,970,736 |    217   |   221
 13,910,886,504	 | 13,845,519,630 |    217   |   221
 14,388,071,190	 | 13,498,583,098 |    223   |   224
 13,656,442,167	 | 12,915,831,681 |    216   |   218
 13,300,268,343	 | 12,930,484,776 |    222   |   218
 13,625,470,223	 | 14,234,092,777 |    219   |   218
 13,508,964,965	 | 13,432,689,094 |    225   |   219
 13,368,950,667	 | 13,683,587,37  |    219   |   225
-------------------------------------------------------------------
 13,803,137,433  | 13,131,974,268 |    220   |   220    Averages

There were 4.85% #instructions when order was 7, in comparison with order
10.

     13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)

The number of page faults in order 7 and 10 were the same.

These results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.

- Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
  on the 16k kernels with pageblock_order 7 and 10.

order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
-------------------------------------------------------------------
  15.8	 |  16.4    |         0.6        |     3.80%
  16.4	 |  16.2    |        -0.2        |    -1.22%
  16.6	 |  16.3    |        -0.3        |    -1.81%
  16.8	 |  16.3    |        -0.5        |    -2.98%
  16.6	 |  16.8    |         0.2        |     1.20%
-------------------------------------------------------------------
  16.44     16.4            -0.04	          -0.24%   Averages

The results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.

Link: https://lkml.kernel.org/r/20250521215807.1860663-1-jyescas@google.com
Signed-off-by: Juan Yescas <jyescas@google.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:13 -07:00
Roman Gushchin
bfe125f1b1 mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()
Commit b67fbebd4c ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") added a
forced tlbflush to tlb_vma_end(), which is required to avoid a race
between munmap() and unmap_mapping_range().  However it added some
overhead to other paths where tlb_vma_end() is used, but vmas are not
removed, e.g.  madvise(MADV_DONTNEED).

Fix this by moving the tlb flush out of tlb_end_vma() into new
tlb_flush_vmas() called from free_pgtables(), somewhat similar to the
stable version of the original commit: commit 895428ee124a ("mm: Force TLB
flush for PFNMAP mappings before unlink_file_vma()").

Note, that if tlb->fullmm is set, no flush is required, as the whole mm is
about to be destroyed.

Link: https://lkml.kernel.org/r/20250522012838.163876-1-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:12 -07:00
SeongJae Park
28615e6eed mm/damon/Kconfig: enable CONFIG_DAMON by default
As of this writing, multiple major distros including Alma, Amazon,
Android, CentOS, Debian, Fedora, and Oracle are build-enabling DAMON (set
CONFIG_DAMON[1]).  Enabling it by default will save configuration setup
time for the current and future DAMON users.

Build-enabling DAMON does not introduce a real risk since it makes no
behavioral change by default.  It requires explicit user requests to do
anything.  Only one potential risk is making the size of the kernel a
little bit larger.  On a production-purpose configuration, it increases
the resulting kernel package size by about 0.1 % of the final package
file.  I believe that's too small to be a real problem in common setups.

Hence, the benefit of enabling CONFIG_DAMON outweighs the potential risk. 
Set CONFIG_DAMON by default.

Link: https://oracle.github.io/kconfigs/?config=UTS_RELEASE&config=DAMON [1]
Link: https://lkml.kernel.org/r/20250521042755.39653-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Acked-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:12 -07:00
SeongJae Park
0acfc656df mm/damon/Kconfig: set DAMON_{VADDR,PADDR,SYSFS} default to DAMON
Patch series "mm/damon: build-enable essential DAMON components by
default".

As of this writing, multiple major distros including Alma, Amazon,
Android, CentOS, Debian, Fedora, and Oracle are build-enabling DAMON (set
CONFIG_DAMON[1]).  Configuring DAMON is not very easy, since it is
disabled by default, and there are multiple essential options that need to
be manually turned on, one by one.  Make it easier, by grouping essential
configurations to be enabled with one selection, and enabling build of the
essential parts of DAMON by default.

Note that build-enabling DAMON does not introduce any real risk, since it
makes no behavioral change by default.  It requires explicit user requests
to do anything.  Only one potential risk is making the size of the kernel
a little bit larger.  On a production-purpose configuration, it increases
the resulting kernel package binary size by about 0.1 % of the final
package file.  I believe that's too small to be a real problem in common
setups.

DAMON_{VADDR,PADDR,SYSFS} are de-facto essential parts of DAMON for normal
usages.  Because those need to be enabled one by one, however, and there
are other test-purpose or non-essential configurations, it is easy to be
confused and make mistakes at setup.  Make the essential configurations
default to CONFIG_DAMON, so that those can be enabled by default with a
single change.

Link: https://oracle.github.io/kconfigs/?config=UTS_RELEASE&config=DAMON [1]
Link: https://lkml.kernel.org/r/20250521042755.39653-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250521042755.39653-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Acked-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:11 -07:00
Wenjie Xu
3aefb1f069 hugetlb: show nr_huge_pages in report_hugepages()
The number of pre-allocated huge pages should be nr_huge_pages, not
free_huge_pages, although they are same during booting stage

Link: https://lkml.kernel.org/r/20250515114231.65824-1-xuwenjie04@baidu.com
Signed-off-by: Wenjie Xu <xuwenjie04@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:11 -07:00
Kemeng Shi
c5a9deace6 mm/shmem: remove unneeded xa_is_value() check in shmem_unuse_swap_entries()
As only value entry will be added to fbatch in shmem_find_swap_entries(),
there is no need to do xa_is_value() check in shmem_unuse_swap_entries().

Link: https://lkml.kernel.org/r/20250516170939.965736-6-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:11 -07:00
Kemeng Shi
a5cdbe9f37 mm: shmem: only remove inode from swaplist when it's swapped page count is 0
Even if we fail to allocate a swap entry, the inode might have previously
allocated entry and we might take inode containing swap entry off
swaplist.  As a result, try_to_unuse() may enter a potential dead loop to
repeatedly look for inode and clean it's swap entry.  Only take inode off
swaplist when it's swapped page count is 0 to fix the issue.

Link: https://lkml.kernel.org/r/20250516170939.965736-5-shikemeng@huaweicloud.com
Fixes: b487a2da35 ("mm, swap: simplify folio swap allocation")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202505161438.9009cf47-lkp@intel.com
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:10 -07:00
Kemeng Shi
3f778ab1b5 mm/shmem: fix potential dead loop in shmem_unuse()
If multi shmem_unuse() for different swap type is called concurrently, a
dead loop could occur as following:

shmem_unuse(typeA)               shmem_unuse(typeB)
 mutex_lock(&shmem_swaplist_mutex)
 list_for_each_entry_safe(info, next, ...)
  ...
  mutex_unlock(&shmem_swaplist_mutex)
  /* info->swapped may drop to 0 */
  shmem_unuse_inode(&info->vfs_inode, type)

                                  mutex_lock(&shmem_swaplist_mutex)
                                  list_for_each_entry(info, next, ...)
                                   if (!info->swapped)
                                    list_del_init(&info->swaplist)

                                  ...
                                  mutex_unlock(&shmem_swaplist_mutex)

  mutex_lock(&shmem_swaplist_mutex)
  /* iterate with offlist entry and encounter a dead loop */
  next = list_next_entry(info, swaplist);
  ...

Restart the iteration if the inode is already off shmem_swaplist list to
fix the issue.

Link: https://lkml.kernel.org/r/20250516170939.965736-4-shikemeng@huaweicloud.com
Fixes: b56a2d8af9 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:10 -07:00
Kemeng Shi
594ec2ab38 mm: shmem: add missing shmem_unacct_size() in __shmem_file_setup()
We will miss shmem_unacct_size() when is_idmapped_mnt() returns a failure.
Move is_idmapped_mnt() before shmem_acct_size() to fix the issue.

Link: https://lkml.kernel.org/r/20250516170939.965736-3-shikemeng@huaweicloud.com
Fixes: 7a80e5b8c6 ("shmem: support idmapped mounts for tmpfs")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:10 -07:00
Kemeng Shi
e08d5f5156 mm: shmem: avoid unpaired folio_unlock() in shmem_swapin_folio()
Patch series "Some random fixes and cleanup to shmem", v3.

This series contains some simple fixes and cleanup which are made during
learning shmem.  More details can be found in respective patches.


This patch (of 5):

If we get a folio from swap_cache_get_folio() successfully but encounter a
failure before the folio is locked, we will unlock the folio which was not
previously locked.

Put the folio and set it to NULL when a failure occurs before the folio is
locked to fix the issue.

Link: https://lkml.kernel.org/r/20250516170939.965736-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20250516170939.965736-2-shikemeng@huaweicloud.com
Fixes: 058313515d ("mm: shmem: fix potential data corruption during shmem swapin")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:10 -07:00
Akinobu Mita
8e1c4961f4 mm/damon/core: avoid destroyed target reference from DAMOS quota
When the number of the monitoring targets in running contexts is reduced,
there may be DAMOS quotas referencing the targets that will be destroyed.

Applying the scheme action for such DAMOS scheme will be skipped forever
looking for the starting part of the region for the destroyed monitoring
target.

To fix this issue, when the monitoring target is destroyed, reset the
starting part for all DAMOS quotas that reference the target.

Link: https://lkml.kernel.org/r/20250517141852.142802-1-akinobu.mita@gmail.com
Fixes: da87878010 ("mm/damon/sysfs: support online inputs update")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:09 -07:00
Shakeel Butt
3ac4638a73 memcg: make memcg_rstat_updated nmi safe
Currently kernel maintains memory related stats updates per-cgroup to
optimize stats flushing.  The stats_updates is defined as atomic64_t which
is not nmi-safe on some archs.  Actually we don't really need 64bit atomic
as the max value stats_updates can get should be less than nr_cpus *
MEMCG_CHARGE_BATCH.  A normal atomic_t should suffice.

Also the function cgroup_rstat_updated() is still not nmi-safe but there
is parallel effort to make it nmi-safe, so until then let's ignore it in
the nmi context.

Link: https://lkml.kernel.org/r/20250519063142.111219-6-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:09 -07:00
Shakeel Butt
15ca4fa904 memcg: nmi-safe slab stats updates
The objcg based kmem [un]charging can be called in nmi context and it may
need to update NR_SLAB_[UN]RECLAIMABLE_B stats.  So, let's correctly
handle the updates of these stats in the nmi context.

Link: https://lkml.kernel.org/r/20250519063142.111219-5-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:09 -07:00
Shakeel Butt
9d3edf96ce memcg: add nmi-safe update for MEMCG_KMEM
The objcg based kmem charging and uncharging code path needs to update
MEMCG_KMEM appropriately.  Let's add support to update MEMCG_KMEM in
nmi-safe way for those code paths.

Link: https://lkml.kernel.org/r/20250519063142.111219-4-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:09 -07:00
Shakeel Butt
940b01fc8d memcg: nmi safe memcg stats for specific archs
There are archs which have NMI but does not support this_cpu_* ops safely
in the nmi context but they support safe atomic ops in nmi context.  For
such archs, let's add infra to use atomic ops for the memcg stats which
can be updated in nmi.

At the moment, the memcg stats which get updated in the objcg charging
path are MEMCG_KMEM, NR_SLAB_RECLAIMABLE_B & NR_SLAB_UNRECLAIMABLE_B. 
Rather than adding support for all memcg stats to be nmi safe, let's just
add infra to make these three stats nmi safe which this patch is doing.

Link: https://lkml.kernel.org/r/20250519063142.111219-3-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:08 -07:00
Shakeel Butt
25352d2f2d memcg: disable kmem charging in nmi for unsupported arch
Patch series "memcg: nmi-safe kmem charging", v4.

Users can attached their BPF programs at arbitrary execution points in the
kernel and such BPF programs may run in nmi context.  In addition, these
programs can trigger memcg charged kernel allocations in the nmi context. 
However memcg charging infra for kernel memory is not equipped to handle
nmi context for all architectures.

This series removes the hurdles to enable kmem charging in the nmi context
for most of the archs.  For archs without CONFIG_HAVE_NMI, this series is
a noop.  For archs with NMI support and have
CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS, the previous work to make memcg
stats re-entrant is sufficient for allowing kmem charging in nmi context. 
For archs with NMI support but without
CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS and with ARCH_HAVE_NMI_SAFE_CMPXCHG,
this series added infra to support kmem charging in nmi context.  Lastly
those archs with NMI support but without
CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS and ARCH_HAVE_NMI_SAFE_CMPXCHG, kmem
charging in nmi context is not supported at all.

Mostly used archs have support for CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
and this series should be almost a noop (other than making
memcg_rstat_updated nmi safe) for such archs.  


This patch (of 5):

The memcg accounting and stats uses this_cpu* and atomic* ops.  There are
archs which define CONFIG_HAVE_NMI but does not define
CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS and ARCH_HAVE_NMI_SAFE_CMPXCHG, so
memcg accounting for such archs in nmi context is not possible to support.
Let's just disable memcg accounting in nmi context for such archs.

Link: https://lkml.kernel.org/r/20250519063142.111219-1-shakeel.butt@linux.dev
Link: https://lkml.kernel.org/r/20250519063142.111219-2-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:08 -07:00
Matthew Wilcox (Oracle)
acc53a0b4c mm: rename page->index to page->__folio_index
All users of page->index have been converted to not refer to it any more. 
Update a few pieces of documentation that were missed and prevent new
users from appearing (or at least make them easy to grep for).

Link: https://lkml.kernel.org/r/20250514181508.3019795-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:06 -07:00
Linus Torvalds
7d4e49a77d - The 3 patch series "hung_task: extend blocking task stacktrace dump to
semaphore" from Lance Yang enhances the hung task detector.  The
   detector presently dumps the blocking tasks's stack when it is blocked
   on a mutex.  Lance's series extends this to semaphores.
 
 - The 2 patch series "nilfs2: improve sanity checks in dirty state
   propagation" from Wentao Liang addresses a couple of minor flaws in
   nilfs2.
 
 - The 2 patch series "scripts/gdb: Fixes related to lx_per_cpu()" from
   Illia Ostapyshyn fixes a couple of issues in the gdb scripts.
 
 - The 9 patch series "Support kdump with LUKS encryption by reusing LUKS
   volume keys" from Coiby Xu addresses a usability problem with kdump.
   When the dump device is LUKS-encrypted, the kdump kernel may not have
   the keys to the encrypted filesystem.  A full writeup of this is in the
   series [0/N] cover letter.
 
 - The 2 patch series "sysfs: add counters for lockups and stalls" from
   Max Kellermann adds /sys/kernel/hardlockup_count and
   /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count.
 
 - The 3 patch series "fork: Page operation cleanups in the fork code"
   from Pasha Tatashin implements a number of code cleanups in fork.c.
 
 - The 3 patch series "scripts/gdb/symbols: determine KASLR offset on
   s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in
   the gdb scripts.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDuCvQAKCRDdBJ7gKXxA
 jrkxAQCnFAp/uK9ckkbN4nfpJ0+OMY36C+A+dawSDtuRsIkXBAEAq3e6MNAUdg5W
 Ca0cXdgSIq1Op7ZKEA+66Km6Rfvfow8=
 =g45L
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - "hung_task: extend blocking task stacktrace dump to semaphore" from
   Lance Yang enhances the hung task detector.

   The detector presently dumps the blocking tasks's stack when it is
   blocked on a mutex. Lance's series extends this to semaphores

 - "nilfs2: improve sanity checks in dirty state propagation" from
   Wentao Liang addresses a couple of minor flaws in nilfs2

 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn
   fixes a couple of issues in the gdb scripts

 - "Support kdump with LUKS encryption by reusing LUKS volume keys" from
   Coiby Xu addresses a usability problem with kdump.

   When the dump device is LUKS-encrypted, the kdump kernel may not have
   the keys to the encrypted filesystem. A full writeup of this is in
   the series [0/N] cover letter

 - "sysfs: add counters for lockups and stalls" from Max Kellermann adds
   /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and
   /sys/kernel/rcu_stall_count

 - "fork: Page operation cleanups in the fork code" from Pasha Tatashin
   implements a number of code cleanups in fork.c

 - "scripts/gdb/symbols: determine KASLR offset on s390 during early
   boot" from Ilya Leoshkevich fixes some s390 issues in the gdb
   scripts

* tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits)
  llist: make llist_add_batch() a static inline
  delayacct: remove redundant code and adjust indentation
  squashfs: add optional full compressed block caching
  crash_dump, nvme: select CONFIGFS_FS as built-in
  scripts/gdb/symbols: determine KASLR offset on s390 during early boot
  scripts/gdb/symbols: factor out pagination_off()
  scripts/gdb/symbols: factor out get_vmlinux()
  kernel/panic.c: format kernel-doc comments
  mailmap: update and consolidate Casey Connolly's name and email
  nilfs2: remove wbc->for_reclaim handling
  fork: define a local GFP_VMAP_STACK
  fork: check charging success before zeroing stack
  fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code
  fork: clean-up ifdef logic around stack allocation
  kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
  kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
  x86/crash: make the page that stores the dm crypt keys inaccessible
  x86/crash: pass dm crypt keys to kdump kernel
  Revert "x86/mm: Remove unused __set_memory_prot()"
  crash_dump: retrieve dm crypt keys in kdump kernel
  ...
2025-05-31 19:12:53 -07:00
Linus Torvalds
00c010e130 - The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox
simplifies the act of creating a pte which addresses the first page in a
   folio and reduces the amount of plumbing which architecture must
   implement to provide this.
 
 - The 8 patch series "Misc folio patches for 6.16" from Matthew Wilcox
   is a shower of largely unrelated folio infrastructure changes which
   clean things up and better prepare us for future work.
 
 - The 3 patch series "memory,x86,acpi: hotplug memory alignment
   advisement" from Gregory Price adds early-init code to prevent x86 from
   leaving physical memory unused when physical address regions are not
   aligned to memory block size.
 
 - The 2 patch series "mm/compaction: allow more aggressive proactive
   compaction" from Michal Clapinski provides some tuning of the (sadly,
   hard-coded (more sadly, not auto-tuned)) thresholds for our invokation
   of proactive compaction.  In a simple test case, the reduction of a guest
   VM's memory consumption was dramatic.
 
 - The 8 patch series "Minor cleanups and improvements to swap freeing
   code" from Kemeng Shi provides some code cleaups and a small efficiency
   improvement to this part of our swap handling code.
 
 - The 6 patch series "ptrace: introduce PTRACE_SET_SYSCALL_INFO API"
   from Dmitry Levin adds the ability for a ptracer to modify syscalls
   arguments.  At this time we can alter only "system call information that
   are used by strace system call tampering, namely, syscall number,
   syscall arguments, and syscall return value.
 
   This series should have been incorporated into mm.git's "non-MM"
   branch, but I goofed.
 
 - The 3 patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report
   guard regions" from Andrei Vagin extends the info returned by the
   PAGEMAP_SCAN ioctl against /proc/pid/pagemap.  This permits CRIU to more
   efficiently get at the info about guard regions.
 
 - The 2 patch series "Fix parameter passed to page_mapcount_is_type()"
   from Gavin Shan implements that fix.  No runtime effect is expected
   because validate_page_before_insert() happens to fix up this error.
 
 - The 3 patch series "kernel/events/uprobes: uprobe_write_opcode()
   rewrite" from David Hildenbrand basically brings uprobe text poking into
   the current decade.  Remove a bunch of hand-rolled implementation in
   favor of using more current facilities.
 
 - The 3 patch series "mm/ptdump: Drop assumption that pxd_val() is u64"
   from Anshuman Khandual provides enhancements and generalizations to the
   pte dumping code.  This might be needed when 128-bit Page Table
   Descriptors are enabled for ARM.
 
 - The 12 patch series "Always call constructor for kernel page tables"
   from Kevin Brodsky "ensures that the ctor/dtor is always called for
   kernel pgtables, as it already is for user pgtables".  This permits the
   addition of more functionality such as "insert hooks to protect page
   tables".  This change does result in various architectures performing
   unnecesary work, but this is fixed up where it is anticipated to occur.
 
 - The 9 patch series "Rust support for mm_struct, vm_area_struct, and
   mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM
   structures.
 
 - The 3 patch series "fix incorrectly disallowed anonymous VMA merges"
   from Lorenzo Stoakes takes advantage of some VMA merging opportunities
   which we've been missing for 15 years.
 
 - The 4 patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED
   and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB
   flushing.  Instead of flushing each address range in the provided iovec,
   we batch the flushing across all the iovec entries.  The syscall's cost
   was approximately halved with a microbenchmark which was designed to
   load this particular operation.
 
 - The 6 patch series "Track node vacancy to reduce worst case allocation
   counts" from Sidhartha Kumar makes the maple tree smarter about its node
   preallocation.  stress-ng mmap performance increased by single-digit
   percentages and the amount of unnecessarily preallocated memory was
   dramaticelly reduced.
 
 - The 3 patch series "mm/gup: Minor fix, cleanup and improvements" from
   Baoquan He removes a few unnecessary things which Baoquan noted when
   reading the code.
 
 - The 3 patch series ""Enhance sysfs handling for memory hotplug in
   weighted interleave" from Rakie Kim "enhances the weighted interleave
   policy in the memory management subsystem by improving sysfs handling,
   fixing memory leaks, and introducing dynamic sysfs updates for memory
   hotplug support".  Fixes things on error paths which we are unlikely to
   hit.
 
 - The 7 patch series "mm/damon: auto-tune DAMOS for NUMA setups
   including tiered memory" from SeongJae Park introduces new DAMOS quota
   goal metrics which eliminate the manual tuning which is required when
   utilizing DAMON for memory tiering.
 
 - The 5 patch series "mm/vmalloc.c: code cleanup and improvements" from
   Baoquan He provides cleanups and small efficiency improvements which
   Baoquan found via code inspection.
 
 - The 2 patch series "vmscan: enforce mems_effective during demotion"
   from Gregory Price "changes reclaim to respect cpuset.mems_effective
   during demotion when possible".  because "presently, reclaim explicitly
   ignores cpuset.mems_effective when demoting, which may cause the cpuset
   settings to violated." "This is useful for isolating workloads on a
   multi-tenant system from certain classes of memory more consistently."
 
 - The 2 patch series ""Clean up split_huge_pmd_locked() and remove
   unnecessary folio pointers" from Gavin Guo provides minor cleanups and
   efficiency gains in in the huge page splitting and migrating code.
 
 - The 3 patch series "Use kmem_cache for memcg alloc" from Huan Yang
   creates a slab cache for `struct mem_cgroup', yielding improved memory
   utilization.
 
 - The 4 patch series "add max arg to swappiness in memory.reclaim and
   lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness="
   argument for memory.reclaim MGLRU's lru_gen.  This directs proactive
   reclaim to reclaim from only anon folios rather than file-backed folios.
 
 - The 17 patch series "kexec: introduce Kexec HandOver (KHO)" from Mike
   Rapoport is the first step on the path to permitting the kernel to
   maintain existing VMs while replacing the host kernel via file-based
   kexec.  At this time only memblock's reserve_mem is preserved.
 
 - The 7 patch series "mm: Introduce for_each_valid_pfn()" from David
   Woodhouse provides and uses a smarter way of looping over a pfn range.
   By skipping ranges of invalid pfns.
 
 - The 2 patch series "sched/numa: Skip VMA scanning on memory pinned to
   one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless
   VMA scanning when a task is pinned a single NUMA mode.  Dramatic
   performance benefits were seen in some real world cases.
 
 - The 2 patch series "JFS: Implement migrate_folio for
   jfs_metapage_aops" from Shivank Garg addresses a warning which occurs
   during memory compaction when using JFS.
 
 - The 4 patch series "move all VMA allocation, freeing and duplication
   logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c
   into the more appropriate mm/vma.c.
 
 - The 6 patch series "mm, swap: clean up swap cache mapping helper" from
   Kairui Song provides code consolidation and cleanups related to the
   folio_index() function.
 
 - The 2 patch series "mm/gup: Cleanup memfd_pin_folios()" from Vishal
   Moola does that.
 
 - The 8 patch series "memcg: Fix test_memcg_min/low test failures" from
   Waiman Long addresses some bogus failures which are being reported by
   the test_memcontrol selftest.
 
 - The 3 patch series "eliminate mmap() retry merge, add .mmap_prepare
   hook" from Lorenzo Stoakes commences the deprecation of
   file_operations.mmap() in favor of the new
   file_operations.mmap_prepare().  The latter is more restrictive and
   prevents drivers from messing with things in ways which, amongst other
   problems, may defeat VMA merging.
 
 - The 4 patch series "memcg: decouple memcg and objcg stocks"" from
   Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's
   one.  This is a step along the way to making memcg and objcg charging
   NMI-safe, which is a BPF requirement.
 
 - The 6 patch series "mm/damon: minor fixups and improvements for code,
   tests, and documents" from SeongJae Park is "yet another batch of
   miscellaneous DAMON changes.  Fix and improve minor problems in code,
   tests and documents."
 
 - The 7 patch series "memcg: make memcg stats irq safe" from Shakeel
   Butt converts memcg stats to be irq safe.  Another step along the way to
   making memcg charging and stats updates NMI-safe, a BPF requirement.
 
 - The 4 patch series "Let unmap_hugepage_range() and several related
   functions take folio instead of page" from Fan Ni provides folio
   conversions in the hugetlb code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDt5qgAKCRDdBJ7gKXxA
 ju6XAP9nTiSfRz8Cz1n5LJZpFKEGzLpSihCYyR6P3o1L9oe3mwEAlZ5+XAwk2I5x
 Qqb/UGMEpilyre1PayQqOnct3aSL9Ao=
 =tYYm
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
   creating a pte which addresses the first page in a folio and reduces
   the amount of plumbing which architecture must implement to provide
   this.

 - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
   largely unrelated folio infrastructure changes which clean things up
   and better prepare us for future work.

 - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
   Price adds early-init code to prevent x86 from leaving physical
   memory unused when physical address regions are not aligned to memory
   block size.

 - "mm/compaction: allow more aggressive proactive compaction" from
   Michal Clapinski provides some tuning of the (sadly, hard-coded (more
   sadly, not auto-tuned)) thresholds for our invokation of proactive
   compaction. In a simple test case, the reduction of a guest VM's
   memory consumption was dramatic.

 - "Minor cleanups and improvements to swap freeing code" from Kemeng
   Shi provides some code cleaups and a small efficiency improvement to
   this part of our swap handling code.

 - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
   adds the ability for a ptracer to modify syscalls arguments. At this
   time we can alter only "system call information that are used by
   strace system call tampering, namely, syscall number, syscall
   arguments, and syscall return value.

   This series should have been incorporated into mm.git's "non-MM"
   branch, but I goofed.

 - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
   Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
   against /proc/pid/pagemap. This permits CRIU to more efficiently get
   at the info about guard regions.

 - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
   implements that fix. No runtime effect is expected because
   validate_page_before_insert() happens to fix up this error.

 - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
   Hildenbrand basically brings uprobe text poking into the current
   decade. Remove a bunch of hand-rolled implementation in favor of
   using more current facilities.

 - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
   Khandual provides enhancements and generalizations to the pte dumping
   code. This might be needed when 128-bit Page Table Descriptors are
   enabled for ARM.

 - "Always call constructor for kernel page tables" from Kevin Brodsky
   ensures that the ctor/dtor is always called for kernel pgtables, as
   it already is for user pgtables.

   This permits the addition of more functionality such as "insert hooks
   to protect page tables". This change does result in various
   architectures performing unnecesary work, but this is fixed up where
   it is anticipated to occur.

 - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
   Ryhl adds plumbing to permit Rust access to core MM structures.

 - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
   Stoakes takes advantage of some VMA merging opportunities which we've
   been missing for 15 years.

 - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
   SeongJae Park optimizes process_madvise()'s TLB flushing.

   Instead of flushing each address range in the provided iovec, we
   batch the flushing across all the iovec entries. The syscall's cost
   was approximately halved with a microbenchmark which was designed to
   load this particular operation.

 - "Track node vacancy to reduce worst case allocation counts" from
   Sidhartha Kumar makes the maple tree smarter about its node
   preallocation.

   stress-ng mmap performance increased by single-digit percentages and
   the amount of unnecessarily preallocated memory was dramaticelly
   reduced.

 - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
   a few unnecessary things which Baoquan noted when reading the code.

 - ""Enhance sysfs handling for memory hotplug in weighted interleave"
   from Rakie Kim "enhances the weighted interleave policy in the memory
   management subsystem by improving sysfs handling, fixing memory
   leaks, and introducing dynamic sysfs updates for memory hotplug
   support". Fixes things on error paths which we are unlikely to hit.

 - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
   from SeongJae Park introduces new DAMOS quota goal metrics which
   eliminate the manual tuning which is required when utilizing DAMON
   for memory tiering.

 - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
   provides cleanups and small efficiency improvements which Baoquan
   found via code inspection.

 - "vmscan: enforce mems_effective during demotion" from Gregory Price
   changes reclaim to respect cpuset.mems_effective during demotion when
   possible. because presently, reclaim explicitly ignores
   cpuset.mems_effective when demoting, which may cause the cpuset
   settings to violated.

   This is useful for isolating workloads on a multi-tenant system from
   certain classes of memory more consistently.

 - "Clean up split_huge_pmd_locked() and remove unnecessary folio
   pointers" from Gavin Guo provides minor cleanups and efficiency gains
   in in the huge page splitting and migrating code.

 - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
   for `struct mem_cgroup', yielding improved memory utilization.

 - "add max arg to swappiness in memory.reclaim and lru_gen" from
   Zhongkun He adds a new "max" argument to the "swappiness=" argument
   for memory.reclaim MGLRU's lru_gen.

   This directs proactive reclaim to reclaim from only anon folios
   rather than file-backed folios.

 - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
   first step on the path to permitting the kernel to maintain existing
   VMs while replacing the host kernel via file-based kexec. At this
   time only memblock's reserve_mem is preserved.

 - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
   and uses a smarter way of looping over a pfn range. By skipping
   ranges of invalid pfns.

 - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
   cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
   when a task is pinned a single NUMA mode.

   Dramatic performance benefits were seen in some real world cases.

 - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
   Garg addresses a warning which occurs during memory compaction when
   using JFS.

 - "move all VMA allocation, freeing and duplication logic to mm" from
   Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
   appropriate mm/vma.c.

 - "mm, swap: clean up swap cache mapping helper" from Kairui Song
   provides code consolidation and cleanups related to the folio_index()
   function.

 - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.

 - "memcg: Fix test_memcg_min/low test failures" from Waiman Long
   addresses some bogus failures which are being reported by the
   test_memcontrol selftest.

 - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
   Stoakes commences the deprecation of file_operations.mmap() in favor
   of the new file_operations.mmap_prepare().

   The latter is more restrictive and prevents drivers from messing with
   things in ways which, amongst other problems, may defeat VMA merging.

 - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
   the per-cpu memcg charge cache from the objcg's one.

   This is a step along the way to making memcg and objcg charging
   NMI-safe, which is a BPF requirement.

 - "mm/damon: minor fixups and improvements for code, tests, and
   documents" from SeongJae Park is yet another batch of miscellaneous
   DAMON changes. Fix and improve minor problems in code, tests and
   documents.

 - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
   stats to be irq safe. Another step along the way to making memcg
   charging and stats updates NMI-safe, a BPF requirement.

 - "Let unmap_hugepage_range() and several related functions take folio
   instead of page" from Fan Ni provides folio conversions in the
   hugetlb code.

* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
  mm: pcp: increase pcp->free_count threshold to trigger free_high
  mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
  mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: pass folio instead of page to unmap_ref_private()
  memcg: objcg stock trylock without irq disabling
  memcg: no stock lock for cpu hot-unplug
  memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
  memcg: make count_memcg_events re-entrant safe against irqs
  memcg: make mod_memcg_state re-entrant safe against irqs
  memcg: move preempt disable to callers of memcg_rstat_updated
  memcg: memcg_rstat_updated re-entrant safe against irqs
  mm: khugepaged: decouple SHMEM and file folios' collapse
  selftests/eventfd: correct test name and improve messages
  alloc_tag: check mem_profiling_support in alloc_tag_init
  Docs/damon: update titles and brief introductions to explain DAMOS
  selftests/damon/_damon_sysfs: read tried regions directories in order
  mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
  mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
  mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
  ...
2025-05-31 15:44:16 -07:00
Linus Torvalds
dee264c16a require gcc-8 and binutils-2.30
x86 already uses gcc-8 as the minimum version, this changes all other
 architectures to the same version. gcc-8 is used is Debian 10 and Red
 Hat Enterprise Linux 8, both of which are still supported, and binutils
 2.30 is the oldest corresponding version on those. Ubuntu Pro 18.04 and
 SUSE Linux Enterprise Server 15 both use gcc-7 as the system compiler
 but additionally include toolchains that remain supported.
 
 With the new minimum toolchain versions, a number of workarounds for older
 versions can be dropped, in particular on x86_64 and arm64.  Importantly,
 the updated compiler version allows removing two of the five remaining
 gcc plugins, as support for sancov and structeak features is already
 included in modern compiler versions.
 
 I tried collecting the known changes that are possible based on the
 new toolchain version, but expect that more cleanups will be possible.
 Since this touches multiple architectures, I merged the patches through
 the asm-generic tree.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmg6vNMACgkQmmx57+YA
 GNkOmg/+LtR9B2P27GPBeG8HnLTZ8hKELiyYeSk6ZFgQv5hevE37HV35Yru7e7gu
 wcF6CgYr8ff4CVcHM7y0790oGew1thkqq5CklFIH0EwCDJx/mWfZR1SS2jfZIEWM
 HSDOlQQd1S8oWia14tSnQos3nW3CB9/ABVTHH+Wvl3xn48WMRvgK2LJgGLuxJrt8
 5DD9auHiLjchWB5tB4DU98IgWWgFUGMTsI6IayZ4dkF4CdWqd89h0Y3pjJYeBgHS
 mPxzR2q8WjEmG9hp7QuZQgn/pAYleJAwHvvkoLrkQ2ieqx3FjWiwFbQp4CG1Sc8L
 eBR1lnkqS2z/e7xJLfe86fOoKWWu4I0tZKhRan/0+UOGm5nXrGpqSxKS8ZDsRuAp
 3fvyhIp1cYSa7Xkok8BFhLEFR0tguXJXnXBc3tWE5VXIfFNd0Ohh1GUYhXDAqWKh
 i0jN9dSNhokM3AqBi6qZl5kmBnRA3UsIaOg3QRrqN8IlBPp+u7i5xsrJIUWvD95o
 TO06admmLcCJT8n6ZfNVfRjBgzu8+t54UVaDx9YYwxoNGOSFwqOb8CSPTWPxLmDr
 RKDUOvO8DBlP7uFz9neP+LxluA3DjurRZvb0z0AmCZ8/RXEmTMCyfP5a6esxquXt
 0Bqo6hM9q+TeXTHNS1CNvqLSWWikw+AzS/ZPPvriYFn5lxtbq6c=
 =pdDC
 -----END PGP SIGNATURE-----

Merge tag 'gcc-minimum-version-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull compiler version requirement update from Arnd Bergmann:
 "Require gcc-8 and binutils-2.30

  x86 already uses gcc-8 as the minimum version, this changes all other
  architectures to the same version. gcc-8 is used is Debian 10 and Red
  Hat Enterprise Linux 8, both of which are still supported, and
  binutils 2.30 is the oldest corresponding version on those.

  Ubuntu Pro 18.04 and SUSE Linux Enterprise Server 15 both use gcc-7 as
  the system compiler but additionally include toolchains that remain
  supported.

  With the new minimum toolchain versions, a number of workarounds for
  older versions can be dropped, in particular on x86_64 and arm64.
  Importantly, the updated compiler version allows removing two of the
  five remaining gcc plugins, as support for sancov and structeak
  features is already included in modern compiler versions.

  I tried collecting the known changes that are possible based on the
  new toolchain version, but expect that more cleanups will be possible.

  Since this touches multiple architectures, I merged the patches
  through the asm-generic tree."

* tag 'gcc-minimum-version-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  Makefile.kcov: apply needed compiler option unconditionally in CFLAGS_KCOV
  Documentation: update binutils-2.30 version reference
  gcc-plugins: remove SANCOV gcc plugin
  Kbuild: remove structleak gcc plugin
  arm64: drop binutils version checks
  raid6: skip avx512 checks
  kbuild: require gcc-8 and binutils-2.30
2025-05-31 08:16:52 -07:00
Linus Torvalds
dd91b5e1d6 RDMA v6.16 merge window pull request
Usual collection of driver fixes:
 
 - Small bug fixes and cleansup in hfi, hns, rxe, mlx5, mana siw
 
 - Further ODP functionality in rxe
 
 - Remote access MRs in mana, along with more page sizes
 
 - Improve CM scalability with a rwlock around the agent
 
 - More trace points for hns
 
 - ODP hmm conversion to the new two step dma API
 
 - Support the ethernet HW device in mana as well as the RNIC
 
 - Cleanups:
  * Use secs_to_jiffies() when appropriate
  * Use ERR_CAST() instead of naked casts
  * Don't use %pK in printk
  * Unusued functions removed
  * Allocation type matching
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaDm95gAKCRCFwuHvBreF
 YXJxAQCZ+p+mxt0rTeVI2j6YQ26thuvb/tH0Upu8epgdQ3T/ZgD/YOHBC6OrXWJa
 Uz6BTiyz/xiyMtJLTD4kEiG2o74J1gE=
 =DNQC
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Usual collection of driver fixes:

   - Small bug fixes and cleansup in hfi, hns, rxe, mlx5, mana siw

   - Further ODP functionality in rxe

   - Remote access MRs in mana, along with more page sizes

   - Improve CM scalability with a rwlock around the agent

   - More trace points for hns

   - ODP hmm conversion to the new two step dma API

   - Support the ethernet HW device in mana as well as the RNIC

   - Cleanups:
       - Use secs_to_jiffies() when appropriate
       - Use ERR_CAST() instead of naked casts
       - Don't use %pK in printk
       - Unusued functions removed
       - Allocation type matching"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (57 commits)
  RDMA/cma: Fix hang when cma_netevent_callback fails to queue_work
  RDMA/bnxt_re: Support extended stats for Thor2 VF
  RDMA/hns: Fix endian issue in trace events
  RDMA/mlx5: Avoid flexible array warning
  IB/cm: Remove dead code and adjust naming
  RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devices
  RDMA/rxe: Break endless pagefault loop for RO pages
  RDMA/bnxt_re: Fix return code of bnxt_re_configure_cc
  RDMA/bnxt_re: Fix missing error handling for tx_queue
  RDMA/bnxt_re: Fix incorrect display of inactivity_cp in debugfs output
  RDMA/mlx5: Add support for 200Gbps per lane speeds
  RDMA/mlx5: Remove the redundant MLX5_IB_STAGE_UAR stage
  RDMA/iwcm: Fix use-after-free of work objects after cm_id destruction
  net: mana: Add support for auxiliary device servicing events
  RDMA/mana_ib: unify mana_ib functions to support any gdma device
  RDMA/mana_ib: Add support of mana_ib for RNIC and ETH nic
  net: mana: Probe rdma device in mana driver
  RDMA/siw: replace redundant ternary operator with just rv
  RDMA/umem: Separate implicit ODP initialization from explicit ODP
  RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage
  ...
2025-05-30 10:18:56 -07:00
Huacai Chen
a24f2fb70c LoongArch: Introduce the numa_memblks conversion
Commit 8748270821 ("mm: introduce numa_memblks") has moved
numa_memblks from x86 to the generic code, but LoongArch was left out
of this conversion.

This patch introduces the generic numa_memblks for LoongArch.

In detail:
1. Enable NUMA_MEMBLKS (but disable NUMA_EMU) in Kconfig;
2. Use generic definition for numa_memblk and numa_meminfo;
3. Use generic implementation for numa_add_memblk() and its friends;
4. Use generic implementation for numa_set_distance() and its friends;
5. Use generic implementation for memory_add_physaddr_to_nid() and its
   friends.

Note: Disable NUMA_EMU because it needs more efforts and no obvious
demand now.

Tested-by: Binbin Zhou <zhoubinbin@loongson.cn>
Signed-off-by: Yuquan Wang <wangyuquan1236@phytium.com.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-05-30 21:45:43 +08:00
Linus Torvalds
1b98f357da Networking changes for 6.16.
Core
 ----
 
  - Implement the Device Memory TCP transmit path, allowing zero-copy
    data transmission on top of TCP from e.g. GPU memory to the wire.
 
  - Move all the IPv6 routing tables management outside the RTNL scope,
    under its own lock and RCU. The route control path is now 3x times
    faster.
 
  - Convert queue related netlink ops to instance lock, reducing
    again the scope of the RTNL lock. This improves the control plane
    scalability.
 
  - Refactor the software crc32c implementation, removing unneeded
    abstraction layers and improving significantly the related
    micro-benchmarks.
 
  - Optimize the GRO engine for UDP-tunneled traffic, for a 10%
    performance improvement in related stream tests.
 
  - Cover more per-CPU storage with local nested BH locking; this is a
    prep work to remove the current per-CPU lock in local_bh_disable()
    on PREMPT_RT.
 
  - Introduce and use nlmsg_payload helper, combining buffer bounds
    verification with accessing payload carried by netlink messages.
 
 Netfilter
 ---------
 
  - Rewrite the procfs conntrack table implementation, improving
    considerably the dump performance. A lot of user-space tools
    still use this interface.
 
  - Implement support for wildcard netdevice in netdev basechain
    and flowtables.
 
  - Integrate conntrack information into nft trace infrastructure.
 
  - Export set count and backend name to userspace, for better
    introspection.
 
 BPF
 ---
 
  - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
    programs and can be controlled in similar way to traditional qdiscs
    using the "tc qdisc" command.
 
  - Refactor the UDP socket iterator, addressing long standing issues
    WRT duplicate hits or missed sockets.
 
 Protocols
 ---------
 
  - Improve TCP receive buffer auto-tuning and increase the default
    upper bound for the receive buffer; overall this improves the single
    flow maximum thoughput on 200Gbs link by over 60%.
 
  - Add AFS GSSAPI security class to AF_RXRPC; it provides transport
    security for connections to the AFS fileserver and VL server.
 
  - Improve TCP multipath routing, so that the sources address always
    matches the nexthop device.
 
  - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
    and thus preventing DoS caused by passing around problematic FDs.
 
  - Retire DCCP socket. DCCP only receives updates for bugs, and major
    distros disable it by default. Its removal allows for better
    organisation of TCP fields to reduce the number of cache lines hit
    in the fast path.
 
  - Extend TCP drop-reason support to cover PAWS checks.
 
 Driver API
 ----------
 
  - Reorganize PTP ioctl flag support to require an explicit opt-in for
    the drivers, avoiding the problem of drivers not rejecting new
    unsupported flags.
 
  - Converted several device drivers to timestamping APIs.
 
  - Introduce per-PHY ethtool dump helpers, improving the support for
    dump operations targeting PHYs.
 
 Tests and tooling
 -----------------
 
  - Add support for classic netlink in user space C codegen, so that
    ynl-c can now read, create and modify links, routes addresses and
    qdisc layer configuration.
 
  - Add ynl sub-types for binary attributes, allowing ynl-c to output
    known struct instead of raw binary data, clarifying the classic
    netlink output.
 
  - Extend MPTCP selftests to improve the code-coverage.
 
  - Add tests for XDP tail adjustment in AF_XDP.
 
 New hardware / drivers
 ----------------------
 
  - OpenVPN virtual driver: offload OpenVPN data channels processing
    to the kernel-space, increasing the data transfer throughput WRT
    the user-space implementation.
 
  - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
 
  - Broadcom asp-v3.0 ethernet driver.
 
  - AMD Renoir ethernet device.
 
  - ReakTek MT9888 2.5G ethernet PHY driver.
 
  - Aeonsemi 10G C45 PHYs driver.
 
 Drivers
 -------
 
  - Ethernet high-speed NICs:
    - nVidia/Mellanox (mlx5):
      - refactor the stearing table handling to reduce significantly
        the amount of memory used
      - add support for complex matches in H/W flow steering
      - improve flow streeing error handling
      - convert to netdev instance locking
    - Intel (100G, ice, igb, ixgbe, idpf):
      - ice: add switchdev support for LLDP traffic over VF
      - ixgbe: add firmware manipulation and regions devlink support
      - igb: introduce support for frame transmission premption
      - igb: adds persistent NAPI configuration
      - idpf: introduce RDMA support
      - idpf: add initial PTP support
    - Meta (fbnic):
      - extend hardware stats coverage
      - add devlink dev flash support
    - Broadcom (bnxt):
      - add support for RX-side device memory TCP
    - Wangxun (txgbe):
      - implement support for udp tunnel offload
      - complete PTP and SRIOV support for AML 25G/10G devices
 
  - Ethernet NICs embedded and virtual:
    - Google (gve):
      - add device memory TCP TX support
    - Amazon (ena):
      - support persistent per-NAPI config
    - Airoha:
      - add H/W support for L2 traffic offload
      - add per flow stats for flow offloading
    - RealTek (rtl8211): add support for WoL magic packet
    - Synopsys (stmmac):
      - dwmac-socfpga 1000BaseX support
      - add Loongson-2K3000 support
      - introduce support for hardware-accelerated VLAN stripping
    - Broadcom (bcmgenet):
      - expose more H/W stats
    - Freescale (enetc, dpaa2-eth):
      - enetc: add MAC filter, VLAN filter RSS and loopback support
      - dpaa2-eth: convert to H/W timestamping APIs
    - vxlan: convert FDB table to rhashtable, for better scalabilty
    - veth: apply qdisc backpressure on full ring to reduce TX drops
 
  - Ethernet switches:
    - Microchip (kzZ88x3): add ETS scheduler support
 
  - Ethernet PHYs:
    - RealTek (rtl8211):
      - add support for WoL magic packet
      - add support for PHY LEDs
 
  - CAN:
    - Adds RZ/G3E CANFD support to the rcar_canfd driver.
    - Preparatory work for CAN-XL support.
    - Add self-tests framework with support for CAN physical interfaces.
 
  - WiFi:
    - mac80211:
      - scan improvements with multi-link operation (MLO)
    - Qualcomm (ath12k):
      - enable AHB support for IPQ5332
      - add monitor interface support to QCN9274
      - add multi-link operation support to WCN7850
      - add 802.11d scan offload support to WCN7850
      - monitor mode for WCN7850, better 6 GHz regulatory
    - Qualcomm (ath11k):
      - restore hibernation support
    - MediaTek (mt76):
      - WiFi-7 improvements
      - implement support for mt7990
    - Intel (iwlwifi):
      - enhanced multi-link single-radio (EMLSR) support on 5 GHz links
      - rework device configuration
    - RealTek (rtw88):
      - improve throughput for RTL8814AU
    - RealTek (rtw89):
      - add multi-link operation support
      - STA/P2P concurrency improvements
      - support different SAR configs by antenna
 
  - Bluetooth:
    - introduce HCI Driver protocol
    - btintel_pcie: do not generate coredump for diagnostic events
    - btusb: add HCI Drv commands for configuring altsetting
    - btusb: add RTL8851BE device 0x0bda:0xb850
    - btusb: add new VID/PID 13d3/3584 for MT7922
    - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
    - btnxpuart: implement host-wakeup feature
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmg3D64SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkcIsQAK2eEc+BxQer975wzvtMg6gF9eoex4a+
 rZ7jxfDzDtNvTauoQsrpehDZp0FnySaVGCU36lHGB2OvDnhCpPc5hXzKDWQpOuqQ
 SHrGG3/6FTbdTG/HfHUcbNyrUzIf53SADSObiQ3qg4gyEQ3sCpcOKtVtMcU8rvsY
 /HqMnsJWFaROUMjMtCcnUSgjmeY9kBvha3sTXUqgeRugEOCvZD7z4rpqFIcQqHw7
 e2Fi8dwIXEYNxqPp6MRq2qdyUTewCRruE8ZIMAFuhtfYeMElUZMPlqlMENX3AzTQ
 cr0EgwcFOUxRA7oZRxhoBNBsVXavtSpQr4ZDoWplxP4aQ37n5tc1E9Q72axpB/Og
 FbJRl6GvWYnCd8071BczgmfHlKaTAigPvt2Z4r6JjM5I/Bij/IZ3k+On1OTuOAj/
 EqfFkdZ0a5cfKrwUMP+oSGtSAywkMVUtnIKJlZeRbjSj2432sCfe2jVAlS8ELM43
 3LUgXYrAKtA87g171LlsRu5EEpI5QmqPb+i5LpPlEXe2TJEgPisyfecJ3NafF/2+
 j575lm+TFNm9NTNhGGjDPEvw0djI5wSGGMe9J4gC74eWi6s5t6C4cuUf84TKWdwR
 x+9H0IB7rfFncAwXHJuUUtzd+fPHaYzs5dDGbSgMQOXr1cr1wlubCK8mQ1r/Wt/a
 3GjFIOQKW2Q5
 =t/Tz
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Implement the Device Memory TCP transmit path, allowing zero-copy
     data transmission on top of TCP from e.g. GPU memory to the wire.

   - Move all the IPv6 routing tables management outside the RTNL scope,
     under its own lock and RCU. The route control path is now 3x times
     faster.

   - Convert queue related netlink ops to instance lock, reducing again
     the scope of the RTNL lock. This improves the control plane
     scalability.

   - Refactor the software crc32c implementation, removing unneeded
     abstraction layers and improving significantly the related
     micro-benchmarks.

   - Optimize the GRO engine for UDP-tunneled traffic, for a 10%
     performance improvement in related stream tests.

   - Cover more per-CPU storage with local nested BH locking; this is a
     prep work to remove the current per-CPU lock in local_bh_disable()
     on PREMPT_RT.

   - Introduce and use nlmsg_payload helper, combining buffer bounds
     verification with accessing payload carried by netlink messages.

  Netfilter:

   - Rewrite the procfs conntrack table implementation, improving
     considerably the dump performance. A lot of user-space tools still
     use this interface.

   - Implement support for wildcard netdevice in netdev basechain and
     flowtables.

   - Integrate conntrack information into nft trace infrastructure.

   - Export set count and backend name to userspace, for better
     introspection.

  BPF:

   - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
     programs and can be controlled in similar way to traditional qdiscs
     using the "tc qdisc" command.

   - Refactor the UDP socket iterator, addressing long standing issues
     WRT duplicate hits or missed sockets.

  Protocols:

   - Improve TCP receive buffer auto-tuning and increase the default
     upper bound for the receive buffer; overall this improves the
     single flow maximum thoughput on 200Gbs link by over 60%.

   - Add AFS GSSAPI security class to AF_RXRPC; it provides transport
     security for connections to the AFS fileserver and VL server.

   - Improve TCP multipath routing, so that the sources address always
     matches the nexthop device.

   - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
     and thus preventing DoS caused by passing around problematic FDs.

   - Retire DCCP socket. DCCP only receives updates for bugs, and major
     distros disable it by default. Its removal allows for better
     organisation of TCP fields to reduce the number of cache lines hit
     in the fast path.

   - Extend TCP drop-reason support to cover PAWS checks.

  Driver API:

   - Reorganize PTP ioctl flag support to require an explicit opt-in for
     the drivers, avoiding the problem of drivers not rejecting new
     unsupported flags.

   - Converted several device drivers to timestamping APIs.

   - Introduce per-PHY ethtool dump helpers, improving the support for
     dump operations targeting PHYs.

  Tests and tooling:

   - Add support for classic netlink in user space C codegen, so that
     ynl-c can now read, create and modify links, routes addresses and
     qdisc layer configuration.

   - Add ynl sub-types for binary attributes, allowing ynl-c to output
     known struct instead of raw binary data, clarifying the classic
     netlink output.

   - Extend MPTCP selftests to improve the code-coverage.

   - Add tests for XDP tail adjustment in AF_XDP.

  New hardware / drivers:

   - OpenVPN virtual driver: offload OpenVPN data channels processing to
     the kernel-space, increasing the data transfer throughput WRT the
     user-space implementation.

   - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.

   - Broadcom asp-v3.0 ethernet driver.

   - AMD Renoir ethernet device.

   - ReakTek MT9888 2.5G ethernet PHY driver.

   - Aeonsemi 10G C45 PHYs driver.

  Drivers:

   - Ethernet high-speed NICs:
       - nVidia/Mellanox (mlx5):
           - refactor the steering table handling to significantly
             reduce the amount of memory used
           - add support for complex matches in H/W flow steering
           - improve flow streeing error handling
           - convert to netdev instance locking
       - Intel (100G, ice, igb, ixgbe, idpf):
           - ice: add switchdev support for LLDP traffic over VF
           - ixgbe: add firmware manipulation and regions devlink support
           - igb: introduce support for frame transmission premption
           - igb: adds persistent NAPI configuration
           - idpf: introduce RDMA support
           - idpf: add initial PTP support
       - Meta (fbnic):
           - extend hardware stats coverage
           - add devlink dev flash support
       - Broadcom (bnxt):
           - add support for RX-side device memory TCP
       - Wangxun (txgbe):
           - implement support for udp tunnel offload
           - complete PTP and SRIOV support for AML 25G/10G devices

   - Ethernet NICs embedded and virtual:
       - Google (gve):
           - add device memory TCP TX support
       - Amazon (ena):
           - support persistent per-NAPI config
       - Airoha:
           - add H/W support for L2 traffic offload
           - add per flow stats for flow offloading
       - RealTek (rtl8211): add support for WoL magic packet
       - Synopsys (stmmac):
           - dwmac-socfpga 1000BaseX support
           - add Loongson-2K3000 support
           - introduce support for hardware-accelerated VLAN stripping
       - Broadcom (bcmgenet):
           - expose more H/W stats
       - Freescale (enetc, dpaa2-eth):
           - enetc: add MAC filter, VLAN filter RSS and loopback support
           - dpaa2-eth: convert to H/W timestamping APIs
       - vxlan: convert FDB table to rhashtable, for better scalabilty
       - veth: apply qdisc backpressure on full ring to reduce TX drops

   - Ethernet switches:
       - Microchip (kzZ88x3): add ETS scheduler support

   - Ethernet PHYs:
       - RealTek (rtl8211):
           - add support for WoL magic packet
           - add support for PHY LEDs

   - CAN:
       - Adds RZ/G3E CANFD support to the rcar_canfd driver.
       - Preparatory work for CAN-XL support.
       - Add self-tests framework with support for CAN physical interfaces.

   - WiFi:
       - mac80211:
           - scan improvements with multi-link operation (MLO)
       - Qualcomm (ath12k):
           - enable AHB support for IPQ5332
           - add monitor interface support to QCN9274
           - add multi-link operation support to WCN7850
           - add 802.11d scan offload support to WCN7850
           - monitor mode for WCN7850, better 6 GHz regulatory
       - Qualcomm (ath11k):
           - restore hibernation support
       - MediaTek (mt76):
           - WiFi-7 improvements
           - implement support for mt7990
       - Intel (iwlwifi):
           - enhanced multi-link single-radio (EMLSR) support on 5 GHz links
           - rework device configuration
       - RealTek (rtw88):
           - improve throughput for RTL8814AU
       - RealTek (rtw89):
           - add multi-link operation support
           - STA/P2P concurrency improvements
           - support different SAR configs by antenna

   - Bluetooth:
       - introduce HCI Driver protocol
       - btintel_pcie: do not generate coredump for diagnostic events
       - btusb: add HCI Drv commands for configuring altsetting
       - btusb: add RTL8851BE device 0x0bda:0xb850
       - btusb: add new VID/PID 13d3/3584 for MT7922
       - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
       - btnxpuart: implement host-wakeup feature"

* tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits)
  selftests/bpf: Fix bpf selftest build warning
  selftests: netfilter: Fix skip of wildcard interface test
  net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames
  net: openvswitch: Fix the dead loop of MPLS parse
  calipso: Don't call calipso functions for AF_INET sk.
  selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem
  net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
  octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback
  octeontx2-pf: QOS: Perform cache sync on send queue teardown
  net: mana: Add support for Multi Vports on Bare metal
  net: devmem: ncdevmem: remove unused variable
  net: devmem: ksft: upgrade rx test to send 1K data
  net: devmem: ksft: add 5 tuple FS support
  net: devmem: ksft: add exit_wait to make rx test pass
  net: devmem: ksft: add ipv4 support
  net: devmem: preserve sockc_err
  page_pool: fix ugly page_pool formatting
  net: devmem: move list_add to net_devmem_bind_dmabuf.
  selftests: netfilter: nft_queue.sh: include file transfer duration in log message
  net: phy: mscc: Fix memory leak when using one step timestamping
  ...
2025-05-28 15:24:36 -07:00
Linus Torvalds
47cf96fbe3 arm64 updates for 6.16
ACPI, EFI and PSCI:
  - Decouple Arm's "Software Delegated Exception Interface" (SDEI)
    support from the ACPI GHES code so that it can be used by platforms
    booted with device-tree.
 
  - Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
    runtime calls.
 
  - Fix a node refcount imbalance in the PSCI device-tree code.
 
 CPU Features:
  - Ensure register sanitisation is applied to fields in ID_AA64MMFR4.
 
  - Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM guests
    can reliably query the underlying CPU types from the VMM.
 
  - Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
    to our context-switching, signal handling and ptrace code.
 
 Entry code:
  - Hook up TIF_NEED_RESCHED_LAZY so that CONFIG_PREEMPT_LAZY can be
    selected.
 
 Memory management:
  - Prevent BSS exports from being used by the early PI code.
 
  - Propagate level and stride information to the low-level TLB
    invalidation routines when operating on hugetlb entries.
 
  - Use the page-table contiguous hint for vmap() mappings with
    VM_ALLOW_HUGE_VMAP where possible.
 
  - Optimise vmalloc()/vmap() page-table updates to use "lazy MMU mode"
    and hook this up on arm64 so that the trailing DSB (used to publish
    the updates to the hardware walker) can be deferred until the end of
    the mapping operation.
 
  - Extend mmap() randomisation for 52-bit virtual addresses (on par with
    48-bit addressing) and remove limited support for randomisation of
    the linear map.
 
 Perf and PMUs:
  - Add support for probing the CMN-S3 driver using ACPI.
 
  - Minor driver fixes to the CMN, Arm-NI and amlogic PMU drivers.
 
 Selftests:
  - Fix FPSIMD and SME tests to align with the freshly re-enabled SME
    support.
 
  - Fix default setting of the OUTPUT variable so that tests are
    installed in the right location.
 
 vDSO:
  - Replace raw counter access from inline assembly code with a call to
    the the __arch_counter_get_cntvct() helper function.
 
 Miscellaneous:
  - Add some missing header inclusions to the CCA headers.
 
  - Rework rendering of /proc/cpuinfo to follow the x86-approach and
    avoid repeated buffer expansion (the user-visible format remains
    identical).
 
  - Remove redundant selection of CONFIG_CRC32
 
  - Extend early error message when failing to map the device-tree blob.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmg1uTgQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNFv2CAC9S5OW0btOAo7V/LFBpLhJM3hdIV6Sn6N1
 d/K5znuqPBG6VPfBrshaZltEl/C3U8KG4H8xrlX5cSo7CRuf3DgVBw3kiZ6ERZj6
 1gnKR54juA1oWhcroPl0s76ETWj3N4gO036u2qOhWNAYflDunh1+bCIGJkG4H/yP
 wqtWn974YUbad/zQJSbG3IMO1yvxZ/PsNpVF8HjyQ0/ZPWsYTscrhNQ0hWro17sR
 CTcUaGxH4GrXW24EGNgkLB9aq67X2rtGGtaIlp5oFl8FuLklc7TYbPwJp8cPCTNm
 0Sp0mpuR9M675pYIKoCI9m5twc46znRIKmbXi5LvPd77418y3jTf
 =03N4
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "The headline feature is the re-enablement of support for Arm's
  Scalable Matrix Extension (SME) thanks to a bumper crop of fixes
  from Mark Rutland.

  If matrices aren't your thing, then Ryan's page-table optimisation
  work is much more interesting.

  Summary:

  ACPI, EFI and PSCI:

   - Decouple Arm's "Software Delegated Exception Interface" (SDEI)
     support from the ACPI GHES code so that it can be used by platforms
     booted with device-tree

   - Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
     runtime calls

   - Fix a node refcount imbalance in the PSCI device-tree code

  CPU Features:

   - Ensure register sanitisation is applied to fields in ID_AA64MMFR4

   - Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM
     guests can reliably query the underlying CPU types from the VMM

   - Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
     to our context-switching, signal handling and ptrace code

  Entry code:

   - Hook up TIF_NEED_RESCHED_LAZY so that CONFIG_PREEMPT_LAZY can be
     selected

  Memory management:

   - Prevent BSS exports from being used by the early PI code

   - Propagate level and stride information to the low-level TLB
     invalidation routines when operating on hugetlb entries

   - Use the page-table contiguous hint for vmap() mappings with
     VM_ALLOW_HUGE_VMAP where possible

   - Optimise vmalloc()/vmap() page-table updates to use "lazy MMU mode"
     and hook this up on arm64 so that the trailing DSB (used to publish
     the updates to the hardware walker) can be deferred until the end
     of the mapping operation

   - Extend mmap() randomisation for 52-bit virtual addresses (on par
     with 48-bit addressing) and remove limited support for
     randomisation of the linear map

  Perf and PMUs:

   - Add support for probing the CMN-S3 driver using ACPI

   - Minor driver fixes to the CMN, Arm-NI and amlogic PMU drivers

  Selftests:

   - Fix FPSIMD and SME tests to align with the freshly re-enabled SME
     support

   - Fix default setting of the OUTPUT variable so that tests are
     installed in the right location

  vDSO:

   - Replace raw counter access from inline assembly code with a call to
     the the __arch_counter_get_cntvct() helper function

  Miscellaneous:

   - Add some missing header inclusions to the CCA headers

   - Rework rendering of /proc/cpuinfo to follow the x86-approach and
     avoid repeated buffer expansion (the user-visible format remains
     identical)

   - Remove redundant selection of CONFIG_CRC32

   - Extend early error message when failing to map the device-tree
     blob"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (83 commits)
  arm64: cputype: Add cputype definition for HIP12
  arm64: el2_setup.h: Make __init_el2_fgt labels consistent, again
  perf/arm-cmn: Add CMN S3 ACPI binding
  arm64/boot: Disallow BSS exports to startup code
  arm64/boot: Move global CPU override variables out of BSS
  arm64/boot: Move init_pgdir[] and init_idmap_pgdir[] into __pi_ namespace
  perf/arm-cmn: Initialise cmn->cpu earlier
  kselftest/arm64: Set default OUTPUT path when undefined
  arm64: Update comment regarding values in __boot_cpu_mode
  arm64: mm: Drop redundant check in pmd_trans_huge()
  arm64/mm: Re-organise setting up FEAT_S1PIE registers PIRE0_EL1 and PIR_EL1
  arm64/mm: Permit lazy_mmu_mode to be nested
  arm64/mm: Disable barrier batching in interrupt contexts
  arm64/cpuinfo: only show one cpu's info in c_show()
  arm64/mm: Batch barriers when updating kernel mappings
  mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
  arm64/mm: Support huge pte-mapped pages in vmap
  mm/vmalloc: Gracefully unmap huge ptes
  mm/vmalloc: Warn on improper use of vunmap_range()
  arm64/mm: Hoist barriers out of set_ptes_anysz() loop
  ...
2025-05-28 14:55:35 -07:00