2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

1343 Commits

Author SHA1 Message Date
Linus Torvalds
d900c4ce63 execve updates for v6.17
- Introduce regular REGSET note macros arch-wide (Dave Martin)
 
 - Remove arbitrary 4K limitation of program header size (Yin Fengwei)
 
 - Reorder function qualifiers for copy_clone_args_from_user() (Dishank Jogi)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaIVKiAAKCRA2KwveOeQk
 u4zBAP4zUNj2+XyixVPXCzv+Hkle6zWs7yrzdA2yLxe8Qtwj5AD+N2I6MUGcCFGW
 W+uWxlWTtGLDqh1CplIUqTlxMi39Og4=
 =vYnE
 -----END PGP SIGNATURE-----

Merge tag 'execve-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve updates from Kees Cook:

 - Introduce regular REGSET note macros arch-wide (Dave Martin)

 - Remove arbitrary 4K limitation of program header size (Yin Fengwei)

 - Reorder function qualifiers for copy_clone_args_from_user() (Dishank Jogi)

* tag 'execve-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (25 commits)
  fork: reorder function qualifiers for copy_clone_args_from_user
  binfmt_elf: remove the 4k limitation of program header size
  binfmt_elf: Warn on missing or suspicious regset note names
  xtensa: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  um: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  x86/ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  sparc: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  sh: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  s390/ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  riscv: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  powerpc/ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  parisc: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  openrisc: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  nios2: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  MIPS: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  m68k: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  LoongArch: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  hexagon: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  csky: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  arm64: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  ...
2025-07-28 17:11:40 -07:00
Dishank Jogi
7f71195c15 fork: reorder function qualifiers for copy_clone_args_from_user
Change the order of function qualifiers from 'noinline static' to 'static noinline'
in copy_clone_args_from_user for consistency with kernel coding style.

No functional change intended. The goal is to improve readability and
maintain consistent ordering of qualifiers across the codebase.

Signed-off-by: Dishank Jogi <dishank.jogi@siqol.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250716093525.449994-1-dishank.jogi@siqol.com
Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-17 16:37:05 -07:00
Al Viro
a683a5b2ba
fold fs_struct->{lock,seq} into a seqlock
The combination of spinlock_t lock and seqcount_spinlock_t seq
in struct fs_struct is an open-coded seqlock_t (see linux/seqlock_types.h).
	Combine and switch to equivalent seqlock_t primitives.  AFAICS,
that does end up with the same sequence of underlying operations in all
cases.
	While we are at it, get_fs_pwd() is open-coded verbatim in
get_path_from_fd(); rather than applying conversion to it, replace with
the call of get_fs_pwd() there.  Not worth splitting the commit for that,
IMO...

	A bit of historical background - conversion of seqlock_t to
use of seqcount_spinlock_t happened several months after the same
had been done to struct fs_struct; switching fs_struct to seqlock_t
could've been done immediately after that, but it looks like nobody
had gotten around to that until now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/20250702053437.GC1880847@ZenIV
Acked-by: Ahmed S. Darwish <darwi@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-08 10:25:19 +02:00
Linus Torvalds
00c010e130 - The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox
simplifies the act of creating a pte which addresses the first page in a
   folio and reduces the amount of plumbing which architecture must
   implement to provide this.
 
 - The 8 patch series "Misc folio patches for 6.16" from Matthew Wilcox
   is a shower of largely unrelated folio infrastructure changes which
   clean things up and better prepare us for future work.
 
 - The 3 patch series "memory,x86,acpi: hotplug memory alignment
   advisement" from Gregory Price adds early-init code to prevent x86 from
   leaving physical memory unused when physical address regions are not
   aligned to memory block size.
 
 - The 2 patch series "mm/compaction: allow more aggressive proactive
   compaction" from Michal Clapinski provides some tuning of the (sadly,
   hard-coded (more sadly, not auto-tuned)) thresholds for our invokation
   of proactive compaction.  In a simple test case, the reduction of a guest
   VM's memory consumption was dramatic.
 
 - The 8 patch series "Minor cleanups and improvements to swap freeing
   code" from Kemeng Shi provides some code cleaups and a small efficiency
   improvement to this part of our swap handling code.
 
 - The 6 patch series "ptrace: introduce PTRACE_SET_SYSCALL_INFO API"
   from Dmitry Levin adds the ability for a ptracer to modify syscalls
   arguments.  At this time we can alter only "system call information that
   are used by strace system call tampering, namely, syscall number,
   syscall arguments, and syscall return value.
 
   This series should have been incorporated into mm.git's "non-MM"
   branch, but I goofed.
 
 - The 3 patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report
   guard regions" from Andrei Vagin extends the info returned by the
   PAGEMAP_SCAN ioctl against /proc/pid/pagemap.  This permits CRIU to more
   efficiently get at the info about guard regions.
 
 - The 2 patch series "Fix parameter passed to page_mapcount_is_type()"
   from Gavin Shan implements that fix.  No runtime effect is expected
   because validate_page_before_insert() happens to fix up this error.
 
 - The 3 patch series "kernel/events/uprobes: uprobe_write_opcode()
   rewrite" from David Hildenbrand basically brings uprobe text poking into
   the current decade.  Remove a bunch of hand-rolled implementation in
   favor of using more current facilities.
 
 - The 3 patch series "mm/ptdump: Drop assumption that pxd_val() is u64"
   from Anshuman Khandual provides enhancements and generalizations to the
   pte dumping code.  This might be needed when 128-bit Page Table
   Descriptors are enabled for ARM.
 
 - The 12 patch series "Always call constructor for kernel page tables"
   from Kevin Brodsky "ensures that the ctor/dtor is always called for
   kernel pgtables, as it already is for user pgtables".  This permits the
   addition of more functionality such as "insert hooks to protect page
   tables".  This change does result in various architectures performing
   unnecesary work, but this is fixed up where it is anticipated to occur.
 
 - The 9 patch series "Rust support for mm_struct, vm_area_struct, and
   mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM
   structures.
 
 - The 3 patch series "fix incorrectly disallowed anonymous VMA merges"
   from Lorenzo Stoakes takes advantage of some VMA merging opportunities
   which we've been missing for 15 years.
 
 - The 4 patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED
   and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB
   flushing.  Instead of flushing each address range in the provided iovec,
   we batch the flushing across all the iovec entries.  The syscall's cost
   was approximately halved with a microbenchmark which was designed to
   load this particular operation.
 
 - The 6 patch series "Track node vacancy to reduce worst case allocation
   counts" from Sidhartha Kumar makes the maple tree smarter about its node
   preallocation.  stress-ng mmap performance increased by single-digit
   percentages and the amount of unnecessarily preallocated memory was
   dramaticelly reduced.
 
 - The 3 patch series "mm/gup: Minor fix, cleanup and improvements" from
   Baoquan He removes a few unnecessary things which Baoquan noted when
   reading the code.
 
 - The 3 patch series ""Enhance sysfs handling for memory hotplug in
   weighted interleave" from Rakie Kim "enhances the weighted interleave
   policy in the memory management subsystem by improving sysfs handling,
   fixing memory leaks, and introducing dynamic sysfs updates for memory
   hotplug support".  Fixes things on error paths which we are unlikely to
   hit.
 
 - The 7 patch series "mm/damon: auto-tune DAMOS for NUMA setups
   including tiered memory" from SeongJae Park introduces new DAMOS quota
   goal metrics which eliminate the manual tuning which is required when
   utilizing DAMON for memory tiering.
 
 - The 5 patch series "mm/vmalloc.c: code cleanup and improvements" from
   Baoquan He provides cleanups and small efficiency improvements which
   Baoquan found via code inspection.
 
 - The 2 patch series "vmscan: enforce mems_effective during demotion"
   from Gregory Price "changes reclaim to respect cpuset.mems_effective
   during demotion when possible".  because "presently, reclaim explicitly
   ignores cpuset.mems_effective when demoting, which may cause the cpuset
   settings to violated." "This is useful for isolating workloads on a
   multi-tenant system from certain classes of memory more consistently."
 
 - The 2 patch series ""Clean up split_huge_pmd_locked() and remove
   unnecessary folio pointers" from Gavin Guo provides minor cleanups and
   efficiency gains in in the huge page splitting and migrating code.
 
 - The 3 patch series "Use kmem_cache for memcg alloc" from Huan Yang
   creates a slab cache for `struct mem_cgroup', yielding improved memory
   utilization.
 
 - The 4 patch series "add max arg to swappiness in memory.reclaim and
   lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness="
   argument for memory.reclaim MGLRU's lru_gen.  This directs proactive
   reclaim to reclaim from only anon folios rather than file-backed folios.
 
 - The 17 patch series "kexec: introduce Kexec HandOver (KHO)" from Mike
   Rapoport is the first step on the path to permitting the kernel to
   maintain existing VMs while replacing the host kernel via file-based
   kexec.  At this time only memblock's reserve_mem is preserved.
 
 - The 7 patch series "mm: Introduce for_each_valid_pfn()" from David
   Woodhouse provides and uses a smarter way of looping over a pfn range.
   By skipping ranges of invalid pfns.
 
 - The 2 patch series "sched/numa: Skip VMA scanning on memory pinned to
   one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless
   VMA scanning when a task is pinned a single NUMA mode.  Dramatic
   performance benefits were seen in some real world cases.
 
 - The 2 patch series "JFS: Implement migrate_folio for
   jfs_metapage_aops" from Shivank Garg addresses a warning which occurs
   during memory compaction when using JFS.
 
 - The 4 patch series "move all VMA allocation, freeing and duplication
   logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c
   into the more appropriate mm/vma.c.
 
 - The 6 patch series "mm, swap: clean up swap cache mapping helper" from
   Kairui Song provides code consolidation and cleanups related to the
   folio_index() function.
 
 - The 2 patch series "mm/gup: Cleanup memfd_pin_folios()" from Vishal
   Moola does that.
 
 - The 8 patch series "memcg: Fix test_memcg_min/low test failures" from
   Waiman Long addresses some bogus failures which are being reported by
   the test_memcontrol selftest.
 
 - The 3 patch series "eliminate mmap() retry merge, add .mmap_prepare
   hook" from Lorenzo Stoakes commences the deprecation of
   file_operations.mmap() in favor of the new
   file_operations.mmap_prepare().  The latter is more restrictive and
   prevents drivers from messing with things in ways which, amongst other
   problems, may defeat VMA merging.
 
 - The 4 patch series "memcg: decouple memcg and objcg stocks"" from
   Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's
   one.  This is a step along the way to making memcg and objcg charging
   NMI-safe, which is a BPF requirement.
 
 - The 6 patch series "mm/damon: minor fixups and improvements for code,
   tests, and documents" from SeongJae Park is "yet another batch of
   miscellaneous DAMON changes.  Fix and improve minor problems in code,
   tests and documents."
 
 - The 7 patch series "memcg: make memcg stats irq safe" from Shakeel
   Butt converts memcg stats to be irq safe.  Another step along the way to
   making memcg charging and stats updates NMI-safe, a BPF requirement.
 
 - The 4 patch series "Let unmap_hugepage_range() and several related
   functions take folio instead of page" from Fan Ni provides folio
   conversions in the hugetlb code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDt5qgAKCRDdBJ7gKXxA
 ju6XAP9nTiSfRz8Cz1n5LJZpFKEGzLpSihCYyR6P3o1L9oe3mwEAlZ5+XAwk2I5x
 Qqb/UGMEpilyre1PayQqOnct3aSL9Ao=
 =tYYm
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
   creating a pte which addresses the first page in a folio and reduces
   the amount of plumbing which architecture must implement to provide
   this.

 - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
   largely unrelated folio infrastructure changes which clean things up
   and better prepare us for future work.

 - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
   Price adds early-init code to prevent x86 from leaving physical
   memory unused when physical address regions are not aligned to memory
   block size.

 - "mm/compaction: allow more aggressive proactive compaction" from
   Michal Clapinski provides some tuning of the (sadly, hard-coded (more
   sadly, not auto-tuned)) thresholds for our invokation of proactive
   compaction. In a simple test case, the reduction of a guest VM's
   memory consumption was dramatic.

 - "Minor cleanups and improvements to swap freeing code" from Kemeng
   Shi provides some code cleaups and a small efficiency improvement to
   this part of our swap handling code.

 - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
   adds the ability for a ptracer to modify syscalls arguments. At this
   time we can alter only "system call information that are used by
   strace system call tampering, namely, syscall number, syscall
   arguments, and syscall return value.

   This series should have been incorporated into mm.git's "non-MM"
   branch, but I goofed.

 - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
   Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
   against /proc/pid/pagemap. This permits CRIU to more efficiently get
   at the info about guard regions.

 - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
   implements that fix. No runtime effect is expected because
   validate_page_before_insert() happens to fix up this error.

 - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
   Hildenbrand basically brings uprobe text poking into the current
   decade. Remove a bunch of hand-rolled implementation in favor of
   using more current facilities.

 - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
   Khandual provides enhancements and generalizations to the pte dumping
   code. This might be needed when 128-bit Page Table Descriptors are
   enabled for ARM.

 - "Always call constructor for kernel page tables" from Kevin Brodsky
   ensures that the ctor/dtor is always called for kernel pgtables, as
   it already is for user pgtables.

   This permits the addition of more functionality such as "insert hooks
   to protect page tables". This change does result in various
   architectures performing unnecesary work, but this is fixed up where
   it is anticipated to occur.

 - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
   Ryhl adds plumbing to permit Rust access to core MM structures.

 - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
   Stoakes takes advantage of some VMA merging opportunities which we've
   been missing for 15 years.

 - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
   SeongJae Park optimizes process_madvise()'s TLB flushing.

   Instead of flushing each address range in the provided iovec, we
   batch the flushing across all the iovec entries. The syscall's cost
   was approximately halved with a microbenchmark which was designed to
   load this particular operation.

 - "Track node vacancy to reduce worst case allocation counts" from
   Sidhartha Kumar makes the maple tree smarter about its node
   preallocation.

   stress-ng mmap performance increased by single-digit percentages and
   the amount of unnecessarily preallocated memory was dramaticelly
   reduced.

 - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
   a few unnecessary things which Baoquan noted when reading the code.

 - ""Enhance sysfs handling for memory hotplug in weighted interleave"
   from Rakie Kim "enhances the weighted interleave policy in the memory
   management subsystem by improving sysfs handling, fixing memory
   leaks, and introducing dynamic sysfs updates for memory hotplug
   support". Fixes things on error paths which we are unlikely to hit.

 - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
   from SeongJae Park introduces new DAMOS quota goal metrics which
   eliminate the manual tuning which is required when utilizing DAMON
   for memory tiering.

 - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
   provides cleanups and small efficiency improvements which Baoquan
   found via code inspection.

 - "vmscan: enforce mems_effective during demotion" from Gregory Price
   changes reclaim to respect cpuset.mems_effective during demotion when
   possible. because presently, reclaim explicitly ignores
   cpuset.mems_effective when demoting, which may cause the cpuset
   settings to violated.

   This is useful for isolating workloads on a multi-tenant system from
   certain classes of memory more consistently.

 - "Clean up split_huge_pmd_locked() and remove unnecessary folio
   pointers" from Gavin Guo provides minor cleanups and efficiency gains
   in in the huge page splitting and migrating code.

 - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
   for `struct mem_cgroup', yielding improved memory utilization.

 - "add max arg to swappiness in memory.reclaim and lru_gen" from
   Zhongkun He adds a new "max" argument to the "swappiness=" argument
   for memory.reclaim MGLRU's lru_gen.

   This directs proactive reclaim to reclaim from only anon folios
   rather than file-backed folios.

 - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
   first step on the path to permitting the kernel to maintain existing
   VMs while replacing the host kernel via file-based kexec. At this
   time only memblock's reserve_mem is preserved.

 - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
   and uses a smarter way of looping over a pfn range. By skipping
   ranges of invalid pfns.

 - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
   cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
   when a task is pinned a single NUMA mode.

   Dramatic performance benefits were seen in some real world cases.

 - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
   Garg addresses a warning which occurs during memory compaction when
   using JFS.

 - "move all VMA allocation, freeing and duplication logic to mm" from
   Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
   appropriate mm/vma.c.

 - "mm, swap: clean up swap cache mapping helper" from Kairui Song
   provides code consolidation and cleanups related to the folio_index()
   function.

 - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.

 - "memcg: Fix test_memcg_min/low test failures" from Waiman Long
   addresses some bogus failures which are being reported by the
   test_memcontrol selftest.

 - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
   Stoakes commences the deprecation of file_operations.mmap() in favor
   of the new file_operations.mmap_prepare().

   The latter is more restrictive and prevents drivers from messing with
   things in ways which, amongst other problems, may defeat VMA merging.

 - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
   the per-cpu memcg charge cache from the objcg's one.

   This is a step along the way to making memcg and objcg charging
   NMI-safe, which is a BPF requirement.

 - "mm/damon: minor fixups and improvements for code, tests, and
   documents" from SeongJae Park is yet another batch of miscellaneous
   DAMON changes. Fix and improve minor problems in code, tests and
   documents.

 - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
   stats to be irq safe. Another step along the way to making memcg
   charging and stats updates NMI-safe, a BPF requirement.

 - "Let unmap_hugepage_range() and several related functions take folio
   instead of page" from Fan Ni provides folio conversions in the
   hugetlb code.

* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
  mm: pcp: increase pcp->free_count threshold to trigger free_high
  mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
  mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: pass folio instead of page to unmap_ref_private()
  memcg: objcg stock trylock without irq disabling
  memcg: no stock lock for cpu hot-unplug
  memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
  memcg: make count_memcg_events re-entrant safe against irqs
  memcg: make mod_memcg_state re-entrant safe against irqs
  memcg: move preempt disable to callers of memcg_rstat_updated
  memcg: memcg_rstat_updated re-entrant safe against irqs
  mm: khugepaged: decouple SHMEM and file folios' collapse
  selftests/eventfd: correct test name and improve messages
  alloc_tag: check mem_profiling_support in alloc_tag_init
  Docs/damon: update titles and brief introductions to explain DAMOS
  selftests/damon/_damon_sysfs: read tried regions directories in order
  mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
  mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
  mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
  ...
2025-05-31 15:44:16 -07:00
Linus Torvalds
b3570b00dc Locking changes for v6.16:
Futexes:
 
    - Add support for task local hash maps (Sebastian Andrzej Siewior,
      Peter Zijlstra)
 
    - Implement the FUTEX2_NUMA ABI, which feature extends the futex
      interface to be NUMA-aware. On NUMA-aware futexes a second u32
      word containing the NUMA node is added to after the u32 futex value
      word. (Peter Zijlstra)
 
    - Implement the FUTEX2_MPOL ABI, which feature extends the futex
      interface to be mempolicy-aware as well, to further refine futex
      node mappings and lookups. (Peter Zijlstra)
 
   Locking primitives:
 
    - Misc cleanups (Andy Shevchenko, Borislav Petkov, Colin Ian King,
                     Ingo Molnar, Nam Cao, Peter Zijlstra)
 
   Lockdep:
 
    - Prevent abuse of lockdep subclasses (Waiman Long)
    - Add number of dynamic keys to /proc/lockdep_stats (Waiman Long)
 
 Plus misc cleanups and fixes.
 
 Note that the tree includes the following dependent out-of-subsystem
 changes as well:
 
  - rcuref: Provide rcuref_is_dead()
  - mm: Add vmalloc_huge_node()
  - mm: Add the mmap_read_lock guard to <linux/mmap_lock.h>
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgy3E8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1isNw/9FS6+ZReiV3NLHvhwIw8+6U2vV733wLY+
 mFzDk2CRwv2d6xg+QUrhLNI93i2fZnwNvK1f6LcRZMa1pNmwCcEghKgm0G+fRgbv
 skiGrlkUCoEqsDUxRW++/aTBcMo0vqG3NOObnUOrddG2W9tfrR8jq/EwlzB99dO7
 q8qaBNl9W1vLT3gh9/RPP5uKt0NKIf8ObvsyhWCGaywg81h2lC4AHf0Xlj3ZD95T
 TO5jhUhl/muhYtaqxeYPK0gDtCrgFz8NwZdjKx1nyP7Gbko6+L50AvOVXog0SIAU
 nncftvutGJg2ki7dbSYPDoHQrHO0JsF1vUfVZRjaKFebWpFo2yYdNMbITOeXVhSC
 QSpbH2qvyn21nT/YSj9dottHWBoNYBEgrcSf6DO4g0d8A0Jh7egXjQdA852RpeQ0
 LWGYx4rfiKhnjiXlKKQHrURZkcxxa40o+ls3RfFl2/kWA+7aUybvw6nAeDEkV0oL
 s2U0vZxsY37EPWDm40rTe9r4YpPqcB65i9YIesPzhtbcHJVmN0gts0o5l+x53GhR
 CeftFiiUi2nm6JaT+1wGvBDT3hQ8+NZ8GkPSeA6pEJWE3i4KquZlcBZLOSLZ3k/B
 df58zQi99Yun33is5f1kqDNspqvJOg/1nxUK68PgNSdCMKeuZkJYrcmh/rKNnXSC
 f7M1XHoWFb0=
 =La/x
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "Futexes:

   - Add support for task local hash maps (Sebastian Andrzej Siewior,
     Peter Zijlstra)

   - Implement the FUTEX2_NUMA ABI, which feature extends the futex
     interface to be NUMA-aware. On NUMA-aware futexes a second u32 word
     containing the NUMA node is added to after the u32 futex value word
     (Peter Zijlstra)

   - Implement the FUTEX2_MPOL ABI, which feature extends the futex
     interface to be mempolicy-aware as well, to further refine futex
     node mappings and lookups (Peter Zijlstra)

  Locking primitives:

   - Misc cleanups (Andy Shevchenko, Borislav Petkov, Colin Ian King,
     Ingo Molnar, Nam Cao, Peter Zijlstra)

  Lockdep:

   - Prevent abuse of lockdep subclasses (Waiman Long)

   - Add number of dynamic keys to /proc/lockdep_stats (Waiman Long)

  Plus misc cleanups and fixes"

* tag 'locking-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  selftests/futex: Fix spelling mistake "unitiliazed" -> "uninitialized"
  futex: Correct the kernedoc return value for futex_wait_setup().
  tools headers: Synchronize prctl.h ABI header
  futex: Use RCU_INIT_POINTER() in futex_mm_init().
  selftests/futex: Use TAP output in futex_numa_mpol
  selftests/futex: Use TAP output in futex_priv_hash
  futex: Fix kernel-doc comments
  futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init()
  futex: Fix outdated comment in struct restart_block
  locking/lockdep: Add number of dynamic keys to /proc/lockdep_stats
  locking/lockdep: Prevent abuse of lockdep subclass
  locking/lockdep: Move hlock_equal() to the respective #ifdeffery
  futex,selftests: Add another FUTEX2_NUMA selftest
  selftests/futex: Add futex_numa_mpol
  selftests/futex: Add futex_priv_hash
  selftests/futex: Build without headers nonsense
  tools/perf: Allow to select the number of hash buckets
  tools headers: Synchronize prctl.h ABI header
  futex: Implement FUTEX2_MPOL
  futex: Implement FUTEX2_NUMA
  ...
2025-05-26 14:42:07 -07:00
Linus Torvalds
7d7a103d29 vfs-6.16-rc1.pidfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
 ov4zAP4yfqKBAz6eMt9CzDgHCdVQJ9Nuur1EiRdot3maPzHTcQEA2hVkJrvVo1Y/
 jCVAf7gmGX1Uu6nCUF6Vjy35g8i20gA=
 =nzYS
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull pidfs updates from Christian Brauner:
 "Features:

   - Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
     socket option

     SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
     the process that called connect() or listen(). This is heavily used
     to safely authenticate clients in userspace avoiding security bugs
     due to pid recycling races (dbus, polkit, systemd, etc.)

     SO_PEERPIDFD currently doesn't support handing out pidfds if the
     sk->sk_peer_pid thread-group leader has already been reaped. In
     this case it currently returns EINVAL. Userspace still wants to get
     a pidfd for a reaped process to have a stable handle it can pass
     on. This is especially useful now that it is possible to retrieve
     exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s
     PIDFD_INFO_EXIT flag

     Another summary has been provided by David Rheinsberg:

      > A pidfd can outlive the task it refers to, and thus user-space
      > must already be prepared that the task underlying a pidfd is
      > gone at the time they get their hands on the pidfd. For
      > instance, resolving the pidfd to a PID via the fdinfo must be
      > prepared to read `-1`.
      >
      > Despite user-space knowing that a pidfd might be stale, several
      > kernel APIs currently add another layer that checks for this. In
      > particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was
      > already reaped, but returns a stale pidfd if the task is reaped
      > immediately after the respective alive-check.
      >
      > This has the unfortunate effect that user-space now has two ways
      > to check for the exact same scenario: A syscall might return
      > EINVAL/ESRCH/... *or* the pidfd might be stale, even though
      > there is no particular reason to distinguish both cases. This
      > also propagates through user-space APIs, which pass on pidfds.
      > They must be prepared to pass on `-1` *or* the pidfd, because
      > there is no guaranteed way to get a stale pidfd from the kernel.
      >
      > Userspace must already deal with a pidfd referring to a reaped
      > task as the task may exit and get reaped at any time will there
      > are still many pidfds referring to it

     In order to allow handing out reaped pidfd SO_PEERPIDFD needs to
     ensure that PIDFD_INFO_EXIT information is available whenever a
     pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi
     promises that reaped pidfds are only handed out if it is guaranteed
     that the caller sees the exit information:

     TEST_F(pidfd_info, success_reaped)
     {
             struct pidfd_info info = {
                     .mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
             };

             /*
              * Process has already been reaped and PIDFD_INFO_EXIT been set.
              * Verify that we can retrieve the exit status of the process.
              */
             ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
             ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
             ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
             ASSERT_TRUE(WIFEXITED(info.exit_code));
             ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
     }

     To hand out pidfds for reaped processes we thus allocate a pidfs
     entry for the relevant sk->sk_peer_pid at the time the
     sk->sk_peer_pid is stashed and drop it when the socket is
     destroyed. This guarantees that exit information will always be
     recorded for the sk->sk_peer_pid task and we can hand out pidfds
     for reaped processes

   - Hand a pidfd to the coredump usermode helper process

     Give userspace a way to instruct the kernel to install a pidfd for
     the crashing process into the process started as a usermode helper.
     There's still tricky race-windows that cannot be easily or
     sometimes not closed at all by userspace. There's various ways like
     looking at the start time of a process to make sure that the
     usermode helper process is started after the crashing process but
     it's all very very brittle and fraught with peril

     The crashed-but-not-reaped process can be killed by userspace
     before coredump processing programs like systemd-coredump have had
     time to manually open a PIDFD from the PID the kernel provides
     them, which means they can be tricked into reading from an
     arbitrary process, and they run with full privileges as they are
     usermode helper processes

     Even if that specific race-window wouldn't exist it's still the
     safest and cleanest way to let the kernel provide the pidfd
     directly instead of requiring userspace to do it manually. In
     parallel with this commit we already have systemd adding support
     for this in [1]

     When the usermode helper process is forked we install a pidfd file
     descriptor three into the usermode helper's file descriptor table
     so it's available to the exec'd program

     Since usermode helpers are either children of the system_unbound_wq
     workqueue or kthreadd we know that the file descriptor table is
     empty and can thus always use three as the file descriptor number

     Note, that we'll install a pidfd for the thread-group leader even
     if a subthread is calling do_coredump(). We know that task linkage
     hasn't been removed yet and even if this @current isn't the actual
     thread-group leader we know that the thread-group leader cannot be
     reaped until
     @current has exited

   - Allow telling when a task has not been found from finding the wrong
     task when creating a pidfd

     We currently report EINVAL whenever a struct pid has no tasked
     attached anymore thereby conflating two concepts:

      (1) The task has already been reaped

      (2) The caller requested a pidfd for a thread-group leader but the
          pid actually references a struct pid that isn't used as a
          thread-group leader

     This is causing issues for non-threaded workloads as in where they
     expect ESRCH to be reported, not EINVAL

     So allow userspace to reliably distinguish between (1) and (2)

   - Make it possible to detect when a pidfs entry would outlive the
     struct pid it pinned

   - Add a range of new selftests

  Cleanups:

   - Remove unneeded NULL check from pidfd_prepare() for passed struct
     pid

   - Avoid pointless reference count bump during release_task()

  Fixes:

   - Various fixes to the pidfd and coredump selftests

   - Fix error handling for replace_fd() when spawning coredump usermode
     helper"

* tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  pidfs: detect refcount bugs
  coredump: hand a pidfd to the usermode coredump helper
  coredump: fix error handling for replace_fd()
  pidfs: move O_RDWR into pidfs_alloc_file()
  selftests: coredump: Raise timeout to 2 minutes
  selftests: coredump: Fix test failure for slow machines
  selftests: coredump: Properly initialize pointer
  net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
  pidfs: get rid of __pidfd_prepare()
  net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
  pidfs: register pid in pidfs
  net, pidfd: report EINVAL for ESRCH
  release_task: kill the no longer needed get/put_pid(thread_pid)
  pidfs: ensure consistent ENOENT/ESRCH reporting
  exit: move wake_up_all() pidfd waiters into __unhash_process()
  selftest/pidfd: add test for thread-group leader pidfd open for thread
  pidfd: improve uapi when task isn't found
  pidfd: remove unneeded NULL check from pidfd_prepare()
  selftests/pidfd: adapt to recent changes
2025-05-26 10:30:02 -07:00
Lorenzo Stoakes
3e43e260f1 mm: perform VMA allocation, freeing, duplication in mm
Right now these are performed in kernel/fork.c which is odd and a
violation of separation of concerns, as well as preventing us from
integrating this and related logic into userland VMA testing going
forward.

There is a fly in the ointment - nommu - mmap.c is not compiled if
CONFIG_MMU not set, and neither is vma.c.

To square the circle, let's add a new file - vma_init.c.  This will be
compiled for both CONFIG_MMU and nommu builds, and will also form part of
the VMA userland testing.

This allows us to de-duplicate code, while maintaining separation of
concerns and the ability for us to userland test this logic.

Update the VMA userland tests accordingly, additionally adding a
detach_free_vma() helper function to correctly detach VMAs before freeing
them in test code, as this change was triggering the assert for this.

[akpm@linux-foundation.org: remove stray newline, per Liam]
Link: https://lkml.kernel.org/r/f97b3a85a6da0196b28070df331b99e22b263be8.1745853549.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12 23:50:48 -07:00
Lorenzo Stoakes
26a8f57760 mm: move dup_mmap() to mm
This is a key step in our being able to abstract and isolate VMA
allocation and destruction logic.

This function is the last one where vm_area_free() and vm_area_dup() are
directly referenced outside of mmap, so having this in mm allows us to
isolate these.

We do the same for the nommu version which is substantially simpler.

We place the declaration for dup_mmap() in mm/internal.h and have
kernel/fork.c import this in order to prevent improper use of this
functionality elsewhere in the kernel.

While we're here, we remove the useless #ifdef CONFIG_MMU check around
mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
CONFIG_MMU is set.

Link: https://lkml.kernel.org/r/e49aad3d00212f5539d9fa5769bfda4ce451db3e.1745853549.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12 23:50:48 -07:00
David Hildenbrand
e9f180d7cf kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
Not intuitive, but vm_area_dup() located in kernel/fork.c is not only used
for duplicating VMAs during fork(), but also for duplicating VMAs when
splitting VMAs or when mremap()'ing them.

VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
size) and apparently also shrunk during mremap(), which implies
duplicating the VMA in __split_vma() first.

In case of ordinary mremap() (no change in size), we first duplicate the
VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
the old VMA: we effectively move the VM_PAT reservation.  So the
untrack_pfn_clear() call on the new VMA duplicating is wrong in that
context.

Splitting of VMAs seems problematic, because we don't duplicate/adjust the
reservation when splitting the VMA.  Instead, in memtype_erase() -- called
during zapping/munmap -- we shrink a reservation in case only the end
address matches: Assume we split a VMA into A and B, both would share a
reservation until B is unmapped.

So when unmapping B, the reservation would be updated to cover only A. 
When unmapping A, we would properly remove the now-shrunk reservation. 
That scenario describes the mremap() shrinking (old_size > new_size),
where we split + unmap B, and the untrack_pfn_clear() on the new VMA when
is wrong.

What if we manage to split a VM_PFNMAP VMA into A and B and unmap A first?
It would be broken because we would never free the reservation.  Likely,
there are ways to trigger such a VMA split outside of mremap().

Affecting other VMA duplication was not intended, vm_area_dup() being used
outside of kernel/fork.c was an oversight.  So let's fix that for; how to
handle VMA splits better should be investigated separately.


With a simple reproducer that uses mprotect() to split such a VMA I can
trigger

x86/PAT: pat_mremap:26448 freeing invalid memtype [mem 0x00000000-0x00000fff]

Link: https://lkml.kernel.org/r/20250422144942.2871395-1-david@redhat.com
Fixes: dc84bc2aba ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11 17:26:06 -07:00
Sebastian Andrzej Siewior
7c4f75a21f futex: Allow automatic allocation of process wide futex hash
Allocate a private futex hash with 16 slots if a task forks its first
thread.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-14-bigeasy@linutronix.de
2025-05-03 12:02:08 +02:00
Sebastian Andrzej Siewior
80367ad01d futex: Add basic infrastructure for local task local hash
The futex hash is system wide and shared by all tasks. Each slot
is hashed based on futex address and the VMA of the thread. Due to
randomized VMAs (and memory allocations) the same logical lock (pointer)
can end up in a different hash bucket on each invocation of the
application. This in turn means that different applications may share a
hash bucket on the first invocation but not on the second and it is not
always clear which applications will be involved. This can result in
high latency's to acquire the futex_hash_bucket::lock especially if the
lock owner is limited to a CPU and can not be effectively PI boosted.

Introduce basic infrastructure for process local hash which is shared by
all threads of process. This hash will only be used for a
PROCESS_PRIVATE FUTEX operation.

The hashmap can be allocated via:

        prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, num);

A `num' of 0 means that the global hash is used instead of a private
hash.
Other values for `num' specify the number of slots for the hash and the
number must be power of two, starting with two.
The prctl() returns zero on success. This function can only be used
before a thread is created.

The current status for the private hash can be queried via:

        num = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);

which return the current number of slots. The value 0 means that the
global hash is used. Values greater than 0 indicate the number of slots
that are used. A negative number indicates an error.

For optimisation, for the private hash jhash2() uses only two arguments
the address and the offset. This omits the VMA which is always the same.

[peterz: Use 0 for global hash. A bit shuffling and renaming. ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-13-bigeasy@linutronix.de
2025-05-03 12:02:07 +02:00
Christian Brauner
a71f402acd
pidfs: get rid of __pidfd_prepare()
Fold it into pidfd_prepare() and rename PIDFD_CLONE to PIDFD_STALE to
indicate that the passed pid might not have task linkage and no explicit
check for that should be performed.

Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-3-450a19461e75@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: David Rheinsberg <david@readahead.eu>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-26 08:28:03 +02:00
Christian Brauner
17f1b08acf
pidfs: ensure consistent ENOENT/ESRCH reporting
In a prior patch series we tried to cleanly differentiate between:

(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
    actually references a struct pid that isn't used as a thread-group
    leader.

as this was causing issues for non-threaded workloads.

But there's cases where the current simple logic is wrong. Specifically,
if the pid was a leader pid and the check races with __unhash_process().
Stabilize this by using the pidfd waitqueue lock.

Link: https://lore.kernel.org/20250411-work-pidfs-enoent-v2-2-60b2d3bb545f@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-12 14:04:53 +02:00
Christian Brauner
8cf4b738f6
pidfd: improve uapi when task isn't found
We currently report EINVAL whenever a struct pid has no tasked attached
anymore thereby conflating two concepts:

(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
    actually references a struct pid that isn't used as a thread-group
    leader.

This is causing issues for non-threaded workloads as in [1].

This patch tries to allow userspace to distinguish between (1) and (2).
This is racy of course but that shouldn't matter.

Link: https://github.com/systemd/systemd/pull/36982 [1]
Link: https://lore.kernel.org/r/20250403-work-pidfd-fixes-v1-3-a123b6ed6716@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 09:38:24 +02:00
Christian Brauner
1b090949c9
pidfd: remove unneeded NULL check from pidfd_prepare()
None of the caller actually pass a NULL pid in there.

Link: https://lore.kernel.org/r/20250403-work-pidfd-fixes-v1-2-a123b6ed6716@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 09:38:24 +02:00
Linus Torvalds
d6b02199cd - The 7 patch series "powerpc/crash: use generic crashkernel
reservation" from Sourabh Jain changes powerpc's kexec code to use more
   of the generic layers.
 
 - The 2 patch series "get_maintainer: report subsystem status
   separately" from Vlastimil Babka makes some long-requested improvements
   to the get_maintainer output.
 
 - The 4 patch series "ucount: Simplify refcounting with rcuref_t" from
   Sebastian Siewior cleans up and optimizing the refcounting in the ucount
   code.
 
 - The 12 patch series "reboot: support runtime configuration of
   emergency hw_protection action" from Ahmad Fatoum improves the ability
   for a driver to perform an emergency system shutdown or reboot.
 
 - The 16 patch series "Converge on using secs_to_jiffies() part two"
   from Easwar Hariharan performs further migrations from
   msecs_to_jiffies() to secs_to_jiffies().
 
 - The 7 patch series "lib/interval_tree: add some test cases and
   cleanup" from Wei Yang permits more userspace testing of kernel library
   code, adds some more tests and performs some cleanups.
 
 - The 2 patch series "hung_task: Dump the blocking task stacktrace" from
   Masami Hiramatsu arranges for the hung_task detector to dump the stack
   of the blocking task and not just that of the blocked task.
 
 - The 4 patch series "resource: Split and use DEFINE_RES*() macros" from
   Andy Shevchenko provides some cleanups to the resource definition
   macros.
 
 - Plus the usual shower of singleton patches - please see the individual
   changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nuqwAKCRDdBJ7gKXxA
 jtNqAQDxqJpjWkzn4yN9CNSs1ivVx3fr6SqazlYCrt3u89WQvwEA1oRrGpETzUGq
 r6khQUIcQImPPcjFqEFpuiSOU0MBZA0=
 =Kii8
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - The series "powerpc/crash: use generic crashkernel reservation" from
   Sourabh Jain changes powerpc's kexec code to use more of the generic
   layers.

 - The series "get_maintainer: report subsystem status separately" from
   Vlastimil Babka makes some long-requested improvements to the
   get_maintainer output.

 - The series "ucount: Simplify refcounting with rcuref_t" from
   Sebastian Siewior cleans up and optimizing the refcounting in the
   ucount code.

 - The series "reboot: support runtime configuration of emergency
   hw_protection action" from Ahmad Fatoum improves the ability for a
   driver to perform an emergency system shutdown or reboot.

 - The series "Converge on using secs_to_jiffies() part two" from Easwar
   Hariharan performs further migrations from msecs_to_jiffies() to
   secs_to_jiffies().

 - The series "lib/interval_tree: add some test cases and cleanup" from
   Wei Yang permits more userspace testing of kernel library code, adds
   some more tests and performs some cleanups.

 - The series "hung_task: Dump the blocking task stacktrace" from Masami
   Hiramatsu arranges for the hung_task detector to dump the stack of
   the blocking task and not just that of the blocked task.

 - The series "resource: Split and use DEFINE_RES*() macros" from Andy
   Shevchenko provides some cleanups to the resource definition macros.

 - Plus the usual shower of singleton patches - please see the
   individual changelogs for details.

* tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
  mailmap: consolidate email addresses of Alexander Sverdlin
  fs/procfs: fix the comment above proc_pid_wchan()
  relay: use kasprintf() instead of fixed buffer formatting
  resource: replace open coded variant of DEFINE_RES()
  resource: replace open coded variants of DEFINE_RES_*_NAMED()
  resource: replace open coded variant of DEFINE_RES_NAMED_DESC()
  resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED()
  samples: add hung_task detector mutex blocking sample
  hung_task: show the blocker task if the task is hung on mutex
  kexec_core: accept unaccepted kexec segments' destination addresses
  watchdog/perf: optimize bytes copied and remove manual NUL-termination
  lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap()
  lib/interval_tree: skip the check before go to the right subtree
  lib/interval_tree: add test case for span iteration
  lib/interval_tree: add test case for interval_tree_iter_xxx() helpers
  lib/rbtree: add random seed
  lib/rbtree: split tests
  lib/rbtree: enable userland test suite for rbtree related data structure
  checkpatch: describe --min-conf-desc-length
  scripts/gdb/symbols: determine KASLR offset on s390
  ...
2025-04-01 10:06:52 -07:00
Linus Torvalds
eb0ece1602 - The 6 patch series "Enable strict percpu address space checks" from
Uros Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.
 
   This has caused a small amount of fallout - two or three issues were
   reported.  In all cases the calling code was founf to be incorrect.
 
 - The 4 patch series "Some cleanup for memcg" from Chen Ridong
   implements some relatively monir cleanups for the memcontrol code.
 
 - The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
   from David Hildenbrand fixes a boatload of issues which David found then
   using device-exclusive PTE entries when THP is enabled.  More work is
   needed, but this makes thins better - our own HMM selftests now succeed.
 
 - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
   Ahmed remove the z3fold and zbud implementations.  They have been
   deprecated for half a year and nobody has complained.
 
 - The 5 patch series "mm: further simplify VMA merge operation" from
   Lorenzo Stoakes implements numerous simplifications in this area.  No
   runtime effects are anticipated.
 
 - The 4 patch series "mm/madvise: remove redundant mmap_lock operations
   from process_madvise()" from SeongJae Park rationalizes the locking in
   the madvise() implementation.  Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.
 
 - The 12 patch series "Tiny cleanup and improvements about SWAP code"
   from Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.
 
 - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak user-visible
   output.
 
 - The 2 patch series "mm/damon/paddr: fix large folios access and
   schemes handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.
 
 - The 3 patch series "mm/damon/core: fix wrong and/or useless
   damos_walk() behaviors" from SeongJae Park fixes a few issues with the
   accuracy of kdamond's walking of DAMON regions.
 
 - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
   Lorenzo Stoakes changes the interaction between framebuffer deferred-io
   and core MM.  No functional changes are anticipated - this is
   preparatory work for the future removal of page structure fields.
 
 - The 4 patch series "mm/damon: add support for hugepage_size DAMOS
   filter" from Usama Arif adds a DAMOS filter which permits the filtering
   by huge page sizes.
 
 - The 4 patch series "mm: permit guard regions for file-backed/shmem
   mappings" from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state.  The feature now covers shmem and
   file-backed mappings.
 
 - The 4 patch series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping for
   pte-mapped large folios.
 
 - The 18 patch series "reimplement per-vma lock as a refcount" from
   Suren Baghdasaryan puts the vm_lock back into the vma.  Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy.  This patchset provides small (0-10%) improvements on one
   microbenchmark.
 
 - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
   fixes and improves" from SeongJae Park does some maintenance work on the
   DAMON docs.
 
 - The 27 patch series "hugetlb/CMA improvements for large systems" from
   Frank van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
   pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.
 
 - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.
 
 - The 12 patch series "selftests/mm: Some cleanups from trying to run
   them" from Brendan Jackman fixes a pile of unrelated issues which
   Brendan encountered while runnimg our selftests.
 
 - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
   from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.
 
 - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply wasn't
   being effective.
 
 - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
   from David Hildenbrand implements a number of unrelated cleanups in this
   code.
 
 - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
   Khandual implements a number of preparatoty cleanups to the
   GENERIC_PTDUMP Kconfig logic.
 
 - The 8 patch series "mm/damon: auto-tune aggregation interval" from
   SeongJae Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.
 
 - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
   issues in powerpc, sparc and x86 lazy MMU implementations.  Ryan did
   this in preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.
 
 - The 2 patch series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the code
   easier to follow.
 
 - The 3 patch series "page_counter cleanup and size reduction" from
   Shakeel Butt cleans up the page_counter code and fixes a size increase
   which we accidentally added late last year.
 
 - The 3 patch series "Add a command line option that enables control of
   how many threads should be used to allocate huge pages" from Thomas
   Prescher does that.  It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.
 
 - The 3 patch series "Fix calculations in trace_balance_dirty_pages()
   for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.
 
 - The 9 patch series "mm/damon: make allow filters after reject filters
   useful and intuitive" from SeongJae Park improves the handling of allow
   and reject filters.  Behaviour is made more consistent and the
   documention is updated accordingly.
 
 - The 5 patch series "Switch zswap to object read/write APIs" from Yosry
   Ahmed updates zswap to the new object read/write APIs and thus permits
   the removal of some legacy code from zpool and zsmalloc.
 
 - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
   does as it claims.
 
 - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
   from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.
 
 - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
   is a preparatoty refactoring and cleanup of the mremap() code.
 
 - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
   + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.
 
 - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
   filters based on handling layers" from SeongJae Park adds a couple of
   new sysfs directories to ease the management of DAMON/DAMOS filters.
 
 - The 13 patch series "arch, mm: reduce code duplication in mem_init()"
   from Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.
 
 - The 13 patch series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.
 
 - The 3 patch series "mm: page_ext: Introduce new iteration API" from
   Luiz Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.
 
 - The 8 patch series "Buddy allocator like (or non-uniform) folio split"
   from Zi Yan reworks the code to split a folio into smaller folios.  The
   main benefit is lessened memory consumption: fewer post-split folios are
   generated.
 
 - The 2 patch series "Minimize xa_node allocation during xarry split"
   from Zi Yan reduces the number of xarray xa_nodes which are generated
   during an xarray split.
 
 - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.
 
 - The 3 patch series "Add tracepoints for lowmem reserves, watermarks
   and totalreserve_pages" from Martin Liu adds some more tracepoints to
   the page allocator code.
 
 - The 4 patch series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which SeongJae
   observed during his earlier madvise work.
 
 - The 3 patch series "mm/hwpoison: Fix regressions in memory failure
   handling" from Shuai Xue addresses two quite serious regressions which
   Shuai has observed in the memory-failure implementation.
 
 - The 5 patch series "mm: reliable huge page allocator" from Johannes
   Weiner makes huge page allocations cheaper and more reliable by reducing
   fragmentation.
 
 - The 5 patch series "Minor memcg cleanups & prep for memdescs" from
   Matthew Wilcox is preparatory work for the future implementation of
   memdescs.
 
 - The 4 patch series "track memory used by balloon drivers" from Nico
   Pache introduces a way to track memory used by our various balloon
   drivers.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for active
   pages" from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.
 
 - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
   Hao Jia separates the proactive reclaim statistics from the direct
   reclaim statistics.
 
 - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
   from Jinjiang Tu fixes our handling of hwpoisoned pages within the
   reclaim code.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
 jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
 Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
 =Pn2K
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "Enable strict percpu address space checks" from Uros
   Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.

   This has caused a small amount of fallout - two or three issues were
   reported. In all cases the calling code was found to be incorrect.

 - The series "Some cleanup for memcg" from Chen Ridong implements some
   relatively monir cleanups for the memcontrol code.

 - The series "mm: fixes for device-exclusive entries (hmm)" from David
   Hildenbrand fixes a boatload of issues which David found then using
   device-exclusive PTE entries when THP is enabled. More work is
   needed, but this makes thins better - our own HMM selftests now
   succeed.

 - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
   remove the z3fold and zbud implementations. They have been deprecated
   for half a year and nobody has complained.

 - The series "mm: further simplify VMA merge operation" from Lorenzo
   Stoakes implements numerous simplifications in this area. No runtime
   effects are anticipated.

 - The series "mm/madvise: remove redundant mmap_lock operations from
   process_madvise()" from SeongJae Park rationalizes the locking in the
   madvise() implementation. Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.

 - The series "Tiny cleanup and improvements about SWAP code" from
   Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.

 - The series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak
   user-visible output.

 - The series "mm/damon/paddr: fix large folios access and schemes
   handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.

 - The series "mm/damon/core: fix wrong and/or useless damos_walk()
   behaviors" from SeongJae Park fixes a few issues with the accuracy of
   kdamond's walking of DAMON regions.

 - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
   Stoakes changes the interaction between framebuffer deferred-io and
   core MM. No functional changes are anticipated - this is preparatory
   work for the future removal of page structure fields.

 - The series "mm/damon: add support for hugepage_size DAMOS filter"
   from Usama Arif adds a DAMOS filter which permits the filtering by
   huge page sizes.

 - The series "mm: permit guard regions for file-backed/shmem mappings"
   from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state. The feature now covers shmem and
   file-backed mappings.

 - The series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping
   for pte-mapped large folios.

 - The series "reimplement per-vma lock as a refcount" from Suren
   Baghdasaryan puts the vm_lock back into the vma. Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy. This patchset provides small (0-10%) improvements on one
   microbenchmark.

 - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
   improves" from SeongJae Park does some maintenance work on the DAMON
   docs.

 - The series "hugetlb/CMA improvements for large systems" from Frank
   van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.

 - The series "mm/damon: introduce DAMOS filter type for unmapped pages"
   from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.

 - The series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.

 - The series "selftests/mm: Some cleanups from trying to run them" from
   Brendan Jackman fixes a pile of unrelated issues which Brendan
   encountered while runnimg our selftests.

 - The series "fs/proc/task_mmu: add guard region bit to pagemap" from
   Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.

 - The series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply
   wasn't being effective.

 - The series "mm: cleanups for device-exclusive entries (hmm)" from
   David Hildenbrand implements a number of unrelated cleanups in this
   code.

 - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
   implements a number of preparatoty cleanups to the GENERIC_PTDUMP
   Kconfig logic.

 - The series "mm/damon: auto-tune aggregation interval" from SeongJae
   Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.

 - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
   powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
   preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.

 - The series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the
   code easier to follow.

 - The series "page_counter cleanup and size reduction" from Shakeel
   Butt cleans up the page_counter code and fixes a size increase which
   we accidentally added late last year.

 - The series "Add a command line option that enables control of how
   many threads should be used to allocate huge pages" from Thomas
   Prescher does that. It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.

 - The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
   from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.

 - The series "mm/damon: make allow filters after reject filters useful
   and intuitive" from SeongJae Park improves the handling of allow and
   reject filters. Behaviour is made more consistent and the documention
   is updated accordingly.

 - The series "Switch zswap to object read/write APIs" from Yosry Ahmed
   updates zswap to the new object read/write APIs and thus permits the
   removal of some legacy code from zpool and zsmalloc.

 - The series "Some trivial cleanups for shmem" from Baolin Wang does as
   it claims.

 - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
   Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.

 - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
   preparatoty refactoring and cleanup of the mremap() code.

 - The series "mm: MM owner tracking for large folios (!hugetlb) +
   CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.

 - The series "mm/damon: add sysfs dirs for managing DAMOS filters based
   on handling layers" from SeongJae Park adds a couple of new sysfs
   directories to ease the management of DAMON/DAMOS filters.

 - The series "arch, mm: reduce code duplication in mem_init()" from
   Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.

 - The series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.

 - The series "mm: page_ext: Introduce new iteration API" from Luiz
   Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.

 - The series "Buddy allocator like (or non-uniform) folio split" from
   Zi Yan reworks the code to split a folio into smaller folios. The
   main benefit is lessened memory consumption: fewer post-split folios
   are generated.

 - The series "Minimize xa_node allocation during xarry split" from Zi
   Yan reduces the number of xarray xa_nodes which are generated during
   an xarray split.

 - The series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.

 - The series "Add tracepoints for lowmem reserves, watermarks and
   totalreserve_pages" from Martin Liu adds some more tracepoints to the
   page allocator code.

 - The series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which
   SeongJae observed during his earlier madvise work.

 - The series "mm/hwpoison: Fix regressions in memory failure handling"
   from Shuai Xue addresses two quite serious regressions which Shuai
   has observed in the memory-failure implementation.

 - The series "mm: reliable huge page allocator" from Johannes Weiner
   makes huge page allocations cheaper and more reliable by reducing
   fragmentation.

 - The series "Minor memcg cleanups & prep for memdescs" from Matthew
   Wilcox is preparatory work for the future implementation of memdescs.

 - The series "track memory used by balloon drivers" from Nico Pache
   introduces a way to track memory used by our various balloon drivers.

 - The series "mm/damon: introduce DAMOS filter type for active pages"
   from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.

 - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
   separates the proactive reclaim statistics from the direct reclaim
   statistics.

 - The series "mm/vmscan: don't try to reclaim hwpoison folio" from
   Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
   code.

* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
  mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
  x86/mm: restore early initialization of high_memory for 32-bits
  mm/vmscan: don't try to reclaim hwpoison folio
  mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
  cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
  mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
  selftests/mm: speed up split_huge_page_test
  selftests/mm: uffd-unit-tests support for hugepages > 2M
  docs/mm/damon/design: document active DAMOS filter type
  mm/damon: implement a new DAMOS filter type for active pages
  fs/dax: don't disassociate zero page entries
  MM documentation: add "Unaccepted" meminfo entry
  selftests/mm: add commentary about 9pfs bugs
  fork: use __vmalloc_node() for stack allocation
  docs/mm: Physical Memory: Populate the "Zones" section
  xen: balloon: update the NR_BALLOON_PAGES state
  hv_balloon: update the NR_BALLOON_PAGES state
  balloon_compaction: update the NR_BALLOON_PAGES state
  meminfo: add a per node counter for balloon drivers
  mm: remove references to folio in __memcg_kmem_uncharge_page()
  ...
2025-04-01 09:29:18 -07:00
Linus Torvalds
7405c0f01a Miscellaneous x86 fixes and updates:
- Fix a large number of x86 Kconfig dependency and help text accuracy
    bugs/problems, by Mateusz Jończyk and David Heideberg.
 
  - Fix a VM_PAT interaction with fork() crash. This also touches
    core kernel code.
 
  - Fix an ORC unwinder bug for interrupt entries
 
  - Fixes and cleanups.
 
  - Fix an AMD microcode loader bug that can promote verification failures
    into success.
 
  - Add early-printk support for MMIO based UARTs on an x86 board that
    had no other serial debugging facility and also experienced early
    boot crashes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfnFBERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iVDxAAmiB4soT3/WbaWJJdeVyxEL7sOmUNOm04
 5kAVHJVK8QGdje0eWa6h7xmuQD3UOxafE2coCrOxHhZi2qpAAY6CPIIy6oIBRwZK
 gLgT5xn1CHojfm4UFC3YUOyecBRPUF2C5jfkajWdZHumyPP/sOObqvGanpQRAYd5
 bfPHEvrBpeEeS7WkATCdyF2j+I5xYflD4g/MDAsMmqasQHOnjBuFX5VBeVxxkysC
 dMsFkFpxqcA95MnnyOnxXzgOtRTY0UystX07D3Bk1pqhG9zor+mp8OynsTRCU87T
 ZPPbUr2qACNmCqEEXl+F1mAkgj5H66xE2gaJdYx0/jBAIbX8Nwih7mMxhJShVU07
 Lhc0tukmVrDoDaVIr2HsxqI8iokuYLszUjDAqEQmQDrgelL6usPYghN1b2bDSJ9r
 0hCO/s79024H/U9oMrC+CF52D5UH/fE98ipigrbKRIO/hOsoxiiniF3DG2NVWZM2
 n5nPnOdbperqjCEteN1nxQfr7XZkvP95Bwmuqqc90XH+tzKJdHruUkbm4ua7NEEz
 WKgsUIYFjeN5ZrHbJaNtHlQueTyvsyGmL1nlaLi/MaJbSXPsM/WfwvHsaKTh3NrE
 BFwEAhMZVLDHEfnFT0Ev7Mm1MGpW8MbHoRBR1+E5FWWNS4X0yGLKXWRp8diw25Tm
 W3ZVsn65E6U=
 =/qKX
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2025-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 fixes and updates from Ingo Molnar:

 - Fix a large number of x86 Kconfig dependency and help text accuracy
   bugs/problems, by Mateusz Jończyk and David Heideberg

 - Fix a VM_PAT interaction with fork() crash. This also touches core
   kernel code

 - Fix an ORC unwinder bug for interrupt entries

 - Fixes and cleanups

 - Fix an AMD microcode loader bug that can promote verification
   failures into success

 - Add early-printk support for MMIO based UARTs on an x86 board that
   had no other serial debugging facility and also experienced early
   boot crashes

* tag 'x86-urgent-2025-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode/AMD: Fix __apply_microcode_amd()'s return value
  x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
  x86/fpu: Update the outdated comment above fpstate_init_user()
  x86/early_printk: Add support for MMIO-based UARTs
  x86/dumpstack: Fix inaccurate unwinding from exception stacks due to misplaced assignment
  x86/entry: Fix ORC unwinder for PUSH_REGS with save_ret=1
  x86/Kconfig: Fix lists in X86_EXTENDED_PLATFORM help text
  x86/Kconfig: Correct X86_X2APIC help text
  x86/speculation: Remove the extra #ifdef around CALL_NOSPEC
  x86/Kconfig: Document release year of glibc 2.3.3
  x86/Kconfig: Make CONFIG_PCI_CNB20LE_QUIRK depend on X86_32
  x86/Kconfig: Document CONFIG_PCI_MMCONFIG
  x86/Kconfig: Update lists in X86_EXTENDED_PLATFORM
  x86/Kconfig: Move all X86_EXTENDED_PLATFORM options together
  x86/Kconfig: Always enable ARCH_SPARSEMEM_ENABLE
  x86/Kconfig: Enable X86_X2APIC by default and improve help text
2025-03-30 15:25:15 -07:00
David Hildenbrand
dc84bc2aba x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
If track_pfn_copy() fails, we already added the dst VMA to the maple
tree. As fork() fails, we'll cleanup the maple tree, and stumble over
the dst VMA for which we neither performed any reservation nor copied
any page tables.

Consequently untrack_pfn() will see VM_PAT and try obtaining the
PAT information from the page table -- which fails because the page
table was not copied.

The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
if track_pfn_copy() fails. However, the whole thing is about "simply"
clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
and performed a reservation, but copying the page tables fails, we'll
simply clear the VM_PAT flag, not properly undoing the reservation ...
which is also wrong.

So let's fix it properly: set the VM_PAT flag only if the reservation
succeeded (leaving it clear initially), and undo the reservation if
anything goes wrong while copying the page tables: clearing the VM_PAT
flag after undoing the reservation.

Note that any copied page table entries will get zapped when the VMA will
get removed later, after copy_page_range() succeeded; as VM_PAT is not set
then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be
happy. Note that leaving these page tables in place without a reservation
is not a problem, as we are aborting fork(); this process will never run.

A reproducer can trigger this usually at the first try:

  https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c

  WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110
  Modules linked in: ...
  CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:get_pat_info+0xf6/0x110
  ...
  Call Trace:
   <TASK>
   ...
   untrack_pfn+0x52/0x110
   unmap_single_vma+0xa6/0xe0
   unmap_vmas+0x105/0x1f0
   exit_mmap+0xf6/0x460
   __mmput+0x4b/0x120
   copy_process+0x1bf6/0x2aa0
   kernel_clone+0xab/0x440
   __do_sys_clone+0x66/0x90
   do_syscall_64+0x95/0x180

Likely this case was missed in:

  d155df53f3 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")

... and instead of undoing the reservation we simply cleared the VM_PAT flag.

Keep the documentation of these functions in include/linux/pgtable.h,
one place is more than sufficient -- we should clean that up for the other
functions like track_pfn_remap/untrack_pfn separately.

Fixes: d155df53f3 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")
Fixes: 2ab640379a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3")
Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yuxin wang <wang1315768607@163.com>
Reported-by: Marius Fleischer <fleischermarius@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20250321112323.153741-1-david@redhat.com
Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/
Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/
2025-03-25 22:35:14 +01:00
Linus Torvalds
a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Uladzislau Rezki (Sony)
b25bcabb6c fork: use __vmalloc_node() for stack allocation
Replace __vmalloc_node_range() by __vmalloc_node().  The last variant
requires less parameters and it uses exactly the same arguments which are
partly now hidden inside __vmalloc_node().

This change does not change any functionality.  It makes the code a bit
simpler.

Link: https://lkml.kernel.org/r/20250317163614.166502-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21 22:03:14 -07:00
Christian Brauner
68db272741
pidfs: ensure that PIDFS_INFO_EXIT is available
When we currently create a pidfd we check that the task hasn't been
reaped right before we create the pidfd. But it is of course possible
that by the time we return the pidfd to userspace the task has already
been reaped since we don't check again after having created a dentry for
it.

This was fine until now because that race was meaningless. But now that
we provide PIDFD_INFO_EXIT it is a problem because it is possible that
the kernel returns a reaped pidfd and it depends on the race whether
PIDFD_INFO_EXIT information is available. This depends on if the task
gets reaped before or after a dentry has been attached to struct pid.

Make this consistent and only returned pidfds for reaped tasks if
PIDFD_INFO_EXIT information is available. This is done by performing
another check whether the task has been reaped right after we attached a
dentry to struct pid.

Since pidfs_exit() is called before struct pid's task linkage is removed
the case where the task got reaped but a dentry was already attached to
struct pid and exit information was recorded and published can be
handled correctly. In that case we do return a pidfd for a reaped task
like we would've before.

Link: https://lore.kernel.org/r/20250316-kabel-fehden-66bdb6a83436@brauner
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19 14:40:18 +01:00
David Hildenbrand
6af8cb80d3 mm/rmap: basic MM owner tracking for large folios (!hugetlb)
For small folios, we traditionally use the mapcount to decide whether it
was "certainly mapped exclusively" by a single MM (mapcount == 1) or
whether it "maybe mapped shared" by multiple MMs (mapcount > 1).  For
PMD-sized folios that were PMD-mapped, we were able to use a similar
mechanism (single PMD mapping), but for PTE-mapped folios and in the
future folios that span multiple PMDs, this does not work.

So we need a different mechanism to handle large folios.  Let's add a new
mechanism to detect whether a large folio is "certainly mapped
exclusively", or whether it is "maybe mapped shared".

We'll use this information next to optimize CoW reuse for PTE-mapped
anonymous THP, and to convert folio_likely_mapped_shared() to
folio_maybe_mapped_shared(), independent of per-page mapcounts.

For each large folio, we'll have two slots, whereby a slot stores:
 (1) an MM id: unique id assigned to each MM
 (2) a per-MM mapcount

If a slot is unoccupied, it can be taken by the next MM that maps folio
page.

In addition, we'll remember the current state -- "mapped exclusively" vs. 
"maybe mapped shared" -- and use a bit spinlock to sync on updates and to
reduce the total number of atomic accesses on updates.  In the future, it
might be possible to squeeze a proper spinlock into "struct folio".  For
now, keep it simple, as we require the whole thing with THP only, that is
incompatible with RT.

As we have to squeeze this information into the "struct folio" of even
folios of order-1 (2 pages), and we generally want to reduce the required
metadata, we'll assign each MM a unique ID that can fit into an int.  In
total, we can squeeze everything into 4x int (2x long) on 64bit.

32bit support is a bit challenging, because we only have 2x long == 2x int
in order-1 folios.  But we can make it work for now, because we neither
expect many MMs nor very large folios on 32bit.

We will reliably detect folios as "mapped exclusively" vs.  "mapped
shared" as long as only two MMs map pages of a folio at one point in time
-- for example with fork() and short-lived child processes, or with apps
that hand over state from one instance to another.

As soon as three MMs are involved at the same time, we might detect "maybe
mapped shared" although the folio is "mapped exclusively".

Example 1:

(1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0
(2) App2 faults in a folio page -> Tracked as MM1
(4) App1 unmaps all folio pages

 -> We will detect "mapped exclusively".

Example 2:

(1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0
(2) App2 faults in a folio page -> Tracked as MM1
(3) App3 faults in a folio page -> No slot available, tracked as "unknown"
(4) App1 and App2 unmap all folio pages

 -> We will detect "maybe mapped shared".

Make use of __always_inline to keep possible performance degradation when
(un)mapping large folios to a minimum.

Note: by squeezing the two flags into the "unsigned long" that stores the
MM ids, we can use non-atomic __bit_spin_unlock() and non-atomic
setting/clearing of the "maybe mapped shared" bit, effectively not adding
any new atomics on the hot path when updating the large mapcount + new
metadata, which further helps reduce the runtime overhead in
micro-benchmarks.

Link: https://lkml.kernel.org/r/20250303163014.1128035-13-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andy Lutomirks^H^Hski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michal Koutn <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: tejun heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17 22:06:46 -07:00
Andrii Nakryiko
0c555a3c1b mm,procfs: allow read-only remote mm access under CAP_PERFMON
It's very common for various tracing and profiling toolis to need to
access /proc/PID/maps contents for stack symbolization needs to learn
which shared libraries are mapped in memory, at which file offset, etc. 
Currently, access to /proc/PID/maps requires CAP_SYS_PTRACE (unless we are
looking at data for our own process, which is a trivial case not too
relevant for profilers use cases).

Unfortunately, CAP_SYS_PTRACE implies way more than just ability to
discover memory layout of another process: it allows to fully control
arbitrary other processes.  This is problematic from security POV for
applications that only need read-only /proc/PID/maps (and other similar
read-only data) access, and in large production settings CAP_SYS_PTRACE is
frowned upon even for the system-wide profilers.

On the other hand, it's already possible to access similar kind of
information (and more) with just CAP_PERFMON capability.  E.g., setting up
PERF_RECORD_MMAP collection through perf_event_open() would give one
similar information to what /proc/PID/maps provides.

CAP_PERFMON, together with CAP_BPF, is already a very common combination
for system-wide profiling and observability application.  As such, it's
reasonable and convenient to be able to access /proc/PID/maps with
CAP_PERFMON capabilities instead of CAP_SYS_PTRACE.

For procfs, these permissions are checked through common mm_access()
helper, and so we augment that with cap_perfmon() check *only* if
requested mode is PTRACE_MODE_READ.  I.e., PTRACE_MODE_ATTACH wouldn't be
permitted by CAP_PERFMON.  So /proc/PID/mem, which uses
PTRACE_MODE_ATTACH, won't be permitted by CAP_PERFMON, but /proc/PID/maps,
/proc/PID/environ, and a bunch of other read-only contents will be
allowable under CAP_PERFMON.

Besides procfs itself, mm_access() is used by process_madvise() and
process_vm_{readv,writev}() syscalls.  The former one uses
PTRACE_MODE_READ to avoid leaking ASLR metadata, and as such CAP_PERFMON
seems like a meaningful allowable capability as well.

process_vm_{readv,writev} currently assume PTRACE_MODE_ATTACH level of
permissions (though for readv PTRACE_MODE_READ seems more reasonable, but
that's outside the scope of this change), and as such won't be affected by
this patch.

Link: https://lkml.kernel.org/r/20250127222114.1132392-1-andrii@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:30:46 -07:00
Suren Baghdasaryan
3104138517 mm: make vma cache SLAB_TYPESAFE_BY_RCU
To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected by
lock_vma_under_rcu().

Current checks are sufficient as long as vma is detached before it is
freed.  The only place this is not currently happening is in exit_mmap(). 
Add the missing vma_mark_detached() in exit_mmap().

Another issue which might trick lock_vma_under_rcu() during vma reuse is
vm_area_dup(), which copies the entire content of the vma into a new one,
overriding new vma's vm_refcnt and temporarily making it appear as
attached.  This might trick a racing lock_vma_under_rcu() to operate on a
reused vma if it found the vma before it got reused.  To prevent this
situation, we should ensure that vm_refcnt stays at detached state (0)
when it is copied and advances to attached state only after it is added
into the vma tree.  Introduce vm_area_init_from() which preserves new
vma's vm_refcnt and use it in vm_area_dup().  Since all vmas are in
detached state with no current readers when they are freed,

lock_vma_under_rcu() will not be able to take vm_refcnt after vma got
detached even if vma is reused. vma_mark_attached() in modified to
include a release fence to ensure all stores to the vma happen before
vm_refcnt gets initialized.

Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
vm_area_struct reuse and will minimize the number of call_rcu() calls.

[surenb@google.com: remove atomic_set_release() usage in tools/]
  Link: https://lkml.kernel.org/r/20250217054351.2973666-1-surenb@google.com
Link: https://lkml.kernel.org/r/20250213224655.1680278-18-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Shivank Garg <shivankg@amd.com>
  Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:06:21 -07:00
Suren Baghdasaryan
f35ab95ca0 mm: replace vm_lock and detached flag with a reference count
rw_semaphore is a sizable structure of 40 bytes and consumes considerable
space for each vm_area_struct.  However vma_lock has two important
specifics which can be used to replace rw_semaphore with a simpler
structure:

1. Readers never wait.  They try to take the vma_lock and fall back to
   mmap_lock if that fails.

2. Only one writer at a time will ever try to write-lock a vma_lock
   because writers first take mmap_lock in write mode.  Because of these
   requirements, full rw_semaphore functionality is not needed and we can
   replace rw_semaphore and the vma->detached flag with a refcount
   (vm_refcnt).

When vma is in detached state, vm_refcnt is 0 and only a call to
vma_mark_attached() can take it out of this state.  Note that unlike
before, now we enforce both vma_mark_attached() and vma_mark_detached() to
be done only after vma has been write-locked.  vma_mark_attached() changes
vm_refcnt to 1 to indicate that it has been attached to the vma tree. 
When a reader takes read lock, it increments vm_refcnt, unless the top
usable bit of vm_refcnt (0x40000000) is set, indicating presence of a
writer.  When writer takes write lock, it sets the top usable bit to
indicate its presence.  If there are readers, writer will wait using newly
introduced mm->vma_writer_wait.  Since all writers take mmap_lock in write
mode first, there can be only one writer at a time.  The last reader to
release the lock will signal the writer to wake up.  refcount might
overflow if there are many competing readers, in which case read-locking
will fail.  Readers are expected to handle such failures.

In summary:
1. all readers increment the vm_refcnt;
2. writer sets top usable (writer) bit of vm_refcnt;
3. readers cannot increment the vm_refcnt if the writer bit is set;
4. in the presence of readers, writer must wait for the vm_refcnt to drop
to 1 (plus the VMA_LOCK_OFFSET writer bit), indicating an attached vma
with no readers;
5. vm_refcnt overflow is handled by the readers.

While this vm_lock replacement does not yet result in a smaller
vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
allows for further size optimization by structure member regrouping to
bring the size of vm_area_struct below 192 bytes.

[surenb@google.com: fix a crash due to vma_end_read() that should have been removed]
  Link: https://lkml.kernel.org/r/20250220200208.323769-1-surenb@google.com
Link: https://lkml.kernel.org/r/20250213224655.1680278-13-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Shivank Garg <shivankg@amd.com>
  Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:06:20 -07:00
Suren Baghdasaryan
ce08539660 mm: move mmap_init_lock() out of the header file
mmap_init_lock() is used only from mm_init() in fork.c, therefore it does
not have to reside in the header file.  This move lets us avoid including
additional headers in mmap_lock.h later, when mmap_init_lock() needs to
initialize rcuwait object.

Link: https://lkml.kernel.org/r/20250213224655.1680278-9-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Tested-by: Shivank Garg <shivankg@amd.com>
  Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:06:19 -07:00
Suren Baghdasaryan
8ef95d8f15 mm: mark vma as detached until it's added into vma tree
Current implementation does not set detached flag when a VMA is first
allocated.  This does not represent the real state of the VMA, which is
detached until it is added into mm's VMA tree.  Fix this by marking new
VMAs as detached and resetting detached flag only after VMA is added into
a tree.

Introduce vma_mark_attached() to make the API more readable and to
simplify possible future cleanup when vma->vm_mm might be used to indicate
detached vma and vma_mark_attached() will need an additional mm parameter.

Link: https://lkml.kernel.org/r/20250213224655.1680278-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Shivank Garg <shivankg@amd.com>
  Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:06:17 -07:00
Suren Baghdasaryan
7b6218ae12 mm: move per-vma lock into vm_area_struct
Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing.  Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].

Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code.  When that split-away part is a lock, it complicates
things even further.  With no performance benefits, there are no reasons
for this split.  Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.  Move
vm_lock back into vm_area_struct, aligning it at the cacheline boundary
and changing the cache to be cacheline-aligned as well.  With kernel
compiled using defconfig, this causes VMA memory consumption to grow from
160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs.  Note that the size of this structure is
dependent on the kernel configuration and typically the original size is
higher than 160 bytes.  Therefore these calculations are close to the
worst case scenario.  A more realistic vm_area_struct usage before this
change is:

     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    176   46    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
which is 3.9MB per 100000 VMAs.  This memory consumption growth can be
addressed later by optimizing the vm_lock.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

Link: https://lkml.kernel.org/r/20250213224655.1680278-3-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Shivank Garg <shivankg@amd.com>
  Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:06:17 -07:00
Christian Brauner
3155a19407
pidfs: move setting flags into pidfs_alloc_file()
Instead od adding it into __pidfd_prepare() place it where the actual
file allocation happens and update the outdated comment.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-3-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05 13:26:05 +01:00
Christian Brauner
b573bf6f69
pidfd: rely on automatic cleanup in __pidfd_prepare()
Rely on scope-based cleanup for the allocated file descriptor.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-2-c8c3d8361705@kernel.org
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05 13:25:59 +01:00
Nam Cao
91b7be704d fork: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.

Patch was created by using Coccinelle.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/174111145b945391e48936d6debcd43caec3e406.1738746821.git.namcao@linutronix.de
2025-02-18 10:32:32 +01:00
Liam R. Howlett
64c37e134b kernel: be more careful about dup_mmap() failures and uprobe registering
If a memory allocation fails during dup_mmap(), the maple tree can be left
in an unsafe state for other iterators besides the exit path.  All the
locks are dropped before the exit_mmap() call (in mm/mmap.c), but the
incomplete mm_struct can be reached through (at least) the rmap finding
the vmas which have a pointer back to the mm_struct.

Up to this point, there have been no issues with being able to find an
mm_struct that was only partially initialised.  Syzbot was able to make
the incomplete mm_struct fail with recent forking changes, so it has been
proven unsafe to use the mm_struct that hasn't been initialised, as
referenced in the link below.

Although 8ac662f5da ("fork: avoid inappropriate uprobe access to
invalid mm") fixed the uprobe access, it does not completely remove the
race.

This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the
oom side (even though this is extremely unlikely to be selected as an oom
victim in the race window), and sets MMF_UNSTABLE to avoid other potential
users from using a partially initialised mm_struct.

When registering vmas for uprobe, skip the vmas in an mm that is marked
unstable.  Modifying a vma in an unstable mm may cause issues if the mm
isn't fully initialised.

Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com
Fixes: d240629148 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:25 -08:00
Linus Torvalds
c159dfbdd4 Mainly individually changelogged singleton patches. The patch series in
this pull are:
 
 - "lib min_heap: Improve min_heap safety, testing, and documentation"
   from Kuan-Wei Chiu provides various tightenings to the min_heap library
   code.
 
 - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some
   cleanup and Rust preparation in the xarray library code.
 
 - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes
   pathnames in some code comments.
 
 - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the
   new secs_to_jiffies() in various places where that is appropriate.
 
 - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
   switches two filesystems to the new mount API.
 
 - "Convert ocfs2 to use folios" from Matthew Wilcox does that.
 
 - "Remove get_task_comm() and print task comm directly" from Yafang Shao
   removes now-unneeded calls to get_task_comm() in various places.
 
 - "squashfs: reduce memory usage and update docs" from Phillip Lougher
   implements some memory savings in squashfs and performs some
   maintainability work.
 
 - "lib: clarify comparison function requirements" from Kuan-Wei Chiu
   tightens the sort code's behaviour and adds some maintenance work.
 
 - "nilfs2: protect busy buffer heads from being force-cleared" from
   Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a
   corrupted image.
 
 - "nilfs2: fix kernel-doc comments for function return values" from
   Ryusuke Konishi fixes some nilfs kerneldoc.
 
 - "nilfs2: fix issues with rename operations" from Ryusuke Konishi
   addresses some nilfs BUG_ONs which syzbot was able to trigger.
 
 - "minmax.h: Cleanups and minor optimisations" from David Laight
   does some maintenance work on the min/max library code.
 
 - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work
   on the xarray library code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5SP5QAKCRDdBJ7gKXxA
 jqN7AQChvwXGG43n4d5SDiA/rH7ddvowQcDqhC9cAMJ1ReR7qwEA8/LIWDE4PdMX
 mJnaZ1/ibpEpearrChCViApQtcyEGQI=
 =ti4E
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Mainly individually changelogged singleton patches. The patch series
  in this pull are:

   - "lib min_heap: Improve min_heap safety, testing, and documentation"
     from Kuan-Wei Chiu provides various tightenings to the min_heap
     library code

   - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
     some cleanup and Rust preparation in the xarray library code

   - "Update reference to include/asm-<arch>" from Geert Uytterhoeven
     fixes pathnames in some code comments

   - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
     the new secs_to_jiffies() in various places where that is
     appropriate

   - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
     switches two filesystems to the new mount API

   - "Convert ocfs2 to use folios" from Matthew Wilcox does that

   - "Remove get_task_comm() and print task comm directly" from Yafang
     Shao removes now-unneeded calls to get_task_comm() in various
     places

   - "squashfs: reduce memory usage and update docs" from Phillip
     Lougher implements some memory savings in squashfs and performs
     some maintainability work

   - "lib: clarify comparison function requirements" from Kuan-Wei Chiu
     tightens the sort code's behaviour and adds some maintenance work

   - "nilfs2: protect busy buffer heads from being force-cleared" from
     Ryusuke Konishi fixes an issues in nlifs when the fs is presented
     with a corrupted image

   - "nilfs2: fix kernel-doc comments for function return values" from
     Ryusuke Konishi fixes some nilfs kerneldoc

   - "nilfs2: fix issues with rename operations" from Ryusuke Konishi
     addresses some nilfs BUG_ONs which syzbot was able to trigger

   - "minmax.h: Cleanups and minor optimisations" from David Laight does
     some maintenance work on the min/max library code

   - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
     work on the xarray library code"

* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
  ocfs2: use str_yes_no() and str_no_yes() helper functions
  include/linux/lz4.h: add some missing macros
  Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
  Xarray: remove repeat check in xas_squash_marks()
  Xarray: distinguish large entries correctly in xas_split_alloc()
  Xarray: move forward index correctly in xas_pause()
  Xarray: do not return sibling entries from xas_find_marked()
  ipc/util.c: complete the kernel-doc function descriptions
  gcov: clang: use correct function param names
  latencytop: use correct kernel-doc format for func params
  minmax.h: remove some #defines that are only expanded once
  minmax.h: simplify the variants of clamp()
  minmax.h: move all the clamp() definitions after the min/max() ones
  minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
  minmax.h: reduce the #define expansion of min(), max() and clamp()
  minmax.h: update some comments
  minmax.h: add whitespace around operators and after commas
  nilfs2: do not update mtime of renamed directory that is not moved
  nilfs2: handle errors that nilfs_prepare_chunk() may return
  CREDITS: fix spelling mistake
  ...
2025-01-26 17:50:53 -08:00
Linus Torvalds
8883957b3c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmePs7oACgkQnJ2qBz9k
 QNmHuAf9GkLnY5u1/81xP5V9ukZ4N2yeMW0dydLS5cjWj/St5ELeMAza3jeqtJtD
 j36vbnmy2c5pPaGLAK8BJpMXT/R2TkmmKD004zcfqF2S3SgbGzdgO1zMZzq9KJpM
 woRKZtLuglDajedsDEBBcKotBhlN2+C/sQlFuL1mX4zitk9ajr0qYUB1+JqOeg5f
 qwPsDLT077ADpxd7lVIMcm+OqbduP5KWkBKYHpn7lJcLe1eqVMMzceJroW42zhVG
 Dq8Iln26bbU9Wx6FSPFCUcHEzHRHUfXmu07HN9U0X++0QgWjrmBQQLooGFB/bR4a
 edBrPpVas6xE4/brjgFX3gOKtv8xYg==
 =ewDV
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify pre-content notification support from Jan Kara:
 "This introduces a new fsnotify event (FS_PRE_ACCESS) that gets
  generated before a file contents is accessed.

  The event is synchronous so if there is listener for this event, the
  kernel waits for reply. On success the execution continues as usual,
  on failure we propagate the error to userspace. This allows userspace
  to fill in file content on demand from slow storage. The context in
  which the events are generated has been picked so that we don't hold
  any locks and thus there's no risk of a deadlock for the userspace
  handler.

  The new pre-content event is available only for users with global
  CAP_SYS_ADMIN capability (similarly to other parts of fanotify
  functionality) and it is an administrator responsibility to make sure
  the userspace event handler doesn't do stupid stuff that can DoS the
  system.

  Based on your feedback from the last submission, fsnotify code has
  been improved and now file->f_mode encodes whether pre-content event
  needs to be generated for the file so the fast path when nobody wants
  pre-content event for the file just grows the additional file->f_mode
  check. As a bonus this also removes the checks whether the old
  FS_ACCESS event needs to be generated from the fast path. Also the
  place where the event is generated during page fault has been moved so
  now filemap_fault() generates the event if and only if there is no
  uptodate folio in the page cache.

  Also we have dropped FS_PRE_MODIFY event as current real-world users
  of the pre-content functionality don't really use it so let's start
  with the minimal useful feature set"

* tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits)
  fanotify: Fix crash in fanotify_init(2)
  fs: don't block write during exec on pre-content watched files
  fs: enable pre-content events on supported file systems
  ext4: add pre-content fsnotify hook for DAX faults
  btrfs: disable defrag on pre-content watched files
  xfs: add pre-content fsnotify hook for DAX faults
  fsnotify: generate pre-content permission event on page fault
  mm: don't allow huge faults for files with pre content watches
  fanotify: disable readahead if we have pre-content watches
  fanotify: allow to set errno in FAN_DENY permission response
  fanotify: report file range info with pre-content events
  fanotify: introduce FAN_PRE_ACCESS permission event
  fsnotify: generate pre-content permission event on truncate
  fsnotify: pass optional file access range in pre-content event
  fsnotify: introduce pre-content permission events
  fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
  fanotify: rename a misnamed constant
  fanotify: don't skip extra event info if no info_mode is set
  fsnotify: check if file is actually being watched for pre-content events on open
  fsnotify: opt-in for permission events at file open time
  ...
2025-01-23 13:36:06 -08:00
Linus Torvalds
6c4aa896eb Performance events changes for v6.14:
- Seqlock optimizations that arose in a perf context and were
    merged into the perf tree:
 
    - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan)
    - mm: Convert mm_lock_seq to a proper seqcount ((Suren Baghdasaryan)
    - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren Baghdasaryan)
    - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra)
 
  - Core perf enhancements:
 
    - Reduce 'struct page' footprint of perf by mapping pages
      in advance (Lorenzo Stoakes)
    - Save raw sample data conditionally based on sample type (Yabin Cui)
    - Reduce sampling overhead by checking sample_type in
      perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin Cui)
    - Export perf_exclude_event() (Namhyung Kim)
 
  - Uprobes scalability enhancements: (Andrii Nakryiko)
 
    - Simplify find_active_uprobe_rcu() VMA checks
    - Add speculative lockless VMA-to-inode-to-uprobe resolution
    - Simplify session consumer tracking
    - Decouple return_instance list traversal and freeing
    - Ensure return_instance is detached from the list before freeing
    - Reuse return_instances between multiple uretprobes within task
    - Guard against kmemdup() failing in dup_return_instance()
 
  - AMD core PMU driver enhancements:
 
    - Relax privilege filter restriction on AMD IBS (Namhyung Kim)
 
  - AMD RAPL energy counters support: (Dhananjay Ugwekar)
 
    - Introduce topology_logical_core_id() (K Prateek Nayak)
 
    - Remove the unused get_rapl_pmu_cpumask() function
    - Remove the cpu_to_rapl_pmu() function
    - Rename rapl_pmu variables
    - Make rapl_model struct global
    - Add arguments to the init and cleanup functions
    - Modify the generic variable names to *_pkg*
    - Remove the global variable rapl_msrs
    - Move the cntr_mask to rapl_pmus struct
    - Add core energy counter support for AMD CPUs
 
  - Intel core PMU driver enhancements:
 
    - Support RDPMC 'metrics clear mode' feature (Kan Liang)
    - Clarify adaptive PEBS processing (Kan Liang)
    - Factor out functions for PEBS records processing (Kan Liang)
    - Simplify the PEBS records processing for adaptive PEBS (Kan Liang)
 
  - Intel uncore driver enhancements: (Kan Liang)
 
    - Convert buggy pmu->func_id use to pmu->registered
    - Support more units on Granite Rapids
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmeOJdQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i2yQ/+MXl7yfJOgdbwjBpgGGzH4burEO7ppak+
 ktzz+YjpNgjODe/xMAJGjjblouuYArCnRolc1UPvPm6M7jSY76wi42Y6c4dRtFoB
 2ReSrRqnreLOcrRS9nsTjvWRHfJHqJDVSd9TfHX6ILfzbaizCZOGYk558ZxAKRqu
 Lw7FOvLEe/Y3tg4z8dDg083jsasalKySP9wIPc0BkSqQTOfusd3KXju/Fux/9wkn
 hZcUgF4ds+0bH7xtO1/G9ILqGyeq97X1McIR9bAjln5Mxykclen4hSjRaWWHHo9O
 mzBKmd/blIATisfuuW+QLDQow3M1k3688cz7e9QOeWHHd/dJiMb9RLV90jdND/T/
 uLINC5vNemzyWEfnNiYQ31LjhG3SeuDiKWzRp36MbQcCh6EBdRXWLBgtmxq1L/3o
 ZCaCdtFu5+6epycdyOVZEpWDnjdx4GmLXMZi5WJfZ7fZ/IFjNkjk4OdzI1iRQ+i3
 Sbi75ep59ayTUhm5AB7gCJsP3R7EsZsiPHUenQdA2n9Sj6xE+IuhlS/QDQ9g5mdY
 Ijs0jHeVCGmhYoOD1xWnCZSzlnkEVU3zwfypAK+MC7pgtFMwDy5/Bu1USGxXXDy+
 aKsrJRSgHbtZ1gwoHstqkV+DeCTfElCLYkvigzI5Nmyib5Zp4vkwy2ZLWQjaNjm7
 mqRI7PugUkU=
 =c8XB
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:
 "Seqlock optimizations that arose in a perf context and were merged
  into the perf tree:

   - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan)
   - mm: Convert mm_lock_seq to a proper seqcount (Suren Baghdasaryan)
   - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren
     Baghdasaryan)
   - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra)

  Core perf enhancements:

   - Reduce 'struct page' footprint of perf by mapping pages in advance
     (Lorenzo Stoakes)
   - Save raw sample data conditionally based on sample type (Yabin Cui)
   - Reduce sampling overhead by checking sample_type in
     perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin
     Cui)
   - Export perf_exclude_event() (Namhyung Kim)

  Uprobes scalability enhancements: (Andrii Nakryiko)

   - Simplify find_active_uprobe_rcu() VMA checks
   - Add speculative lockless VMA-to-inode-to-uprobe resolution
   - Simplify session consumer tracking
   - Decouple return_instance list traversal and freeing
   - Ensure return_instance is detached from the list before freeing
   - Reuse return_instances between multiple uretprobes within task
   - Guard against kmemdup() failing in dup_return_instance()

  AMD core PMU driver enhancements:

   - Relax privilege filter restriction on AMD IBS (Namhyung Kim)

  AMD RAPL energy counters support: (Dhananjay Ugwekar)

   - Introduce topology_logical_core_id() (K Prateek Nayak)
   - Remove the unused get_rapl_pmu_cpumask() function
   - Remove the cpu_to_rapl_pmu() function
   - Rename rapl_pmu variables
   - Make rapl_model struct global
   - Add arguments to the init and cleanup functions
   - Modify the generic variable names to *_pkg*
   - Remove the global variable rapl_msrs
   - Move the cntr_mask to rapl_pmus struct
   - Add core energy counter support for AMD CPUs

  Intel core PMU driver enhancements:

   - Support RDPMC 'metrics clear mode' feature (Kan Liang)
   - Clarify adaptive PEBS processing (Kan Liang)
   - Factor out functions for PEBS records processing (Kan Liang)
   - Simplify the PEBS records processing for adaptive PEBS (Kan Liang)

  Intel uncore driver enhancements: (Kan Liang)

   - Convert buggy pmu->func_id use to pmu->registered
   - Support more units on Granite Rapids"

* tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  perf: map pages in advance
  perf/x86/intel/uncore: Support more units on Granite Rapids
  perf/x86/intel/uncore: Clean up func_id
  perf/x86/intel: Support RDPMC metrics clear mode
  uprobes: Guard against kmemdup() failing in dup_return_instance()
  perf/x86: Relax privilege filter restriction on AMD IBS
  perf/core: Export perf_exclude_event()
  uprobes: Reuse return_instances between multiple uretprobes within task
  uprobes: Ensure return_instance is detached from the list before freeing
  uprobes: Decouple return_instance list traversal and freeing
  uprobes: Simplify session consumer tracking
  uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  uprobes: simplify find_active_uprobe_rcu() VMA checks
  mm: introduce mmap_lock_speculate_{try_begin|retry}
  mm: convert mm_lock_seq to a proper seqcount
  mm/gup: Use raw_seqcount_try_begin()
  seqlock: add raw_seqcount_try_begin
  perf/x86/rapl: Add core energy counter support for AMD CPUs
  perf/x86/rapl: Move the cntr_mask to rapl_pmus struct
  perf/x86/rapl: Remove the global variable rapl_msrs
  ...
2025-01-21 10:52:03 -08:00
Mateusz Guzik
51f8bd6db5 get_task_exe_file: check PF_KTHREAD locklessly
Same thing as 8ac5dc6659 ("get_task_mm: check PF_KTHREAD lockless")

Nowadays PF_KTHREAD is sticky and it was never protected by ->alloc_lock. 
Move the PF_KTHREAD check outside of task_lock() section to make this code
more understandable.

Link: https://lkml.kernel.org/r/20241119143526.704986-1-mjguzik@gmail.com
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12 20:21:06 -08:00
Lorenzo Stoakes
8ac662f5da fork: avoid inappropriate uprobe access to invalid mm
If dup_mmap() encounters an issue, currently uprobe is able to access the
relevant mm via the reverse mapping (in build_map_info()), and if we are
very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which
we establish as part of the fork error path.

This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which
in turn invokes find_mergeable_anon_vma() that uses a VMA iterator,
invoking vma_iter_load() which uses the advanced maple tree API and thus
is able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit
d240629148 ("fork: use __mt_dup() to duplicate maple tree in
dup_mmap()").

This change was made on the assumption that only process tear-down code
would actually observe (and make use of) these values.  However this very
unlikely but still possible edge case with uprobes exists and
unfortunately does make these observable.

The uprobe operation prevents races against the dup_mmap() operation via
the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap()
and dropped via uprobe_end_dup_mmap(), and held across
register_for_each_vma() prior to invoking build_map_info() which does the
reverse mapping lookup.

Currently these are acquired and dropped within dup_mmap(), which exposes
the race window prior to error handling in the invoking dup_mm() which
tears down the mm.

We can avoid all this by just moving the invocation of
uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm()
and only release this lock once the dup_mmap() operation succeeds or clean
up is done.

This means that the uprobe code can never observe an incompletely
constructed mm and resolves the issue in this case.

Link: https://lkml.kernel.org/r/20241210172412.52995-1-lorenzo.stoakes@oracle.com
Fixes: d240629148 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: syzbot+2d788f4f7cb660dac4b7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:04:44 -08:00
Amir Goldstein
0357ef03c9 fs: don't block write during exec on pre-content watched files
Commit 2a010c4128 ("fs: don't block i_writecount during exec") removed
the legacy behavior of getting ETXTBSY on attempt to open and executable
file for write while it is being executed.

This commit was reverted because an application that depends on this
legacy behavior was broken by the change.

We need to allow HSM writing into executable files while executed to
fill their content on-the-fly.

To that end, disable the ETXTBSY legacy behavior for files that are
watched by pre-content events.

This change is not expected to cause regressions with existing systems
which do not have any pre-content event listeners.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241128142532.465176-1-amir73il@gmail.com
2024-12-11 17:45:18 +01:00
Suren Baghdasaryan
eb449bd969 mm: convert mm_lock_seq to a proper seqcount
Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock
variants to increment it, in-line with the usual seqcount usage pattern.
This lets us check whether the mmap_lock is write-locked by checking
mm_lock_seq.sequence counter (odd=locked, even=unlocked). This will be
used when implementing mmap_lock speculation functions.
As a result vm_lock_seq is also change to be unsigned to match the type
of mm_lock_seq.sequence.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Link: https://lkml.kernel.org/r/20241122174416.1367052-2-surenb@google.com
2024-12-02 12:01:38 +01:00
Christian Brauner
3b83203538
Revert "fs: don't block i_writecount during exec"
This reverts commit 2a010c4128.

Rui Ueyama <rui314@gmail.com> writes:

> I'm the creator and the maintainer of the mold linker
> (https://github.com/rui314/mold). Recently, we discovered that mold
> started causing process crashes in certain situations due to a change
> in the Linux kernel. Here are the details:
>
> - In general, overwriting an existing file is much faster than
> creating an empty file and writing to it on Linux, so mold attempts to
> reuse an existing executable file if it exists.
>
> - If a program is running, opening the executable file for writing
> previously failed with ETXTBSY. If that happens, mold falls back to
> creating a new file.
>
> - However, the Linux kernel recently changed the behavior so that
> writing to an executable file is now always permitted
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a010c412853).
>
> That caused mold to write to an executable file even if there's a
> process running that file. Since changes to mmap'ed files are
> immediately visible to other processes, any processes running that
> file would almost certainly crash in a very mysterious way.
> Identifying the cause of these random crashes took us a few days.
>
> Rejecting writes to an executable file that is currently running is a
> well-known behavior, and Linux had operated that way for a very long
> time. So, I don’t believe relying on this behavior was our mistake;
> rather, I see this as a regression in the Linux kernel.

Quoting myself from commit 2a010c4128 ("fs: don't block i_writecount during exec")

> Yes, someone in userspace could potentially be relying on this. It's not
> completely out of the realm of possibility but let's find out if that's
> actually the case and not guess.

It seems we found out that someone is relying on this obscure behavior.
So revert the change.

Link: https://github.com/rui314/mold/issues/1361
Link: https://lore.kernel.org/r/4a2bc207-76be-4715-8e12-7fc45a76a125@leemhuis.info
Cc: <stable@vger.kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-27 12:51:30 +01:00
Linus Torvalds
5c00ff742b - The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection algorithm.
   This leads to improved memory savings.
 
 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
 
 	- "refine mas_mab_cp()"
 	- "Reduce the space to be cleared for maple_big_node"
 	- "maple_tree: simplify mas_push_node()"
 	- "Following cleanup after introduce mas_wr_store_type()"
 	- "refine storing null"
 
 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.
 
 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping code.
 
 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of shadow
   entries.
 
 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.
 
 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in the
   hugetlb code.
 
 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page into
   small pages.  Instead we replace the whole thing with a THP.  More
   consistent cleaner and potentiall saves a large number of pagefaults.
 
 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.
 
 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to do.
 
 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio size
   rather than as individual pages.  A 20% speedup was observed.
 
 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting.
 
 - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt
   removes the long-deprecated memcgv2 charge moving feature.
 
 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.
 
 - The series "x86/module: use large ROX pages for text allocations" from
   Mike Rapoport teaches x86 to use large pages for read-only-execute
   module text.
 
 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.
 
 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/.  A slow march towards shrinking
   struct page.
 
 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.
 
 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression.  It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.
 
 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests
   over to the KUnit framework.
 
 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a single
   VMA, rather than requiring that multiple VMAs be created for this.
   Improved efficiencies for userspace memory allocators are expected.
 
 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.
 
 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.
 
 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP from
   the kernel boot command line.
 
 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.
 
 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep is
   enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzwFqgAKCRDdBJ7gKXxA
 jkeuAQCkl+BmeYHE6uG0hi3pRxkupseR6DEOAYIiTv0/l8/GggD/Z3jmEeqnZaNq
 xyyenpibWgUoShU2wZ/Ha8FE5WDINwg=
 =JfWR
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "zram: optimal post-processing target selection" from
   Sergey Senozhatsky improves zram's post-processing selection
   algorithm. This leads to improved memory savings.

 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
	- "refine mas_mab_cp()"
	- "Reduce the space to be cleared for maple_big_node"
	- "maple_tree: simplify mas_push_node()"
	- "Following cleanup after introduce mas_wr_store_type()"
	- "refine storing null"

 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.

 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping
   code.

 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of
   shadow entries.

 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.

 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in
   the hugetlb code.

 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page
   into small pages. Instead we replace the whole thing with a THP. More
   consistent cleaner and potentiall saves a large number of pagefaults.

 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.

 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to
   do.

 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio
   size rather than as individual pages. A 20% speedup was observed.

 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
   splitting.

 - The series "memcg-v1: fully deprecate charge moving" from Shakeel
   Butt removes the long-deprecated memcgv2 charge moving feature.

 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.

 - The series "x86/module: use large ROX pages for text allocations"
   from Mike Rapoport teaches x86 to use large pages for
   read-only-execute module text.

 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.

 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/. A slow march towards shrinking
   struct page.

 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.

 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression. It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.

 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
   tests over to the KUnit framework.

 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a
   single VMA, rather than requiring that multiple VMAs be created for
   this. Improved efficiencies for userspace memory allocators are
   expected.

 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.

 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.

 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP
   from the kernel boot command line.

 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.

 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep
   is enabled.

* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
  cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
  mm/kfence: add a new kunit test test_use_after_free_read_nofault()
  zram: fix NULL pointer in comp_algorithm_show()
  memcg/hugetlb: add hugeTLB counters to memcg
  vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
  mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
  zram: ZRAM_DEF_COMP should depend on ZRAM
  MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
  Docs/mm/damon: recommend academic papers to read and/or cite
  mm: define general function pXd_init()
  kmemleak: iommu/iova: fix transient kmemleak false positive
  mm/list_lru: simplify the list_lru walk callback function
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: code clean up for reparenting
  mm/list_lru: don't export list_lru_add
  mm/list_lru: don't pass unnecessary key parameters
  kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
  kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
  kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
  ...
2024-11-23 09:58:07 -08:00
Linus Torvalds
bf9aa14fc5 A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
 
     posix-timers are currently auto-rearmed by the kernel when the signal
     of the timer is ignored so that the timer signal can be delivered once
     the corresponding signal is unignored.
 
     This requires to throttle the timer to prevent a DoS by small intervals
     and keeps the system pointlessly out of low power states for no value.
     This is a long standing non-trivial problem due to the lock order of
     posix-timer lock and the sighand lock along with life time issues as
     the timer and the sigqueue have different life time rules.
 
     Cure this by:
 
      * Embedding the sigqueue into the timer struct to have the same life
        time rules. Aside of that this also avoids the lookup of the timer
        in the signal delivery and rearm path as it's just a always valid
        container_of() now.
 
      * Queuing ignored timer signals onto a seperate ignored list.
 
      * Moving queued timer signals onto the ignored list when the signal is
        switched to SIG_IGN before it could be delivered.
 
      * Walking the ignored list when SIG_IGN is lifted and requeue the
        signals to the actual signal lists. This allows the signal delivery
        code to rearm the timer.
 
     This also required to consolidate the signal delivery rules so they are
     consistent across all situations. With that all self test scenarios
     finally succeed.
 
   - Core infrastructure for VFS multigrain timestamping
 
     This is required to allow the kernel to use coarse grained time stamps
     by default and switch to fine grained time stamps when inode attributes
     are actively observed via getattr().
 
     These changes have been provided to the VFS tree as well, so that the
     VFS specific infrastructure could be built on top.
 
   - Cleanup and consolidation of the sleep() infrastructure
 
     * Move all sleep and timeout functions into one file
 
     * Rework udelay() and ndelay() into proper documented inline functions
       and replace the hardcoded magic numbers by proper defines.
 
     * Rework the fsleep() implementation to take the reality of the timer
       wheel granularity on different HZ values into account. Right now the
       boundaries are hard coded time ranges which fail to provide the
       requested accuracy on different HZ settings.
 
     * Update documentation for all sleep/timeout related functions and fix
       up stale documentation links all over the place
 
     * Fixup a few usage sites
 
   - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks
 
     A system can have multiple PTP clocks which are participating in
     seperate and independent PTP clock domains. So far the kernel only
     considers the PTP clock which is based on CLOCK TAI relevant as that's
     the clock which drives the timekeeping adjustments via the various user
     space daemons through adjtimex(2).
 
     The non TAI based clock domains are accessible via the file descriptor
     based posix clocks, but their usability is very limited. They can't be
     accessed fast as they always go all the way out to the hardware and
     they cannot be utilized in the kernel itself.
 
     As Time Sensitive Networking (TSN) gains traction it is required to
     provide fast user and kernel space access to these clocks.
 
     The approach taken is to utilize the timekeeping and adjtimex(2)
     infrastructure to provide this access in a similar way how the kernel
     provides access to clock MONOTONIC, REALTIME etc.
 
     Instead of creating a duplicated infrastructure this rework converts
     timekeeping and adjtimex(2) into generic functionality which operates
     on pointers to data structures instead of using static variables.
 
     This allows to provide time accessors and adjtimex(2) functionality for
     the independent PTP clocks in a subsequent step.
 
   - Consolidate hrtimer initialization
 
     hrtimers are set up by initializing the data structure and then
     seperately setting the callback function for historical reasons.
 
     That's an extra unnecessary step and makes Rust support less straight
     forward than it should be.
 
     Provide a new set of hrtimer_setup*() functions and convert the core
     code and a few usage sites of the less frequently used interfaces over.
 
     The bulk of the htimer_init() to hrtimer_setup() conversion is already
     prepared and scheduled for the next merge window.
 
   - Drivers:
 
     * Ensure that the global timekeeping clocksource is utilizing the
       cluster 0 timer on MIPS multi-cluster systems.
 
       Otherwise CPUs on different clusters use their cluster specific
       clocksource which is not guaranteed to be synchronized with other
       clusters.
 
     * Mostly boring cleanups, fixes, improvements and code movement
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kPITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZKkD/9OUL6fOJrDUmOYBa4QVeMyfTef4EaL
 tvwIMM/29XQFeiq3xxCIn+EMnHjXn2lvIhYGQ7GKsbKYwvJ7ZBDpQb+UMhZ2nKI9
 6D6BP6WomZohKeH2fZbJQAdqOi3KRYdvQdIsVZUexkqiaVPphRvOH9wOr45gHtZM
 EyMRSotPlQTDqcrbUejDMEO94GyjDCYXRsyATLxjmTzL/N4xD4NRIiotjM2vL/a9
 8MuCgIhrKUEyYlFoOxxeokBsF3kk3/ez2jlG9b/N8VLH3SYIc2zgL58FBgWxlmgG
 bY71nVG3nUgEjxBd2dcXAVVqvb+5widk8p6O7xxOAQKTLMcJ4H0tQDkMnzBtUzvB
 DGAJDHAmAr0g+ja9O35Pkhunkh4HYFIbq0Il4d1HMKObhJV0JumcKuQVxrXycdm3
 UZfq3seqHsZJQbPgCAhlFU0/2WWScocbee9bNebGT33KVwSp5FoVv89C/6Vjb+vV
 Gusc3thqrQuMAZW5zV8g4UcBAA/xH4PB0I+vHib+9XPZ4UQ7/6xKl2jE0kd5hX7n
 AAUeZvFNFqIsY+B6vz+Jx/yzyM7u5cuXq87pof5EHVFzv56lyTp4ToGcOGYRgKH5
 JXeYV1OxGziSDrd5vbf9CzdWMzqMvTefXrHbWrjkjhNOe8E1A8O88RZ5uRKZhmSw
 hZZ4hdM9+3T7cg==
 =2VC6
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "A rather large update for timekeeping and timers:

   - The final step to get rid of auto-rearming posix-timers

     posix-timers are currently auto-rearmed by the kernel when the
     signal of the timer is ignored so that the timer signal can be
     delivered once the corresponding signal is unignored.

     This requires to throttle the timer to prevent a DoS by small
     intervals and keeps the system pointlessly out of low power states
     for no value. This is a long standing non-trivial problem due to
     the lock order of posix-timer lock and the sighand lock along with
     life time issues as the timer and the sigqueue have different life
     time rules.

     Cure this by:

       - Embedding the sigqueue into the timer struct to have the same
         life time rules. Aside of that this also avoids the lookup of
         the timer in the signal delivery and rearm path as it's just a
         always valid container_of() now.

       - Queuing ignored timer signals onto a seperate ignored list.

       - Moving queued timer signals onto the ignored list when the
         signal is switched to SIG_IGN before it could be delivered.

       - Walking the ignored list when SIG_IGN is lifted and requeue the
         signals to the actual signal lists. This allows the signal
         delivery code to rearm the timer.

     This also required to consolidate the signal delivery rules so they
     are consistent across all situations. With that all self test
     scenarios finally succeed.

   - Core infrastructure for VFS multigrain timestamping

     This is required to allow the kernel to use coarse grained time
     stamps by default and switch to fine grained time stamps when inode
     attributes are actively observed via getattr().

     These changes have been provided to the VFS tree as well, so that
     the VFS specific infrastructure could be built on top.

   - Cleanup and consolidation of the sleep() infrastructure

       - Move all sleep and timeout functions into one file

       - Rework udelay() and ndelay() into proper documented inline
         functions and replace the hardcoded magic numbers by proper
         defines.

       - Rework the fsleep() implementation to take the reality of the
         timer wheel granularity on different HZ values into account.
         Right now the boundaries are hard coded time ranges which fail
         to provide the requested accuracy on different HZ settings.

       - Update documentation for all sleep/timeout related functions
         and fix up stale documentation links all over the place

       - Fixup a few usage sites

   - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP
     clocks

     A system can have multiple PTP clocks which are participating in
     seperate and independent PTP clock domains. So far the kernel only
     considers the PTP clock which is based on CLOCK TAI relevant as
     that's the clock which drives the timekeeping adjustments via the
     various user space daemons through adjtimex(2).

     The non TAI based clock domains are accessible via the file
     descriptor based posix clocks, but their usability is very limited.
     They can't be accessed fast as they always go all the way out to
     the hardware and they cannot be utilized in the kernel itself.

     As Time Sensitive Networking (TSN) gains traction it is required to
     provide fast user and kernel space access to these clocks.

     The approach taken is to utilize the timekeeping and adjtimex(2)
     infrastructure to provide this access in a similar way how the
     kernel provides access to clock MONOTONIC, REALTIME etc.

     Instead of creating a duplicated infrastructure this rework
     converts timekeeping and adjtimex(2) into generic functionality
     which operates on pointers to data structures instead of using
     static variables.

     This allows to provide time accessors and adjtimex(2) functionality
     for the independent PTP clocks in a subsequent step.

   - Consolidate hrtimer initialization

     hrtimers are set up by initializing the data structure and then
     seperately setting the callback function for historical reasons.

     That's an extra unnecessary step and makes Rust support less
     straight forward than it should be.

     Provide a new set of hrtimer_setup*() functions and convert the
     core code and a few usage sites of the less frequently used
     interfaces over.

     The bulk of the htimer_init() to hrtimer_setup() conversion is
     already prepared and scheduled for the next merge window.

   - Drivers:

       - Ensure that the global timekeeping clocksource is utilizing the
         cluster 0 timer on MIPS multi-cluster systems.

         Otherwise CPUs on different clusters use their cluster specific
         clocksource which is not guaranteed to be synchronized with
         other clusters.

       - Mostly boring cleanups, fixes, improvements and code movement"

* tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
  posix-timers: Fix spurious warning on double enqueue versus do_exit()
  clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties
  clocksource/drivers/gpx: Remove redundant casts
  clocksource/drivers/timer-ti-dm: Fix child node refcount handling
  dt-bindings: timer: actions,owl-timer: convert to YAML
  clocksource/drivers/ralink: Add Ralink System Tick Counter driver
  clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource
  clocksource/drivers/timer-ti-dm: Don't fail probe if int not found
  clocksource/drivers:sp804: Make user selectable
  clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions
  hrtimers: Delete hrtimer_init_on_stack()
  alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
  io_uring: Switch to use hrtimer_setup_on_stack()
  sched/idle: Switch to use hrtimer_setup_on_stack()
  hrtimers: Delete hrtimer_init_sleeper_on_stack()
  wait: Switch to use hrtimer_setup_sleeper_on_stack()
  timers: Switch to use hrtimer_setup_sleeper_on_stack()
  net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()
  futex: Switch to use hrtimer_setup_sleeper_on_stack()
  fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
  ...
2024-11-19 16:35:06 -08:00
Linus Torvalds
0892d74213 x86/splitlock changes for v6.13:
- Move Split and Bus lock code to a dedicated file (Ravi Bangoria)
  - Add split/bus lock support for AMD (Ravi Bangoria)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7gMERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hEaQ//YRk2Dc3VkiwC+ZE44Bi4ZlztACzjvkL/
 sFjOqX4dSWJLMFDPfISGGEN4e20IFA46uYXwoZQOZEz5RY4tPaJYw+o1aBP5YYEN
 EEv4iRc20FIIYckkyCShP00dKoZlmb6FbxyUysRRwZW0XJuMVLyJnGNmZs0peVvt
 5c8+7erl0CPN9RaR66lULT4YenyvUZ7DChfeB3a1LbazC5+IrEumiIysLJUKj6zN
 075+FeQ084156sFR+LUSjblxLKzY/OqT/727osST2WlMo/HWLIJImCXodHMHG+LC
 dRI0NFFU9zn2G6rGcoltLNsU/TSJfaWoGS8pm6c96kItEZly/BFz5MF1IQIbCfDx
 YFJpil1zJQQeV3FUXldhKGoSio0fv0KWcqC0TLjj/DhqprjdktJGuGIX6ChmkytA
 TDLZPWZxInZdVnWVMBuaJ6defMRBLART02u9DRIoXYEX6aDLjJ1JFTRe5hU9vVab
 cq+GR3ZSeDM9gSGjfW6dGG5746KXX+Wwxv4stxSoygSxmrLPH38CrZ5m66edtKzq
 P+V2/utvhdHZSKawsIpM4Xz5u7fweySkVFQjJyEEeMWyXnfC+alP9OUsVTKS8mFa
 zKbX7mEgnBDcEE9w6O5itL4nIgB3Kooci5uEWDRTAYUee82Hqk09Ycyb5XQkJ7bs
 Cl65CoY+XAA=
 =QpKp
 -----END PGP SIGNATURE-----

Merge tag 'x86-splitlock-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 splitlock updates from Ingo Molnar:

 - Move Split and Bus lock code to a dedicated file (Ravi Bangoria)

 - Add split/bus lock support for AMD (Ravi Bangoria)

* tag 'x86-splitlock-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bus_lock: Add support for AMD
  x86/split_lock: Move Split and Bus lock code to a dedicated file
2024-11-19 14:34:02 -08:00
Linus Torvalds
3f020399e4 Scheduler changes for v6.13:
- Core facilities:
 
     - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which optimizes
       fair-class preemption by delaying preemption requests to the
       tick boundary, while working as full preemption for RR/FIFO/DEADLINE
       classes. (Peter Zijlstra)
 
         - x86: Enable Lazy preemption (Peter Zijlstra)
         - riscv: Enable Lazy preemption (Jisheng Zhang)
 
     - Initialize idle tasks only once (Thomas Gleixner)
 
     - sched/ext: Remove sched_fork() hack (Thomas Gleixner)
 
  - Fair scheduler:
     - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)
 
  - Idle loop:
       Optimize the generic idle loop by removing unnecessary
       memory barrier (Zhongqiu Han)
 
  - RSEQ:
     - Improve cache locality of RSEQ concurrency IDs for
       intermittent workloads (Mathieu Desnoyers)
 
  - Waitqueues:
     - Make wake_up_{bit,var} less fragile (Neil Brown)
 
  - PSI:
     - Pass enqueue/dequeue flags to psi callbacks directly (Johannes Weiner)
 
  - Preparatory patches for proxy execution:
     - core: Add move_queued_task_locked helper (Connor O'Brien)
     - core: Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)
     - core: Split out __schedule() deactivate task logic into a helper (John Stultz)
     - core: Split scheduler and execution contexts (Peter Zijlstra)
     - locking/mutex: Make mutex::wait_lock irq safe (Juri Lelli)
     - locking/mutex: Expose __mutex_owner() (Juri Lelli)
     - locking/mutex: Remove wakeups from under mutex::wait_lock (Peter Zijlstra)
 
  - Misc fixes and cleanups:
     - core: Remove unused __HAVE_THREAD_FUNCTIONS hook support (David Disseldorp)
     - core: Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej Siewior)
     - wait: Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)
     - fair: remove the DOUBLE_TICK feature (Huang Shijie)
     - fair: fix the comment for PREEMPT_SHORT (Huang Shijie)
     - uclamp: Fix unnused variable warning (Christian Loehle)
     - rt: No PREEMPT_RT=y for all{yes,mod}config
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7fnQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hZTBAAozVdWA2m51aNa67HvAZta/olmrIagVbW
 inwbTgqa8b+UfeWEuKOfrZr5khjEh6pLgR3dBTib1uH6xxYj/Okds+qbPWSBPVLh
 yzavlm/zJZM1U1XtxE3eyVfqWik4GrY7DoIMDQQr+YH7rNXonJeJkll38OI2E5MC
 q3Q01qyMo8RJJX8qkf3f8ObOoP/51NsVniTw0Zb2fzEhXz8FjezLlxk6cMfgSkJG
 lg9gfIwUZ7Xg5neRo4kJcc3Ht31KYOhWSiupBJzRD1hss/N/AybvMcTX/Cm8d07w
 HIAdDDAn84o46miFo/a0V/hsJZ72idWbqxVJUCtaezrpOUiFkG+uInRvG/ynr0lF
 5dEI9f+6PUw8Nc7L72IyHkobjPqS2IefSaxYYCBKmxMX2qrenfTor/pKiWzzhBIl
 rX3MZSuUJ8NjV4rNGD/qXRM1IsMJrsDwxDyv+sRec3XdH33x286ds6aAUEPDQ6N7
 96VS0sOKcNUJN8776ErNjlIxRl8HTlpkaO3nZlQIfXgTlXUpRvOuKbEWqP+606lo
 oANgJTKgUhgJPWZnvmdRxDjSiOp93QcImjus9i1tN81FGiEDleONsJUxu2Di1E5+
 s1nCiytjq+cdvzCqFyiOZUh+g6kSZ4yXxNgLg2UvbXzX1zOeUQT3WtyKUhMPXhU8
 esh1TgbUbpE=
 =Zcqj
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core facilities:

   - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which
     optimizes fair-class preemption by delaying preemption requests to
     the tick boundary, while working as full preemption for
     RR/FIFO/DEADLINE classes. (Peter Zijlstra)
        - x86: Enable Lazy preemption (Peter Zijlstra)
        - riscv: Enable Lazy preemption (Jisheng Zhang)

   - Initialize idle tasks only once (Thomas Gleixner)

   - sched/ext: Remove sched_fork() hack (Thomas Gleixner)

  Fair scheduler:

   - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)

  Idle loop:

   - Optimize the generic idle loop by removing unnecessary memory
     barrier (Zhongqiu Han)

  RSEQ:

   - Improve cache locality of RSEQ concurrency IDs for intermittent
     workloads (Mathieu Desnoyers)

  Waitqueues:

   - Make wake_up_{bit,var} less fragile (Neil Brown)

  PSI:

   - Pass enqueue/dequeue flags to psi callbacks directly (Johannes
     Weiner)

  Preparatory patches for proxy execution:

   - Add move_queued_task_locked helper (Connor O'Brien)

   - Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)

   - Split out __schedule() deactivate task logic into a helper (John
     Stultz)

   - Split scheduler and execution contexts (Peter Zijlstra)

   - Make mutex::wait_lock irq safe (Juri Lelli)

   - Expose __mutex_owner() (Juri Lelli)

   - Remove wakeups from under mutex::wait_lock (Peter Zijlstra)

  Misc fixes and cleanups:

   - Remove unused __HAVE_THREAD_FUNCTIONS hook support (David
     Disseldorp)

   - Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej
     Siewior)

   - Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)

   - remove the DOUBLE_TICK feature (Huang Shijie)

   - fix the comment for PREEMPT_SHORT (Huang Shijie)

   - Fix unnused variable warning (Christian Loehle)

   - No PREEMPT_RT=y for all{yes,mod}config"

* tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY.
  sched: No PREEMPT_RT=y for all{yes,mod}config
  riscv: add PREEMPT_LAZY support
  sched, x86: Enable Lazy preemption
  sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
  sched: Add Lazy preemption model
  sched: Add TIF_NEED_RESCHED_LAZY infrastructure
  sched/ext: Remove sched_fork() hack
  sched: Initialize idle tasks only once
  sched: psi: pass enqueue/dequeue flags to psi callbacks directly
  sched/uclamp: Fix unnused variable warning
  sched: Split scheduler and execution contexts
  sched: Split out __schedule() deactivate task logic into a helper
  sched: Consolidate pick_*_task to task_is_pushable helper
  sched: Add move_queued_task_locked helper
  locking/mutex: Expose __mutex_owner()
  locking/mutex: Make mutex::wait_lock irq safe
  locking/mutex: Remove wakeups from under mutex::wait_lock
  sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
  sched: idle: Optimize the generic idle loop by removing needless memory barrier
  ...
2024-11-19 14:16:06 -08:00
Thomas Gleixner
69f032c92c signal: Provide ignored_posix_timers list
To prepare for handling posix timer signals on sigaction(SIG_IGN) properly,
add a list to task::signal.

This list will be used to queue posix timers so their signal can be
requeued when SIG_IGN is lifted later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.920101900@linutronix.de
2024-11-07 02:14:45 +01:00
Nanyong Sun
f2f484085e mm: move mm flags to mm_types.h
The types of mm flags are now far beyond the core dump related features. 
This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h.
The linux/sched/coredump.h has include the mm_types.h, so the C files
related to coredump does not need to change head file inclusion.  In
addition, the inclusion of sched/coredump.h now can be deleted from the C
files that irrelevant to core dump.

Link: https://lkml.kernel.org/r/20240926074922.2721274-1-sunnanyong@huawei.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:56:26 -08:00
Lorenzo Stoakes
cd3f8467af mm: refactor mm_access() to not return NULL
mm_access() can return NULL if the mm is not found, but this is handled
the same as an error in all callers, with some translating this into an
-ESRCH error.

Only proc_mem_open() returns NULL if no mm is found, however in this case
it is clearer and makes more sense to explicitly handle the error. 
Additionally we take the opportunity to refactor the function to eliminate
unnecessary nesting.

Simplify things by simply returning -ESRCH if no mm is found - this both
eliminates confusing use of the IS_ERR_OR_NULL() macro, and simplifies
callers which would return -ESRCH by returning this error directly.

[lorenzo.stoakes@oracle.com: prefer neater pointer error comparison]
  Link: https://lkml.kernel.org/r/2fae1834-749a-45e1-8594-5e5979cf7103@lucifer.local
Link: https://lkml.kernel.org/r/20240924201023.193135-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:56:23 -08:00
Linus Torvalds
b019b4a670 A single fix for posix CPU timers
When a thread is cloned, the posix CPU timers are not inherited.
 
   If the parent has a CPU timer armed the corresponding tick dependency in
   the tasks tick_dep_mask is set and copied to the new thread, which means
   the new thread and all decendants will prevent the system to go into full
   NOHZ operation.
 
   Clear the tick dependency mask in copy_process() to fix this.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmcnT/oTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZw5D/0bqjjMYiE0ws8nuuXN1gL2T1wt6P2C
 2zzEKnk6nsxnGbMfFs7XifDkqSOHNVro7F6kCkz6cH4U/VSK6R2nNONbufz4mNWk
 6uDvdlZps30ekmxN+C3fB6S/4MNVOhXFFXjsQhT/PxX/CfibVP5fATLtcRLq9Lfx
 mV3nnxKTHPjDGz2/2QRyYpA3G1fzOj/l2QeJsVAIC8GxCo7drLVe0yn5Yt+13zhU
 JMjgt9ox4PKFsjaXrqvt1yKNTtb+evjYSIVXxIg60oxUkfva6XFLXJv3rjbARUNj
 aqBHfkZ1/d2Hwc0WexDZfvhNWeCqnfUA+db7ALSYICbNd37EVxWVZA2TwfkkWKSt
 RDq3xX6NJSd71h0lxDvzv7Ph3NUq23rq3LycAkjqfhiFjPQmE6axtnioXcR5mtVp
 q9tilB/3I4zj4BIYPfd9KowkdclRSK+B3Oo0DTyuVhKLggF0UD3poDxT4HxnBCFB
 uKDV8GDsoD8Ksjsl0/X/D4oorqLYAT0tG9gxMw1Kii16gijhhu5qeqTWY+qs9ieg
 2J+Ku83QLQgljvy2s7y0AnMZMxaeKN5YMs6zNV+yGAoTFyft3CnBMKalPHfOCI0A
 fdKHi2aKm+lLrp/UIG2Yw9N1xfcrGWA2moH+9dw6zTyAei+TInP+WYndI7VR5EDf
 3KvW3OoRf0sywA==
 =83r/
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Thomas Gleixner:
 "A single fix for posix CPU timers.

  When a thread is cloned, the posix CPU timers are not inherited.

  If the parent has a CPU timer armed the corresponding tick dependency
  in the tasks tick_dep_mask is set and copied to the new thread, which
  means the new thread and all decendants will prevent the system to go
  into full NOHZ operation.

  Clear the tick dependency mask in copy_process() to fix this"

* tag 'timers-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-cpu-timers: Clear TICK_DEP_BIT_POSIX_TIMER on clone
2024-11-03 08:22:21 -10:00
Lorenzo Stoakes
985da552a9 fork: only invoke khugepaged, ksm hooks if no error
There is no reason to invoke these hooks early against an mm that is in an
incomplete state.

The change in commit d240629148 ("fork: use __mt_dup() to duplicate
maple tree in dup_mmap()") makes this more pertinent as we may be in a
state where entries in the maple tree are not yet consistent.

Their placement early in dup_mmap() only appears to have been meaningful
for early error checking, and since functionally it'd require a very small
allocation to fail (in practice 'too small to fail') that'd only occur in
the most dire circumstances, meaning the fork would fail or be OOM'd in
any case.

Since both khugepaged and KSM tracking are there to provide optimisations
to memory performance rather than critical functionality, it doesn't
really matter all that much if, under such dire memory pressure, we fail
to register an mm with these.

As a result, we follow the example of commit d2081b2bf8 ("mm:
khugepaged: make khugepaged_enter() void function") and make ksm_fork() a
void function also.

We only expose the mm to these functions once we are done with them and
only if no error occurred in the fork operation.

Link: https://lkml.kernel.org/r/e0cb8b840c9d1d5a6e84d4f8eff5f3f2022aa10c.1729014377.git.lorenzo.stoakes@oracle.com
Fixes: d240629148 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Jann Horn <jannh@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28 21:40:39 -07:00