mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	
			
				
					
						
					
					80c80164a5
				
			
			
		
	
	
		
			1224 Commits
		
	
	
	| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|  Peter Zijlstra | 3f4c8211d9 | x86/mm: Use mm_alloc() in poking_init() Instead of duplicating init_mm, allocate a fresh mm. The advantage is that mm_alloc() has much simpler dependencies. Additionally it makes more conceptual sense, init_mm has no (and must not have) user state to duplicate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.816175235@infradead.org | ||
|  Peter Zijlstra | af80602799 | mm: Move mm_cachep initialization to mm_init() In order to allow using mm_alloc() much earlier, move initializing mm_cachep into mm_init(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.751153381@infradead.org | ||
|  Linus Torvalds | e2ca6ba6ba | MM patches for 6.2-rc1. - More userfaultfs work from Peter Xu. - Several convert-to-folios series from Sidhartha Kumar and Huang Ying. - Some filemap cleanups from Vishal Moola. - David Hildenbrand added the ability to selftest anon memory COW handling. - Some cpuset simplifications from Liu Shixin. - Addition of vmalloc tracing support by Uladzislau Rezki. - Some pagecache folioifications and simplifications from Matthew Wilcox. - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it. - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series shold have been in the non-MM tree, my bad. - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages. - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages. - Peter Xu utilized the PTE marker code for handling swapin errors. - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient. - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand. - zram support for multiple compression streams from Sergey Senozhatsky. - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway. - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations. - Vishal Moola removed the try_to_release_page() wrapper. - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache. - David Hildenbrand did some cleanup and repair work on KSM COW breaking. - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend. - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range(). - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen. - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect. - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages(). - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting. - David Hildenbrand has fixed some of our MM selftests for 32-bit machines. - Many singleton patches, as usual. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml CodAgiA51qwzId3GRytIo/tfWZSezgA= =d19R -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ... | ||
|  Linus Torvalds | 268325bda5 | Random number generator updates for Linux 6.2-rc1. -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe
 A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw
 bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I
 RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX
 WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS
 waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT
 ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6
 /ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI
 PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX
 RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4
 CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq
 APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE=
 =QRhK
 -----END PGP SIGNATURE-----
Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
 - Replace prandom_u32_max() and various open-coded variants of it,
   there is now a new family of functions that uses fast rejection
   sampling to choose properly uniformly random numbers within an
   interval:
       get_random_u32_below(ceil) - [0, ceil)
       get_random_u32_above(floor) - (floor, U32_MAX]
       get_random_u32_inclusive(floor, ceil) - [floor, ceil]
   Coccinelle was used to convert all current users of
   prandom_u32_max(), as well as many open-coded patterns, resulting in
   improvements throughout the tree.
   I'll have a "late" 6.1-rc1 pull for you that removes the now unused
   prandom_u32_max() function, just in case any other trees add a new
   use case of it that needs to converted. According to linux-next,
   there may be two trivial cases of prandom_u32_max() reintroductions
   that are fixable with a 's/.../.../'. So I'll have for you a final
   conversion patch doing that alongside the removal patch during the
   second week.
   This is a treewide change that touches many files throughout.
 - More consistent use of get_random_canary().
 - Updates to comments, documentation, tests, headers, and
   simplification in configuration.
 - The arch_get_random*_early() abstraction was only used by arm64 and
   wasn't entirely useful, so this has been replaced by code that works
   in all relevant contexts.
 - The kernel will use and manage random seeds in non-volatile EFI
   variables, refreshing a variable with a fresh seed when the RNG is
   initialized. The RNG GUID namespace is then hidden from efivarfs to
   prevent accidental leakage.
   These changes are split into random.c infrastructure code used in the
   EFI subsystem, in this pull request, and related support inside of
   EFISTUB, in Ard's EFI tree. These are co-dependent for full
   functionality, but the order of merging doesn't matter.
 - Part of the infrastructure added for the EFI support is also used for
   an improvement to the way vsprintf initializes its siphash key,
   replacing an sleep loop wart.
 - The hardware RNG framework now always calls its correct random.c
   input function, add_hwgenerator_randomness(), rather than sometimes
   going through helpers better suited for other cases.
 - The add_latent_entropy() function has long been called from the fork
   handler, but is a no-op when the latent entropy gcc plugin isn't
   used, which is fine for the purposes of latent entropy.
   But it was missing out on the cycle counter that was also being mixed
   in beside the latent entropy variable. So now, if the latent entropy
   gcc plugin isn't enabled, add_latent_entropy() will expand to a call
   to add_device_randomness(NULL, 0), which adds a cycle counter,
   without the absent latent entropy variable.
 - The RNG is now reseeded from a delayed worker, rather than on demand
   when used. Always running from a worker allows it to make use of the
   CPU RNG on platforms like S390x, whose instructions are too slow to
   do so from interrupts. It also has the effect of adding in new inputs
   more frequently with more regularity, amounting to a long term
   transcript of random values. Plus, it helps a bit with the upcoming
   vDSO implementation (which isn't yet ready for 6.2).
 - The jitter entropy algorithm now tries to execute on many different
   CPUs, round-robining, in hopes of hitting even more memory latencies
   and other unpredictable effects. It also will mix in a cycle counter
   when the entropy timer fires, in addition to being mixed in from the
   main loop, to account more explicitly for fluctuations in that timer
   firing. And the state it touches is now kept within the same cache
   line, so that it's assured that the different execution contexts will
   cause latencies.
* tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits)
  random: include <linux/once.h> in the right header
  random: align entropy_timer_state to cache line
  random: mix in cycle counter when jitter timer fires
  random: spread out jitter callback to different CPUs
  random: remove extraneous period and add a missing one in comments
  efi: random: refresh non-volatile random seed when RNG is initialized
  vsprintf: initialize siphash key using notifier
  random: add back async readiness notifier
  random: reseed in delayed work rather than on-demand
  random: always mix cycle counter in add_latent_entropy()
  hw_random: use add_hwgenerator_randomness() for early entropy
  random: modernize documentation comment on get_random_bytes()
  random: adjust comment to account for removed function
  random: remove early archrandom abstraction
  random: use random.trust_{bootloader,cpu} command line option only
  stackprotector: actually use get_random_canary()
  stackprotector: move get_random_canary() into stackprotector.h
  treewide: use get_random_u32_inclusive() when possible
  treewide: use get_random_u32_{above,below}() instead of manual loop
  treewide: use get_random_u32_below() instead of deprecated function
  ... | ||
|  Linus Torvalds | 7fc035058e | execve updates for v6.2-rc1 - Add timens support (when switching mm). This version has survived
   in -next for the entire cycle (Andrei Vagin).
 
 - Various small bug fixes, refactoring, and readability improvements
   (Bernd Edlinger, Rolf Eike Beer, Bo Liu, Li Zetao Liu Shixin).
 
 - Remove FOLL_FORCE for stack setup (Kees Cook).
 
 - Whilespace cleanups (Rolf Eike Beer, Kees Cook).
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmOOjsgWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJqwED/9mjtKL2GwHOYKsfhtc0m4HVGBw
 gxTEKuyo5mRwaRLg2bfuWe1OQfeGWQd9+IZ83Kr2ijzm4R16Gslv9i69Iwdf2tce
 iFf2R+iR7On+zNokHxaNflRH9fMsZLobVFqzLvB73BUF82ybJlTR3WMnQhS6HZQB
 Gse8jRfueOnVgKldRLlgdxIucPVsXYSoBS4B0nvIUuQn3aNzDNuuctMe/5NFK0ud
 +TWMXtKzS3B9pcLTXy3e0bPk/Ptio18CBUEI+iLMAHswtNCoxx1ZCcuvnEcrd5Qr
 h2WGaRvYJ7oSUXeEsqPKuDdhqEJQH2AQoX8FzvD+hyIutQJCJzVYlHvwGCqn/Km6
 0Dalng9Pjb6z2LEie/N42LDXEQmLZO2WtJ4otpORJlsJ7ZkrLjB4u+hDU1JA/Q14
 YPWvth3fMA5vAFKvGCtpEc7YdHmghmXCW+YGXOBm625fPYnwFSXOarHfow1RKNE5
 MOM4l60WwzLIHgmr8AFUaLf8TbutXN+BKvbMRh2ToWzDYXEoywxAedHDyo4LVwEy
 mZEca/3izT1ynBcyZg1t8shf4htgLjcPHqM0B+Hq0iNMIrwtecqAcYL/Oj6XssPx
 OuQYv341KF9fV/hMy84GM2HMr0ygUmrP7b9x+PEvCwzWf/2Glaw6Z4rtCdYC+TjW
 8ZWqPqEY+LRsZsL18Q==
 =ZDYk
 -----END PGP SIGNATURE-----
Merge tag 'execve-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
 "Most are small refactorings and bug fixes, but three things stand out:
  switching timens (which got reverted before) looks solid now,
  FOLL_FORCE has been removed (no failures seen yet across several weeks
  in -next), and some whitespace cleanups (which are long overdue).
   - Add timens support (when switching mm). This version has survived
     in -next for the entire cycle (Andrei Vagin)
   - Various small bug fixes, refactoring, and readability improvements
     (Bernd Edlinger, Rolf Eike Beer, Bo Liu, Li Zetao Liu Shixin)
   - Remove FOLL_FORCE for stack setup (Kees Cook)
   - Whitespace cleanups (Rolf Eike Beer, Kees Cook)"
* tag 'execve-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt_misc: fix shift-out-of-bounds in check_special_flags
  binfmt: Fix error return code in load_elf_fdpic_binary()
  exec: Remove FOLL_FORCE for stack setup
  binfmt_elf: replace IS_ERR() with IS_ERR_VALUE()
  binfmt_elf: simplify error handling in load_elf_phdrs()
  binfmt_elf: fix documented return value for load_elf_phdrs()
  exec: simplify initial stack size expansion
  binfmt: Fix whitespace issues
  exec: Add comments on check_unsafe_exec() fs counting
  ELF uapi: add spaces before '{'
  selftests/timens: add a test for vfork+exit
  fs/exec: switch timens when a task gets a new mm | ||
|  Kuniyuki Iwashima | a1140cb215 | seccomp: Move copy_seccomp() to no failure path. Our syzbot instance reported memory leaks in do_seccomp() [0], similar to the report [1]. It shows that we miss freeing struct seccomp_filter and some objects included in it. We can reproduce the issue with the program below [2] which calls one seccomp() and two clone() syscalls. The first clone()d child exits earlier than its parent and sends a signal to kill it during the second clone(), more precisely before the fatal_signal_pending() test in copy_process(). When the parent receives the signal, it has to destroy the embryonic process and return -EINTR to user space. In the failure path, we have to call seccomp_filter_release() to decrement the filter's refcount. Initially, we called it in free_task() called from the failure path, but the commit | ||
|  Shakeel Butt | f689054aac | percpu_counter: add percpu_counter_sum_all interface The percpu_counter is used for scenarios where performance is more important than the accuracy. For percpu_counter users, who want more accurate information in their slowpath, percpu_counter_sum is provided which traverses all the online CPUs to accumulate the data. The reason it only needs to traverse online CPUs is because percpu_counter does implement CPU offline callback which syncs the local data of the offlined CPU. However there is a small race window between the online CPUs traversal of percpu_counter_sum and the CPU offline callback. The offline callback has to traverse all the percpu_counters on the system to flush the CPU local data which can be a lot. During that time, the CPU which is going offline has already been published as offline to all the readers. So, as the offline callback is running, percpu_counter_sum can be called for one counter which has some state on the CPU going offline. Since percpu_counter_sum only traverses online CPUs, it will skip that specific CPU and the offline callback might not have flushed the state for that specific percpu_counter on that offlined CPU. Normally this is not an issue because percpu_counter users can deal with some inaccuracy for small time window. However a new user i.e. mm_struct on the cleanup path wants to check the exact state of the percpu_counter through check_mm(). For such users, this patch introduces percpu_counter_sum_all() which traverses all possible CPUs and it is used in fork.c:check_mm() to avoid the potential race. This issue is exposed by the later patch "mm: convert mm's rss stats into percpu_counter". Link: https://lkml.kernel.org/r/20221109012011.881058-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Shakeel Butt | f1a7941243 | mm: convert mm's rss stats into percpu_counter Currently mm_struct maintains rss_stats which are updated on page fault and the unmapping codepaths. For page fault codepath the updates are cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. The reason for caching is performance for multithreaded applications otherwise the rss_stats updates may become hotspot for such applications. However this optimization comes with the cost of error margin in the rss stats. The rss_stats for applications with large number of threads can be very skewed. At worst the error margin is (nr_threads * 64) and we have a lot of applications with 100s of threads, so the error margin can be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. Recently we started seeing the unbounded errors for rss_stats for specific applications which use TCP rx0cp. It seems like vm_insert_pages() codepath does not sync rss_stats at all. This patch converts the rss_stats into percpu_counter to convert the error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). However this conversion enable us to get the accurate stats for situations where accuracy is more important than the cpu cost. This patch does not make such tradeoffs - we can just use percpu_counter_add_local() for the updates and percpu_counter_sum() (or percpu_counter_sync() + percpu_counter_read) for the readers. At the moment the readers are either procfs interface, oom_killer and memory reclaim which I think are not performance critical and should be ok with slow read. However I think we can make that change in a separate patch. Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Jason A. Donenfeld | b3883a9a1f | stackprotector: move get_random_canary() into stackprotector.h This has nothing to do with random.c and everything to do with stack protectors. Yes, it uses randomness. But many things use randomness. random.h and random.c are concerned with the generation of randomness, not with each and every use. So move this function into the more specific stackprotector.h file where it belongs. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> | ||
|  Andrei Vagin | 2b5f9dad32 | fs/exec: switch timens when a task gets a new mm Changing a time namespace requires remapping a vvar page, so we don't want
to allow doing that if any other tasks can use the same mm.
Currently, we install a time namespace when a task is created with a new
vm. exec() is another case when a task gets a new mm and so it can switch
a time namespace safely, but it isn't handled now.
One more issue of the current interface is that clone() with CLONE_VM isn't
allowed if the current task has unshared a time namespace
(timens_for_children doesn't match the current timens).
Both these issues make some inconvenience for users. For example, Alexey
and Florian reported that posix_spawn() uses vfork+exec and this pattern
doesn't work with time namespaces due to the both described issues.
LXC needed to workaround the exec() issue by calling setns.
In the commit  | ||
|  Linus Torvalds | 676cb49573 | - hfs and hfsplus kmap API modernization from Fabio Francesco - Valentin Schneider makes crash-kexec work properly when invoked from an NMI-time panic. - ntfs bugfixes from Hawkins Jiawei - Jiebin Sun improves IPC msg scalability by replacing atomic_t's with percpu counters. - nilfs2 cleanups from Minghao Chi - lots of other single patches all over the tree! -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0Yf0gAKCRDdBJ7gKXxA joapAQDT1d1zu7T8yf9cQXkYnZVuBKCjxKE/IsYvqaq1a42MjQD/SeWZg0wV05B8 DhJPj9nkEp6R3Rj3Mssip+3vNuceAQM= =lUQY -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - hfs and hfsplus kmap API modernization (Fabio Francesco) - make crash-kexec work properly when invoked from an NMI-time panic (Valentin Schneider) - ntfs bugfixes (Hawkins Jiawei) - improve IPC msg scalability by replacing atomic_t's with percpu counters (Jiebin Sun) - nilfs2 cleanups (Minghao Chi) - lots of other single patches all over the tree! * tag 'mm-nonmm-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits) include/linux/entry-common.h: remove has_signal comment of arch_do_signal_or_restart() prototype proc: test how it holds up with mapping'less process mailmap: update Frank Rowand email address ia64: mca: use strscpy() is more robust and safer init/Kconfig: fix unmet direct dependencies ia64: update config files nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failure fork: remove duplicate included header files init/main.c: remove unnecessary (void*) conversions proc: mark more files as permanent nilfs2: remove the unneeded result variable nilfs2: delete unnecessary checks before brelse() checkpatch: warn for non-standard fixes tag style usr/gen_init_cpio.c: remove unnecessary -1 values from int file ipc/msg: mitigate the lock contention with percpu counter percpu: add percpu_counter_add_local and percpu_counter_sub_local fs/ocfs2: fix repeated words in comments relay: use kvcalloc to alloc page array in relay_alloc_page_array proc: make config PROC_CHILDREN depend on PROC_FS fs: uninline inode_maybe_inc_iversion() ... | ||
|  Xu Panda | ef79361b26 | fork: remove duplicate included header files linux/sched/mm.h is included more than once. Link: https://lkml.kernel.org/r/20220912071556.16811-1-xu.panda@zte.com.cn Signed-off-by: Xu Panda <xu.panda@zte.com.cn> Reported-by: Zeal Robot <zealci@zte.com.cn> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Linus Torvalds | 27bc50fc90 | - Yu Zhao's Multi-Gen LRU patches are here.  They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam R. Howlett. An overlapping range-based tree for vmas. It it apparently slight more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com). This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU= =xfWx -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ... | ||
|  Linus Torvalds | 30c999937f | Scheduler changes for v6.1: - Debuggability:
 
      - Change most occurances of BUG_ON() to WARN_ON_ONCE()
 
      - Reorganize & fix TASK_ state comparisons, turn it into a bitmap
 
      - Update/fix misc scheduler debugging facilities
 
  - Load-balancing & regular scheduling:
 
      - Improve the behavior of the scheduler in presence of lot of
        SCHED_IDLE tasks - in particular they should not impact other
        scheduling classes.
 
      - Optimize task load tracking, cleanups & fixes
 
      - Clean up & simplify misc load-balancing code
 
  - Freezer:
 
      - Rewrite the core freezer to behave better wrt thawing and be simpler
        in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting
        all the fallout.
 
  - Deadline scheduler:
 
      - Fix the DL capacity-aware code
 
      - Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period()
 
      - Relax/optimize locking in task_non_contending()
 
  - Cleanups:
 
      - Factor out the update_current_exec_runtime() helper
 
      - Various cleanups, simplifications
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/01cRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1geZA/+PB4KC1T9aVxzaTHI36R03YgJYZmIdtxw
 wTf02MixePmz+gQCbepJbempGOh5ST28aOcI0xhdYOql5B63MaUBBMlB0HvGUyDG
 IU3zETqLMRtAbnSTdQFv8m++ECUtZYp8/x1FCel4WO7ya4ETkRu1NRfCoUepEhpZ
 aVAlae9LH3NBaF9t7s0PT2lTjf3pIzMFRkddJ0ywJhbFR3VnWat05fAK+J6fGY8+
 LS54coefNlJD4oDh5TY8uniL1j5SmWmmwbk9Cdj7bLU5P3dFSS0/+5FJNHJPVGDE
 srGT7wstRUcDrN0CnZo48VIUBiApJCCDqTfJYi9wNYd0NAHvwY6MIJJgEIY8mKsI
 L/qH26H81Wt+ezSZ/5JIlGlZ/LIeNaa6OO/fbWEYABBQogvvx3nxsRNUYKSQzumH
 CnSBasBjLnjWyLlK4qARM9cI7NFSEK6NUigrEx/7h8JFu/8T4DlSy6LsF1HUyKgq
 4+FJLAqG6cL0tcwB/fHYd0oRESN8dStnQhGxSojgufwLc7dlFULvCYF5JM/dX+/V
 IKwbOfIOeOn6ViMtSOXAEGdII+IQ2/ZFPwr+8Z5JC7NzvTVL6xlu/3JXkLZR3L7o
 yaXTSaz06h1vil7Z+GRf7RHc+wUeGkEpXh5vnarGZKXivhFdWsBdROIJANK+xR0i
 TeSLCxQxXlU=
 =KjMD
 -----END PGP SIGNATURE-----
Merge tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "Debuggability:
   - Change most occurances of BUG_ON() to WARN_ON_ONCE()
   - Reorganize & fix TASK_ state comparisons, turn it into a bitmap
   - Update/fix misc scheduler debugging facilities
  Load-balancing & regular scheduling:
   - Improve the behavior of the scheduler in presence of lot of
     SCHED_IDLE tasks - in particular they should not impact other
     scheduling classes.
   - Optimize task load tracking, cleanups & fixes
   - Clean up & simplify misc load-balancing code
  Freezer:
   - Rewrite the core freezer to behave better wrt thawing and be
     simpler in general, by replacing PF_FROZEN with TASK_FROZEN &
     fixing/adjusting all the fallout.
  Deadline scheduler:
   - Fix the DL capacity-aware code
   - Factor out dl_task_is_earliest_deadline() &
     replenish_dl_new_period()
   - Relax/optimize locking in task_non_contending()
  Cleanups:
   - Factor out the update_current_exec_runtime() helper
   - Various cleanups, simplifications"
* tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
  sched: Fix more TASK_state comparisons
  sched: Fix TASK_state comparisons
  sched/fair: Move call to list_last_entry() in detach_tasks
  sched/fair: Cleanup loop_max and loop_break
  sched/fair: Make sure to try to detach at least one movable task
  sched: Show PF_flag holes
  freezer,sched: Rewrite core freezer logic
  sched: Widen TAKS_state literals
  sched/wait: Add wait_event_state()
  sched/completion: Add wait_for_completion_state()
  sched: Add TASK_ANY for wait_task_inactive()
  sched: Change wait_task_inactive()s match_state
  freezer,umh: Clean up freezer/initrd interaction
  freezer: Have {,un}lock_system_sleep() save/restore flags
  sched: Rename task_running() to task_on_cpu()
  sched/fair: Cleanup for SIS_PROP
  sched/fair: Default to false in test_idle_cores()
  sched/fair: Remove useless check in select_idle_core()
  sched/fair: Avoid double search on same cpu
  sched/fair: Remove redundant check in select_idle_smt()
  ... | ||
|  Linus Torvalds | 493ffd6605 | ucounts: Split rlimit and ucount values and max values After the ucount rlimit code was merged a bunch of small but
 siginificant bugs were found and fixed.  At the time it was realized
 that part of the problem was that while the ucount rlimits were very
 similar to the oridinary ucounts (in being nested counts with limits)
 the semantics were slightly different and the code would be less error
 prone if there was less sharing.  This is the long awaited cleanup
 that should hopefully keep things more comprehensible and less error
 prone for whoever needs to touch that code next.
 
 Alexey Gladkov (1):
       ucounts: Split rlimit and ucount values and max values
 
  fs/exec.c                      |  2 +-
  fs/proc/array.c                |  2 +-
  include/linux/user_namespace.h | 35 ++++++++++++++++++++++-------------
  kernel/fork.c                  | 12 ++++++------
  kernel/sys.c                   |  2 +-
  kernel/ucount.c                | 34 +++++++++++++++-------------------
  kernel/user_namespace.c        | 10 +++++-----
  7 files changed, 51 insertions(+), 46 deletions(-)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmM7U4cACgkQC/v6Eiaj
 j0AbRA//RVrGJ9n5iYyHM7WgeoTlFbaupEyLTq5dEpkOMD9CEB4OpMymGA/VXbeX
 cjgF5dqykfrdpYBwJdosl1fgq15ZFe9ChKhPGQkI5CGlwyRYTl2kq+FrZLC790s8
 c4TN3fKO1DyQPn5+UNzlBgLP8ofiUqeScZJDGa+LeMlUIv1OFS3m05jHuG/uzl6b
 bbbdcn61tFKOFCapbE72hWusEQssPOAN+dSY1/lwKO05WOKR0N2CR0EHyZhW2Owd
 GIQ27Zh5ed/9xRNlxa8VIa+JDfuATbPeoWcvRmiWSEoAxKtPBUf8lwcltlHBUcKK
 72MH+KU9AaIZ1prq9ng4xEaM+vXiSSNspYB8siwph7au1gWx1Yu2yYVavEPeFB9o
 C0JaD7kTh6Mhk6xdPhnmFUHFOLLGC5LdnBcIwwoMb1jlwP4QJRVucbjpqaOptoiE
 SeWhRRKUBwpcQdztQZCR+X0h1paHRJJXplHFmeEGcMviGWntgKUaxXJQ4BJrnRTO
 pagn7h181KVF7u9Toh0IWzrd322mXNqmcgwhzE/S9pa5EJMQHt7qYkDzQCgEwoap
 JmIld9tKkv/0fYMOHordjMb1OY37feI7FyDAuZuLP1ZWYgKhOq0LrD5x8PzKmoyM
 6oKAOfXZUVT/Pnw21nEzAtHsazV3mLRpW+gLLiLSiWoSaYT4x14=
 =kjVh
 -----END PGP SIGNATURE-----
Merge tag 'ucount-rlimits-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ucounts update from Eric Biederman:
 "Split rlimit and ucount values and max values
  After the ucount rlimit code was merged a bunch of small but
  siginificant bugs were found and fixed. At the time it was realized
  that part of the problem was that while the ucount rlimits were very
  similar to the oridinary ucounts (in being nested counts with limits)
  the semantics were slightly different and the code would be less error
  prone if there was less sharing.
  This is the long awaited cleanup that should hopefully keep things
  more comprehensible and less error prone for whoever needs to touch
  that code next"
* tag 'ucount-rlimits-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ucounts: Split rlimit and ucount values and max values | ||
|  Linus Torvalds | e572410e47 | ptrace: Stop supporting SIGKILL for PTRACE_EVENT_EXIT Recently I had a conversation where it was pointed out to me that
 SIGKILL sent to a tracee stropped in PTRACE_EVENT_EXIT is quite
 difficult for a tracer to handle.
 
 Keeping SIGKILL working after the process has been killed is pain
 from an implementation point of view.
 
 So since the debuggers don't want this behavior let's see if we can
 remove this wart for the userspace API
 
 If a regression is detected it should only need to be the last change
 that is the reverted.  The other two are just general cleanups that
 make the last patch simpler.
 
 Eric W. Biederman (3):
       signal: Ensure SIGNAL_GROUP_EXIT gets set in do_group_exit
       signal: Guarantee that SIGNAL_GROUP_EXIT is set on process exit
       signal: Drop signals received after a fatal signal has been processed
 
  fs/coredump.c                |  2 +-
  include/linux/sched/signal.h |  1 +
  kernel/exit.c                | 20 +++++++++++++++++++-
  kernel/fork.c                |  2 ++
  kernel/signal.c              |  3 ++-
  5 files changed, 25 insertions(+), 3 deletions(-)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmM7U40ACgkQC/v6Eiaj
 j0B7eRAAst6EW4nkxiDBN/PRo43Kz7tkYCZGLAq73vZAaKkyTFSwghT85cZwTsuc
 px/iDPWZpSGdIeHhdt1giJeuFG0FuoMR9prQVx3z/fM6KLXEJb86OtW1c1uW00Bh
 TkhVPQiF/9bc+Eb/nRKF61NA4sP0OXwO1+t/zBL4clekO9vFLP1DpRBE9OrNlHq2
 NlDqoPqq6SsKYG8f+J2LJKKzRWLICvHCtz4uUt6O11Wt/PH2TZYdYnXf/2vvZyzS
 SOs8kQjw4QPSptcNH/LIZz0WNHagn1uU/2/T106LOgPgcr515T4MhqOK+Wp95BjA
 0RrhsfdMD+7HArqLmt9VKgKO9Gs+T/M6jQZgpUyzlw3qPorKUGu4s9UnUgS7l4uz
 oNV9no1Ei2fQ+YR6RBR74579a45FWkqRBsaia59KFCzZxRsQz8VB1cqcIgl9dFUA
 f81qt9FiX8duVYZIcoT79BvV1bQ3LGwyygrH+X3sDTQSdN/aeZ24JdGP2MNtYO/c
 jWN/mMVHc1xOsYmACVGETWhZf0Y7lGGBYSIJ92jT2HxIuEiqmFE4kngRjKcYgL/G
 1/o9z8VntMq5t+ZcQY5WfK/WpSeSPXa6TsP80PEm/hvdMwZLEO4IzCR+gPJB6oJS
 KPQ3EZfRCB2Jb1qBmcpbleA/RBA0iJUZtBvtIPI7QmMTR+VRdhc=
 =gX4y
 -----END PGP SIGNATURE-----
Merge tag 'signal-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace update from Eric Biederman:
 "ptrace: Stop supporting SIGKILL for PTRACE_EVENT_EXIT
  Recently I had a conversation where it was pointed out to me that
  SIGKILL sent to a tracee stropped in PTRACE_EVENT_EXIT is quite
  difficult for a tracer to handle.
  Keeping SIGKILL working after the process has been killed is pain from
  an implementation point of view.
  So since the debuggers don't want this behavior let's see if we can
  remove this wart for the userspace API
  If a regression is detected it should only need to be the last change
  that is the reverted. The other two are just general cleanups that
  make the last patch simpler"
* tag 'signal-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal: Drop signals received after a fatal signal has been processed
  signal: Guarantee that SIGNAL_GROUP_EXIT is set on process exit
  signal: Ensure SIGNAL_GROUP_EXIT gets set in do_group_exit | ||
|  Alexander Potapenko | 50b5e49ca6 | kmsan: handle task creation and exiting Tell KMSAN that a new task is created, so the tool creates a backing metadata structure for that task. Link: https://lkml.kernel.org/r/20220915150417.722975-17-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Mike Kravetz | 8d9bfb2608 | hugetlb: add vma based lock for pmd sharing Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for synchronization use by vmas that could be involved in pmd sharing. This data structure contains a rw semaphore that is the primary tool used for synchronization. This new structure is ref counted, so that it can exist when NOT attached to a vma. This is only helpful in resolving lock ordering issues where code may need to obtain the vma_lock while there are no guarantees the vma may go away. By obtaining a ref on the structure, it can be guaranteed that at least the rw semaphore will not go away. Only add infrastructure for the new lock here. Actual use will be added in subsequent patches. [mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release] Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Liam R. Howlett | 763ecb0350 | mm: remove the vma linked list Replace any vm_next use with vma_find(). Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the maple tree. Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap(). At the same time, alter the loop to be more compact. Now that free_pgtables() and unmap_vmas() take a maple tree as an argument, rearrange do_mas_align_munmap() to use the new tree to hold the vmas to remove. Remove __vma_link_list() and __vma_unlink_list() as they are exclusively used to update the linked list. Drop linked list update from __insert_vm_struct(). Rework validation of tree as it was depending on the linked list. [yang.lee@linux.alibaba.com: fix one kernel-doc comment] Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=1949 Link: https://lkml.kernel.org/r/20220824021918.94116-1-yang.lee@linux.alibaba.comLink: https://lkml.kernel.org/r/20220906194824.2110408-69-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Matthew Wilcox (Oracle) | fa5e587679 | fork: use VMA iterator The VMA iterator is faster than the linked list and removing the linked list will shrink the vm_area_struct. Link: https://lkml.kernel.org/r/20220906194824.2110408-50-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Liam R. Howlett | 7964cf8caa | mm: remove vmacache By using the maple tree and the maple tree state, the vmacache is no longer beneficial and is complicating the VMA code. Remove the vmacache to reduce the work in keeping it up to date and code complexity. Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Liam R. Howlett | 524e00b36e | mm: remove rb tree. Remove the RB tree and start using the maple tree for vm_area_struct tracking. Drop validate_mm() calls in expand_upwards() and expand_downwards() as the lock is not held. Link: https://lkml.kernel.org/r/20220906194824.2110408-18-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Liam R. Howlett | c9dbe82cb9 | kernel/fork: use maple tree for dup_mmap() during forking The maple tree was already tracking VMAs in this function by an earlier commit, but the rbtree iterator was being used to iterate the list. Change the iterator to use a maple tree native iterator and switch to the maple tree advanced API to avoid multiple walks of the tree during insert operations. Unexport the now-unused vma_store() function. For performance reasons we bulk allocate the maple tree nodes. The node calculations are done internally to the tree and use the VMA count and assume the worst-case node requirements. The VM_DONT_COPY flag does not allow for the most efficient copy method of the tree and so a bulk loading algorithm is used. Link: https://lkml.kernel.org/r/20220906194824.2110408-15-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Liam R. Howlett | d4af56c5c7 | mm: start tracking VMAs with maple tree Start tracking the VMAs with the new maple tree structure in parallel with the rb_tree. Add debug and trace events for maple tree operations and duplicate the rb_tree that is created on forks into the maple tree. The maple tree is added to the mm_struct including the mm_init struct, added support in required mm/mmap functions, added tracking in kernel/fork for process forking, and used to find the unmapped_area and checked against what the rbtree finds. This also moves the mmap_lock() in exit_mmap() since the oom reaper call does walk the VMAs. Otherwise lockdep will be unhappy if oom happens. When splitting a vma fails due to allocations of the maple tree nodes, the error path in __split_vma() calls new->vm_ops->close(new). The page accounting for hugetlb is actually in the close() operation, so it accounts for the removal of 1/2 of the VMA which was not adjusted. This results in a negative exit value. To avoid the negative charge, set vm_start = vm_end and vm_pgoff = 0. There is also a potential accounting issue in special mappings from insert_vm_struct() failing to allocate, so reverse the charge there in the failure scenario. Link: https://lkml.kernel.org/r/20220906194824.2110408-9-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Yu Zhao | bd74fdaea1 | mm: multi-gen LRU: support page table walks To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages.  A kill switch will be
added in the next patch to disable this behavior.  When disabled, the
aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated.  Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.
When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently.  Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim.  Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
  Single workload:
    fio (buffered I/O): no change
  Single workload:
    memcached (anon): +[8, 10]%
                Ops/sec      KB/sec
      patch1-7: 1147696.57   44640.29
      patch1-8: 1245274.91   48435.66
  Configurations:
    no change
Client benchmark results:
  kswapd profiles:
    patch1-7
      48.16%  lzo1x_1_do_compress (real work)
       8.20%  page_vma_mapped_walk (overhead)
       7.06%  _raw_spin_unlock_irq
       2.92%  ptep_clear_flush
       2.53%  __zram_bvec_write
       2.11%  do_raw_spin_lock
       2.02%  memmove
       1.93%  lru_gen_look_around
       1.56%  free_unref_page_list
       1.40%  memset
    patch1-8
      49.44%  lzo1x_1_do_compress (real work)
       6.19%  page_vma_mapped_walk (overhead)
       5.97%  _raw_spin_unlock_irq
       3.13%  get_pfn_folio
       2.85%  ptep_clear_flush
       2.42%  __zram_bvec_write
       2.08%  do_raw_spin_lock
       1.92%  memmove
       1.44%  alloc_zspage
       1.36%  memset
  Configurations:
    no change
Thanks to the following developers for their efforts [3].
  kernel test robot <lkp@intel.com>
[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Linus Torvalds | f489921dba | execve reverts for v6.0-rc7 - Remove the recent "unshare time namespace on vfork+exec" feature (Andrei Vagin)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmMoxpIWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJpd/D/9V7iLUZoquMvXFonv//sRH21P+
 u7vH03q0X4lSov73jdjizq8znZl9RVO14IYi+6lQE8VHyOjzjBoTALRPnirNCyGa
 Ia8P+LPaOHDTDQmGqt+9xmPKp3z0qwrpWWyTrFHLo7GRzWtI0QjQsSlgUTIz7jCw
 dSwLRWN6n7d3hzNzFWt9VUOOlzpip8NTcnAbC9YA5dPFLO85+wZ4ZpMYYfFJMcQj
 N/Zm63lrqAU0wy7EhonkKJQDjgRP/zYUs6VJMejHqYl951SrZJ+DgXEGaAwR14Sz
 IZAUhSM5Fl8alhkrcmlkiy9A5P014iVRR6AaSyeT2616fac97wY1EWHxvBMqzNsB
 AJJqjPHoN+mc8cqt9lMyIhbmS8WkTuyTHziEcFyyTVsNYGYN6x9hVVZalqPrl8o3
 Y3zC6MfRK33JNVB2GZVUzsf5EZC3mjz9VJKKmLwYmG4X7/JOvIVCiW123b060T7z
 b49PzI+0rTG8SHTk1I/T8NpWuvLRTCglzZK06q971uyT80xPoGD/HmSpmm+86dHs
 k3WV2qBoz31Eaoewa3NJqn6pBxQLy9WAZP6rJb3aQSFwDRCuvKO4CUpHAXILt5U+
 SoarR5445zVzY3NYHaf/3BRsEnCQS06U67ma0lAmMWk4J3ehFOY0DrRqtLJ02iwd
 sKJD/KnKC+IEcLjrAA==
 =yFGx
 -----END PGP SIGNATURE-----
Merge tag 'execve-v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve reverts from Kees Cook:
 "The recent work to support time namespace unsharing turns out to have
  some undesirable corner cases, so rather than allowing the API to stay
  exposed for another release, it'd be best to remove it ASAP, with the
  replacement getting another cycle of testing. Nothing is known to use
  this yet, so no userspace breakage is expected.
  For more details, see:
    https://lore.kernel.org/lkml/ed418e43ad28b8688cfea2b7c90fce1c@ispras.ru
  Summary:
   - Remove the recent 'unshare time namespace on vfork+exec' feature
     (Andrei Vagin)"
* tag 'execve-v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  Revert "fs/exec: allow to unshare a time namespace on vfork+exec"
  Revert "selftests/timens: add a test for vfork+exit" | ||
|  Andrei Vagin | 33a2d6bc34 | Revert "fs/exec: allow to unshare a time namespace on vfork+exec" This reverts commit  | ||
|  Peter Zijlstra | f5d39b0208 | freezer,sched: Rewrite core freezer logic Rewrite the core freezer to behave better wrt thawing and be simpler
in general.
By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until thawed and don't randomly wake
up early, as is currently possible.
As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
two PF_flags (yay!).
Specifically; the current scheme works a little like:
	freezer_do_not_count();
	schedule();
	freezer_count();
And either the task is blocked, or it lands in try_to_freezer()
through freezer_count(). Now, when it is blocked, the freezer
considers it frozen and continues.
However, on thawing, once pm_freezing is cleared, freezer_count()
stops working, and any random/spurious wakeup will let a task run
before its time.
That is, thawing tries to thaw things in explicit order; kernel
threads and workqueues before doing bringing SMP back before userspace
etc.. However due to the above mentioned races it is entirely possible
for userspace tasks to thaw (by accident) before SMP is back.
This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
where the userspace task requires a special CPU to run.
As said; replace this with a special task state TASK_FROZEN and add
the following state transitions:
	TASK_FREEZABLE	-> TASK_FROZEN
	__TASK_STOPPED	-> TASK_FROZEN
	__TASK_TRACED	-> TASK_FROZEN
The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
(IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
is already required to deal with spurious wakeups and the freezer
causes one such when thawing the task (since the original state is
lost).
The special __TASK_{STOPPED,TRACED} states *can* be restored since
their canonical state is in ->jobctl.
With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
free of undue (early / spurious) wakeups.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org | ||
|  Yishai Hadas | 85eaeb5058 | IB/core: Fix a nested dead lock as part of ODP flow Fix a nested dead lock as part of ODP flow by using mmput_async().
From the below call trace [1] can see that calling mmput() once we have
the umem_odp->umem_mutex locked as required by
ib_umem_odp_map_dma_and_lock() might trigger in the same task the
exit_mmap()->__mmu_notifier_release()->mlx5_ib_invalidate_range() which
may dead lock when trying to lock the same mutex.
Moving to use mmput_async() will solve the problem as the above
exit_mmap() flow will be called in other task and will be executed once
the lock will be available.
[1]
[64843.077665] task:kworker/u133:2  state:D stack:    0 pid:80906 ppid:
2 flags:0x00004000
[64843.077672] Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action [mlx5_ib]
[64843.077719] Call Trace:
[64843.077722]  <TASK>
[64843.077724]  __schedule+0x23d/0x590
[64843.077729]  schedule+0x4e/0xb0
[64843.077735]  schedule_preempt_disabled+0xe/0x10
[64843.077740]  __mutex_lock.constprop.0+0x263/0x490
[64843.077747]  __mutex_lock_slowpath+0x13/0x20
[64843.077752]  mutex_lock+0x34/0x40
[64843.077758]  mlx5_ib_invalidate_range+0x48/0x270 [mlx5_ib]
[64843.077808]  __mmu_notifier_release+0x1a4/0x200
[64843.077816]  exit_mmap+0x1bc/0x200
[64843.077822]  ? walk_page_range+0x9c/0x120
[64843.077828]  ? __cond_resched+0x1a/0x50
[64843.077833]  ? mutex_lock+0x13/0x40
[64843.077839]  ? uprobe_clear_state+0xac/0x120
[64843.077860]  mmput+0x5f/0x140
[64843.077867]  ib_umem_odp_map_dma_and_lock+0x21b/0x580 [ib_core]
[64843.077931]  pagefault_real_mr+0x9a/0x140 [mlx5_ib]
[64843.077962]  pagefault_mr+0xb4/0x550 [mlx5_ib]
[64843.077992]  pagefault_single_data_segment.constprop.0+0x2ac/0x560
[mlx5_ib]
[64843.078022]  mlx5_ib_eqe_pf_action+0x528/0x780 [mlx5_ib]
[64843.078051]  process_one_work+0x22b/0x3d0
[64843.078059]  worker_thread+0x53/0x410
[64843.078065]  ? process_one_work+0x3d0/0x3d0
[64843.078073]  kthread+0x12a/0x150
[64843.078079]  ? set_kthread_struct+0x50/0x50
[64843.078085]  ret_from_fork+0x22/0x30
[64843.078093]  </TASK>
Fixes:  | ||
|  Linus Torvalds | 965a9d75e3 | Tracing updates for 5.20 / 6.0 - Runtime verification infrastructure
   This is the biggest change for this pull request. It introduces the
   runtime verification that is necessary for running Linux on safety
   critical systems. It allows for deterministic automata models to be
   inserted into the kernel that will attach to tracepoints, where the
   information on these tracepoints will move the model from state to state.
   If a state is encountered that does not belong to the model, it will then
   activate a given reactor, that could just inform the user or even panic
   the kernel (for which safety critical systems will detect and can recover
   from).
 
 - Two monitor models are also added: Wakeup In Preemptive (WIP - not to be
   confused with "work in progress"), and Wakeup While Not Running (WWNR).
 
 - Added __vstring() helper to the TRACE_EVENT() macro to replace several
   vsnprintf() usages that were all doing it wrong.
 
 - eprobes now can have their event autogenerated when the event name is left
   off.
 
 - The rest is various cleanups and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYu0yzRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qj4HAP4tQtV55rjj4DQ5XIXmtI3/64PmyRSJ
 +y4DEXi1UvEUCQD/QAuQfWoT/7gh35ltkfeS4t3ockzy14rrkP5drZigiQA=
 =kEtM
 -----END PGP SIGNATURE-----
Merge tag 'trace-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
 - Runtime verification infrastructure
   This is the biggest change here. It introduces the runtime
   verification that is necessary for running Linux on safety critical
   systems.
   It allows for deterministic automata models to be inserted into the
   kernel that will attach to tracepoints, where the information on
   these tracepoints will move the model from state to state.
   If a state is encountered that does not belong to the model, it will
   then activate a given reactor, that could just inform the user or
   even panic the kernel (for which safety critical systems will detect
   and can recover from).
 - Two monitor models are also added: Wakeup In Preemptive (WIP - not to
   be confused with "work in progress"), and Wakeup While Not Running
   (WWNR).
 - Added __vstring() helper to the TRACE_EVENT() macro to replace
   several vsnprintf() usages that were all doing it wrong.
 - eprobes now can have their event autogenerated when the event name is
   left off.
 - The rest is various cleanups and fixes.
* tag 'trace-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (50 commits)
  rv: Unlock on error path in rv_unregister_reactor()
  tracing: Use alignof__(struct {type b;}) instead of offsetof()
  tracing/eprobe: Show syntax error logs in error_log file
  scripts/tracing: Fix typo 'the the' in comment
  tracepoints: It is CONFIG_TRACEPOINTS not CONFIG_TRACEPOINT
  tracing: Use free_trace_buffer() in allocate_trace_buffers()
  tracing: Use a struct alignof to determine trace event field alignment
  rv/reactor: Add the panic reactor
  rv/reactor: Add the printk reactor
  rv/monitor: Add the wwnr monitor
  rv/monitor: Add the wip monitor
  rv/monitor: Add the wip monitor skeleton created by dot2k
  Documentation/rv: Add deterministic automata instrumentation documentation
  Documentation/rv: Add deterministic automata monitor synthesis documentation
  tools/rv: Add dot2k
  Documentation/rv: Add deterministic automaton documentation
  tools/rv: Add dot2c
  Documentation/rv: Add a basic documentation
  rv/include: Add instrumentation helper functions
  rv/include: Add deterministic automata monitor definition via C macros
  ... | ||
|  Linus Torvalds | 7d9d077c78 | RCU pull request for v5.20 (or whatever) This pull request contains the following branches: doc.2022.06.21a: Documentation updates. fixes.2022.07.19a: Miscellaneous fixes. nocb.2022.07.19a: Callback-offload updates, perhaps most notably a new RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be offloaded at boot time, regardless of kernel boot parameters. This is useful to battery-powered systems such as ChromeOS and Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot parameter prevents offloaded callbacks from interfering with real-time workloads and with energy-efficiency mechanisms. poll.2022.07.21a: Polled grace-period updates, perhaps most notably making these APIs account for both normal and expedited grace periods. rcu-tasks.2022.06.21a: Tasks RCU updates, perhaps most notably reducing the CPU overhead of RCU tasks trace grace periods by more than a factor of two on a system with 15,000 tasks. The reduction is expected to increase with the number of tasks, so it seems reasonable to hypothesize that a system with 150,000 tasks might see a 20-fold reduction in CPU overhead. torture.2022.06.21a: Torture-test updates. ctxt.2022.07.05a: Updates that merge RCU's dyntick-idle tracking into context tracking, thus reducing the overhead of transitioning to kernel mode from either idle or nohz_full userspace execution for kernels that track context independently of RCU. This is expected to be helpful primarily for kernels built with CONFIG_NO_HZ_FULL=y. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmLgMcgTHHBhdWxtY2tA a2VybmVsLm9yZwAKCRCevxLzctn7jArXD/0fjbCwqpRjHVTzjMY8jN4zDkqZZD6m g8Fx27hZ4ToNFwRptyHwNezrNj14skjAJEXfdjaVw32W62ivXvf0HINvSzsTLCSq k2kWyBdXLc9CwY5p5W4smnpn5VoAScjg5PoPL59INoZ/Zziji323C7Zepl/1DYJt 0T6bPCQjo1ZQoDUCyVpSjDmAqxnderWG0MeJVt74GkLqmnYLANg0GH8c7mH4+9LL kVGlLp5nlPgNJ4FEoFdMwNU8T/ETmaVld/m2dkiawjkXjJzB2XKtBigU91DDmXz5 7DIdV4ABrxiy4kGNqtIe/jFgnKyVD7xiDpyfjd6KTeDr/rDS8u2ZH7+1iHsyz3g0 Np/tS3vcd0KR+gI/d0eXxPbgm5sKlCmKw/nU2eArpW/+4LmVXBUfHTG9Jg+LJmBc JrUh6aEdIZJZHgv/nOQBNig7GJW43IG50rjuJxAuzcxiZNEG5lUSS23ysaA9CPCL PxRWKSxIEfK3kdmvVO5IIbKTQmIBGWlcWMTcYictFSVfBgcCXpPAksGvqA5JiUkc egW+xLFo/7K+E158vSKsVqlWZcEeUbsNJ88QOlpqnRgH++I2Yv/LhK41XfJfpH+Y ALxVaDd+mAq6v+qSHNVq9wT3ozXIPy/zK1hDlMIqx40h2YvaEsH4je+521oSoN9r vX60+QNxvUBLwA== =vUNm -----END PGP SIGNATURE----- Merge tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates - Miscellaneous fixes - Callback-offload updates, perhaps most notably a new RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be offloaded at boot time, regardless of kernel boot parameters. This is useful to battery-powered systems such as ChromeOS and Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot parameter prevents offloaded callbacks from interfering with real-time workloads and with energy-efficiency mechanisms - Polled grace-period updates, perhaps most notably making these APIs account for both normal and expedited grace periods - Tasks RCU updates, perhaps most notably reducing the CPU overhead of RCU tasks trace grace periods by more than a factor of two on a system with 15,000 tasks. The reduction is expected to increase with the number of tasks, so it seems reasonable to hypothesize that a system with 150,000 tasks might see a 20-fold reduction in CPU overhead - Torture-test updates - Updates that merge RCU's dyntick-idle tracking into context tracking, thus reducing the overhead of transitioning to kernel mode from either idle or nohz_full userspace execution for kernels that track context independently of RCU. This is expected to be helpful primarily for kernels built with CONFIG_NO_HZ_FULL=y * tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (98 commits) rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings rcu: Diagnose extended sync_rcu_do_polled_gp() loops rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings rcutorture: Test polled expedited grace-period primitives rcu: Add polled expedited grace-period primitives rcutorture: Verify that polled GP API sees synchronous grace periods rcu: Make Tiny RCU grace periods visible to polled APIs rcu: Make polled grace-period API account for expedited grace periods rcu: Switch polled grace-period APIs to ->gp_seq_polled rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty rcu/nocb: Add option to opt rcuo kthreads out of RT priority rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread() rcu/nocb: Add an option to offload all CPUs on boot rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order rcu/nocb: Add/del rdp to iterate from rcuog itself rcu/tree: Add comment to describe GP-done condition in fqs loop rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs() rcu/kvfree: Remove useless monitor_todo flag rcu: Cleanup RCU urgency state for offline CPU ... | ||
|  Daniel Bristot de Oliveira | 792575348f | rv/include: Add deterministic automata monitor definition via C macros In Linux terms, the runtime verification monitors are encapsulated
inside the "RV monitor" abstraction. The "RV monitor" includes a set
of instances of the monitor (per-cpu monitor, per-task monitor, and
so on), the helper functions that glue the monitor to the system
reference model, and the trace output as a reaction for event parsing
and exceptions, as depicted below:
Linux  +----- RV Monitor ----------------------------------+ Formal
 Realm |                                                   |  Realm
 +-------------------+     +----------------+     +-----------------+
 |   Linux kernel    |     |     Monitor    |     |     Reference   |
 |     Tracing       |  -> |   Instance(s)  | <-  |       Model     |
 | (instrumentation) |     | (verification) |     | (specification) |
 +-------------------+     +----------------+     +-----------------+
        |                          |                       |
        |                          V                       |
        |                     +----------+                 |
        |                     | Reaction |                 |
        |                     +--+--+--+-+                 |
        |                        |  |  |                   |
        |                        |  |  +-> trace output ?  |
        +------------------------|--|----------------------+
                                 |  +----> panic ?
                                 +-------> <user-specified>
Add the rv/da_monitor.h, enabling automatic code generation for the
*Monitor Instance(s)* using C macros, and code to support it.
The benefits of the usage of macro for monitor synthesis are 3-fold as it:
- Reduces the code duplication;
- Facilitates the bug fix/improvement;
- Avoids the case of developers changing the core of the monitor code
  to manipulate the model in a (let's say) non-standard way.
This initial implementation presents three different types of monitor
instances:
- DECLARE_DA_MON_GLOBAL(name, type)
- DECLARE_DA_MON_PER_CPU(name, type)
- DECLARE_DA_MON_PER_TASK(name, type)
The first declares the functions for a global deterministic automata monitor,
the second for monitors with per-cpu instances, and the third with per-task
instances.
Link: https://lkml.kernel.org/r/51b0bf425a281e226dfeba7401d2115d6091f84e.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> | ||
|  Eric W. Biederman | d80f7d7b2c | signal: Guarantee that SIGNAL_GROUP_EXIT is set on process exit Track how many threads have not started exiting and when the last thread starts exiting set SIGNAL_GROUP_EXIT. This guarantees that SIGNAL_GROUP_EXIT will get set when a process exits. In practice this achieves nothing as glibc's implementation of _exit calls sys_group_exit then sys_exit. While glibc's implemenation of pthread_exit calls exit (which cleansup and calls _exit) if it is the last thread and sys_exit if it is the last thread. This means the only way the kernel might observe a process that does not set call exit_group is if the language runtime does not use glibc. With more cleanups I hope to move the decrement of quick_threads earlier. Link: https://lkml.kernel.org/r/87bkukd4tc.fsf_-_@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Paul E. McKenney | 434c9eefb9 | rcu-tasks: Add data structures for lightweight grace periods This commit adds fields to task_struct and to rcu_tasks_percpu that will be used to avoid the task-list scan for RCU Tasks Trace grace periods, and also initializes these fields. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <kafai@fb.com> Cc: KP Singh <kpsingh@kernel.org> | ||
|  Andrei Vagin | 133e2d3e81 | fs/exec: allow to unshare a time namespace on vfork+exec Right now, a new process can't be forked in another time namespace if it shares mm with its parent. It is prohibited, because each time namespace has its own vvar page that is mapped into a process address space. When a process calls exec, it gets a new mm and so it could be "legal" to switch time namespace in that case. This was not implemented and now if we want to do this, we need to add another clone flag to not break backward compatibility. We don't have any user requests to switch times on exec except the vfork+exec combination, so there is no reason to add a new clone flag. As for vfork+exec, this should be safe to allow switching timens with the current clone flag. Right now, vfork (CLONE_VFORK | CLONE_VM) fails if a child is forked into another time namespace. With this change, vfork creates a new process in parent's timens, and the following exec does the actual switch to the target time namespace. Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220613060723.197407-1-avagin@gmail.com | ||
|  Linus Torvalds | 1ec6574a3c | This set of changes updates init and user mode helper tasks to be ordinary user mode tasks. In commit | ||
|  Linus Torvalds | 98931dd95f | Yang Shi has improved the behaviour of khugepaged collapsing of readonly file-backed transparent hugepages.
 
 Johannes Weiner has arranged for zswap memory use to be tracked and
 managed on a per-cgroup basis.
 
 Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime
 enablement of the recent huge page vmemmap optimization feature.
 
 Baolin Wang contributes a series to fix some issues around hugetlb
 pagetable invalidation.
 
 Zhenwei Pi has fixed some interactions between hwpoisoned pages and
 virtualization.
 
 Tong Tiangen has enabled the use of the presently x86-only
 page_table_check debugging feature on arm64 and riscv.
 
 David Vernet has done some fixup work on the memcg selftests.
 
 Peter Xu has taught userfaultfd to handle write protection faults against
 shmem- and hugetlbfs-backed files.
 
 More DAMON development from SeongJae Park - adding online tuning of the
 feature and support for monitoring of fixed virtual address ranges.  Also
 easier discovery of which monitoring operations are available.
 
 Nadav Amit has done some optimization of TLB flushing during mprotect().
 
 Neil Brown continues to labor away at improving our swap-over-NFS support.
 
 David Hildenbrand has some fixes to anon page COWing versus
 get_user_pages().
 
 Peng Liu fixed some errors in the core hugetlb code.
 
 Joao Martins has reduced the amount of memory consumed by device-dax's
 compound devmaps.
 
 Some cleanups of the arch-specific pagemap code from Anshuman Khandual.
 
 Muchun Song has found and fixed some errors in the TLB flushing of
 transparent hugepages.
 
 Roman Gushchin has done more work on the memcg selftests.
 
 And, of course, many smaller fixes and cleanups.  Notably, the customary
 million cleanup serieses from Miaohe Lin.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo52xQAKCRDdBJ7gKXxA
 jtJFAQD238KoeI9z5SkPMaeBRYSRQmNll85mxs25KapcEgWgGQD9FAb7DJkqsIVk
 PzE+d9hEfirUGdL6cujatwJ6ejYR8Q8=
 =nFe6
 -----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
 "Almost all of MM here. A few things are still getting finished off,
  reviewed, etc.
   - Yang Shi has improved the behaviour of khugepaged collapsing of
     readonly file-backed transparent hugepages.
   - Johannes Weiner has arranged for zswap memory use to be tracked and
     managed on a per-cgroup basis.
   - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
     runtime enablement of the recent huge page vmemmap optimization
     feature.
   - Baolin Wang contributes a series to fix some issues around hugetlb
     pagetable invalidation.
   - Zhenwei Pi has fixed some interactions between hwpoisoned pages and
     virtualization.
   - Tong Tiangen has enabled the use of the presently x86-only
     page_table_check debugging feature on arm64 and riscv.
   - David Vernet has done some fixup work on the memcg selftests.
   - Peter Xu has taught userfaultfd to handle write protection faults
     against shmem- and hugetlbfs-backed files.
   - More DAMON development from SeongJae Park - adding online tuning of
     the feature and support for monitoring of fixed virtual address
     ranges. Also easier discovery of which monitoring operations are
     available.
   - Nadav Amit has done some optimization of TLB flushing during
     mprotect().
   - Neil Brown continues to labor away at improving our swap-over-NFS
     support.
   - David Hildenbrand has some fixes to anon page COWing versus
     get_user_pages().
   - Peng Liu fixed some errors in the core hugetlb code.
   - Joao Martins has reduced the amount of memory consumed by
     device-dax's compound devmaps.
   - Some cleanups of the arch-specific pagemap code from Anshuman
     Khandual.
   - Muchun Song has found and fixed some errors in the TLB flushing of
     transparent hugepages.
   - Roman Gushchin has done more work on the memcg selftests.
  ... and, of course, many smaller fixes and cleanups. Notably, the
  customary million cleanup serieses from Miaohe Lin"
* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
  mm: kfence: use PAGE_ALIGNED helper
  selftests: vm: add the "settings" file with timeout variable
  selftests: vm: add "test_hmm.sh" to TEST_FILES
  selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
  selftests: vm: add migration to the .gitignore
  selftests/vm/pkeys: fix typo in comment
  ksm: fix typo in comment
  selftests: vm: add process_mrelease tests
  Revert "mm/vmscan: never demote for memcg reclaim"
  mm/kfence: print disabling or re-enabling message
  include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
  include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
  mm: fix a potential infinite loop in start_isolate_page_range()
  MAINTAINERS: add Muchun as co-maintainer for HugeTLB
  zram: fix Kconfig dependency warning
  mm/shmem: fix shmem folio swapoff hang
  cgroup: fix an error handling path in alloc_pagecache_max_30M()
  mm: damon: use HPAGE_PMD_SIZE
  tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
  nodemask.h: fix compilation error with GCC12
  ... | ||
|  Linus Torvalds | 3e2cbc016b | - Add Raptor Lake to the set of CPU models which support splitlock - Make life miserable for apps using split locks by slowing them down considerably while the rest of the system remains responsive. The hope is it will hurt more and people will really fix their misaligned locks apps. As a result, free a TIF bit. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKL5PQACgkQEsHwGGHe VUrz1Q//QjAKyKsAwCzGSPergtnZp9drimSuNsZAz6/xL8wFnn2nfWJTxugNF5jg n0Hal2oUGC8lg13mliB7NuDNu4RUWpkFzTzcIbPT8K9h7CUBdQPzqS7E3/p4s/eG ZCHp8psBGNp8+/+/LFfu9yhzYsAH9ji5KWmOzTVx9UdP3ovgR8BuCI7FCVJSfRz7 cY690XgvcuKoXKckVNaCcoQXPJxykfk4Y1yt1TpITqivFbs2I0vLgzEhoRcTAhPA nX3pR3uy6oaA6rZAapRt8lbLWOwIEWoI0Tt1v+r5p28+nFiCRfm1XdPYK6CDBlox UuMBK4WyvSKjKHLu3wEdLCvYbs1kw2l9pXvS3hrqsKhbdeXKrxrNZ3zshwFMAYap MY/nSTsKSWUUgMgUbWI084csapGFB+hxwY8OVr6JXbxE8YYD/yCbPGOe1cLI7MMt /H3F6vNqSzdp1N3mAaaKVxiiT21lHIn6oJuSZcDE5sOvBwvpXsOp/w3FxhJCOX49 PXrZLZmSHkDQSbh1XnvT/a+rq3XX1TFXFz71HYZf1yDk+xTijECglNtGnGSdj2Za iOw6M8VduV5Wy3ED9ubonruuHEJn6njpx/MH1B9+mAZsuLBpmuYFBxOn6AHOkXSb MVJD4flHXj0ugYm4Q5Y3yi24iWLsRI9utTOU079VL6i6DmFXeZc= =svvI -----END PGP SIGNATURE----- Merge tag 'x86_splitlock_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 splitlock updates from Borislav Petkov: - Add Raptor Lake to the set of CPU models which support splitlock - Make life miserable for apps using split locks by slowing them down considerably while the rest of the system remains responsive. The hope is it will hurt more and people will really fix their misaligned locks apps. As a result, free a TIF bit. * tag 'x86_splitlock_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/split_lock: Enable the split lock feature on Raptor Lake x86/split-lock: Remove unused TIF_SLD bit x86/split_lock: Make life miserable for split lockers | ||
|  Yang Shi | d2081b2bf8 | mm: khugepaged: make khugepaged_enter() void function The most callers of khugepaged_enter() don't care about the return value. Only dup_mmap(), anonymous THP page fault and MADV_HUGEPAGE handle the error by returning -ENOMEM. Actually it is not harmful for them to ignore the error case either. It also sounds overkilling to fail fork() and page fault early due to khugepaged_enter() error, and MADV_HUGEPAGE does set VM_HUGEPAGE flag regardless of the error. Link: https://lkml.kernel.org/r/20220510203222.24246-6-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Alexey Gladkov | de399236e2 | ucounts: Split rlimit and ucount values and max values Since the semantics of maximum rlimit values are different, it would be better not to mix ucount and rlimit values. This will prevent the error of using inc_count/dec_ucount for rlimit parameters. This patch also renames the functions to emphasize the lack of connection between rlimit and ucount. v3: - Fix BUG:KASAN:use-after-free_in_dec_ucount. v2: - Fix the array-index-out-of-bounds that was found by the lkp project. Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://lkml.kernel.org/r/20220518171730.l65lmnnjtnxnftpq@example.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> | ||
|  Eric W. Biederman | 753550eb0c | fork: Explicitly set PF_KTHREAD Instead of implicitly inheriting PF_KTHREAD from the parent process examine arguments in kernel_clone_args to see if PF_KTHREAD should be set. This makes knowledge of which new threads are kernel threads explicit. This also makes it so that init and the user mode helper processes no longer have PF_KTHREAD set. Link: https://lkml.kernel.org/r/20220506141512.516114-6-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Eric W. Biederman | 5bd2e97c86 | fork: Generalize PF_IO_WORKER handling Add fn and fn_arg members into struct kernel_clone_args and test for them in copy_thread (instead of testing for PF_KTHREAD | PF_IO_WORKER). This allows any task that wants to be a user space task that only runs in kernel mode to use this functionality. The code on x86 is an exception and still retains a PF_KTHREAD test because x86 unlikely everything else handles kthreads slightly differently than user space tasks that start with a function. The functions that created tasks that start with a function have been updated to set ".fn" and ".fn_arg" instead of ".stack" and ".stack_size". These functions are fork_idle(), create_io_thread(), kernel_thread(), and user_mode_thread(). Link: https://lkml.kernel.org/r/20220506141512.516114-4-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Eric W. Biederman | 36cb0e1cda | fork: Explicity test for idle tasks in copy_thread The architectures ia64 and parisc have special handling for the idle thread in copy_process. Add a flag named idle to kernel_clone_args and use it to explicity test if an idle process is being created. Fullfill the expectations of the rest of the copy_thread implemetations and pass a function pointer in .stack from fork_idle(). This makes what is happening in copy_thread better defined, and is useful to make idle threads less special. Link: https://lkml.kernel.org/r/20220506141512.516114-3-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Eric W. Biederman | c5febea095 | fork: Pass struct kernel_clone_args into copy_thread With io_uring we have started supporting tasks that are for most purposes user space tasks that exclusively run code in kernel mode. The kernel task that exec's init and tasks that exec user mode helpers are also user mode tasks that just run kernel code until they call kernel execve. Pass kernel_clone_args into copy_thread so these oddball tasks can be supported more cleanly and easily. v2: Fix spelling of kenrel_clone_args on h8300 Link: https://lkml.kernel.org/r/20220506141512.516114-2-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Eric W. Biederman | 343f4c49f2 | kthread: Don't allocate kthread_struct for init and umh If kthread_is_per_cpu runs concurrently with free_kthread_struct the kthread_struct that was just freed may be read from. This bug was introduced by commit | ||
|  Fenghua Yu | 2667ed10d9 | mm: Fix PASID use-after-free issue The PASID is being freed too early.  It needs to stay around until after
device drivers that might be using it have had a chance to clear it out
of the hardware.
The relevant refcounts are:
  mmget() /mmput()  refcount the mm's address space
  mmgrab()/mmdrop() refcount the mm itself
The PASID is currently tied to the life of the mm's address space and freed
in __mmput().  This makes logical sense because the PASID can't be used
once the address space is gone.
But, this misses an important point: even after the address space is gone,
the PASID will still be programmed into a device.  Device drivers might,
for instance, still need to flush operations that are outstanding and need
to use that PASID.  They do this at file->release() time.
Device drivers call the IOMMU driver to hold a reference on the mm itself
and drop it at file->release() time.  But, the IOMMU driver holds a
reference on the mm itself, not the address space.  The address space (and
the PASID) is long gone by the time the driver tries to clean up.  This is
effectively a use-after-free bug on the PASID.
To fix this, move the PASID free operation from __mmput() to __mmdrop().
This ensures that the IOMMU driver's existing mmgrab() keeps the PASID
allocated until it drops its mm reference.
Fixes:  | ||
|  Tony Luck | b041b525da | x86/split_lock: Make life miserable for split lockers In https://lore.kernel.org/all/87y22uujkm.ffs@tglx/ Thomas said: Its's simply wishful thinking that stuff gets fixed because of a WARN_ONCE(). This has never worked. The only thing which works is to make stuff fail hard or slow it down in a way which makes it annoying enough to users to complain. He was talking about WBINVD. But it made me think about how we use the split lock detection feature in Linux. Existing code has three options for applications: 1) Don't enable split lock detection (allow arbitrary split locks) 2) Warn once when a process uses split lock, but let the process keep running with split lock detection disabled 3) Kill process that use split locks Option 2 falls into the "wishful thinking" territory that Thomas warns does nothing. But option 3 might not be viable in a situation with legacy applications that need to run. Hence make option 2 much stricter to "slow it down in a way which makes it annoying". Primary reason for this change is to provide better quality of service to the rest of the applications running on the system. Internal testing shows that even with many processes splitting locks, performance for the rest of the system is much more responsive. The new "warn" mode operates like this. When an application tries to execute a bus lock the #AC handler. 1) Delays (interruptibly) 10 ms before moving to next step. 2) Blocks (interruptibly) until it can get the semaphore If interrupted, just return. Assume the signal will either kill the task, or direct execution away from the instruction that is trying to get the bus lock. 3) Disables split lock detection for the current core 4) Schedules a work queue to re-enable split lock detect in 2 jiffies 5) Returns The work queue that re-enables split lock detection also releases the semaphore. There is a corner case where a CPU may be taken offline while split lock detection is disabled. A CPU hotplug handler handles this case. Old behaviour was to only print the split lock warning on the first occurrence of a split lock from a task. Preserve that by adding a flag to the task structure that suppresses subsequent split lock messages from that task. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220310204854.31752-2-tony.luck@intel.com | ||
|  Andrey Konovalov | 51fb34de2a | kasan, arm64: reset pointer tags of vmapped stacks Once tag-based KASAN modes start tagging vmalloc() allocations, kernel stacks start getting tagged if CONFIG_VMAP_STACK is enabled. Reset the tag of kernel stack pointers after allocation in arch_alloc_vmap_stack(). For SW_TAGS KASAN, when CONFIG_KASAN_STACK is enabled, the instrumentation can't handle the SP register being tagged. For HW_TAGS KASAN, there's no instrumentation-related issues. However, the impact of having a tagged SP register needs to be properly evaluated, so keep it non-tagged for now. Note, that the memory for the stack allocation still gets tagged to catch vmalloc-into-stack out-of-bounds accesses. [andreyknvl@google.com: fix case when a stack is retrieved from cached_stacks] Link: https://lkml.kernel.org/r/f50c5f96ef896d7936192c888b0c0a7674e33184.1644943792.git.andreyknvl@google.com [dan.carpenter@oracle.com: remove unnecessary check in alloc_thread_stack_node()] Link: https://lkml.kernel.org/r/20220301080706.GB17208@kili Link: https://lkml.kernel.org/r/698c5ab21743c796d46c15d075b9481825973e34.1643047180.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Andrey Konovalov | c08e6a1206 | kasan, fork: reset pointer tags of vmapped stacks Once tag-based KASAN modes start tagging vmalloc() allocations, kernel stacks start getting tagged if CONFIG_VMAP_STACK is enabled. Reset the tag of kernel stack pointers after allocation in alloc_thread_stack_node(). For SW_TAGS KASAN, when CONFIG_KASAN_STACK is enabled, the instrumentation can't handle the SP register being tagged. For HW_TAGS KASAN, there's no instrumentation-related issues. However, the impact of having a tagged SP register needs to be properly evaluated, so keep it non-tagged for now. Note, that the memory for the stack allocation still gets tagged to catch vmalloc-into-stack out-of-bounds accesses. Link: https://lkml.kernel.org/r/c6c96f012371ecd80e1936509ebcd3b07a5956f7.1643047180.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | 169e77764a | Networking changes for 5.18. Core
 ----
 
  - Introduce XDP multi-buffer support, allowing the use of XDP with
    jumbo frame MTUs and combination with Rx coalescing offloads (LRO).
 
  - Speed up netns dismantling (5x) and lower the memory cost a little.
    Remove unnecessary per-netns sockets. Scope some lists to a netns.
    Cut down RCU syncing. Use batch methods. Allow netdev registration
    to complete out of order.
 
  - Support distinguishing timestamp types (ingress vs egress) and
    maintaining them across packet scrubbing points (e.g. redirect).
 
  - Continue the work of annotating packet drop reasons throughout
    the stack.
 
  - Switch netdev error counters from an atomic to dynamically
    allocated per-CPU counters.
 
  - Rework a few preempt_disable(), local_irq_save() and busy waiting
    sections problematic on PREEMPT_RT.
 
  - Extend the ref_tracker to allow catching use-after-free bugs.
 
 BPF
 ---
 
  - Introduce "packing allocator" for BPF JIT images. JITed code is
    marked read only, and used to be allocated at page granularity.
    Custom allocator allows for more efficient memory use, lower
    iTLB pressure and prevents identity mapping huge pages from
    getting split.
 
  - Make use of BTF type annotations (e.g. __user, __percpu) to enforce
    the correct probe read access method, add appropriate helpers.
 
  - Convert the BPF preload to use light skeleton and drop
    the user-mode-driver dependency.
 
  - Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
    its use as a packet generator.
 
  - Allow local storage memory to be allocated with GFP_KERNEL if called
    from a hook allowed to sleep.
 
  - Introduce fprobe (multi kprobe) to speed up mass attachment (arch
    bits to come later).
 
  - Add unstable conntrack lookup helpers for BPF by using the BPF
    kfunc infra.
 
  - Allow cgroup BPF progs to return custom errors to user space.
 
  - Add support for AF_UNIX iterator batching.
 
  - Allow iterator programs to use sleepable helpers.
 
  - Support JIT of add, and, or, xor and xchg atomic ops on arm64.
 
  - Add BTFGen support to bpftool which allows to use CO-RE in kernels
    without BTF info.
 
  - Large number of libbpf API improvements, cleanups and deprecations.
 
 Protocols
 ---------
 
  - Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.
 
  - Adjust TSO packet sizes based on min_rtt, allowing very low latency
    links (data centers) to always send full-sized TSO super-frames.
 
  - Make IPv6 flow label changes (AKA hash rethink) more configurable,
    via sysctl and setsockopt. Distinguish between server and client
    behavior.
 
  - VxLAN support to "collect metadata" devices to terminate only
    configured VNIs. This is similar to VLAN filtering in the bridge.
 
  - Support inserting IPv6 IOAM information to a fraction of frames.
 
  - Add protocol attribute to IP addresses to allow identifying where
    given address comes from (kernel-generated, DHCP etc.)
 
  - Support setting socket and IPv6 options via cmsg on ping6 sockets.
 
  - Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
    Define dscp_t and stop taking ECN bits into account in fib-rules.
 
  - Add support for locked bridge ports (for 802.1X).
 
  - tun: support NAPI for packets received from batched XDP buffs,
    doubling the performance in some scenarios.
 
  - IPv6 extension header handling in Open vSwitch.
 
  - Support IPv6 control message load balancing in bonding, prevent
    neighbor solicitation and advertisement from using the wrong port.
    Support NS/NA monitor selection similar to existing ARP monitor.
 
  - SMC
    - improve performance with TCP_CORK and sendfile()
    - support auto-corking
    - support TCP_NODELAY
 
  - MCTP (Management Component Transport Protocol)
    - add user space tag control interface
    - I2C binding driver (as specified by DMTF DSP0237)
 
  - Multi-BSSID beacon handling in AP mode for WiFi.
 
  - Bluetooth:
    - handle MSFT Monitor Device Event
    - add MGMT Adv Monitor Device Found/Lost events
 
  - Multi-Path TCP:
    - add support for the SO_SNDTIMEO socket option
    - lots of selftest cleanups and improvements
 
  - Increase the max PDU size in CAN ISOTP to 64 kB.
 
 Driver API
 ----------
 
  - Add HW counters for SW netdevs, a mechanism for devices which
    offload packet forwarding to report packet statistics back to
    software interfaces such as tunnels.
 
  - Select the default NIC queue count as a fraction of number of
    physical CPU cores, instead of hard-coding to 8.
 
  - Expose devlink instance locks to drivers. Allow device layer of
    drivers to use that lock directly instead of creating their own
    which always runs into ordering issues in devlink callbacks.
 
  - Add header/data split indication to guide user space enabling
    of TCP zero-copy Rx.
 
  - Allow configuring completion queue event size.
 
  - Refactor page_pool to enable fragmenting after allocation.
 
  - Add allocation and page reuse statistics to page_pool.
 
  - Improve Multiple Spanning Trees support in the bridge to allow
    reuse of topologies across VLANs, saving HW resources in switches.
 
  - DSA (Distributed Switch Architecture):
    - replay and offload of host VLAN entries
    - offload of static and local FDB entries on LAG interfaces
    - FDB isolation and unicast filtering
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - LAN937x T1 PHYs
    - Davicom DM9051 SPI NIC driver
    - Realtek RTL8367S, RTL8367RB-VB switch and MDIO
    - Microchip ksz8563 switches
    - Netronome NFP3800 SmartNICs
    - Fungible SmartNICs
    - MediaTek MT8195 switches
 
  - WiFi:
    - mt76: MediaTek mt7916
    - mt76: MediaTek mt7921u USB adapters
    - brcmfmac: Broadcom BCM43454/6
 
  - Mobile:
    - iosm: Intel M.2 7360 WWAN card
 
 Drivers
 -------
 
  - Convert many drivers to the new phylink API built for split PCS
    designs but also simplifying other cases.
 
  - Intel Ethernet NICs:
    - add TTY for GNSS module for E810T device
    - improve AF_XDP performance
    - GTP-C and GTP-U filter offload
    - QinQ VLAN support
 
  - Mellanox Ethernet NICs (mlx5):
    - support xdp->data_meta
    - multi-buffer XDP
    - offload tc push_eth and pop_eth actions
 
  - Netronome Ethernet NICs (nfp):
    - flow-independent tc action hardware offload (police / meter)
    - AF_XDP
 
  - Other Ethernet NICs:
    - at803x: fiber and SFP support
    - xgmac: mdio: preamble suppression and custom MDC frequencies
    - r8169: enable ASPM L1.2 if system vendor flags it as safe
    - macb/gem: ZynqMP SGMII
    - hns3: add TX push mode
    - dpaa2-eth: software TSO
    - lan743x: multi-queue, mdio, SGMII, PTP
    - axienet: NAPI and GRO support
 
  - Mellanox Ethernet switches (mlxsw):
    - source and dest IP address rewrites
    - RJ45 ports
 
  - Marvell Ethernet switches (prestera):
    - basic routing offload
    - multi-chain TC ACL offload
 
  - NXP embedded Ethernet switches (ocelot & felix):
    - PTP over UDP with the ocelot-8021q DSA tagging protocol
    - basic QoS classification on Felix DSA switch using dcbnl
    - port mirroring for ocelot switches
 
  - Microchip high-speed industrial Ethernet (sparx5):
    - offloading of bridge port flooding flags
    - PTP Hardware Clock
 
  - Other embedded switches:
    - lan966x: PTP Hardward Clock
    - qca8k: mdio read/write operations via crafted Ethernet packets
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - add LDPC FEC type and 802.11ax High Efficiency data in radiotap
    - enable RX PPDU stats in monitor co-exist mode
 
  - Intel WiFi (iwlwifi):
    - UHB TAS enablement via BIOS
    - band disablement via BIOS
    - channel switch offload
    - 32 Rx AMPDU sessions in newer devices
 
  - MediaTek WiFi (mt76):
    - background radar detection
    - thermal management improvements on mt7915
    - SAR support for more mt76 platforms
    - MBSSID and 6 GHz band on mt7915
 
  - RealTek WiFi:
    - rtw89: AP mode
    - rtw89: 160 MHz channels and 6 GHz band
    - rtw89: hardware scan
 
  - Bluetooth:
    - mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)
 
  - Microchip CAN (mcp251xfd):
    - multiple RX-FIFOs and runtime configurable RX/TX rings
    - internal PLL, runtime PM handling simplification
    - improve chip detection and error handling after wakeup
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmI7YBcACgkQMUZtbf5S
 IrveSBAAmSNJlUK6vPsnNzs7IhsZnfI/AUjm2TCLZnlhKttbpI4A/4Pohk33V7RS
 FGX7f8kjEfhUwrIiLDgeCnztNHRECrCmk6aZc/jLEvecmTauJ+f6kjShkDY/wix+
 AkPHmrZnQeLPAEVuljDdV+sL6ik08+zQL7PazIYHsaSKKC0MGQptRwcri8PLRAKE
 KPBAhVhleq2rAZ/ntprSN52F4Af6rpFTrPIWuN8Bqdbc9dy5094LT0mpOOWYvgr3
 /DLvvAPuLemwyIQkjWknVKBRUAQcmNPC+BY3J8K3LRaiNhekGqOFan46BfqP+k2J
 6DWu0Qrp2yWt4BMOeEToZR5rA6v5suUAMIBu8PRZIDkINXQMlIxHfGjZyNm0rVfw
 7edNri966yus9OdzwPa32MIG3oC6PnVAwYCJAjjBMNS8sSIkp7wgHLkgWN4UFe2H
 K/e6z8TLF4UQ+zFM0aGI5WZ+9QqWkTWEDF3R3OhdFpGrznna0gxmkOeV2YvtsgxY
 cbS0vV9Zj73o+bYzgBKJsw/dAjyLdXoHUGvus26VLQ78S/VGunVKtItwoxBAYmZo
 krW964qcC89YofzSi8RSKLHuEWtNWZbVm8YXr75u6jpr5GhMBu0CYefLs+BuZcxy
 dw8c69cGneVbGZmY2J3rBhDkchbuICl8vdUPatGrOJAoaFdYKuw=
 =ELpe
 -----END PGP SIGNATURE-----
Merge tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
 "The sprinkling of SPI drivers is because we added a new one and Mark
  sent us a SPI driver interface conversion pull request.
  Core
  ----
   - Introduce XDP multi-buffer support, allowing the use of XDP with
     jumbo frame MTUs and combination with Rx coalescing offloads (LRO).
   - Speed up netns dismantling (5x) and lower the memory cost a little.
     Remove unnecessary per-netns sockets. Scope some lists to a netns.
     Cut down RCU syncing. Use batch methods. Allow netdev registration
     to complete out of order.
   - Support distinguishing timestamp types (ingress vs egress) and
     maintaining them across packet scrubbing points (e.g. redirect).
   - Continue the work of annotating packet drop reasons throughout the
     stack.
   - Switch netdev error counters from an atomic to dynamically
     allocated per-CPU counters.
   - Rework a few preempt_disable(), local_irq_save() and busy waiting
     sections problematic on PREEMPT_RT.
   - Extend the ref_tracker to allow catching use-after-free bugs.
  BPF
  ---
   - Introduce "packing allocator" for BPF JIT images. JITed code is
     marked read only, and used to be allocated at page granularity.
     Custom allocator allows for more efficient memory use, lower iTLB
     pressure and prevents identity mapping huge pages from getting
     split.
   - Make use of BTF type annotations (e.g. __user, __percpu) to enforce
     the correct probe read access method, add appropriate helpers.
   - Convert the BPF preload to use light skeleton and drop the
     user-mode-driver dependency.
   - Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
     its use as a packet generator.
   - Allow local storage memory to be allocated with GFP_KERNEL if
     called from a hook allowed to sleep.
   - Introduce fprobe (multi kprobe) to speed up mass attachment (arch
     bits to come later).
   - Add unstable conntrack lookup helpers for BPF by using the BPF
     kfunc infra.
   - Allow cgroup BPF progs to return custom errors to user space.
   - Add support for AF_UNIX iterator batching.
   - Allow iterator programs to use sleepable helpers.
   - Support JIT of add, and, or, xor and xchg atomic ops on arm64.
   - Add BTFGen support to bpftool which allows to use CO-RE in kernels
     without BTF info.
   - Large number of libbpf API improvements, cleanups and deprecations.
  Protocols
  ---------
   - Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.
   - Adjust TSO packet sizes based on min_rtt, allowing very low latency
     links (data centers) to always send full-sized TSO super-frames.
   - Make IPv6 flow label changes (AKA hash rethink) more configurable,
     via sysctl and setsockopt. Distinguish between server and client
     behavior.
   - VxLAN support to "collect metadata" devices to terminate only
     configured VNIs. This is similar to VLAN filtering in the bridge.
   - Support inserting IPv6 IOAM information to a fraction of frames.
   - Add protocol attribute to IP addresses to allow identifying where
     given address comes from (kernel-generated, DHCP etc.)
   - Support setting socket and IPv6 options via cmsg on ping6 sockets.
   - Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
     Define dscp_t and stop taking ECN bits into account in fib-rules.
   - Add support for locked bridge ports (for 802.1X).
   - tun: support NAPI for packets received from batched XDP buffs,
     doubling the performance in some scenarios.
   - IPv6 extension header handling in Open vSwitch.
   - Support IPv6 control message load balancing in bonding, prevent
     neighbor solicitation and advertisement from using the wrong port.
     Support NS/NA monitor selection similar to existing ARP monitor.
   - SMC
      - improve performance with TCP_CORK and sendfile()
      - support auto-corking
      - support TCP_NODELAY
   - MCTP (Management Component Transport Protocol)
      - add user space tag control interface
      - I2C binding driver (as specified by DMTF DSP0237)
   - Multi-BSSID beacon handling in AP mode for WiFi.
   - Bluetooth:
      - handle MSFT Monitor Device Event
      - add MGMT Adv Monitor Device Found/Lost events
   - Multi-Path TCP:
      - add support for the SO_SNDTIMEO socket option
      - lots of selftest cleanups and improvements
   - Increase the max PDU size in CAN ISOTP to 64 kB.
  Driver API
  ----------
   - Add HW counters for SW netdevs, a mechanism for devices which
     offload packet forwarding to report packet statistics back to
     software interfaces such as tunnels.
   - Select the default NIC queue count as a fraction of number of
     physical CPU cores, instead of hard-coding to 8.
   - Expose devlink instance locks to drivers. Allow device layer of
     drivers to use that lock directly instead of creating their own
     which always runs into ordering issues in devlink callbacks.
   - Add header/data split indication to guide user space enabling of
     TCP zero-copy Rx.
   - Allow configuring completion queue event size.
   - Refactor page_pool to enable fragmenting after allocation.
   - Add allocation and page reuse statistics to page_pool.
   - Improve Multiple Spanning Trees support in the bridge to allow
     reuse of topologies across VLANs, saving HW resources in switches.
   - DSA (Distributed Switch Architecture):
      - replay and offload of host VLAN entries
      - offload of static and local FDB entries on LAG interfaces
      - FDB isolation and unicast filtering
  New hardware / drivers
  ----------------------
   - Ethernet:
      - LAN937x T1 PHYs
      - Davicom DM9051 SPI NIC driver
      - Realtek RTL8367S, RTL8367RB-VB switch and MDIO
      - Microchip ksz8563 switches
      - Netronome NFP3800 SmartNICs
      - Fungible SmartNICs
      - MediaTek MT8195 switches
   - WiFi:
      - mt76: MediaTek mt7916
      - mt76: MediaTek mt7921u USB adapters
      - brcmfmac: Broadcom BCM43454/6
   - Mobile:
      - iosm: Intel M.2 7360 WWAN card
  Drivers
  -------
   - Convert many drivers to the new phylink API built for split PCS
     designs but also simplifying other cases.
   - Intel Ethernet NICs:
      - add TTY for GNSS module for E810T device
      - improve AF_XDP performance
      - GTP-C and GTP-U filter offload
      - QinQ VLAN support
   - Mellanox Ethernet NICs (mlx5):
      - support xdp->data_meta
      - multi-buffer XDP
      - offload tc push_eth and pop_eth actions
   - Netronome Ethernet NICs (nfp):
      - flow-independent tc action hardware offload (police / meter)
      - AF_XDP
   - Other Ethernet NICs:
      - at803x: fiber and SFP support
      - xgmac: mdio: preamble suppression and custom MDC frequencies
      - r8169: enable ASPM L1.2 if system vendor flags it as safe
      - macb/gem: ZynqMP SGMII
      - hns3: add TX push mode
      - dpaa2-eth: software TSO
      - lan743x: multi-queue, mdio, SGMII, PTP
      - axienet: NAPI and GRO support
   - Mellanox Ethernet switches (mlxsw):
      - source and dest IP address rewrites
      - RJ45 ports
   - Marvell Ethernet switches (prestera):
      - basic routing offload
      - multi-chain TC ACL offload
   - NXP embedded Ethernet switches (ocelot & felix):
      - PTP over UDP with the ocelot-8021q DSA tagging protocol
      - basic QoS classification on Felix DSA switch using dcbnl
      - port mirroring for ocelot switches
   - Microchip high-speed industrial Ethernet (sparx5):
      - offloading of bridge port flooding flags
      - PTP Hardware Clock
   - Other embedded switches:
      - lan966x: PTP Hardward Clock
      - qca8k: mdio read/write operations via crafted Ethernet packets
   - Qualcomm 802.11ax WiFi (ath11k):
      - add LDPC FEC type and 802.11ax High Efficiency data in radiotap
      - enable RX PPDU stats in monitor co-exist mode
   - Intel WiFi (iwlwifi):
      - UHB TAS enablement via BIOS
      - band disablement via BIOS
      - channel switch offload
      - 32 Rx AMPDU sessions in newer devices
   - MediaTek WiFi (mt76):
      - background radar detection
      - thermal management improvements on mt7915
      - SAR support for more mt76 platforms
      - MBSSID and 6 GHz band on mt7915
   - RealTek WiFi:
      - rtw89: AP mode
      - rtw89: 160 MHz channels and 6 GHz band
      - rtw89: hardware scan
   - Bluetooth:
      - mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)
   - Microchip CAN (mcp251xfd):
      - multiple RX-FIFOs and runtime configurable RX/TX rings
      - internal PLL, runtime PM handling simplification
      - improve chip detection and error handling after wakeup"
* tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits)
  llc: fix netdevice reference leaks in llc_ui_bind()
  drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool
  ice: don't allow to run ice_send_event_to_aux() in atomic ctx
  ice: fix 'scheduling while atomic' on aux critical err interrupt
  net/sched: fix incorrect vlan_push_eth dest field
  net: bridge: mst: Restrict info size queries to bridge ports
  net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init()
  drivers: net: xgene: Fix regression in CRC stripping
  net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT
  net: dsa: fix missing host-filtered multicast addresses
  net/mlx5e: Fix build warning, detected write beyond size of field
  iwlwifi: mvm: Don't fail if PPAG isn't supported
  selftests/bpf: Fix kprobe_multi test.
  Revert "rethook: x86: Add rethook x86 implementation"
  Revert "arm64: rethook: Add arm64 rethook implementation"
  Revert "powerpc: Add rethook support"
  Revert "ARM: rethook: Add rethook arm implementation"
  netdevice: add missing dm_private kdoc
  net: bridge: mst: prevent NULL deref in br_mst_info_size()
  selftests: forwarding: Use same VRF for port and VLAN upper
  ... | ||
|  Jakub Kicinski | 0db8640df5 | Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-03-21 v2 We've added 137 non-merge commits during the last 17 day(s) which contain a total of 143 files changed, 7123 insertions(+), 1092 deletions(-). The main changes are: 1) Custom SEC() handling in libbpf, from Andrii. 2) subskeleton support, from Delyan. 3) Use btf_tag to recognize __percpu pointers in the verifier, from Hao. 4) Fix net.core.bpf_jit_harden race, from Hou. 5) Fix bpf_sk_lookup remote_port on big-endian, from Jakub. 6) Introduce fprobe (multi kprobe) _without_ arch bits, from Masami. The arch specific bits will come later. 7) Introduce multi_kprobe bpf programs on top of fprobe, from Jiri. 8) Enable non-atomic allocations in local storage, from Joanne. 9) Various var_off ptr_to_btf_id fixed, from Kumar. 10) bpf_ima_file_hash helper, from Roberto. 11) Add "live packet" mode for XDP in BPF_PROG_RUN, from Toke. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (137 commits) selftests/bpf: Fix kprobe_multi test. Revert "rethook: x86: Add rethook x86 implementation" Revert "arm64: rethook: Add arm64 rethook implementation" Revert "powerpc: Add rethook support" Revert "ARM: rethook: Add rethook arm implementation" bpftool: Fix a bug in subskeleton code generation bpf: Fix bpf_prog_pack when PMU_SIZE is not defined bpf: Fix bpf_prog_pack for multi-node setup bpf: Fix warning for cast from restricted gfp_t in verifier bpf, arm: Fix various typos in comments libbpf: Close fd in bpf_object__reuse_map bpftool: Fix print error when show bpf map bpf: Fix kprobe_multi return probe backtrace Revert "bpf: Add support to inline bpf_get_func_ip helper on x86" bpf: Simplify check in btf_parse_hdr() selftests/bpf/test_lirc_mode2.sh: Exit with proper code bpf: Check for NULL return from bpf_get_btf_vmlinux selftests/bpf: Test skipping stacktrace bpf: Adjust BPF stack helper functions to accommodate skip > 0 bpf: Select proper size for bpf_prog_pack ... ==================== Link: https://lore.kernel.org/r/20220322050159.5507-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> | ||
|  Linus Torvalds | bba90e0964 | Core code updates: - Reduce the amount of work to release a task stack in context
     switch. There is no real reason to do cgroup accounting and memory
     freeing in this performance sensitive context. Aside of this the
     invoked functions cannot be called from this preemption disabled
     context on PREEMPT_RT enabled kernels. Solve this by moving the
     accounting into do_exit() and delaying the freeing of the stack unless
     the vmap stack can be cached.
 
   - Provide a mechanism to delay raising signals from atomic context on
     PREEMPT_RT enabled kernels as sighand::lock cannot be acquired.  Store
     the information in the task struct and raise it in the exit path.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmI4U6gTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoSpkEACwgaaQUbqVrpw5yb6LbwzUPnjEdFNN
 uUQCv0XZD8LWbfhcQQVSPWGho7S/w2Mkpdhi0DkVb2K0dkB7EvITSNEC4KoS/yez
 8iQBpv6Lm00quHdNLjkQySSZ4NYB8M1GasBI7zSBjROK/+sRqioTPQsM0oDemGmD
 uMvw0dgDJRlB8X4LZv0xuJbYLdSzu2VOlWd5aJG9BUgHkd7PfUWMlHsa29FP0hkP
 A5yziOnr9kMsmCAsgmiyDW/GmefrEealby5M/jgnxTruF/OLnDsP+PYMlws47fPx
 g6xpHkT5H0zQJ/nMJtK2JAlxpnbIl4cLuUnpn7wX316yjBpP2s3Pw04AVdzPPoBa
 ufAoOLFtnrKN6enIqLWaJHGAsBHEULw6d3/7HoAEQOVWChnQSuWOob8z0QDbvM14
 kKtz+LTrO+P5a15fd4g5+9lFBXJUTnF74SYQNwxIm2cV9hxrf15NhAr8yg+RtUvF
 /ilNNAFtXkASLqs9moEi7U+GyBYwemG+gduVZ3Dw8FBxK/vHmDrhlItcZdKom+UJ
 k4VFDVhzd2GYRHMrcaLfkCYew6ou+LD/rjdPhIU9OQHgILIMLY5aLqxDuyPtHqDz
 TEyF5qsL4wYLIUdsWlqyHISqQQ6LfnpIyko5kb2Zt56sYtrcZr8swDy+yimiEOdL
 G4BzQu0nVbCLhw==
 =uGTc
 -----END PGP SIGNATURE-----
Merge tag 'core-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core process handling RT latency updates from Thomas Gleixner:
 - Reduce the amount of work to release a task stack in context switch.
   There is no real reason to do cgroup accounting and memory freeing in
   this performance sensitive context.
   Aside of this the invoked functions cannot be called from this
   preemption disabled context on PREEMPT_RT enabled kernels. Solve this
   by moving the accounting into do_exit() and delaying the freeing of
   the stack unless the vmap stack can be cached.
 - Provide a mechanism to delay raising signals from atomic context on
   PREEMPT_RT enabled kernels as sighand::lock cannot be acquired. Store
   the information in the task struct and raise it in the exit path.
* tag 'core-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  signal, x86: Delay calling signals in atomic on RT enabled kernels
  fork: Use IS_ENABLED() in account_kernel_stack()
  fork: Only cache the VMAP stack in finish_task_switch()
  fork: Move task stack accounting to do_exit()
  fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK
  fork: Don't assign the stack pointer in dup_task_struct()
  fork, IA64: Provide alloc_thread_stack_node() for IA64
  fork: Duplicate task_struct before stack allocation
  fork: Redo ifdefs around task stack handling | ||
|  Linus Torvalds | 3fd33273a4 | Reenable ENQCMD/PASID support: - Simplify the PASID handling to allocate the PASID once, associate it to
    the mm of a process and free it on mm_exit(). The previous attempt of
    refcounted PASIDs and dynamic alloc()/free() turned out to be error
    prone and too complex. The PASID space is 20bits, so the case of
    resource exhaustion is a pure academic concern.
 
  - Populate the PASID MSR on demand via #GP to avoid racy updates via IPIs.
 
  - Reenable ENQCMD and let objtool check for the forbidden usage of ENQCMD
    in the kernel.
 
  - Update the documentation for Shared Virtual Addressing accordingly.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmI4WpETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUfnD/0bY94rgEX4Uuy/mFQ1W8X8XlcyKrha
 0/cRATb+4QV/pwJgGr2nClKhGlFMYPdJLvKMC1TCUPCVrLD1RNmluIZoFzeqXwhm
 jDdCcFOuGZ2D4ujDPWwOOpKBT1ytovnQa7+lH6QJyKkEqdcC2ncOvGJQoiRxRQIG
 8wTVs/OUvQJ5ZhSZQMKQN4uMWMyHEjhbroYS30/uNi/598jTPgzlEoa14XocQ9Os
 nS6ALvjuc9MsJ34F61etMaJU1ZMI3Wx75u9QjEvX6hmJs87YdvgwE7lzJUKFDEuh
 gewM0wp2fTa8/azzP0eMiHTin56PqFdmllzRqXmilbZMEPOeI29dZVArCdpKcAn0
 r9p1kJUT3Xl2G3Oir/OdCaaQHcznD1Y5ZFOyh12wgEucZ/rdeSr7nq7n5HoOL5Bw
 Q2o6YvTkE9DOL0nTN1lSXGiPspou7fzX0uUcRBrbJUS3sBv4zGIlaJXUaTVnSdAt
 VZj4LeOK7v2BjyeiOY0iaaIQd3xjmLUF0UjozXS5M13SoVcToZRbyWqhDzPvNuKA
 imQb/dnFpXhABgmuqAiJLeqM0VtGMFNc780OURkcsBSPng+iSEdV4DzuhK0jpU8x
 Uk1RuGMd/vgmrlDFBrw+orQQiiKR1ixpI0LiHfcOBycfJhqTwcnrNZvAN5/do28Z
 E23+QzlUbZF0cw==
 =Dy8V
 -----END PGP SIGNATURE-----
Merge tag 'x86-pasid-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PASID support from Thomas Gleixner:
 "Reenable ENQCMD/PASID support:
   - Simplify the PASID handling to allocate the PASID once, associate
     it to the mm of a process and free it on mm_exit().
     The previous attempt of refcounted PASIDs and dynamic
     alloc()/free() turned out to be error prone and too complex. The
     PASID space is 20bits, so the case of resource exhaustion is a pure
     academic concern.
   - Populate the PASID MSR on demand via #GP to avoid racy updates via
     IPIs.
   - Reenable ENQCMD and let objtool check for the forbidden usage of
     ENQCMD in the kernel.
   - Update the documentation for Shared Virtual Addressing accordingly"
* tag 'x86-pasid-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation/x86: Update documentation for SVA (Shared Virtual Addressing)
  tools/objtool: Check for use of the ENQCMD instruction in the kernel
  x86/cpufeatures: Re-enable ENQCMD
  x86/traps: Demand-populate PASID MSR via #GP
  sched: Define and initialize a flag to identify valid PASID in the task
  x86/fpu: Clear PASID when copying fpstate
  iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit
  kernel/fork: Initialize mm's PASID
  iommu/ioasid: Introduce a helper to check for valid PASIDs
  mm: Change CONFIG option for mm->pasid field
  iommu/sva: Rename CONFIG_IOMMU_SVA_LIB to CONFIG_IOMMU_SVA | ||
|  Masami Hiramatsu | 54ecbe6f1e | rethook: Add a generic return hook Add a return hook framework which hooks the function return. Most of the logic came from the kretprobe, but this is independent from kretprobe. Note that this is expected to be used with other function entry hooking feature, like ftrace, fprobe, adn kprobes. Eventually this will replace the kretprobe (e.g. kprobe + rethook = kretprobe), but at this moment, this is just an additional hook. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/164735285066.1084943.9259661137330166643.stgit@devnote2 | ||
|  Suren Baghdasaryan | 5c26f6ac94 | mm: refactor vm_area_struct::anon_vma_name usage code Avoid mixing strings and their anon_vma_name referenced pointers by using struct anon_vma_name whenever possible. This simplifies the code and allows easier sharing of anon_vma_name structures when they represent the same name. [surenb@google.com: fix comment] Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Colin Cross <ccross@google.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Xiaofeng Cao <caoxiaofeng@yulong.com> Cc: David Hildenbrand <david@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Sebastian Andrzej Siewior | 0ce055f853 | fork: Use IS_ENABLED() in account_kernel_stack() Not strickly needed but checking CONFIG_VMAP_STACK instead of task_stack_vm_area()' result allows the compiler the remove the else path in the CONFIG_VMAP_STACK case where the pointer can't be NULL. Check for CONFIG_VMAP_STACK in order to use the proper path. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20220217102406.3697941-9-bigeasy@linutronix.de | ||
|  Sebastian Andrzej Siewior | e540bf3162 | fork: Only cache the VMAP stack in finish_task_switch() The task stack could be deallocated later, but for fork()/exec() kind of workloads (say a shell script executing several commands) it is important that the stack is released in finish_task_switch() so that in VMAP_STACK case it can be cached and reused in the new task. For PREEMPT_RT it would be good if the wake-up in vfree_atomic() could be avoided in the scheduling path. Far worse are the other free_thread_stack() implementations which invoke __free_pages()/ kmem_cache_free() with disabled preemption. Cache the stack in free_thread_stack() in the VMAP_STACK case and RCU-delay the free path otherwise. Free the stack in the RCU callback. In the VMAP_STACK case this is another opportunity to fill the cache. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20220217102406.3697941-8-bigeasy@linutronix.de | ||
|  Sebastian Andrzej Siewior | 1a03d3f13f | fork: Move task stack accounting to do_exit() There is no need to perform the stack accounting of the outgoing task in its final schedule() invocation which happens with preemption disabled. The task is leaving, the resources will be freed and the accounting can happen in do_exit() before the actual schedule invocation which frees the stack memory. Move the accounting of the stack memory from release_task_stack() to exit_task_stack_account() which then can be invoked from do_exit(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20220217102406.3697941-7-bigeasy@linutronix.de | ||
|  Sebastian Andrzej Siewior | f1c1a9ee00 | fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK memcg_charge_kernel_stack() is only used in the CONFIG_VMAP_STACK case. Move memcg_charge_kernel_stack() into the CONFIG_VMAP_STACK block and invoke it from within alloc_thread_stack_node(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20220217102406.3697941-6-bigeasy@linutronix.de | ||
|  Sebastian Andrzej Siewior | 7865aba3ad | fork: Don't assign the stack pointer in dup_task_struct() All four versions of alloc_thread_stack_node() assign now task_struct::stack in case the allocation was successful. Let alloc_thread_stack_node() return an error code instead of the stack pointer and remove the stack assignment in dup_task_struct(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20220217102406.3697941-5-bigeasy@linutronix.de | ||
|  Sebastian Andrzej Siewior | 2bb0529c0b | fork, IA64: Provide alloc_thread_stack_node() for IA64 Provide a generic alloc_thread_stack_node() for IA64 and CONFIG_ARCH_THREAD_STACK_ALLOCATOR which returns stack pointer and sets task_struct::stack so it behaves exactly like the other implementations. Rename IA64's alloc_thread_stack_node() and add the generic version to the fork code so it is in one place _and_ to drastically lower the chances of fat fingering the IA64 code. Do the same for free_thread_stack(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20220217102406.3697941-4-bigeasy@linutronix.de | ||
|  Sebastian Andrzej Siewior | 546c42b2c5 | fork: Duplicate task_struct before stack allocation alloc_thread_stack_node() already populates the task_struct::stack member except on IA64. The stack pointer is saved and populated again because IA64 needs it and arch_dup_task_struct() overwrites it. Allocate thread's stack after task_struct has been duplicated as a preparation for further changes. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20220217102406.3697941-3-bigeasy@linutronix.de | ||
|  Sebastian Andrzej Siewior | be9a2277ca | fork: Redo ifdefs around task stack handling The use of ifdef CONFIG_VMAP_STACK is confusing in terms what is actually happenning and what can happen. For instance from reading free_thread_stack() it appears that in the CONFIG_VMAP_STACK case it may receive a non-NULL vm pointer but it may also be NULL in which case __free_pages() is used to free the stack. This is however not the case because in the VMAP case a non-NULL pointer is always returned here. Since it looks like this might happen, the compiler creates the correct dead code with the invocation to __free_pages() and everything around it. Twice. Add spaces between the ifdef and the identifer to recognize the ifdef level which is currently in scope. Add the current identifer as a comment behind #else and #endif. Move the code within free_thread_stack() and alloc_thread_stack_node() into the relevant ifdef blocks. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20220217102406.3697941-2-bigeasy@linutronix.de | ||
|  Linus Torvalds | 0b0894ff78 | - Fix task exposure order when forking tasks -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmISLs0ACgkQEsHwGGHe VUrq+g/+NxLVl3QTmteNh4jEzA/3FDccRKbALNs09YX116nCOZ1sv1Z5FgtdOLUT ukIujAm1SKDXQrXoglmm3eH6ylWZ54DQo6qrP+vzavtIKSgbt+vPS63qDR4ehF8N Kh04qssWy+AzOSeZRQ6lXTiGei2nBKFbiwlwPnV5vbQmLBFvzv2dEErVFYkHwWN6 foygRUOhslv5aizwpDriAvDq9ZTVUPXqzIi5zWnTON8y8Vy32eZjSJKez8SQH57w vbEZkJTw33tCzG/5+J5mh3UmP9Mcj34W/GDCKHxjO1Y38SbLTMy2Agl5hbRKjVZ8 W4wElyXyXtLx1RsrZpOJ90mhqwADcxLgkq4ipl6uzikeQlQVnOOBsHZOi3NkTorg 1YhpxZhqzWWotdGUQwzfZYSuoqQAyUxeqZKwqUD+dYsEqQWhTWSa41IWcfz+r8UD uD+pO03EbAEzBB+BSJL9xh0xnchdOnHbpTSFo1AJeQ+LUKanXxO2tXk9/3SwliMj M751vPtQH6DiVukSfywxJym+aaLbjeju6RvA69Atap51roYXPSsaCHGDwLeNUS+w 5D6KC0IuCxsliLpAD4c4PXkRdTvwnG/0lUt+s83bGnyfh5nQzfXXbBp3I5+o0Fcq jIQ37uH2Mj5MKtmC8o4ekCtNTtWM8QNZvKzMzoju616Wqee0Ag4= =RagE -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: "Fix task exposure order when forking tasks" * tag 'sched_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix yet more sched_fork() races | ||
|  Linus Torvalds | c1034d249d | pidfd.v5.17-rc4 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYhDKMgAKCRCRxhvAZXjc
 opUwAPwORv7MD8rh5va7LFWUxX1UFpxVILWcC1umuhHAOKZ7YAEA3DTYFrQZhxA2
 nMBR6hBEDKRARIFv3zHZYflYK97FnQA=
 =NPXo
 -----END PGP SIGNATURE-----
Merge tag 'pidfd.v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd fix from Christian Brauner:
 "This fixes a problem reported by lockdep when installing a pidfd via
  fd_install() with siglock and the tasklisk write lock held in
  copy_process() when calling clone()/clone3() with CLONE_PIDFD.
  Originally a pidfd was created prior to holding any of these locks but
  this required a call to ksys_close(). So quite some time ago in
   | ||
|  Peter Zijlstra | b1e8206582 | sched: Fix yet more sched_fork() races Where commit | ||
|  Eric W. Biederman | 8f2f9c4d82 | ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1 Michal Koutný <mkoutny@suse.com> wrote: > It was reported that v5.14 behaves differently when enforcing > RLIMIT_NPROC limit, namely, it allows one more task than previously. > This is consequence of the commit | ||
|  Peter Zijlstra | a3d29e8291 | sched: Define and initialize a flag to identify valid PASID in the task Add a new single bit field to the task structure to track whether this task has initialized the IA32_PASID MSR to the mm's PASID. Initialize the field to zero when creating a new task with fork/clone. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Co-developed-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220207230254.3342514-8-fenghua.yu@intel.com | ||
|  Fenghua Yu | 701fac4038 | iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit PASIDs are process-wide. It was attempted to use refcounted PASIDs to free them when the last thread drops the refcount. This turned out to be complex and error prone. Given the fact that the PASID space is 20 bits, which allows up to 1M processes to have a PASID associated concurrently, PASID resource exhaustion is not a realistic concern. Therefore, it was decided to simplify the approach and stick with lazy on demand PASID allocation, but drop the eager free approach and make an allocated PASID's lifetime bound to the lifetime of the process. Get rid of the refcounting mechanisms and replace/rename the interfaces to reflect this new approach. [ bp: Massage commit message. ] Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Joerg Roedel <jroedel@suse.de> Link: https://lore.kernel.org/r/20220207230254.3342514-6-fenghua.yu@intel.com | ||
|  Fenghua Yu | a6cbd44093 | kernel/fork: Initialize mm's PASID A new mm doesn't have a PASID yet when it's created. Initialize the mm's PASID on fork() or for init_mm to INVALID_IOASID (-1). INIT_PASID (0) is reserved for kernel legacy DMA PASID. It cannot be allocated to a user process. Initializing the process's PASID to 0 may cause confusion that's why the process uses the reserved kernel legacy DMA PASID. Initializing the PASID to INVALID_IOASID (-1) explicitly tells the process doesn't have a valid PASID yet. Even though the only user of mm_pasid_init() is in fork.c, define it in <linux/sched/mm.h> as the first of three mm/pasid life cycle functions (init/set/drop) to keep these all together. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220207230254.3342514-5-fenghua.yu@intel.com | ||
|  Fenghua Yu | 7a853c2d59 | mm: Change CONFIG option for mm->pasid field This currently depends on CONFIG_IOMMU_SUPPORT. But it is only needed when CONFIG_IOMMU_SVA option is enabled. Change the CONFIG guards around definition and initialization of mm->pasid field. Suggested-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20220207230254.3342514-3-fenghua.yu@intel.com | ||
|  Waiman Long | ddc204b517 | copy_process(): Move fd_install() out of sighand->siglock critical section I was made aware of the following lockdep splat: [ 2516.308763] ===================================================== [ 2516.309085] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected [ 2516.309433] 5.14.0-51.el9.aarch64+debug #1 Not tainted [ 2516.309703] ----------------------------------------------------- [ 2516.310149] stress-ng/153663 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 2516.310512] ffff0000e422b198 (&newf->file_lock){+.+.}-{2:2}, at: fd_install+0x368/0x4f0 [ 2516.310944] and this task is already holding: [ 2516.311248] ffff0000c08140d8 (&sighand->siglock){-.-.}-{2:2}, at: copy_process+0x1e2c/0x3e80 [ 2516.311804] which would create a new lock dependency: [ 2516.312066] (&sighand->siglock){-.-.}-{2:2} -> (&newf->file_lock){+.+.}-{2:2} [ 2516.312446] but this new dependency connects a HARDIRQ-irq-safe lock: [ 2516.312983] (&sighand->siglock){-.-.}-{2:2} : [ 2516.330700] Possible interrupt unsafe locking scenario: [ 2516.331075] CPU0 CPU1 [ 2516.331328] ---- ---- [ 2516.331580] lock(&newf->file_lock); [ 2516.331790] local_irq_disable(); [ 2516.332231] lock(&sighand->siglock); [ 2516.332579] lock(&newf->file_lock); [ 2516.332922] <Interrupt> [ 2516.333069] lock(&sighand->siglock); [ 2516.333291] *** DEADLOCK *** [ 2516.389845] stack backtrace: [ 2516.390101] CPU: 3 PID: 153663 Comm: stress-ng Kdump: loaded Not tainted 5.14.0-51.el9.aarch64+debug #1 [ 2516.390756] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 2516.391155] Call trace: [ 2516.391302] dump_backtrace+0x0/0x3e0 [ 2516.391518] show_stack+0x24/0x30 [ 2516.391717] dump_stack_lvl+0x9c/0xd8 [ 2516.391938] dump_stack+0x1c/0x38 [ 2516.392247] print_bad_irq_dependency+0x620/0x710 [ 2516.392525] check_irq_usage+0x4fc/0x86c [ 2516.392756] check_prev_add+0x180/0x1d90 [ 2516.392988] validate_chain+0x8e0/0xee0 [ 2516.393215] __lock_acquire+0x97c/0x1e40 [ 2516.393449] lock_acquire.part.0+0x240/0x570 [ 2516.393814] lock_acquire+0x90/0xb4 [ 2516.394021] _raw_spin_lock+0xe8/0x154 [ 2516.394244] fd_install+0x368/0x4f0 [ 2516.394451] copy_process+0x1f5c/0x3e80 [ 2516.394678] kernel_clone+0x134/0x660 [ 2516.394895] __do_sys_clone3+0x130/0x1f4 [ 2516.395128] __arm64_sys_clone3+0x5c/0x7c [ 2516.395478] invoke_syscall.constprop.0+0x78/0x1f0 [ 2516.395762] el0_svc_common.constprop.0+0x22c/0x2c4 [ 2516.396050] do_el0_svc+0xb0/0x10c [ 2516.396252] el0_svc+0x24/0x34 [ 2516.396436] el0t_64_sync_handler+0xa4/0x12c [ 2516.396688] el0t_64_sync+0x198/0x19c [ 2517.491197] NET: Registered PF_ATMPVC protocol family [ 2517.491524] NET: Registered PF_ATMSVC protocol family [ 2591.991877] sched: RT throttling activated One way to solve this problem is to move the fd_install() call out of the sighand->siglock critical section. Before commit | ||
|  Linus Torvalds | 35ce8ae9ae | Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull signal/exit/ptrace updates from Eric Biederman:
 "This set of changes deletes some dead code, makes a lot of cleanups
  which hopefully make the code easier to follow, and fixes bugs found
  along the way.
  The end-game which I have not yet reached yet is for fatal signals
  that generate coredumps to be short-circuit deliverable from
  complete_signal, for force_siginfo_to_task not to require changing
  userspace configured signal delivery state, and for the ptrace stops
  to always happen in locations where we can guarantee on all
  architectures that the all of the registers are saved and available on
  the stack.
  Removal of profile_task_ext, profile_munmap, and profile_handoff_task
  are the big successes for dead code removal this round.
  A bunch of small bug fixes are included, as most of the issues
  reported were small enough that they would not affect bisection so I
  simply added the fixes and did not fold the fixes into the changes
  they were fixing.
  There was a bug that broke coredumps piped to systemd-coredump. I
  dropped the change that caused that bug and replaced it entirely with
  something much more restrained. Unfortunately that required some
  rebasing.
  Some successes after this set of changes: There are few enough calls
  to do_exit to audit in a reasonable amount of time. The lifetime of
  struct kthread now matches the lifetime of struct task, and the
  pointer to struct kthread is no longer stored in set_child_tid. The
  flag SIGNAL_GROUP_COREDUMP is removed. The field group_exit_task is
  removed. Issues where task->exit_code was examined with
  signal->group_exit_code should been examined were fixed.
  There are several loosely related changes included because I am
  cleaning up and if I don't include them they will probably get lost.
  The original postings of these changes can be found at:
     https://lkml.kernel.org/r/87a6ha4zsd.fsf@email.froward.int.ebiederm.org
     https://lkml.kernel.org/r/87bl1kunjj.fsf@email.froward.int.ebiederm.org
     https://lkml.kernel.org/r/87r19opkx1.fsf_-_@email.froward.int.ebiederm.org
  I trimmed back the last set of changes to only the obviously correct
  once. Simply because there was less time for review than I had hoped"
* 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (44 commits)
  ptrace/m68k: Stop open coding ptrace_report_syscall
  ptrace: Remove unused regs argument from ptrace_report_syscall
  ptrace: Remove second setting of PT_SEIZED in ptrace_attach
  taskstats: Cleanup the use of task->exit_code
  exit: Use the correct exit_code in /proc/<pid>/stat
  exit: Fix the exit_code for wait_task_zombie
  exit: Coredumps reach do_group_exit
  exit: Remove profile_handoff_task
  exit: Remove profile_task_exit & profile_munmap
  signal: clean up kernel-doc comments
  signal: Remove the helper signal_group_exit
  signal: Rename group_exit_task group_exec_task
  coredump: Stop setting signal->group_exit_task
  signal: Remove SIGNAL_GROUP_COREDUMP
  signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process
  signal: Make coredump handling explicit in complete_signal
  signal: Have prepare_signal detect coredumps using signal->core_state
  signal: Have the oom killer detect coredumps using signal->core_state
  exit: Move force_uaccess back into do_exit
  exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit
  ... | ||
|  Linus Torvalds | f56caedaf9 | Merge branch 'akpm' (patches from Andrew) Merge misc updates from Andrew Morton: "146 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak, dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap, memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb, userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp, ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits) mm/damon: hide kernel pointer from tracepoint event mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging mm/damon/dbgfs: remove an unnecessary variable mm/damon: move the implementation of damon_insert_region to damon.h mm/damon: add access checking for hugetlb pages Docs/admin-guide/mm/damon/usage: update for schemes statistics mm/damon/dbgfs: support all DAMOS stats Docs/admin-guide/mm/damon/reclaim: document statistics parameters mm/damon/reclaim: provide reclamation statistics mm/damon/schemes: account how many times quota limit has exceeded mm/damon/schemes: account scheme actions that successfully applied mm/damon: remove a mistakenly added comment for a future feature Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning Docs/admin-guide/mm/damon/usage: remove redundant information Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks mm/damon: convert macro functions to static inline functions mm/damon: modify damon_rand() macro to static inline function mm/damon: move damon_rand() definition into damon.h ... | ||
|  Arnd Bergmann | 17fca131ce | mm: move anon_vma declarations to linux/mm_inline.h The patch to add anonymous vma names causes a build failure in some
configurations:
  include/linux/mm_types.h: In function 'is_same_vma_anon_name':
  include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration]
    924 |         return name && vma_name && !strcmp(name, vma_name);
        |                                     ^~~~~~
  include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'?
This should not really be part of linux/mm_types.h in the first place,
as that header is meant to only contain structure defintions and need a
minimum set of indirect includes itself.
While the header clearly includes more than it should at this point,
let's not make it worse by including string.h as well, which would pull
in the expensive (compile-speed wise) fortify-string logic.
Move the new functions into a separate header that only needs to be
included in a couple of locations.
Link: https://lkml.kernel.org/r/20211207125710.2503446-1-arnd@kernel.org
Fixes: "mm: add a field to store names for private anonymous memory"
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Colin Cross <ccross@google.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Colin Cross | 9a10064f56 | mm: add a field to store names for private anonymous memory In many userspace applications, and especially in VM based applications like Android uses heavily, there are multiple different allocators in use. At a minimum there is libc malloc and the stack, and in many cases there are libc malloc, the stack, direct syscalls to mmap anonymous memory, and multiple VM heaps (one for small objects, one for big objects, etc.). Each of these layers usually has its own tools to inspect its usage; malloc by compiling a debug version, the VM through heap inspection tools, and for direct syscalls there is usually no way to track them. On Android we heavily use a set of tools that use an extended version of the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped in userspace and slice their usage by process, shared (COW) vs. unique mappings, backing, etc. This can account for real physical memory usage even in cases like fork without exec (which Android uses heavily to share as many private COW pages as possible between processes), Kernel SamePage Merging, and clean zero pages. It produces a measurement of the pages that only exist in that process (USS, for unique), and a measurement of the physical memory usage of that process with the cost of shared pages being evenly split between processes that share them (PSS). If all anonymous memory is indistinguishable then figuring out the real physical memory usage (PSS) of each heap requires either a pagemap walking tool that can understand the heap debugging of every layer, or for every layer's heap debugging tools to implement the pagemap walking logic, in which case it is hard to get a consistent view of memory across the whole system. Tracking the information in userspace leads to all sorts of problems. It either needs to be stored inside the process, which means every process has to have an API to export its current heap information upon request, or it has to be stored externally in a filesystem that somebody needs to clean up on crashes. It needs to be readable while the process is still running, so it has to have some sort of synchronization with every layer of userspace. Efficiently tracking the ranges requires reimplementing something like the kernel vma trees, and linking to it from every layer of userspace. It requires more memory, more syscalls, more runtime cost, and more complexity to separately track regions that the kernel is already tracking. This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a userspace-provided name for anonymous vmas. The names of named anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>]. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name) Setting the name to NULL clears it. The name length limit is 80 bytes including NUL-terminator and is checked to contain only printable ascii characters (including space), except '[',']','\','$' and '`'. Ascii strings are being used to have a descriptive identifiers for vmas, which can be understood by the users reading /proc/pid/maps or /proc/pid/smaps. Names can be standardized for a given system and they can include some variable parts such as the name of the allocator or a library, tid of the thread using it, etc. The name is stored in a pointer in the shared union in vm_area_struct that points to a null terminated string. Anonymous vmas with the same name (equivalent strings) and are otherwise mergeable will be merged. The name pointers are not shared between vmas even if they contain the same name. The name pointer is stored in a union with fields that are only used on file-backed mappings, so it does not increase memory usage. CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this feature. It keeps the feature disabled by default to prevent any additional memory overhead and to avoid confusing procfs parsers on systems which are not ready to support named anonymous vmas. The patch is based on the original patch developed by Colin Cross, more specifically on its latest version [1] posted upstream by Sumit Semwal. It used a userspace pointer to store vma names. In that design, name pointers could be shared between vmas. However during the last upstreaming attempt, Kees Cook raised concerns [2] about this approach and suggested to copy the name into kernel memory space, perform validity checks [3] and store as a string referenced from vm_area_struct. One big concern is about fork() performance which would need to strdup anonymous vma names. Dave Hansen suggested experimenting with worst-case scenario of forking a process with 64k vmas having longest possible names [4]. I ran this experiment on an ARM64 Android device and recorded a worst-case regression of almost 40% when forking such a process. This regression is addressed in the followup patch which replaces the pointer to a name with a refcounted structure that allows sharing the name pointer between vmas of the same name. Instead of duplicating the string during fork() or when splitting a vma it increments the refcount. [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/ [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/ [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/ [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/ Changes for prctl(2) manual page (in the options section): PR_SET_VMA Sets an attribute specified in arg2 for virtual memory areas starting from the address specified in arg3 and spanning the size specified in arg4. arg5 specifies the value of the attribute to be set. Note that assigning an attribute to a virtual memory area might prevent it from being merged with adjacent virtual memory areas due to the difference in that attribute's value. Currently, arg2 must be one of: PR_SET_VMA_ANON_NAME Set a name for anonymous virtual memory areas. arg5 should be a pointer to a null-terminated string containing the name. The name length including null byte cannot exceed 80 bytes. If arg5 is NULL, the name of the appropriate anonymous virtual memory areas will be reset. The name can contain only printable ascii characters (including space), except '[',']','\','$' and '`'. This feature is available only if the kernel is built with the CONFIG_ANON_VMA_NAME option enabled. [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table] Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy, added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the work here was done by Colin Cross, therefore, with his permission, keeping him as the author] Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com Signed-off-by: Colin Cross <ccross@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Glauber <jan.glauber@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Landley <rob@landley.net> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Shaohua Li <shli@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Eric W. Biederman | 2873cd31a2 | exit: Remove profile_handoff_task All profile_handoff_task does is notify the task_free_notifier chain. The helpers task_handoff_register and task_handoff_unregister are used to add and delete entries from that chain and are never called. So remove the dead code and make it much easier to read and reason about __put_task_struct. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/87fspyw6m0.fsf@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Eric W. Biederman | e32cf5dfbe | kthread: Generalize pf_io_worker so it can point to struct kthread The point of using set_child_tid to hold the kthread pointer was that it already did what is necessary. There are now restrictions on when set_child_tid can be initialized and when set_child_tid can be used in schedule_tail. Which indicates that continuing to use set_child_tid to hold the kthread pointer is a bad idea. Instead of continuing to use the set_child_tid field of task_struct generalize the pf_io_worker field of task_struct and use it to hold the kthread pointer. Rename pf_io_worker (which is a void * pointer) to worker_private so it can be used to store kthreads struct kthread pointer. Update the kthread code to store the kthread pointer in the worker_private field. Remove the places where set_child_tid had to be dealt with carefully because kthreads also used it. Link: https://lkml.kernel.org/r/CAHk-=wgtFAA9SbVYg0gR1tqPMC17-NYcs0GQkaYg1bGhh1uJQQ@mail.gmail.com Link: https://lkml.kernel.org/r/87a6grvqy8.fsf_-_@email.froward.int.ebiederm.org Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Eric W. Biederman | ff8288ff47 | fork: Rename bad_fork_cleanup_threadgroup_lock to bad_fork_cleanup_delayacct I just fixed a bug in copy_process when using the label
bad_fork_cleanup_threadgroup_lock.  While fixing the bug I looked
closer at the label and realized it has been misnamed since
 | ||
|  Eric W. Biederman | 6692c98c7d | fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA Mark Brown <broonie@kernel.org> reported:
> This is also causing further build errors including but not limited to:
>
> /tmp/next/build/kernel/fork.c: In function 'copy_process':
> /tmp/next/build/kernel/fork.c:2106:4: error: label 'bad_fork_cleanup_threadgroup_lock' used but not defined
>  2106 |    goto bad_fork_cleanup_threadgroup_lock;
>       |    ^~~~
It turns out that I messed up and was depending upon a label protected
by an ifdef.  Move the label out of the ifdef as the ifdef around the label
no longer makes sense (if it ever did).
Link: https://lkml.kernel.org/r/YbugCP144uxXvRsk@sirena.org.uk
Fixes:  | ||
|  Eric W. Biederman | 40966e316f | kthread: Ensure struct kthread is present for all kthreads Today the rules are a bit iffy and arbitrary about which kernel threads have struct kthread present. Both idle threads and thread started with create_kthread want struct kthread present so that is effectively all kernel threads. Make the rule that if PF_KTHREAD and the task is running then struct kthread is present. This will allow the kernel thread code to using tsk->exit_code with different semantics from ordinary processes. To make ensure that struct kthread is present for all kernel threads move it's allocation into copy_process. Add a deallocation of struct kthread in exec for processes that were kernel threads. Move the allocation of struct kthread for the initial thread earlier so that it is not repeated for each additional idle thread. Move the initialization of struct kthread into set_kthread_struct so that the structure is always and reliably initailized. Clear set_child_tid in free_kthread_struct to ensure the kthread struct is reliably freed during exec. The function free_kthread_struct does not need to clear vfork_done during exec as exec_mm_release called from exec_mmap has already cleared vfork_done. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Christoph Hellwig | 88c9a2ce52 | fork: move copy_io to block/blk-ioc.c Move the copying of the I/O context to the block layer as that is where we can use the proper low-level interfaces. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211126115817.2087431-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> | ||
|  Linus Torvalds | 622c72b651 | A single fix for POSIX CPU timers to address a problem where POSIX CPU timer delivery stops working for a new child task because copy_process() copies state information which is only valid for the parent task. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGRDVUTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYocOFD/42NOdli73N+Jdq7APHUIHXzu+6DVT6 CI5toLQw+0KPoF0s1wg4+J0YCDt2k0Pu4lOabF3Ze/+c6RlR5zfCXESqsXdHaCjh E91Vs57u0ataRMEHo6KB6eBIutuF8hyxfY6vVXfkTRNAreUIWiwWYrlB0G64JVOG +/l1W7adovjLcLwcW+ArrnLJwkBKtXunK6PVv2IrdRHwpMHbwoNRCCCFvzkqnWmQ 4Yy2/NaB/PEBK5kezP1/j9EMcGCTWk1JJIm+l/PEwCCcbIgIdUahpW3XHAaqms6R oukqCvE5ukfmVzBFYBhCamhF8heyEeBVRqGU+Yyk48+I+DQFBCqaqa1NKSuEUdNL Nycy6Rp1yn7CHVSB461shMS6NJGOSNDBjv7vxer3WjV3HPJu7y0RrN7jXbkSfQnm hVKjkmbDEYwylgzFE5+T857NqD5MEXeuIBtTO08hNRnpd61aB3x+qq+8ElE6ST8Y pm6rMzw0AZ5buPK8QdGVDk0dD4WKObj1LzmRZvBtYeWynO6sxyKUl6B2CgAxrvn5 D1Li2/arkJMCVeIuIL5uE6DPoxSh8J7OuEC4KeWX8M8xQSEDImqfZ+tDL2Esv6jv xDmymq584hiCBc1CJjCOA9kZYe6KNXC7lkVOns6GaKKzLhkrcvUR3dUGhMyzxAMO t9QIAinR6JwRRA== =EBbc -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix for POSIX CPU timers to address a problem where POSIX CPU timer delivery stops working for a new child task because copy_process() copies state information which is only valid for the parent task" * tag 'timers-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Clear task::posix_cputimers_work in copy_process() | ||
|  Linus Torvalds | 59a2ceeef6 | Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: "87 patches. Subsystems affected by this patch series: mm (pagecache and hugetlb), procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs, init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork, sysvfs, kcov, gdb, resource, selftests, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits) ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL ipc: check checkpoint_restore_ns_capable() to modify C/R proc files selftests/kselftest/runner/run_one(): allow running non-executable files virtio-mem: disallow mapping virtio-mem memory via /dev/mem kernel/resource: disallow access to exclusive system RAM regions kernel/resource: clean up and optimize iomem_is_exclusive() scripts/gdb: handle split debug for vmlinux kcov: replace local_irq_save() with a local_lock_t kcov: avoid enable+disable interrupts if !in_task() kcov: allocate per-CPU memory on the relevant node Documentation/kcov: define `ip' in the example Documentation/kcov: include types.h in the example sysv: use BUILD_BUG_ON instead of runtime check kernel/fork.c: unshare(): use swap() to make code cleaner seq_file: fix passing wrong private data seq_file: move seq_escape() to a header signal: remove duplicate include in signal.h crash_dump: remove duplicate include in crash_dump.h crash_dump: fix boolreturn.cocci warning hfs/hfsplus: use WARN_ON for sanity check ... | ||
|  Ran Xiaokai | ba1f70ddd1 | kernel/fork.c: unshare(): use swap() to make code cleaner Use swap() instead of reimplementing it. Link: https://lkml.kernel.org/r/20210909022046.8151-1-ran.xiaokai@zte.com.cn Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Gabriel Krisman Bertazi <krisman@collabora.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexey Gladkov <legion@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | a602285ac1 | Merge branch 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull per signal_struct coredumps from Eric Biederman: "Current coredumps are mixed up with the exit code, the signal handling code, and the ptrace code making coredumps much more complicated than necessary and difficult to follow. This series of changes starts with ptrace_stop and cleans it up, making it easier to follow what is happening in ptrace_stop. Then cleans up the exec interactions with coredumps. Then cleans up the coredump interactions with exit. Finally the coredump interactions with the signal handling code is cleaned up. The first and last changes are bug fixes for minor bugs. I believe the fact that vfork followed by execve can kill the process the called vfork if exec fails is sufficient justification to change the userspace visible behavior. In previous discussions some of these changes were organized differently and individually appeared to make the code base worse. As currently written I believe they all stand on their own as cleanups and bug fixes. Which means that even if the worst should happen and the last change needs to be reverted for some unimaginable reason, the code base will still be improved. If the worst does not happen there are a more cleanups that can be made. Signals that generate coredumps can easily become eligible for short circuit delivery in complete_signal. The entire rendezvous for generating a coredump can move into get_signal. The function force_sig_info_to_task be written in a way that does not modify the signal handling state of the target task (because coredumps are eligible for short circuit delivery). Many of these future cleanups can be done another way but nothing so cleanly as if coredumps become per signal_struct" * 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: coredump: Limit coredumps to a single thread group coredump: Don't perform any cleanups before dumping core exit: Factor coredump_exit_mm out of exit_mm exec: Check for a pending fatal signal instead of core_state ptrace: Remove the unnecessary arguments from arch_ptrace_stop signal: Remove the bogus sigkill_pending in ptrace_stop | ||
|  Michael Pratt | ca7752caea | posix-cpu-timers: Clear task::posix_cputimers_work in copy_process() copy_process currently copies task_struct.posix_cputimers_work as-is. If a
timer interrupt arrives while handling clone and before dup_task_struct
completes then the child task will have:
1. posix_cputimers_work.scheduled = true
2. posix_cputimers_work.work queued.
copy_process clears task_struct.task_works, so (2) will have no effect and
posix_cpu_timers_work will never run (not to mention it doesn't make sense
for two tasks to share a common linked list).
Since posix_cpu_timers_work never runs, posix_cputimers_work.scheduled is
never cleared. Since scheduled is set, future timer interrupts will skip
scheduling work, with the ultimate result that the task will never receive
timer expirations.
Together, the complete flow is:
1. Task 1 calls clone(), enters kernel.
2. Timer interrupt fires, schedules task work on Task 1.
   2a. task_struct.posix_cputimers_work.scheduled = true
   2b. task_struct.posix_cputimers_work.work added to
       task_struct.task_works.
3. dup_task_struct() copies Task 1 to Task 2.
4. copy_process() clears task_struct.task_works for Task 2.
5. Future timer interrupts on Task 2 see
   task_struct.posix_cputimers_work.scheduled = true and skip scheduling
   work.
Fix this by explicitly clearing contents of task_struct.posix_cputimers_work
in copy_process(). This was never meant to be shared or inherited across
tasks in the first place.
Fixes:  | ||
|  Linus Torvalds | 9a7e0a90a4 | Scheduler updates: - Revert the printk format based wchan() symbol resolution as it can leak
    the raw value in case that the symbol is not resolvable.
 
  - Make wchan() more robust and work with all kind of unwinders by
    enforcing that the task stays blocked while unwinding is in progress.
 
  - Prevent sched_fork() from accessing an invalid sched_task_group
 
  - Improve asymmetric packing logic
 
  - Extend scheduler statistics to RT and DL scheduling classes and add
    statistics for bandwith burst to the SCHED_FAIR class.
 
  - Properly account SCHED_IDLE entities
 
  - Prevent a potential deadlock when initial priority is assigned to a
    newly created kthread. A recent change to plug a race between cpuset and
    __sched_setscheduler() introduced a new lock dependency which is now
    triggered. Break the lock dependency chain by moving the priority
    assignment to the thread function.
 
  - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
 
  - Improve idle balancing in general and especially for NOHZ enabled
    systems.
 
  - Provide proper interfaces for live patching so it does not have to
    fiddle with scheduler internals.
 
  - Add cluster aware scheduling support.
 
  - A small set of tweaks for RT (irqwork, wait_task_inactive(), various
    scheduler options and delaying mmdrop)
 
  - The usual small tweaks and improvements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/OUkTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoR/5D/9ikdGNpKg9osNqJ3GjAmxsK6kVkB29
 iFe2k8pIpWDToWQf/wQRGih4Yj3Cl49QSnZcPIibh2/12EB1qrrW6iSPJkInz8Ec
 /1LS5/Vewn2OyoxyXZjdvGC5gTXEodSbIazASvX7nvdMeI4gsAsL5etzrMJirT/t
 aymqvr7zovvywrwMTQJrGjUMo9l4ewE8tafMNNhRu1BHU1U4ojM9yvThyRAAcmp7
 3Xy49A+Yq3IgrvYI4u8FMK5Zh08KaxSFjiLhePGm/bF+wSfYmWop2TP1jY05W2Uo
 ti8hfbJMUoFRYuMxAiEldkItnc0wV4M9PtWZZ/x+B71bs65Y4Zjt9cW+rxJv2+m1
 vzV31EsQwGnOti072dzWN4c/cZqngVXAjaNtErvDwJUr+Tw1ayv9KUvuodMQqZY6
 mu68bFUO2kV9EMe1CBOv51Uy1RGHyLj3rlNqrkw+Xp5ISE9Ad2vhUEiRp5bQx5Ci
 V/XFhGZkGUluh0vccrdFlNYZwhj8cZEzkOPCnPSeZ+bq8SyZE6xuHH/lTP1CJCOy
 s800rW1huM+kgV+zRN8adDkGXibAk9N3RtVGnQXmuEy8gB9LZmQg+JeM2wsc9B+6
 i0gdqZnsjNAfoK+BBAG4holxptSL8/eOJsFH8ZNIoxQ+iqooyPx9tFX7yXnRTBQj
 d2qWG7UvoseT+g==
 =fgtS
 -----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
 - Revert the printk format based wchan() symbol resolution as it can
   leak the raw value in case that the symbol is not resolvable.
 - Make wchan() more robust and work with all kind of unwinders by
   enforcing that the task stays blocked while unwinding is in progress.
 - Prevent sched_fork() from accessing an invalid sched_task_group
 - Improve asymmetric packing logic
 - Extend scheduler statistics to RT and DL scheduling classes and add
   statistics for bandwith burst to the SCHED_FAIR class.
 - Properly account SCHED_IDLE entities
 - Prevent a potential deadlock when initial priority is assigned to a
   newly created kthread. A recent change to plug a race between cpuset
   and __sched_setscheduler() introduced a new lock dependency which is
   now triggered. Break the lock dependency chain by moving the priority
   assignment to the thread function.
 - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
 - Improve idle balancing in general and especially for NOHZ enabled
   systems.
 - Provide proper interfaces for live patching so it does not have to
   fiddle with scheduler internals.
 - Add cluster aware scheduling support.
 - A small set of tweaks for RT (irqwork, wait_task_inactive(), various
   scheduler options and delaying mmdrop)
 - The usual small tweaks and improvements all over the place
* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
  sched/fair: Cleanup newidle_balance
  sched/fair: Remove sysctl_sched_migration_cost condition
  sched/fair: Wait before decaying max_newidle_lb_cost
  sched/fair: Skip update_blocked_averages if we are defering load balance
  sched/fair: Account update_blocked_averages in newidle_balance cost
  x86: Fix __get_wchan() for !STACKTRACE
  sched,x86: Fix L2 cache mask
  sched/core: Remove rq_relock()
  sched: Improve wake_up_all_idle_cpus() take #2
  irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
  irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
  irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
  sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
  sched: Add cluster scheduler level for x86
  sched: Add cluster scheduler level in core and related Kconfig for ARM64
  topology: Represent clusters of CPUs within a die
  sched: Disable -Wunused-but-set-variable
  sched: Add wrapper for get_wchan() to keep task blocked
  x86: Fix get_wchan() to support the ORC unwinder
  proc: Use task_is_running() for wchan in /proc/$pid/stat
  ... | ||
|  Christoph Hellwig | 545c6647d2 | kernel: remove spurious blkdev.h includes Various files have acquired spurious includes of <linux/blkdev.h> over time. Remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> | ||
|  Zhang Qiao | 4ef0c5c6b5 | kernel/sched: Fix sched_fork() access an invalid sched_task_group There is a small race between copy_process() and sched_fork()
where child->sched_task_group point to an already freed pointer.
	parent doing fork()      | someone moving the parent
				 | to another cgroup
  -------------------------------+-------------------------------
  copy_process()
      + dup_task_struct()<1>
				  parent move to another cgroup,
				  and free the old cgroup. <2>
      + sched_fork()
	+ __set_task_cpu()<3>
	+ task_fork_fair()
	  + sched_slice()<4>
In the worst case, this bug can lead to "use-after-free" and
cause panic as shown above:
  (1) parent copy its sched_task_group to child at <1>;
  (2) someone move the parent to another cgroup and free the old
      cgroup at <2>;
  (3) the sched_task_group and cfs_rq that belong to the old cgroup
      will be accessed at <3> and <4>, which cause a panic:
  [] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
  [] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD 2029955067 PMD 0
  [] Oops: 0000 [#1] SMP PTI
  [] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0.x86_64+ #1
  [] RIP: 0010:sched_slice+0x84/0xc0
  [] Call Trace:
  []  task_fork_fair+0x81/0x120
  []  sched_fork+0x132/0x240
  []  copy_process.part.5+0x675/0x20e0
  []  ? __handle_mm_fault+0x63f/0x690
  []  _do_fork+0xcd/0x3b0
  []  do_syscall_64+0x5d/0x1d0
  []  entry_SYSCALL_64_after_hwframe+0x65/0xca
  [] RIP: 0033:0x7f04418cd7e1
Between cgroup_can_fork() and cgroup_post_fork(), the cgroup
membership and thus sched_task_group can't change. So update child's
sched_task_group at sched_post_fork() and move task_fork() and
__set_task_cpu() (where accees the sched_task_group) from sched_fork()
to sched_post_fork().
Fixes:  | ||
|  Eric W. Biederman | 0258b5fd7c | coredump: Limit coredumps to a single thread group Today when a signal is delivered with a handler of SIG_DFL whose
default behavior is to generate a core dump not only that process but
every process that shares the mm is killed.
In the case of vfork this looks like a real world problem.  Consider
the following well defined sequence.
	if (vfork() == 0) {
		execve(...);
		_exit(EXIT_FAILURE);
	}
If a signal that generates a core dump is received after vfork but
before the execve changes the mm the process that called vfork will
also be killed (as the mm is shared).
Similarly if the execve fails after the point of no return the kernel
delivers SIGSEGV which will kill both the exec'ing process and because
the mm is shared the process that called vfork as well.
As far as I can tell this behavior is a violation of people's
reasonable expectations, POSIX, and is unnecessarily fragile when the
system is low on memory.
Solve this by making a userspace visible change to only kill a single
process/thread group.  This is possible because Jann Horn recently
modified[1] the coredump code so that the mm can safely be modified
while the coredump is happening.  With LinuxThreads long gone I don't
expect anyone to have a notice this behavior change in practice.
To accomplish this move the core_state pointer from mm_struct to
signal_struct, which allows different thread groups to coredump
simultatenously.
In zap_threads remove the work to kill anything except for the current
thread group.
v2: Remove core_state from the VM_BUG_ON_MM print to fix
    compile failure when CONFIG_DEBUG_VM is enabled.
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
[1]  | ||
|  Eric W. Biederman | 9230738308 | coredump:  Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace "may_ptrace_stop()" with a simple test of "current->ptrace". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> | ||
|  Liu Zixian | 13db8c5047 | mm/hugetlb: initialize hugetlb_usage in mm_init After fork, the child process will get incorrect (2x) hugetlb_usage.  If
a process uses 5 2MB hugetlb pages in an anonymous mapping,
	HugetlbPages:	   10240 kB
and then forks, the child will show,
	HugetlbPages:	   20480 kB
The reason for double the amount is because hugetlb_usage will be copied
from the parent and then increased when we copy page tables from parent
to child.  Child will have 2x actual usage.
Fix this by adding hugetlb_count_init in mm_init.
Link: https://lkml.kernel.org/r/20210826071742.877-1-liuzixian4@huawei.com
Fixes:  | ||
|  Linus Torvalds | 2d338201d5 | Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton:
 "147 patches, based on  | ||
|  Christoph Hellwig | 05da8113c9 | kernel/fork.c: unexport get_{mm,task}_exe_file Only used by core code and the tomoyo which can't be a module either. Link: https://lkml.kernel.org/r/20210820095430.445242-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | 49624efa65 | Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux Pull MAP_DENYWRITE removal from David Hildenbrand:
 "Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
  VM_DENYWRITE.
  There are some (minor) user-visible changes:
   - We no longer deny write access to shared libaries loaded via legacy
     uselib(); this behavior matches modern user space e.g. dlopen().
   - We no longer deny write access to the elf interpreter after exec
     completed, treating it just like shared libraries (which it often
     is).
   - We always deny write access to the file linked via /proc/pid/exe:
     sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
     file cannot be denied, and write access to the file will remain
     denied until the link is effectivel gone (exec, termination,
     sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.
  Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
  s390x, ...) and verified via ltp that especially the relevant tests
  (i.e., creat07 and execve04) continue working as expected"
* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
  fs: update documentation of get_write_access() and friends
  mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
  mm: remove VM_DENYWRITE
  binfmt: remove in-tree usage of MAP_DENYWRITE
  kernel/fork: always deny write access to current MM exe_file
  kernel/fork: factor out replacing the current MM exe_file
  binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib() | ||
|  David Hildenbrand | 8d0920bde5 | mm: remove VM_DENYWRITE All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be set from user space, so all users are gone; let's remove it. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com> | ||
|  David Hildenbrand | fe69d560b5 | kernel/fork: always deny write access to current MM exe_file We want to remove VM_DENYWRITE only currently only used when mapping the executable during exec. During exec, we already deny_write_access() the executable, however, after exec completes the VMAs mapped with VM_DENYWRITE effectively keeps write access denied via deny_write_access(). Let's deny write access when setting or replacing the MM exe_file. With this change, we can remove VM_DENYWRITE for mapping executables. Make set_mm_exe_file() return an error in case deny_write_access() fails; note that this should never happen, because exec code does a deny_write_access() early and keeps write access denied when calling set_mm_exe_file. However, it makes the code easier to read and makes set_mm_exe_file() and replace_mm_exe_file() look more similar. This represents a minor user space visible change: sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file cannot be opened writable. Note that we can already fail with -EACCES if the file doesn't have execute permissions. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com> | ||
|  David Hildenbrand | 35d7bdc860 | kernel/fork: factor out replacing the current MM exe_file Let's factor the main logic out into replace_mm_exe_file(), such that all mm->exe_file logic is contained in kernel/fork.c. While at it, perform some simple cleanups that are possible now that we're simplifying the individual functions. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com> | ||
|  Linus Torvalds | 9e9fb7655e | Core: - Enable memcg accounting for various networking objects.
 
 BPF:
 
  - Introduce bpf timers.
 
  - Add perf link and opaque bpf_cookie which the program can read
    out again, to be used in libbpf-based USDT library.
 
  - Add bpf_task_pt_regs() helper to access user space pt_regs
    in kprobes, to help user space stack unwinding.
 
  - Add support for UNIX sockets for BPF sockmap.
 
  - Extend BPF iterator support for UNIX domain sockets.
 
  - Allow BPF TCP congestion control progs and bpf iterators to call
    bpf_setsockopt(), e.g. to switch to another congestion control
    algorithm.
 
 Protocols:
 
  - Support IOAM Pre-allocated Trace with IPv6.
 
  - Support Management Component Transport Protocol.
 
  - bridge: multicast: add vlan support.
 
  - netfilter: add hooks for the SRv6 lightweight tunnel driver.
 
  - tcp:
     - enable mid-stream window clamping (by user space or BPF)
     - allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
     - more accurate DSACK processing for RACK-TLP
 
  - mptcp:
     - add full mesh path manager option
     - add partial support for MP_FAIL
     - improve use of backup subflows
     - optimize option processing
 
  - af_unix: add OOB notification support.
 
  - ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by
          the router.
 
  - mac80211: Target Wake Time support in AP mode.
 
  - can: j1939: extend UAPI to notify about RX status.
 
 Driver APIs:
 
  - Add page frag support in page pool API.
 
  - Many improvements to the DSA (distributed switch) APIs.
 
  - ethtool: extend IRQ coalesce uAPI with timer reset modes.
 
  - devlink: control which auxiliary devices are created.
 
  - Support CAN PHYs via the generic PHY subsystem.
 
  - Proper cross-chip support for tag_8021q.
 
  - Allow TX forwarding for the software bridge data path to be
    offloaded to capable devices.
 
 Drivers:
 
  - veth: more flexible channels number configuration.
 
  - openvswitch: introduce per-cpu upcall dispatch.
 
  - Add internet mix (IMIX) mode to pktgen.
 
  - Transparently handle XDP operations in the bonding driver.
 
  - Add LiteETH network driver.
 
  - Renesas (ravb):
    - support Gigabit Ethernet IP
 
  - NXP Ethernet switch (sja1105)
    - fast aging support
    - support for "H" switch topologies
    - traffic termination for ports under VLAN-aware bridge
 
  - Intel 1G Ethernet
     - support getcrosststamp() with PCIe PTM (Precision Time
       Measurement) for better time sync
     - support Credit-Based Shaper (CBS) offload, enabling HW traffic
       prioritization and bandwidth reservation
 
  - Broadcom Ethernet (bnxt)
     - support pulse-per-second output
     - support larger Rx rings
 
  - Mellanox Ethernet (mlx5)
     - support ethtool RSS contexts and MQPRIO channel mode
     - support LAG offload with bridging
     - support devlink rate limit API
     - support packet sampling on tunnels
 
  - Huawei Ethernet (hns3):
     - basic devlink support
     - add extended IRQ coalescing support
     - report extended link state
 
  - Netronome Ethernet (nfp):
     - add conntrack offload support
 
  - Broadcom WiFi (brcmfmac):
     - add WPA3 Personal with FT to supported cipher suites
     - support 43752 SDIO device
 
  - Intel WiFi (iwlwifi):
     - support scanning hidden 6GHz networks
     - support for a new hardware family (Bz)
 
  - Xen pv driver:
     - harden netfront against malicious backends
 
  - Qualcomm mobile
     - ipa: refactor power management and enable automatic suspend
     - mhi: move MBIM to WWAN subsystem interfaces
 
 Refactor:
 
  - Ambient BPF run context and cgroup storage cleanup.
 
  - Compat rework for ndo_ioctl.
 
 Old code removal:
 
  - prism54 remove the obsoleted driver, deprecated by the p54 driver.
 
  - wan: remove sbni/granch driver.
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmEukBYACgkQMUZtbf5S
 IrsyHA//TO8dw18NYts4n9LmlJT2naJ7yBUUSSXK/M+DtW0MQ9nnHhqzPm5uJdRl
 IgQTNJrW3dYzRwgqaWZqEwO1t5/FI+f87ND1Nsekg7x9tF66a6ov5WxU26TwwSba
 U+si/inQ/4chuQ+LxMQobqCDxaLE46I2dIoRl+YfndJ24DRzYSwAEYIPPbSdfyU+
 +/l+3s4GaxO4k/hLciPAiOniyxLoUNiGUTNh+2yqRBXelSRJRKVnl+V22ANFrxRW
 nTEiplfVKhlPU1e4iLuRtaxDDiePHhw9I3j/lMHhfeFU2P/gKJIvz4QpGV0CAZg2
 1VvDU32WEx1GQLXJbKm0KwoNRUq1QSjOyyFti+BO7ugGaYAR4gKhShOqlSYLzUtB
 tbtzQhSNLWOGqgmSJOztZb5kFDm2EdRSll5/lP2uyFlPkIsIp0QbscJVzNTnS74b
 Xz15ZOw41Z4TfWPEMWgfrx6Zkm7pPWkly+7WfUkPcHa1gftNz6tzXXxSXcXIBPdi
 yQ5JCzzxrM5573YHuk5YedwZpn6PiAt4A/muFGk9C6aXP60TQAOS/ppaUzZdnk4D
 NfOk9mj06WEULjYjPcKEuT3GGWE6kmjb8Pu0QZWKOchv7vr6oZly1EkVZqYlXELP
 AfhcrFeuufie8mqm0jdb4LnYaAnqyLzlb1J4Zxh9F+/IX7G3yoc=
 =JDGD
 -----END PGP SIGNATURE-----
Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
 "Core:
   - Enable memcg accounting for various networking objects.
  BPF:
   - Introduce bpf timers.
   - Add perf link and opaque bpf_cookie which the program can read out
     again, to be used in libbpf-based USDT library.
   - Add bpf_task_pt_regs() helper to access user space pt_regs in
     kprobes, to help user space stack unwinding.
   - Add support for UNIX sockets for BPF sockmap.
   - Extend BPF iterator support for UNIX domain sockets.
   - Allow BPF TCP congestion control progs and bpf iterators to call
     bpf_setsockopt(), e.g. to switch to another congestion control
     algorithm.
  Protocols:
   - Support IOAM Pre-allocated Trace with IPv6.
   - Support Management Component Transport Protocol.
   - bridge: multicast: add vlan support.
   - netfilter: add hooks for the SRv6 lightweight tunnel driver.
   - tcp:
       - enable mid-stream window clamping (by user space or BPF)
       - allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
       - more accurate DSACK processing for RACK-TLP
   - mptcp:
       - add full mesh path manager option
       - add partial support for MP_FAIL
       - improve use of backup subflows
       - optimize option processing
   - af_unix: add OOB notification support.
   - ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by the
     router.
   - mac80211: Target Wake Time support in AP mode.
   - can: j1939: extend UAPI to notify about RX status.
  Driver APIs:
   - Add page frag support in page pool API.
   - Many improvements to the DSA (distributed switch) APIs.
   - ethtool: extend IRQ coalesce uAPI with timer reset modes.
   - devlink: control which auxiliary devices are created.
   - Support CAN PHYs via the generic PHY subsystem.
   - Proper cross-chip support for tag_8021q.
   - Allow TX forwarding for the software bridge data path to be
     offloaded to capable devices.
  Drivers:
   - veth: more flexible channels number configuration.
   - openvswitch: introduce per-cpu upcall dispatch.
   - Add internet mix (IMIX) mode to pktgen.
   - Transparently handle XDP operations in the bonding driver.
   - Add LiteETH network driver.
   - Renesas (ravb):
       - support Gigabit Ethernet IP
   - NXP Ethernet switch (sja1105):
       - fast aging support
       - support for "H" switch topologies
       - traffic termination for ports under VLAN-aware bridge
   - Intel 1G Ethernet
       - support getcrosststamp() with PCIe PTM (Precision Time
         Measurement) for better time sync
       - support Credit-Based Shaper (CBS) offload, enabling HW traffic
         prioritization and bandwidth reservation
   - Broadcom Ethernet (bnxt)
       - support pulse-per-second output
       - support larger Rx rings
   - Mellanox Ethernet (mlx5)
       - support ethtool RSS contexts and MQPRIO channel mode
       - support LAG offload with bridging
       - support devlink rate limit API
       - support packet sampling on tunnels
   - Huawei Ethernet (hns3):
       - basic devlink support
       - add extended IRQ coalescing support
       - report extended link state
   - Netronome Ethernet (nfp):
       - add conntrack offload support
   - Broadcom WiFi (brcmfmac):
       - add WPA3 Personal with FT to supported cipher suites
       - support 43752 SDIO device
   - Intel WiFi (iwlwifi):
       - support scanning hidden 6GHz networks
       - support for a new hardware family (Bz)
   - Xen pv driver:
       - harden netfront against malicious backends
   - Qualcomm mobile
       - ipa: refactor power management and enable automatic suspend
       - mhi: move MBIM to WWAN subsystem interfaces
  Refactor:
   - Ambient BPF run context and cgroup storage cleanup.
   - Compat rework for ndo_ioctl.
  Old code removal:
   - prism54 remove the obsoleted driver, deprecated by the p54 driver.
   - wan: remove sbni/granch driver"
* tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1715 commits)
  net: Add depends on OF_NET for LiteX's LiteETH
  ipv6: seg6: remove duplicated include
  net: hns3: remove unnecessary spaces
  net: hns3: add some required spaces
  net: hns3: clean up a type mismatch warning
  net: hns3: refine function hns3_set_default_feature()
  ipv6: remove duplicated 'net/lwtunnel.h' include
  net: w5100: check return value after calling platform_get_resource()
  net/mlxbf_gige: Make use of devm_platform_ioremap_resourcexxx()
  net: mdio: mscc-miim: Make use of the helper function devm_platform_ioremap_resource()
  net: mdio-ipq4019: Make use of devm_platform_ioremap_resource()
  fou: remove sparse errors
  ipv4: fix endianness issue in inet_rtm_getroute_build_skb()
  octeontx2-af: Set proper errorcode for IPv4 checksum errors
  octeontx2-af: Fix static code analyzer reported issues
  octeontx2-af: Fix mailbox errors in nix_rss_flowkey_cfg
  octeontx2-af: Fix loop in free and unmap counter
  af_unix: fix potential NULL deref in unix_dgram_connect()
  dpaa2-eth: Replace strlcpy with strscpy
  octeontx2-af: Use NDC TX for transmit packet data
  ... |