mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-09-04 20:19:47 +08:00
cedff4e3e2
1498 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
8cc01d43f8 |
RCU pull request for v6.3
This pull request contains the following branches: doc.2023.01.05a: Documentation updates. fixes.2023.01.23a: Miscellaneous fixes, perhaps most notably: o Throttling callback invocation based on the number of callbacks that are now ready to invoke instead of on the total number of callbacks. o Several patches that suppress false-positive boot-time diagnostics, for example, due to lockdep not yet being initialized. o Make expedited RCU CPU stall warnings dump stacks of any tasks that are blocking the stalled grace period. (Normal RCU CPU stall warnings have doen this for mnay years.) o Lazy-callback fixes to avoid delays during boot, suspend, and resume. (Note that lazy callbacks must be explicitly enabled, so this should not (yet) affect production use cases.) kvfree.2023.01.03a: Cause kfree_rcu() and friends to take advantage of polled grace periods, thus reducing memory footprint by almost two orders of magnitude, admittedly on a microbenchmark. This series also begins the transition from kfree_rcu(p) to kfree_rcu_mightsleep(p). This transition was motivated by bugs where kfree_rcu(p), which can block, was typed instead of the intended kfree_rcu(p, rh). srcu.2023.01.03a: SRCU updates, perhaps most notably fixing a bug that causes SRCU to fail when booted on a system with a non-zero boot CPU. This surprising situation actually happens for kdump kernels on the powerpc architecture. It also adds an srcu_down_read() and srcu_up_read(), which act like srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side critical section to be handed off from one task to another. srcu-always.2023.02.02a: Cleans up the now-useless SRCU Kconfig option. There are a few more commits that are not yet acked or pulled into maintainer trees, and these will be in a pull request for a later merge window. tasks.2023.01.03a: RCU-tasks updates, perhaps most notably these fixes: o A strange interaction between PID-namespace unshare and the RCU-tasks grace period that results in a low-probability but very real hang. o A race between an RCU tasks rude grace period on a single-CPU system and CPU-hotplug addition of the second CPU that can result in a too-short grace period. o A race between shrinking RCU tasks down to a single callback list and queuing a new callback to some other CPU, but where that queuing is delayed for more than an RCU grace period. This can result in that callback being stranded on the non-boot CPU. torture.2023.01.05a: Torture-test updates and fixes. torturescript.2023.01.03a: Torture-test scripting updates and fixes. stall.2023.01.09a: Provide additional RCU CPU stall-warning information in kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute timeout limit for expedited RCU CPU stall warnings. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmPq29UTHHBhdWxtY2tA a2VybmVsLm9yZwAKCRCevxLzctn7jAhVEACEAKJY1VJ9IUqz7CwzAYkzgRJfiygh oDUXmlqtm6ew9pr2GdLUVCVsUSldzBc0K7Djb/G1niv4JPs+v7YwupIV33+UbStU Qxt6ztTdxc4lKospLm1+2vF9ZdzVEmiP4wVCc4iDarv5FM3FpWSTNc8+L7qmlC+X myjv+GqMTxkXZBvYJOgJGFjDwN8noTd7Fr3mCCVLFm3PXMDa7tcwD6HRP5AqD2N8 qC5M6LEqepKVGmz0mYMLlSN1GPaqIsEcexIFEazRsPEivPh/iafyQCQ/cqxwhXmV vEt7u+dXGZT/oiDq9cJ+/XRDS2RyKIS6dUE14TiiHolDCn1ONESahfA/gXWKykC2 BaGPfjWXrWv/hwbeZ+8xEdkAvTIV92tGpXir9Fby1Z5PjP3balvrnn6hs5AnQBJb NdhRPLzy/dCnEF+CweAYYm1qvTo8cd5nyiNwBZHn7rEAIu3Axrecag1rhFl3AJ07 cpVMQXZtkQVa2X8aIRTUC+ijX6yIqNaHlu0HqNXgIUTDzL4nv5cMjOMzpNQP9/dZ FwAMZYNiOk9IlMiKJ8ZiVcxeiA8ouIBlkYM3k6vGrmiONZ7a/EV/mSHoJqI8bvqr AxUIJ2Ayhg3bxPboL5oKgCiLql0A7ZVvz6quX6McitWGMgaSvel1fDzT3TnZd41e 4AFBFd/+VedUGg== =bBYK -----END PGP SIGNATURE----- Merge tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates - Miscellaneous fixes, perhaps most notably: - Throttling callback invocation based on the number of callbacks that are now ready to invoke instead of on the total number of callbacks - Several patches that suppress false-positive boot-time diagnostics, for example, due to lockdep not yet being initialized - Make expedited RCU CPU stall warnings dump stacks of any tasks that are blocking the stalled grace period. (Normal RCU CPU stall warnings have done this for many years) - Lazy-callback fixes to avoid delays during boot, suspend, and resume. (Note that lazy callbacks must be explicitly enabled, so this should not (yet) affect production use cases) - Make kfree_rcu() and friends take advantage of polled grace periods, thus reducing memory footprint by almost two orders of magnitude, admittedly on a microbenchmark This also begins the transition from kfree_rcu(p) to kfree_rcu_mightsleep(p). This transition was motivated by bugs where kfree_rcu(p), which can block, was typed instead of the intended kfree_rcu(p, rh) - SRCU updates, perhaps most notably fixing a bug that causes SRCU to fail when booted on a system with a non-zero boot CPU. This surprising situation actually happens for kdump kernels on the powerpc architecture This also adds an srcu_down_read() and srcu_up_read(), which act like srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side critical section to be handed off from one task to another - Clean up the now-useless SRCU Kconfig option There are a few more commits that are not yet acked or pulled into maintainer trees, and these will be in a pull request for a later merge window - RCU-tasks updates, perhaps most notably these fixes: - A strange interaction between PID-namespace unshare and the RCU-tasks grace period that results in a low-probability but very real hang - A race between an RCU tasks rude grace period on a single-CPU system and CPU-hotplug addition of the second CPU that can result in a too-short grace period - A race between shrinking RCU tasks down to a single callback list and queuing a new callback to some other CPU, but where that queuing is delayed for more than an RCU grace period. This can result in that callback being stranded on the non-boot CPU - Torture-test updates and fixes - Torture-test scripting updates and fixes - Provide additional RCU CPU stall-warning information in kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute timeout limit for expedited RCU CPU stall warnings * tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits) rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep() kernel/notifier: Remove CONFIG_SRCU init: Remove "select SRCU" fs/quota: Remove "select SRCU" fs/notify: Remove "select SRCU" fs/btrfs: Remove "select SRCU" fs: Remove CONFIG_SRCU drivers/pci/controller: Remove "select SRCU" drivers/net: Remove "select SRCU" drivers/md: Remove "select SRCU" drivers/hwtracing/stm: Remove "select SRCU" drivers/dax: Remove "select SRCU" drivers/base: Remove CONFIG_SRCU rcu: Disable laziness if lazy-tracking says so rcu: Track laziness during boot and suspend rcu: Remove redundant call to rcu_boost_kthread_setaffinity() rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts rcu: Align the output of RCU CPU stall warning messages rcu: Add RCU stall diagnosis information sched: Add helper nr_context_switches_cpu() ... |
||
![]() |
1f2d9ffc7a |
Scheduler updates in this cycle are:
- Improve the scalability of the CFS bandwidth unthrottling logic with large number of CPUs. - Fix & rework various cpuidle routines, simplify interaction with the generic scheduler code. Add __cpuidle methods as noinstr to objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks. - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query previously issued registrations. - Limit scheduler slice duration to the sysctl_sched_latency period, to improve scheduling granularity with a large number of SCHED_IDLE tasks. - Debuggability enhancement on sys_exit(): warn about disabled IRQs, but also enable them to prevent a cascade of followup problems and repeat warnings. - Fix the rescheduling logic in prio_changed_dl(). - Micro-optimize cpufreq and sched-util methods. - Micro-optimize ttwu_runnable() - Micro-optimize the idle-scanning in update_numa_stats(), select_idle_capacity() and steal_cookie_task(). - Update the RSEQ code & self-tests - Constify various scheduler methods - Remove unused methods - Refine __init tags - Documentation updates - ... Misc other cleanups, fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzbJwRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iIvA//ZcEaB8Z6ChLRQjM+bsaudKJu3pdLQbPK iYbP8Da+LsAfxbEfYuGV3m+jIp0LlBOtsI/EezxQrXV+V7FvNyAX9Y00eEu/zlj8 7Jn3LMy/DBYTwH7LwVdcU0MyIVI8ZPc6WNnkx0LOtGZn8n+qfHPSDzcP3CW+a5AV UvllPYpYyEmsX0Eby7CF4Ue8mSmbViw/xR3rNr8ZSve0c25XzKabw8O9kE3jiHxP d/zERJoAYeDyYUEuZqhfn5dTlB4an4IjNEkAfRE5SQ09RA8Gkxsa5Ar8gob9e9M1 eQsdd4/bdhnrkM8L5qDZczqmgCTZ2bukQrxkBXhRDhLgoFxwAn77b+2ZjmIW3Lae AyGqRcDSg1q2oxaYm5ZiuO/t26aDOZu9vPHyHRDGt95EGbZlrp+GgeePyfCigJYz UmPdZAAcHdSymnnnlcvdG37WVvaVkpgWZzd8LbtBi23QR+Zc4WQ2IlgnUS5WKNNf VOBcAcP6E1IslDotZDQCc2dPFFQoQQEssVooyUc5oMytm7BsvxXLOeHG+Ncu/8uc H+U8Qn8jnqTxJbC5hkWQIJlhVKCq2FJrHxxySYTKROfUNcDgCmxboFeAcXTCIU1K T0S+sdoTS/CvtLklRkG0j6B8N4N98mOd9cFwUV3tX+/gMLMep3hCQs5L76JagvC5 skkQXoONNaM= =l1nN -----END PGP SIGNATURE----- Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Improve the scalability of the CFS bandwidth unthrottling logic with large number of CPUs. - Fix & rework various cpuidle routines, simplify interaction with the generic scheduler code. Add __cpuidle methods as noinstr to objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks. - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query previously issued registrations. - Limit scheduler slice duration to the sysctl_sched_latency period, to improve scheduling granularity with a large number of SCHED_IDLE tasks. - Debuggability enhancement on sys_exit(): warn about disabled IRQs, but also enable them to prevent a cascade of followup problems and repeat warnings. - Fix the rescheduling logic in prio_changed_dl(). - Micro-optimize cpufreq and sched-util methods. - Micro-optimize ttwu_runnable() - Micro-optimize the idle-scanning in update_numa_stats(), select_idle_capacity() and steal_cookie_task(). - Update the RSEQ code & self-tests - Constify various scheduler methods - Remove unused methods - Refine __init tags - Documentation updates - Misc other cleanups, fixes * tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits) sched/rt: pick_next_rt_entity(): check list_entry sched/deadline: Add more reschedule cases to prio_changed_dl() sched/fair: sanitize vruntime of entity being placed sched/fair: Remove capacity inversion detection sched/fair: unlink misfit task from cpu overutilized objtool: mem*() are not uaccess safe cpuidle: Fix poll_idle() noinstr annotation sched/clock: Make local_clock() noinstr sched/clock/x86: Mark sched_clock() noinstr x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read() x86/atomics: Always inline arch_atomic64*() cpuidle: tracing, preempt: Squash _rcuidle tracing cpuidle: tracing: Warn about !rcu_is_watching() cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG cpuidle: drivers: firmware: psci: Dont instrument suspend code KVM: selftests: Fix build of rseq test exit: Detect and fix irq disabled state in oops cpuidle, arm64: Fix the ARM64 cpuidle logic cpuidle: mvebu: Fix duplicate flags assignment sched/fair: Limit sched slice duration ... |
||
![]() |
df14b7f9ef |
sched/core: Fix a missed update of user_cpus_ptr
Since commit |
||
![]() |
57a30218fa |
Linux 6.2-rc6
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPW7E8eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGf7MIAI0JnHN9WvtEukSZ E6j6+cEGWxsvD6q0g3GPolaKOCw7hlv0pWcFJFcUAt0jebspMdxV2oUGJ8RYW7Lg nCcHvEVswGKLAQtQSWw52qotW6fUfMPsNYYB5l31sm1sKH4Cgss0W7l2HxO/1LvG TSeNHX53vNAZ8pVnFYEWCSXC9bzrmU/VALF2EV00cdICmfvjlgkELGXoLKJJWzUp s63fBHYGGURSgwIWOKStoO6HNo0j/F/wcSMx8leY8qDUtVKHj4v24EvSgxUSDBER ch3LiSQ6qf4sw/z7pqruKFthKOrlNmcc0phjiES0xwwGiNhLv0z3rAhc4OM2cgYh SDc/Y/c= =zpaD -----END PGP SIGNATURE----- Merge tag 'v6.2-rc6' into sched/core, to pick up fixes Pick up fixes before merging another batch of cpuidle updates. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
5657c11678 |
sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs
The kernel commit |
||
![]() |
9a5418bc48 |
sched/core: Use kfree_rcu() in do_set_cpus_allowed()
Commit |
||
![]() |
87ca4f9efb |
sched/core: Fix use-after-free bug in dup_user_cpus_ptr()
Since commit |
||
![]() |
7fb3ff22ad |
sched/core: Fix arch_scale_freq_tick() on tickless systems
In order for the scheduler to be frequency invariant we measure the
ratio between the maximum CPU frequency and the actual CPU frequency.
During long tickless periods of time the calculations that keep track
of that might overflow, in the function scale_freq_tick():
if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
goto error;
eventually forcing the kernel to disable the feature for all CPUs,
and show the warning message:
"Scheduler frequency invariance went wobbly, disabling!".
Let's avoid that by limiting the frequency invariant calculations
to CPUs with regular tick.
Fixes:
|
||
![]() |
160fb0d83f |
sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate()
ttwu_do_activate() is used for a complete wakeup, in which we will activate_task() and use ttwu_do_wakeup() to mark the task runnable and perform wakeup-preemption, also call class->task_woken() callback and update the rq->idle_stamp. Since ttwu_runnable() is not a complete wakeup, don't need all those done in ttwu_do_wakeup(), so we can move those to ttwu_do_activate() to simplify ttwu_do_wakeup(), making it only mark the task runnable to be reused in ttwu_runnable() and try_to_wake_up(). This patch should not have any functional changes. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20221223103257.4962-2-zhouchengming@bytedance.com |
||
![]() |
efe0938586 |
sched/core: Micro-optimize ttwu_runnable()
ttwu_runnable() is used as a fast wakeup path when the wakee task is running on CPU or runnable on RQ, in both cases we can just set its state to TASK_RUNNING to prevent a sleep. If the wakee task is on_cpu running, we don't need to update_rq_clock() or check_preempt_curr(). But if the wakee task is on_rq && !on_cpu (e.g. an IRQ hit before the task got to schedule() and the task been preempted), we should check_preempt_curr() to see if it can preempt the current running. This also removes the class->task_woken() callback from ttwu_runnable(), which wasn't required per the RT/DL implementations: any required push operation would have been queued during class->set_next_task() when p got preempted. ttwu_runnable() also loses the update to rq->idle_stamp, as by definition the rq cannot be idle in this scenario. Suggested-by: Valentin Schneider <vschneid@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20221223103257.4962-1-zhouchengming@bytedance.com |
||
![]() |
7c182722a0 |
sched: Add helper nr_context_switches_cpu()
Add a function nr_context_switches_cpu() that returns number of context switches since boot on the specified CPU. This information will be used to diagnose RCU CPU stalls. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> |
||
![]() |
bbd0b03150 |
sched/rseq: Fix concurrency ID handling of usermodehelper kthreads
sched_mm_cid_after_execve() does not expect NULL t->mm, but it may happen
if a usermodehelper kthread fails when attempting to execute a binary.
sched_mm_cid_fork() can be issued from a usermodehelper kthread, which
has t->flags PF_KTHREAD set.
Fixes:
|
||
![]() |
8589018acc |
sched/core: Adjusting the order of scanning CPU
When select_idle_capacity() starts scanning for an idle CPU, it starts with target CPU that has already been checked in select_idle_sibling(). So we start checking from the next CPU and try the target CPU at the end. Similarly for task_numa_assign(), we have just checked numa_migrate_on of dst_cpu, so start from the next CPU. This also works for steal_cookie_task(), the first scan must fail and start directly from the next one. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20221216062406.7812-3-jiahao.os@bytedance.com |
||
![]() |
904cbab71d |
sched: Make const-safe
With a modified container_of() that preserves constness, the compiler finds some pointers which should have been marked as const. task_of() also needs to become const-preserving for the !FAIR_GROUP_SCHED case so that cfs_rq_of() can take a const argument. No change to generated code. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221212144946.2657785-1-willy@infradead.org |
||
![]() |
af7f588d8f |
sched: Introduce per-memory-map concurrency ID
This feature allows the scheduler to expose a per-memory map concurrency ID to user-space. This concurrency ID is within the possible cpus range, and is temporarily (and uniquely) assigned while threads are actively running within a memory map. If a memory map has fewer threads than cores, or is limited to run on few cores concurrently through sched affinity or cgroup cpusets, the concurrency IDs will be values close to 0, thus allowing efficient use of user-space memory for per-cpu data structures. This feature is meant to be exposed by a new rseq thread area field. The primary purpose of this feature is to do the heavy-lifting needed by memory allocators to allow them to use per-cpu data structures efficiently in the following situations: - Single-threaded applications, - Multi-threaded applications on large systems (many cores) with limited cpu affinity mask, - Multi-threaded applications on large systems (many cores) with restricted cgroup cpuset per container. One of the key concern from scheduler maintainers is the overhead associated with additional spin locks or atomic operations in the scheduler fast-path. This is why the following optimization is implemented. On context switch between threads belonging to the same memory map, transfer the mm_cid from prev to next without any atomic ops. This takes care of use-cases involving frequent context switch between threads belonging to the same memory map. Additional optimizations can be done if the spin locks added when context switching between threads belonging to different memory maps end up being a performance bottleneck. Those are left out of this patch though. A performance impact would have to be clearly demonstrated to justify the added complexity. The credit goes to Paul Turner (Google) for the original virtual cpu id idea. This feature is implemented based on the discussions with Paul Turner and Peter Oskolkov (Google), but I took the liberty to implement scheduler fast-path optimizations and my own NUMA-awareness scheme. The rumor has it that Google have been running a rseq vcpu_id extension internally in production for a year. The tcmalloc source code indeed has comments hinting at a vcpu_id prototype extension to the rseq system call [1]. The following benchmarks do not show any significant overhead added to the scheduler context switch by this feature: * perf bench sched messaging (process) Baseline: 86.5±0.3 ms With mm_cid: 86.7±2.6 ms * perf bench sched messaging (threaded) Baseline: 84.3±3.0 ms With mm_cid: 84.7±2.6 ms * hackbench (process) Baseline: 82.9±2.7 ms With mm_cid: 82.9±2.9 ms * hackbench (threaded) Baseline: 85.2±2.6 ms With mm_cid: 84.4±2.9 ms [1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26 Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221122203932.231377-8-mathieu.desnoyers@efficios.com |
||
![]() |
48ea09cdda |
hardening updates for v6.2-rc1
- Convert flexible array members, fix -Wstringop-overflow warnings, and fix KCFI function type mismatches that went ignored by maintainers (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook). - Remove the remaining side-effect users of ksize() by converting dma-buf, btrfs, and coredump to using kmalloc_size_roundup(), add more __alloc_size attributes, and introduce full testing of all allocator functions. Finally remove the ksize() side-effect so that each allocation-aware checker can finally behave without exceptions. - Introduce oops_limit (default 10,000) and warn_limit (default off) to provide greater granularity of control for panic_on_oops and panic_on_warn (Jann Horn, Kees Cook). - Introduce overflows_type() and castable_to_type() helpers for cleaner overflow checking. - Improve code generation for strscpy() and update str*() kern-doc. - Convert strscpy and sigphash tests to KUnit, and expand memcpy tests. - Always use a non-NULL argument for prepare_kernel_cred(). - Disable structleak plugin in FORTIFY KUnit test (Anders Roxell). - Adjust orphan linker section checking to respect CONFIG_WERROR (Xin Li). - Make sure siginfo is cleared for forced SIGKILL (haifeng.xu). - Fix um vs FORTIFY warnings for always-NULL arguments. -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmOZSOoWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjAAD/0YkvpU7f03f8hcQMJK6wv//24K AW41hEaBikq9RcmkuvkLLrJRibGgZ5O2xUkUkxRs/HxhkhrZ0kEw8sbwZe8MoWls F4Y9+TDjsrdHmjhfcBZdLnVxwcKK5wlaEcpjZXtbsfcdhx3TbgcDA23YELl5t0K+ I11j4kYmf9SLl4CwIrSP5iACml8CBHARDh8oIMF7FT/LrjNbM8XkvBcVVT6hTbOV yjgA8WP2e9GXvj9GzKgqvd0uE/kwPkVAeXLNFWopPi4FQ8AWjlxbBZR0gamA6/EB d7TIs0ifpVU2JGQaTav4xO6SsFMj3ntoUI0qIrFaTxZAvV4KYGrPT/Kwz1O4SFaG rN5lcxseQbPQSBTFNG4zFjpywTkVCgD2tZqDwz5Rrmiraz0RyIokCN+i4CD9S0Ds oEd8JSyLBk1sRALczkuEKo0an5AyC9YWRcBXuRdIHpLo08PsbeUUSe//4pe303cw 0ApQxYOXnrIk26MLElTzSMImlSvlzW6/5XXzL9ME16leSHOIfDeerPnc9FU9Eb3z ODv22z6tJZ9H/apSUIHZbMciMbbVTZ8zgpkfydr08o87b342N/ncYHZ5cSvQ6DWb jS5YOIuvl46/IhMPT16qWC8p0bP5YhxoPv5l6Xr0zq0ooEj0E7keiD/SzoLvW+Qs AHXcibguPRQBPAdiPQ== =yaaN -----END PGP SIGNATURE----- Merge tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kernel hardening updates from Kees Cook: - Convert flexible array members, fix -Wstringop-overflow warnings, and fix KCFI function type mismatches that went ignored by maintainers (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook) - Remove the remaining side-effect users of ksize() by converting dma-buf, btrfs, and coredump to using kmalloc_size_roundup(), add more __alloc_size attributes, and introduce full testing of all allocator functions. Finally remove the ksize() side-effect so that each allocation-aware checker can finally behave without exceptions - Introduce oops_limit (default 10,000) and warn_limit (default off) to provide greater granularity of control for panic_on_oops and panic_on_warn (Jann Horn, Kees Cook) - Introduce overflows_type() and castable_to_type() helpers for cleaner overflow checking - Improve code generation for strscpy() and update str*() kern-doc - Convert strscpy and sigphash tests to KUnit, and expand memcpy tests - Always use a non-NULL argument for prepare_kernel_cred() - Disable structleak plugin in FORTIFY KUnit test (Anders Roxell) - Adjust orphan linker section checking to respect CONFIG_WERROR (Xin Li) - Make sure siginfo is cleared for forced SIGKILL (haifeng.xu) - Fix um vs FORTIFY warnings for always-NULL arguments * tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (31 commits) ksmbd: replace one-element arrays with flexible-array members hpet: Replace one-element array with flexible-array member um: virt-pci: Avoid GCC non-NULL warning signal: Initialize the info in ksignal lib: fortify_kunit: build without structleak plugin panic: Expose "warn_count" to sysfs panic: Introduce warn_limit panic: Consolidate open-coded panic_on_warn checks exit: Allow oops_limit to be disabled exit: Expose "oops_count" to sysfs exit: Put an upper limit on how often we can oops panic: Separate sysctl logic from CONFIG_SMP mm/pgtable: Fix multiple -Wstringop-overflow warnings mm: Make ksize() a reporting-only function kunit/fortify: Validate __alloc_size attribute results drm/sti: Fix return type of sti_{dvo,hda,hdmi}_connector_mode_valid() drm/fsl-dcu: Fix return type of fsl_dcu_drm_connector_mode_valid() driver core: Add __alloc_size hint to devm allocators overflow: Introduce overflows_type() and castable_to_type() coredump: Proactively round up to kmalloc bucket size ... |
||
![]() |
8fa37a6835 |
sysctl changes for v6.2-rc1
Only step forward on the sysctl cleanups for this cycle. This has been on linux-next since September and this time it goes with a "Yeah, think so, it just moves stuff around a bit" from Peter Zijlstra. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmOYC3sSHG1jZ3JvZkBr ZXJuZWwub3JnAAoJEM4jHQowkoinVEYQAL6/3nRt854jULd3zRwrWDyJZd5yxbnc R8jJBTt3q4CKwtMqd59uQqVYLpSqOCx/GsArfsXkmY4x7KYhlaSKcC4LHmFS8Z/u dofyVKumIFqtXMI+hYuTyqNqfGoK9UKXUftrqYb8pK+K3h73uqYbrDgSex4G9GJo Au0/WeDjTzLlgqt7RPN7n0PL2jMtfWVQkr3001OCQOWW9sdrOjtprn/3bDTUnW5q KukKB5saU0CvUzrTn2DaweQiRCJxQfCQfy3DZfhDRHVuWFYMV9b1okaGEoVmQlQT I9/urfdf3aLCdBBxCQG5W6uRxZwZ2Yb93M+rijZNWNFMC6WHrMCmSiADwz9LJzIK iQV7LoolGe1TFTEVJbsde5xKSF6BeId0IF5mmPQuokAx3TPE9279HNgluaB/38c8 p3P4+mP6qE12mMPyhpwDwNOzEWgUnLsGSIE5n/WPwxCiGNa7UsN2lzMDP1cJejp5 NlRg1hRKmgt30d9+t9sHeKMcWhrjxyPGsyUMwBJTuMCHbjqizGyBsB8DzyK95OoF aN66pyRqwsK0+IUivd8VfLgfriE2gDrQD5VqkJ8lfWBx9pq8RMEq7zQ1eE9IbCff hzbfG+7k9R3o4SPfJYmCBXtp6fcq+ovjbLYSvGGCJk0zfFe6SQE21rZ3hCQPq3v5 xKFh05xUfbRF =M48U -----END PGP SIGNATURE----- Merge tag 'sysctl-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "Only a small step forward on the sysctl cleanups for this cycle" * tag 'sysctl-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: sched: Move numa_balancing sysctls to its own file |
||
![]() |
bf57ae2165 |
Scheduler changes for v6.2:
- Implement persistent user-requested affinity: introduce affinity_context::user_mask and unconditionally preserve the user-requested CPU affinity masks, for long-lived tasks to better interact with cpusets & CPU hotplug events over longer timespans, without destroying the original affinity intent if the underlying topology changes. - Uclamp updates: fix relationship between uclamp and fits_capacity() - PSI fixes - Misc fixes & updates. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmOXkmgRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1j/dQ//WYW/JaBpydqnVxDu6C21z0w3+fHDdlsN nQ6jyLPlouFjI2Ink1E7i7Iq8C73sdewCgD7Jq3xGa1GhsPEJIrPAaBgacxYjOqc x9HHZoygSkAihTfrVvzq37YttD2t/gQQxc81tBziMBVP2A+gb9z44u+ezMlxjiGz irgE07qNfiLyTeD/dJhEU2EOsPJm/gestW3+Cd8uwYAe6pj0X4FE3n8ipmr0BzNZ 6nxFJaSspwAkREjpAIZVENEArq7XrkGqUFKgKpYqWn0HAnuTWgFcW8E2NrDw7Qbf 4aAdBuzimbWdbkqRoX9r7++r5wqc3KW+is8Y97aEUsc0zhrXHAW1Hn2w7en5XxiQ btaPi77Boi69sHvOrfMy3i6UZ895yh2sROIkYBDT485w57BR75HsMLkk2LNIm7qE mATrrZ65bbGAgAxZouqlnQk40BUlniIfDlfZyReyFtXkW8UH5tTNX6qtpWzzdwfy posrm+XvgDcP96/7DIczZwT6VEJE5GBZbPvk2Vw4GNq6/QeW7g9GPhYTaV6CXzzW lCk/MV1n+IWCUqjkGXXCTS53TIyC6WZh2ehegcsh1KYyWcVijEs42S6eqXZI9cO7 F4oU7sehg4vlhMm1uE5JgaABfYqqzzKlvZySdwXbne2Vjt4nsWlWoe6u6JAdA4EB PRwmUDRMyEE= =aao/ -----END PGP SIGNATURE----- Merge tag 'sched-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Implement persistent user-requested affinity: introduce affinity_context::user_mask and unconditionally preserve the user-requested CPU affinity masks, for long-lived tasks to better interact with cpusets & CPU hotplug events over longer timespans, without destroying the original affinity intent if the underlying topology changes. - Uclamp updates: fix relationship between uclamp and fits_capacity() - PSI fixes - Misc fixes & updates * tag 'sched-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Clear ttwu_pending after enqueue_task() sched/psi: Use task->psi_flags to clear in CPU migration sched/psi: Stop relying on timer_pending() for poll_work rescheduling sched/psi: Fix avgs_work re-arm in psi_avgs_work() sched/psi: Fix possible missing or delayed pending event sched: Always clear user_cpus_ptr in do_set_cpus_allowed() sched: Enforce user requested affinity sched: Always preserve the user requested cpumask sched: Introduce affinity_context sched: Add __releases annotations to affine_move_task() sched/fair: Check if prev_cpu has highest spare cap in feec() sched/fair: Consider capacity inversion in util_fits_cpu() sched/fair: Detect capacity inversion sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition sched/uclamp: Make cpu_overutilized() use util_fits_cpu() sched/uclamp: Make asym_fits_capacity() use util_fits_cpu() sched/uclamp: Make select_idle_capacity() use util_fits_cpu() sched/uclamp: Fix fits_capacity() check in feec() sched/uclamp: Make task_fits_capacity() use util_fits_cpu() sched/uclamp: Fix relationship between uclamp and migration margin |
||
![]() |
79cc1ba7ba |
panic: Consolidate open-coded panic_on_warn checks
Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll their own warnings, and each check "panic_on_warn". Consolidate this into a single function so that future instrumentation can be added in a single location. Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Gow <davidgow@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Jann Horn <jannh@google.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org |
||
![]() |
0dff89c448 |
sched: Move numa_balancing sysctls to its own file
The sysctl_numa_balancing_promote_rate_limit and sysctl_numa_balancing are part of sched, move them to its own file. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> |
||
![]() |
d6962c4fe8 |
sched: Clear ttwu_pending after enqueue_task()
We found a long tail latency in schbench whem m*t is close to nr_cpus.
(e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)
This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
too early, and idle_cpu() will return true until the wakee task enqueued.
This will mislead the waker when selecting idle cpu, and wake multiple
worker threads on the same wakee cpu. This situation is enlarged by
commit
|
||
![]() |
91dabf33ae |
sched: Fix race in task_call_func()
There is a very narrow race between schedule() and task_call_func().
CPU0 CPU1
__schedule()
rq_lock();
prev_state = READ_ONCE(prev->__state);
if (... && prev_state) {
deactivate_tasl(rq, prev, ...)
prev->on_rq = 0;
task_call_func()
raw_spin_lock_irqsave(p->pi_lock);
state = READ_ONCE(p->__state);
smp_rmb();
if (... || p->on_rq) // false!!!
rq = __task_rq_lock()
ret = func();
next = pick_next_task();
rq = context_switch(prev, next)
prepare_lock_switch()
spin_release(&__rq_lockp(rq)->dep_map...)
So while the task is on it's way out, it still holds rq->lock for a
little while, and right then task_call_func() comes in and figures it
doesn't need rq->lock anymore (because the task is already dequeued --
but still running there) and then the __set_task_frozen() thing observes
it's holding rq->lock and yells murder.
Avoid this by waiting for p->on_cpu to get cleared, which guarantees
the task is fully finished on the old CPU.
( While arguably the fixes tag is 'wrong' -- none of the previous
task_call_func() users appears to care for this case. )
Fixes:
|
||
![]() |
52b33d87b9 |
sched/psi: Use task->psi_flags to clear in CPU migration
The commit
|
||
![]() |
851a723e45 |
sched: Always clear user_cpus_ptr in do_set_cpus_allowed()
The do_set_cpus_allowed() function is used by either kthread_bind() or select_fallback_rq(). In both cases the user affinity (if any) should be destroyed too. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-6-longman@redhat.com |
||
![]() |
da01903281 |
sched: Enforce user requested affinity
It was found that the user requested affinity via sched_setaffinity() can be easily overwritten by other kernel subsystems without an easy way to reset it back to what the user requested. For example, any change to the current cpuset hierarchy may reset the cpumask of the tasks in the affected cpusets to the default cpuset value even if those tasks have pre-existing user requested affinity. That is especially easy to trigger under a cgroup v2 environment where writing "+cpuset" to the root cgroup's cgroup.subtree_control file will reset the cpus affinity of all the processes in the system. That is problematic in a nohz_full environment where the tasks running in the nohz_full CPUs usually have their cpus affinity explicitly set and will behave incorrectly if cpus affinity changes. Fix this problem by looking at user_cpus_ptr in __set_cpus_allowed_ptr() and use it to restrcit the given cpumask unless there is no overlap. In that case, it will fallback to the given one. The SCA_USER flag is reused to indicate intent to set user_cpus_ptr and so user_cpus_ptr masking should be skipped. In addition, masking should also be skipped if any of the SCA_MIGRATE_* flag is set. All callers of set_cpus_allowed_ptr() will be affected by this change. A scratch cpumask is added to percpu runqueues structure for doing additional masking when user_cpus_ptr is set. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-4-longman@redhat.com |
||
![]() |
8f9ea86fdf |
sched: Always preserve the user requested cpumask
Unconditionally preserve the user requested cpumask on sched_setaffinity() calls. This allows using it outside of the fairly narrow restrict_cpus_allowed_ptr() use-case and fix some cpuset issues that currently suffer destruction of cpumasks. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-3-longman@redhat.com |
||
![]() |
713a2e21a5 |
sched: Introduce affinity_context
In order to prepare for passing through additional data through the affinity call-chains, convert the mask and flags argument into a structure. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-5-longman@redhat.com |
||
![]() |
5584e8ac2c |
sched: Add __releases annotations to affine_move_task()
affine_move_task() assumes task_rq_lock() has been called and it does an implicit task_rq_unlock() before returning. Add the appropriate __releases annotations to make this clear. A typo error in comment is also fixed. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-2-longman@redhat.com |
||
![]() |
244226035a |
sched/uclamp: Fix fits_capacity() check in feec()
As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024,
it'll always pick the previous CPU because fits_capacity() will always
return false in this case.
The new util_fits_cpu() logic should handle this correctly for us beside
more corner cases where similar failures could occur, like when using
UCLAMP_MAX.
We open code uclamp_rq_util_with() except for the clamp() part,
util_fits_cpu() needs the 'raw' values to be passed to it.
Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp
value for the rq. Makes the code more readable and ensures the right
rules (use READ_ONCE/WRITE_ONCE) are respected transparently.
[1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html
Fixes:
|
||
![]() |
8e5bad7dcc |
sched: Introduce struct balance_callback to avoid CFI mismatches
Introduce distinct struct balance_callback instead of performing function pointer casting which will trip CFI. Avoids warnings as found by Clang's future -Wcast-function-type-strict option: In file included from kernel/sched/core.c:84: kernel/sched/sched.h:1755:15: warning: cast from 'void (*)(struct rq *)' to 'void (*)(struct callback_head *)' converts to incompatible function type [-Wcast-function-type-strict] head->func = (void (*)(struct callback_head *))func; ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ No binary differences result from this change. This patch is a cleanup based on Brad Spengler/PaX Team's modifications to sched code in their last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Reported-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1724 Link: https://lkml.kernel.org/r/20221008000758.2957718-1-keescook@chromium.org |
||
![]() |
bd9a3dba18 |
PSI updates for v6.1:
- Various performance optimizations, resulting in a 4%-9% speedup in the mmtests/config-scheduler-perfpipe micro-benchmark. - New interface to turn PSI on/off on a per cgroup level. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmNJKPsRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iPmg//aovCitAQX2lLoHJDIgdQibU40oaEpKTX wM549EGz3Dr6qmwF8+qT1U2Ge6af/hHQc5G/ZqDpKbuTjUIc3RmBkqX80dNKFLuH uyi9UtfsSriw+ks8fWuDdjr+S4oppwW9ZoIXvK8v4bisd3F31DNGvKPTayNxt73m lExfzJiD1oJixDxGX8MGO9QpcoywmjWjzjrB2P+J8hnTpArouHx/HOKdQOpG6wXq ZRr9kZvju6ucDpXCTa1HJrfVRxNAh35tx/b4cDtXbBFifVAeKaPOrHapMTVsqfel Z7T+2DymhidNYK0hrRJoGUwa/vkz+2Sm1ZLG9LlgUCXVco/9S1zw1ZuQakVvzPen wriuxRaAkR+szCP0L8js5+/DAkGa43MjKsvQHmDVnetQtlsAD4eYnn+alQ837SXv MP3jwFqF+e4mcWdoQcfh0OWUgGec5XZzdgRYrFkBKyTWGLB2iPivcAMNf0X/h82Q xxv4DQJIIJ017GOQ/ho2saq+GbtFCvX8YnGYas9T47Bjjluhjo7jgTVtPTo+mhtN RfwMdG718Ap/gvnAX7wMe/t+L/4AP8AIgDRi5L35dTRqETwOjH+LAvOYjleQFYgu kMVtLMyzU+TGwHscuzPFRh7TnvSJ4sD48Ll1BPnyZsh3SS9u0gAs1bml7Cu7JbmW SIZD/S/hzdI= =91tB -----END PGP SIGNATURE----- Merge tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull PSI updates from Ingo Molnar: - Various performance optimizations, resulting in a 4%-9% speedup in the mmtests/config-scheduler-perfpipe micro-benchmark. - New interface to turn PSI on/off on a per cgroup level. * tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/psi: Per-cgroup PSI accounting disable/re-enable interface sched/psi: Cache parent psi_group to speed up group iteration sched/psi: Consolidate cgroup_psi() sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure sched/psi: Remove NR_ONCPU task accounting sched/psi: Optimize task switch inside shared cgroups again sched/psi: Move private helpers to sched/stats.h sched/psi: Save percpu memory when !psi_cgroups_enabled sched/psi: Don't create cgroup PSI files when psi_disabled sched/psi: Fix periodic aggregation shut off |
||
![]() |
27bc50fc90 |
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam R. Howlett. An overlapping range-based tree for vmas. It it apparently slight more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com). This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU= =xfWx -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ... |
||
![]() |
d4013bc4d4 |
bitmap patches for v6.1-rc1
From Phil Auld: drivers/base: Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES From me: cpumask: cleanup nr_cpu_ids vs nr_cpumask_bits mess This series cleans that mess and adds new config FORCE_NR_CPUS that allows to optimize cpumask subsystem if the number of CPUs is known at compile-time. From me: lib: optimize find_bit() functions Reworks find_bit() functions based on new FIND_{FIRST,NEXT}_BIT() macros. From me: lib/find: add find_nth_bit() Adds find_nth_bit(), which is ~70 times faster than bitcounting with for_each() loop: for_each_set_bit(bit, mask, size) if (n-- == 0) return bit; Also adds bitmap_weight_and() to let people replace this pattern: tmp = bitmap_alloc(nbits); bitmap_and(tmp, map1, map2, nbits); weight = bitmap_weight(tmp, nbits); bitmap_free(tmp); with a single bitmap_weight_and() call. From me: cpumask: repair cpumask_check() After switching cpumask to use nr_cpu_ids, cpumask_check() started generating many false-positive warnings. This series fixes it. From Valentin Schneider: bitmap,cpumask: Add for_each_cpu_andnot() and for_each_cpu_andnot() Extends the API with one more function and applies it in sched/core. -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmNBwmUACgkQsUSA/Tof vshPRwv+KlqnZlKtuSPgbo/Kgswworpi/7TqfnN9GWlb8AJ2uhjBKI3GFwv4TDow 7KV6wdKdXYLr4pktcIhWy3qLrT+bDDExfarHRo3QI1A1W42EJ+ZiUaGnQGcnVMzD 5q/K1YMJYq0oaesHEw5PVUh8mm6h9qRD8VbX1u+riW/VCWBj3bho9Dp4mffQ48Q6 hVy/SnMGgClQwNYp+sxkqYx38xUqUGYoU5MzeziUmoS6pZQh+4lF33MULnI3EKmc /ehXilPPtOV/Tm0RovDWFfm3rjNapV9FXHu8Ob2z/c+1A29EgXnE3pwrBDkAx001 TQrL9qbCANRDGPLzWQHw0dwFIaXvTdrSttCsfYYfU5hI4JbnJEe0Pqkaaohy7jqm r0dW/TlyOG5T+k8Kwdx9w9A+jKs8TbKKZ8HOaN8BpkXswVnpbzpQbj3TITZI4aeV 6YR4URBQ5UkrVLEXFXbrOzwjL2zqDdyNoBdTJmGLJ+5b/n0HHzmyMVkegNIwLLM3 GR7sMQae =Q/+F -----END PGP SIGNATURE----- Merge tag 'bitmap-6.1-rc1' of https://github.com/norov/linux Pull bitmap updates from Yury Norov: - Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES (Phil Auld) - cleanup nr_cpu_ids vs nr_cpumask_bits mess (me) This series cleans that mess and adds new config FORCE_NR_CPUS that allows to optimize cpumask subsystem if the number of CPUs is known at compile-time. - optimize find_bit() functions (me) Reworks find_bit() functions based on new FIND_{FIRST,NEXT}_BIT() macros. - add find_nth_bit() (me) Adds find_nth_bit(), which is ~70 times faster than bitcounting with for_each() loop: for_each_set_bit(bit, mask, size) if (n-- == 0) return bit; Also adds bitmap_weight_and() to let people replace this pattern: tmp = bitmap_alloc(nbits); bitmap_and(tmp, map1, map2, nbits); weight = bitmap_weight(tmp, nbits); bitmap_free(tmp); with a single bitmap_weight_and() call. - repair cpumask_check() (me) After switching cpumask to use nr_cpu_ids, cpumask_check() started generating many false-positive warnings. This series fixes it. - Add for_each_cpu_andnot() and for_each_cpu_andnot() (Valentin Schneider) Extends the API with one more function and applies it in sched/core. * tag 'bitmap-6.1-rc1' of https://github.com/norov/linux: (28 commits) sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() lib/test_cpumask: Add for_each_cpu_and(not) tests cpumask: Introduce for_each_cpu_andnot() lib/find_bit: Introduce find_next_andnot_bit() cpumask: fix checking valid cpu range lib/bitmap: add tests for for_each() loops lib/find: optimize for_each() macros lib/bitmap: introduce for_each_set_bit_wrap() macro lib/find_bit: add find_next{,_and}_bit_wrap cpumask: switch for_each_cpu{,_not} to use for_each_bit() net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and} cpumask: add cpumask_nth_{,and,andnot} lib/bitmap: remove bitmap_ord_to_pos lib/bitmap: add tests for find_nth_bit() lib: add find_nth{,_and,_andnot}_bit() lib/bitmap: add bitmap_weight_and() lib/bitmap: don't call __bitmap_weight() in kernel code tools: sync find_bit() implementation lib/find_bit: optimize find_next_bit() functions lib/find_bit: create find_first_zero_bit_le() ... |
||
![]() |
30c999937f |
Scheduler changes for v6.1:
- Debuggability: - Change most occurances of BUG_ON() to WARN_ON_ONCE() - Reorganize & fix TASK_ state comparisons, turn it into a bitmap - Update/fix misc scheduler debugging facilities - Load-balancing & regular scheduling: - Improve the behavior of the scheduler in presence of lot of SCHED_IDLE tasks - in particular they should not impact other scheduling classes. - Optimize task load tracking, cleanups & fixes - Clean up & simplify misc load-balancing code - Freezer: - Rewrite the core freezer to behave better wrt thawing and be simpler in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting all the fallout. - Deadline scheduler: - Fix the DL capacity-aware code - Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period() - Relax/optimize locking in task_non_contending() - Cleanups: - Factor out the update_current_exec_runtime() helper - Various cleanups, simplifications Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/01cRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1geZA/+PB4KC1T9aVxzaTHI36R03YgJYZmIdtxw wTf02MixePmz+gQCbepJbempGOh5ST28aOcI0xhdYOql5B63MaUBBMlB0HvGUyDG IU3zETqLMRtAbnSTdQFv8m++ECUtZYp8/x1FCel4WO7ya4ETkRu1NRfCoUepEhpZ aVAlae9LH3NBaF9t7s0PT2lTjf3pIzMFRkddJ0ywJhbFR3VnWat05fAK+J6fGY8+ LS54coefNlJD4oDh5TY8uniL1j5SmWmmwbk9Cdj7bLU5P3dFSS0/+5FJNHJPVGDE srGT7wstRUcDrN0CnZo48VIUBiApJCCDqTfJYi9wNYd0NAHvwY6MIJJgEIY8mKsI L/qH26H81Wt+ezSZ/5JIlGlZ/LIeNaa6OO/fbWEYABBQogvvx3nxsRNUYKSQzumH CnSBasBjLnjWyLlK4qARM9cI7NFSEK6NUigrEx/7h8JFu/8T4DlSy6LsF1HUyKgq 4+FJLAqG6cL0tcwB/fHYd0oRESN8dStnQhGxSojgufwLc7dlFULvCYF5JM/dX+/V IKwbOfIOeOn6ViMtSOXAEGdII+IQ2/ZFPwr+8Z5JC7NzvTVL6xlu/3JXkLZR3L7o yaXTSaz06h1vil7Z+GRf7RHc+wUeGkEpXh5vnarGZKXivhFdWsBdROIJANK+xR0i TeSLCxQxXlU= =KjMD -----END PGP SIGNATURE----- Merge tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Debuggability: - Change most occurances of BUG_ON() to WARN_ON_ONCE() - Reorganize & fix TASK_ state comparisons, turn it into a bitmap - Update/fix misc scheduler debugging facilities Load-balancing & regular scheduling: - Improve the behavior of the scheduler in presence of lot of SCHED_IDLE tasks - in particular they should not impact other scheduling classes. - Optimize task load tracking, cleanups & fixes - Clean up & simplify misc load-balancing code Freezer: - Rewrite the core freezer to behave better wrt thawing and be simpler in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting all the fallout. Deadline scheduler: - Fix the DL capacity-aware code - Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period() - Relax/optimize locking in task_non_contending() Cleanups: - Factor out the update_current_exec_runtime() helper - Various cleanups, simplifications" * tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits) sched: Fix more TASK_state comparisons sched: Fix TASK_state comparisons sched/fair: Move call to list_last_entry() in detach_tasks sched/fair: Cleanup loop_max and loop_break sched/fair: Make sure to try to detach at least one movable task sched: Show PF_flag holes freezer,sched: Rewrite core freezer logic sched: Widen TAKS_state literals sched/wait: Add wait_event_state() sched/completion: Add wait_for_completion_state() sched: Add TASK_ANY for wait_task_inactive() sched: Change wait_task_inactive()s match_state freezer,umh: Clean up freezer/initrd interaction freezer: Have {,un}lock_system_sleep() save/restore flags sched: Rename task_running() to task_on_cpu() sched/fair: Cleanup for SIS_PROP sched/fair: Default to false in test_idle_cores() sched/fair: Remove useless check in select_idle_core() sched/fair: Avoid double search on same cpu sched/fair: Remove redundant check in select_idle_smt() ... |
||
![]() |
585463f0d5 |
sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()
This removes the second use of the sched_core_mask temporary mask. Suggested-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> |
||
![]() |
5aec788aeb |
sched: Fix TASK_state comparisons
Task state is fundamentally a bitmask; direct comparisons are probably
not working as intended. Specifically the normal wait-state have
a number of possible modifiers:
TASK_UNINTERRUPTIBLE: TASK_WAKEKILL, TASK_NOLOAD, TASK_FREEZABLE
TASK_INTERRUPTIBLE: TASK_FREEZABLE
Specifically, the addition of TASK_FREEZABLE wrecked
__wait_is_interruptible(). This however led to an audit of direct
comparisons yielding the rest of the changes.
Fixes:
|
||
![]() |
bd74fdaea1 |
mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[8, 10]% Ops/sec KB/sec patch1-7: 1147696.57 44640.29 patch1-8: 1245274.91 48435.66 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 48.16% lzo1x_1_do_compress (real work) 8.20% page_vma_mapped_walk (overhead) 7.06% _raw_spin_unlock_irq 2.92% ptep_clear_flush 2.53% __zram_bvec_write 2.11% do_raw_spin_lock 2.02% memmove 1.93% lru_gen_look_around 1.56% free_unref_page_list 1.40% memset patch1-8 49.44% lzo1x_1_do_compress (real work) 6.19% page_vma_mapped_walk (overhead) 5.97% _raw_spin_unlock_irq 3.13% get_pfn_folio 2.85% ptep_clear_flush 2.42% __zram_bvec_write 2.08% do_raw_spin_lock 1.92% memmove 1.44% alloc_zspage 1.36% memset Configurations: no change Thanks to the following developers for their efforts [3]. kernel test robot <lkp@intel.com> [1] https://lwn.net/Articles/23732/ [2] https://llvm.org/docs/ScudoHardenedAllocator.html [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/ Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
c59862f826 |
sched/fair: Cleanup loop_max and loop_break
sched_nr_migrate_break is set to a fix value and never changes so we can replace it by a define SCHED_NR_MIGRATE_BREAK. Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value of sysctl_sched_nr_migrate which can be init to different values. Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate. The behavior stays unchanged unless you modify sysctl_sched_nr_migrate trough debugfs. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-3-vincent.guittot@linaro.org |
||
![]() |
c959924b0d |
memory tiering: adjust hot threshold automatically
The promotion hot threshold is workload and system configuration dependent. So in this patch, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. If the hint page fault latency of a page is less than the hot threshold, we will try to promote the page, and the page is called the candidate promotion page. If the number of the candidate promotion pages in the statistics interval is much more than the promotion rate limit, the hot threshold will be decreased to reduce the number of the candidate promotion pages. Otherwise, the hot threshold will be increased to increase the number of the candidate promotion pages. To make the above method works, in each statistics interval, the total number of the pages to check (on which the hint page faults occur) and the hot/cold distribution need to be stable. Because the page tables are scanned linearly in NUMA balancing, but the hot/cold distribution isn't uniform along the address usually, the statistics interval should be larger than the NUMA balancing scan period. So in the patch, the max scan period is used as statistics interval and it works well in our tests. Link: https://lkml.kernel.org/r/20220713083954.34196-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: osalvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
52b1364ba0 |
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload. When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status. Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com |
||
![]() |
f5d39b0208 |
freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. As said; replace this with a special task state TASK_FROZEN and add the following state transitions: TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost). The special __TASK_{STOPPED,TRACED} states *can* be restored since their canonical state is in ->jobctl. With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org |
||
![]() |
f9fc8cad97 |
sched: Add TASK_ANY for wait_task_inactive()
Now that wait_task_inactive()'s @match_state argument is a mask (like ttwu()) it is possible to replace the special !match_state case with an 'all-states' value such that any blocked state will match. Suggested-by: Ingo Molnar (mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net |
||
![]() |
9204a97f7a |
sched: Change wait_task_inactive()s match_state
Make wait_task_inactive()'s @match_state work like ttwu()'s @state. That is, instead of an equal comparison, use it as a mask. This allows matching multiple block conditions. (removes the unlikely; it doesn't make sense how it's only part of the condition) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220822114648.856734578@infradead.org |
||
![]() |
0b9d46fc5e |
sched: Rename task_running() to task_on_cpu()
There is some ambiguity about task_running() in that it is unrelated to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing task_on_cpu(). Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net |
||
![]() |
bc1cca97e6 |
sched/debug: Show the registers of 'current' in dump_cpu_task()
The dump_cpu_task() function does not print registers on architectures that do not support NMIs. However, registers can be useful for debugging. Fortunately, in the case where dump_cpu_task() is invoked from an interrupt handler and is dumping the current CPU's stack, the get_irq_regs() function can be used to get the registers. Therefore, this commit makes dump_cpu_task() check to see if it is being asked to dump the current CPU's stack from within an interrupt handler, and, if so, it uses the get_irq_regs() function to obtain the registers. On systems that do support NMIs, this commit has the further advantage of avoiding a self-NMI in this case. This is an example of rcu self-detected stall on arm64, which does not support NMIs: [ 27.501721] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.502238] rcu: 0-....: (1250 ticks this GP) idle=4f7/1/0x4000000000000000 softirq=2594/2594 fqs=619 [ 27.502632] (t=1251 jiffies g=2989 q=29 ncpus=4) [ 27.503845] CPU: 0 PID: 306 Comm: test0 Not tainted 5.19.0-rc7-00009-g1c1a6c29ff99-dirty #46 [ 27.504732] Hardware name: linux,dummy-virt (DT) [ 27.504947] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 27.504998] pc : arch_counter_read+0x18/0x24 [ 27.505301] lr : arch_counter_read+0x18/0x24 [ 27.505328] sp : ffff80000b29bdf0 [ 27.505345] x29: ffff80000b29bdf0 x28: 0000000000000000 x27: 0000000000000000 [ 27.505475] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 27.505553] x23: 0000000000001f40 x22: ffff800009849c48 x21: 000000065f871ae0 [ 27.505627] x20: 00000000000025ec x19: ffff80000a6eb300 x18: ffffffffffffffff [ 27.505654] x17: 0000000000000001 x16: 0000000000000000 x15: ffff80000a6d0296 [ 27.505681] x14: ffffffffffffffff x13: ffff80000a29bc18 x12: 0000000000000426 [ 27.505709] x11: 0000000000000162 x10: ffff80000a2f3c18 x9 : ffff80000a29bc18 [ 27.505736] x8 : 00000000ffffefff x7 : ffff80000a2f3c18 x6 : 00000000759bd013 [ 27.505761] x5 : 01ffffffffffffff x4 : 0002dc6c00000000 x3 : 0000000000000017 [ 27.505787] x2 : 00000000000025ec x1 : ffff80000b29bdf0 x0 : 0000000075a30653 [ 27.505937] Call trace: [ 27.506002] arch_counter_read+0x18/0x24 [ 27.506171] ktime_get+0x48/0xa0 [ 27.506207] test_task+0x70/0xf0 [ 27.506227] kthread+0x10c/0x110 [ 27.506243] ret_from_fork+0x10/0x20 This is a marked improvement over the old output: [ 27.944550] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.944980] rcu: 0-....: (1249 ticks this GP) idle=cbb/1/0x4000000000000000 softirq=2610/2610 fqs=614 [ 27.945407] (t=1251 jiffies g=2681 q=28 ncpus=4) [ 27.945731] Task dump for CPU 0: [ 27.945844] task:test0 state:R running task stack: 0 pid: 306 ppid: 2 flags:0x0000000a [ 27.946073] Call trace: [ 27.946151] dump_backtrace.part.0+0xc8/0xd4 [ 27.946378] show_stack+0x18/0x70 [ 27.946405] sched_show_task+0x150/0x180 [ 27.946427] dump_cpu_task+0x44/0x54 [ 27.947193] rcu_dump_cpu_stacks+0xec/0x130 [ 27.947212] rcu_sched_clock_irq+0xb18/0xef0 [ 27.947231] update_process_times+0x68/0xac [ 27.947248] tick_sched_handle+0x34/0x60 [ 27.947266] tick_sched_timer+0x4c/0xa4 [ 27.947281] __hrtimer_run_queues+0x178/0x360 [ 27.947295] hrtimer_interrupt+0xe8/0x244 [ 27.947309] arch_timer_handler_virt+0x38/0x4c [ 27.947326] handle_percpu_devid_irq+0x88/0x230 [ 27.947342] generic_handle_domain_irq+0x2c/0x44 [ 27.947357] gic_handle_irq+0x44/0xc4 [ 27.947376] call_on_irq_stack+0x2c/0x54 [ 27.947415] do_interrupt_handler+0x80/0x94 [ 27.947431] el1_interrupt+0x34/0x70 [ 27.947447] el1h_64_irq_handler+0x18/0x24 [ 27.947462] el1h_64_irq+0x64/0x68 <--- the above backtrace is worthless [ 27.947474] arch_counter_read+0x18/0x24 [ 27.947487] ktime_get+0x48/0xa0 [ 27.947501] test_task+0x70/0xf0 [ 27.947520] kthread+0x10c/0x110 [ 27.947538] ret_from_fork+0x10/0x20 Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> |
||
![]() |
e73dfe3093 |
sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task()
The trigger_all_cpu_backtrace() function attempts to send an NMI to the target CPU, which usually provides much better stack traces than the dump_cpu_task() function's approach of dumping that stack from some other CPU. So much so that most calls to dump_cpu_task() only happen after a call to trigger_all_cpu_backtrace() has failed. And the exception to this rule really should attempt to use trigger_all_cpu_backtrace() first. Therefore, move the trigger_all_cpu_backtrace() invocation into dump_cpu_task(). Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> |
||
![]() |
53aa930dc4 |
Merge branch 'sched/warnings' into sched/core, to pick up WARN_ON_ONCE() conversion commit
Merge in the BUG_ON() => WARN_ON_ONCE() conversion commit. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
df16b71c68 |
sched/fair: Allow changing cgroup of new forked task
commit
|
||
![]() |
39c4261191 |
sched/fair: Remove redundant cpu_cgrp_subsys->fork()
We use cpu_cgrp_subsys->fork() to set task group for the new fair task
in cgroup_post_fork().
Since commit
|
||
![]() |
09348d75a6 |
sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
There's no good reason to crash a user's system with a BUG_ON(), chances are high that they'll never even see the crash message on Xorg, and it won't make it into the syslog either. By using a WARN_ON_ONCE() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs. None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON_ONCE(). There's one exception: WARN_ON_ONCE() arguments with side-effects, such as locking - in this case we use the return value of the WARN_ON_ONCE(), such as in: - BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON_ONCE(!lock_task_sighand(p, &flags))) + return; Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com |