mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-09-04 20:19:47 +08:00
d8f26717c9
768 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
feacb1774b |
sched_ext: Changes for v6.16
- More in-kernel idle CPU selection improvements. Expand topology awareness coverage add scx_bpf_select_cpu_and() to allow more flexibility. The idle CPU selection kfuncs can now be called from unlocked contexts too. - A bunch of reorganization changes to lay the foundation for multiple hierarchical scheduler support. This isn't ready yet and the included changes don't make meaningful behavior differences. One notable change is replacing some static_key tests with dynamic tests as the test results may differ depending on the scheduler instance. This isn't expected to cause meaningful performance difference. - Other minor and doc updates. - There were multiple patches in for-6.15-fixes which conflicted with changes in for-6.16. for-6.15-fixes were pulled three times into for-6.16 to resolve the conflicts. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaDYZMw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGfbcAQDRloVb/d5RfC6VYlue9EV1jHuoJefTYHvR3jmO ju70EQEAjLBXw58XAePQ9La/570JELgsC5FzJp3tLTilGx2JyQA= =7cDG -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - More in-kernel idle CPU selection improvements. Expand topology awareness coverage add scx_bpf_select_cpu_and() to allow more flexibility. The idle CPU selection kfuncs can now be called from unlocked contexts too. - A bunch of reorganization changes to lay the foundation for multiple hierarchical scheduler support. This isn't ready yet and the included changes don't make meaningful behavior differences. One notable change is replacing some static_key tests with dynamic tests as the test results may differ depending on the scheduler instance. This isn't expected to cause meaningful performance difference. - Other minor and doc updates. - There were multiple patches in for-6.15-fixes which conflicted with changes in for-6.16. for-6.15-fixes were pulled three times into for-6.16 to resolve the conflicts. * tag 'sched_ext-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (49 commits) sched_ext: Call ops.update_idle() after updating builtin idle bits sched_ext, docs: convert mentions of "CFS" to "fair-class scheduler" selftests/sched_ext: Update test enq_select_cpu_fails sched_ext: idle: Consolidate default idle CPU selection kfuncs selftests/sched_ext: Add test for scx_bpf_select_cpu_and() via test_run sched_ext: idle: Allow scx_bpf_select_cpu_and() from unlocked context sched_ext: idle: Validate locking correctness in scx_bpf_select_cpu_and() sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.c sched_ext, docs: add label sched_ext: Explain the temporary situation around scx_root dereferences sched_ext: Add @sch to SCX_CALL_OP*() sched_ext: Cleanup [__]scx_exit/error*() sched_ext: Add @sch to SCX_CALL_OP*() sched_ext: Clean up scx_root usages Documentation: scheduler: Changed lowercase acronyms to uppercase sched_ext: Avoid NULL scx_root deref in __scx_exit() sched_ext: Add RCU protection to scx_root in DSQ iterator sched_ext: Clean up SCX_EXIT_NONE handling in scx_disable_workfn() sched_ext: Move disable machinery into scx_sched sched_ext: Move event_stats_cpu into scx_sched ... |
||
![]() |
c89756bcf4 |
Power management updates for 6.16-rc1
- Fix potential division-by-zero error in em_compute_costs() (Yaxiong Tian). - Fix typos in energy model documentation and example driver code (Moon Hee Lee, Atul Kumar Pant). - Rearrange the energy model management code and add a new function for adjusting a CPU energy model after adjusting the capacity of the given CPU to it (Rafael Wysocki). - Refactor cpufreq_online(), add and use cpufreq policy locking guards, use __free() in policy reference counting, and clean up core cpufreq code on top of that (Rafael Wysocki). - Fix boost handling on CPU suspend/resume and sysfs updates (Viresh Kumar). - Fix des_perf clamping with max_perf in amd_pstate_update() (Dhananjay Ugwekar). - Add offline, online and suspend callbacks to the amd-pstate driver, rename and use the existing amd_pstate_epp callbacks in it (Dhananjay Ugwekar). - Add support for the "Requested CPU Min frequency" BIOS option to the amd-pstate driver (Dhananjay Ugwekar). - Reset amd-pstate driver mode after running selftests (Swapnil Sapkal). - Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan Chancellor). - Add helper for governor checks to the schedutil cpufreq governor and move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki). - Populate the cpu_capacity sysfs entries from the intel_pstate driver after registering asym capacity support (Ricardo Neri). - Add support for enabling Energy-aware scheduling (EAS) to the intel_pstate driver when operating in the passive mode on a hybrid platform (Rafael Wysocki). - Drop redundant cpus_read_lock() from store_local_boost() in the cpufreq core (Seyediman Seyedarab). - Replace sscanf() with kstrtouint() in the cpufreq code and use a symbol instead of a raw number in it (Bowen Yu). - Add support for autonomous CPU performance state selection to the CPPC cpufreq driver (Lifeng Zheng). - OPP: Add dev_pm_opp_set_level() (Praveen Talari). - Introduce scope-based cleanup headers and mutex locking guards in OPP core (Viresh Kumar). - Switch OPP to use kmemdup_array() (Zhang Enpei). - Optimize bucket assignment when next_timer_ns equals KTIME_MAX in the menu cpuidle governor (Zhongqiu Han). - Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla). - Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem Bityutskiy). - Fix typos in two comments in the teo cpuidle governor (Atul Kumar Pant). - Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja Kalla). - Move debug runtime PM attributes to runtime_attrs[] (Rafael Wysocki). - Add new devm_ functions for enabling runtime PM and runtime PM reference counting (Bence Csókás). - Remove size arguments from strscpy() calls in the hibernation core code (Thorsten Blum). - Adjust the handling of devices with asynchronous suspend enabled during system suspend and resume to start resuming them immediately after resuming their parents and to start suspending such a device immediately after suspending its first child (Rafael Wysocki). - Adjust messages printed during tasks freezing to avoid using pr_cont() (Andrew Sayers, Paul Menzel). - Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan Zhang). - Add missing wakeup source attribute relax_count to sysfs and remove the space character at the end ofi the string produced by pm_show_wakelocks() (Zijun Hu). - Add configurable pm_test delay for hibernation (Zihuan Zhang). - Disable asynchronous suspend in ucsi_ccg_probe() to prevent the cypd4226 device on Tegra boards from suspending prematurely (Jon Hunter). - Unbreak printing PM debug messages during hibernation and clean up some related code (Rafael Wysocki). - Add a systemd service to run cpupower and change cpupower binding's Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli). -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmg0xS0SHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO1AwwH/Rvgza5YBPb9JZqWJT/ZiBw7HcEWHhP1 fNfcVU1gXPZiF0yoPfjfJua6BcLj6lyQ3d/+zWqqAcWfmRSD6HPe8yYz8qALUAqj RWhDa04aGj6B9bQuOjejatznYlQlkwCRT7zec+75D+dAHVMqR/Vt2LFAetCadgHe MQibAQmVFXu3RFkBjReTAdGzVoTXkwoZDrzdfA2aFAfMJNtJpOW4atUZvnucuctv VK3ZratrctCIw7yXEoB1nWSmlY7R5JlslplBfndjmmOnky3YxNr7C6paqwtbTWoF MiX48qkmLOGeO6gS8s/lVCDQ4oZ+UNFQvXRsM5NGjycBikhHX/dp/w4= =dIqJ -----END PGP SIGNATURE----- Merge tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Once again, the changes are dominated by cpufreq updates, but this time the majority of them are cpufreq core changes, mostly related to the introduction of policy locking guards and __free() usage, and fixes related to boost handling. Still, there is also a significant update of the intel_pstate driver making it register an energy model when running on a hybrid platform which is used for enabling energy-aware scheduling (EAS) if the driver operates in the passive mode (and schedutil is used as the cpufreq governor for all CPUs which is the passive mode default). There are some amd-pstate driver updates too, for a good measure, including the "Requested CPU Min frequency" BIOS option support and new online/offline callbacks. In the cpuidle space, the most significant change is the addition of a C1 demotion on/off sysfs knob to intel_idle which should help some users to configure their systems more precisely. There is also the conversion of the PSCI cpuidle driver to a faux device one and there are two small updates of cpuidle governors. Device power management is also modified quite a bit, especially the handling of devices with asynchronous suspend and resume enabled during system transitions. They are now going to be handled more asynchronously during suspend transitions and somewhat less aggressively during resume transitions. Apart from the above, the operating performance points (OPP) library is now going to use mutex locking guards and scope-based cleanup helpers and there is the usual bunch of assorted fixes and code cleanups. Specifics: - Fix potential division-by-zero error in em_compute_costs() (Yaxiong Tian) - Fix typos in energy model documentation and example driver code (Moon Hee Lee, Atul Kumar Pant) - Rearrange the energy model management code and add a new function for adjusting a CPU energy model after adjusting the capacity of the given CPU to it (Rafael Wysocki) - Refactor cpufreq_online(), add and use cpufreq policy locking guards, use __free() in policy reference counting, and clean up core cpufreq code on top of that (Rafael Wysocki) - Fix boost handling on CPU suspend/resume and sysfs updates (Viresh Kumar) - Fix des_perf clamping with max_perf in amd_pstate_update() (Dhananjay Ugwekar) - Add offline, online and suspend callbacks to the amd-pstate driver, rename and use the existing amd_pstate_epp callbacks in it (Dhananjay Ugwekar) - Add support for the "Requested CPU Min frequency" BIOS option to the amd-pstate driver (Dhananjay Ugwekar) - Reset amd-pstate driver mode after running selftests (Swapnil Sapkal) - Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan Chancellor) - Add helper for governor checks to the schedutil cpufreq governor and move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki) - Populate the cpu_capacity sysfs entries from the intel_pstate driver after registering asym capacity support (Ricardo Neri) - Add support for enabling Energy-aware scheduling (EAS) to the intel_pstate driver when operating in the passive mode on a hybrid platform (Rafael Wysocki) - Drop redundant cpus_read_lock() from store_local_boost() in the cpufreq core (Seyediman Seyedarab) - Replace sscanf() with kstrtouint() in the cpufreq code and use a symbol instead of a raw number in it (Bowen Yu) - Add support for autonomous CPU performance state selection to the CPPC cpufreq driver (Lifeng Zheng) - OPP: Add dev_pm_opp_set_level() (Praveen Talari) - Introduce scope-based cleanup headers and mutex locking guards in OPP core (Viresh Kumar) - Switch OPP to use kmemdup_array() (Zhang Enpei) - Optimize bucket assignment when next_timer_ns equals KTIME_MAX in the menu cpuidle governor (Zhongqiu Han) - Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla) - Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem Bityutskiy) - Fix typos in two comments in the teo cpuidle governor (Atul Kumar Pant) - Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja Kalla) - Move debug runtime PM attributes to runtime_attrs[] (Rafael Wysocki) - Add new devm_ functions for enabling runtime PM and runtime PM reference counting (Bence Csókás) - Remove size arguments from strscpy() calls in the hibernation core code (Thorsten Blum) - Adjust the handling of devices with asynchronous suspend enabled during system suspend and resume to start resuming them immediately after resuming their parents and to start suspending such a device immediately after suspending its first child (Rafael Wysocki) - Adjust messages printed during tasks freezing to avoid using pr_cont() (Andrew Sayers, Paul Menzel) - Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan Zhang) - Add missing wakeup source attribute relax_count to sysfs and remove the space character at the end ofi the string produced by pm_show_wakelocks() (Zijun Hu) - Add configurable pm_test delay for hibernation (Zihuan Zhang) - Disable asynchronous suspend in ucsi_ccg_probe() to prevent the cypd4226 device on Tegra boards from suspending prematurely (Jon Hunter) - Unbreak printing PM debug messages during hibernation and clean up some related code (Rafael Wysocki) - Add a systemd service to run cpupower and change cpupower binding's Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli)" * tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits) cpufreq: CPPC: Add support for autonomous selection cpufreq: Update sscanf() to kstrtouint() cpufreq: Replace magic number OPP: switch to use kmemdup_array() PM: freezer: Rewrite restarting tasks log to remove stray *done.* PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn() cpufreq: drop redundant cpus_read_lock() from store_local_boost() cpupower: do not install files to /etc/default/ cpupower: do not call systemctl at install time cpupower: do not write DESTDIR to cpupower.service PM: sleep: Introduce pm_sleep_transition_in_progress() cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver() cpufreq: intel_pstate: Document hybrid processor support cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache cpufreq: intel_pstate: EAS support for hybrid platforms PM: EM: Introduce em_adjust_cpu_capacity() PM: EM: Move CPU capacity check to em_adjust_new_capacity() PM: EM: Documentation: Fix typos in example driver code cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas() PM: sleep: Introduce pm_suspend_in_progress() ... |
||
![]() |
f42c8556a0 |
cpufreq/sched: schedutil: Add helper for governor checks
Add a helper for checking if schedutil is the current governor for a given cpufreq policy and use it in sched_is_eas_possible() to avoid accessing cpufreq policy internals directly from there. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://patch.msgid.link/3365956.44csPzL39Z@rjwysocki.net |
||
![]() |
0ab94c3242 |
sched: Add annotations to RT_GROUP_SCHED fields
Update comments to ease RT throttling understanding. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-10-mkoutny@suse.com |
||
![]() |
61d3164fec |
sched: Skip non-root task_groups with disabled RT_GROUP_SCHED
First, we want to prevent placement of RT tasks on non-root rt_rqs which we achieve in the task migration code that'd fall back to root_task_group's rt_rq. Second, we want to work with only root_task_group's rt_rq when iterating all "real" rt_rqs when RT_GROUP is disabled. To achieve this we keep root_task_group as the first one on the task_groups and break out quickly. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-6-mkoutny@suse.com |
||
![]() |
e34e0131fe |
sched: Add commadline option for RT_GROUP_SCHED toggling
Only simple implementation with a static key wrapper, it will be wired in later. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-5-mkoutny@suse.com |
||
![]() |
a5a25b32c0 |
sched: Always initialize rt_rq's task_group
rt_rq->tg may be NULL which denotes the root task_group. Store the pointer to root_task_group directly so that callers may use rt_rq->tg homogenously. root_task_group exists always with CONFIG_CGROUPS_SCHED, CONFIG_RT_GROUP_SCHED depends on that. This changes root level rt_rq's default limit from infinity to the value of (originally) global RT throttling. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-4-mkoutny@suse.com |
||
![]() |
a50c365f99 |
sched_ext: Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> |
||
![]() |
dd5bdaf2b7 |
sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
All the big Linux distros enable CONFIG_SCHED_DEBUG, because the various features it provides help not just with kernel development, but with system administration and user-space software development as well. Reflect this reality and enable this functionality unconditionally. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-4-mingo@kernel.org |
||
![]() |
57903f72f2 |
sched/debug: Make 'const_debug' tunables unconditional __read_mostly
With CONFIG_SCHED_DEBUG becoming unconditional, remove the extra 'const_debug' indirection towards __read_mostly. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-3-mingo@kernel.org |
||
![]() |
f7d2728cc0 |
sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
The scheduler has this special SCHED_WARN() facility that depends on CONFIG_SCHED_DEBUG. Since CONFIG_SCHED_DEBUG is getting removed, convert SCHED_WARN() to WARN_ON_ONCE(). Note that the warning output isn't 100% equivalent: #define SCHED_WARN_ON(x) WARN_ONCE(x, #x) Because SCHED_WARN_ON() would output the 'x' condition as well, while WARN_ONCE() will only show a backtrace. Hopefully these are rare enough to not really matter. If it does, we should probably introduce a new WARN_ON() variant that outputs the condition in stringified form, or improve WARN_ON() itself. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-2-mingo@kernel.org |
||
![]() |
45007c6fb5 |
sched/deadline: Generalize unique visiting of root domains
Bandwidth checks and updates that work on root domains currently employ
a cookie mechanism for efficiency. This mechanism is very much tied to
when root domains are first created and initialized.
Generalize the cookie mechanism so that it can be used also later at
runtime while updating root domains. Also, additionally guard it with
sched_domains_mutex, since domains need to be stable while updating them
(and it will be required for further dynamic changes).
Fixes:
|
||
![]() |
8bdc5daaa0 |
sched: Add a generic function to return the preemption string
The individual architectures often add the preemption model to the begin of the backtrace. This is the case on X86 or ARM64 for the "die" case but not for regular warning. With the addition of DYNAMIC_PREEMPT for PREEMPT_RT we end up with CONFIG_PREEMPT and CONFIG_PREEMPT_RT set simultaneously. That means that everyone who tried to add that piece of information gets it wrong for PREEMPT_RT because PREEMPT is checked first. Provide a generic function which returns the current scheduling model considering LAZY preempt and the current state of PREEMPT_DYNAMIC. The resulting strings are: ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ Model ┃ -RT -DYN ┃ +RT -DYN ┃ -RT +DYN ┃ +RT +DYN ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │NONE │ NONE │ n/a │ PREEMPT(none) │ n/a │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │VOLUNTARY │ VOLUNTARY │ n/a │ PREEMPT(voluntary) │ n/a │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │FULL │ PREEMPT │ PREEMPT_RT │ PREEMPT(full) │ PREEMPT_{RT,full} │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │LAZY │ PREEMPT_LAZY │ PREEMPT_{RT,LAZY} │ PREEMPT(lazy) │ PREEMPT_{RT,lazy} │ └───────────┴──────────────┴───────────────────┴────────────────────┴───────────────────┘ [ The dynamic building of the string can lead to an empty string if the function is invoked simultaneously on two CPUs. ] Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Co-developed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Signed-off-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20250314160810.2373416-2-bigeasy@linutronix.de |
||
![]() |
4bc4582414 |
sched/uclamp: Optimize sched_uclamp_used static key enabling
Repeat calls of static_branch_enable() to an already enabled static key introduce overhead, because it calls cpus_read_lock(). Users may frequently set the uclamp value of tasks, triggering the repeat enabling of the sched_uclamp_used static key. Optimize this and avoid repeat calls to static_branch_enable() by checking whether it's enabled already. [ mingo: Rewrote the changelog for legibility ] Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250219093747.2612-2-xuewen.yan@unisoc.com |
||
![]() |
5fca5a4cf9 |
sched/uclamp: Use the uclamp_is_used() helper instead of open-coding it
Don't open-code static_branch_unlikely(&sched_uclamp_used), we have the uclamp_is_used() wrapper around it. [ mingo: Clean up the changelog ] Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250219093747.2612-1-xuewen.yan@unisoc.com |
||
![]() |
82354fce16 |
Merge branch 'sched/urgent' into sched/core, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
02d954c0fd |
sched: Compact RSEQ concurrency IDs with reduced threads and affinity
When a process reduces its number of threads or clears bits in its CPU affinity mask, the mm_cid allocation should eventually converge towards smaller values. However, the change introduced by: commit |
||
![]() |
b9f2b29b94 |
sched: Don't define sched_clock_irqtime as static key
The sched_clock_irqtime was defined as a static key in commit |
||
![]() |
d6f3e7d564 |
sched_ext: Fix incorrect autogroup migration detection
scx_move_task() is called from sched_move_task() and tells the BPF scheduler
that cgroup migration is being committed. sched_move_task() is used by both
cgroup and autogroup migrations and scx_move_task() tried to filter out
autogroup migrations by testing the destination cgroup and PF_EXITING but
this is not enough. In fact, without explicitly tagging the thread which is
doing the cgroup migration, there is no good way to tell apart
scx_move_task() invocations for racing migration to the root cgroup and an
autogroup migration.
This led to scx_move_task() incorrectly ignoring a migration from non-root
cgroup to an autogroup of the root cgroup triggering the following warning:
WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340
...
Call Trace:
<TASK>
cgroup_migrate_execute+0x5b1/0x700
cgroup_attach_task+0x296/0x400
__cgroup_procs_write+0x128/0x140
cgroup_procs_write+0x17/0x30
kernfs_fop_write_iter+0x141/0x1f0
vfs_write+0x31d/0x4a0
__x64_sys_write+0x72/0xf0
do_syscall_64+0x82/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by adding an argument to sched_move_task() that indicates whether the
moving is for a cgroup or autogroup migration. After the change,
scx_move_task() is called only for cgroup migrations and renamed to
scx_cgroup_move_task().
Link: https://github.com/sched-ext/scx/issues/370
Fixes:
|
||
![]() |
bc8198dc7e |
sched_ext: Changes for v6.14
- scx_bpf_now() added so that BPF scheduler can access the cached timestamp in struct rq to avoid reading TSC multiple times within a locked scheduling operation. - Minor updates to the built-in idle CPU selection logic. - tool/sched_ext updates and other misc changes. Pulling sched_ext/for-6.14 into master causes a merge conflict between the following two commits (first commit in master, second in for-6.14): |
||
![]() |
62de6e1685 |
Scheduler enhancements for v6.14:
- Fair scheduler (SCHED_FAIR) enhancements: - Behavioral improvements: - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra) - Delayed-dequeue enhancements & fixes: (Vincent Guittot) - Rename h_nr_running into h_nr_queued - Add new cfs_rq.h_nr_runnable - Use the new cfs_rq.h_nr_runnable - Removed unsued cfs_rq.h_nr_delayed - Rename cfs_rq.idle_h_nr_running into h_nr_idle - Remove unused cfs_rq.idle_nr_running - Rename cfs_rq.nr_running into nr_queued - Do not try to migrate delayed dequeue task - Fix variable declaration position - Encapsulate set custom slice in a __setparam_fair() function - Fixes: - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding) - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia) - Cleanups: - Clean up in migrate_degrades_locality() to improve readability (Peter Zijlstra) - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko) - Update comments after sched_tick() rename (Sebastian Andrzej Siewior) - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used() (Valentin Schneider) - Deadline scheduler (SCHED_DL) enhancements: - Restore dl_server bandwidth on non-destructive root domain changes (Juri Lelli) - Correctly account for allocated bandwidth during hotplug (Juri Lelli) - Check bandwidth overflow earlier for hotplug (Juri Lelli) - Clean up goto label in pick_earliest_pushable_dl_task() (John Stultz) - Consolidate timer cancellation (Wander Lairson Costa) - Load-balancer enhancements: - Improve performance by prioritizing migrating eligible tasks in sched_balance_rq() (Hao Jia) - Do not compute NUMA Balancing stats unnecessarily during load-balancing (K Prateek Nayak) - Do not compute overloaded status unnecessarily during load-balancing (K Prateek Nayak) - Generic scheduling code enhancements: - Use READ_ONCE() in task_on_rq_queued(), to consistently use the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal) - Isolated CPUs support enhancements: (Waiman Long) - Make "isolcpus=nohz" equivalent to "nohz_full" - Consolidate housekeeping cpumasks that are always identical - Remove HK_TYPE_SCHED - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE - RSEQ enhancements: - Validate read-only fields under DEBUG_RSEQ config (Mathieu Desnoyers) - PSI enhancements: - Fix race when task wakes up before psi_sched_switch() adjusts flags (Chengming Zhou) - IRQ time accounting performance enhancements: (Yafang Shao) - Define sched_clock_irqtime as static key - Don't account irq time if sched_clock_irqtime is disabled - Virtual machine scheduling enhancements: - Don't try to catch up excess steal time (Suleiman Souhlal) - Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak) - Convert "sysctl_sched_itmt_enabled" to boolean - Use guard() for itmt_update_mutex - Move the "sched_itmt_enabled" sysctl to debugfs - Remove x86_smt_flags and use cpu_smt_flags directly - Use x86_sched_itmt_flags for PKG domain unconditionally - Debugging code & instrumentation enhancements: - Change need_resched warnings to pr_err() (David Rientjes) - Print domain name in /proc/schedstat (K Prateek Nayak) - Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra) - Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal) - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal) - Update Schedstat version to 17 (Swapnil Sapkal) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmePSRcRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hrdBAAjYiLl5Q8SHM0xnl+kbvuUkCTgEB/gSgA mfrZtHRUgRZuA89NZ9NljlCkQSlsLTOjnpNuaeFzs529GMg9iemc99dbnz3BP5F3 V5qpYvWe7yIkJ3hd0TOGLmYEPMNQaAW57YBOrxcPjWNLJ4cr9iMdccVA1OQtcmqD ZUh3nibv81QI8HDmT2G+figxEIqH3yBV1+SmEIxbrdkQpIJ5702Ng6+0KQK5TShN xwjFELWZUl2TfkoCc4nkIpkImV6cI1DvXSw1xK6gbb1xEVOrsmFW3TYFw4trKHBu 2RBG4wtmzNjh+12GmSdIBJHogPNcay+JIJW9EG/unT7jirqzkkeP1X2eJEbh+X1L CMa7GsD9Vy72jCzeJDMuiy7bKfG/MiKUtDXrAZQDo2atbw7H88QOzMuTE5a5WSV+ tRxXGI/dgFVOk+JQUfctfJbYeXjmG8GAflawvXtGDAfDZsja6M+65fH8p0AOgW1E HHmXUzAe2E2xQBiSok/DYHPQeCDBAjoJvU93YhGiXv8UScb2UaD4BAfzfmc8P+Zs Eox6444ah5U0jiXmZ3HU707n1zO+Ql4qKoyyMJzSyP+oYHE/Do7NYTElw2QovVdN FX/9Uae8T4ttA/5lFe7FNoXgKvSxXDKYyKLZcysjVrWJF866Ui/TWtmxA6w8Osn7 sfucuLawLPM= =5ZNW -----END PGP SIGNATURE----- Merge tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Fair scheduler (SCHED_FAIR) enhancements: - Behavioral improvements: - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra) - Delayed-dequeue enhancements & fixes: (Vincent Guittot) - Rename h_nr_running into h_nr_queued - Add new cfs_rq.h_nr_runnable - Use the new cfs_rq.h_nr_runnable - Removed unsued cfs_rq.h_nr_delayed - Rename cfs_rq.idle_h_nr_running into h_nr_idle - Remove unused cfs_rq.idle_nr_running - Rename cfs_rq.nr_running into nr_queued - Do not try to migrate delayed dequeue task - Fix variable declaration position - Encapsulate set custom slice in a __setparam_fair() function - Fixes: - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding) - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia) - Cleanups: - Clean up in migrate_degrades_locality() to improve readability (Peter Zijlstra) - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko) - Update comments after sched_tick() rename (Sebastian Andrzej Siewior) - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used() (Valentin Schneider) Deadline scheduler (SCHED_DL) enhancements: - Restore dl_server bandwidth on non-destructive root domain changes (Juri Lelli) - Correctly account for allocated bandwidth during hotplug (Juri Lelli) - Check bandwidth overflow earlier for hotplug (Juri Lelli) - Clean up goto label in pick_earliest_pushable_dl_task() (John Stultz) - Consolidate timer cancellation (Wander Lairson Costa) Load-balancer enhancements: - Improve performance by prioritizing migrating eligible tasks in sched_balance_rq() (Hao Jia) - Do not compute NUMA Balancing stats unnecessarily during load-balancing (K Prateek Nayak) - Do not compute overloaded status unnecessarily during load-balancing (K Prateek Nayak) Generic scheduling code enhancements: - Use READ_ONCE() in task_on_rq_queued(), to consistently use the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal) Isolated CPUs support enhancements: (Waiman Long) - Make "isolcpus=nohz" equivalent to "nohz_full" - Consolidate housekeeping cpumasks that are always identical - Remove HK_TYPE_SCHED - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE RSEQ enhancements: - Validate read-only fields under DEBUG_RSEQ config (Mathieu Desnoyers) PSI enhancements: - Fix race when task wakes up before psi_sched_switch() adjusts flags (Chengming Zhou) IRQ time accounting performance enhancements: (Yafang Shao) - Define sched_clock_irqtime as static key - Don't account irq time if sched_clock_irqtime is disabled Virtual machine scheduling enhancements: - Don't try to catch up excess steal time (Suleiman Souhlal) Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak) - Convert "sysctl_sched_itmt_enabled" to boolean - Use guard() for itmt_update_mutex - Move the "sched_itmt_enabled" sysctl to debugfs - Remove x86_smt_flags and use cpu_smt_flags directly - Use x86_sched_itmt_flags for PKG domain unconditionally Debugging code & instrumentation enhancements: - Change need_resched warnings to pr_err() (David Rientjes) - Print domain name in /proc/schedstat (K Prateek Nayak) - Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra) - Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal) - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal) - Update Schedstat version to 17 (Swapnil Sapkal)" * tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits) rseq: Fix rseq unregistration regression psi: Fix race when task wakes up before psi_sched_switch() adjusts flags sched, psi: Don't account irq time if sched_clock_irqtime is disabled sched: Don't account irq time if sched_clock_irqtime is disabled sched: Define sched_clock_irqtime as static key sched/fair: Do not compute overloaded status unnecessarily during lb sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs x86/itmt: Use guard() for itmt_update_mutex x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean sched/core: Prioritize migrating eligible tasks in sched_balance_rq() sched/debug: Change need_resched warnings to pr_err sched/fair: Encapsulate set custom slice in a __setparam_fair() function sched: Fix race between yield_to() and try_to_wake_up() docs: Update Schedstat version to 17 sched/stats: Print domain name in /proc/schedstat sched: Move sched domain name out of CONFIG_SCHED_DEBUG sched: Report the different kinds of imbalances in /proc/schedstat ... |
||
![]() |
8722903cbb |
sched: Define sched_clock_irqtime as static key
Since CPU time accounting is a performance-critical path, let's define sched_clock_irqtime as a static key to minimize potential overhead. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250103022409.2544-2-laoar.shao@gmail.com |
||
![]() |
2cf9ac4007 |
sched/fair: Encapsulate set custom slice in a __setparam_fair() function
Similarly to dl, create a __setparam_fair() function to set parameters related to fair class and move it in the fair.c file. No functional changes expected Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20250110144656.484601-1-vincent.guittot@linaro.org |
||
![]() |
3a9910b590 |
sched_ext: Implement scx_bpf_now()
Returns a high-performance monotonically non-decreasing clock for the current CPU. The clock returned is in nanoseconds. It provides the following properties: 1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently to account for execution time and track tasks' runtime properties. Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which eventually reads a hardware timestamp counter -- is neither performant nor scalable. scx_bpf_now() aims to provide a high-performance clock by using the rq clock in the scheduler core whenever possible. 2) High enough resolution for the BPF scheduler use cases: In most BPF scheduler use cases, the required clock resolution is lower than the most accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now() basically uses the rq clock in the scheduler core whenever it is valid. It considers that the rq clock is valid from the time the rq clock is updated (update_rq_clock) until the rq is unlocked (rq_unpin_lock). 3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now() guarantees the clock never goes backward when comparing them in the same CPU. On the other hand, when comparing clocks in different CPUs, there is no such guarantee -- the clock can go backward. It provides a monotonically *non-decreasing* clock so that it would provide the same clock values in two different scx_bpf_now() calls in the same CPU during the same period of when the rq clock is valid. An rq clock becomes valid when it is updated using update_rq_clock() and invalidated when the rq is unlocked using rq_unpin_lock(). Let's suppose the following timeline in the scheduler core: T1. rq_lock(rq) T2. update_rq_clock(rq) T3. a sched_ext BPF operation T4. rq_unlock(rq) T5. a sched_ext BPF operation T6. rq_lock(rq) T7. update_rq_clock(rq) For [T2, T4), we consider that rq clock is valid (SCX_RQ_CLK_VALID is set), so scx_bpf_now() calls during [T2, T4) (including T3) will return the rq clock updated at T2. For duration [T4, T7), when a BPF scheduler can still call scx_bpf_now() (T5), we consider the rq clock is invalid (SCX_RQ_CLK_VALID is unset at T4). So when calling scx_bpf_now() at T5, we will return a fresh clock value by calling sched_clock_cpu() internally. Also, to prevent getting outdated rq clocks from a previous scx scheduler, invalidate all the rq clocks when unloading a BPF scheduler. One example of calling scx_bpf_now(), when the rq clock is invalid (like T5), is in scx_central [1]. The scx_central scheduler uses a BPF timer for preemptive scheduling. In every msec, the timer callback checks if the currently running tasks exceed their timeslice. At the beginning of the BPF timer callback (central_timerfn in scx_central.bpf.c), scx_central gets the current time. When the BPF timer callback runs, the rq clock could be invalid, the same as T5. In this case, scx_bpf_now() returns a fresh clock value rather than returning the old one (T2). [1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_central.bpf.c Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
ea9b262627 |
sched_ext: Relocate scx_enabled() related code
scx_enabled() will be used in scx_rq_clock_update/invalidate() in the following patch, so relocate the scx_enabled() related code to the proper location. Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
b53127db1d |
sched/dlserver: Fix dlserver double enqueue
dlserver can get dequeued during a dlserver pick_task due to the delayed
deueue feature and this can lead to issues with dlserver logic as it
still thinks that dlserver is on the runqueue. The dlserver throttling
and replenish logic gets confused and can lead to double enqueue of
dlserver.
Double enqueue of dlserver could happend due to couple of reasons:
Case 1
------
Delayed dequeue feature[1] can cause dlserver being stopped during a
pick initiated by dlserver:
__pick_next_task
pick_task_dl -> server_pick_task
pick_task_fair
pick_next_entity (if (sched_delayed))
dequeue_entities
dl_server_stop
server_pick_task goes ahead with update_curr_dl_se without knowing that
dlserver is dequeued and this confuses the logic and may lead to
unintended enqueue while the server is stopped.
Case 2
------
A race condition between a task dequeue on one cpu and same task's enqueue
on this cpu by a remote cpu while the lock is released causing dlserver
double enqueue.
One cpu would be in the schedule() and releasing RQ-lock:
current->state = TASK_INTERRUPTIBLE();
schedule();
deactivate_task()
dl_stop_server();
pick_next_task()
pick_next_task_fair()
sched_balance_newidle()
rq_unlock(this_rq)
at which point another CPU can take our RQ-lock and do:
try_to_wake_up()
ttwu_queue()
rq_lock()
...
activate_task()
dl_server_start() --> first enqueue
wakeup_preempt() := check_preempt_wakeup_fair()
update_curr()
update_curr_task()
if (current->dl_server)
dl_server_update()
enqueue_dl_entity() --> second enqueue
This bug was not apparent as the enqueue in dl_server_start doesn't
usually happen because of the defer logic. But as a side effect of the
first case(dequeue during dlserver pick), dl_throttled and dl_yield will
be set and this causes the time accounting of dlserver to messup and
then leading to a enqueue in dl_server_start.
Have an explicit flag representing the status of dlserver to avoid the
confusion. This is set in dl_server_start and reset in dlserver_stop.
Fixes:
|
||
![]() |
736c55a02c |
sched/fair: Rename cfs_rq.nr_running into nr_queued
Rename cfs_rq.nr_running into cfs_rq.nr_queued which better reflects the reality as the value includes both the ready to run tasks and the delayed dequeue tasks. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-10-vincent.guittot@linaro.org |
||
![]() |
43eef7c3a4 |
sched/fair: Remove unused cfs_rq.idle_nr_running
cfs_rq.idle_nr_running field is not used anywhere so we can remove the
useless associated computation. Last user went in commit
|
||
![]() |
31898e7b87 |
sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle
Use same naming convention as others starting with h_nr_* and rename idle_h_nr_running into h_nr_idle. The "running" is not correct anymore as it includes delayed dequeue tasks as well. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-8-vincent.guittot@linaro.org |
||
![]() |
9216582b0b |
sched/fair: Removed unsued cfs_rq.h_nr_delayed
h_nr_delayed is not used anymore. We now have: - h_nr_runnable which tracks tasks ready to run - h_nr_queued which tracks enqueued tasks either ready to run or delayed dequeue Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-7-vincent.guittot@linaro.org |
||
![]() |
1a49104496 |
sched/fair: Use the new cfs_rq.h_nr_runnable
Use the new h_nr_runnable that tracks only queued and runnable tasks in the statistics that are used to balance the system: - PELT runnable_avg - deciding if a group is overloaded or has spare capacity - numa stats - reduced capacity management - load balance - nohz kick It should be noticed that the rq->nr_running still counts the delayed dequeued tasks as delayed dequeue is a fair feature that is meaningless at core level. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-6-vincent.guittot@linaro.org |
||
![]() |
c2a295bffe |
sched/fair: Add new cfs_rq.h_nr_runnable
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed. As a result, it stays also visible in the statistics that are used to balance the system and in particular the field cfs.h_nr_queued when the sched_entity is associated to a task. Create a new h_nr_runnable that tracks only queued and runnable tasks. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-5-vincent.guittot@linaro.org |
||
![]() |
7b8a702d94 |
sched/fair: Rename h_nr_running into h_nr_queued
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed but can't run. Rename h_nr_running into h_nr_queued to reflect this new behavior. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-4-vincent.guittot@linaro.org |
||
![]() |
40c3b94fbb |
Merge branch 'sched/urgent'
Sync with urgent bits as a base for further work. Signed-off-by: Peter Zijlstra <peterz@infradead.org> |
||
![]() |
76f2f78329 |
sched/eevdf: More PELT vs DELAYED_DEQUEUE
Vincent and Dietmar noted that while commit |
||
![]() |
d4742f6ed7 |
sched/deadline: Correctly account for allocated bandwidth during hotplug
For hotplug operations, DEADLINE needs to check that there is still enough bandwidth left after removing the CPU that is going offline. We however fail to do so currently. Restore the correct behavior by restructuring dl_bw_manage() a bit, so that overflow conditions (not enough bandwidth left) are properly checked. Also account for dl_server bandwidth, i.e. discount such bandwidth in the calculation since NORMAL tasks will be anyway moved away from the CPU as a result of the hotplug operation. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20241114142810.794657-3-juri.lelli@redhat.com |
||
![]() |
59297e2093 |
sched: add READ_ONCE to task_on_rq_queued
task_on_rq_queued read p->on_rq without READ_ONCE, though p->on_rq is set with WRITE_ONCE in {activate|deactivate}_task and smp_store_release in __block_task, and also read with READ_ONCE in task_on_rq_migrating. Make all of these accesses pair together by adding READ_ONCE in the task_on_rq_queued. Signed-off-by: Harshit Agarwal <harshit@nutanix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20241114210812.1836587-1-jon@nutanix.com |
||
![]() |
3f020399e4 |
Scheduler changes for v6.13:
- Core facilities: - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which optimizes fair-class preemption by delaying preemption requests to the tick boundary, while working as full preemption for RR/FIFO/DEADLINE classes. (Peter Zijlstra) - x86: Enable Lazy preemption (Peter Zijlstra) - riscv: Enable Lazy preemption (Jisheng Zhang) - Initialize idle tasks only once (Thomas Gleixner) - sched/ext: Remove sched_fork() hack (Thomas Gleixner) - Fair scheduler: - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie) - Idle loop: Optimize the generic idle loop by removing unnecessary memory barrier (Zhongqiu Han) - RSEQ: - Improve cache locality of RSEQ concurrency IDs for intermittent workloads (Mathieu Desnoyers) - Waitqueues: - Make wake_up_{bit,var} less fragile (Neil Brown) - PSI: - Pass enqueue/dequeue flags to psi callbacks directly (Johannes Weiner) - Preparatory patches for proxy execution: - core: Add move_queued_task_locked helper (Connor O'Brien) - core: Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien) - core: Split out __schedule() deactivate task logic into a helper (John Stultz) - core: Split scheduler and execution contexts (Peter Zijlstra) - locking/mutex: Make mutex::wait_lock irq safe (Juri Lelli) - locking/mutex: Expose __mutex_owner() (Juri Lelli) - locking/mutex: Remove wakeups from under mutex::wait_lock (Peter Zijlstra) - Misc fixes and cleanups: - core: Remove unused __HAVE_THREAD_FUNCTIONS hook support (David Disseldorp) - core: Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej Siewior) - wait: Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert) - fair: remove the DOUBLE_TICK feature (Huang Shijie) - fair: fix the comment for PREEMPT_SHORT (Huang Shijie) - uclamp: Fix unnused variable warning (Christian Loehle) - rt: No PREEMPT_RT=y for all{yes,mod}config Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7fnQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hZTBAAozVdWA2m51aNa67HvAZta/olmrIagVbW inwbTgqa8b+UfeWEuKOfrZr5khjEh6pLgR3dBTib1uH6xxYj/Okds+qbPWSBPVLh yzavlm/zJZM1U1XtxE3eyVfqWik4GrY7DoIMDQQr+YH7rNXonJeJkll38OI2E5MC q3Q01qyMo8RJJX8qkf3f8ObOoP/51NsVniTw0Zb2fzEhXz8FjezLlxk6cMfgSkJG lg9gfIwUZ7Xg5neRo4kJcc3Ht31KYOhWSiupBJzRD1hss/N/AybvMcTX/Cm8d07w HIAdDDAn84o46miFo/a0V/hsJZ72idWbqxVJUCtaezrpOUiFkG+uInRvG/ynr0lF 5dEI9f+6PUw8Nc7L72IyHkobjPqS2IefSaxYYCBKmxMX2qrenfTor/pKiWzzhBIl rX3MZSuUJ8NjV4rNGD/qXRM1IsMJrsDwxDyv+sRec3XdH33x286ds6aAUEPDQ6N7 96VS0sOKcNUJN8776ErNjlIxRl8HTlpkaO3nZlQIfXgTlXUpRvOuKbEWqP+606lo oANgJTKgUhgJPWZnvmdRxDjSiOp93QcImjus9i1tN81FGiEDleONsJUxu2Di1E5+ s1nCiytjq+cdvzCqFyiOZUh+g6kSZ4yXxNgLg2UvbXzX1zOeUQT3WtyKUhMPXhU8 esh1TgbUbpE= =Zcqj -----END PGP SIGNATURE----- Merge tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core facilities: - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which optimizes fair-class preemption by delaying preemption requests to the tick boundary, while working as full preemption for RR/FIFO/DEADLINE classes. (Peter Zijlstra) - x86: Enable Lazy preemption (Peter Zijlstra) - riscv: Enable Lazy preemption (Jisheng Zhang) - Initialize idle tasks only once (Thomas Gleixner) - sched/ext: Remove sched_fork() hack (Thomas Gleixner) Fair scheduler: - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie) Idle loop: - Optimize the generic idle loop by removing unnecessary memory barrier (Zhongqiu Han) RSEQ: - Improve cache locality of RSEQ concurrency IDs for intermittent workloads (Mathieu Desnoyers) Waitqueues: - Make wake_up_{bit,var} less fragile (Neil Brown) PSI: - Pass enqueue/dequeue flags to psi callbacks directly (Johannes Weiner) Preparatory patches for proxy execution: - Add move_queued_task_locked helper (Connor O'Brien) - Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien) - Split out __schedule() deactivate task logic into a helper (John Stultz) - Split scheduler and execution contexts (Peter Zijlstra) - Make mutex::wait_lock irq safe (Juri Lelli) - Expose __mutex_owner() (Juri Lelli) - Remove wakeups from under mutex::wait_lock (Peter Zijlstra) Misc fixes and cleanups: - Remove unused __HAVE_THREAD_FUNCTIONS hook support (David Disseldorp) - Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej Siewior) - Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert) - remove the DOUBLE_TICK feature (Huang Shijie) - fix the comment for PREEMPT_SHORT (Huang Shijie) - Fix unnused variable warning (Christian Loehle) - No PREEMPT_RT=y for all{yes,mod}config" * tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY. sched: No PREEMPT_RT=y for all{yes,mod}config riscv: add PREEMPT_LAZY support sched, x86: Enable Lazy preemption sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT sched: Add Lazy preemption model sched: Add TIF_NEED_RESCHED_LAZY infrastructure sched/ext: Remove sched_fork() hack sched: Initialize idle tasks only once sched: psi: pass enqueue/dequeue flags to psi callbacks directly sched/uclamp: Fix unnused variable warning sched: Split scheduler and execution contexts sched: Split out __schedule() deactivate task logic into a helper sched: Consolidate pick_*_task to task_is_pushable helper sched: Add move_queued_task_locked helper locking/mutex: Expose __mutex_owner() locking/mutex: Make mutex::wait_lock irq safe locking/mutex: Remove wakeups from under mutex::wait_lock sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads sched: idle: Optimize the generic idle loop by removing needless memory barrier ... |
||
![]() |
3022e9d00e |
sched_ext: Fixes for v6.12-rc7
- The fair sched class currently has a bug where its balance() returns true telling the sched core that it has tasks to run but then NULL from pick_task(). This makes sched core call sched_ext's pick_task() without preceding balance() which can lead to stalls in partial mode. For now, work around by detecting the condition and forcing the CPU to go through another scheduling cycle. - Add a missing newline to an error message and fix drgn introspection tool which went out of sync. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZzI8sw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGb5KAP40b/o6TyAFDG+Hn6GxyxQT7rcAUMXsdB2bcEpg /IjmzQEAwbHU5KP5vQXV6XHv+2V7Rs7u6ZqFtDnL88N0A9hf3wk= =7hL8 -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - The fair sched class currently has a bug where its balance() returns true telling the sched core that it has tasks to run but then NULL from pick_task(). This makes sched core call sched_ext's pick_task() without preceding balance() which can lead to stalls in partial mode. For now, work around by detecting the condition and forcing the CPU to go through another scheduling cycle. - Add a missing newline to an error message and fix drgn introspection tool which went out of sync. * tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx() sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type sched_ext: Add a missing newline at the end of an error message |
||
![]() |
a6250aa251 |
sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
sched_ext dispatches tasks from the BPF scheduler from balance_scx() and thus every pick_task_scx() call must be preceded by balance_scx(). While this usually holds, due to a bug, there are cases where the fair class's balance() returns true indicating that it has tasks to run on the CPU and thus terminating balance() calls but fails to actually find the next task to run when pick_task() is called. In such cases, pick_task_scx() can be called without preceding balance_scx(). Detect this condition using SCX_RQ_BAL_PENDING flags. If detected, keep running the previous task if possible and avoid stalling from entering idle without balancing. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/Ztj_h5c2LYsdXYbA@slm.duckdns.org |
||
![]() |
7c70cb94d2 |
sched: Add Lazy preemption model
Change fair to use resched_curr_lazy(), which, when the lazy preemption model is selected, will set TIF_NEED_RESCHED_LAZY. This LAZY bit will be promoted to the full NEED_RESCHED bit on tick. As such, the average delay between setting LAZY and actually rescheduling will be TICK_NSEC/2. In short, Lazy preemption will delay preemption for fair class but will function as Full preemption for all the other classes, most notably the realtime (RR/FIFO/DEADLINE) classes. The goal is to bridge the performance gap with Voluntary, such that we might eventually remove that option entirely. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20241007075055.331243614@infradead.org |
||
![]() |
5db91545ef |
sched: Pass correct scheduling policy to __setscheduler_class
Commit |
||
![]() |
1a6151017e |
sched: psi: pass enqueue/dequeue flags to psi callbacks directly
What psi needs to do on each enqueue and dequeue has gotten more subtle, and the generic sched code trying to distill this into a bool for the callbacks is awkward. Pass the flags directly and let psi parse them. For that to work, the #include "stats.h" (which has the psi callback implementations) needs to be below the flag definitions in "sched.h". Move that section further down, next to some of the other accounting stuff. This also puts the ENQUEUE_SAVE/RESTORE branch behind the psi jump label, slightly reducing overhead when PSI=y but runtime disabled. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241014144358.GB1021@cmpxchg.org |
||
![]() |
b55945c500 |
sched: Fix pick_next_task_fair() vs try_to_wake_up() race
Syzkaller robot reported KCSAN tripping over the
ASSERT_EXCLUSIVE_WRITER(p->on_rq) in __block_task().
The report noted that both pick_next_task_fair() and try_to_wake_up()
were concurrently trying to write to the same p->on_rq, violating the
assertion -- even though both paths hold rq->__lock.
The logical consequence is that both code paths end up holding a
different rq->__lock. And looking through ttwu(), this is possible
when the __block_task() 'p->on_rq = 0' store is visible to the ttwu()
'p->on_rq' load, which then assumes the task is not queued and
continues to migrate it.
Rearrange things such that __block_task() releases @p with the store
and no code thereafter will use @p again.
Fixes:
|
||
![]() |
d1fb8a78b2 |
Linux 6.12-rc4
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmcVgfoeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGhCYH/0Sdfp3cIq3JWLRv HCkWhPkPbEvR5XQlYQsAvTPVrEc0ZG9PKlXCaYaa8Tvt8xQ7WT/VDTjKgaWEhr8s qa6bNTx1zggiNBTP/3jYsNliOyAYfw5qjxA7fpEmueAeuT5y1XKZFKPHEXE/1qbR 8zeISKTkE0qwUmLqCdXe2qBWFnCC5i+78RcI6IN7uErnuNWk7ssapldgU4DB+dEl DDRxi1FTvARGPQGl8T+jPkfJiugv87ksG7l4WsqcYgoW+045K76C7I6vQjkDOrsd wqtPIow/yPmGQbbdRhWLxNU+wDmselYQ6xp7aMxppNF45HoHtzNm+X+T2ZU3bPoP iT2Mkbg= =+GXK -----END PGP SIGNATURE----- Merge tag 'v6.12-rc4' into sched/core, to resolve conflict Overlapping fixes solving the same bug slightly differently: |
||
![]() |
be602cde65 |
Merge branch 'linus' into sched/urgent, to resolve conflict
Conflicts: kernel/sched/ext.c There's a context conflict between this upstream commit: |
||
![]() |
af0c8b2bf6 |
sched: Split scheduler and execution contexts
Let's define the "scheduling context" as all the scheduler state in task_struct for the task chosen to run, which we'll call the donor task, and the "execution context" as all state required to actually run the task. Currently both are intertwined in task_struct. We want to logically split these such that we can use the scheduling context of the donor task selected to be scheduled, but use the execution context of a different task to actually be run. To this purpose, introduce rq->donor field to point to the task_struct chosen from the runqueue by the scheduler, and will be used for scheduler state, and preserve rq->curr to indicate the execution context of the task that will actually be run. This patch introduces the donor field as a union with curr, so it doesn't cause the contexts to be split yet, but adds the logic to handle everything separately. [add additional comments and update more sched_class code to use rq::proxy] [jstultz: Rebased and resolved minor collisions, reworked to use accessors, tweaked update_curr_common to use rq_proxy fixing rt scheduling issues] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-8-jstultz@google.com |
||
![]() |
18adad1dac |
sched: Consolidate pick_*_task to task_is_pushable helper
This patch consolidates rt and deadline pick_*_task functions to a task_is_pushable() helper This patch was broken out from a larger chain migration patch originally by Connor O'Brien. [jstultz: split out from larger chain migration patch, renamed helper function] Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-6-jstultz@google.com |
||
![]() |
2b05a0b4c0 |
sched: Add move_queued_task_locked helper
Switch logic that deactivates, sets the task cpu, and reactivates a task on a different rq to use a helper that will be later extended to push entire blocked task chains. This patch was broken out from a larger chain migration patch originally by Connor O'Brien. [jstultz: split out from larger chain migration patch] Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-5-jstultz@google.com |
||
![]() |
7e019dcc47 |
sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
commit
|