2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

253 Commits

Author SHA1 Message Date
Linus Torvalds
90b83efa67 bpf-next-6.16
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmg3NqgACgkQ6rmadz2v
 bTpNUQ/8DPeYtn3nskpsP2OwFy6O3hhfCe6gjOAmUVSk000xbG+AcI/h1DnGZWgk
 xlVcEs93ekzUzHd7k1+RJ2c5yDLXieLJAtb66rbFU1enkxs2cWlcWSKE6K/gaoh3
 G1BCARVlKwtrJhrVrsXtYP/eGZxKRSUZFK7xhtCk7lp7sRI3xkTLE+FJBcDkTJ6W
 HwF14i3zO+BkqNGdFwwlASCCqRItSNBBiM3KjW1DbETOTfAKlvCTrcgdUiODqxhF
 PNnULW+xmICABDFlKfDMlUAGNlSHKjiI3+g31LdblA5eyEhIqiCRgBGFYoCnsluk
 qUauRSie61KqC7fxN3qVpC3bXJfD1td7uIvoqSkDLtTv8a5+HAoiohzi1qBzCayl
 LAGkBYewAfDtdDDjNY38JLH2RCdyY6zG9DhqghPHdPlM7zj7L5zZgj34igEwesMM
 mfj9TuFFF99yfX5UUeSxKpDGR1eO4Ew0p7tg8CRs8Fqh6AIQSmboREZrsncVRCTS
 4SDHSI4KcO4LO2pEKzy+X4dewganN7aESnQG34iG0liyvDDwJOgUnDWLRwPLas7k
 3b/zIfBLxOJpA5R+0hhAMtjMA4NgyKJf4yFZwEieuasQjvzwTApi24YhZ/b3HSEB
 2Dp8kHEEbwezv0OFFz/fJ88dNQnrDmtJ+QByN/liA8kj4Yuh2+Q=
 =j3t8
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Fix and improve BTF deduplication of identical BTF types (Alan
   Maguire and Andrii Nakryiko)

 - Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and
   Alexis Lothoré)

 - Support load-acquire and store-release instructions in BPF JIT on
   riscv64 (Andrea Parri)

 - Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton
   Protopopov)

 - Streamline allowed helpers across program types (Feng Yang)

 - Support atomic update for hashtab of BPF maps (Hou Tao)

 - Implement json output for BPF helpers (Ihor Solodrai)

 - Several s390 JIT fixes (Ilya Leoshkevich)

 - Various sockmap fixes (Jiayuan Chen)

 - Support mmap of vmlinux BTF data (Lorenz Bauer)

 - Support BPF rbtree traversal and list peeking (Martin KaFai Lau)

 - Tests for sockmap/sockhash redirection (Michal Luczaj)

 - Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko)

 - Add support for dma-buf iterators in BPF (T.J. Mercier)

 - The verifier support for __bpf_trap() (Yonghong Song)

* tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits)
  bpf, arm64: Remove unused-but-set function and variable.
  selftests/bpf: Add tests with stack ptr register in conditional jmp
  bpf: Do not include stack ptr register in precision backtracking bookkeeping
  selftests/bpf: enable many-args tests for arm64
  bpf, arm64: Support up to 12 function arguments
  bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()
  bpf: Avoid __bpf_prog_ret0_warn when jit fails
  bpftool: Add support for custom BTF path in prog load/loadall
  selftests/bpf: Add unit tests with __bpf_trap() kfunc
  bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable
  bpf: Remove special_kfunc_set from verifier
  selftests/bpf: Add test for open coded dmabuf_iter
  selftests/bpf: Add test for dmabuf_iter
  bpf: Add open coded dmabuf iterator
  bpf: Add dmabuf iterator
  dma-buf: Rename debugfs symbols
  bpf: Fix error return value in bpf_copy_from_user_dynptr
  libbpf: Use mmap to parse vmlinux BTF from sysfs
  selftests: bpf: Add a test for mmapable vmlinux BTF
  btf: Allow mmap of vmlinux btf
  ...
2025-05-28 15:52:42 -07:00
Andrea Righi
617a77018f sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.c
Relocate the scx_kf_allowed_if_unlocked(), so it can be used from other
source files (e.g., ext_idle.c).

No functional change.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-20 10:23:50 -10:00
Tejun Heo
cb4ff91492 sched_ext: Explain the temporary situation around scx_root dereferences
Naked scx_root dereferences are being used as temporary markers to indicate
that they need to be updated to point to the right scheduler instance.
Explain the situation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-05-14 12:37:18 -04:00
Tejun Heo
a8433f7a26 sched_ext: Add @sch to SCX_CALL_OP*()
In preparation of hierarchical scheduling support, add @sch to scx_exit()
and friends:

- scx_exit/error() updated to take explicit @sch instead of assuming
  scx_root.

- scx_kf_exit/error() added. These are to be used from kfuncs, don't take
  @sch and internally determine the scx_sched instance to abort. Currently,
  it's always scx_root but once multiple scheduler support is in place, it
  will be the scx_sched instance that invoked the kfunc. This simplifies
  many callsites and defers scx_sched lookup until error is triggered.

- @sch is propagated to ops_cpu_valid() and ops_sanitize_err(). The CPU
  validity conditions in ops_cpu_valid() are factored into __cpu_valid() to
  implement kf_cpu_valid() which is the counterpart to scx_kf_exit/error().

- All users are converted. Most conversions are straightforward.
  check_rq_for_timeouts() and scx_softlockup() are updated to use explicit
  rcu_dereference*(scx_root) for safety as they may execute asynchronous to
  the exit path. scx_tick() is also updated to use rcu_dereference(). While
  not strictly necessary due to the preceding scx_enabled() test and IRQ
  disabled context, this removes the subtlety at no noticeable cost.

No behavior changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14 11:11:48 -04:00
Tejun Heo
c4c286d747 sched_ext: Cleanup [__]scx_exit/error*()
__scx_exit() is the base exit implementation and there are three wrappers on
top of it - scx_exit(), __scx_error() and scx_error(). This is more
confusing than helpful especially given that there are only a couple users
of scx_exit() and __scx_error(). To simplify the situation:

- Make __scx_exit() take va_list and rename it to scx_vexit(). This is to
  ease implementing more complex extensions on top.

- Make scx_exit() a varargs wrapper around __scx_exit(). scx_exit() now
  takes both @kind and @exit_code.

- Convert existing scx_exit() and __scx_error() users to use the new
  scx_exit().

- scx_error() remains unchanged.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14 11:11:48 -04:00
Tejun Heo
ab3f497ac1 sched_ext: Add @sch to SCX_CALL_OP*()
In preparation of hierarchical scheduling support, make SCX_CALL_OP*() take
explicit @sch instead of assuming scx_root. As scx_root is still the only
scheduler instance, this patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14 11:11:48 -04:00
Tejun Heo
d310fb4009 sched_ext: Clean up scx_root usages
- Always cache scx_root into local variable sch before using.

- Don't use scx_root if cached sch is available.

- Wrap !sch test with unlikely().

- Pass @scx into scx_cgroup_init/exit().

No behavior changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14 11:11:48 -04:00
Feng Yang
8c112a428b sched_ext: Remove bpf_scx_get_func_proto
task_storage_{get,delete} has been moved to bpf_base_func_proto.

Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20250506061434.94277-3-yangfeng59949@163.com
2025-05-09 11:01:45 -07:00
Tejun Heo
9b30400ff6 Merge branch 'for-6.15-fixes' into for-6.16
To receive 428dc9fc08 ("sched_ext: bpf_iter_scx_dsq_new() should always
initialize iterator") which conflicts with cdf5a6faa8 ("sched_ext: Move
dsq_hash into scx_sched"). The conflict is a simple context conflict which
can be resolved by taking changes from both changes in the right order.
2025-05-07 06:25:39 -10:00
Tejun Heo
428dc9fc08 sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator
BPF programs may call next() and destroy() on BPF iterators even after new()
returns an error value (e.g. bpf_for_each() macro ignores error returns from
new()). bpf_iter_scx_dsq_new() could leave the iterator in an uninitialized
state after an error return causing bpf_iter_scx_dsq_next() to dereference
garbage data. Make bpf_iter_scx_dsq_new() always clear $kit->dsq so that
next() and destroy() become noops.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 650ba21b13 ("sched_ext: Implement DSQ iterator")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-05-07 06:24:07 -10:00
Andrea Righi
c8fafb3485 sched_ext: Avoid NULL scx_root deref in __scx_exit()
A sched_ext scheduler may trigger __scx_exit() from a BPF timer
callback, where scx_root may not be safely dereferenced.

This can lead to a NULL pointer dereference as shown below (triggered by
scx_tickless):

 BUG: kernel NULL pointer dereference, address: 0000000000000330
...
 CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full)
 RIP: 0010:__scx_exit+0x2b/0x190
...
 Call Trace:
  <IRQ>
  scx_bpf_get_idle_smtmask+0x59/0x80
  bpf_prog_8320d4217989178c_dispatch_all_cpus+0x35/0x1b6
...
  bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264
  bpf_timer_cb+0x7a/0x140
  __hrtimer_run_queues+0x1f9/0x3a0
  hrtimer_run_softirq+0x8c/0xd0
  handle_softirqs+0xd3/0x3d0
  __irq_exit_rcu+0x9a/0xc0
  irq_exit_rcu+0xe/0x20

Fix this by checking for a valid scx_root and adding proper RCU
protection.

Fixes: 48e1267773 ("sched_ext: Introduce scx_sched")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-30 12:47:17 -10:00
Andrea Righi
c01adf4097 sched_ext: Add RCU protection to scx_root in DSQ iterator
Using a DSQ iterators from a timer callback can trigger the following
lockdep splat when accessing scx_root:

 =============================
 WARNING: suspicious RCU usage
 6.14.0-virtme #1 Not tainted
 -----------------------------
 kernel/sched/ext.c:6907 suspicious rcu_dereference_check() usage!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 no locks held by swapper/0/0.

 stack backtrace:
 CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full)
 Sched_ext: tickless (enabled+all)
 Call Trace:
 <IRQ>
 dump_stack_lvl+0x6f/0xb0
 lockdep_rcu_suspicious.cold+0x4e/0xa3
 bpf_iter_scx_dsq_new+0xb1/0xd0
 bpf_prog_63f4fd1bccc101e7_dispatch_cpu+0x3e/0x156
 bpf_prog_8320d4217989178c_dispatch_all_cpus+0x153/0x1b6
 bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264
 ? hrtimer_run_softirq+0x4f/0xd0
 bpf_timer_cb+0x7a/0x140
 __hrtimer_run_queues+0x1f9/0x3a0
 hrtimer_run_softirq+0x8c/0xd0
 handle_softirqs+0xd3/0x3d0
 __irq_exit_rcu+0x9a/0xc0
 irq_exit_rcu+0xe/0x20
 sysvec_apic_timer_interrupt+0x73/0x80

Add a proper dereference check to explicitly validate RCU-safe access to
scx_root from rcu_read_lock() contexts and also from contexts that hold
rcu_read_lock_bh(), such as timer callbacks.

Fixes: cdf5a6faa8 ("sched_ext: Move dsq_hash into scx_sched")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-30 12:39:50 -10:00
Tejun Heo
9ba7f37e5b sched_ext: Clean up SCX_EXIT_NONE handling in scx_disable_workfn()
With the global states and disable machinery moved into scx_sched,
scx_disable_workfn() can only be scheduled and run for the specific
scheduler instance. This makes it impossible for scx_disable_workfn() to see
SCX_EXIT_NONE. Turn that condition into WARN_ON_ONCE().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:11 -10:00
Tejun Heo
bff3b5aec1 sched_ext: Move disable machinery into scx_sched
Because disable can be triggered from any place and the scheduler cannot be
trusted, disable path uses an irq_work to bounce and a kthread_work which is
executed on an RT helper kthread to perform disable. These must be per
scheduler instance to guarantee forward progress. Move them into scx_sched.

- If an scx_sched is accessible, its helper kthread is always valid making
  the `helper` check in schedule_scx_disable_work() unnecessary. As the
  function becomes trivial after the removal of the test, inline it.

- scx_create_rt_helper() has only one user - creation of the disable helper
  kthread. Inline it into scx_alloc_and_add_sched().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
c201ea1578 sched_ext: Move event_stats_cpu into scx_sched
The event counters are going to become per scheduler instance. Move
event_stats_cpu into scx_sched.

- [__]scx_add_event() are updated to take @sch. While at it, add missing
  parentheses around @cnt expansions.

- scx_read_events() is updated to take @sch.

- scx_bpf_events() accesses scx_root under RCU read lock.

v2: - Replace stale scx_bpf_get_event_stat() reference in a comment with
      scx_bpf_events().

    - Trivial goto label rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
3a8facc424 sched_ext: Factor out scx_read_events()
In prepration of moving event_stats_cpu into scx_sched, factor out
scx_read_events() out of scx_bpf_events() and update the in-kernel users -
scx_attr_events_show() and scx_dump_state() - to use scx_read_events()
instead of scx_bpf_events(). No observable behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
f97a79156a sched_ext: Relocate scx_event_stats definition
In prepration of moving event_stats_cpu into scx_sched, move scx_event_stats
definitions above scx_sched definition. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
8409b800a0 sched_ext: Move global_dsqs into scx_sched
Global DSQs are going to become per scheduler instance. Move global_dsqs
into scx_sched. find_global_dsq() already takes a task_struct pointer as an
argument and should later be able to determine the scx_sched to use from
that. For now, assume scx_root.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
cdf5a6faa8 sched_ext: Move dsq_hash into scx_sched
User DSQs are going to become per scheduler instance. Move dsq_hash into
scx_sched. This shifts the code that assumes scx_root to be the only
scx_sched instance up the call stack but doesn't remove them yet.

v2: Add missing rcu_read_lock() in scx_bpf_destroy_dsq() as per Andrea.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
d9f7546310 sched_ext: Factor out scx_alloc_and_add_sched()
More will be moved into scx_sched. Factor out the allocation and kobject
addition path into scx_alloc_and_add_sched().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
392b7e08de sched_ext: Inline create_dsq() into scx_bpf_create_dsq()
create_dsq() is only used by scx_bpf_create_dsq() and the separation gets in
the way of making dsq_hash per scx_sched. Inline it into
scx_bpf_create_dsq(). While at it, add unlikely() around
SCX_DSQ_FLAG_BUILTIN test.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
17108735b4 sched_ext: Use dynamic allocation for scx_sched
To prepare for supporting multiple schedulers, make scx_sched allocated
dynamically. scx_sched->kobj is now an embedded field and the kobj's
lifetime determines the lifetime of the containing scx_sched.

- Enable path is updated so that kobj init and addition are performed later.

- scx_sched freeing is initiated in scx_kobj_release() and also goes through
  an rcu_work so that scx_root can be accessed from an unsynchronized path -
  scx_disable().

- sched_ext_ops->priv is added and used to point to scx_sched instance
  created for the ops instance. This is used by bpf_scx_unreg() to determine
  the scx_sched instance to disable and put.

No behavior changes intended.

v2: Andrea reported kernel oops due to scx_bpf_unreg() trying to deref NULL
    scx_root after scheduler init failure. sched_ext_ops->priv added so that
    scx_bpf_unreg() can always find the scx_sched instance to unregister
    even if it failed early during init.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
a77d10d032 sched_ext: Avoid NULL scx_root deref through SCX_HAS_OP()
SCX_HAS_OP() tests scx_root->has_op bitmap. The bitmap is currently in a
statically allocated struct scx_sched and initialized while loading the BPF
scheduler and cleared while unloading, and thus can be tested anytime.
However, scx_root will be switched to dynamic allocation and thus won't
always be deferenceable.

Most usages of SCX_HAS_OP() are already protected by scx_enabled() either
directly or indirectly (e.g. through a task which is on SCX). However, there
are a couple places that could try to dereference NULL scx_root. Update them
so that scx_root is guaranteed to be valid before SCX_HAS_OP() is called.

- In handle_hotplug(), test whether scx_root is NULL before doing anything
  else. This is safe because scx_root updates will be protected by
  cpus_read_lock().

- In scx_tg_offline(), test scx_cgroup_enabled before invoking SCX_HAS_OP(),
  which should guarnatee that scx_root won't turn NULL. This is also in line
  with other cgroup operations. As the code path is synchronized against
  scx_cgroup_init/exit() through scx_cgroup_rwsem, this shouldn't cause any
  behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
48e1267773 sched_ext: Introduce scx_sched
To support multiple scheduler instances, collect some of the global
variables that should be specific to a scheduler instance into the new
struct scx_sched. scx_root is the root scheduler instance and points to a
static instance of struct scx_sched. Except for an extra dereference through
the scx_root pointer, this patch makes no functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
ce565f839c Merge branch 'for-6.15-fixes' into for-6.16
To receive e38be1c764 ("sched_ext: Fix rq lock state in hotplug ops") to
avoid conflicts with scx_sched related patches pending for for-6.16.
2025-04-29 08:24:58 -10:00
Andrea Righi
e38be1c764 sched_ext: Fix rq lock state in hotplug ops
The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume
that the rq involved in the operation is locked, which is not the case
during hotplug, triggering the following warning:

  WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340

Fix by not tracking the target rq as locked in the context of
ops.cpu_online() and ops.cpu_offline().

Fixes: 18853ba782 ("sched_ext: Track currently locked rq")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Tested-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-29 08:01:32 -10:00
Andrea Righi
e7dcd1304b sched_ext: Remove duplicate BTF_ID_FLAGS definitions
Some kfuncs specific to the idle CPU selection policy are registered in
both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though
they should only be defined in the latter.

Remove the duplicates from scx_kfunc_ids_any.

Fixes: 337d1b354a ("sched_ext: Move built-in idle CPU selection policy to a separate file")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-25 08:41:02 -10:00
Andrea Righi
069ac9e161 sched_ext: Clarify CPU context for running/stopping callbacks
The ops.running() and ops.stopping() callbacks can be invoked from a CPU
other than the one the task is assigned to, particularly when a task
property is changed, as both scx_next_task_scx() and dequeue_task_scx() may
run on CPUs different from the task's target CPU.

This behavior can lead to confusion or incorrect assumptions if not
properly clarified, potentially resulting in bugs (see [1]).

Therefore, update the documentation to clarify this aspect and advise
users to use scx_bpf_task_cpu() to determine the actual CPU the task
will run on or was running on.

[1] https://github.com/sched-ext/scx/pull/1728

Cc: Jake Hillion <jake@hillion.co.uk>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-23 13:34:32 -10:00
Tejun Heo
ac47c272b2 Merge branch 'for-6.15-fixes' into for-6.16
a11d6784d7 ("sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()")
added a call to scx_ops_error() which was renamed to scx_error() in
for-6.16. Fix it up.
2025-04-22 09:29:44 -10:00
Andrea Righi
a11d6784d7 sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
scx_bpf_cpuperf_set() can be used to set a performance target level on
any CPU. However, it doesn't correctly acquire the corresponding rq
lock, which may lead to unsafe behavior and trigger the following
warning, due to the lockdep_assert_rq_held() check:

[   51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0
...
[   51.713836] Call trace:
[   51.713837]  scx_bpf_cpuperf_set+0x1a0/0x1e0 (P)
[   51.713839]  bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440
[   51.713841]  bpf__sched_ext_ops_init+0x54/0x8c
[   51.713843]  scx_ops_enable.constprop.0+0x2c0/0x10f0
[   51.713845]  bpf_scx_reg+0x18/0x30
[   51.713847]  bpf_struct_ops_link_create+0x154/0x1b0
[   51.713849]  __sys_bpf+0x1934/0x22a0

Fix by properly acquiring the rq lock when possible or raising an error
if we try to operate on a CPU that is not the one currently locked.

Fixes: d86adb4fc0 ("sched_ext: Add cpuperf support")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-22 09:28:12 -10:00
Andrea Righi
18853ba782 sched_ext: Track currently locked rq
Some kfuncs provided by sched_ext may need to operate on a struct rq,
but they can be invoked from various contexts, specifically, different
scx callbacks.

While some of these callbacks are invoked with a particular rq already
locked, others are not. This makes it impossible for a kfunc to reliably
determine whether it's safe to access a given rq, triggering potential
bugs or unsafe behaviors, see for example [1].

To address this, track the currently locked rq whenever a sched_ext
callback is invoked via SCX_CALL_OP*().

This allows kfuncs that need to operate on an arbitrary rq to retrieve
the currently locked one and apply the appropriate action as needed.

[1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-22 09:27:50 -10:00
Honglei Wang
69120f8228 sched_ext: add helper for refill task with default slice
Add helper for refilling task with default slice and event
statistics accordingly.

Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18 17:27:00 -10:00
Honglei Wang
f203683c3e sched_ext: change the variable name for slice refill event
SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs
when the tasks were enqueued, which seems not accurate. What it actually
means is the refilling with defalt slice, and this can occur either when
enqueue or pick_task. Let's change the variable to
SCX_EV_REFILL_SLICE_DFL.

Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18 17:25:39 -10:00
Tejun Heo
0b30461793 sched_ext: Make scx_has_op a bitmap
scx_has_op is used to encode which ops are implemented by the BPF scheduler
into an array of static_keys. While this saves a bit of branching overhead,
that is unlikely to be noticeable compared to the overall cost. As the
global static_keys can't work with the planned hierarchical multiple
scheduler support, replace the static_key array with a bitmap.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:06:00 -10:00
Tejun Heo
743354e3bb sched_ext: Remove scx_ops_allow_queued_wakeup static_key
scx_ops_allow_queued_wakeup is used to encode SCX_OPS_ALLOW_QUEUED_WAKEUP
into a static_key. The test is gated behind scx_enabled(), and, even when
sched_ext is enabled, is unlikely for the static_key usage to make any
meaningful difference. It is made to use a static_key mostly because there
was no reason not to. However, global static_keys can't work with the
planned hierarchical multiple scheduler support. Remove the static_key and
instead test SCX_OPS_ALLOW_QUEUED_WAKEUP directly.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:06:00 -10:00
Tejun Heo
54d2e717bc sched_ext: Remove scx_ops_cpu_preempt static_key
scx_ops_cpu_preempt is used to encode whether ops.cpu_acquire/release() are
implemented into a static_key. These tests aren't hot enough for static_key
usage to make any meaningful difference and are made to use a static_key
mostly because there was no reason not to. However, global static_keys can't
work with the planned hierarchical multiple scheduler support. Remove the
static_key and instead use an internal ops flag SCX_OPS_HAS_CPU_PREEMPT to
record and test whether ops.cpu_acquire/release() are implemented.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:06:00 -10:00
Tejun Heo
cc39454c34 sched_ext: Remove scx_ops_enq_* static_keys
scx_ops_enq_last/exiting/migration_disabled are used to encode the
corresponding SCX_OPS_ flags into static_keys. These flags aren't hot enough
for static_key usage to make any meaningful difference and are made
static_keys mostly because there was no reason not to. However, global
static_keys can't work with the planned hierarchical multiple scheduler
support. Remove the static_keys and test the ops flags directly.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:05:59 -10:00
Tejun Heo
d75ee2d678 sched_ext: Indentation updates
Purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:05:59 -10:00
Tejun Heo
294f5ff474 sched_ext: Merge branch 'for-6.15-fixes' into for-6.16
Pull for-6.15-fixes to receive:

 e776b26e37 ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings")

which conflicts with:

 1a7ff7216c ("sched_ext: Drop "ops" from scx_ops_enable_state and friends")

The former removes code updated by the latter. Resolved by removing the
updated section.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08 08:56:57 -10:00
Tejun Heo
bc08b15b54 sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation
SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup
weight support warnings. Now that the warnings are removed, the flag doesn't
do anything. Mark it for deprecation and remove its usage from scx_flatcg.

v2: Actually include the scx_flatcg update.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-04-08 08:53:52 -10:00
Tejun Heo
e776b26e37 sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings
sched_ext generates warnings when cpu.weight / cpu.idle are set to
non-default values if the BPF scheduler doesn't implement weight support.
These warnings don't provide much value while adding constant annoyance. A
BPF scheduler may not implement any particular behavior and there's nothing
particularly special about missing cgroup weight support. Drop the warnings.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08 08:00:49 -10:00
Breno Leitao
47068309b5 sched_ext: Use kvzalloc for large exit_dump allocation
Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which
can require large contiguous memory depending on the implementation.
This change prevents allocation failures by allowing the system to fall
back to vmalloc when contiguous memory allocation fails.

Since this buffer is only used for debugging purposes, physical memory
contiguity is not required, making vmalloc a suitable alternative.

Cc: stable@vger.kernel.org
Fixes: 07814a9439 ("sched_ext: Print debug dump after an error exit")
Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08 07:53:27 -10:00
Andrea Righi
683d2d0fab sched_ext: idle: Introduce scx_bpf_select_cpu_and()
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply
the built-in idle CPU selection policy to a subset of allowed CPU.

This new helper is basically an extension of scx_bpf_select_cpu_dfl().
However, when an idle CPU can't be found, it returns a negative value
instead of @prev_cpu, aligning its behavior more closely with
scx_bpf_pick_idle_cpu().

It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce
strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to
request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying
the built-in selection logic.

With this helper, BPF schedulers can apply the built-in idle CPU
selection policy restricted to any arbitrary subset of CPUs.

Example usage
=============

Possible usage in ops.select_cpu():

s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
		   s32 prev_cpu, u64 wake_flags)
{
	const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr;
	s32 cpu;

	cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0);
	if (cpu >= 0) {
		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
		return cpu;
	}

	return prev_cpu;
}

Results
=======

Load distribution on a 4 sockets, 4 cores per socket system, simulated
using virtme-ng, running a modified version of scx_bpfland that uses
scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs:

 $ vng --cpu 16,sockets=4,cores=4,threads=1
 ...
 $ stress-ng -c 16
 ...
 $ htop
 ...
   0[                         0.0%]   8[||||||||||||||||||||||||100.0%]
   1[                         0.0%]   9[||||||||||||||||||||||||100.0%]
   2[                         0.0%]  10[||||||||||||||||||||||||100.0%]
   3[                         0.0%]  11[||||||||||||||||||||||||100.0%]
   4[                         0.0%]  12[||||||||||||||||||||||||100.0%]
   5[                         0.0%]  13[||||||||||||||||||||||||100.0%]
   6[                         0.0%]  14[||||||||||||||||||||||||100.0%]
   7[                         0.0%]  15[||||||||||||||||||||||||100.0%]

With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across
all the available CPUs.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07 07:13:52 -10:00
Andrea Righi
23c63a9652 sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl()
Modify scx_select_cpu_dfl() to take the allowed cpumask as an explicit
argument, instead of implicitly using @p->cpus_ptr.

This prepares for future changes where arbitrary cpumasks may be passed
to the built-in idle CPU selection policy.

This is a pure refactoring with no functional changes.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07 07:13:52 -10:00
Tejun Heo
29b49be6c9 sched_ext: Drop "ops" from SCX_OPS_TASK_ITER_BATCH
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from SCX_OPS_TASK_ITER_BATCH.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 10:10:39 -10:00
Tejun Heo
1a2469403e sched_ext: Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and friends.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:49 -10:00
Tejun Heo
c5f22258f5 sched_ext: Drop "ops" from scx_ops_exit(), scx_ops_error() and friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_exit(), scx_ops_error() and friends.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:49 -10:00
Tejun Heo
8c6ee86246 sched_ext: Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends. Update
scx_show_state.py accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:49 -10:00
Tejun Heo
a50c365f99 sched_ext: Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled.
Update scx_show_state.py accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:48 -10:00
Tejun Heo
1a7ff7216c sched_ext: Drop "ops" from scx_ops_enable_state and friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_enable_state and friends. Update scx_show_state.py
accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:48 -10:00