2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

47146 Commits

Author SHA1 Message Date
Steven Rostedt
ded9140622 fprobe: Always unregister fgraph function from ops
When the last fprobe is removed, it calls unregister_ftrace_graph() to
remove the graph_ops from function graph. The issue is when it does so, it
calls return before removing the function from its graph ops via
ftrace_set_filter_ips(). This leaves the last function lingering in the
fprobe's fgraph ops and if a probe is added it also enables that last
function (even though the callback will just drop it, it does add unneeded
overhead to make that call).

  # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
kernel_clone (1)           	tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

  # echo "f:myevent2 schedule_timeout" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
kernel_clone (1)           	tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
schedule_timeout (1)           	tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

  # > /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions

  # echo "f:myevent3 kmem_cache_free" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
kmem_cache_free (1)           	tramp: 0xffffffffc0219000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
schedule_timeout (1)           	tramp: 0xffffffffc0219000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

The above enabled a fprobe on kernel_clone, and then on schedule_timeout.
The content of the enabled_functions shows the functions that have a
callback attached to them. The fprobe attached to those functions
properly. Then the fprobes were cleared, and enabled_functions was empty
after that. But after adding a fprobe on kmem_cache_free, the
enabled_functions shows that the schedule_timeout was attached again. This
is because it was still left in the fprobe ops that is used to tell
function graph what functions it wants callbacks from.

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.393254452@goodmis.org
Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Tested-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:36:12 -05:00
Steven Rostedt
8eb4b09e0b ftrace: Do not add duplicate entries in subops manager ops
Check if a function is already in the manager ops of a subops. A manager
ops contains multiple subops, and if two or more subops are tracing the
same function, the manager ops only needs a single entry in its hash.

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.226762894@goodmis.org
Fixes: 4f554e9556 ("ftrace: Add ftrace_set_filter_ips function")
Tested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:36:12 -05:00
Steven Rostedt
38b1406194 ftrace: Fix accounting of adding subops to a manager ops
Function graph uses a subops and manager ops mechanism to attach to
ftrace.  The manager ops connects to ftrace and the functions it connects
to is defined by a list of subops that it manages.

The function hash that defines what the above ops attaches to limits the
functions to attach if the hash has any content. If the hash is empty, it
means to trace all functions.

The creation of the manager ops hash is done by iterating over all the
subops hashes. If any of the subops hashes is empty, it means that the
manager ops hash must trace all functions as well.

The issue is in the creation of the manager ops. When a second subops is
attached, a new hash is created by starting it as NULL and adding the
subops one at a time. But the NULL ops is mistaken as an empty hash, and
once an empty hash is found, it stops the loop of subops and just enables
all functions.

  # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
kernel_clone (1)           	tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

  # echo "f:myevent2 schedule_timeout" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
trace_initcall_start_cb (1)             tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
run_init_process (1)            tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
try_to_run_init_process (1)             tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
x86_pmu_show_pmu_cap (1)                tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
cleanup_rapl_pmus (1)                   tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
uncore_free_pcibus_map (1)              tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
uncore_types_exit (1)                   tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
uncore_pci_exit.part.0 (1)              tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
kvm_shutdown (1)                tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
vmx_dump_msrs (1)               tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
vmx_cleanup_l1d_flush (1)               tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
[..]

Fix this by initializing the new hash to NULL and if the hash is NULL do
not treat it as an empty hash but instead allocate by copying the content
of the first sub ops. Then on subsequent iterations, the new hash will not
be NULL, but the content of the previous subops. If that first subops
attached to all functions, then new hash may assume that the manager ops
also needs to attach to all functions.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.060300046@goodmis.org
Fixes: 5fccc7552c ("ftrace: Add subops logic to allow one ops to manage many")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:35:44 -05:00
Michael Jeanson
dc0a241cea rseq: Fix rseq registration with CONFIG_DEBUG_RSEQ
With CONFIG_DEBUG_RSEQ=y, at rseq registration the read-only fields are
copied from user-space, if this copy fails the syscall returns -EFAULT
and the registration should not be activated - but it erroneously is.

Move the activation of the registration after the copy of the fields to
fix this bug.

Fixes: 7d5265ffcd ("rseq: Validate read-only fields under DEBUG_RSEQ config")
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/r/20250219205330.324770-1-mjeanson@efficios.com
2025-02-21 14:21:02 +01:00
Linus Torvalds
319fc77f8f BPF fixes:
- Fix a soft-lockup in BPF arena_map_free on 64k page size
   kernels (Alan Maguire)
 
 - Fix a missing allocation failure check in BPF verifier's
   acquire_lock_state (Kumar Kartikeya Dwivedi)
 
 - Fix a NULL-pointer dereference in trace_kfree_skb by adding
   kfree_skb to the raw_tp_null_args set (Kuniyuki Iwashima)
 
 - Fix a deadlock when freeing BPF cgroup storage (Abel Wu)
 
 - Fix a syzbot-reported deadlock when holding BPF map's
   freeze_mutex (Andrii Nakryiko)
 
 - Fix a use-after-free issue in bpf_test_init when
   eth_skb_pkt_type is accessing skb data not containing an
   Ethernet header (Shigeru Yoshida)
 
 - Fix skipping non-existing keys in generic_map_lookup_batch
   (Yan Zhai)
 
 - Several BPF sockmap fixes to address incorrect TCP copied_seq
   calculations, which prevented correct data reads from recv(2)
   in user space (Jiayuan Chen)
 
 - Two fixes for BPF map lookup nullness elision (Daniel Xu)
 
 - Fix a NULL-pointer dereference from vmlinux BTF lookup in
   bpf_sk_storage_tracing_allowed (Jared Kangas)
 
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ7evlRUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISIPwHgD/dTvM00Ck4Q73fPivyT7tcqxeXJlD
 D6ggzWl/SG9LAbwA/2/cSgAM9Jm1g7ddvn/S9QaDYOs5GmFl6urq6krs+tYD
 =FCs9
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull BPF fixes from Daniel Borkmann:

 - Fix a soft-lockup in BPF arena_map_free on 64k page size kernels
   (Alan Maguire)

 - Fix a missing allocation failure check in BPF verifier's
   acquire_lock_state (Kumar Kartikeya Dwivedi)

 - Fix a NULL-pointer dereference in trace_kfree_skb by adding kfree_skb
   to the raw_tp_null_args set (Kuniyuki Iwashima)

 - Fix a deadlock when freeing BPF cgroup storage (Abel Wu)

 - Fix a syzbot-reported deadlock when holding BPF map's freeze_mutex
   (Andrii Nakryiko)

 - Fix a use-after-free issue in bpf_test_init when eth_skb_pkt_type is
   accessing skb data not containing an Ethernet header (Shigeru
   Yoshida)

 - Fix skipping non-existing keys in generic_map_lookup_batch (Yan Zhai)

 - Several BPF sockmap fixes to address incorrect TCP copied_seq
   calculations, which prevented correct data reads from recv(2) in user
   space (Jiayuan Chen)

 - Two fixes for BPF map lookup nullness elision (Daniel Xu)

 - Fix a NULL-pointer dereference from vmlinux BTF lookup in
   bpf_sk_storage_tracing_allowed (Jared Kangas)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests: bpf: test batch lookup on array of maps with holes
  bpf: skip non exist keys in generic_map_lookup_batch
  bpf: Handle allocation failure in acquire_lock_state
  bpf: verifier: Disambiguate get_constant_map_key() errors
  bpf: selftests: Test constant key extraction on irrelevant maps
  bpf: verifier: Do not extract constant map keys for irrelevant maps
  bpf: Fix softlockup in arena_map_free on 64k page kernel
  net: Add rx_skb of kfree_skb to raw_tp_null_args[].
  bpf: Fix deadlock when freeing cgroup storage
  selftests/bpf: Add strparser test for bpf
  selftests/bpf: Fix invalid flag of recv()
  bpf: Disable non stream socket for strparser
  bpf: Fix wrong copied_seq calculation
  strparser: Add read_sock callback
  bpf: avoid holding freeze_mutex during mmap operation
  bpf: unify VM_WRITE vs VM_MAYWRITE use in BPF map mmaping logic
  selftests/bpf: Adjust data size to have ETH_HLEN
  bpf, test_run: Fix use-after-free issue in eth_skb_pkt_type()
  bpf: Remove unnecessary BTF lookups in bpf_sk_storage_tracing_allowed
2025-02-20 15:37:17 -08:00
Dave Airlie
395436f3bd An reset signal polarity fix for the jd9365da-h3 panel, a folio handling
fix and config fix in nouveau, a dmem cgroup descendant pool handling
 fix, and a missing header for amdxdna.
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQTkHFbLp4ejekA/qfgnX84Zoj2+dgUCZ7bn6AAKCRAnX84Zoj2+
 djJnAX9XDHzW0CCJnF8UopQearYcn2DPKrXvLKWwwpSothyoOFiIHyifP7fHlQBX
 XA5+iIQBf1RZq3uTeqbaq3DeD0Pf9LrRUC41g3H7HO9Lt0/Bp2dnRqJJgZVxIg+h
 57yAKxQ5Cw==
 =PwGs
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-fixes-2025-02-20' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

An reset signal polarity fix for the jd9365da-h3 panel, a folio handling
fix and config fix in nouveau, a dmem cgroup descendant pool handling
fix, and a missing header for amdxdna.

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <mripard@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250220-glorious-cockle-of-might-5b35f7@houat
2025-02-21 09:16:35 +10:00
Jason Xing
5942246426 bpf: Support selective sampling for bpf timestamping
Add the bpf_sock_ops_enable_tx_tstamp kfunc to allow BPF programs to
selectively enable TX timestamping on a skb during tcp_sendmsg().

For example, BPF program will limit tracking X numbers of packets
and then will stop there instead of tracing all the sendmsgs of
matched flow all along. It would be helpful for users who cannot
afford to calculate latencies from every sendmsg call probably
due to the performance or storage space consideration.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-12-kerneljasonxing@gmail.com
2025-02-20 14:30:02 -08:00
Friedrich Vock
8821f36333 cgroup/dmem: Don't open-code css_for_each_descendant_pre
The current implementation has a bug: If the current css doesn't
contain any pool that is a descendant of the "pool" (i.e. when
found_descendant == false), then "pool" will point to some unrelated
pool. If the current css has a child, we'll overwrite parent_pool with
this unrelated pool on the next iteration.

Since we can just check whether a pool refers to the same region to
determine whether or not it's related, all the additional pool tracking
is unnecessary, so just switch to using css_for_each_descendant_pre for
traversal.

Fixes: b168ed458d ("kernel/cgroup: Add "dmem" memory accounting cgroup")
Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250127152754.21325-1-friedrich.vock@gmx.de
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
2025-02-19 09:50:37 +01:00
Yan Zhai
5644c6b50f bpf: skip non exist keys in generic_map_lookup_batch
The generic_map_lookup_batch currently returns EINTR if it fails with
ENOENT and retries several times on bpf_map_copy_value. The next batch
would start from the same location, presuming it's a transient issue.
This is incorrect if a map can actually have "holes", i.e.
"get_next_key" can return a key that does not point to a valid value. At
least the array of maps type may contain such holes legitly. Right now
these holes show up, generic batch lookup cannot proceed any more. It
will always fail with EINTR errors.

Rather, do not retry in generic_map_lookup_batch. If it finds a non
existing element, skip to the next key. This simple solution comes with
a price that transient errors may not be recovered, and the iteration
might cycle back to the first key under parallel deletion. For example,
Hou Tao <houtao@huaweicloud.com> pointed out a following scenario:

For LPM trie map:
(1) ->map_get_next_key(map, prev_key, key) returns a valid key

(2) bpf_map_copy_value() return -ENOMENT
It means the key must be deleted concurrently.

(3) goto next_key
It swaps the prev_key and key

(4) ->map_get_next_key(map, prev_key, key) again
prev_key points to a non-existing key, for LPM trie it will treat just
like prev_key=NULL case, the returned key will be duplicated.

With the retry logic, the iteration can continue to the key next to the
deleted one. But if we directly skip to the next key, the iteration loop
would restart from the first key for the lpm_trie type.

However, not all races may be recovered. For example, if current key is
deleted after instead of before bpf_map_copy_value, or if the prev_key
also gets deleted, then the loop will still restart from the first key
for lpm_tire anyway. For generic lookup it might be better to stay
simple, i.e. just skip to the next key. To guarantee that the output
keys are not duplicated, it is better to implement map type specific
batch operations, which can properly lock the trie and synchronize with
concurrent mutators.

Fixes: cb4d03ab49 ("bpf: Add generic support for lookup batch op")
Closes: https://lore.kernel.org/bpf/Z6JXtA1M5jAZx8xD@debian.debian/
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/85618439eea75930630685c467ccefeac0942e2b.1739171594.git.yan@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18 17:27:37 -08:00
Thomas Weißschuh
ec5fd50aef uprobes: Don't use %pK through printk
Restricted pointers ("%pK") are not meant to be used through printk().
It can unintentionally expose security sensitive, raw pointer values.

Use regular pointer formatting instead.

For more background, see:

  https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250217-restricted-pointers-uprobes-v1-1-e8cbe5bb22a7@linutronix.de
2025-02-18 09:48:02 +01:00
Mathieu Desnoyers
02d954c0fd sched: Compact RSEQ concurrency IDs with reduced threads and affinity
When a process reduces its number of threads or clears bits in its CPU
affinity mask, the mm_cid allocation should eventually converge towards
smaller values.

However, the change introduced by:

commit 7e019dcc47 ("sched: Improve cache locality of RSEQ concurrency
IDs for intermittent workloads")

adds a per-mm/CPU recent_cid which is never unset unless a thread
migrates.

This is a tradeoff between:

A) Preserving cache locality after a transition from many threads to few
   threads, or after reducing the hamming weight of the allowed CPU mask.

B) Making the mm_cid upper bounds wrt nr threads and allowed CPU mask
   easy to document and understand.

C) Allowing applications to eventually react to mm_cid compaction after
   reduction of the nr threads or allowed CPU mask, making the tracking
   of mm_cid compaction easier by shrinking it back towards 0 or not.

D) Making sure applications that periodically reduce and then increase
   again the nr threads or allowed CPU mask still benefit from good
   cache locality with mm_cid.

Introduce the following changes:

* After shrinking the number of threads or reducing the number of
  allowed CPUs, reduce the value of max_nr_cid so expansion of CID
  allocation will preserve cache locality if the number of threads or
  allowed CPUs increase again.

* Only re-use a recent_cid if it is within the max_nr_cid upper bound,
  else find the first available CID.

Fixes: 7e019dcc47 ("sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lkml.kernel.org/r/20250210153253.460471-2-gmonaco@redhat.com
2025-02-18 08:50:36 +01:00
Linus Torvalds
2408a807bf vfs-6.14-rc4.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ7MONQAKCRCRxhvAZXjc
 ovw1AP4uB8c0hYfQHv/02XVTBad46zQm7uDh28EnEI8mrX7UBwEAnHw1PrrcX6ZH
 QFA47x5iGR+InXfQx4mmGqgvlD1XQgI=
 =x1hu
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.14-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "It was reported that the acct(2) system call can be used to trigger a
  NULL deref in cases where it is set to write to a file that triggers
  an internal lookup.

  This can e.g., happen when pointing acct(2) to /sys/power/resume. At
  the point the where the write to this file happens the calling task
  has already exited and called exit_fs() but an internal lookup might
  be triggered through lookup_bdev(). This may trigger a NULL-deref when
  accessing current->fs.

  Reorganize the code so that the the final write happens from the
  workqueue but with the caller's credentials. This preserves the
  (strange) permission model and has almost no regression risk.

  Also block access to kernel internal filesystems as well as procfs and
  sysfs in the first place.

  Various fixes for netfslib:

   - Fix a number of read-retry hangs, including:

      - Incorrect getting/putting of references on subreqs as we retry
        them

      - Failure to track whether a last old subrequest in a retried set
        is superfluous

      - Inconsistency in the usage of wait queues used for subrequests
        (ie. using clear_and_wake_up_bit() whilst waiting on a private
        waitqueue)

   - Add stats counters for retries and publish in /proc/fs/netfs/stats.
     This is not a fix per se, but is useful in debugging and shouldn't
     otherwise change the operation of the code

   - Fix the ordering of queuing subrequests with respect to setting the
     request flag that says we've now queued them all"

* tag 'vfs-6.14-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  netfs: Fix setting NETFS_RREQ_ALL_QUEUED to be after all subreqs queued
  netfs: Add retry stat counters
  netfs: Fix a number of read-retry hangs
  acct: block access to kernel internal filesystems
  acct: perform last write from workqueue
2025-02-17 10:38:25 -08:00
Linus Torvalds
ba643b6d84 - Remove an unused config item GENERIC_PENDING_IRQ_CHIPFLAGS
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmextVEACgkQEsHwGGHe
 VUpEew//bv6xE+9sUxtKpD4yQFSZmOOAYbSg6TNdxlmlIMXn2JHeOe166TiQHDIv
 w4QVZuzmmHx11hPC3YVgdMtzU0S/dFWd2LhoguHVALbZ39sxap2tuWIZacDvShyP
 WKrGAtvb/BOb0lL0yQY3dGwQ+YO7ye0EU+gJLCaWqNC7n0vWfP4gin17k08iNgQg
 qghwGUNCsYsijfpvimcJB9eqg00pOqnOVnWiE2xGSz5LKy+Yigtd8aNR50mS+Mgt
 Sl2Dzmesio0B464X+HSj34dhi4GswdjWmqKyuixmOuECp/1rCBakpuq760J0ijOg
 dliiBHZplavsLcis9Mh/vKzN4Pd5xjXMPsADHpTX/2fpF8yXj9sWpG+xp7UgsJi0
 LZASk8LTovstOGiDSF7ff+4iCPlkZTvE77fawfdcrpe4kgUc85VpRWyv7xalbGwj
 0Y/6dIkXg9dOwv+1z57i15ws2KDG35d2faF5UH1qfr/wwzp/TloIk5oX1C5d2puh
 Y5Fohu3S+D+wFgJB0xwkGTXzLTGVn/ElQMF08CvHHvqX3TrA9HNfJ7EahROx/T3i
 HNjj10YyKXbGwnk7OAQENOKG3SztlxrPk4SEEY4fZ1RvehYYu6qXXHA9vShjDQcN
 C2Elj42sgoTXr+DBGeVPHE1bD279kxacWZYD2xPx15t74Mu0zis=
 =tXlN
 -----END PGP SIGNATURE-----

Merge tag 'irq_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq Kconfig cleanup from Borislav Petkov:

 - Remove an unused config item GENERIC_PENDING_IRQ_CHIPFLAGS

* tag 'irq_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Remove unused CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS
2025-02-16 10:55:17 -08:00
Linus Torvalds
ff3b373ecc - Clarify what happens when a task is woken up from the wake queue and make
clear its removal from that queue is atomic
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmexq1MACgkQEsHwGGHe
 VUq6ghAAoD872KGmQ3YioDZs+FKLpLWvo+6lC2rY+GFQ3oCu4TJfmlsiTGLCyzjv
 aYwL52diIORD9/Yfn6Sq/ZWkfncoVmwnht+tgVjXeRr3Pb4EnttgWxPRx/xYQizr
 jgRASpNsRUTr6zNzqEeeQYodIJaInOF5+r26oqYArcN5V9XB9Qaj0+f14UiyB6u+
 53qpEQqQopeLPyG4t59iUfefsaWm2ZIW3EnoWeyI8sRuaapGY/0LHhUAn6vcA++N
 kuUkliVsk+f/uTNZeJ4zv2uy8DpBXO4kTjmwPVwFz46sJ8RL8P7MOLax7e7fNssw
 tylwHt4qoLEoB2vg1yMvlUNFMeH85gj8hTyJMsgGtFTbwCH0kLFEoXUz0lfKMS7U
 A271E1Bumu3OT7FrAnxQahViv02YWG5fcg6R3OidQdSmgoQBIMJwDA2pyKLiq9FL
 7mWoNfEqqBWn4O/1qBlf3jCvfFlzXRSSgVzEruoNB93cgzTaQaN5yVgeekMzwJEj
 NDowmIZRxEN8+lJyMxIGSOGa44aTXu0/+dtehEDeSpsDOXULFc5fpYt4SSa0Jt/F
 LlgnPGkM1vF0ddG5vDJpGw6B9Dhb82i1oYy2IVwOCkLOXSp2kvLDEI3sqgk5mmlH
 zFtAV21Zm4jqBke/aF7r+RCYRvDyQkuW0d1+H9okGWLSk7jKauM=
 =J43o
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:

 - Clarify what happens when a task is woken up from the wake queue and
   make clear its removal from that queue is atomic

* tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Clarify wake_up_q()'s write to task->wake_q.next
2025-02-16 10:38:24 -08:00
Linus Torvalds
5784d8c93e Ring buffer fixes for v6.14:
- Enable resize on mmap() error
 
   When a process mmaps a ring buffer, its size is locked and resizing is
   disabled. But if the user passes in a wrong parameter, the mmap() can fail
   after the resize was disabled and the mmap() exits with error without
   reenabling the ring buffer resize. This prevents the ring buffer from ever
   being resized after that. Reenable resizing of the ring buffer on mmap()
   error.
 
 - Have resizing return proper error and not always -ENOMEM
 
   If the ring buffer is mmapped by one task and another task tries to resize
   the buffer it will error with -ENOMEM. This is confusing to the user as
   there may be plenty of memory available. Have it return the error that
   actually happens (in this case -EBUSY) where the user can understand why
   the resize failed.
 
 - Test the sub-buffer array to validate persistent memory buffer
 
   On boot up, the initialization of the persistent memory buffer will do a
   validation check to see if the content of the data is valid, and if so, it
   will use the memory as is, otherwise it re-initializes it. There's meta
   data in this persistent memory that keeps track of which sub-buffer is the
   reader page and an array that states the order of the sub-buffers. The
   values in this array are indexes into the sub-buffers. The validator
   checks to make sure that all the entries in the array are within the
   sub-buffer list index, but it does not check for duplications.
 
   While working on this code, the array got corrupted and had duplicates,
   where not all the sub-buffers were accounted for. This passed the
   validator as all entries were valid, but the link list was incorrect and
   could have caused a crash. The corruption only produced incorrect data,
   but it could have been more severe. To fix this, create a bitmask that
   covers all the sub-buffer indexes and set it to all zeros. While iterating
   the array checking the values of the array content, have it set a bit
   corresponding to the index in the array. If the bit was already set, then
   it is a duplicate and mark the buffer as invalid and reset it.
 
 - Prevent mmap()ing persistent ring buffer
 
   The persistent ring buffer uses vmap() to map the persistent memory.
   Currently, the mmap() logic only uses virt_to_page() to get the page
   from the ring buffer memory and use that to map to user space. This works
   because a normal ring buffer uses alloc_page() to allocate its memory.
   But because the persistent ring buffer use vmap() it causes a kernel
   crash.  Fixing this to work with vmap() is not hard, but since mmap() on
   persistent memory buffers never worked, just have the mmap() return
   -ENODEV (what was returned before mmap() for persistent memory ring
   buffers, as they never supported mmap. Normal buffers will still allow
   mmap(). Implementing mmap() for persistent memory ring buffers can wait
   till the next merge window.
 
 - Fix polling on persistent ring buffers
 
   There's a "buffer_percent" option (default set to 50), that is used to
   have reads of the ring buffer binary data block until the buffer fills to
   that percentage. The field "pages_touched" is incremented every time a
   new sub-buffer has content added to it. This field is used in the
   calculations to determine the amount of content is in the buffer and if it
   exceeds the "buffer_percent" then it will wake the task polling on the
   buffer.
 
   As persistent ring buffers can be created by the content from a previous
   boot, the "pages_touched" field was not updated. This means that if a task
   were to poll on the persistent buffer, it would block even if the buffer
   was completely full. It would block even if the "buffer_percent" was zero,
   because with "pages_touched" as zero, it would be calculated as the buffer
   having no content. Update pages_touched when initializing the persistent
   ring buffer from a previous boot.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ7DtcxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmTQAQD1W/xHfS8yLw7BQBjM+6kqExdrKI/D
 Z378Et0LSWjZBQD/VtPKiSjLhhNgLUBy5fAWS5t4X/DZ49GKhTA36AzGHwE=
 =1b+2
 -----END PGP SIGNATURE-----

Merge tag 'trace-ring-buffer-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull trace ring buffer fixes from Steven Rostedt:

 - Enable resize on mmap() error

   When a process mmaps a ring buffer, its size is locked and resizing
   is disabled. But if the user passes in a wrong parameter, the mmap()
   can fail after the resize was disabled and the mmap() exits with
   error without reenabling the ring buffer resize. This prevents the
   ring buffer from ever being resized after that. Reenable resizing of
   the ring buffer on mmap() error.

 - Have resizing return proper error and not always -ENOMEM

   If the ring buffer is mmapped by one task and another task tries to
   resize the buffer it will error with -ENOMEM. This is confusing to
   the user as there may be plenty of memory available. Have it return
   the error that actually happens (in this case -EBUSY) where the user
   can understand why the resize failed.

 - Test the sub-buffer array to validate persistent memory buffer

   On boot up, the initialization of the persistent memory buffer will
   do a validation check to see if the content of the data is valid, and
   if so, it will use the memory as is, otherwise it re-initializes it.
   There's meta data in this persistent memory that keeps track of which
   sub-buffer is the reader page and an array that states the order of
   the sub-buffers. The values in this array are indexes into the
   sub-buffers. The validator checks to make sure that all the entries
   in the array are within the sub-buffer list index, but it does not
   check for duplications.

   While working on this code, the array got corrupted and had
   duplicates, where not all the sub-buffers were accounted for. This
   passed the validator as all entries were valid, but the link list was
   incorrect and could have caused a crash. The corruption only produced
   incorrect data, but it could have been more severe. To fix this,
   create a bitmask that covers all the sub-buffer indexes and set it to
   all zeros. While iterating the array checking the values of the array
   content, have it set a bit corresponding to the index in the array.
   If the bit was already set, then it is a duplicate and mark the
   buffer as invalid and reset it.

 - Prevent mmap()ing persistent ring buffer

   The persistent ring buffer uses vmap() to map the persistent memory.
   Currently, the mmap() logic only uses virt_to_page() to get the page
   from the ring buffer memory and use that to map to user space. This
   works because a normal ring buffer uses alloc_page() to allocate its
   memory. But because the persistent ring buffer use vmap() it causes a
   kernel crash.

   Fixing this to work with vmap() is not hard, but since mmap() on
   persistent memory buffers never worked, just have the mmap() return
   -ENODEV (what was returned before mmap() for persistent memory ring
   buffers, as they never supported mmap. Normal buffers will still
   allow mmap(). Implementing mmap() for persistent memory ring buffers
   can wait till the next merge window.

 - Fix polling on persistent ring buffers

   There's a "buffer_percent" option (default set to 50), that is used
   to have reads of the ring buffer binary data block until the buffer
   fills to that percentage. The field "pages_touched" is incremented
   every time a new sub-buffer has content added to it. This field is
   used in the calculations to determine the amount of content is in the
   buffer and if it exceeds the "buffer_percent" then it will wake the
   task polling on the buffer.

   As persistent ring buffers can be created by the content from a
   previous boot, the "pages_touched" field was not updated. This means
   that if a task were to poll on the persistent buffer, it would block
   even if the buffer was completely full. It would block even if the
   "buffer_percent" was zero, because with "pages_touched" as zero, it
   would be calculated as the buffer having no content. Update
   pages_touched when initializing the persistent ring buffer from a
   previous boot.

* tag 'trace-ring-buffer-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Update pages_touched to reflect persistent buffer content
  tracing: Do not allow mmap() of persistent ring buffer
  ring-buffer: Validate the persistent meta data subbuf array
  tracing: Have the error of __tracing_resize_ring_buffer() passed to user
  ring-buffer: Unlock resize on mmap error
2025-02-15 16:34:41 -08:00
Steven Rostedt
97937834ae ring-buffer: Update pages_touched to reflect persistent buffer content
The pages_touched field represents the number of subbuffers in the ring
buffer that have content that can be read. This is used in accounting of
"dirty_pages" and "buffer_percent" to allow the user to wait for the
buffer to be filled to a certain amount before it reads the buffer in
blocking mode.

The persistent buffer never updated this value so it was set to zero, and
this accounting would take it as it had no content. This would cause user
space to wait for content even though there's enough content in the ring
buffer that satisfies the buffer_percent.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214123512.0631436e@gandalf.local.home
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-15 14:00:59 -05:00
Steven Rostedt
129fe71881 tracing: Do not allow mmap() of persistent ring buffer
When trying to mmap a trace instance buffer that is attached to
reserve_mem, it would crash:

 BUG: unable to handle page fault for address: ffffe97bd00025c8
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 2862f3067 P4D 2862f3067 PUD 0
 Oops: Oops: 0000 [#1] PREEMPT_RT SMP PTI
 CPU: 4 UID: 0 PID: 981 Comm: mmap-rb Not tainted 6.14.0-rc2-test-00003-g7f1a5e3fbf9e-dirty #233
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:validate_page_before_insert+0x5/0xb0
 Code: e2 01 89 d0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 <48> 8b 46 08 a8 01 75 67 66 90 48 89 f0 8b 50 34 85 d2 74 76 48 89
 RSP: 0018:ffffb148c2f3f968 EFLAGS: 00010246
 RAX: ffff9fa5d3322000 RBX: ffff9fa5ccff9c08 RCX: 00000000b879ed29
 RDX: ffffe97bd00025c0 RSI: ffffe97bd00025c0 RDI: ffff9fa5ccff9c08
 RBP: ffffb148c2f3f9f0 R08: 0000000000000004 R09: 0000000000000004
 R10: 0000000000000000 R11: 0000000000000200 R12: 0000000000000000
 R13: 00007f16a18d5000 R14: ffff9fa5c48db6a8 R15: 0000000000000000
 FS:  00007f16a1b54740(0000) GS:ffff9fa73df00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffe97bd00025c8 CR3: 00000001048c6006 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  ? __die_body.cold+0x19/0x1f
  ? __die+0x2e/0x40
  ? page_fault_oops+0x157/0x2b0
  ? search_module_extables+0x53/0x80
  ? validate_page_before_insert+0x5/0xb0
  ? kernelmode_fixup_or_oops.isra.0+0x5f/0x70
  ? __bad_area_nosemaphore+0x16e/0x1b0
  ? bad_area_nosemaphore+0x16/0x20
  ? do_kern_addr_fault+0x77/0x90
  ? exc_page_fault+0x22b/0x230
  ? asm_exc_page_fault+0x2b/0x30
  ? validate_page_before_insert+0x5/0xb0
  ? vm_insert_pages+0x151/0x400
  __rb_map_vma+0x21f/0x3f0
  ring_buffer_map+0x21b/0x2f0
  tracing_buffers_mmap+0x70/0xd0
  __mmap_region+0x6f0/0xbd0
  mmap_region+0x7f/0x130
  do_mmap+0x475/0x610
  vm_mmap_pgoff+0xf2/0x1d0
  ksys_mmap_pgoff+0x166/0x200
  __x64_sys_mmap+0x37/0x50
  x64_sys_call+0x1670/0x1d70
  do_syscall_64+0xbb/0x1d0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

The reason was that the code that maps the ring buffer pages to user space
has:

	page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);

And uses that in:

	vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);

But virt_to_page() does not work with vmap()'d memory which is what the
persistent ring buffer has. It is rather trivial to allow this, but for
now just disable mmap() of instances that have their ring buffer from the
reserve_mem option.

If an mmap() is performed on a persistent buffer it will return -ENODEV
just like it would if the .mmap field wasn't defined in the
file_operations structure.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214115547.0d7287d3@gandalf.local.home
Fixes: 9b7bdf6f6e ("tracing: Have trace_printk not use binary prints if boot buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-15 13:59:52 -05:00
Linus Torvalds
04f41cbf03 sched_ext: Fixes for v6.14-rc2
- Fix lock imbalance in a corner case of dispatch_to_local_dsq().
 
 - Migration disabled tasks were confusing some BPF schedulers and its
   handling had a bug. Fix it and simplify the default behavior by
   dispatching them automatically.
 
 - ops.tick(), ops.disable() and ops.exit_task() were incorrectly disallowing
   kfuncs that require the task argument to be the rq operation is currently
   operating on and thus is rq-locked. Allow them.
 
 - Fix autogroup migration handling bug which was occasionally triggering a
   warning in the cgroup migration path.
 
 - tools/sched_ext, selftest and other misc updates.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ695uA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGeCfAQDmUixMNJCIrRphYsWcYUzlGLZyyRpQEEYFtRMO
 UC266gD+PUV2UvuO5sAVH8HVnGdOqkXaE/IRG+TC7fQH3ruPlgI=
 =LFd1
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Fix lock imbalance in a corner case of dispatch_to_local_dsq()

 - Migration disabled tasks were confusing some BPF schedulers and its
   handling had a bug. Fix it and simplify the default behavior by
   dispatching them automatically

 - ops.tick(), ops.disable() and ops.exit_task() were incorrectly
   disallowing kfuncs that require the task argument to be the rq
   operation is currently operating on and thus is rq-locked.
   Allow them.

 - Fix autogroup migration handling bug which was occasionally
   triggering a warning in the cgroup migration path

 - tools/sched_ext, selftest and other misc updates

* tag 'sched_ext-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx
  sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h.
  sched_ext: selftests: Fix grammar in tests description
  sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()
  sched_ext: Fix migration disabled handling in targeted dispatches
  sched_ext: Implement auto local dispatching of migration disabled tasks
  sched_ext: Fix incorrect time delta calculation in time_delta()
  sched_ext: Fix lock imbalance in dispatch_to_local_dsq()
  sched_ext: selftests/dsp_local_on: Fix selftest on UP systems
  tools/sched_ext: Add helper to check task migration state
  sched_ext: Fix incorrect autogroup migration detection
  sched_ext: selftests/dsp_local_on: Fix sporadic failures
  selftests/sched_ext: Fix enum resolution
  sched_ext: Include task weight in the error state dump
  sched_ext: Fixes typos in comments
2025-02-14 11:14:24 -08:00
Linus Torvalds
80868f5d3d cgroup: Fixes for v6.14-rc2
- Fix a race window where a newly forked task could escape cgroup.kill.
 
 - Remove incorrectly included steal time from cpu.stat::usage_usec.
 
 - Minor update in selftest.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ6928A4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGYrbAPsEtoH5GFw7VtKIy4fS23QbtxUuwW0fERrPWGyt
 JtQv3gD5AboBUrGWdgiM5c2bIXT3f+Bn9w3HhLiPaB8ieN/0kA8=
 =3BBH
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - Fix a race window where a newly forked task could escape cgroup.kill

 - Remove incorrectly included steal time from cpu.stat::usage_usec

 - Minor update in selftest

* tag 'cgroup-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Remove steal time from usage_usec
  selftests/cgroup: use bash in test_cpuset_v1_hp.sh
  cgroup: fix race between fork and cgroup.kill
2025-02-14 11:00:42 -08:00
Linus Torvalds
f4d4680965 workqueue: Fixes for v6.14-rc2
- Fix a regression where a worker pool can be freed before rescuer workers
   are done with it leading to user-after-free.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ691oQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGUJ9AP9+fR3J07+0TzAtQmDzBRsJeIjx7zgM9hE2OVgR
 L5jvDgEAmzfEmHTaDYI097T3yM6o1se+e9nRKgwMvru0ZXT48wo=
 =eAo4
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fix from Tejun Heo:

 - Fix a regression where a worker pool can be freed before rescuer
   workers are done with it leading to user-after-free

* tag 'wq-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Put the pwq after detaching the rescuer from the pool
2025-02-14 10:58:03 -08:00
Steven Rostedt
f5b95f1fa2 ring-buffer: Validate the persistent meta data subbuf array
The meta data for a mapped ring buffer contains an array of indexes of all
the subbuffers. The first entry is the reader page, and the rest of the
entries lay out the order of the subbuffers in how the ring buffer link
list is to be created.

The validator currently makes sure that all the entries are within the
range of 0 and nr_subbufs. But it does not check if there are any
duplicates.

While working on the ring buffer, I corrupted this array, where I added
duplicates. The validator did not catch it and created the ring buffer
link list on top of it. Luckily, the corruption was only that the reader
page was also in the writer path and only presented corrupted data but did
not crash the kernel. But if there were duplicates in the writer side,
then it could corrupt the ring buffer link list and cause a crash.

Create a bitmask array with the size of the number of subbuffers. Then
clear it. When walking through the subbuf array checking to see if the
entries are within the range, test if its bit is already set in the
subbuf_mask. If it is, then there is duplicates and fail the validation.
If not, set the corresponding bit and continue.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214102820.7509ddea@gandalf.local.home
Fixes: c76883f18e ("ring-buffer: Add test if range of boot buffer is valid")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14 12:50:51 -05:00
Steven Rostedt
60b8f71114 tracing: Have the error of __tracing_resize_ring_buffer() passed to user
Currently if __tracing_resize_ring_buffer() returns an error, the
tracing_resize_ringbuffer() returns -ENOMEM. But it may not be a memory
issue that caused the function to fail. If the ring buffer is memory
mapped, then the resizing of the ring buffer will be disabled. But if the
user tries to resize the buffer, it will get an -ENOMEM returned, which is
confusing because there is plenty of memory. The actual error returned was
-EBUSY, which would make much more sense to the user.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250213134132.7e4505d7@gandalf.local.home
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-02-14 12:50:27 -05:00
Steven Rostedt
9ba0e1755a ring-buffer: Unlock resize on mmap error
Memory mapping the tracing ring buffer will disable resizing the buffer.
But if there's an error in the memory mapping like an invalid parameter,
the function exits out without re-enabling the resizing of the ring
buffer, preventing the ring buffer from being resized after that.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250213131957.530ec3c5@gandalf.local.home
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14 12:50:05 -05:00
Will Deacon
8221fd1a73 workqueue: Log additional details when rejecting work
Syzbot regularly runs into the following warning on arm64:

  | WARNING: CPU: 1 PID: 6023 at kernel/workqueue.c:2257 current_wq_worker kernel/workqueue_internal.h:69 [inline]
  | WARNING: CPU: 1 PID: 6023 at kernel/workqueue.c:2257 is_chained_work kernel/workqueue.c:2199 [inline]
  | WARNING: CPU: 1 PID: 6023 at kernel/workqueue.c:2257 __queue_work+0xe50/0x1308 kernel/workqueue.c:2256
  | Modules linked in:
  | CPU: 1 UID: 0 PID: 6023 Comm: klogd Not tainted 6.13.0-rc2-syzkaller-g2e7aff49b5da #0
  | Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
  | pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  | pc : __queue_work+0xe50/0x1308 kernel/workqueue_internal.h:69
  | lr : current_wq_worker kernel/workqueue_internal.h:69 [inline]
  | lr : is_chained_work kernel/workqueue.c:2199 [inline]
  | lr : __queue_work+0xe50/0x1308 kernel/workqueue.c:2256

  [...]

  |    __queue_work+0xe50/0x1308 kernel/workqueue.c:2256 (L)
  |  delayed_work_timer_fn+0x74/0x90 kernel/workqueue.c:2485
  |  call_timer_fn+0x1b4/0x8b8 kernel/time/timer.c:1793
  |  expire_timers kernel/time/timer.c:1839 [inline]
  |  __run_timers kernel/time/timer.c:2418 [inline]
  |  __run_timer_base+0x59c/0x7b4 kernel/time/timer.c:2430
  |  run_timer_base kernel/time/timer.c:2439 [inline]
  |  run_timer_softirq+0xcc/0x194 kernel/time/timer.c:2449

The warning is probably because we are trying to queue work into a
destroyed workqueue, but the softirq context makes it hard to pinpoint
the problematic caller.

Extend the warning diagnostics to print both the function we are trying
to queue as well as the name of the workqueue.

Cc: Tejun Heo <tj@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Link: https://syzkaller.appspot.com/bug?extid=e13e654d315d4da1277c
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14 07:16:33 -10:00
Chuyi Zhou
f5717c93a1 sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of
the current task, the following error will occur:

scx_foo[3795244] triggered exit kind 1024:
  runtime error (called on a task not being operated on)

The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK()
when calling ops.tick(), which triggers the error during the subsequent
scx_kf_allowed_on_arg_tasks() check.

SCX_CALL_OP_TASK() was first introduced in commit 36454023f5 ("sched_ext:
Track tasks that are subjects of the in-flight SCX operation") to ensure
task's rq lock is held when accessing task's sched_group. Since ops.tick()
is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq
lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly,
the same changes should be made for ops.disable() and ops.exit_task(), as
they are also protected by task_rq_lock() and it's safe to access the
task's task_group.

Fixes: 36454023f5 ("sched_ext: Track tasks that are subjects of the in-flight SCX operation")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13 06:57:33 -10:00
Anup Patel
4cf7d58620 genirq: Remove unused CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS
CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS is not used anymore, hence remove it.

Fixes: f94a18249b ("genirq: Remove IRQ_MOVE_PCNTXT and related code")
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250209041655.331470-7-apatel@ventanamicro.com
2025-02-13 13:18:54 +01:00
Christian Brauner
890ed45bde
acct: block access to kernel internal filesystems
There's no point in allowing anything kernel internal nor procfs or
sysfs.

Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com
Link: https://lore.kernel.org/r/20250211-work-acct-v1-2-1c16aecab8b3@kernel.org
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reported-by: Zicheng Qu <quzicheng@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12 12:24:16 +01:00
Christian Brauner
56d5f3eba3
acct: perform last write from workqueue
In [1] it was reported that the acct(2) system call can be used to
trigger NULL deref in cases where it is set to write to a file that
triggers an internal lookup. This can e.g., happen when pointing acc(2)
to /sys/power/resume. At the point the where the write to this file
happens the calling task has already exited and called exit_fs(). A
lookup will thus trigger a NULL-deref when accessing current->fs.

Reorganize the code so that the the final write happens from the
workqueue but with the caller's credentials. This preserves the
(strange) permission model and has almost no regression risk.

This api should stop to exist though.

Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]
Link: https://lore.kernel.org/r/20250211-work-acct-v1-1-1c16aecab8b3@kernel.org
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Zicheng Qu <quzicheng@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12 12:24:16 +01:00
Tejun Heo
f3f08c3acf sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()
While fixing migration disabled task handling, 3296682157 ("sched_ext: Fix
migration disabled handling in targeted dispatches") assumed that a
migration disabled task's ->cpus_ptr would only have the pinned CPU. While
this is eventually true for migration disabled tasks that are switched out,
->cpus_ptr update is performed by migrate_disable_switch() which is called
right before context_switch() in __scheduler(). However, the task is
enqueued earlier during pick_next_task() via put_prev_task_scx(), so there
is a race window where another CPU can see the task on a DSQ.

If the CPU tries to dispatch the migration disabled task while in that
window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq()
will subsequently trigger SCHED_WARN(is_migration_disabled()).

  WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140
  Sched_ext: layered (enabled+all), task: runnable_at=-10ms
  RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140
  ...
   <TASK>
   consume_dispatch_q+0xab/0x220
   scx_bpf_dsq_move_to_local+0x58/0xd0
   bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa
   bpf__sched_ext_ops_dispatch+0x4b/0xab
   balance_one+0x1fe/0x3b0
   balance_scx+0x61/0x1d0
   prev_balance+0x46/0xc0
   __pick_next_task+0x73/0x1c0
   __schedule+0x206/0x1730
   schedule+0x3a/0x160
   __do_sys_sched_yield+0xe/0x20
   do_syscall_64+0xbb/0x1e0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fix it by converting the SCHED_WARN() back to a regular failure path. Also,
perform the migration disabled test before task_allowed_on_cpu() test so
that BPF schedulers which fail to handle migration disabled tasks can be
noticed easily.

While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case
for brevity and consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")
Acked-by: Andrea Righi <arighi@nvidia.com>
Reported-by: Jake Hillion <jakehillion@meta.com>
2025-02-10 10:40:47 -10:00
Tejun Heo
3296682157 sched_ext: Fix migration disabled handling in targeted dispatches
A dispatch operation that can target a specific local DSQ -
scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task
can be migrated to the target CPU using task_can_run_on_remote_rq(). If the
task can't be migrated to the targeted CPU, it is bounced through a global
DSQ.

task_can_run_on_remote_rq() assumes that the task is on a CPU that's
different from the targeted CPU but the callers doesn't uphold the
assumption and may call the function when the task is already on the target
CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends
up returning %false incorrectly unnecessarily bouncing the task to a global
DSQ.

Fix it by updating the callers to only call task_can_run_on_remote_rq() when
the task is on a different CPU than the target CPU. As this is a bit subtle,
for clarity and documentation:

- Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on
  the same CPU as the target CPU.

- is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger
  if the task is on a different CPU than the target CPU as the preceding
  task_allowed_on_cpu() test should fail beforehand. Convert the test into
  SCHED_WARN_ON().

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Fixes: 0366017e09 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()")
Cc: stable@vger.kernel.org # v6.12+
2025-02-08 20:33:25 -10:00
Tejun Heo
2fa0fbeb69 sched_ext: Implement auto local dispatching of migration disabled tasks
Migration disabled tasks are special and pinned to their previous CPUs. They
tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may
not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers
by automatically dispatching them to the pinned local DSQs by default. If a
BPF scheduler wants to handle migration disabled tasks explicitly, it can
set SCX_OPS_ENQ_MIGRATION_DISABLED.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-08 20:32:54 -10:00
Linus Torvalds
f4a45f14cf seccomp fix for v6.14-rc2
- Allow uretprobe on x86_64 to avoid behavioral complications (Eyal Birger)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCZ6e/FAAKCRA2KwveOeQk
 u/xCAP9gd2nRw9jXgpg/CCfxkX0Yj3/pnzQoCDlS2lWy43BWNgEAtcJFDvz2Lg09
 omRld7QHjbMhNqihYgXRyD0nzX42uwY=
 =0JOg
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp fix from Kees Cook:
 "This is really a work-around for x86_64 having grown a syscall to
  implement uretprobe, which has caused problems since v6.11.

  This may change in the future, but for now, this fixes the unintended
  seccomp filtering when uretprobe switched away from traps, and does so
  with something that should be easy to backport.

   - Allow uretprobe on x86_64 to avoid behavioral complications (Eyal
     Birger)"

* tag 'seccomp-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  selftests/seccomp: validate uretprobe syscall passes through seccomp
  seccomp: passthrough uretprobe systemcall without filtering
2025-02-08 14:04:21 -08:00
Linus Torvalds
a0df483fe3 function graph fix of notrace functions:
When the function graph tracer was restructured to use the global section
 of the meta data in the shadow stack, the bit logic was changed. There's a
 TRACE_GRAPH_NOTRACE_BIT that is the bit number in the mask that tells if
 the function graph tracer is currently in the "notrace" mode. The
 TRACE_GRAPH_NOTRACE is the mask with that bit set. But when the code we
 restructured, the TRACE_GRAPH_NOTRACE_BIT was used when it should have
 been the TRACE_GRAPH_NOTRACE mask. This made notrace not work properly.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ6drQBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qodZAQDpQY27L2aooxKjxjTwXDmmFu5z7x0M
 PDscxSeAuWzM5QEA/7KeepW+SmZYE7E6SGwKaPCE+cVs4US62AVrRuAsnQA=
 =w/Ns
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace fix from Steven Rostedt:
 "Function graph fix of notrace functions.

  When the function graph tracer was restructured to use the global
  section of the meta data in the shadow stack, the bit logic was
  changed. There's a TRACE_GRAPH_NOTRACE_BIT that is the bit number in
  the mask that tells if the function graph tracer is currently in the
  "notrace" mode. The TRACE_GRAPH_NOTRACE is the mask with that bit set.

  But when the code we restructured, the TRACE_GRAPH_NOTRACE_BIT was
  used when it should have been the TRACE_GRAPH_NOTRACE mask. This made
  notrace not work properly"

* tag 'ftrace-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT
2025-02-08 12:18:02 -08:00
Linus Torvalds
3a0562d733 Fix a PREEMPT_RT bug in the clocksource verification code that
caused false positive warnings.
 
 Also fix a timer migration setup bug when new CPUs are added.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmenI0MRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jUGA/+MfsjIC+WolYPCKwLXCRXOXc4Qx3kKdTP
 kcJeL59SDoaKRKmgyhCLxpAdDORhK5vA8u05328Cr5JCtPrlDY22pBgi984CLUBL
 AJdu5oBMPZlLiZ735PPhicCffrV33dKLyBbuqzhtlhs+9cYdEgcbn6FfNdWawYxA
 MjreFnAQGJ3/M6il2An58GfofrKd6y8QTufTOBSSVNmVAh/QABhYu1N0ytiwjvaX
 m9HxGy0l4xH/KF0pICWTJjLPbBpSWTNqIfK1WBConpQHesp6PXwakgWQj5/Np0ot
 wMkAUwPnLldvQTm664xlTAzoZv9N4jlXORvJ/xvPWgTDcYiDnsHE/44DAEc4wHh1
 2nvOrDu9EAhpTrMWRDct7h7BhShQUNFl+L2rF6kOgUZfCQ8OHL1U3IO9HxcO31Zg
 ZLnNfF6tz6D05y2EBJWS3st1CSZKfHTxlb8p4QFMZ9dyTMRDfTYSrEO2C6fmdJcg
 GMS/rL8MC4/N4kI3BkOv144ImcZIoiEzzPC8SnR73KeEg5LRM5IwJZ8cSP9ZUz9W
 P5VQIoBsHBbtROePRmurUqFgdmWzC0qyAQLPrWvNVUiweRcGF6Au7AqE4yjoVYAz
 Aa+z+pUu6EZLlVX3+yWa/fn2ExBWCApaVJS1ctoplNUjJY5EXVgaoWpS/9/B0du9
 KlNU3DhCaYA=
 =sKCk
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:
 "Fix a PREEMPT_RT bug in the clocksource verification code that caused
  false positive warnings.

  Also fix a timer migration setup bug when new CPUs are added"

* tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Fix off-by-one root mis-connection
  clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context
2025-02-08 11:55:03 -08:00
Linus Torvalds
c7b92e8969 Fix a cfs_rq->h_nr_runnable accounting bug that trips up a
defensive SCHED_WARN_ON() on certain workloads. The bug is
 believed to be (accidentally) self-correcting, hence no
 behavioral side effects are expected.
 
 Also print se.slice in debug output, since this value can now be
 set via the syscall ABI and can be useful to track.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmenIcsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iyZA//ZpzGiiZ8coXBk4PQ77c0+BOgSGdkbWeR
 zicKvqWd+j9skOTVrIk8MPs7D3C5uNuDuVAeNWYalBHRO7ndLfCg36pHqR8tQ3xN
 2GwziJzKpi1r4WXBSCtFe/abKrMhIsMYiqKDQz3Ry2Zfn+TuY4t4bzEYgfm481mO
 FkLcXs97RI5sFsCBI0uhRu76A9jrNRZKW2zczHTb2MS4zjwGq1yL3vkx2M11d0Su
 BqrcPjz6mZFuzCrA6pW0QeSP4Mn5xL8c2seoMYnvFrPYzq47Thxm4nOWPURCN9uX
 6MmqNXisJjTe00noswhbcUia1n2cr37uu5d21tXhn7Rf2XTO4+WRE7zjzvxLKQCr
 DtcH1XqvnChPGPFTsSlso4vbygfEZIk14DWZ71mXQJto6t06A/yjFiAR82ZIKzLA
 ZlK6hnX2rQ8WST3XSixiRPU/5xDu3ptJ0aDRV/dScCMzb7c0Y2YSeqA4xBMcZDPq
 A1TrNUsnsvH0TJXXqeslDSxxEFUKHNiOI2itzMCxAteEZXPysa8Rvcl4zFOZ5A6F
 nb8hDN94dydNnPpf4g+ZIVEHs5ES3Yeexh1dGwCjQSp72qAJK3wkHQ6KfPN/T3Gb
 PbDVw93cYJI/fhB2BX9DcnN+uNLTrqr1kfY7uFBE1feMdnQNBHILdjbByILfQAQ+
 0r7+aYwn0F0=
 =5mot
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Fix a cfs_rq->h_nr_runnable accounting bug that trips up a defensive
  SCHED_WARN_ON() on certain workloads. The bug is believed to be
  (accidentally) self-correcting, hence no behavioral side effects are
  expected.

  Also print se.slice in debug output, since this value can now be set
  via the syscall ABI and can be useful to track"

* tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Provide slice length for fair tasks
  sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
2025-02-08 11:16:22 -08:00
Linus Torvalds
fa76887bb7 Fix a dangling pointer bug in the futex code used by the
uring code, which isn't causing problems at the moment
 due to uring ABI limitations leaving it essentially
 unused in current usages, but is a good idea to fix
 nevertheless.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmenHrkRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jnLQ//T+vNYeyQ5Nc3CuqsZfv5h77ijCLzazSh
 qu5LXyGHHIlLLPEzh53wRQQbGBQ6A2HdbVVphn8k/0v4eT1Ez5yN7AiTYuPkEP73
 m6MWQAWcGQ7M7vR7cvWIsIB1wS5PD2g3UdvS8x+OECZk4lnSx4Xh/TfbRIURwhe2
 SS6jgRGhaodsp8N2o8c/BgrvvHY9aedJQhx4iAh3PiuPomygr9kfIAaQstQNKx61
 w4NQBQhK93LD9duESc+ONDlRhzSvbdJfRby1hbHzvcnCGe5S2aZzOfY31CPJbOt6
 UvbfeStEGEHkfqbZOXEtwVPZ80+U2hWvD67wSXFB0pTc68zkuGN3/Ko88GCyZx5+
 mxDRYWLoExknEUuk/Mc+hOzu1uaCjpXxA8qRr7SW3ewH1QOGr+ZISQgSffRdujbH
 2E2cBh9/HOeVZ/7nAvfkSU+yyfvBwZBP/Q0PN5ODpk3S7ZfCC7h57oClWx4WUuTX
 0H9N2IvPG0hqmqljKkt/5Xc4Qgvh6RA+pmxK0uUngViuw+v81Ea7/m+kbetQRO07
 OPOH/UT4nlmwoCwch+nKr/MRmZADpXEZyeRKS0kBJQLRMkN9VT1+e2Zf3Yir2Ji4
 hveqiJKiIgCPPxz3w+N/XcSgOTQUN1PmOLjEXB+gRNRctsvGZOtuY2HZIydQAMbT
 EjJBwkEWIQo=
 =qhN4
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
 "Fix a dangling pointer bug in the futex code used by the uring code.

  It isn't causing problems at the moment due to uring ABI limitations
  leaving it essentially unused in current usages, but is a good idea to
  fix nevertheless"

* tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Pass in task to futex_queue()
2025-02-08 10:54:11 -08:00
Jann Horn
bcc6244e13 sched: Clarify wake_up_q()'s write to task->wake_q.next
Clarify that wake_up_q() does an atomic write to task->wake_q.next, after
which a concurrent __wake_q_add() can immediately overwrite
task->wake_q.next again.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250129-sched-wakeup-prettier-v1-1-2f51f5f663fa@google.com
2025-02-08 15:43:12 +01:00
Steven Rostedt
c8c9b1d2d5 fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT
The code was restructured where the function graph notrace code, that
would not trace a function and all its children is done by setting a
NOTRACE flag when the function that is not to be traced is hit.

There's a TRACE_GRAPH_NOTRACE_BIT which defines the bit in the flags and a
TRACE_GRAPH_NOTRACE which is the mask with that bit set. But the
restructuring used TRACE_GRAPH_NOTRACE_BIT when it should have used
TRACE_GRAPH_NOTRACE.

For example:

 # cd /sys/kernel/tracing
 # echo set_track_prepare stack_trace_save  > set_graph_notrace
 # echo function_graph > current_tracer
 # cat trace
[..]
 0)               |                          __slab_free() {
 0)               |                            free_to_partial_list() {
 0)               |                                  arch_stack_walk() {
 0)               |                                    __unwind_start() {
 0)   0.501 us    |                                      get_stack_info();

Where a non filter trace looks like:

 # echo > set_graph_notrace
 # cat trace
 0)               |                            free_to_partial_list() {
 0)               |                              set_track_prepare() {
 0)               |                                stack_trace_save() {
 0)               |                                  arch_stack_walk() {
 0)               |                                    __unwind_start() {

Where the filter should look like:

 # cat trace
 0)               |                            free_to_partial_list() {
 0)               |                              _raw_spin_lock_irqsave() {
 0)   0.350 us    |                                preempt_count_add();
 0)   0.351 us    |                                do_raw_spin_lock();
 0)   2.440 us    |                              }

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250208001511.535be150@batman.local.home
Fixes: b84214890a ("function_graph: Move graph notrace bit to shadow stack global var")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-08 08:36:45 -05:00
Kumar Kartikeya Dwivedi
8784714d7f bpf: Handle allocation failure in acquire_lock_state
The acquire_lock_state function needs to handle possible NULL values
returned by acquire_reference_state, and return -ENOMEM.

Fixes: 769b0f1c82 ("bpf: Refactor {acquire,release}_reference_state")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250206105435.2159977-24-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-07 18:17:07 -08:00
Daniel Xu
7968c65815 bpf: verifier: Disambiguate get_constant_map_key() errors
Refactor get_constant_map_key() to disambiguate the constant key
value from potential error values. In the case that the key is
negative, it could be confused for an error.

It's not currently an issue, as the verifier seems to track s32 spills
as u32. So even if the program wrongly uses a negative value for an
arraymap key, the verifier just thinks it's an impossibly high value
which gets correctly discarded.

Refactor anyways to make things cleaner and prevent potential future
issues.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/dfe144259ae7cfc98aa63e1b388a14869a10632a.1738689872.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-07 15:45:44 -08:00
Daniel Xu
884c3a18da bpf: verifier: Do not extract constant map keys for irrelevant maps
Previously, we were trying to extract constant map keys for all
bpf_map_lookup_elem(), regardless of map type. This is an issue if the
map has a u64 key and the value is very high, as it can be interpreted
as a negative signed value. This in turn is treated as an error value by
check_func_arg() which causes a valid program to be incorrectly
rejected.

Fix by only extracting constant map keys for relevant maps. This fix
works because nullness elision is only allowed for {PERCPU_}ARRAY maps,
and keys for these are within u32 range. See next commit for an example
via selftest.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Reported-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/aa868b642b026ff87ba6105ea151bc8693b35932.1738689872.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-07 15:45:43 -08:00
Muhammad Adeel
db5fd3cf8b cgroup: Remove steal time from usage_usec
The CPU usage time is the time when user, system or both are using the CPU.
Steal time is the time when CPU is waiting to be run by the Hypervisor. It
should not be added to the CPU usage time, hence removing it from the
usage_usec entry.

Fixes: 936f2a70f2 ("cgroup: add cpu.stat file to root cgroup")
Acked-by: Axel Busch <axel.busch@ibm.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Muhammad Adeel <muhammad.adeel@ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-07 11:02:17 -10:00
Frederic Weisbecker
868c9037df timers/migration: Fix off-by-one root mis-connection
Before attaching a new root to the old root, the children counter of the
new root is checked to verify that only the upcoming CPU's top group have
been connected to it. However since the recently added commit b729cc1ec2
("timers/migration: Fix another race between hotplug and idle entry/exit")
this check is not valid anymore because the old root is pre-accounted
as a child to the new root. Therefore after connecting the upcoming
CPU's top group to the new root, the children count to be expected must
be 2 and not 1 anymore.

This omission results in the old root to not be connected to the new
root. Then eventually the system may run with more than one top level,
which defeats the purpose of a single idle migrator.

Also the old root is pre-accounted but not connected upon the new root
creation. But it can be connected to the new root later on. Therefore
the old root may be accounted twice to the new root. The propagation of
such overcommit can end up creating a double final top-level root with a
groupmask incorrectly initialized. Although harmless given that the final
top level roots will never have a parent to walk up to, this oddity
opportunistically reported the core issue:

  WARNING: CPU: 8 PID: 0 at kernel/time/timer_migration.c:543 tmigr_requires_handle_remote
  CPU: 8 UID: 0 PID: 0 Comm: swapper/8
  RIP: 0010:tmigr_requires_handle_remote
  Call Trace:
   <IRQ>
   ? tmigr_requires_handle_remote
   ? hrtimer_run_queues
   update_process_times
   tick_periodic
   tick_handle_periodic
   __sysvec_apic_timer_interrupt
   sysvec_apic_timer_interrupt
  </IRQ>

Fix the problem by taking the old root into account in the children count
of the new root so the connection is not omitted.

Also warn when more than one top level group exists to better detect
similar issues in the future.

Fixes: b729cc1ec2 ("timers/migration: Fix another race between hotplug and idle entry/exit")
Reported-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250205160220.39467-1-frederic@kernel.org
2025-02-07 09:02:16 +01:00
Eyal Birger
cf6cb56ef2 seccomp: passthrough uretprobe systemcall without filtering
When attaching uretprobes to processes running inside docker, the attached
process is segfaulted when encountering the retprobe.

The reason is that now that uretprobe is a system call the default seccomp
filters in docker block it as they only allow a specific set of known
syscalls. This is true for other userspace applications which use seccomp
to control their syscall surface.

Since uretprobe is a "kernel implementation detail" system call which is
not used by userspace application code directly, it is impractical and
there's very little point in forcing all userspace applications to
explicitly allow it in order to avoid crashing tracked processes.

Pass this systemcall through seccomp without depending on configuration.

Note: uretprobe is currently only x86_64 and isn't expected to ever be
supported in i386.

Fixes: ff474a78ce ("uprobe: Add uretprobe syscall to speed up return probe")
Reported-by: Rafael Buchbinder <rafi@rbk.io>
Closes: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
Link: https://lore.kernel.org/lkml/20250121182939.33d05470@gandalf.local.home/T/#me2676c378eff2d6a33f3054fed4a5f3afa64e65b
Link: https://lore.kernel.org/lkml/20250128145806.1849977-1-eyal.birger@gmail.com/
Cc: stable@vger.kernel.org
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Link: https://lore.kernel.org/r/20250202162921.335813-2-eyal.birger@gmail.com
[kees: minimized changes for easier backporting, tweaked commit log]
Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-06 12:48:21 -08:00
Alan Maguire
517e8a7835 bpf: Fix softlockup in arena_map_free on 64k page kernel
On an aarch64 kernel with CONFIG_PAGE_SIZE_64KB=y,
arena_htab tests cause a segmentation fault and soft lockup.
The same failure is not observed with 4k pages on aarch64.

It turns out arena_map_free() is calling
apply_to_existing_page_range() with the address returned by
bpf_arena_get_kern_vm_start().  If this address is not page-aligned
the code ends up calling apply_to_pte_range() with that unaligned
address causing soft lockup.

Fix it by round up GUARD_SZ to PAGE_SIZE << 1 so that the
division by 2 in bpf_arena_get_kern_vm_start() returns
a page-aligned value.

Fixes: 317460317a ("bpf: Introduce bpf_arena.")
Reported-by: Colm Harrington <colm.harrington@oracle.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/r/20250205170059.427458-1-alan.maguire@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-06 03:41:08 -08:00
Kuniyuki Iwashima
5da7e15fb5 net: Add rx_skb of kfree_skb to raw_tp_null_args[].
Yan Zhai reported a BPF prog could trigger a null-ptr-deref [0]
in trace_kfree_skb if the prog does not check if rx_sk is NULL.

Commit c53795d48e ("net: add rx_sk to trace_kfree_skb") added
rx_sk to trace_kfree_skb, but rx_sk is optional and could be NULL.

Let's add kfree_skb to raw_tp_null_args[] to let the BPF verifier
validate such a prog and prevent the issue.

Now we fail to load such a prog:

  libbpf: prog 'drop': -- BEGIN PROG LOAD LOG --
  0: R1=ctx() R10=fp0
  ; int BPF_PROG(drop, struct sk_buff *skb, void *location, @ kfree_skb_sk_null.bpf.c:21
  0: (79) r3 = *(u64 *)(r1 +24)
  func 'kfree_skb' arg3 has btf_id 5253 type STRUCT 'sock'
  1: R1=ctx() R3_w=trusted_ptr_or_null_sock(id=1)
  ; bpf_printk("sk: %d, %d\n", sk, sk->__sk_common.skc_family); @ kfree_skb_sk_null.bpf.c:24
  1: (69) r4 = *(u16 *)(r3 +16)
  R3 invalid mem access 'trusted_ptr_or_null_'
  processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --

Note this fix requires commit 838a10bd2e ("bpf: Augment raw_tp
arguments with PTR_MAYBE_NULL").

[0]:
BUG: kernel NULL pointer dereference, address: 0000000000000010
 PF: supervisor read access in kernel mode
 PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
PREEMPT SMP
RIP: 0010:bpf_prog_5e21a6db8fcff1aa_drop+0x10/0x2d
Call Trace:
 <TASK>
 ? __die+0x1f/0x60
 ? page_fault_oops+0x148/0x420
 ? search_bpf_extables+0x5b/0x70
 ? fixup_exception+0x27/0x2c0
 ? exc_page_fault+0x75/0x170
 ? asm_exc_page_fault+0x22/0x30
 ? bpf_prog_5e21a6db8fcff1aa_drop+0x10/0x2d
 bpf_trace_run4+0x68/0xd0
 ? unix_stream_connect+0x1f4/0x6f0
 sk_skb_reason_drop+0x90/0x120
 unix_stream_connect+0x1f4/0x6f0
 __sys_connect+0x7f/0xb0
 __x64_sys_connect+0x14/0x20
 do_syscall_64+0x47/0xc30
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fixes: c53795d48e ("net: add rx_sk to trace_kfree_skb")
Reported-by: Yan Zhai <yan@cloudflare.com>
Closes: https://lore.kernel.org/netdev/Z50zebTRzI962e6X@debian.debian/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20250201030142.62703-1-kuniyu@amazon.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-06 02:46:00 -08:00
Linus Torvalds
5c8c229261 Fixes for kthreads
- Properly handle return value when allocation fails for the preferred
   affinity.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmeiMlMACgkQhSRUR1CO
 jHdkNg//bFAleXDdw1FcDIzpsTjQa9fCG7GiyCdFGlNoJl1IbIwVQtvZjLZhR9Ep
 xzW1V/s2YGIG9rlwDbEq2DBE/PUc/jWjhlhpuqrLE66VkXx7TPqipVMwgubBZKJr
 HCO4L4ubT3ZE/e/EBrRMhxcWqbDdTmmbG1zRjg/lv8iVH4o26vTm2ZdNCLAET3UA
 bq1VQiTFzJDo+iW1hEJoe52wroZmfQUmx1SdtXL11iHCFpxx93ZJYYLNTZ/Iweyl
 lbuRoLRD2//627mFxWI2gOzBwM3u++DgaDCqv4tOG1wBv3tMceOIHbA7K3caCrE7
 SKXnJ1jxj3A0hiC4azc+5+QeFE++hYLIJZeujRyze4Q1iNvc1hzwijBRqc6h6j7N
 rPgyalKPfKt80fzLUbR0P7psI5tJMNNyZnloQ+m9hl33n9ogFw6vGWZ0uQLRNoP1
 l1aGXIc2YGENH3sIFrAwfhuYGG/+D5h+mHXqc/UhuKCplN8zFdeFLtRPMhJk4hKn
 qfaJrluSw2r04pzhX3OxYD/q9tOp356oRUNWMwYIBLojwIbRizLUOAmfeo0xuWje
 JmraK3gEsCZd0qMa6S7qGF7H/gGlHudG73Ti85POeXizaUGZF7cxaDq4+nM/4SWI
 uWhDcbTCpHZzwyNBASSrx0xYnWvYh49Au9XYY1veNrHQDD4ssQ0=
 =lIYQ
 -----END PGP SIGNATURE-----

Merge tag 'kthreads-fixes-2025-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks

Pull kthreads fix from Frederic Weisbecker:

 - Properly handle return value when allocation fails for the preferred
   affinity

* tag 'kthreads-fixes-2025-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
  kthread: Fix return value on kzalloc() failure in kthread_affine_preferred()
2025-02-04 11:01:48 -08:00
Yu-Chun Lin
1b0332a426 kthread: Fix return value on kzalloc() failure in kthread_affine_preferred()
kthread_affine_preferred() incorrectly returns 0 instead of -ENOMEM
when kzalloc() fails. Return 'ret' to ensure the correct error code is
propagated.

Fixes: 4d13f4304f ("kthread: Implement preferred affinity")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501301528.t0cZVbnq-lkp@intel.com/
Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-02-04 01:42:27 +01:00
Linus Torvalds
f286757b64 Updates for timers and timekeeping:
- Properly cast the input to secs_to_jiffies() to unsigned long as
    otherwise the result uses the data type of the input variable, which
    causes result range checks to fail if the input data type is signed and
    smaller than unsigned long.
 
  - Handle late armed hrtimers gracefully on CPU hotplug
 
    There are legitimate cases where a hrtimer is (re)armed on an outgoing
    CPU after the timers have been migrated away. This triggers warnings and
    caused people to implement horrible workarounds in RCU. But those work
    arounds are incomplete and do not cover e.g. the scheduler hrtimers.
 
    Stop this by force moving timer which are enqueued on the current CPU
    after timer migration to be queued on a remote online CPU.
 
    This allows to undo the workarounds in a seperate step.
 
  - Demote a warning level printk() to info level in the clocksource
    watchdog code as there is no point to emit a warning level message for a
    purely informational message.
 
  - Mark a helper function __always_inline and move it into the existing
    #ifdef block to avoid 'unused function' warnings from CLANG
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmegv8QTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWA7EADF7/GBufaTAYr1ZQKs2oK+xD+Vhs8M
 4CHgG0zlnl0HkPk1CE2VNBJ9PP8C5bKfMQJyYdtsxELVBFiJJEPEqbgpGFJQljD7
 lG/bJSc5MctOauSkbURZyFKtzOwre+q4tWqZ2xvth0LTtaY3SycsImIWCKr4cvKv
 95IQlXLMUkHZsTR4sXLSwaE1Kt9uyHOPa00pkvsQJ3CaWT7BAc+bdbZ83OdM7BTk
 2XnLvH3zlwijp/o4sS8HCpdX24HQlsKm7TF5igxGmwNophRwNzP3Imd3yh6onpL3
 9BrEYPyptKl7hB9N0y3mMu7JRljphfVBmfmzcGYLfkjuGmX7KkOOj5tpD9PwqIFl
 Mu8fDff1wTMxpDcoMzW2M540xYq3Pm9kheuwGQFH3XRoq28IKxo6MufXWgpgJkz4
 JxpQ8h+9wT4LodVthcaotqHxe1yTsOzot0ggejtCDMptVlLXudZu4J/QvBqOiygg
 +3ehX7G+AY+yTbqyncPY/jLKd2lnp3PArmJ0zqvYGiGsnr07F7zFZmCGhgUr96w7
 ZQWc9D9AvHREPoXiXb6AxbYAOImjVbYW2+Y1eB3eWK6GlhALoQ82YGldW4y4rfeu
 +9OaU7WgRTjlfBdaid7DmqBaLvpCSQKj/spzwKkq7eAiO9RSNdy91eaug99Ezjgn
 NySGwUM4t0Wkfw==
 =lUGt
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2025-02-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:

 - Properly cast the input to secs_to_jiffies() to unsigned long as
   otherwise the result uses the data type of the input variable, which
   causes result range checks to fail if the input data type is signed
   and smaller than unsigned long.

 - Handle late armed hrtimers gracefully on CPU hotplug

   There are legitimate cases where a hrtimer is (re)armed on an
   outgoing CPU after the timers have been migrated away. This triggers
   warnings and caused people to implement horrible workarounds in RCU.
   But those workarounds are incomplete and do not cover e.g. the
   scheduler hrtimers.

   Stop this by force moving timer which are enqueued on the current CPU
   after timer migration to be queued on a remote online CPU.

   This allows to undo the workarounds in a seperate step.

 - Demote a warning level printk() to info level in the clocksource
   watchdog code as there is no point to emit a warning level message
   for a purely informational message.

 - Mark a helper function __always_inline and move it into the existing
   #ifdef block to avoid 'unused function' warnings from CLANG

* tag 'timers-urgent-2025-02-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  jiffies: Cast to unsigned long in secs_to_jiffies() conversion
  clocksource: Use pr_info() for "Checking clocksource synchronization" message
  hrtimers: Force migrate away hrtimers queued after CPUHP_AP_HRTIMERS_DYING
  hrtimers: Mark is_migration_base() with __always_inline
2025-02-03 09:10:56 -08:00
Waiman Long
6bb05a3333 clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context
The following bug report happened with a PREEMPT_RT kernel:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2012, name: kwatchdog
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  get_random_u32+0x4f/0x110
  clocksource_verify_choose_cpus+0xab/0x1a0
  clocksource_verify_percpu.part.0+0x6b/0x330
  clocksource_watchdog_kthread+0x193/0x1a0

It is due to the fact that clocksource_verify_choose_cpus() is invoked with
preemption disabled.  This function invokes get_random_u32() to obtain
random numbers for choosing CPUs.  The batched_entropy_32 local lock and/or
the base_crng.lock spinlock in driver/char/random.c will be acquired during
the call. In PREEMPT_RT kernel, they are both sleeping locks and so cannot
be acquired in atomic context.

Fix this problem by using migrate_disable() to allow smp_processor_id() to
be reliably used without introducing atomic context. preempt_disable() is
then called after clocksource_verify_choose_cpus() but before the
clocksource measurement is being run to avoid introducing unexpected
latency.

Fixes: 7560c02bdf ("clocksource: Check per-CPU clock synchronization when marked unstable")
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/all/20250131173323.891943-2-longman@redhat.com
2025-02-03 16:18:56 +01:00
Shakeel Butt
b69bb476de cgroup: fix race between fork and cgroup.kill
Tejun reported the following race between fork() and cgroup.kill at [1].

Tejun:
  I was looking at cgroup.kill implementation and wondering whether there
  could be a race window. So, __cgroup_kill() does the following:

   k1. Set CGRP_KILL.
   k2. Iterate tasks and deliver SIGKILL.
   k3. Clear CGRP_KILL.

  The copy_process() does the following:

   c1. Copy a bunch of stuff.
   c2. Grab siglock.
   c3. Check fatal_signal_pending().
   c4. Commit to forking.
   c5. Release siglock.
   c6. Call cgroup_post_fork() which puts the task on the css_set and tests
       CGRP_KILL.

  The intention seems to be that either a forking task gets SIGKILL and
  terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I
  don't see what guarantees that k3 can't happen before c6. ie. After a
  forking task passes c5, k2 can take place and then before the forking task
  reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child.
  What am I missing?

This is indeed a race. One way to fix this race is by taking
cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork()
side takes cgroup_threadgroup_rwsem in read mode from cgroup_can_fork()
to cgroup_post_fork(). However that would be heavy handed as this adds
one more potential stall scenario for cgroup.kill which is usually
called under extreme situation like memory pressure.

To fix this race, let's maintain a sequence number per cgroup which gets
incremented on __cgroup_kill() call. On the fork() side, the
cgroup_can_fork() will cache the sequence number locally and recheck it
against the cgroup's sequence number at cgroup_post_fork() site. If the
sequence numbers mismatch, it means __cgroup_kill() can been called and
we should send SIGKILL to the newly created task.

Reported-by: Tejun Heo <tj@kernel.org>
Closes: https://lore.kernel.org/all/Z5QHE2Qn-QZ6M-KW@slm.duckdns.org/ [1]
Fixes: 661ee62809 ("cgroup: introduce cgroup.kill")
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 06:54:51 -10:00
Linus Torvalds
03cc3579bc 21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues.
13 are for MM and 8 are for non-MM.  All are singletons, please see the
 changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ54MEgAKCRDdBJ7gKXxA
 jlhdAP0evTQ9JX+22DDWSVdWFBbnQ74c5ddFXVQc1LO2G2FhFgD+OXhH8E65Nez5
 qGWjb4xgjoQTHS7AL4pYEFYqx/cpbAQ=
 =rN+l
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2025-02-01-03-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "21 hotfixes. 8 are cc:stable and the remainder address post-6.13
  issues. 13 are for MM and 8 are for non-MM.

  All are singletons, please see the changelogs for details"

* tag 'mm-hotfixes-stable-2025-02-01-03-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
  MAINTAINERS: include linux-mm for xarray maintenance
  revert "xarray: port tests to kunit"
  MAINTAINERS: add lib/test_xarray.c
  mailmap, MAINTAINERS, docs: update Carlos's email address
  mm/hugetlb: fix hugepage allocation for interleaved memory nodes
  mm: gup: fix infinite loop within __get_longterm_locked
  mm, swap: fix reclaim offset calculation error during allocation
  .mailmap: update email address for Christopher Obbard
  kfence: skip __GFP_THISNODE allocations on NUMA systems
  nilfs2: fix possible int overflows in nilfs_fiemap()
  mm: compaction: use the proper flag to determine watermarks
  kernel: be more careful about dup_mmap() failures and uprobe registering
  mm/fake-numa: handle cases with no SRAT info
  mm: kmemleak: fix upper boundary check for physical address objects
  mailmap: add an entry for Hamza Mahfooz
  MAINTAINERS: mailmap: update Yosry Ahmed's email address
  scripts/gdb: fix aarch64 userspace detection in get_current_task
  mm/vmscan: accumulate nr_demoted for accurate demotion statistics
  ocfs2: fix incorrect CPU endianness conversion causing mount failure
  mm/zsmalloc: add __maybe_unused attribute for is_first_zpdesc()
  ...
2025-02-01 09:49:20 -08:00
Liam R. Howlett
64c37e134b kernel: be more careful about dup_mmap() failures and uprobe registering
If a memory allocation fails during dup_mmap(), the maple tree can be left
in an unsafe state for other iterators besides the exit path.  All the
locks are dropped before the exit_mmap() call (in mm/mmap.c), but the
incomplete mm_struct can be reached through (at least) the rmap finding
the vmas which have a pointer back to the mm_struct.

Up to this point, there have been no issues with being able to find an
mm_struct that was only partially initialised.  Syzbot was able to make
the incomplete mm_struct fail with recent forking changes, so it has been
proven unsafe to use the mm_struct that hasn't been initialised, as
referenced in the link below.

Although 8ac662f5da ("fork: avoid inappropriate uprobe access to
invalid mm") fixed the uprobe access, it does not completely remove the
race.

This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the
oom side (even though this is extremely unlikely to be selected as an oom
victim in the race window), and sets MMF_UNSTABLE to avoid other potential
users from using a partially initialised mm_struct.

When registering vmas for uprobe, skip the vmas in an mm that is marked
unstable.  Modifying a vma in an unstable mm may cause issues if the mm
isn't fully initialised.

Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com
Fixes: d240629148 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:25 -08:00
Linus Torvalds
fd8c09ad0d Kbuild updates for v6.14
- Support multiple hook locations for maint scripts of Debian package
 
  - Remove 'cpio' from the build tool requirement
 
  - Introduce gendwarfksyms tool, which computes CRCs for export symbols
    based on the DWARF information
 
  - Support CONFIG_MODVERSIONS for Rust
 
  - Resolve all conflicts in the genksyms parser
 
  - Fix several syntax errors in genksyms
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmedJ7AVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGBs0P/1svrWxVRdD7XFtM+UykQqf2R/be
 SnZDeeviqdbGr1J0X5FM/pjaeOb1BPnXsr6O67QaWhyVnpWUviBwwYG2NrCDsx9k
 AaMVsROzpDIeoMhMNeYVs+/SYhA8Jndnd4BOH5CBbo52k4/BLqhU9LXLncLgjZbF
 nys30TilwOaylKq2FHFVI1GhTCQtKKbq+tIxE1SKkHZ1LsSaFphe/eCHMpcOvQb+
 zIFHMm2eIcSlS8TCONFeTnw7NdeCVldYbPCyNAV+nC7Ow7VM8Ws1fq5vsKeH3TGE
 +3qtgQS41KpccNRdp4cTvy6p9iBEvpAvk1BAAOZ347EtLXrNNhngg65CbjyCUt7H
 yBpgWZ6GAGq2yExX5bHbbFHb/n4I3HhkZKeaFDZ3VnMPnni4zdbBqD+sBSI3yHLC
 LUh1NI8gIHjD4bIazbjxWMAQlamQNVNMaHkHqGjro2yIbLL1i2mqMdXuzYhAVwrx
 al7hv357fVPwQ1Gfin7R23T4/NqBTB+1IJnYTBYnnAFnUIAdhsuu9YkX8I97i6yu
 mdTrGROpEYL0GZTE+1LGz6V9DZfQHP8GZ5fDAU1X/f8Js9xkn1lHFhFCg/xM5f9j
 RnwHdiRbvq6uL40/zkinlcpPnGWJkAXWgZqlc0ZbwW8v/vKI51Uo0qUxlVSeu2uH
 mQG30SibArJFPpUN
 =D553
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Support multiple hook locations for maint scripts of Debian package

 - Remove 'cpio' from the build tool requirement

 - Introduce gendwarfksyms tool, which computes CRCs for export symbols
   based on the DWARF information

 - Support CONFIG_MODVERSIONS for Rust

 - Resolve all conflicts in the genksyms parser

 - Fix several syntax errors in genksyms

* tag 'kbuild-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (64 commits)
  kbuild: fix Clang LTO with CONFIG_OBJTOOL=n
  kbuild: Strip runtime const RELA sections correctly
  kconfig: fix memory leak in sym_warn_unmet_dep()
  kconfig: fix file name in warnings when loading KCONFIG_DEFCONFIG_LIST
  genksyms: fix syntax error for attribute before init-declarator
  genksyms: fix syntax error for builtin (u)int*x*_t types
  genksyms: fix syntax error for attribute after 'union'
  genksyms: fix syntax error for attribute after 'struct'
  genksyms: fix syntax error for attribute after abstact_declarator
  genksyms: fix syntax error for attribute before nested_declarator
  genksyms: fix syntax error for attribute before abstract_declarator
  genksyms: decouple ATTRIBUTE_PHRASE from type-qualifier
  genksyms: record attributes consistently for init-declarator
  genksyms: restrict direct-declarator to take one parameter-type-list
  genksyms: restrict direct-abstract-declarator to take one parameter-type-list
  genksyms: remove Makefile hack
  genksyms: fix last 3 shift/reduce conflicts
  genksyms: fix 6 shift/reduce conflicts and 5 reduce/reduce conflicts
  genksyms: reduce type_qualifier directly to decl_specifier
  genksyms: rename cvar_qualifier to type_qualifier
  ...
2025-01-31 12:07:07 -08:00
Christian Loehle
9065ce6975 sched/debug: Provide slice length for fair tasks
Since commit:

  857b158dc5 ("sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion")

... we have the userspace per-task tunable slice length, which is
a key parameter that is otherwise difficult to obtain, so provide
it in /proc/$PID/sched.

[ mingo: Clarified the changelog. ]

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/453349b1-1637-42f5-a7b2-2385392b5956@arm.com
2025-01-31 10:45:33 +01:00
Linus Torvalds
e20f2bb8b7 audit-pr-20250130
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmeb6w4UHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNnbxAAgyw1kFtFfIqQCs8OJ7mUo5W48GXU
 60Lxv8ce4kJrYsGtNyS57Z4EmzAIbmt/XGbdrRwAtTs+OFystfXXgFS9DUgNc9yp
 3ERn9W+1qLZajJUOE5KKKbTSpzKF8VYuycxwT8qN3YzbuvGZXS7Ssj+XUb5n6B68
 m6YlLJJ8Mcw5XnQiTo+o7AakTrr52R+xfL5pnW4+FnypdXX7KO63JgKxG2uoqeDA
 FSEMu0zgvVNGziqsEYdq5v47w6OvJHq2Zw4k2uEDX5tPU1YtZH2QMXQL8ED020we
 v51/1QmVVN09FZ9lsJSNVIWP5Mboms0cH/89KPxEwSMEW+Du80wm0bVTZloxyGyS
 UNjP+TdDZqySFGe5U6hgBFfOMVez4zfyfjw8VdUtQ5EJxGyJZPZ+3QznvCaykTP6
 ybEGcklybEHykPbFiMAjXTe46F7+0NAs27j5SPm6oWHCEuP7UQwvdT2Eae9N3Ldj
 MTX7M8nBJOx09SxKi4Vy2X4G/frlpptpoPFIvEOWK9ysDoxnwVJ/+A1NJ3ThaDj9
 Y42kNBDMdUo7Tnfs0yriMNlu6jdoVIlaTS06BqsjmzPerca7Hg/qToolKwrl9YBx
 bdr9Vm6xlqF58khKZIA8YehbgRsHQ6w4pkmD9hjeU3fNB55i8MUuUV3NcpRbnInf
 awOkSrXWT9bOKSI=
 =XdIy
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20250130' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fix from Paul Moore:
 "A minor audit patch to fix an unitialized variable problem"

* tag 'audit-pr-20250130' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: Initialize lsmctx to avoid memory allocation error
2025-01-30 17:36:14 -08:00
Linus Torvalds
f55b0671e3 More power management updates for 6.14-rc1
- Add missing error handling for syscore_suspend() to the hibernation
    core code (Wentao Liang).
 
  - Revert a commit that added unused macros (Andy Shevchenko).
 
  - Synchronize the runtime PM status of devices that were runtime-
    suspended before a system-wide suspend and need to be resumed during
    the subsequent system-wide resume transition (Rafael Wysocki).
 
  - Clean up the teo cpuidle governor and make the handling of short idle
    intervals in it consistent regardless of the properties of idle
    states supplied by the cpuidle driver (Rafael Wysocki).
 
  - Fix some boost-related issues in cpufreq (Lifeng Zheng).
 
  - Fix build issues in the s3c64xx and airoha cpufreq drivers (Viresh
    Kumar).
 
  - Remove unconditional binding of schedutil governor kthreads to the
    affected CPUs if the cpufreq driver indicates that updates can happen
    from any CPU (Christian Loehle).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmeb5xYSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxcFsP/2FIoEI2G6J7pk8zChWT225qkkaieh5P
 tHIkcFINlgzyjLnqmyWELUdt+sB7re6/dMmoLor+abudHimvBvUfAj6Oiz1F3p2F
 utE9TpfhOkXi1ci5zBl9h6+iDj2Z5op3Qe/qw/W3DTlcManAD+6r60A2tOEy0jhi
 GTbp2SEEU28+LU/2J59IfxEuRTTH4pbQGXi+iKv/k9bmtLvQofa1saXyQCBSZrvO
 z3MBdqnAxLeZCg/qILmEGsBvbv1wpugvp3yoMLVwGNyul12Augcs8PreQz7e5tFq
 spEuCfpBJwyJLAGlOnjOYgsPbJBXWRkIBeLH7JealfZr9TX0y4LZSAHi/xe0Asd3
 BBZLLDojxhYMLzmqSkuafHlQd5J7jKl++RS1A9Qm6aqglKjeSOC9Ca9fmrc9Ub9P
 Jpf1SVJ3kJsv1Z7wuUcaj6oLxD8wlAgCo5pigNgWTP2HllhP2bmc22M8JWxpAz+m
 nMeW8nr8bAu4XViZeb74YKGgUDngO/uKRwBpthSkE2fqM7q0Wr5E7G0u9M91mzG6
 nd/XDwta5TeznMQpSy339NgT61i4HHfyc/SDIdpkBxI0C5l6jknNawq79i9gy7/E
 In4MyooOlls/iX/JR0uxx2hXEByltF0IHqwsRYeJ6dmIajYgASXR1Hhh5Iy9fPJ/
 JTJ7vR5oZPB/
 =EgsM
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These are mostly fixes on top of the previously merged power
  management material with the addition of some teo cpuidle governor
  updates, some of which may also be regarded as fixes:

   - Add missing error handling for syscore_suspend() to the hibernation
     core code (Wentao Liang)

   - Revert a commit that added unused macros (Andy Shevchenko)

   - Synchronize the runtime PM status of devices that were runtime-
     suspended before a system-wide suspend and need to be resumed
     during the subsequent system-wide resume transition (Rafael
     Wysocki)

   - Clean up the teo cpuidle governor and make the handling of short
     idle intervals in it consistent regardless of the properties of
     idle states supplied by the cpuidle driver (Rafael Wysocki)

   - Fix some boost-related issues in cpufreq (Lifeng Zheng)

   - Fix build issues in the s3c64xx and airoha cpufreq drivers (Viresh
     Kumar)

   - Remove unconditional binding of schedutil governor kthreads to the
     affected CPUs if the cpufreq driver indicates that updates can
     happen from any CPU (Christian Loehle)"

* tag 'pm-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: sleep: core: Synchronize runtime PM status of parents and children
  cpufreq: airoha: Depends on OF
  PM: Revert "Add EXPORT macros for exporting PM functions"
  PM: hibernate: Add error handling for syscore_suspend()
  cpufreq/schedutil: Only bind threads if needed
  cpufreq: ACPI: Remove set_boost in acpi_cpufreq_cpu_init()
  cpufreq: CPPC: Fix wrong max_freq in policy initialization
  cpufreq: Introduce a more generic way to set default per-policy boost flag
  cpufreq: Fix re-boost issue after hotplugging a CPU
  cpufreq: s3c64xx: Fix compilation warning
  cpuidle: teo: Skip sleep length computation for low latency constraints
  cpuidle: teo: Replace time_span_ns with a flag
  cpuidle: teo: Simplify handling of total events count
  cpuidle: teo: Skip getting the sleep length if wakeups are very frequent
  cpuidle: teo: Simplify counting events used for tick management
  cpuidle: teo: Clarify two code comments
  cpuidle: teo: Drop local variable prev_intercept_idx
  cpuidle: teo: Combine candidate state index checks against 0
  cpuidle: teo: Reorder candidate state index checks
  cpuidle: teo: Rearrange idle state lookup code
2025-01-30 15:10:34 -08:00
Rafael J. Wysocki
a01e0f47a7 Merge branch 'pm-sleep'
Merge fixes related to system sleep for 6.14-rc1:

 - Add missing error handling for syscore_suspend() to the hibernation
   core code (Wentao Liang).

 - Revert a commit that added unused macros (Andy Shevchenko).

 - Synchronize the runtime PM status of devices that were runtime-
   suspended before a system-wide suspend and need to be resumed during
   the subsequent syste-wide resume transition (Rafael Wysocki).

* pm-sleep:
  PM: sleep: core: Synchronize runtime PM status of parents and children
  PM: Revert "Add EXPORT macros for exporting PM functions"
  PM: hibernate: Add error handling for syscore_suspend()
2025-01-30 21:28:16 +01:00
Abel Wu
c78f4afbd9 bpf: Fix deadlock when freeing cgroup storage
The following commit
bc235cdb42 ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]")
first introduced deadlock prevention for fentry/fexit programs attaching
on bpf_task_storage helpers. That commit also employed the logic in map
free path in its v6 version.

Later bpf_cgrp_storage was first introduced in
c4bcfb38a9 ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs")
which faces the same issue as bpf_task_storage, instead of its busy
counter, NULL was passed to bpf_local_storage_map_free() which opened
a window to cause deadlock:

	<TASK>
		(acquiring local_storage->lock)
	_raw_spin_lock_irqsave+0x3d/0x50
	bpf_local_storage_update+0xd1/0x460
	bpf_cgrp_storage_get+0x109/0x130
	bpf_prog_a4d4a370ba857314_cgrp_ptr+0x139/0x170
	? __bpf_prog_enter_recur+0x16/0x80
	bpf_trampoline_6442485186+0x43/0xa4
	cgroup_storage_ptr+0x9/0x20
		(holding local_storage->lock)
	bpf_selem_unlink_storage_nolock.constprop.0+0x135/0x160
	bpf_selem_unlink_storage+0x6f/0x110
	bpf_local_storage_map_free+0xa2/0x110
	bpf_map_free_deferred+0x5b/0x90
	process_one_work+0x17c/0x390
	worker_thread+0x251/0x360
	kthread+0xd2/0x100
	ret_from_fork+0x34/0x50
	ret_from_fork_asm+0x1a/0x30
	</TASK>

Progs:
 - A: SEC("fentry/cgroup_storage_ptr")
   - cgid (BPF_MAP_TYPE_HASH)
	Record the id of the cgroup the current task belonging
	to in this hash map, using the address of the cgroup
	as the map key.
   - cgrpa (BPF_MAP_TYPE_CGRP_STORAGE)
	If current task is a kworker, lookup the above hash
	map using function parameter @owner as the key to get
	its corresponding cgroup id which is then used to get
	a trusted pointer to the cgroup through
	bpf_cgroup_from_id(). This trusted pointer can then
	be passed to bpf_cgrp_storage_get() to finally trigger
	the deadlock issue.
 - B: SEC("tp_btf/sys_enter")
   - cgrpb (BPF_MAP_TYPE_CGRP_STORAGE)
	The only purpose of this prog is to fill Prog A's
	hash map by calling bpf_cgrp_storage_get() for as
	many userspace tasks as possible.

Steps to reproduce:
 - Run A;
 - while (true) { Run B; Destroy B; }

Fix this issue by passing its busy counter to the free procedure so
it can be properly incremented before storage/smap locking.

Fixes: c4bcfb38a9 ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241221061018.37717-1-wuyun.abel@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-29 18:38:19 -08:00
Huacai Chen
35fcac7a7c audit: Initialize lsmctx to avoid memory allocation error
When audit is enabled in a kernel build, and there are no LSMs active
that support LSM labeling, it is possible that local variable lsmctx
in the AUDIT_SIGNAL_INFO handler in audit_receive_msg() could be used
before it is properly initialize. Then kmalloc() will try to allocate
a large amount of memory with the uninitialized length.

This patch corrects this problem by initializing the lsmctx to a safe
value when it is declared, which avoid errors like:

 WARNING: CPU: 2 PID: 443 at mm/page_alloc.c:4727 __alloc_pages_noprof
        ...
    ra: 9000000003059644 ___kmalloc_large_node+0x84/0x1e0
   ERA: 900000000304d588 __alloc_pages_noprof+0x4c8/0x1040
  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
  PRMD: 00000004 (PPLV0 +PIE -PWE)
  EUEN: 00000007 (+FPE +SXE +ASXE -BTE)
  ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
 ESTAT: 000c0000 [BRK] (IS= ECode=12 EsubCode=0)
  PRID: 0014c010 (Loongson-64bit, Loongson-3A5000)
 CPU: 2 UID: 0 PID: 443 Comm: auditd Not tainted 6.13.0-rc1+ #1899
        ...
 Call Trace:
 [<9000000002def6a8>] show_stack+0x30/0x148
 [<9000000002debf58>] dump_stack_lvl+0x68/0xa0
 [<9000000002e0fe18>] __warn+0x80/0x108
 [<900000000407486c>] report_bug+0x154/0x268
 [<90000000040ad468>] do_bp+0x2a8/0x320
 [<9000000002dedda0>] handle_bp+0x120/0x1c0
 [<900000000304d588>] __alloc_pages_noprof+0x4c8/0x1040
 [<9000000003059640>] ___kmalloc_large_node+0x80/0x1e0
 [<9000000003061504>] __kmalloc_noprof+0x2c4/0x380
 [<9000000002f0f7ac>] audit_receive_msg+0x764/0x1530
 [<9000000002f1065c>] audit_receive+0xe4/0x1c0
 [<9000000003e5abe8>] netlink_unicast+0x340/0x450
 [<9000000003e5ae9c>] netlink_sendmsg+0x1a4/0x4a0
 [<9000000003d9ffd0>] __sock_sendmsg+0x48/0x58
 [<9000000003da32f0>] __sys_sendto+0x100/0x170
 [<9000000003da3374>] sys_sendto+0x14/0x28
 [<90000000040ad574>] do_syscall+0x94/0x138
 [<9000000002ded318>] handle_syscall+0xb8/0x158

Fixes: 6fba89813c ("lsm: ensure the correct LSM context releaser")
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
[PM: resolved excessive line length in the backtrace]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-01-29 20:02:04 -05:00
Linus Torvalds
af13ff1c33 Summary:
All ctl_table declared outside of functions and that remain unmodified after
   initialization are const qualified. This prevents unintended modifications to
   proc_handler function pointers by placing them in the .rodata section. This is
   a continuation of the tree-wide effort started a few releases ago with the
   constification of the ctl_table struct arguments in the sysctl API done in
   78eb4ea25c ("sysctl: treewide: constify the ctl_table argument of
   proc_handlers")
 
 Testing:
 
   Testing was done on 0-day and sysctl selftests in x86_64. The linux-next
   branch was not used for such a big change in order to avoid unnecessary merge
   conflicts
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmeY6L0ACgkQupfNUreW
 QU/REwwAizeoFg3XyfwvGsjKUJKvZ8Ltnv3n4+tkd687UAQJnJHPE7/ODR8hKbpE
 E56G12jFlKQyiFR01wg+cbOy6+TTOT9o5qVmLZbo/zmI491Ygkxqen0Y0Z2mGXqR
 FMqcI8ZBmAAYfUKDjjUo+xUI70aNikWOOKRSmJp4cpgm5242d/UN7sOuKkOgt5DY
 GiyjPGlpKFkcYN4bOegKhlfZKdr9BMFxSgN0TZLtensj6cDrkZyLsrdgmVXy1mRT
 0xTnmonGehweog4XY4hSPt2l6uCUu1fiY/WUcghKdWxUty43x9J3LahfD9b7DiAA
 G+DxHStSH0S/czWsa8Z0peyt/2gW8KZcRgk9W4UyVhpyDknXtVxr2sI3nxbTEFGl
 x2h6C29VCqg9Tn9oljEgGbYUrwlLz5Mah65JLDwlPLTpJmfA4BNbNxaC1V+DiqrX
 eApet8vaqGPlG7F3DRlyRAn7DoG8rs/eX93qqjbSA/pUjKjQUwCk/VBxNr1JBuNG
 elX+8QZi
 =x7aW
 -----END PGP SIGNATURE-----

Merge tag 'constfy-sysctl-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl

Pull sysctl table constification from Joel Granados:
 "All ctl_table declared outside of functions and that remain unmodified
  after initialization are const qualified.

  This prevents unintended modifications to proc_handler function
  pointers by placing them in the .rodata section.

  This is a continuation of the tree-wide effort started a few releases
  ago with the constification of the ctl_table struct arguments in the
  sysctl API done in 78eb4ea25c ("sysctl: treewide: constify the
  ctl_table argument of proc_handlers")"

* tag 'constfy-sysctl-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
  treewide: const qualify ctl_tables where applicable
2025-01-29 10:35:40 -08:00
Andrii Nakryiko
bc27c52eea bpf: avoid holding freeze_mutex during mmap operation
We use map->freeze_mutex to prevent races between map_freeze() and
memory mapping BPF map contents with writable permissions. The way we
naively do this means we'll hold freeze_mutex for entire duration of all
the mm and VMA manipulations, which is completely unnecessary. This can
potentially also lead to deadlocks, as reported by syzbot in [0].

So, instead, hold freeze_mutex only during writeability checks, bump
(proactively) "write active" count for the map, unlock the mutex and
proceed with mmap logic. And only if something went wrong during mmap
logic, then undo that "write active" counter increment.

  [0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@google.com/

Fixes: fc9702273e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Reported-by: syzbot+4dc041c686b7c816a71e@syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250129012246.1515826-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-29 09:49:50 -08:00
Andrii Nakryiko
98671a0fd1 bpf: unify VM_WRITE vs VM_MAYWRITE use in BPF map mmaping logic
For all BPF maps we ensure that VM_MAYWRITE is cleared when
memory-mapping BPF map contents as initially read-only VMA. This is
because in some cases BPF verifier relies on the underlying data to not
be modified afterwards by user space, so once something is mapped
read-only, it shouldn't be re-mmap'ed as read-write.

As such, it's not necessary to check VM_MAYWRITE in bpf_map_mmap() and
map->ops->map_mmap() callbacks: VM_WRITE should be consistently set for
read-write mappings, and if VM_WRITE is not set, there is no way for
user space to upgrade read-only mapping to read-write one.

This patch cleans up this VM_WRITE vs VM_MAYWRITE handling within
bpf_map_mmap(), which is an entry point for any BPF map mmap()-ing
logic. We also drop unnecessary sanitization of VM_MAYWRITE in BPF
ringbuf's map_mmap() callback implementation, as it is already performed
by common code in bpf_map_mmap().

Note, though, that in bpf_map_mmap_{open,close}() callbacks we can't
drop VM_MAYWRITE use, because it's possible (and is outside of
subsystem's control) to have initially read-write memory mapping, which
is subsequently dropped to read-only by user space through mprotect().
In such case, from BPF verifier POV it's read-write data throughout the
lifetime of BPF map, and is counted as "active writer".

But its VMAs will start out as VM_WRITE|VM_MAYWRITE, then mprotect() can
change it to just VM_MAYWRITE (and no VM_WRITE), so when its finally
munmap()'ed and bpf_map_mmap_close() is called, vm_flags will be just
VM_MAYWRITE, but we still need to decrement active writer count with
bpf_map_write_active_dec() as it's still considered to be a read-write
mapping by the rest of BPF subsystem.

Similar reasoning applies to bpf_map_mmap_open(), which is called
whenever mmap(), munmap(), and/or mprotect() forces mm subsystem to
split original VMA into multiple discontiguous VMAs.

Memory-mapping handling is a bit tricky, yes.

Cc: Jann Horn <jannh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250129012246.1515826-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-29 09:49:50 -08:00
Linus Torvalds
2ab002c755 Driver core and debugfs updates
Here is the big set of driver core and debugfs updates for 6.14-rc1.
 It's coming late in the merge cycle as there are a number of merge
 conflicts with your tree now, and I wanted to make sure they were
 working properly.  To resolve them, look in linux-next, and I will send
 the "fixup" patch as a response to the pull request.
 
 Included in here is a bunch of driver core, PCI, OF, and platform rust
 bindings (all acked by the different subsystem maintainers), hence the
 merge conflict with the rust tree, and some driver core api updates to
 mark things as const, which will also require some fixups due to new
 stuff coming in through other trees in this merge window.
 
 There are also a bunch of debugfs updates from Al, and there is at least
 one user that does have a regression with these, but Al is working on
 tracking down the fix for it.  In my use (and everyone else's linux-next
 use), it does not seem like a big issue at the moment.
 
 Here's a short list of the things in here:
   - driver core bindings for PCI, platform, OF, and some i/o functions.
     We are almost at the "write a real driver in rust" stage now,
     depending on what you want to do.
   - misc device rust bindings and a sample driver to show how to use
     them
   - debugfs cleanups in the fs as well as the users of the fs api for
     places where drivers got it wrong or were unnecessarily doing things
     in complex ways.
   - driver core const work, making more of the api take const * for
     different parameters to make the rust bindings easier overall.
   - other small fixes and updates
 
 All of these have been in linux-next with all of the aforementioned
 merge conflicts, and the one debugfs issue, which looks to be resolved
 "soon".
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ5koPA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymFHACfT5acDKf2Bov2Lc/5u3vBW/R6ChsAnj+LmgVI
 hcDSPodj4szR40RRnzBd
 =u5Ey
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core and debugfs updates from Greg KH:
 "Here is the big set of driver core and debugfs updates for 6.14-rc1.

  Included in here is a bunch of driver core, PCI, OF, and platform rust
  bindings (all acked by the different subsystem maintainers), hence the
  merge conflict with the rust tree, and some driver core api updates to
  mark things as const, which will also require some fixups due to new
  stuff coming in through other trees in this merge window.

  There are also a bunch of debugfs updates from Al, and there is at
  least one user that does have a regression with these, but Al is
  working on tracking down the fix for it. In my use (and everyone
  else's linux-next use), it does not seem like a big issue at the
  moment.

  Here's a short list of the things in here:

   - driver core rust bindings for PCI, platform, OF, and some i/o
     functions.

     We are almost at the "write a real driver in rust" stage now,
     depending on what you want to do.

   - misc device rust bindings and a sample driver to show how to use
     them

   - debugfs cleanups in the fs as well as the users of the fs api for
     places where drivers got it wrong or were unnecessarily doing
     things in complex ways.

   - driver core const work, making more of the api take const * for
     different parameters to make the rust bindings easier overall.

   - other small fixes and updates

  All of these have been in linux-next with all of the aforementioned
  merge conflicts, and the one debugfs issue, which looks to be resolved
  "soon""

* tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
  rust: device: Use as_char_ptr() to avoid explicit cast
  rust: device: Replace CString with CStr in property_present()
  devcoredump: Constify 'struct bin_attribute'
  devcoredump: Define 'struct bin_attribute' through macro
  rust: device: Add property_present()
  saner replacement for debugfs_rename()
  orangefs-debugfs: don't mess with ->d_name
  octeontx2: don't mess with ->d_parent or ->d_parent->d_name
  arm_scmi: don't mess with ->d_parent->d_name
  slub: don't mess with ->d_name
  sof-client-ipc-flood-test: don't mess with ->d_name
  qat: don't mess with ->d_name
  xhci: don't mess with ->d_iname
  mtu3: don't mess wiht ->d_iname
  greybus/camera - stop messing with ->d_iname
  mediatek: stop messing with ->d_iname
  netdevsim: don't embed file_operations into your structs
  b43legacy: make use of debugfs_get_aux()
  b43: stop embedding struct file_operations into their objects
  carl9170: stop embedding file_operations into their objects
  ...
2025-01-28 12:25:12 -08:00
Linus Torvalds
f785692ff5 stop_machine pull request for v6.14
Move a misplaced call to rcu_momentary_eqs() from multi_cpu_stop()
 to ensure that interrupts are disabled as required.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmeY+YcTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jAGmD/sGWoBndDhdJnYoxOygE3piYK4j3oZ2
 LhqpCOpsbVwSOm2Axg1mJufzKBUgQjWQDnIsQLPlq/cfURvaUPh1iUHl8Uslp3Dh
 ONo1VpXtbjuwDNiDi72DaoJNU3vGC6WJq43cX89F8fg2v3Z4PsLZfYZXUUo+lfQ9
 Vqag0Zrp7pOhkksmV1bMIpwChn9WBSzFTvavPQEwNumWIfigMIJ9pBhxBp0hcgSi
 8O2F2Yga9lpBNPuXqfKuANBQpk4vvNaRCLKBHXmmd1HwcsbOMjTgPUXbWYKJm0bX
 yS6SWwBVnok3Atl17BBQNdjYPZuNBYjoAupXjXnQVsc8nebPPeTlmbjoVSi24rAz
 2DX7Kb8hHkHhHDlKOy1PumDu3gVjgVGN94UnJlqAePY91XIxls2Xjo6C0CGMRNXS
 I53ORv1S06HZRg5i7V7NNatppSq8JXFa/7EJTW/5ZHzb61TmR3PqK3yj7glFHWzc
 +xBc8wNyoMbxduCGQEscSYu1MiycMFHOiSX2B+3v/ckrW79CcgoH7KwDGohIs5HB
 XYsrgd0K+ESejQm3z/fF6Y4lFjxU0EmfpW7mVHT+N2JOegRO31bnLLamZ4O6yokV
 ETU0edJh17Tl1I99Q2LczRydU3pw25hs5e3LSgZgYUA0y9HxVLoFWPlal11uG9js
 1rSyJpQ7LtqC/g==
 =cFJo
 -----END PGP SIGNATURE-----

Merge tag 'stop-machine.2025.01.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull stop_machine update from Paul McKenney:
 "Move a misplaced call to rcu_momentary_eqs() from multi_cpu_stop() to
  ensure that interrupts are disabled as required"

* tag 'stop-machine.2025.01.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  stop_machine: Fix rcu_momentary_eqs() call in multi_cpu_stop()
2025-01-28 11:35:58 -08:00
Linus Torvalds
b2b3379f4c CSD-lock pull request for v6.14
Allow runtime modification of the csd_lock_timeout and panic_on_ipistall
 module parameters.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmeY+dsTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jH+TD/417AAMonxfZhg1Rv1U61HQlg6hm3n3
 OMrwh1AL584Tv+g5UWxqwuBTmxdYNIRLtkuzuRKov7PwxPVgz7T7APXtGYE2D+k+
 Kg8mVOXXvb6UUKenA0ICwy0vKgFBbfaClO8EY/5nMrEvvLz43yWe3cRRhfcvjNXq
 sgSiG42caElvJrOBzpJ7wmPIlU5VScoSXx7gdOrLXNOC8Tc63O3Z1Z5/vlgZV48k
 NhBBGDwIlkLyaFig876XcE/X5c7yx2b6U9iyvcA4k+DwWw/Ku/pq/srbVLFKvCKj
 NU8ax6vGzBoaG2xD/NjF/+NFF86C6eSJW5x9eGrysbjKbwNA30TDYKTBC9P4N0Kv
 WLJto6j/j9JMeftKKOWSRYA9FjrscxEOMzOJ71OrOXNhLTXNkSwSBqFnGdLsfar5
 6B+TxRTaOqPUUo3NyNG6uNtv8TZ00p9kTEo+1SPDkBy5pROfCbIuoJjuUm+iqd6e
 Udx37tcWS6DQFsxcFRr+JiJSBE8NbfIDaSlPonjTDf39sgWfIj1u+mdJ7eb+NNE7
 YHJLhf6iRaSPRSsiaS7IN3tRlU5/aQVqwymmyOFX6fh+nyzTSk19A0jhVW7RAznE
 ta08he9RDUvVjW03wzZANpoRSj4z07Amrz8Rbvfc1j7faVhEBPgFNjfwH/Ej2A9b
 5ZoUL1B53EyDYQ==
 =+f0b
 -----END PGP SIGNATURE-----

Merge tag 'csd-lock.2025.01.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull CSD-lock update from Paul McKenney:
 "Allow runtime modification of the csd_lock_timeout and
  panic_on_ipistall module parameters"

* tag 'csd-lock.2025.01.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  locking/csd-lock: make CSD lock debug tunables writable in /sys
2025-01-28 11:34:03 -08:00
Joel Granados
1751f872cc treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
    virtual patch

    @
    depends on !(file in "net")
    disable optional_qualifier
    @

    identifier table_name != {
      watchdog_hardlockup_sysctl,
      iwcm_ctl_table,
      ucma_ctl_table,
      memory_allocation_profiling_sysctls,
      loadpin_sysctl_table
    };
    @@

    + const
    struct ctl_table table_name [] = { ... };

sed:
    sed --in-place \
      -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
      kernel/utsname_sysctl.c

Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-28 13:48:37 +01:00
Andrea Righi
1626e5ef0b sched_ext: Fix lock imbalance in dispatch_to_local_dsq()
While performing the rq locking dance in dispatch_to_local_dsq(), we may
trigger the following lock imbalance condition, in particular when
multiple tasks are rapidly changing CPU affinity (i.e., running a
`stress-ng --race-sched 0`):

[   13.413579] =====================================
[   13.413660] WARNING: bad unlock balance detected!
[   13.413729] 6.13.0-virtme #15 Not tainted
[   13.413792] -------------------------------------
[   13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at:
[   13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0
[   13.414111] but there are no more locks to release!
[   13.414176]
[   13.414176] other info that might help us debug this:
[   13.414258] 1 lock held by kworker/1:1/80:
[   13.414318]  #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[   13.414612]
[   13.414612] stack backtrace:
[   13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
[   13.415505] Workqueue:  0x0 (events)
[   13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
[   13.415570] Call Trace:
[   13.415700]  <TASK>
[   13.415744]  dump_stack_lvl+0x78/0xe0
[   13.415806]  ? dispatch_to_local_dsq+0x108/0x1a0
[   13.415884]  print_unlock_imbalance_bug+0x11b/0x130
[   13.415965]  ? dispatch_to_local_dsq+0x108/0x1a0
[   13.416226]  lock_release+0x231/0x2c0
[   13.416326]  _raw_spin_unlock+0x1b/0x40
[   13.416422]  dispatch_to_local_dsq+0x108/0x1a0
[   13.416554]  flush_dispatch_buf+0x199/0x1d0
[   13.416652]  balance_one+0x194/0x370
[   13.416751]  balance_scx+0x61/0x1e0
[   13.416848]  prev_balance+0x43/0xb0
[   13.416947]  __pick_next_task+0x6b/0x1b0
[   13.417052]  __schedule+0x20d/0x1740

This happens because dispatch_to_local_dsq() is racing with
dispatch_dequeue() and, when the latter wins, we incorrectly assume that
the task has been moved to dst_rq.

Fix by properly tracking the currently locked rq.

Fixes: 4d3ca89bdd ("sched_ext: Refactor consume_remote_task()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 12:41:12 -10:00
Tejun Heo
d6f3e7d564 sched_ext: Fix incorrect autogroup migration detection
scx_move_task() is called from sched_move_task() and tells the BPF scheduler
that cgroup migration is being committed. sched_move_task() is used by both
cgroup and autogroup migrations and scx_move_task() tried to filter out
autogroup migrations by testing the destination cgroup and PF_EXITING but
this is not enough. In fact, without explicitly tagging the thread which is
doing the cgroup migration, there is no good way to tell apart
scx_move_task() invocations for racing migration to the root cgroup and an
autogroup migration.

This led to scx_move_task() incorrectly ignoring a migration from non-root
cgroup to an autogroup of the root cgroup triggering the following warning:

  WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340
  ...
  Call Trace:
  <TASK>
    cgroup_migrate_execute+0x5b1/0x700
    cgroup_attach_task+0x296/0x400
    __cgroup_procs_write+0x128/0x140
    cgroup_procs_write+0x17/0x30
    kernfs_fop_write_iter+0x141/0x1f0
    vfs_write+0x31d/0x4a0
    __x64_sys_write+0x72/0xf0
    do_syscall_64+0x82/0x160
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix it by adding an argument to sched_move_task() that indicates whether the
moving is for a cgroup or autogroup migration. After the change,
scx_move_task() is called only for cgroup migrations and renamed to
scx_cgroup_move_task().

Link: https://github.com/sched-ext/scx/issues/370
Fixes: 8195136669 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 08:31:50 -10:00
Waiman Long
1f566840a8 clocksource: Use pr_info() for "Checking clocksource synchronization" message
The "Checking clocksource synchronization" message is normally printed
when clocksource_verify_percpu() is called for a given clocksource if
both the CLOCK_SOURCE_UNSTABLE and CLOCK_SOURCE_VERIFY_PERCPU flags
are set.

It is an informational message and so pr_info() is the correct choice.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250125015442.3740588-1-longman@redhat.com
2025-01-27 10:30:59 +01:00
Linus Torvalds
9c5968db9e The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
 
 - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
   page allocator so we end up with the ability to allocate and free
   zero-refcount pages.  So that callers (ie, slab) can avoid a refcount
   inc & dec.
 
 - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
   large folios other than PMD-sized ones.
 
 - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
   fixes for this small built-in kernel selftest.
 
 - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
   the mapletree code.
 
 - "mm: fix format issues and param types" from Keren Sun implements a
   few minor code cleanups.
 
 - "simplify split calculation" from Wei Yang provides a few fixes and a
   test for the mapletree code.
 
 - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
   continues the work of moving vma-related code into the (relatively) new
   mm/vma.c.
 
 - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
   Hildenbrand cleans up and rationalizes handling of gfp flags in the page
   allocator.
 
 - "readahead: Reintroduce fix for improper RA window sizing" from Jan
   Kara is a second attempt at fixing a readahead window sizing issue.  It
   should reduce the amount of unnecessary reading.
 
 - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
   addresses an issue where "huge" amounts of pte pagetables are
   accumulated
   (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
   Qi's series addresses this windup by synchronously freeing PTE memory
   within the context of madvise(MADV_DONTNEED).
 
 - "selftest/mm: Remove warnings found by adding compiler flags" from
   Muhammad Usama Anjum fixes some build warnings in the selftests code
   when optional compiler warnings are enabled.
 
 - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
   Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
 
 - "pkeys kselftests improvements" from Kevin Brodsky implements various
   fixes and cleanups in the MM selftests code, mainly pertaining to the
   pkeys tests.
 
 - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
   estimate application working set size.
 
 - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
   provides some cleanups to memcg's hugetlb charging logic.
 
 - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
   removes the global swap cgroup lock.  A speedup of 10% for a tmpfs-based
   kernel build was demonstrated.
 
 - "zram: split page type read/write handling" from Sergey Senozhatsky
   has several fixes and cleaups for zram in the area of zram_write_page().
   A watchdog softlockup warning was eliminated.
 
 - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
   cleans up the pagetable destructor implementations.  A rare
   use-after-free race is fixed.
 
 - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
   simplifies and cleans up the debugging code in the VMA merging logic.
 
 - "Account page tables at all levels" from Kevin Brodsky cleans up and
   regularizes the pagetable ctor/dtor handling.  This results in
   improvements in accounting accuracy.
 
 - "mm/damon: replace most damon_callback usages in sysfs with new core
   functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
   file interface logic.
 
 - "mm/damon: enable page level properties based monitoring" from
   SeongJae Park increases the amount of information which is presented in
   response to DAMOS actions.
 
 - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
   DAMON's long-deprecated debugfs interfaces.  Thus the migration to sysfs
   is completed.
 
 - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
   Xu cleans up and generalizes the hugetlb reservation accounting.
 
 - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
   removes a never-used feature of the alloc_pages_bulk() interface.
 
 - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
   extends DAMOS filters to support not only exclusion (rejecting), but
   also inclusion (allowing) behavior.
 
 - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
   "introduces a new memory descriptor for zswap.zpool that currently
   overlaps with struct page for now.  This is part of the effort to reduce
   the size of struct page and to enable dynamic allocation of memory
   descriptors."
 
 - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
   simplifies the swap allocator locking.  A speedup of 400% was
   demonstrated for one workload.  As was a 35% reduction for kernel build
   time with swap-on-zram.
 
 - "mm: update mips to use do_mmap(), make mmap_region() internal" from
   Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
   mmap_region() can be made MM-internal.
 
 - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
   regressions and otherwise improves MGLRU performance.
 
 - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
   updates DAMON documentation.
 
 - "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
 
 - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
   provides various cleanups in the areas of hugetlb folios, THP folios and
   migration.
 
 - "Uncached buffered IO" from Jens Axboe implements the new
   RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
   reading and writing.  To permite userspace to address issues with
   massive buildup of useless pagecache when reading/writing fast devices.
 
 - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
   Weißschuh fixes and optimizes some of the MM selftests.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
 jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
 uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
 =Ib2s
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "The various patchsets are summarized below. Plus of course many
  indivudual patches which are described in their changelogs.

   - "Allocate and free frozen pages" from Matthew Wilcox reorganizes
     the page allocator so we end up with the ability to allocate and
     free zero-refcount pages. So that callers (ie, slab) can avoid a
     refcount inc & dec

   - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
     use large folios other than PMD-sized ones

   - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
     and fixes for this small built-in kernel selftest

   - "mas_anode_descend() related cleanup" from Wei Yang tidies up part
     of the mapletree code

   - "mm: fix format issues and param types" from Keren Sun implements a
     few minor code cleanups

   - "simplify split calculation" from Wei Yang provides a few fixes and
     a test for the mapletree code

   - "mm/vma: make more mmap logic userland testable" from Lorenzo
     Stoakes continues the work of moving vma-related code into the
     (relatively) new mm/vma.c

   - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
     Hildenbrand cleans up and rationalizes handling of gfp flags in the
     page allocator

   - "readahead: Reintroduce fix for improper RA window sizing" from Jan
     Kara is a second attempt at fixing a readahead window sizing issue.
     It should reduce the amount of unnecessary reading

   - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
     addresses an issue where "huge" amounts of pte pagetables are
     accumulated:

       https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/

     Qi's series addresses this windup by synchronously freeing PTE
     memory within the context of madvise(MADV_DONTNEED)

   - "selftest/mm: Remove warnings found by adding compiler flags" from
     Muhammad Usama Anjum fixes some build warnings in the selftests
     code when optional compiler warnings are enabled

   - "mm: don't use __GFP_HARDWALL when migrating remote pages" from
     David Hildenbrand tightens the allocator's observance of
     __GFP_HARDWALL

   - "pkeys kselftests improvements" from Kevin Brodsky implements
     various fixes and cleanups in the MM selftests code, mainly
     pertaining to the pkeys tests

   - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
     estimate application working set size

   - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
     provides some cleanups to memcg's hugetlb charging logic

   - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
     removes the global swap cgroup lock. A speedup of 10% for a
     tmpfs-based kernel build was demonstrated

   - "zram: split page type read/write handling" from Sergey Senozhatsky
     has several fixes and cleaups for zram in the area of
     zram_write_page(). A watchdog softlockup warning was eliminated

   - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
     Brodsky cleans up the pagetable destructor implementations. A rare
     use-after-free race is fixed

   - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
     simplifies and cleans up the debugging code in the VMA merging
     logic

   - "Account page tables at all levels" from Kevin Brodsky cleans up
     and regularizes the pagetable ctor/dtor handling. This results in
     improvements in accounting accuracy

   - "mm/damon: replace most damon_callback usages in sysfs with new
     core functions" from SeongJae Park cleans up and generalizes
     DAMON's sysfs file interface logic

   - "mm/damon: enable page level properties based monitoring" from
     SeongJae Park increases the amount of information which is
     presented in response to DAMOS actions

   - "mm/damon: remove DAMON debugfs interface" from SeongJae Park
     removes DAMON's long-deprecated debugfs interfaces. Thus the
     migration to sysfs is completed

   - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
     Peter Xu cleans up and generalizes the hugetlb reservation
     accounting

   - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
     removes a never-used feature of the alloc_pages_bulk() interface

   - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
     extends DAMOS filters to support not only exclusion (rejecting),
     but also inclusion (allowing) behavior

   - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
     introduces a new memory descriptor for zswap.zpool that currently
     overlaps with struct page for now. This is part of the effort to
     reduce the size of struct page and to enable dynamic allocation of
     memory descriptors

   - "mm, swap: rework of swap allocator locks" from Kairui Song redoes
     and simplifies the swap allocator locking. A speedup of 400% was
     demonstrated for one workload. As was a 35% reduction for kernel
     build time with swap-on-zram

   - "mm: update mips to use do_mmap(), make mmap_region() internal"
     from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
     mmap_region() can be made MM-internal

   - "mm/mglru: performance optimizations" from Yu Zhao fixes a few
     MGLRU regressions and otherwise improves MGLRU performance

   - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
     Park updates DAMON documentation

   - "Cleanup for memfd_create()" from Isaac Manjarres does that thing

   - "mm: hugetlb+THP folio and migration cleanups" from David
     Hildenbrand provides various cleanups in the areas of hugetlb
     folios, THP folios and migration

   - "Uncached buffered IO" from Jens Axboe implements the new
     RWF_DONTCACHE flag which provides synchronous dropbehind for
     pagecache reading and writing. To permite userspace to address
     issues with massive buildup of useless pagecache when
     reading/writing fast devices

   - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
     Weißschuh fixes and optimizes some of the MM selftests"

* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm/compaction: fix UBSAN shift-out-of-bounds warning
  s390/mm: add missing ctor/dtor on page table upgrade
  kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
  tools: add VM_WARN_ON_VMG definition
  mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
  seqlock: add missing parameter documentation for raw_seqcount_try_begin()
  mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
  mm/page_alloc: remove the incorrect and misleading comment
  zram: remove zcomp_stream_put() from write_incompressible_page()
  mm: separate move/undo parts from migrate_pages_batch()
  mm/kfence: use str_write_read() helper in get_access_type()
  selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
  kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
  selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
  selftests/mm: vm_util: split up /proc/self/smaps parsing
  selftests/mm: virtual_address_range: unmap chunks after validation
  selftests/mm: virtual_address_range: mmap() without PROT_WRITE
  selftests/memfd/memfd_test: fix possible NULL pointer dereference
  mm: add FGP_DONTCACHE folio creation flag
  mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
  ...
2025-01-26 18:36:23 -08:00
Linus Torvalds
c159dfbdd4 Mainly individually changelogged singleton patches. The patch series in
this pull are:
 
 - "lib min_heap: Improve min_heap safety, testing, and documentation"
   from Kuan-Wei Chiu provides various tightenings to the min_heap library
   code.
 
 - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some
   cleanup and Rust preparation in the xarray library code.
 
 - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes
   pathnames in some code comments.
 
 - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the
   new secs_to_jiffies() in various places where that is appropriate.
 
 - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
   switches two filesystems to the new mount API.
 
 - "Convert ocfs2 to use folios" from Matthew Wilcox does that.
 
 - "Remove get_task_comm() and print task comm directly" from Yafang Shao
   removes now-unneeded calls to get_task_comm() in various places.
 
 - "squashfs: reduce memory usage and update docs" from Phillip Lougher
   implements some memory savings in squashfs and performs some
   maintainability work.
 
 - "lib: clarify comparison function requirements" from Kuan-Wei Chiu
   tightens the sort code's behaviour and adds some maintenance work.
 
 - "nilfs2: protect busy buffer heads from being force-cleared" from
   Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a
   corrupted image.
 
 - "nilfs2: fix kernel-doc comments for function return values" from
   Ryusuke Konishi fixes some nilfs kerneldoc.
 
 - "nilfs2: fix issues with rename operations" from Ryusuke Konishi
   addresses some nilfs BUG_ONs which syzbot was able to trigger.
 
 - "minmax.h: Cleanups and minor optimisations" from David Laight
   does some maintenance work on the min/max library code.
 
 - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work
   on the xarray library code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5SP5QAKCRDdBJ7gKXxA
 jqN7AQChvwXGG43n4d5SDiA/rH7ddvowQcDqhC9cAMJ1ReR7qwEA8/LIWDE4PdMX
 mJnaZ1/ibpEpearrChCViApQtcyEGQI=
 =ti4E
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Mainly individually changelogged singleton patches. The patch series
  in this pull are:

   - "lib min_heap: Improve min_heap safety, testing, and documentation"
     from Kuan-Wei Chiu provides various tightenings to the min_heap
     library code

   - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
     some cleanup and Rust preparation in the xarray library code

   - "Update reference to include/asm-<arch>" from Geert Uytterhoeven
     fixes pathnames in some code comments

   - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
     the new secs_to_jiffies() in various places where that is
     appropriate

   - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
     switches two filesystems to the new mount API

   - "Convert ocfs2 to use folios" from Matthew Wilcox does that

   - "Remove get_task_comm() and print task comm directly" from Yafang
     Shao removes now-unneeded calls to get_task_comm() in various
     places

   - "squashfs: reduce memory usage and update docs" from Phillip
     Lougher implements some memory savings in squashfs and performs
     some maintainability work

   - "lib: clarify comparison function requirements" from Kuan-Wei Chiu
     tightens the sort code's behaviour and adds some maintenance work

   - "nilfs2: protect busy buffer heads from being force-cleared" from
     Ryusuke Konishi fixes an issues in nlifs when the fs is presented
     with a corrupted image

   - "nilfs2: fix kernel-doc comments for function return values" from
     Ryusuke Konishi fixes some nilfs kerneldoc

   - "nilfs2: fix issues with rename operations" from Ryusuke Konishi
     addresses some nilfs BUG_ONs which syzbot was able to trigger

   - "minmax.h: Cleanups and minor optimisations" from David Laight does
     some maintenance work on the min/max library code

   - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
     work on the xarray library code"

* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
  ocfs2: use str_yes_no() and str_no_yes() helper functions
  include/linux/lz4.h: add some missing macros
  Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
  Xarray: remove repeat check in xas_squash_marks()
  Xarray: distinguish large entries correctly in xas_split_alloc()
  Xarray: move forward index correctly in xas_pause()
  Xarray: do not return sibling entries from xas_find_marked()
  ipc/util.c: complete the kernel-doc function descriptions
  gcov: clang: use correct function param names
  latencytop: use correct kernel-doc format for func params
  minmax.h: remove some #defines that are only expanded once
  minmax.h: simplify the variants of clamp()
  minmax.h: move all the clamp() definitions after the min/max() ones
  minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
  minmax.h: reduce the #define expansion of min(), max() and clamp()
  minmax.h: update some comments
  minmax.h: add whitespace around operators and after commas
  nilfs2: do not update mtime of renamed directory that is not moved
  nilfs2: handle errors that nilfs_prepare_chunk() may return
  CREDITS: fix spelling mistake
  ...
2025-01-26 17:50:53 -08:00
Linus Torvalds
41bfad507c Modules changes for 6.14-rc1
Several fixes and small improvements are present:
 
 - Sign modules with sha512 instead of sha1 by default
 
 - Don't fail module loading when setting ro_after_init section RO failed
 
 - Constify 'struct module_attribute'
 
 - Cleanups and preparation for const struct bin_attribute
 
 - Put known GPL offenders in an array
 
 - Extend the preempt disabled section in dereference_symbol_descriptor()
 
 This has been all on linux-next for at least 2 weeks with no issues.
 
 A small merge conflict between the changes here and a pull from the
 driver-core tree might appear in kernel/module/sysfs.c, function
 add_notes_attrs(). The code has been cleaned up here and the driver-core
 additionally changes nattr->read to nattr->read_new.
 
 Related to the modules, an important new tool gendwarfksyms to calculate
 symbols CRCs from DWARF data and thereby enable the modversion support for
 Rust should come through the kbuild tree.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEEIduBR9MnFA82q/jtumpXJwqY6poFAmeWOxwUHHBldHIucGF2
 bHVAc3VzZS5jb20ACgkQumpXJwqY6ppY0gf/aPhM6fLVArOIcyW1BvO8P4rl/nNp
 cBvN/WnNB3M1hCASAyCT56u8dDN8pxZYdMvaMKd2rCDGVIdcnXdhBEXmYLbZ4ih2
 w56nEJHbtqH2MZcBtrfZgJ4V70poNuKNl0JpTUzya+uGtl4noTfxybjcJXDvlNoO
 2Qj6SgblZ35GbEWBW0TrfjwiiHVTADaj2s0V9Pnsmpk472qc4fEMc96U3AGvsGsp
 b3KJZND2SW1bdTnTyO1SOJBcMeYCKlXb7f3/X7tGqqn7jhjfx1nrfeVG+NLYXaZo
 XxNi4vzmJCpu3wkOryvxj8WQkkHIqVkldtpM/fmu+GYVRr282LRdOdrY5Q==
 =GgGQ
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux

Pull modules updates from Petr Pavlu:

 - Sign modules with sha512 instead of sha1 by default

 - Don't fail module loading when failing to set the
   ro_after_init section read-only

 - Constify 'struct module_attribute'

 - Cleanups and preparation for const struct bin_attribute

 - Put known GPL offenders in an array

 - Extend the preempt disabled section in
   dereference_symbol_descriptor()

* tag 'modules-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
  module: sign with sha512 instead of sha1 by default
  module: Don't fail module loading when setting ro_after_init section RO failed
  module: Split module_enable_rodata_ro()
  module: sysfs: Use const 'struct bin_attribute'
  module: sysfs: Add notes attributes through attribute_group
  module: sysfs: Simplify section attribute allocation
  module: sysfs: Drop 'struct module_sect_attr'
  module: sysfs: Drop member 'module_sect_attr::address'
  module: sysfs: Drop member 'module_sect_attrs::nsections'
  module: Constify 'struct module_attribute'
  module: Handle 'struct module_version_attribute' as const
  params: Prepare for 'const struct module_attribute *'
  module: Put known GPL offenders in an array
  module: Extend the preempt disabled section in dereference_symbol_descriptor().
2025-01-26 14:30:43 -08:00
Linus Torvalds
40648d246f rv: tools/rtla: Updates for 6.14
- Add a test suite to test the tool
 
   Add a small test suite that can be used to test rtla's basic features to
   at least have something to test when applying changes.
 
 - Automate manual steps in monitor creation
 
   While creating a new monitor in RV, besides generating code from dot2k,
   there are a few manual steps which can be tedious and error prone, like
   adding the tracepoints, makefile lines and kconfig, or selecting events
   that start the monitor in the initial state.
 
   Updates were made to try and automate as much as possible among those steps to
   make creating a new RV monitor much quicker. It is still requires to
   select proper tracepoints, this step is harder to automate in a general
   way and, in several cases, would still need user intervention.
 
 - Have rtla timerlat hist and top set OSNOISE_WORKLOAD flag
 
   Have both rtla-timerlat-hist and rtla-timerlat-top set OSNOISE_WORKLOAD to
   the proper value ("on" when running with -k, "off" when running with -u)
   every time the option is available instead of setting it only when running
   with -u.
 
   This prevents rtla timerlat -k from giving no results when
   NO_OSNOISE_WORKLOAD is set, either manually or by an abnormally exited earlier
   run of rtla timerlat -u.
 
 -  Stop rtla timerlat on signal properly when overloaded
 
   There is an issue where if rtla is run on machines with a high number of
   CPUs (100+), timerlat can generate more samples than rtla is able to process
   via tracefs_iterate_raw_events. This is especially common when the interval
   is set to 100us (rteval and cyclictest default) as opposed to the rtla
   default of 1000us, but also happens with the rtla default.
 
   Currently, this leads to rtla hanging and having to be terminated with
   SIGTERM. SIGINT setting stop_tracing is not enough, since more and more
   events are coming and tracefs_iterate_raw_events never exits.
 
   To fix this: Stop the timerlat tracer on SIGINT/SIGALRM to ensure no more
   events are generated when rtla is supposed to exit.
 
   Also on receiving SIGINT/SIGALRM twice, abort iteration immediately with
   tracefs_iterate_stop, making rtla exit right away instead of waiting for all
   events to be processed.
 
 - Account for missed events
 
   Due to tracefs buffer overflow, it can happen that rtla misses events,
   making the tracing results inaccurate.
 
   Count both the number of missed events and the total number of processed
   events, and display missed events as well as their percentage. The numbers
   are displayed for both osnoise and timerlat, even though for the earlier,
   missed events are generally not expected.
 
   For hist, the number is displayed at the end of the run; for top, it is
   displayed on each printing of the top table.
 
 - Changes to make osnoise more robust
 
   There was a dependency in the code that the first field of the
   osnoise_tool structure was the trace field. If that that ever changed,
   then the code work break. Change the code to encapsulate this dependency
   where the code that uses the structure does not have this dependency.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5UQ4BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qktFAQD2px6MyoOVTssB5Iw3aTWGUfTFoDEc
 bfng5JsBxlVJkQEA+2UUvP8FJlLTOQvVEwJiscX7CCJxl5bYkV6GWuGRxQU=
 =h//9
 -----END PGP SIGNATURE-----

Merge tag 'trace-tools-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull rv and tools/rtla updates from Steven Rostedt:

 - Add a test suite to test the tool

   Add a small test suite that can be used to test rtla's basic features
   to at least have something to test when applying changes.

 - Automate manual steps in monitor creation

   While creating a new monitor in RV, besides generating code from
   dot2k, there are a few manual steps which can be tedious and error
   prone, like adding the tracepoints, makefile lines and kconfig, or
   selecting events that start the monitor in the initial state.

   Updates were made to try and automate as much as possible among those
   steps to make creating a new RV monitor much quicker. It is still
   requires to select proper tracepoints, this step is harder to
   automate in a general way and, in several cases, would still need
   user intervention.

 - Have rtla timerlat hist and top set OSNOISE_WORKLOAD flag

   Have both rtla-timerlat-hist and rtla-timerlat-top set
   OSNOISE_WORKLOAD to the proper value ("on" when running with -k,
   "off" when running with -u) every time the option is available
   instead of setting it only when running with -u.

   This prevents rtla timerlat -k from giving no results when
   NO_OSNOISE_WORKLOAD is set, either manually or by an abnormally
   exited earlier run of rtla timerlat -u.

 - Stop rtla timerlat on signal properly when overloaded

   There is an issue where if rtla is run on machines with a high number
   of CPUs (100+), timerlat can generate more samples than rtla is able
   to process via tracefs_iterate_raw_events. This is especially common
   when the interval is set to 100us (rteval and cyclictest default) as
   opposed to the rtla default of 1000us, but also happens with the rtla
   default.

   Currently, this leads to rtla hanging and having to be terminated
   with SIGTERM. SIGINT setting stop_tracing is not enough, since more
   and more events are coming and tracefs_iterate_raw_events never
   exits.

   To fix this: Stop the timerlat tracer on SIGINT/SIGALRM to ensure no
   more events are generated when rtla is supposed to exit.

   Also on receiving SIGINT/SIGALRM twice, abort iteration immediately
   with tracefs_iterate_stop, making rtla exit right away instead of
   waiting for all events to be processed.

 - Account for missed events

   Due to tracefs buffer overflow, it can happen that rtla misses
   events, making the tracing results inaccurate.

   Count both the number of missed events and the total number of
   processed events, and display missed events as well as their
   percentage. The numbers are displayed for both osnoise and timerlat,
   even though for the earlier, missed events are generally not
   expected.

   For hist, the number is displayed at the end of the run; for top, it
   is displayed on each printing of the top table.

 - Changes to make osnoise more robust

   There was a dependency in the code that the first field of the
   osnoise_tool structure was the trace field. If that that ever
   changed, then the code work break. Change the code to encapsulate
   this dependency where the code that uses the structure does not have
   this dependency.

* tag 'trace-tools-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (22 commits)
  rtla: Report missed event count
  rtla: Add function to report missed events
  rtla: Count all processed events
  rtla: Count missed trace events
  tools/rtla: Add osnoise_trace_is_off()
  rtla/timerlat_top: Set OSNOISE_WORKLOAD for kernel threads
  rtla/timerlat_hist: Set OSNOISE_WORKLOAD for kernel threads
  rtla/osnoise: Distinguish missing workload option
  rtla/timerlat_top: Abort event processing on second signal
  rtla/timerlat_hist: Abort event processing on second signal
  rtla/timerlat_top: Stop timerlat tracer on signal
  rtla/timerlat_hist: Stop timerlat tracer on signal
  rtla: Add trace_instance_stop
  tools/rtla: Add basic test suite
  verification/dot2k: Implement event type detection
  verification/dot2k: Auto patch current kernel source
  verification/dot2k: Simplify manual steps in monitor creation
  rv: Simplify manual steps in monitor creation
  verification/dot2k: Add support for name and description options
  verification/dot2k: More robust template variables
  ...
2025-01-26 14:25:58 -08:00
Linus Torvalds
90ab2117f4 Runtime verifier and osnoise fixes for 6.14:
- Reset idle tasks on reset for runtime verifier
 
   When the runtime verifier is reset, it resets the task's data that is being
   monitored. But it only iterates for_each_process() which does not include
   the idle tasks. As the idle tasks can be monitored, they need to be reset
   as well.
 
 - Fix the enabling and disabling of tracepoints in osnoise
 
   If timerlat is enabled and the WORKLOAD flag is not set, then the osnoise
   tracer will enable the migrate task tracepoint to monitor it for its own
   workload. The test to enable the tracepoint is done against user space
   modifiable parameters. On disabling of the tracer, those same parameters
   are used to determine if the tracepoint should be disabled. The problem is
   if user space were to modify the parameters after it enables the tracer
   then it may not disable the tracepoint.
 
   Instead, a static variable is used to keep track if the tracepoint was
   enabled or not. Then when the tracer shuts down, it will use this variable
   to decide to disable the tracepoint or not, instead of looking at the user
   space parameters.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5UJ1BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qo46AQDjr1yVTyUmzvxe1bMLDoDOq1xeRMRe
 4f8CdpOJRxbi0QEAwnl5Ey9Rvcuy8GFpt/USESr4VYWAN10fvsPx7pkphAs=
 =RYyn
 -----END PGP SIGNATURE-----

Merge tag 'trace-rv-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verifier and osnoise fixes from Steven Rostedt:

 - Reset idle tasks on reset for runtime verifier

   When the runtime verifier is reset, it resets the task's data that is
   being monitored. But it only iterates for_each_process() which does
   not include the idle tasks. As the idle tasks can be monitored, they
   need to be reset as well.

 - Fix the enabling and disabling of tracepoints in osnoise

   If timerlat is enabled and the WORKLOAD flag is not set, then the
   osnoise tracer will enable the migrate task tracepoint to monitor it
   for its own workload. The test to enable the tracepoint is done
   against user space modifiable parameters. On disabling of the tracer,
   those same parameters are used to determine if the tracepoint should
   be disabled. The problem is if user space were to modify the
   parameters after it enables the tracer then it may not disable the
   tracepoint.

   Instead, a static variable is used to keep track if the tracepoint
   was enabled or not. Then when the tracer shuts down, it will use this
   variable to decide to disable the tracepoint or not, instead of
   looking at the user space parameters.

* tag 'trace-rv-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/osnoise: Fix resetting of tracepoints
  rv: Reset per-task monitors also for idle tasks
2025-01-26 14:19:45 -08:00
Linus Torvalds
5fb4088624 bitmap patches for v6.14.
Hi Linus,
 
 Please pull bitmap patches for v6.14.
 
 This includes const_true() series from Vincent Mailhol, another
 __always_inline rework from Nathan Chancellor for RISCV, and a
 couple random fixes from Dr. David Alan Gilbert and I Hsin Cheng.
 
 Thanks,
 Yury
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmeUMpEACgkQsUSA/Tof
 vsiqEAv/VvtTD6I3Ms3kIl2G1pBP/EFcYQGbwS1PCsX3RX16rinZ2XUDRtjvRy1Y
 FA+OsZ2yPtH8G1WRM8YauZsh2cZSCA4xTFadLSZkT8leSWERKaTyJI+PXe2A43IU
 d+FV4zYH5JYqV2u9aLHWMO8Voq9nNHZXOYHRu0q53TBFn7V294Lma9oDlK3Wjfur
 vSZZU9SKKlMV8Oy6/hZ3tDemUDM1jAGlqrxFb8aXRsTsCpsmlqE1bQdZ+AadjevZ
 cVeplB8OCCnqcYV28szIwsJpSzmd5/WBP6jLpeMgBYFGS0JT2USdZ8gsw+Yq/On5
 hjxek3cHBKdv0CINk0Ejf4aV0IvoX5S/VRlTjhttzyX68no1DoibDuWJWB42PRWS
 frllVOmdkm2DqA0G9mgxtwzBl5UqMFVe5LuVU9E9BZZeDmRZmS3obrUiMzpiqUOs
 zkdDvA0uaKgjx2qZADDEFqg1+XdX0A0iPebEv9vLaULXv0+D/PbkClNqIf8p7778
 2GWuBLJe
 =1hW6
 -----END PGP SIGNATURE-----

Merge tag 'bitmap-for-6.14' of https://github.com:/norov/linux

Pull bitmap updates from Yury Norov:
 "This includes const_true() series from Vincent Mailhol, another
  __always_inline rework from Nathan Chancellor for RISCV, and a couple
  of random fixes from Dr. David Alan Gilbert and I Hsin Cheng"

* tag 'bitmap-for-6.14' of https://github.com:/norov/linux:
  cpumask: Rephrase comments for cpumask_any*() APIs
  cpu: Remove unused init_cpu_online
  riscv: Always inline bitops
  linux/bits.h: simplify GENMASK_INPUT_CHECK()
  compiler.h: add const_true()
2025-01-26 14:03:44 -08:00
Thorsten Leemhuis
f3b93547b9 module: sign with sha512 instead of sha1 by default
Switch away from using sha1 for module signing by default and use the
more modern sha512 instead, which is what among others Arch, Fedora,
RHEL, and Ubuntu are currently using for their kernels.

Sha1 has not been considered secure against well-funded opponents since
2005[1]; since 2011 the NIST and other organizations furthermore
recommended its replacement[2]. This is why OpenSSL on RHEL9, Fedora
Linux 41+[3], and likely some other current and future distributions
reject the creation of sha1 signatures, which leads to a build error of
allmodconfig configurations:

  80A20474797F0000:error:03000098:digital envelope routines:do_sigver_init:invalid digest:crypto/evp/m_sigver.c:342:
  make[4]: *** [.../certs/Makefile:53: certs/signing_key.pem] Error 1
  make[4]: *** Deleting file 'certs/signing_key.pem'
  make[4]: *** Waiting for unfinished jobs....
  make[3]: *** [.../scripts/Makefile.build:478: certs] Error 2
  make[2]: *** [.../Makefile:1936: .] Error 2
  make[1]: *** [.../Makefile:224: __sub-make] Error 2
  make[1]: Leaving directory '...'
  make: *** [Makefile:224: __sub-make] Error 2

This change makes allmodconfig work again and sets a default that is
more appropriate for current and future users, too.

Link: https://www.schneier.com/blog/archives/2005/02/cryptanalysis_o.html [1]
Link: https://csrc.nist.gov/projects/hash-functions [2]
Link: https://fedoraproject.org/wiki/Changes/OpenSSLDistrustsha1SigVer [3]
Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: kdevops <kdevops@lists.linux.dev> [0]
Link: https://github.com/linux-kdevops/linux-modules-kpd/actions/runs/11420092929/job/31775404330 [0]
Link: https://lore.kernel.org/r/52ee32c0c92afc4d3263cea1f8a1cdc809728aff.1729088288.git.linux@leemhuis.info
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:24 +01:00
Christophe Leroy
110b1e070f module: Don't fail module loading when setting ro_after_init section RO failed
Once module init has succeded it is too late to cancel loading.
If setting ro_after_init data section to read-only fails, all we
can do is to inform the user through a warning.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Closes: https://lore.kernel.org/all/20230915082126.4187913-1-ruanjinjie@huawei.com/
Fixes: d1909c0221 ("module: Don't ignore errors from set_memory_XX()")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/d6c81f38da76092de8aacc8c93c4c65cb0fe48b8.1733427536.git.christophe.leroy@csgroup.eu
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:24 +01:00
Christophe Leroy
097fd001e1 module: Split module_enable_rodata_ro()
module_enable_rodata_ro() is called twice, once before module init
to set rodata sections readonly and once after module init to set
rodata_after_init section readonly.

The second time, only the rodata_after_init section needs to be
set to read-only, no need to re-apply it to already set rodata.

Split module_enable_rodata_ro() in two.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/e3b6ff0df7eac281c58bb02cecaeb377215daff3.1733427536.git.christophe.leroy@csgroup.eu
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:24 +01:00
Thomas Weißschuh
b83815afae module: sysfs: Use const 'struct bin_attribute'
The sysfs core is switching to 'const struct bin_attribute's.
Prepare for that.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20241227-sysfs-const-bin_attr-module-v2-6-e267275f0f37@weissschuh.net
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:24 +01:00
Thomas Weißschuh
4723f16de6 module: sysfs: Add notes attributes through attribute_group
A kobject is meant to manage the lifecycle of some resource.
However the module sysfs code only creates a kobject to get a
"notes" subdirectory in sysfs.
This can be achieved easier and cheaper by using a sysfs group.
Switch the notes attribute code to such a group, similar to how the
section allocation in the same file already works.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20241227-sysfs-const-bin_attr-module-v2-5-e267275f0f37@weissschuh.net
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:24 +01:00
Thomas Weißschuh
f47c0bebed module: sysfs: Simplify section attribute allocation
The existing allocation logic manually stuffs two allocations into one.
This is hard to understand and of limited value, given that all the
section names are allocated on their own anyways.
Une one allocation per datastructure.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20241227-sysfs-const-bin_attr-module-v2-4-e267275f0f37@weissschuh.net
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:24 +01:00
Thomas Weißschuh
34f5ec0f82 module: sysfs: Drop 'struct module_sect_attr'
This is now an otherwise empty wrapper around a 'struct bin_attribute',
not providing any functionality. Remove it.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20241227-sysfs-const-bin_attr-module-v2-3-e267275f0f37@weissschuh.net
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:24 +01:00
Thomas Weißschuh
4b2c11e4aa module: sysfs: Drop member 'module_sect_attr::address'
'struct bin_attribute' already contains the member 'private' to pass
custom data to the attribute handlers.
Use that instead of the custom 'address' member.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20241227-sysfs-const-bin_attr-module-v2-2-e267275f0f37@weissschuh.net
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:24 +01:00
Thomas Weißschuh
d8959b947a module: sysfs: Drop member 'module_sect_attrs::nsections'
The member is only used to iterate over all attributes in
free_sect_attrs(). However the attribute group can already be used for
that. Use the group and drop 'nsections'.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20241227-sysfs-const-bin_attr-module-v2-1-e267275f0f37@weissschuh.net
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:24 +01:00
Thomas Weißschuh
f3227ffda0 module: Constify 'struct module_attribute'
These structs are never modified, move them to read-only memory.
This makes the API clearer and also prepares for the constification of
'struct attribute' itself.

While at it, also constify 'modinfo_attrs_count'.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Link: https://lore.kernel.org/r/20241216-sysfs-const-attr-module-v1-3-3790b53e0abf@weissschuh.net
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:23 +01:00
Thomas Weißschuh
38e3fe6595 module: Handle 'struct module_version_attribute' as const
The structure is always read-only due to its placement in the read-only
section __modver. Reflect this at its usage sites.
Also prepare for the const handling of 'struct module_attribute' itself.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Link: https://lore.kernel.org/r/20241216-sysfs-const-attr-module-v1-2-3790b53e0abf@weissschuh.net
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:23 +01:00
Thomas Weißschuh
30d4460888 params: Prepare for 'const struct module_attribute *'
The 'struct module_attribute' sysfs callbacks are about to change to
receive a 'const struct module_attribute *' parameter.
Prepare for that by avoid casting away the constness through
container_of() and using const pointers to 'struct param_attribute'.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Link: https://lore.kernel.org/r/20241216-sysfs-const-attr-module-v1-1-3790b53e0abf@weissschuh.net
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:23 +01:00
Uwe Kleine-König
c8e0bd579e module: Put known GPL offenders in an array
Instead of repeating the add_taint_module() call for each offender, create
an array and loop over that one. This simplifies adding new entries
considerably.

Signed-off-by: Uwe Kleine-König <ukleinek@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Link: https://lore.kernel.org/r/20241115185253.1299264-2-wse@tuxedocomputers.com
[ppavlu: make the array const]
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26 13:05:23 +01:00
Guo Weikang
c6f239796b mm/memblock: add memblock_alloc_or_panic interface
Before SLUB initialization, various subsystems used memblock_alloc to
allocate memory.  In most cases, when memory allocation fails, an
immediate panic is required.  To simplify this behavior and reduce
repetitive checks, introduce `memblock_alloc_or_panic`.  This function
ensures that memory allocation failures result in a panic automatically,
improving code readability and consistency across subsystems that require
this behavior.

[guoweikang.kernel@gmail.com: arch/s390: save_area_alloc default failure behavior changed to panic]
  Link: https://lkml.kernel.org/r/20250109033136.2845676-1-guoweikang.kernel@gmail.com
  Link: https://lore.kernel.org/lkml/Z2fknmnNtiZbCc7x@kernel.org/
Link: https://lkml.kernel.org/r/20250102072528.650926-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>	[s390]
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:38 -08:00
Luiz Capitulino
6bf9b5b40a mm: alloc_pages_bulk: rename API
The previous commit removed the page_list argument from
alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function.

Now that only the *_array() flavour of the API remains, we can do the
following renaming (along with the _noprof() ones):

  alloc_pages_bulk_array -> alloc_pages_bulk
  alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy
  alloc_pages_bulk_array_node -> alloc_pages_bulk_node

Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com
Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:31 -08:00
Linus Torvalds
405057718a kgdb patches for 6.14
A very small set of changes this kernel cycle. It consists of two cleanups,
 one switches to kmap_local_page() (from kmap_atomic() ) and the other
 removes a bit of dead code.
 
 Signed-off-by: Daniel Thompson (RISCstar) <danielt@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmeVJ0EACgkQfOMlXTn3
 iKEtAQ/6A8eMJBhqKT+c51mFsrZiofit7i22/nw5G9yg7u6KRKudVk9Hu54NBZ9j
 p824xOS/rwKuqd1cd0LwBbUd+V1jQNT5x4FjwHbI3eA7e+ek2tA8OAVk6QYSHWmg
 pu2scuKtZYldrfZVVd7hUwizxQSGScGewVIvDTTTKKOun55r8tIcznrjFXUmw+RR
 Tg45EfZO77IP/AYfMRmztYoUExR9TQ54BrezCnYHHB9NlldjtfD4IxfFC1yUa6Ph
 fmkG27L8UzGJATqLIYyG25tyk3ZkNplFqLXjCJgR66xzAFRlaPYn9SA+qaqw2SQh
 MCR2jOTCMmG9zGtbcs/YbGpUDkOz8gtfuGQOm91jLzTH3FQ2sM11rokKxdx+Add1
 Sx56snhukRKD0aZGZgIPjqeBGdUYtdclWAbRsm4/2Q83ipRclP5jezW9N1EIWMZL
 HLVXZJSXpLl7gAEYNorDiyUTqna7IXvGOAu4HHj0rMKCj57wA90yIiml/sIl3kN8
 fRGpqK2tc14/8MqeugqctiAGfBYYvNRupdnI0KvbgMDDP4s3d/QPYgnjSYQW2sjc
 8ZFMO84x39SPtGNNx2mPVmVP3Wuf4Njy+v2wmGzcd7RbM1G6nYt2rh7cZs32/PFt
 zeo5iYjYIxPA8oSOmJh+uQtlACyWaNW7pIaTwyoutgaUDdgspIg=
 =n8Nr
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
 "A very small set of changes this kernel cycle.

  Two cleanups, one switches to kmap_local_page() (from kmap_atomic())
  and the other removes a bit of dead code"

* tag 'kgdb-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kdb: Remove unused flags stack
  kdb: use kmap_local_page()
2025-01-25 10:21:13 -08:00
Dr. David Alan Gilbert
6beaa75cd2 kdb: Remove unused flags stack
kdb_restore_flags() and kdb_save_flags() were added in 2010 by
commit 5d5314d679 ("kdb: core for kgdb back end (1 of 2)")
but have remained unused.

Remove them, and their associated storage.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20250112012049.319515-1-linux@treblig.org
Signed-off-by: Daniel Thompson (RISCstar) <danielt@kernel.org>
2025-01-25 08:22:26 +00:00
Zhang Heng
36975ec3a2 kdb: use kmap_local_page()
Use kmap_local_page() instead of kmap_atomic() which has been deprecated.

Signed-off-by: Zhang Heng <zhangheng@kylinos.cn>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20241223085420.1815930-1-zhangheng@kylinos.cn
Signed-off-by: Daniel Thompson (RISCstar) <danielt@kernel.org>
2025-01-25 08:22:11 +00:00
Randy Dunlap
9c9ce355b1 gcov: clang: use correct function param names
Fix the function parameter names to match the function so that
the kernel-doc warnings disappear.

clang.c:273: warning: Function parameter or struct member 'dst' not described in 'gcov_info_add'
clang.c:273: warning: Function parameter or struct member 'src' not described in 'gcov_info_add'
clang.c:273: warning: Excess function parameter 'dest' description in 'gcov_info_add'
clang.c:273: warning: Excess function parameter 'source' description in 'gcov_info_add'

Link: https://lkml.kernel.org/r/20250111062944.910638-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-24 22:47:27 -08:00
Randy Dunlap
690794430a latencytop: use correct kernel-doc format for func params
Use a ':' instead of a '-' after function parameters to eliminate
kernel-doc warnings.

kernel/latencytop.c:177: warning: Function parameter or struct member 'tsk' not described in '__account_scheduler_latency'
../kernel/latencytop.c:177: warning: Function parameter or struct member 'usecs' not described in '__account_scheduler_latency'
../kernel/latencytop.c:177: warning: Function parameter or struct member 'inter' not described in '__account_scheduler_latency'

Link: https://lkml.kernel.org/r/20250111063019.910730-1-rdunlap@infradead.org
Fixes: ad0b0fd554 ("sched, latencytop: incorporate review feedback from Andrew Morton")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-24 22:47:27 -08:00
Oxana Kharitonova
65ef17aa07 hung_task: add task->flags, blocked by coredump to log
Resending this patch as I haven't received feedback on my initial
submission https://lore.kernel.org/all/20241204182953.10854-1-oxana@cloudflare.com/

For the processes which are terminated abnormally the kernel can provide
a coredump if enabled. When the coredump is performed, the process and
all its threads are put into the D state
(TASK_UNINTERRUPTIBLE | TASK_FREEZABLE).

On the other hand, we have kernel thread khungtaskd which monitors the
processes in the D state. If the task stuck in the D state more than
kernel.hung_task_timeout_secs, the hung_task alert appears in the kernel
log.

The higher memory usage of a process, the longer it takes to create
coredump, the longer tasks are in the D state. We have hung_task alerts
for the processes with memory usage above 10Gb. Although, our
kernel.hung_task_timeout_secs is 10 sec when the default is 120 sec.

Adding additional information to the log that the task is blocked by
coredump will help with monitoring. Another approach might be to
completely filter out alerts for such tasks, but in that case we would
lose transparency about what is putting pressure on some system
resources, e.g. we saw an increase in I/O when coredump occurs due its
writing to disk.

Additionally, it would be helpful to have task_struct->flags in the log
from the function sched_show_task(). Currently it prints
task_struct->thread_info->flags, this seems misleading as the line
starts with "task:xxxx".

[akpm@linux-foundation.org: fix printk control string]
Link: https://lkml.kernel.org/r/20250110160328.64947-1-oxana@cloudflare.com
Signed-off-by: Oxana Kharitonova <oxana@cloudflare.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Segall <bsegall@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-24 22:47:24 -08:00
Tio Zhang
97549ce684 kthread: correct comments before kthread_queue_work()
s/kthread_worker_create/kthread_create_worker/ to avoid confusion when
reading comments before kthread_queue_work().

Link: https://lkml.kernel.org/r/20241224095344.GA7587@didi-ThinkCentre-M930t-N000
Signed-off-by: Tio Zhang <tiozhang@didiglobal.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-24 22:47:22 -08:00
Steven Rostedt
e3ff424592 tracing/osnoise: Fix resetting of tracepoints
If a timerlat tracer is started with the osnoise option OSNOISE_WORKLOAD
disabled, but then that option is enabled and timerlat is removed, the
tracepoints that were enabled on timerlat registration do not get
disabled. If the option is disabled again and timelat is started, then it
triggers a warning in the tracepoint code due to registering the
tracepoint again without ever disabling it.

Do not use the same user space defined options to know to disable the
tracepoints when timerlat is removed. Instead, set a global flag when it
is enabled and use that flag to know to disable the events.

 ~# echo NO_OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
 ~# echo timerlat > /sys/kernel/tracing/current_tracer
 ~# echo OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
 ~# echo nop > /sys/kernel/tracing/current_tracer
 ~# echo NO_OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
 ~# echo timerlat > /sys/kernel/tracing/current_tracer

Triggers:

 ------------[ cut here ]------------
 WARNING: CPU: 6 PID: 1337 at kernel/tracepoint.c:294 tracepoint_add_func+0x3b6/0x3f0
 Modules linked in:
 CPU: 6 UID: 0 PID: 1337 Comm: rtla Not tainted 6.13.0-rc4-test-00018-ga867c441128e-dirty #73
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:tracepoint_add_func+0x3b6/0x3f0
 Code: 48 8b 53 28 48 8b 73 20 4c 89 04 24 e8 23 59 11 00 4c 8b 04 24 e9 36 fe ff ff 0f 0b b8 ea ff ff ff 45 84 e4 0f 84 68 fe ff ff <0f> 0b e9 61 fe ff ff 48 8b 7b 18 48 85 ff 0f 84 4f ff ff ff 49 8b
 RSP: 0018:ffffb9b003a87ca0 EFLAGS: 00010202
 RAX: 00000000ffffffef RBX: ffffffff92f30860 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff9bf59e91ccd0 RDI: ffffffff913b6410
 RBP: 000000000000000a R08: 00000000000005c7 R09: 0000000000000002
 R10: ffffb9b003a87ce0 R11: 0000000000000002 R12: 0000000000000001
 R13: ffffb9b003a87ce0 R14: ffffffffffffffef R15: 0000000000000008
 FS:  00007fce81209240(0000) GS:ffff9bf6fdd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055e99b728000 CR3: 00000001277c0002 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  ? __warn.cold+0xb7/0x14d
  ? tracepoint_add_func+0x3b6/0x3f0
  ? report_bug+0xea/0x170
  ? handle_bug+0x58/0x90
  ? exc_invalid_op+0x17/0x70
  ? asm_exc_invalid_op+0x1a/0x20
  ? __pfx_trace_sched_migrate_callback+0x10/0x10
  ? tracepoint_add_func+0x3b6/0x3f0
  ? __pfx_trace_sched_migrate_callback+0x10/0x10
  ? __pfx_trace_sched_migrate_callback+0x10/0x10
  tracepoint_probe_register+0x78/0xb0
  ? __pfx_trace_sched_migrate_callback+0x10/0x10
  osnoise_workload_start+0x2b5/0x370
  timerlat_tracer_init+0x76/0x1b0
  tracing_set_tracer+0x244/0x400
  tracing_set_trace_write+0xa0/0xe0
  vfs_write+0xfc/0x570
  ? do_sys_openat2+0x9c/0xe0
  ksys_write+0x72/0xf0
  do_syscall_64+0x79/0x1c0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250123204159.4450c88e@gandalf.local.home
Fixes: e88ed227f6 ("tracing/timerlat: Add user-space interface")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-24 15:20:44 -05:00
Lai Jiangshan
e769461101 workqueue: Put the pwq after detaching the rescuer from the pool
The commit 68f83057b913("workqueue: Reap workers via kthread_stop() and
remove detach_completion") adds code to reap the normal workers but
mistakenly does not handle the rescuer and also removes the code waiting
for the rescuer in put_unbound_pool(), which caused a use-after-free bug
reported by Cheung Wall.

To avoid the use-after-free bug, the pool’s reference must be held until
the detachment is complete. Therefore, move the code that puts the pwq
after detaching the rescuer from the pool.

Reported-by: cheung wall <zzqq0103.hey@gmail.com>
Cc: cheung wall <zzqq0103.hey@gmail.com>
Link: https://lore.kernel.org/lkml/CAKHoSAvP3iQW+GwmKzWjEAOoPvzeWeoMO0Gz7Pp3_4kxt-RMoA@mail.gmail.com/
Fixes: 68f83057b913("workqueue: Reap workers via kthread_stop() and remove detach_completion")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-24 09:29:46 -10:00