2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

6468 Commits

Author SHA1 Message Date
Adrian Huang
2fa6a01345 tracing: Fix memory leak when reading set_event file
kmemleak reports the following memory leak after reading set_event file:

  # cat /sys/kernel/tracing/set_event

  # cat /sys/kernel/debug/kmemleak
  unreferenced object 0xff110001234449e0 (size 16):
  comm "cat", pid 13645, jiffies 4294981880
  hex dump (first 16 bytes):
    01 00 00 00 00 00 00 00 a8 71 e7 84 ff ff ff ff  .........q......
  backtrace (crc c43abbc):
    __kmalloc_cache_noprof+0x3ca/0x4b0
    s_start+0x72/0x2d0
    seq_read_iter+0x265/0x1080
    seq_read+0x2c9/0x420
    vfs_read+0x166/0xc30
    ksys_read+0xf4/0x1d0
    do_syscall_64+0x79/0x150
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

The issue can be reproduced regardless of whether set_event is empty or
not. Here is an example about the valid content of set_event.

  # cat /sys/kernel/tracing/set_event
  sched:sched_process_fork
  sched:sched_switch
  sched:sched_wakeup
  *:*:mod:trace_events_sample

The root cause is that s_next() returns NULL when nothing is found.
This results in s_stop() attempting to free a NULL pointer because its
parameter is NULL.

Fix the issue by freeing the memory appropriately when s_next() fails
to find anything.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250220031528.7373-1-ahuang12@lenovo.com
Fixes: b355247df1 ("tracing: Cache ":mod:" events for modules not loaded yet")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:36:12 -05:00
Sebastian Andrzej Siewior
57b76bedc5 ftrace: Correct preemption accounting for function tracing.
The function tracer should record the preemption level at the point when
the function is invoked. If the tracing subsystem decrement the
preemption counter it needs to correct this before feeding the data into
the trace buffer. This was broken in the commit cited below while
shifting the preempt-disabled section.

Use tracing_gen_ctx_dec() which properly subtracts one from the
preemption counter on a preemptible kernel.

Cc: stable@vger.kernel.org
Cc: Wander Lairson Costa <wander@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20250220140749.pfw8qoNZ@linutronix.de
Fixes: ce5e48036c ("ftrace: disable preemption when recursion locked")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:36:12 -05:00
Steven Rostedt
ca26554a14 fprobe: Fix accounting of when to unregister from function graph
When adding a new fprobe, it will update the function hash to the
functions the fprobe is attached to and register with function graph to
have it call the registered functions. The fprobe_graph_active variable
keeps track of the number of fprobes that are using function graph.

If two fprobes attach to the same function, it increments the
fprobe_graph_active for each of them. But when they are removed, the first
fprobe to be removed will see that the function it is attached to is also
used by another fprobe and it will not remove that function from
function_graph. The logic will skip decrementing the fprobe_graph_active
variable.

This causes the fprobe_graph_active variable to not go to zero when all
fprobes are removed, and in doing so it does not unregister from
function graph. As the fgraph ops hash will now be empty, and an empty
filter hash means all functions are enabled, this triggers function graph
to add a callback to the fprobe infrastructure for every function!

 # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events
 # echo "f:myevent2 kernel_clone%return" >> /sys/kernel/tracing/dynamic_events
 # cat /sys/kernel/tracing/enabled_functions
kernel_clone (1)           	tramp: 0xffffffffc0024000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

 # > /sys/kernel/tracing/dynamic_events
 # cat /sys/kernel/tracing/enabled_functions
trace_initcall_start_cb (1)             tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
run_init_process (1)            tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
try_to_run_init_process (1)             tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
x86_pmu_show_pmu_cap (1)                tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
cleanup_rapl_pmus (1)                   tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
uncore_free_pcibus_map (1)              tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
uncore_types_exit (1)                   tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
uncore_pci_exit.part.0 (1)              tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
kvm_shutdown (1)                tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
vmx_dump_msrs (1)               tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
[..]

 # cat /sys/kernel/tracing/enabled_functions | wc -l
54702

If a fprobe is being removed and all its functions are also traced by
other fprobes, still decrement the fprobe_graph_active counter.

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.565129766@goodmis.org
Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Closes: https://lore.kernel.org/all/20250217114918.10397-A-hca@linux.ibm.com/
Reported-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:36:12 -05:00
Steven Rostedt
ded9140622 fprobe: Always unregister fgraph function from ops
When the last fprobe is removed, it calls unregister_ftrace_graph() to
remove the graph_ops from function graph. The issue is when it does so, it
calls return before removing the function from its graph ops via
ftrace_set_filter_ips(). This leaves the last function lingering in the
fprobe's fgraph ops and if a probe is added it also enables that last
function (even though the callback will just drop it, it does add unneeded
overhead to make that call).

  # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
kernel_clone (1)           	tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

  # echo "f:myevent2 schedule_timeout" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
kernel_clone (1)           	tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
schedule_timeout (1)           	tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

  # > /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions

  # echo "f:myevent3 kmem_cache_free" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
kmem_cache_free (1)           	tramp: 0xffffffffc0219000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
schedule_timeout (1)           	tramp: 0xffffffffc0219000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

The above enabled a fprobe on kernel_clone, and then on schedule_timeout.
The content of the enabled_functions shows the functions that have a
callback attached to them. The fprobe attached to those functions
properly. Then the fprobes were cleared, and enabled_functions was empty
after that. But after adding a fprobe on kmem_cache_free, the
enabled_functions shows that the schedule_timeout was attached again. This
is because it was still left in the fprobe ops that is used to tell
function graph what functions it wants callbacks from.

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.393254452@goodmis.org
Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Tested-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:36:12 -05:00
Steven Rostedt
8eb4b09e0b ftrace: Do not add duplicate entries in subops manager ops
Check if a function is already in the manager ops of a subops. A manager
ops contains multiple subops, and if two or more subops are tracing the
same function, the manager ops only needs a single entry in its hash.

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.226762894@goodmis.org
Fixes: 4f554e9556 ("ftrace: Add ftrace_set_filter_ips function")
Tested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:36:12 -05:00
Steven Rostedt
38b1406194 ftrace: Fix accounting of adding subops to a manager ops
Function graph uses a subops and manager ops mechanism to attach to
ftrace.  The manager ops connects to ftrace and the functions it connects
to is defined by a list of subops that it manages.

The function hash that defines what the above ops attaches to limits the
functions to attach if the hash has any content. If the hash is empty, it
means to trace all functions.

The creation of the manager ops hash is done by iterating over all the
subops hashes. If any of the subops hashes is empty, it means that the
manager ops hash must trace all functions as well.

The issue is in the creation of the manager ops. When a second subops is
attached, a new hash is created by starting it as NULL and adding the
subops one at a time. But the NULL ops is mistaken as an empty hash, and
once an empty hash is found, it stops the loop of subops and just enables
all functions.

  # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
kernel_clone (1)           	tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

  # echo "f:myevent2 schedule_timeout" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
trace_initcall_start_cb (1)             tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
run_init_process (1)            tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
try_to_run_init_process (1)             tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
x86_pmu_show_pmu_cap (1)                tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
cleanup_rapl_pmus (1)                   tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
uncore_free_pcibus_map (1)              tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
uncore_types_exit (1)                   tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
uncore_pci_exit.part.0 (1)              tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
kvm_shutdown (1)                tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
vmx_dump_msrs (1)               tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
vmx_cleanup_l1d_flush (1)               tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
[..]

Fix this by initializing the new hash to NULL and if the hash is NULL do
not treat it as an empty hash but instead allocate by copying the content
of the first sub ops. Then on subsequent iterations, the new hash will not
be NULL, but the content of the previous subops. If that first subops
attached to all functions, then new hash may assume that the manager ops
also needs to attach to all functions.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.060300046@goodmis.org
Fixes: 5fccc7552c ("ftrace: Add subops logic to allow one ops to manage many")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:35:44 -05:00
Steven Rostedt
97937834ae ring-buffer: Update pages_touched to reflect persistent buffer content
The pages_touched field represents the number of subbuffers in the ring
buffer that have content that can be read. This is used in accounting of
"dirty_pages" and "buffer_percent" to allow the user to wait for the
buffer to be filled to a certain amount before it reads the buffer in
blocking mode.

The persistent buffer never updated this value so it was set to zero, and
this accounting would take it as it had no content. This would cause user
space to wait for content even though there's enough content in the ring
buffer that satisfies the buffer_percent.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214123512.0631436e@gandalf.local.home
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-15 14:00:59 -05:00
Steven Rostedt
129fe71881 tracing: Do not allow mmap() of persistent ring buffer
When trying to mmap a trace instance buffer that is attached to
reserve_mem, it would crash:

 BUG: unable to handle page fault for address: ffffe97bd00025c8
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 2862f3067 P4D 2862f3067 PUD 0
 Oops: Oops: 0000 [#1] PREEMPT_RT SMP PTI
 CPU: 4 UID: 0 PID: 981 Comm: mmap-rb Not tainted 6.14.0-rc2-test-00003-g7f1a5e3fbf9e-dirty #233
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:validate_page_before_insert+0x5/0xb0
 Code: e2 01 89 d0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 <48> 8b 46 08 a8 01 75 67 66 90 48 89 f0 8b 50 34 85 d2 74 76 48 89
 RSP: 0018:ffffb148c2f3f968 EFLAGS: 00010246
 RAX: ffff9fa5d3322000 RBX: ffff9fa5ccff9c08 RCX: 00000000b879ed29
 RDX: ffffe97bd00025c0 RSI: ffffe97bd00025c0 RDI: ffff9fa5ccff9c08
 RBP: ffffb148c2f3f9f0 R08: 0000000000000004 R09: 0000000000000004
 R10: 0000000000000000 R11: 0000000000000200 R12: 0000000000000000
 R13: 00007f16a18d5000 R14: ffff9fa5c48db6a8 R15: 0000000000000000
 FS:  00007f16a1b54740(0000) GS:ffff9fa73df00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffe97bd00025c8 CR3: 00000001048c6006 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  ? __die_body.cold+0x19/0x1f
  ? __die+0x2e/0x40
  ? page_fault_oops+0x157/0x2b0
  ? search_module_extables+0x53/0x80
  ? validate_page_before_insert+0x5/0xb0
  ? kernelmode_fixup_or_oops.isra.0+0x5f/0x70
  ? __bad_area_nosemaphore+0x16e/0x1b0
  ? bad_area_nosemaphore+0x16/0x20
  ? do_kern_addr_fault+0x77/0x90
  ? exc_page_fault+0x22b/0x230
  ? asm_exc_page_fault+0x2b/0x30
  ? validate_page_before_insert+0x5/0xb0
  ? vm_insert_pages+0x151/0x400
  __rb_map_vma+0x21f/0x3f0
  ring_buffer_map+0x21b/0x2f0
  tracing_buffers_mmap+0x70/0xd0
  __mmap_region+0x6f0/0xbd0
  mmap_region+0x7f/0x130
  do_mmap+0x475/0x610
  vm_mmap_pgoff+0xf2/0x1d0
  ksys_mmap_pgoff+0x166/0x200
  __x64_sys_mmap+0x37/0x50
  x64_sys_call+0x1670/0x1d70
  do_syscall_64+0xbb/0x1d0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

The reason was that the code that maps the ring buffer pages to user space
has:

	page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);

And uses that in:

	vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);

But virt_to_page() does not work with vmap()'d memory which is what the
persistent ring buffer has. It is rather trivial to allow this, but for
now just disable mmap() of instances that have their ring buffer from the
reserve_mem option.

If an mmap() is performed on a persistent buffer it will return -ENODEV
just like it would if the .mmap field wasn't defined in the
file_operations structure.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214115547.0d7287d3@gandalf.local.home
Fixes: 9b7bdf6f6e ("tracing: Have trace_printk not use binary prints if boot buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-15 13:59:52 -05:00
Steven Rostedt
f5b95f1fa2 ring-buffer: Validate the persistent meta data subbuf array
The meta data for a mapped ring buffer contains an array of indexes of all
the subbuffers. The first entry is the reader page, and the rest of the
entries lay out the order of the subbuffers in how the ring buffer link
list is to be created.

The validator currently makes sure that all the entries are within the
range of 0 and nr_subbufs. But it does not check if there are any
duplicates.

While working on the ring buffer, I corrupted this array, where I added
duplicates. The validator did not catch it and created the ring buffer
link list on top of it. Luckily, the corruption was only that the reader
page was also in the writer path and only presented corrupted data but did
not crash the kernel. But if there were duplicates in the writer side,
then it could corrupt the ring buffer link list and cause a crash.

Create a bitmask array with the size of the number of subbuffers. Then
clear it. When walking through the subbuf array checking to see if the
entries are within the range, test if its bit is already set in the
subbuf_mask. If it is, then there is duplicates and fail the validation.
If not, set the corresponding bit and continue.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214102820.7509ddea@gandalf.local.home
Fixes: c76883f18e ("ring-buffer: Add test if range of boot buffer is valid")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14 12:50:51 -05:00
Steven Rostedt
60b8f71114 tracing: Have the error of __tracing_resize_ring_buffer() passed to user
Currently if __tracing_resize_ring_buffer() returns an error, the
tracing_resize_ringbuffer() returns -ENOMEM. But it may not be a memory
issue that caused the function to fail. If the ring buffer is memory
mapped, then the resizing of the ring buffer will be disabled. But if the
user tries to resize the buffer, it will get an -ENOMEM returned, which is
confusing because there is plenty of memory. The actual error returned was
-EBUSY, which would make much more sense to the user.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250213134132.7e4505d7@gandalf.local.home
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-02-14 12:50:27 -05:00
Steven Rostedt
9ba0e1755a ring-buffer: Unlock resize on mmap error
Memory mapping the tracing ring buffer will disable resizing the buffer.
But if there's an error in the memory mapping like an invalid parameter,
the function exits out without re-enabling the resizing of the ring
buffer, preventing the ring buffer from being resized after that.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250213131957.530ec3c5@gandalf.local.home
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14 12:50:05 -05:00
Steven Rostedt
c8c9b1d2d5 fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT
The code was restructured where the function graph notrace code, that
would not trace a function and all its children is done by setting a
NOTRACE flag when the function that is not to be traced is hit.

There's a TRACE_GRAPH_NOTRACE_BIT which defines the bit in the flags and a
TRACE_GRAPH_NOTRACE which is the mask with that bit set. But the
restructuring used TRACE_GRAPH_NOTRACE_BIT when it should have used
TRACE_GRAPH_NOTRACE.

For example:

 # cd /sys/kernel/tracing
 # echo set_track_prepare stack_trace_save  > set_graph_notrace
 # echo function_graph > current_tracer
 # cat trace
[..]
 0)               |                          __slab_free() {
 0)               |                            free_to_partial_list() {
 0)               |                                  arch_stack_walk() {
 0)               |                                    __unwind_start() {
 0)   0.501 us    |                                      get_stack_info();

Where a non filter trace looks like:

 # echo > set_graph_notrace
 # cat trace
 0)               |                            free_to_partial_list() {
 0)               |                              set_track_prepare() {
 0)               |                                stack_trace_save() {
 0)               |                                  arch_stack_walk() {
 0)               |                                    __unwind_start() {

Where the filter should look like:

 # cat trace
 0)               |                            free_to_partial_list() {
 0)               |                              _raw_spin_lock_irqsave() {
 0)   0.350 us    |                                preempt_count_add();
 0)   0.351 us    |                                do_raw_spin_lock();
 0)   2.440 us    |                              }

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250208001511.535be150@batman.local.home
Fixes: b84214890a ("function_graph: Move graph notrace bit to shadow stack global var")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-08 08:36:45 -05:00
Joel Granados
1751f872cc treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
    virtual patch

    @
    depends on !(file in "net")
    disable optional_qualifier
    @

    identifier table_name != {
      watchdog_hardlockup_sysctl,
      iwcm_ctl_table,
      ucma_ctl_table,
      memory_allocation_profiling_sysctls,
      loadpin_sysctl_table
    };
    @@

    + const
    struct ctl_table table_name [] = { ... };

sed:
    sed --in-place \
      -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
      kernel/utsname_sysctl.c

Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-28 13:48:37 +01:00
Linus Torvalds
40648d246f rv: tools/rtla: Updates for 6.14
- Add a test suite to test the tool
 
   Add a small test suite that can be used to test rtla's basic features to
   at least have something to test when applying changes.
 
 - Automate manual steps in monitor creation
 
   While creating a new monitor in RV, besides generating code from dot2k,
   there are a few manual steps which can be tedious and error prone, like
   adding the tracepoints, makefile lines and kconfig, or selecting events
   that start the monitor in the initial state.
 
   Updates were made to try and automate as much as possible among those steps to
   make creating a new RV monitor much quicker. It is still requires to
   select proper tracepoints, this step is harder to automate in a general
   way and, in several cases, would still need user intervention.
 
 - Have rtla timerlat hist and top set OSNOISE_WORKLOAD flag
 
   Have both rtla-timerlat-hist and rtla-timerlat-top set OSNOISE_WORKLOAD to
   the proper value ("on" when running with -k, "off" when running with -u)
   every time the option is available instead of setting it only when running
   with -u.
 
   This prevents rtla timerlat -k from giving no results when
   NO_OSNOISE_WORKLOAD is set, either manually or by an abnormally exited earlier
   run of rtla timerlat -u.
 
 -  Stop rtla timerlat on signal properly when overloaded
 
   There is an issue where if rtla is run on machines with a high number of
   CPUs (100+), timerlat can generate more samples than rtla is able to process
   via tracefs_iterate_raw_events. This is especially common when the interval
   is set to 100us (rteval and cyclictest default) as opposed to the rtla
   default of 1000us, but also happens with the rtla default.
 
   Currently, this leads to rtla hanging and having to be terminated with
   SIGTERM. SIGINT setting stop_tracing is not enough, since more and more
   events are coming and tracefs_iterate_raw_events never exits.
 
   To fix this: Stop the timerlat tracer on SIGINT/SIGALRM to ensure no more
   events are generated when rtla is supposed to exit.
 
   Also on receiving SIGINT/SIGALRM twice, abort iteration immediately with
   tracefs_iterate_stop, making rtla exit right away instead of waiting for all
   events to be processed.
 
 - Account for missed events
 
   Due to tracefs buffer overflow, it can happen that rtla misses events,
   making the tracing results inaccurate.
 
   Count both the number of missed events and the total number of processed
   events, and display missed events as well as their percentage. The numbers
   are displayed for both osnoise and timerlat, even though for the earlier,
   missed events are generally not expected.
 
   For hist, the number is displayed at the end of the run; for top, it is
   displayed on each printing of the top table.
 
 - Changes to make osnoise more robust
 
   There was a dependency in the code that the first field of the
   osnoise_tool structure was the trace field. If that that ever changed,
   then the code work break. Change the code to encapsulate this dependency
   where the code that uses the structure does not have this dependency.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5UQ4BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qktFAQD2px6MyoOVTssB5Iw3aTWGUfTFoDEc
 bfng5JsBxlVJkQEA+2UUvP8FJlLTOQvVEwJiscX7CCJxl5bYkV6GWuGRxQU=
 =h//9
 -----END PGP SIGNATURE-----

Merge tag 'trace-tools-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull rv and tools/rtla updates from Steven Rostedt:

 - Add a test suite to test the tool

   Add a small test suite that can be used to test rtla's basic features
   to at least have something to test when applying changes.

 - Automate manual steps in monitor creation

   While creating a new monitor in RV, besides generating code from
   dot2k, there are a few manual steps which can be tedious and error
   prone, like adding the tracepoints, makefile lines and kconfig, or
   selecting events that start the monitor in the initial state.

   Updates were made to try and automate as much as possible among those
   steps to make creating a new RV monitor much quicker. It is still
   requires to select proper tracepoints, this step is harder to
   automate in a general way and, in several cases, would still need
   user intervention.

 - Have rtla timerlat hist and top set OSNOISE_WORKLOAD flag

   Have both rtla-timerlat-hist and rtla-timerlat-top set
   OSNOISE_WORKLOAD to the proper value ("on" when running with -k,
   "off" when running with -u) every time the option is available
   instead of setting it only when running with -u.

   This prevents rtla timerlat -k from giving no results when
   NO_OSNOISE_WORKLOAD is set, either manually or by an abnormally
   exited earlier run of rtla timerlat -u.

 - Stop rtla timerlat on signal properly when overloaded

   There is an issue where if rtla is run on machines with a high number
   of CPUs (100+), timerlat can generate more samples than rtla is able
   to process via tracefs_iterate_raw_events. This is especially common
   when the interval is set to 100us (rteval and cyclictest default) as
   opposed to the rtla default of 1000us, but also happens with the rtla
   default.

   Currently, this leads to rtla hanging and having to be terminated
   with SIGTERM. SIGINT setting stop_tracing is not enough, since more
   and more events are coming and tracefs_iterate_raw_events never
   exits.

   To fix this: Stop the timerlat tracer on SIGINT/SIGALRM to ensure no
   more events are generated when rtla is supposed to exit.

   Also on receiving SIGINT/SIGALRM twice, abort iteration immediately
   with tracefs_iterate_stop, making rtla exit right away instead of
   waiting for all events to be processed.

 - Account for missed events

   Due to tracefs buffer overflow, it can happen that rtla misses
   events, making the tracing results inaccurate.

   Count both the number of missed events and the total number of
   processed events, and display missed events as well as their
   percentage. The numbers are displayed for both osnoise and timerlat,
   even though for the earlier, missed events are generally not
   expected.

   For hist, the number is displayed at the end of the run; for top, it
   is displayed on each printing of the top table.

 - Changes to make osnoise more robust

   There was a dependency in the code that the first field of the
   osnoise_tool structure was the trace field. If that that ever
   changed, then the code work break. Change the code to encapsulate
   this dependency where the code that uses the structure does not have
   this dependency.

* tag 'trace-tools-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (22 commits)
  rtla: Report missed event count
  rtla: Add function to report missed events
  rtla: Count all processed events
  rtla: Count missed trace events
  tools/rtla: Add osnoise_trace_is_off()
  rtla/timerlat_top: Set OSNOISE_WORKLOAD for kernel threads
  rtla/timerlat_hist: Set OSNOISE_WORKLOAD for kernel threads
  rtla/osnoise: Distinguish missing workload option
  rtla/timerlat_top: Abort event processing on second signal
  rtla/timerlat_hist: Abort event processing on second signal
  rtla/timerlat_top: Stop timerlat tracer on signal
  rtla/timerlat_hist: Stop timerlat tracer on signal
  rtla: Add trace_instance_stop
  tools/rtla: Add basic test suite
  verification/dot2k: Implement event type detection
  verification/dot2k: Auto patch current kernel source
  verification/dot2k: Simplify manual steps in monitor creation
  rv: Simplify manual steps in monitor creation
  verification/dot2k: Add support for name and description options
  verification/dot2k: More robust template variables
  ...
2025-01-26 14:25:58 -08:00
Linus Torvalds
90ab2117f4 Runtime verifier and osnoise fixes for 6.14:
- Reset idle tasks on reset for runtime verifier
 
   When the runtime verifier is reset, it resets the task's data that is being
   monitored. But it only iterates for_each_process() which does not include
   the idle tasks. As the idle tasks can be monitored, they need to be reset
   as well.
 
 - Fix the enabling and disabling of tracepoints in osnoise
 
   If timerlat is enabled and the WORKLOAD flag is not set, then the osnoise
   tracer will enable the migrate task tracepoint to monitor it for its own
   workload. The test to enable the tracepoint is done against user space
   modifiable parameters. On disabling of the tracer, those same parameters
   are used to determine if the tracepoint should be disabled. The problem is
   if user space were to modify the parameters after it enables the tracer
   then it may not disable the tracepoint.
 
   Instead, a static variable is used to keep track if the tracepoint was
   enabled or not. Then when the tracer shuts down, it will use this variable
   to decide to disable the tracepoint or not, instead of looking at the user
   space parameters.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5UJ1BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qo46AQDjr1yVTyUmzvxe1bMLDoDOq1xeRMRe
 4f8CdpOJRxbi0QEAwnl5Ey9Rvcuy8GFpt/USESr4VYWAN10fvsPx7pkphAs=
 =RYyn
 -----END PGP SIGNATURE-----

Merge tag 'trace-rv-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verifier and osnoise fixes from Steven Rostedt:

 - Reset idle tasks on reset for runtime verifier

   When the runtime verifier is reset, it resets the task's data that is
   being monitored. But it only iterates for_each_process() which does
   not include the idle tasks. As the idle tasks can be monitored, they
   need to be reset as well.

 - Fix the enabling and disabling of tracepoints in osnoise

   If timerlat is enabled and the WORKLOAD flag is not set, then the
   osnoise tracer will enable the migrate task tracepoint to monitor it
   for its own workload. The test to enable the tracepoint is done
   against user space modifiable parameters. On disabling of the tracer,
   those same parameters are used to determine if the tracepoint should
   be disabled. The problem is if user space were to modify the
   parameters after it enables the tracer then it may not disable the
   tracepoint.

   Instead, a static variable is used to keep track if the tracepoint
   was enabled or not. Then when the tracer shuts down, it will use this
   variable to decide to disable the tracepoint or not, instead of
   looking at the user space parameters.

* tag 'trace-rv-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/osnoise: Fix resetting of tracepoints
  rv: Reset per-task monitors also for idle tasks
2025-01-26 14:19:45 -08:00
Steven Rostedt
e3ff424592 tracing/osnoise: Fix resetting of tracepoints
If a timerlat tracer is started with the osnoise option OSNOISE_WORKLOAD
disabled, but then that option is enabled and timerlat is removed, the
tracepoints that were enabled on timerlat registration do not get
disabled. If the option is disabled again and timelat is started, then it
triggers a warning in the tracepoint code due to registering the
tracepoint again without ever disabling it.

Do not use the same user space defined options to know to disable the
tracepoints when timerlat is removed. Instead, set a global flag when it
is enabled and use that flag to know to disable the events.

 ~# echo NO_OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
 ~# echo timerlat > /sys/kernel/tracing/current_tracer
 ~# echo OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
 ~# echo nop > /sys/kernel/tracing/current_tracer
 ~# echo NO_OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
 ~# echo timerlat > /sys/kernel/tracing/current_tracer

Triggers:

 ------------[ cut here ]------------
 WARNING: CPU: 6 PID: 1337 at kernel/tracepoint.c:294 tracepoint_add_func+0x3b6/0x3f0
 Modules linked in:
 CPU: 6 UID: 0 PID: 1337 Comm: rtla Not tainted 6.13.0-rc4-test-00018-ga867c441128e-dirty #73
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:tracepoint_add_func+0x3b6/0x3f0
 Code: 48 8b 53 28 48 8b 73 20 4c 89 04 24 e8 23 59 11 00 4c 8b 04 24 e9 36 fe ff ff 0f 0b b8 ea ff ff ff 45 84 e4 0f 84 68 fe ff ff <0f> 0b e9 61 fe ff ff 48 8b 7b 18 48 85 ff 0f 84 4f ff ff ff 49 8b
 RSP: 0018:ffffb9b003a87ca0 EFLAGS: 00010202
 RAX: 00000000ffffffef RBX: ffffffff92f30860 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff9bf59e91ccd0 RDI: ffffffff913b6410
 RBP: 000000000000000a R08: 00000000000005c7 R09: 0000000000000002
 R10: ffffb9b003a87ce0 R11: 0000000000000002 R12: 0000000000000001
 R13: ffffb9b003a87ce0 R14: ffffffffffffffef R15: 0000000000000008
 FS:  00007fce81209240(0000) GS:ffff9bf6fdd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055e99b728000 CR3: 00000001277c0002 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  ? __warn.cold+0xb7/0x14d
  ? tracepoint_add_func+0x3b6/0x3f0
  ? report_bug+0xea/0x170
  ? handle_bug+0x58/0x90
  ? exc_invalid_op+0x17/0x70
  ? asm_exc_invalid_op+0x1a/0x20
  ? __pfx_trace_sched_migrate_callback+0x10/0x10
  ? tracepoint_add_func+0x3b6/0x3f0
  ? __pfx_trace_sched_migrate_callback+0x10/0x10
  ? __pfx_trace_sched_migrate_callback+0x10/0x10
  tracepoint_probe_register+0x78/0xb0
  ? __pfx_trace_sched_migrate_callback+0x10/0x10
  osnoise_workload_start+0x2b5/0x370
  timerlat_tracer_init+0x76/0x1b0
  tracing_set_tracer+0x244/0x400
  tracing_set_trace_write+0xa0/0xe0
  vfs_write+0xfc/0x570
  ? do_sys_openat2+0x9c/0xe0
  ksys_write+0x72/0xf0
  do_syscall_64+0x79/0x1c0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250123204159.4450c88e@gandalf.local.home
Fixes: e88ed227f6 ("tracing/timerlat: Add user-space interface")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-24 15:20:44 -05:00
Linus Torvalds
606489dbfa Fix atomic64 operations on some architectures for the tracing ring buffer:
- Have emulating atomic64 use arch_spin_locks instead of raw_spin_locks
 
   The tracing ring buffer events have a small timestamp that holds the
   delta between itself and the event before it. But this can be tricky
   to update when interrupts come in. It originally just set the deltas
   to zero for events that interrupted the adding of another event which
   made all the events in the interrupt have the same timestamp as the
   event it interrupted. This was not suitable for many tools, so it
   was eventually fixed. But that fix required adding an atomic64 cmpxchg
   on the timestamp in cases where an event was added while another
   event was in the process of being added.
 
   Originally, for 32 bit architectures, the manipulation of the 64 bit
   timestamp was done by a structure that held multiple 32bit words to hold
   parts of the timestamp and a counter. But as updates to the ring buffer
   were done, maintaining this became too complex and was replaced by the
   atomic64 generic operations which are now used by both 64bit and 32bit
   architectures.  Shortly after that, it was reported that riscv32 and
   other 32 bit architectures that just used the generic atomic64 were
   locking up. This was because the generic atomic64 operations defined in
   lib/atomic64.c uses a raw_spin_lock() to emulate an atomic64 operation.
   The problem here was that raw_spin_lock() can also be traced by the
   function tracer (which is commonly used for debugging raw spin locks).
   Since the function tracer uses the tracing ring buffer, which now is being
   traced internally, this was triggering a recursion and setting off a
   warning that the spin locks were recusing.
 
   There's no reason for the code that emulates atomic64 operations to be
   using raw_spin_locks which have a lot of debugging infrastructure attached
   to them (depending on the config options). Instead it should be using
   the arch_spin_lock() which does not have any infrastructure attached to
   them and is used by low level infrastructure like RCU locks, lockdep
   and of course tracing. Using arch_spin_lock()s fixes this issue.
 
 - Do not trace in NMI if the architecture uses emulated atomic64 operations
 
   Another issue with using the emulated atomic64 operations that uses
   spin locks to emulate the atomic64 operations is that they cannot be
   used in NMI context. As an NMI can trigger while holding the atomic64
   spin locks it can try to take the same lock and cause a deadlock.
 
   Have the ring buffer fail recording events if in NMI context and the
   architecture uses the emulated atomic64 operations.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5Jr7RQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qg7cAPoD/H4BRsFa3UUDnxofTlBuj4A7neJd
 rk9ddD9HXH8KywEAhBn1Oujiw81Ayjx7E6s4ednAQX4rldTXBXDyFNuuGgU=
 =b13F
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull trace fing buffer fix from Steven Rostedt:
 "Fix atomic64 operations on some architectures for the tracing ring
  buffer:

   - Have emulating atomic64 use arch_spin_locks instead of
     raw_spin_locks

     The tracing ring buffer events have a small timestamp that holds
     the delta between itself and the event before it. But this can be
     tricky to update when interrupts come in. It originally just set
     the deltas to zero for events that interrupted the adding of
     another event which made all the events in the interrupt have the
     same timestamp as the event it interrupted. This was not suitable
     for many tools, so it was eventually fixed. But that fix required
     adding an atomic64 cmpxchg on the timestamp in cases where an event
     was added while another event was in the process of being added.

     Originally, for 32 bit architectures, the manipulation of the 64
     bit timestamp was done by a structure that held multiple 32bit
     words to hold parts of the timestamp and a counter. But as updates
     to the ring buffer were done, maintaining this became too complex
     and was replaced by the atomic64 generic operations which are now
     used by both 64bit and 32bit architectures. Shortly after that, it
     was reported that riscv32 and other 32 bit architectures that just
     used the generic atomic64 were locking up. This was because the
     generic atomic64 operations defined in lib/atomic64.c uses a
     raw_spin_lock() to emulate an atomic64 operation. The problem here
     was that raw_spin_lock() can also be traced by the function tracer
     (which is commonly used for debugging raw spin locks). Since the
     function tracer uses the tracing ring buffer, which now is being
     traced internally, this was triggering a recursion and setting off
     a warning that the spin locks were recusing.

     There's no reason for the code that emulates atomic64 operations to
     be using raw_spin_locks which have a lot of debugging
     infrastructure attached to them (depending on the config options).
     Instead it should be using the arch_spin_lock() which does not have
     any infrastructure attached to them and is used by low level
     infrastructure like RCU locks, lockdep and of course tracing. Using
     arch_spin_lock()s fixes this issue.

   - Do not trace in NMI if the architecture uses emulated atomic64
     operations

     Another issue with using the emulated atomic64 operations that uses
     spin locks to emulate the atomic64 operations is that they cannot
     be used in NMI context. As an NMI can trigger while holding the
     atomic64 spin locks it can try to take the same lock and cause a
     deadlock.

     Have the ring buffer fail recording events if in NMI context and
     the architecture uses the emulated atomic64 operations"

* tag 'trace-ringbuffer-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  atomic64: Use arch_spin_locks instead of raw_spin_locks
  ring-buffer: Do not allow events in NMI with generic atomic64 cmpxchg()
2025-01-23 18:02:55 -08:00
Linus Torvalds
7c1badb2a9 Remove calltime and rettime from fgraph infrastructure
The calltime and rettime were used by the function graph tracer to
 calculate the timings of functions where it traced their entry and exit.
 The calltime and rettime were stored in the generic structures that
 were used for the mechanisms to add an entry and exit callback.
 
 Now that function graph infrastructure is used by other subsystems than
 just the tracer, the calltime and rettime are not needed for them.
 Remove the calltime and rettime from the generic fgraph infrastructure
 and have the callers that require them handle them.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5Jk2BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qr3GAQCtzSX6PYTliZKm/Wa5zhebopwhGKEb
 Mv0VZejaxYu0hgEA5O57J676FAOaQ7BLzoMojKFhR8RZ6LadKR5r7A0Qzgc=
 =Bb/r
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull fgraph updates from Steven Rostedt:
 "Remove calltime and rettime from fgraph infrastructure

  The calltime and rettime were used by the function graph tracer to
  calculate the timings of functions where it traced their entry and
  exit. The calltime and rettime were stored in the generic structures
  that were used for the mechanisms to add an entry and exit callback.

  Now that function graph infrastructure is used by other subsystems
  than just the tracer, the calltime and rettime are not needed for
  them. Remove the calltime and rettime from the generic fgraph
  infrastructure and have the callers that require them handle them"

* tag 'ftrace-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fgraph: Remove calltime and rettime from generic operations
2025-01-23 17:59:25 -08:00
Linus Torvalds
e8744fbc83 tracing updates for v6.14:
- Cleanup with guard() and free() helpers
 
   There were several places in the code that had a lot of "goto out" in the
   error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.
 
 - Update the Rust tracepoint code to use the C code too
 
   There was some duplication of the tracepoint code for Rust that did the
   same logic as the C code. Add a helper that makes it possible for both
   algorithms to use the same logic in one place.
 
 - Add poll to trace event hist files
 
   It is useful to know when an event is triggered, or even with some
   filtering. Since hist files of events get updated when active and the
   event is triggered, allow applications to poll the hist file and wake up
   when an event is triggered. This will let the application know that the
   event it is waiting for happened.
 
 - Add :mod: command to enable events for current or future modules
 
   The function tracer already has a way to enable functions to be traced in
   modules by writing ":mod:<module>" into set_ftrace_filter. That will
   enable either all the functions for the module if it is loaded, or if it
   is not, it will cache that command, and when the module is loaded that
   matches <module>, its functions will be enabled. This also allows init
   functions to be traced. But currently events do not have that feature.
 
   Add the command where if ':mod:<module>' is written into set_event, then
   either all the modules events are enabled if it is loaded, or cache it so
   that the module's events are enabled when it is loaded. This also works
   from the kernel command line, where "trace_event=:mod:<module>", when the
   module is loaded at boot up, its events will be enabled then.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5EbMxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qkZsAP9Amgx9frSbR1pn1t0I3wVnQx7khgOu
 s/b8Ro+vjTx1/QD/RN2AA7f+HK4F27w3Aqfrs0nKXAPtXWsJ9Epp8raG5w8=
 =Pg+4
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Cleanup with guard() and free() helpers

   There were several places in the code that had a lot of "goto out" in
   the error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.

 - Update the Rust tracepoint code to use the C code too

   There was some duplication of the tracepoint code for Rust that did
   the same logic as the C code. Add a helper that makes it possible for
   both algorithms to use the same logic in one place.

 - Add poll to trace event hist files

   It is useful to know when an event is triggered, or even with some
   filtering. Since hist files of events get updated when active and the
   event is triggered, allow applications to poll the hist file and wake
   up when an event is triggered. This will let the application know
   that the event it is waiting for happened.

 - Add :mod: command to enable events for current or future modules

   The function tracer already has a way to enable functions to be
   traced in modules by writing ":mod:<module>" into set_ftrace_filter.
   That will enable either all the functions for the module if it is
   loaded, or if it is not, it will cache that command, and when the
   module is loaded that matches <module>, its functions will be
   enabled. This also allows init functions to be traced. But currently
   events do not have that feature.

   Add the command where if ':mod:<module>' is written into set_event,
   then either all the modules events are enabled if it is loaded, or
   cache it so that the module's events are enabled when it is loaded.
   This also works from the kernel command line, where
   "trace_event=:mod:<module>", when the module is loaded at boot up,
   its events will be enabled then.

* tag 'trace-v6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (26 commits)
  tracing: Fix output of set_event for some cached module events
  tracing: Fix allocation of printing set_event file content
  tracing: Rename update_cache() to update_mod_cache()
  tracing: Fix #if CONFIG_MODULES to #ifdef CONFIG_MODULES
  selftests/ftrace: Add test that tests event :mod: commands
  tracing: Cache ":mod:" events for modules not loaded yet
  tracing: Add :mod: command to enabled module events
  selftests/tracing: Add hist poll() support test
  tracing/hist: Support POLLPRI event for poll on histogram
  tracing/hist: Add poll(POLLIN) support on hist file
  tracing: Fix using ret variable in tracing_set_tracer()
  tracepoint: Reduce duplication of __DO_TRACE_CALL
  tracing/string: Create and use __free(argv_free) in trace_dynevent.c
  tracing: Switch trace_stat.c code over to use guard()
  tracing: Switch trace_stack.c code over to use guard()
  tracing: Switch trace_osnoise.c code over to use guard() and __free()
  tracing: Switch trace_events_synth.c code over to use guard()
  tracing: Switch trace_events_filter.c code over to use guard()
  tracing: Switch trace_events_trigger.c code over to use guard()
  tracing: Switch trace_events_hist.c code over to use guard()
  ...
2025-01-23 17:51:16 -08:00
Linus Torvalds
544521d621 Probes updates for v6.14:
- kprobes: Cleanups using guard() and __free(): Use cleanup.h macros
   to cleanup code and remove all gotos from kprobes code. This work
   includes below changes.
   . kprobes: Adopt guard() and scoped_guard()
   . jump_label: Define guard() for jump_label_lock
   . kprobes: Use guard() for external locks
   . kprobes: Use guard for rcu_read_lock
   . kprobes: Remove unneeded goto
   . kprobes: Remove remaining gotos
 
 - tracing/probes: Also cleanups tracing/*probe events code with guard()
   and __free(). These patches are just to simplify the parser codes.
   This work includes below changes.
   . tracing/kprobe: Adopt guard() and scoped_guard()
   . tracing/uprobe: Adopt guard() and scoped_guard()
   . tracing/eprobe: Adopt guard() and scoped_guard()
   . tracing: Use __free() in trace_probe for cleanup
   . tracing: Use __free() for kprobe events to cleanup
   . tracing/kprobes: Simplify __trace_kprobe_create() by removing gotos
 
 - kprobes: Reduce preempt disable scope in check_kprobe_access_safe()
   This reduces preempt disable time only when getting the module
   refcount in check_kprobe_access_safe(). Previously it disabled
   preempt needlessly for other checks including
   jump_label_text_reserved(), but it took long time because of
   the linear search.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmeNy8kbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b/UoIAMYX8pYtLMZjokKSlsnN
 Y3rMZecFJNKfXeA62YazpESO9n+BYyqj/Zj25Ws4YzdfnoT4kGtMuoUCS7d5U3lo
 RZO/a7i3TgrzQdWbRwBz7l75h+FnCAsrSoLGoH/sQxXUP+p/2C6GJG3O3A/0qMaS
 TPp7/nsR78P7MSf6Yr/ODZj0UWMSX6a6QNrzwloiMUjY6aPxz7K1on9cdzMF2QPE
 fgtJk+JbCn9nLOJi/SVlO+wu4EfdKrfT7sqz9taNX/eQYmx2O9fE3R92Qvhxh79r
 MrqcrMfcwfWtX+yD1kb9mf+jMNYy8Dhcf595B0TGSzCZq64ir9UukntR6YD3raAO
 Wm0=
 =qmrY
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:

 - kprobes: Cleanups using guard() and __free(): Use cleanup.h macros to
   cleanup code and remove all gotos from kprobes code.

 - tracing/probes: Also cleanups tracing/*probe events code with guard()
   and __free(). These patches are just to simplify the parser codes.

 - kprobes: Reduce preempt disable scope in check_kprobe_access_safe()

   This reduces preempt disable time to only when getting the module
   refcount in check_kprobe_access_safe().

   Previously it disabled preempt needlessly for other checks including
   jump_label_text_reserved(), which took a long time because of the
   linear search.

* tag 'probes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/kprobes: Simplify __trace_kprobe_create() by removing gotos
  tracing: Use __free() for kprobe events to cleanup
  tracing: Use __free() in trace_probe for cleanup
  kprobes: Remove remaining gotos
  kprobes: Remove unneeded goto
  kprobes: Use guard for rcu_read_lock
  kprobes: Use guard() for external locks
  jump_label: Define guard() for jump_label_lock
  tracing/eprobe: Adopt guard() and scoped_guard()
  tracing/uprobe: Adopt guard() and scoped_guard()
  tracing/kprobe: Adopt guard() and scoped_guard()
  kprobes: Adopt guard() and scoped_guard()
  kprobes: Reduce preempt disable scope in check_kprobe_access_safe()
2025-01-23 17:24:20 -08:00
Linus Torvalds
d0d106a2bd bpf-next-6.14
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmeOu1YACgkQ6rmadz2v
 bTrrHxAAn6eqEsluWnDlzhI0OGsPjvgS00sf+MOeqiXYeS2eJ8yJuKifp38+nIQZ
 lIplsWU2ReUY20eizPqLPnQ7TXZGvLgp08E8yHUoZ0siWanqr9iDRfbZCCNrDMNm
 lMqeR1SLapMws2R/UX9JbvPn2ajIJ6Lb4wxenTfdlW6q+0hAGM6Dt0k/jBod+quq
 /oo+xwG3L0q4APBovJfiAFN2z6IYN03b+zLiOrpIJtMACGewEXnl3m4mkL8ZM/FV
 nZGPIxIUPXCpKTGEkNqxfkrnHN2wZQ4ZSKEJ6lhEEp4jrgCVITaGZ/E7jlx6fZoj
 bbd4YMonIPo9Nhim8p1dt8yYBhKKiE5IXIq0GqlMv5+MvAN8ylrlydpsouW1fu66
 hZ1W1BxbxmrgyF0Bwo9JPOMhBHwMrmD6iH9LgiMpZf0ASeF+q9cJpoSOU5j5E9XB
 LpLIRf5jYTd4wZjhDmrQREReLo+Bng9DlCBu+jjh2+YTz6l6Qed+ETpENcd7lL5i
 IHZVbgD2RVPNJoUfdrd763HfYfDTk+50MF5FIMEyfKHz11if0E/LhBMzto22hm6b
 2f8ruj/8yvg8s2dxEP3ySQgcnynlwEnGxLenUVv7uEOYKeWri1rq+fvTK5ne1OLK
 oHnTlkViwQb74c0r8cFW+nkyfUYTfhhBAql14rl/fMjGDO2KZ10=
 =f2CA
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:
 "A smaller than usual release cycle.

  The main changes are:

   - Prepare selftest to run with GCC-BPF backend (Ihor Solodrai)

     In addition to LLVM-BPF runs the BPF CI now runs GCC-BPF in compile
     only mode. Half of the tests are failing, since support for
     btf_decl_tag is still WIP, but this is a great milestone.

   - Convert various samples/bpf to selftests/bpf/test_progs format
     (Alexis Lothoré and Bastien Curutchet)

   - Teach verifier to recognize that array lookup with constant
     in-range index will always succeed (Daniel Xu)

   - Cleanup migrate disable scope in BPF maps (Hou Tao)

   - Fix bpf_timer destroy path in PREEMPT_RT (Hou Tao)

   - Always use bpf_mem_alloc in bpf_local_storage in PREEMPT_RT (Martin
     KaFai Lau)

   - Refactor verifier lock support (Kumar Kartikeya Dwivedi)

     This is a prerequisite for upcoming resilient spin lock.

   - Remove excessive 'may_goto +0' instructions in the verifier that
     LLVM leaves when unrolls the loops (Yonghong Song)

   - Remove unhelpful bpf_probe_write_user() warning message (Marco
     Elver)

   - Add fd_array_cnt attribute for prog_load command (Anton Protopopov)

     This is a prerequisite for upcoming support for static_branch"

* tag 'bpf-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (125 commits)
  selftests/bpf: Add some tests related to 'may_goto 0' insns
  bpf: Remove 'may_goto 0' instruction in opt_remove_nops()
  bpf: Allow 'may_goto 0' instruction in verifier
  selftests/bpf: Add test case for the freeing of bpf_timer
  bpf: Cancel the running bpf_timer through kworker for PREEMPT_RT
  bpf: Free element after unlock in __htab_map_lookup_and_delete_elem()
  bpf: Bail out early in __htab_map_lookup_and_delete_elem()
  bpf: Free special fields after unlock in htab_lru_map_delete_node()
  tools: Sync if_xdp.h uapi tooling header
  libbpf: Work around kernel inconsistently stripping '.llvm.' suffix
  bpf: selftests: verifier: Add nullness elision tests
  bpf: verifier: Support eliding map lookup nullness
  bpf: verifier: Refactor helper access type tracking
  bpf: tcp: Mark bpf_load_hdr_opt() arg2 as read-write
  bpf: verifier: Add missing newline on verbose() call
  selftests/bpf: Add distilled BTF test about marking BTF_IS_EMBEDDED
  libbpf: Fix incorrect traversal end type ID when marking BTF_IS_EMBEDDED
  libbpf: Fix return zero when elf_begin failed
  selftests/bpf: Fix btf leak on new btf alloc failure in btf_distill test
  veristat: Load struct_ops programs only once
  ...
2025-01-23 08:04:07 -08:00
Steven Rostedt
66611c0475 fgraph: Remove calltime and rettime from generic operations
The function graph infrastructure is now generic so that kretprobes,
fprobes and BPF can use it. But there is still some leftover logic that
only the function graph tracer itself uses. This is the calculation of the
calltime and return time of the functions. The calculation of the calltime
has been moved into the function graph tracer and those users that need it
so that it doesn't cause overhead to the other users. But the return
function timestamp was still called.

Instead of just moving the taking of the timestamp into the function graph
trace remove the calltime and rettime completely from the ftrace_graph_ret
structure. Instead, move it into the function graph return entry event
structure and this also moves all the calltime and rettime logic out of
the generic fgraph.c code and into the tracing code that uses it.

This has been reported to decrease the overhead by ~27%.

Link: https://lore.kernel.org/all/Z3aSuql3fnXMVMoM@krava/
Link: https://lore.kernel.org/all/173665959558.1629214.16724136597211810729.stgit@devnote2/

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250121194436.15bdf71a@gandalf.local.home
Reported-by: Jiri Olsa <olsajiri@gmail.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-21 21:55:49 -05:00
Linus Torvalds
2e04247f7c ftrace updates for v6.14:
- Have fprobes built on top of function graph infrastructure
 
   The fprobe logic is an optimized kprobe that uses ftrace to attach to
   functions when a probe is needed at the start or end of the function. The
   fprobe and kretprobe logic implements a similar method as the function
   graph tracer to trace the end of the function. That is to hijack the
   return address and jump to a trampoline to do the trace when the function
   exits. To do this, a shadow stack needs to be created to store the
   original return address.  Fprobes and function graph do this slightly
   differently. Fprobes (and kretprobes) has slots per callsite that are
   reserved to save the return address. This is fine when just a few points
   are traced. But users of fprobes, such as BPF programs, are starting to add
   many more locations, and this method does not scale.
 
   The function graph tracer was created to trace all functions in the
   kernel. In order to do this, when function graph tracing is started, every
   task gets its own shadow stack to hold the return address that is going to
   be traced. The function graph tracer has been updated to allow multiple
   users to use its infrastructure. Now have fprobes be one of those users.
   This will also allow for the fprobe and kretprobe methods to trace the
   return address to become obsolete. With new technologies like CFI that
   need to know about these methods of hijacking the return address, going
   toward a solution that has only one method of doing this will make the
   kernel less complex.
 
 - Cleanup with guard() and free() helpers
 
   There were several places in the code that had a lot of "goto out" in the
   error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.
 
 - Remove disabling of interrupts in the function graph tracer
 
   When function graph tracer was first introduced, it could race with
   interrupts and NMIs. To prevent that race, it would disable interrupts and
   not trace NMIs. But the code has changed to allow NMIs and also
   interrupts. This change was done a long time ago, but the disabling of
   interrupts was never removed. Remove the disabling of interrupts in the
   function graph tracer is it is not needed. This greatly improves its
   performance.
 
 - Allow the :mod: command to enable tracing module functions on the kernel
   command line.
 
   The function tracer already has a way to enable functions to be traced in
   modules by writing ":mod:<module>" into set_ftrace_filter. That will
   enable either all the functions for the module if it is loaded, or if it
   is not, it will cache that command, and when the module is loaded that
   matches <module>, its functions will be enabled. This also allows init
   functions to be traced. But currently events do not have that feature.
 
   Because enabling function tracing can be done very early at boot up
   (before scheduling is enabled), the commands that can be done when
   function tracing is started is limited. Having the ":mod:" command to
   trace module functions as they are loaded is very useful. Update the
   kernel command line function filtering to allow it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ42E2RQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqXSAPwOMxuhye8tb1GYG62QD9+w7e6nOmlC
 2GCPj4detnEM2QD/ciivkhespVKhHpZHRewAuSnJgHPSM45NQ3EVESzjWQ4=
 =snbx
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace updates from Steven Rostedt:

 - Have fprobes built on top of function graph infrastructure

   The fprobe logic is an optimized kprobe that uses ftrace to attach to
   functions when a probe is needed at the start or end of the function.
   The fprobe and kretprobe logic implements a similar method as the
   function graph tracer to trace the end of the function. That is to
   hijack the return address and jump to a trampoline to do the trace
   when the function exits. To do this, a shadow stack needs to be
   created to store the original return address. Fprobes and function
   graph do this slightly differently. Fprobes (and kretprobes) has
   slots per callsite that are reserved to save the return address. This
   is fine when just a few points are traced. But users of fprobes, such
   as BPF programs, are starting to add many more locations, and this
   method does not scale.

   The function graph tracer was created to trace all functions in the
   kernel. In order to do this, when function graph tracing is started,
   every task gets its own shadow stack to hold the return address that
   is going to be traced. The function graph tracer has been updated to
   allow multiple users to use its infrastructure. Now have fprobes be
   one of those users. This will also allow for the fprobe and kretprobe
   methods to trace the return address to become obsolete. With new
   technologies like CFI that need to know about these methods of
   hijacking the return address, going toward a solution that has only
   one method of doing this will make the kernel less complex.

 - Cleanup with guard() and free() helpers

   There were several places in the code that had a lot of "goto out" in
   the error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.

 - Remove disabling of interrupts in the function graph tracer

   When function graph tracer was first introduced, it could race with
   interrupts and NMIs. To prevent that race, it would disable
   interrupts and not trace NMIs. But the code has changed to allow NMIs
   and also interrupts. This change was done a long time ago, but the
   disabling of interrupts was never removed. Remove the disabling of
   interrupts in the function graph tracer is it is not needed. This
   greatly improves its performance.

 - Allow the :mod: command to enable tracing module functions on the
   kernel command line.

   The function tracer already has a way to enable functions to be
   traced in modules by writing ":mod:<module>" into set_ftrace_filter.
   That will enable either all the functions for the module if it is
   loaded, or if it is not, it will cache that command, and when the
   module is loaded that matches <module>, its functions will be
   enabled. This also allows init functions to be traced. But currently
   events do not have that feature.

   Because enabling function tracing can be done very early at boot up
   (before scheduling is enabled), the commands that can be done when
   function tracing is started is limited. Having the ":mod:" command to
   trace module functions as they are loaded is very useful. Update the
   kernel command line function filtering to allow it.

* tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (26 commits)
  ftrace: Implement :mod: cache filtering on kernel command line
  tracing: Adopt __free() and guard() for trace_fprobe.c
  bpf: Use ftrace_get_symaddr() for kprobe_multi probes
  ftrace: Add ftrace_get_symaddr to convert fentry_ip to symaddr
  Documentation: probes: Update fprobe on function-graph tracer
  selftests/ftrace: Add a test case for repeating register/unregister fprobe
  selftests: ftrace: Remove obsolate maxactive syntax check
  tracing/fprobe: Remove nr_maxactive from fprobe
  fprobe: Add fprobe_header encoding feature
  fprobe: Rewrite fprobe on function-graph tracer
  s390/tracing: Enable HAVE_FTRACE_GRAPH_FUNC
  ftrace: Add CONFIG_HAVE_FTRACE_GRAPH_FUNC
  bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled
  tracing/fprobe: Enable fprobe events with CONFIG_DYNAMIC_FTRACE_WITH_ARGS
  tracing: Add ftrace_fill_perf_regs() for perf event
  tracing: Add ftrace_partial_regs() for converting ftrace_regs to pt_regs
  fprobe: Use ftrace_regs in fprobe exit handler
  fprobe: Use ftrace_regs in fprobe entry handler
  fgraph: Pass ftrace_regs to retfunc
  fgraph: Replace fgraph_ret_regs with ftrace_regs
  ...
2025-01-21 15:15:28 -08:00
Linus Torvalds
0074adea39 ring-buffer changes for v6.14
- Clean up the __rb_map_vma() logic
 
   The logic of __rb_map_vma() has a error check with WARN_ON() that makes
   sure that the index does not go past the end of the array of buffers. The
   test in the loop pretty much guarantees that it will never happen, but
   since the relation of the variables used is a little complex, the
   WARN_ON() check was added. It was noticed that the array was dereferenced
   before this check and if the logic does break and for some reason the
   logic goes past the array, there will be an out of bounds access here.
   Move the access to after the WARN_ON().
 
 - Consolidate how the ring buffer is determined to be empty
 
   Currently there's two ways that are used to determine if the ring buffer
   is empty. One relies on the status of the commit and reader pages and what
   was read, and the other is on what was written vs what was read. By using
   the number of entries (written) method, it can be used for reading events
   that are out of the kernel's control (what pKVM will use). Move to this
   method to make it easier to implement a pKVM ring buffer that the kernel
   can read.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ42XuRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qn3GAQCOQ94vr88FSXb/azC9281iDGYC/KbJ
 7J4dGv2rXHpoVAEAtXRXSXpG0mTIJ6TtgVKgMrIFAuT/AVo4EIUr2q/CsgA=
 =2G7c
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull trace ring-buffer updates from Steven Rostedt:

 - Clean up the __rb_map_vma() logic

   The logic of __rb_map_vma() has a error check with WARN_ON() that
   makes sure that the index does not go past the end of the array of
   buffers. The test in the loop pretty much guarantees that it will
   never happen, but since the relation of the variables used is a
   little complex, the WARN_ON() check was added. It was noticed that
   the array was dereferenced before this check and if the logic does
   break and for some reason the logic goes past the array, there will
   be an out of bounds access here. Move the access to after the
   WARN_ON().

 - Consolidate how the ring buffer is determined to be empty

   Currently there's two ways that are used to determine if the ring
   buffer is empty. One relies on the status of the commit and reader
   pages and what was read, and the other is on what was written vs what
   was read. By using the number of entries (written) method, it can be
   used for reading events that are out of the kernel's control (what
   pKVM will use). Move to this method to make it easier to implement a
   pKVM ring buffer that the kernel can read.

* tag 'trace-ringbuffer-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Make reading page consistent with the code logic
  ring-buffer: Check for empty ring-buffer with rb_num_of_entries()
2025-01-21 15:11:54 -08:00
Steven Rostedt
8f21943e10 tracing: Fix output of set_event for some cached module events
The following works fine:

 ~# echo ':mod:trace_events_sample' > /sys/kernel/tracing/set_event
 ~# cat /sys/kernel/tracing/set_event
 *:*:mod:trace_events_sample
 ~#

But if a name is given without a ':' where it can match an event name or
system name, the output of the cached events does not include a new line:

 ~# echo 'foo_bar:mod:trace_events_sample' > /sys/kernel/tracing/set_event
 ~# cat /sys/kernel/tracing/set_event
 foo_bar:mod:trace_events_sample~#

Add the '\n' to that as well.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250121151336.6c491844@gandalf.local.home
Fixes: b355247df1 ("tracing: Cache ":mod:" events for modules not loaded yet")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-21 15:23:07 -05:00
Steven Rostedt
f95ee54294 tracing: Fix allocation of printing set_event file content
The adding of cached events for modules not loaded yet required a
descriptor to separate the iteration of events with the iteration of
cached events for a module. But the allocation used the size of the
pointer and not the size of the contents to allocate its data and caused a
slab-out-of-bounds.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250121151236.47fcf433@gandalf.local.home
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/Z4_OHKESRSiJcr-b@lappy/
Fixes: b355247df1 ("tracing: Cache ":mod:" events for modules not loaded yet")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-21 15:22:41 -05:00
Steven Rostedt
cd2375a356 ring-buffer: Do not allow events in NMI with generic atomic64 cmpxchg()
Some architectures can not safely do atomic64 operations in NMI context.
Since the ring buffer relies on atomic64 operations to do its time
keeping, if an event is requested in NMI context, reject it for these
architectures.

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Link: https://lore.kernel.org/20250120235721.407068250@goodmis.org
Fixes: c84897c0ff ("ring-buffer: Remove 32bit timestamp logic")
Closes: https://lore.kernel.org/all/86fb4f86-a0e4-45a2-a2df-3154acc4f086@gaisler.com/
Reported-by: Ludwig Rydberg <ludwig.rydberg@gaisler.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-21 15:19:00 -05:00
Linus Torvalds
6c4aa896eb Performance events changes for v6.14:
- Seqlock optimizations that arose in a perf context and were
    merged into the perf tree:
 
    - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan)
    - mm: Convert mm_lock_seq to a proper seqcount ((Suren Baghdasaryan)
    - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren Baghdasaryan)
    - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra)
 
  - Core perf enhancements:
 
    - Reduce 'struct page' footprint of perf by mapping pages
      in advance (Lorenzo Stoakes)
    - Save raw sample data conditionally based on sample type (Yabin Cui)
    - Reduce sampling overhead by checking sample_type in
      perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin Cui)
    - Export perf_exclude_event() (Namhyung Kim)
 
  - Uprobes scalability enhancements: (Andrii Nakryiko)
 
    - Simplify find_active_uprobe_rcu() VMA checks
    - Add speculative lockless VMA-to-inode-to-uprobe resolution
    - Simplify session consumer tracking
    - Decouple return_instance list traversal and freeing
    - Ensure return_instance is detached from the list before freeing
    - Reuse return_instances between multiple uretprobes within task
    - Guard against kmemdup() failing in dup_return_instance()
 
  - AMD core PMU driver enhancements:
 
    - Relax privilege filter restriction on AMD IBS (Namhyung Kim)
 
  - AMD RAPL energy counters support: (Dhananjay Ugwekar)
 
    - Introduce topology_logical_core_id() (K Prateek Nayak)
 
    - Remove the unused get_rapl_pmu_cpumask() function
    - Remove the cpu_to_rapl_pmu() function
    - Rename rapl_pmu variables
    - Make rapl_model struct global
    - Add arguments to the init and cleanup functions
    - Modify the generic variable names to *_pkg*
    - Remove the global variable rapl_msrs
    - Move the cntr_mask to rapl_pmus struct
    - Add core energy counter support for AMD CPUs
 
  - Intel core PMU driver enhancements:
 
    - Support RDPMC 'metrics clear mode' feature (Kan Liang)
    - Clarify adaptive PEBS processing (Kan Liang)
    - Factor out functions for PEBS records processing (Kan Liang)
    - Simplify the PEBS records processing for adaptive PEBS (Kan Liang)
 
  - Intel uncore driver enhancements: (Kan Liang)
 
    - Convert buggy pmu->func_id use to pmu->registered
    - Support more units on Granite Rapids
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmeOJdQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i2yQ/+MXl7yfJOgdbwjBpgGGzH4burEO7ppak+
 ktzz+YjpNgjODe/xMAJGjjblouuYArCnRolc1UPvPm6M7jSY76wi42Y6c4dRtFoB
 2ReSrRqnreLOcrRS9nsTjvWRHfJHqJDVSd9TfHX6ILfzbaizCZOGYk558ZxAKRqu
 Lw7FOvLEe/Y3tg4z8dDg083jsasalKySP9wIPc0BkSqQTOfusd3KXju/Fux/9wkn
 hZcUgF4ds+0bH7xtO1/G9ILqGyeq97X1McIR9bAjln5Mxykclen4hSjRaWWHHo9O
 mzBKmd/blIATisfuuW+QLDQow3M1k3688cz7e9QOeWHHd/dJiMb9RLV90jdND/T/
 uLINC5vNemzyWEfnNiYQ31LjhG3SeuDiKWzRp36MbQcCh6EBdRXWLBgtmxq1L/3o
 ZCaCdtFu5+6epycdyOVZEpWDnjdx4GmLXMZi5WJfZ7fZ/IFjNkjk4OdzI1iRQ+i3
 Sbi75ep59ayTUhm5AB7gCJsP3R7EsZsiPHUenQdA2n9Sj6xE+IuhlS/QDQ9g5mdY
 Ijs0jHeVCGmhYoOD1xWnCZSzlnkEVU3zwfypAK+MC7pgtFMwDy5/Bu1USGxXXDy+
 aKsrJRSgHbtZ1gwoHstqkV+DeCTfElCLYkvigzI5Nmyib5Zp4vkwy2ZLWQjaNjm7
 mqRI7PugUkU=
 =c8XB
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:
 "Seqlock optimizations that arose in a perf context and were merged
  into the perf tree:

   - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan)
   - mm: Convert mm_lock_seq to a proper seqcount (Suren Baghdasaryan)
   - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren
     Baghdasaryan)
   - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra)

  Core perf enhancements:

   - Reduce 'struct page' footprint of perf by mapping pages in advance
     (Lorenzo Stoakes)
   - Save raw sample data conditionally based on sample type (Yabin Cui)
   - Reduce sampling overhead by checking sample_type in
     perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin
     Cui)
   - Export perf_exclude_event() (Namhyung Kim)

  Uprobes scalability enhancements: (Andrii Nakryiko)

   - Simplify find_active_uprobe_rcu() VMA checks
   - Add speculative lockless VMA-to-inode-to-uprobe resolution
   - Simplify session consumer tracking
   - Decouple return_instance list traversal and freeing
   - Ensure return_instance is detached from the list before freeing
   - Reuse return_instances between multiple uretprobes within task
   - Guard against kmemdup() failing in dup_return_instance()

  AMD core PMU driver enhancements:

   - Relax privilege filter restriction on AMD IBS (Namhyung Kim)

  AMD RAPL energy counters support: (Dhananjay Ugwekar)

   - Introduce topology_logical_core_id() (K Prateek Nayak)
   - Remove the unused get_rapl_pmu_cpumask() function
   - Remove the cpu_to_rapl_pmu() function
   - Rename rapl_pmu variables
   - Make rapl_model struct global
   - Add arguments to the init and cleanup functions
   - Modify the generic variable names to *_pkg*
   - Remove the global variable rapl_msrs
   - Move the cntr_mask to rapl_pmus struct
   - Add core energy counter support for AMD CPUs

  Intel core PMU driver enhancements:

   - Support RDPMC 'metrics clear mode' feature (Kan Liang)
   - Clarify adaptive PEBS processing (Kan Liang)
   - Factor out functions for PEBS records processing (Kan Liang)
   - Simplify the PEBS records processing for adaptive PEBS (Kan Liang)

  Intel uncore driver enhancements: (Kan Liang)

   - Convert buggy pmu->func_id use to pmu->registered
   - Support more units on Granite Rapids"

* tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  perf: map pages in advance
  perf/x86/intel/uncore: Support more units on Granite Rapids
  perf/x86/intel/uncore: Clean up func_id
  perf/x86/intel: Support RDPMC metrics clear mode
  uprobes: Guard against kmemdup() failing in dup_return_instance()
  perf/x86: Relax privilege filter restriction on AMD IBS
  perf/core: Export perf_exclude_event()
  uprobes: Reuse return_instances between multiple uretprobes within task
  uprobes: Ensure return_instance is detached from the list before freeing
  uprobes: Decouple return_instance list traversal and freeing
  uprobes: Simplify session consumer tracking
  uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  uprobes: simplify find_active_uprobe_rcu() VMA checks
  mm: introduce mmap_lock_speculate_{try_begin|retry}
  mm: convert mm_lock_seq to a proper seqcount
  mm/gup: Use raw_seqcount_try_begin()
  seqlock: add raw_seqcount_try_begin
  perf/x86/rapl: Add core energy counter support for AMD CPUs
  perf/x86/rapl: Move the cntr_mask to rapl_pmus struct
  perf/x86/rapl: Remove the global variable rapl_msrs
  ...
2025-01-21 10:52:03 -08:00
Linus Torvalds
1cbfb828e0 for-6.14/block-20250118
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB
 Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27
 hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K
 uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze
 QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi
 Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu
 YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje
 ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD
 bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR
 rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/
 CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA
 h3Rtbh+Pgg==
 =EXYs
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Keith:
      - Target support for PCI-Endpoint transport (Damien)
      - TCP IO queue spreading fixes (Sagi, Chaitanya)
      - Target handling for "limited retry" flags (Guixen)
      - Poll type fix (Yongsoo)
      - Xarray storage error handling (Keisuke)
      - Host memory buffer free size fix on error (Francis)

 - MD pull requests via Song:
      - Reintroduce md-linear (Yu Kuai)
      - md-bitmap refactor and fix (Yu Kuai)
      - Replace kmap_atomic with kmap_local_page (David Reaver)

 - Quite a few queue freeze and debugfs deadlock fixes

   Ming introduced lockdep support for this in the 6.13 kernel, and it
   has (unsurprisingly) uncovered quite a few issues

 - Use const attributes for IO schedulers

 - Remove bio ioprio wrappers

 - Fixes for stacked device atomic write support

 - Refactor queue affinity helpers, in preparation for better supporting
   isolated CPUs

 - Cleanups of loop O_DIRECT handling

 - Cleanup of BLK_MQ_F_* flags

 - Add rotational support for null_blk

 - Various fixes and cleanups

* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
  block: Don't trim an atomic write
  block: Add common atomic writes enable flag
  md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
  block: limit disk max sectors to (LLONG_MAX >> 9)
  block: Change blk_stack_atomic_writes_limits() unit_min check
  block: Ensure start sector is aligned for stacking atomic writes
  blk-mq: Move more error handling into blk_mq_submit_bio()
  block: Reorder the request allocation code in blk_mq_submit_bio()
  nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
  md/md-bitmap: move bitmap_{start, end}write to md upper layer
  md/raid5: implement pers->bitmap_sector()
  md: add a new callback pers->bitmap_sector()
  md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
  md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
  md: Replace deprecated kmap_atomic() with kmap_local_page()
  md: reintroduce md-linear
  partitions: ldm: remove the initial kernel-doc notation
  blk-cgroup: rwstat: fix kernel-doc warnings in header file
  blk-cgroup: fix kernel-doc warnings in header file
  nbd: fix partial sending
  ...
2025-01-20 19:38:46 -08:00
Steven Rostedt
22412b72ca tracing: Rename update_cache() to update_mod_cache()
The static function in trace_events.c called update_cache() is too generic
and conflicts with the function defined in arch/openrisc/include/asm/pgtable.h

Rename it to update_mod_cache() to make it less generic.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250120172756.4ecfb43f@batman.local.home
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501210550.Ufrj5CRn-lkp@intel.com/
Fixes: b355247df1 ("tracing: Cache ":mod:" events for modules not loaded yet")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-20 19:09:16 -05:00
Linus Torvalds
1a89a6924b kernel-6.14-rc1.pid
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pR0wAKCRCRxhvAZXjc
 ojb2AQD5QfpTEX/ju1TkenTvoNl+JfnIjaVSY40Lm9DWYzmCMAEAuRvf5WRIV713
 00/RVOrUvsLobzhmnk0yw53EQ5A+pA0=
 =2NDA
 -----END PGP SIGNATURE-----

Merge tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull pid_max namespacing update from Christian Brauner:
 "The pid_max sysctl is a global value. For a long time the default
  value has been 65535 and during the pidfd dicussions Linus proposed to
  bump pid_max by default. Based on this discussion systemd started
  bumping pid_max to 2^22. So all new systems now run with a very high
  pid_max limit with some distros having also backported that change.

  The decision to bump pid_max is obviously correct. It just doesn't
  make a lot of sense nowadays to enforce such a low pid number. There's
  sufficient tooling to make selecting specific processes without typing
  really large pid numbers available.

  In any case, there are workloads that have expections about how large
  pid numbers they accept. Either for historical reasons or
  architectural reasons. One concreate example is the 32-bit version of
  Android's bionic libc which requires pid numbers less than 65536.
  There are workloads where it is run in a 32-bit container on a 64-bit
  kernel. If the host has a pid_max value greater than 65535 the libc
  will abort thread creation because of size assumptions of
  pthread_mutex_t.

  That's a fairly specific use-case however, in general specific
  workloads that are moved into containers running on a host with a new
  kernel and a new systemd can run into issues with large pid_max
  values. Obviously making assumptions about the size of the allocated
  pid is suboptimal but we have userspace that does it.

  Of course, giving containers the ability to restrict the number of
  processes in their respective pid namespace indepent of the global
  limit through pid_max is something desirable in itself and comes in
  handy in general.

  Independent of motivating use-cases the existence of pid namespaces
  makes this also a good semantical extension and there have been prior
  proposals pushing in a similar direction. The trick here is to
  minimize the risk of regressions which I think is doable. The fact
  that pid namespaces are hierarchical will help us here.

  What we mostly care about is that when the host sets a low pid_max
  limit, say (crazy number) 100 that no descendant pid namespace can
  allocate a higher pid number in its namespace. Since pid allocation is
  hierarchial this can be ensured by checking each pid allocation
  against the pid namespace's pid_max limit. This means if the
  allocation in the descendant pid namespace succeeds, the ancestor pid
  namespace can reject it. If the ancestor pid namespace has a higher
  limit than the descendant pid namespace the descendant pid namespace
  will reject the pid allocation. The ancestor pid namespace will
  obviously not care about this.

  All in all this means pid_max continues to enforce a system wide limit
  on the number of processes but allows pid namespaces sufficient leeway
  in handling workloads with assumptions about pid values and allows
  containers to restrict the number of processes in a pid namespace
  through the pid_max interface"

* tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  tests/pid_namespace: add pid_max tests
  pid: allow pid_max to be set per pid namespace
2025-01-20 10:29:11 -08:00
Steven Rostedt
a925df6f50 tracing: Fix #if CONFIG_MODULES to #ifdef CONFIG_MODULES
A typo was introduced when adding the ":mod:" command that did
a "#if CONFIG_MODULES" instead of a "#ifdef CONFIG_MODULES".
Fix it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250120125745.4ac90ca6@gandalf.local.home
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501190121.E2CIJuUj-lkp@intel.com/
Fixes: b355247df1 ("tracing: Cache ":mod:" events for modules not loaded yet")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-20 13:03:23 -05:00
Steven Rostedt
31f505dc70 ftrace: Implement :mod: cache filtering on kernel command line
Module functions can be set to set_ftrace_filter before the module is
loaded.

  # echo :mod:snd_hda_intel > set_ftrace_filter

This will enable all the functions for the module snd_hda_intel. If that
module is not loaded, it is "cached" in the trace array for when the
module is loaded, its functions will be traced.

But this is not implemented in the kernel command line. That's because the
kernel command line filtering is added very early in boot up as it is
needed to be done before boot time function tracing can start, which is
also available very early in boot up. The code used by the
"set_ftrace_filter" file can not be used that early as it depends on some
other initialization to occur first. But some of the functions can.

Implement the ":mod:" feature of "set_ftrace_filter" in the kernel command
line parsing. Now function tracing on just a single module that is loaded
at boot up can be done.

Adding:

 ftrace=function ftrace_filter=:mod:sna_hda_intel

To the kernel command line will only enable the sna_hda_intel module
functions when the module is loaded, and it will start tracing.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250116175832.34e39779@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16 21:27:10 -05:00
Masami Hiramatsu (Google)
8275637215 tracing: Adopt __free() and guard() for trace_fprobe.c
Adopt __free() and guard() for trace_fprobe.c to remove gotos.

Link: https://lore.kernel.org/173708043449.319651.12242878905778792182.stgit@mhiramat.roam.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16 21:27:07 -05:00
Steven Rostedt
b355247df1 tracing: Cache ":mod:" events for modules not loaded yet
When the :mod: command is written into /sys/kernel/tracing/set_event (or
that file within an instance), if the module specified after the ":mod:"
is not yet loaded, it will store that string internally. When the module
is loaded, it will enable the events as if the module was loaded when the
string was written into the set_event file.

This can also be useful to enable events that are in the init section of
the module, as the events are enabled before the init section is executed.

This also works on the kernel command line:

 trace_event=:mod:<module>

Will enable the events for <module> when it is loaded.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250116143533.514730995@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16 09:41:08 -05:00
Steven Rostedt
4c86bc531e tracing: Add :mod: command to enabled module events
Add a :mod: command to enable only events from a given module from the
set_events file.

  echo '*:mod:<module>' > set_events

Or

  echo ':mod:<module>' > set_events

Will enable all events for that module. Specific events can also be
enabled via:

  echo '<event>:mod:<module>' > set_events

Or

  echo '<system>:<event>:mod:<module>' > set_events

Or

  echo '*:<event>:mod:<module>' > set_events

The ":mod:" keyword is consistent with the function tracing filter to
enable functions from a given module.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250116143533.214496360@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16 09:41:07 -05:00
Puranjay Mohan
87c544108b bpf: Send signals asynchronously if !preemptible
BPF programs can execute in all kinds of contexts and when a program
running in a non-preemptible context uses the bpf_send_signal() kfunc,
it will cause issues because this kfunc can sleep.
Change `irqs_disabled()` to `!preemptible()`.

Reported-by: syzbot+97da3d7e0112d59971de@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67486b09.050a0220.253251.0084.GAE@google.com/
Fixes: 1bc7896e9e ("bpf: Fix deadlock with rq_lock in bpf_send_signal()")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250115103647.38487-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-15 13:44:08 -08:00
Shrikanth Hegde
24e0e61040 tracing: Print lazy preemption model
Print lazy preemption model in ftrace header when latency-format=1.

 # cat /sys/kernel/debug/sched/preempt
 none voluntary full (lazy)

Without patch:
  latency: 0 us, #232946/232946, CPU#40 | (M:unknown VP:0, KP:0, SP:0 HP:0 #P:80)
                                             ^^^^^^^

With Patch:
  latency: 0 us, #1897938/25566788, CPU#16 | (M:lazy VP:0, KP:0, SP:0 HP:0 #P:80)
                                                ^^^^

Now that lazy preemption is part of the kernel, make sure the tracing
infrastructure reflects that.

Link: https://lore.kernel.org/20250103093647.575919-1-sshegde@linux.ibm.com
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-14 09:44:33 -05:00
Steven Rostedt
a485ea9e3e tracing: Fix irqsoff and wakeup latency tracers when using function graph
The function graph tracer has become generic so that kretprobes and BPF
can use it along with function graph tracing itself. Some of the
infrastructure was specific for function graph tracing such as recording
the calltime and return time of the functions. Calling the clock code on a
high volume function does add overhead. The calculation of the calltime
was removed from the generic code and placed into the function graph
tracer itself so that the other users did not incur this overhead as they
did not need that timestamp.

The calltime field was still kept in the generic return entry structure
and the function graph return entry callback filled it as that structure
was passed to other code.

But this broke both irqsoff and wakeup latency tracer as they still
depended on the trace structure containing the calltime when the option
display-graph is set as it used some of those same functions that the
function graph tracer used. But now the calltime was not set and was just
zero. This caused the calculation of the function time to be the absolute
value of the return timestamp and not the length of the function.

 # cd /sys/kernel/tracing
 # echo 1 > options/display-graph
 # echo irqsoff > current_tracer

The tracers went from:

 #   REL TIME      CPU  TASK/PID       ||||     DURATION                  FUNCTION CALLS
 #      |          |     |    |        ||||      |   |                     |   |   |   |
        0 us |   4)    <idle>-0    |  d..1. |   0.000 us    |  irqentry_enter();
        3 us |   4)    <idle>-0    |  d..2. |               |  irq_enter_rcu() {
        4 us |   4)    <idle>-0    |  d..2. |   0.431 us    |    preempt_count_add();
        5 us |   4)    <idle>-0    |  d.h2. |               |    tick_irq_enter() {
        5 us |   4)    <idle>-0    |  d.h2. |   0.433 us    |      tick_check_oneshot_broadcast_this_cpu();
        6 us |   4)    <idle>-0    |  d.h2. |   2.426 us    |      ktime_get();
        9 us |   4)    <idle>-0    |  d.h2. |               |      tick_nohz_stop_idle() {
       10 us |   4)    <idle>-0    |  d.h2. |   0.398 us    |        nr_iowait_cpu();
       11 us |   4)    <idle>-0    |  d.h1. |   1.903 us    |      }
       11 us |   4)    <idle>-0    |  d.h2. |               |      tick_do_update_jiffies64() {
       12 us |   4)    <idle>-0    |  d.h2. |               |        _raw_spin_lock() {
       12 us |   4)    <idle>-0    |  d.h2. |   0.360 us    |          preempt_count_add();
       13 us |   4)    <idle>-0    |  d.h3. |   0.354 us    |          do_raw_spin_lock();
       14 us |   4)    <idle>-0    |  d.h2. |   2.207 us    |        }
       15 us |   4)    <idle>-0    |  d.h3. |   0.428 us    |        calc_global_load();
       16 us |   4)    <idle>-0    |  d.h3. |               |        _raw_spin_unlock() {
       16 us |   4)    <idle>-0    |  d.h3. |   0.380 us    |          do_raw_spin_unlock();
       17 us |   4)    <idle>-0    |  d.h3. |   0.334 us    |          preempt_count_sub();
       18 us |   4)    <idle>-0    |  d.h1. |   1.768 us    |        }
       18 us |   4)    <idle>-0    |  d.h2. |               |        update_wall_time() {
      [..]

To:

 #   REL TIME      CPU  TASK/PID       ||||     DURATION                  FUNCTION CALLS
 #      |          |     |    |        ||||      |   |                     |   |   |   |
        0 us |   5)    <idle>-0    |  d.s2. |   0.000 us    |  _raw_spin_lock_irqsave();
        0 us |   5)    <idle>-0    |  d.s3. |   312159583 us |      preempt_count_add();
        2 us |   5)    <idle>-0    |  d.s4. |   312159585 us |      do_raw_spin_lock();
        3 us |   5)    <idle>-0    |  d.s4. |               |      _raw_spin_unlock() {
        3 us |   5)    <idle>-0    |  d.s4. |   312159586 us |        do_raw_spin_unlock();
        4 us |   5)    <idle>-0    |  d.s4. |   312159587 us |        preempt_count_sub();
        4 us |   5)    <idle>-0    |  d.s2. |   312159587 us |      }
        5 us |   5)    <idle>-0    |  d.s3. |               |      _raw_spin_lock() {
        5 us |   5)    <idle>-0    |  d.s3. |   312159588 us |        preempt_count_add();
        6 us |   5)    <idle>-0    |  d.s4. |   312159589 us |        do_raw_spin_lock();
        7 us |   5)    <idle>-0    |  d.s3. |   312159590 us |      }
        8 us |   5)    <idle>-0    |  d.s4. |   312159591 us |      calc_wheel_index();
        9 us |   5)    <idle>-0    |  d.s4. |               |      enqueue_timer() {
        9 us |   5)    <idle>-0    |  d.s4. |               |        wake_up_nohz_cpu() {
       11 us |   5)    <idle>-0    |  d.s4. |               |          native_smp_send_reschedule() {
       11 us |   5)    <idle>-0    |  d.s4. |   312171987 us |            default_send_IPI_single_phys();
    12408 us |   5)    <idle>-0    |  d.s3. |   312171990 us |          }
    12408 us |   5)    <idle>-0    |  d.s3. |   312171991 us |        }
    12409 us |   5)    <idle>-0    |  d.s3. |   312171991 us |      }

Where the calculation of the time for each function was the return time
minus zero and not the time of when the function returned.

Have these tracers also save the calltime in the fgraph data section and
retrieve it again on the return to get the correct timings again.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250113183124.61767419@gandalf.local.home
Fixes: f1f36e22be ("ftrace: Have calltime be saved in the fgraph storage")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-14 09:38:09 -05:00
Jeongjun Park
6e31b759b0 ring-buffer: Make reading page consistent with the code logic
In the loop of __rb_map_vma(), the 's' variable is calculated from the
same logic that nr_pages is and they both come from nr_subbufs. But the
relationship is not obvious and there's a WARN_ON_ONCE() around the 's'
variable to make sure it never becomes equal to nr_subbufs within the
loop. If that happens, then the code is buggy and needs to be fixed.

The 'page' variable is calculated from cpu_buffer->subbuf_ids[s] which is
an array of 'nr_subbufs' entries. If the code becomes buggy and 's'
becomes equal to or greater than 'nr_subbufs' then this will be an out of
bounds hit before the WARN_ON() is triggered and the code exiting safely.

Make the 'page' initialization consistent with the code logic and assign
it after the out of bounds check.

Link: https://lore.kernel.org/20250110162612.13983-1-aha310510@gmail.com
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
[ sdr: rewrote change log ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-13 16:05:43 -05:00
Vincent Donnefort
0568c6ebf0 ring-buffer: Check for empty ring-buffer with rb_num_of_entries()
Currently there are two ways of identifying an empty ring-buffer. One
relying on the current status of the commit / reader page
(rb_per_cpu_empty()) and the other on the write and read counters
(rb_num_of_entries() used in rb_get_reader_page()).

with rb_num_of_entries(). This intends to ease later
introduction of ring-buffer writers which are out of the kernel control
and with whom, the only information available is through the meta-page
counters.

Link: https://lore.kernel.org/20250108114536.627715-2-vdonnefort@google.com
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-13 15:39:50 -05:00
Masami Hiramatsu (Google)
4f7caaa2f9 bpf: Use ftrace_get_symaddr() for kprobe_multi probes
Add ftrace_get_entry_ip() which is only for ftrace based probes, and use
it for kprobe multi probes because they are based on fprobe which uses
ftrace instead of kprobes.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/173566081414.878879.10631096557346094362.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-13 15:04:30 -05:00
Masami Hiramatsu (Google)
927054606d tracing/kprobes: Simplify __trace_kprobe_create() by removing gotos
Simplify __trace_kprobe_create() by removing gotos.

Link: https://lore.kernel.org/all/173643301102.1514810.6149004416601259466.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-10 09:01:14 +09:00
Masami Hiramatsu (Google)
7dcc352078 tracing: Use __free() for kprobe events to cleanup
Use __free() in trace_kprobe.c to cleanup code.

Link: https://lore.kernel.org/all/173643299989.1514810.2924926552980462072.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-10 09:01:01 +09:00
Masami Hiramatsu (Google)
4af0532a0f tracing: Use __free() in trace_probe for cleanup
Use __free() in trace_probe to cleanup some gotos.

Link: https://lore.kernel.org/all/173643298860.1514810.7267350121047606213.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-10 09:00:13 +09:00
Masami Hiramatsu (Google)
4e83017e4c tracing/eprobe: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in eprobe events for critical sections
rather than discrete lock/unlock pairs.

Link: https://lore.kernel.org/all/173289890996.73724.17421347964110362029.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-01-10 09:00:12 +09:00
Masami Hiramatsu (Google)
f8821732dc tracing/uprobe: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in uprobe events for critical sections
rather than discrete lock/unlock pairs.

Link: https://lore.kernel.org/all/173289889911.73724.12457932738419630525.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-01-10 09:00:12 +09:00
Masami Hiramatsu (Google)
2cba0070cd tracing/kprobe: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in kprobe events for critical sections
rather than discrete lock/unlock pairs.

Link: https://lore.kernel.org/all/173289888883.73724.6586200652276577583.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-01-10 09:00:12 +09:00
Masami Hiramatsu (Google)
30c8fd31c5 tracing/kprobes: Fix to free objects when failed to copy a symbol
In __trace_kprobe_create(), if something fails it must goto error block
to free objects. But when strdup() a symbol, it returns without that.
Fix it to goto the error block to free objects correctly.

Link: https://lore.kernel.org/all/173643297743.1514810.2408159540454241947.stgit@devnote2/

Fixes: 6212dd2968 ("tracing/kprobes: Use dyn_event framework for kprobe events")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-10 08:57:18 +09:00
Jiri Olsa
2ebadb60cb bpf: Return error for missed kprobe multi bpf program execution
When kprobe multi bpf program can't be executed due to recursion check,
we currently return 0 (success) to fprobe layer where it's ignored for
standard kprobe multi probes.

For kprobe session the success return value will make fprobe layer to
install return probe and try to execute it as well.

But the return session probe should not get executed, because the entry
part did not run. FWIW the return probe bpf program most likely won't get
executed, because its recursion check will likely fail as well, but we
don't need to run it in the first place.. also we can make this clear
and obvious.

It also affects missed counts for kprobe session program execution, which
are now doubled (extra count for not executed return probe).

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250106175048.1443905-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08 09:39:58 -08:00