2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

2860 Commits

Author SHA1 Message Date
James Clark
8988c4b919 perf tools: Fix in-source libperf build
When libperf is built alone in-source, $(OUTPUT) isn't set. This causes
the generated uapi path to resolve to '/../arch' which results in a
permissions error:

  mkdir: cannot create directory '/../arch': Permission denied

Fix it by removing the preceding '/..' which means that it gets
generated either in the tools/lib/perf part of the tree or the OUTPUT
folder. Some other rules that rely on OUTPUT further refine this
conditionally depending on whether it's an in-source or out-of-source
build, but I don't think we need the extra complexity here. And this
rule is slightly different to others because the header is needed by
both libperf and Perf. This is further complicated by the fact that Perf
always passes O=... to libperf even for in source builds, meaning that
OUTPUT isn't set consistently between projects.

Because we're no longer going one level up to try to generate the file
in the tools/ folder, Perf's include rule needs to descend into libperf.
Also fix the clean rule while we're here.

Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Closes: https://lore.kernel.org/linux-perf-users/7703f88e-ccb7-4c98-9da4-8aad224e780f@leemhuis.info/
Fixes: bfb713ea53 ("perf tools: Fix arm64 build by generating unistd_64.h")
Signed-off-by: James Clark <james.clark@linaro.org>
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
Link: https://lore.kernel.org/r/20250429-james-perf-fix-libperf-in-source-build-v1-1-a1a827ac15e5@linaro.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-04-29 12:32:31 -07:00
James Clark
bfb713ea53 perf tools: Fix arm64 build by generating unistd_64.h
Since pulling in the kernel changes in commit 22f72088ff ("tools
headers: Update the syscall table with the kernel sources"), arm64 is
no longer using a generic syscall header and generates one from the
syscall table. Therefore we must also generate the syscall header for
arm64 before building Perf.

Add it as a dependency to libperf which uses one syscall number. Perf
uses more, but as libperf is a dependency of Perf it will be generated
for both.

Future platforms that need this will have to add their own syscall-y
targets in libperf manually. Unfortunately the arch specific files that
do this (e.g. arch/arm64/include/asm/Kbuild) can't easily be imported
into the Perf build. But Perf only needs a subset of the generated files
anyway, so redefining them is probably the correct thing to do.

Fixes: 22f72088ff ("tools headers: Update the syscall table with the kernel sources")
Signed-off-by: James Clark <james.clark@linaro.org>
Tested-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Link: https://lore.kernel.org/r/20250417-james-perf-fix-gen-syscall-v1-1-1d268c923901@linaro.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-04-23 08:57:12 -07:00
Linus Torvalds
d6b02199cd - The 7 patch series "powerpc/crash: use generic crashkernel
reservation" from Sourabh Jain changes powerpc's kexec code to use more
   of the generic layers.
 
 - The 2 patch series "get_maintainer: report subsystem status
   separately" from Vlastimil Babka makes some long-requested improvements
   to the get_maintainer output.
 
 - The 4 patch series "ucount: Simplify refcounting with rcuref_t" from
   Sebastian Siewior cleans up and optimizing the refcounting in the ucount
   code.
 
 - The 12 patch series "reboot: support runtime configuration of
   emergency hw_protection action" from Ahmad Fatoum improves the ability
   for a driver to perform an emergency system shutdown or reboot.
 
 - The 16 patch series "Converge on using secs_to_jiffies() part two"
   from Easwar Hariharan performs further migrations from
   msecs_to_jiffies() to secs_to_jiffies().
 
 - The 7 patch series "lib/interval_tree: add some test cases and
   cleanup" from Wei Yang permits more userspace testing of kernel library
   code, adds some more tests and performs some cleanups.
 
 - The 2 patch series "hung_task: Dump the blocking task stacktrace" from
   Masami Hiramatsu arranges for the hung_task detector to dump the stack
   of the blocking task and not just that of the blocked task.
 
 - The 4 patch series "resource: Split and use DEFINE_RES*() macros" from
   Andy Shevchenko provides some cleanups to the resource definition
   macros.
 
 - Plus the usual shower of singleton patches - please see the individual
   changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nuqwAKCRDdBJ7gKXxA
 jtNqAQDxqJpjWkzn4yN9CNSs1ivVx3fr6SqazlYCrt3u89WQvwEA1oRrGpETzUGq
 r6khQUIcQImPPcjFqEFpuiSOU0MBZA0=
 =Kii8
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - The series "powerpc/crash: use generic crashkernel reservation" from
   Sourabh Jain changes powerpc's kexec code to use more of the generic
   layers.

 - The series "get_maintainer: report subsystem status separately" from
   Vlastimil Babka makes some long-requested improvements to the
   get_maintainer output.

 - The series "ucount: Simplify refcounting with rcuref_t" from
   Sebastian Siewior cleans up and optimizing the refcounting in the
   ucount code.

 - The series "reboot: support runtime configuration of emergency
   hw_protection action" from Ahmad Fatoum improves the ability for a
   driver to perform an emergency system shutdown or reboot.

 - The series "Converge on using secs_to_jiffies() part two" from Easwar
   Hariharan performs further migrations from msecs_to_jiffies() to
   secs_to_jiffies().

 - The series "lib/interval_tree: add some test cases and cleanup" from
   Wei Yang permits more userspace testing of kernel library code, adds
   some more tests and performs some cleanups.

 - The series "hung_task: Dump the blocking task stacktrace" from Masami
   Hiramatsu arranges for the hung_task detector to dump the stack of
   the blocking task and not just that of the blocked task.

 - The series "resource: Split and use DEFINE_RES*() macros" from Andy
   Shevchenko provides some cleanups to the resource definition macros.

 - Plus the usual shower of singleton patches - please see the
   individual changelogs for details.

* tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
  mailmap: consolidate email addresses of Alexander Sverdlin
  fs/procfs: fix the comment above proc_pid_wchan()
  relay: use kasprintf() instead of fixed buffer formatting
  resource: replace open coded variant of DEFINE_RES()
  resource: replace open coded variants of DEFINE_RES_*_NAMED()
  resource: replace open coded variant of DEFINE_RES_NAMED_DESC()
  resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED()
  samples: add hung_task detector mutex blocking sample
  hung_task: show the blocker task if the task is hung on mutex
  kexec_core: accept unaccepted kexec segments' destination addresses
  watchdog/perf: optimize bytes copied and remove manual NUL-termination
  lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap()
  lib/interval_tree: skip the check before go to the right subtree
  lib/interval_tree: add test case for span iteration
  lib/interval_tree: add test case for interval_tree_iter_xxx() helpers
  lib/rbtree: add random seed
  lib/rbtree: split tests
  lib/rbtree: enable userland test suite for rbtree related data structure
  checkpatch: describe --min-conf-desc-length
  scripts/gdb/symbols: determine KASLR offset on s390
  ...
2025-04-01 10:06:52 -07:00
Linus Torvalds
802f0d58d5 perf tools changes for v6.15
perf record
 -----------
 * Introduce latency profiling using scheduler information.  The latency
   profiling is to show impacts on wall-time rather than cpu-time.  By
   tracking context switches, it can weight samples and find which part
   of the code contributed more to the execution latency.
 
   The value (period) of the sample is weighted by dividing it by the
   number of parallel execution at the moment.  The parallelism is
   tracked in perf report with sched-switch records.  This will reduce
   the portion that are run in parallel and in turn increase the portion
   of serial executions.
 
   For now, it's limited to profile processes, IOW system-wide profiling
   is not supported.  You can add --latency option to enable this.
 
     $ perf record --latency -- make -C tools/perf
 
   I've run the above command for perf build which adds -j option to
   make with the number of CPUs in the system internally.  Normally
   it'd show something like below:
 
     $ perf report -F overhead,comm
     ...
     #
     # Overhead  Command
     # ........  ...............
     #
         78.97%  cc1
          6.54%  python3
          4.21%  shellcheck
          3.28%  ld
          1.80%  as
          1.37%  cc1plus
          0.80%  sh
          0.62%  clang
          0.56%  gcc
          0.44%  perl
          0.39%  make
 	 ...
 
   The cc1 takes around 80% of the overhead as it's the actual compiler.
   However it runs in parallel so its contribution to latency may be less
   than that.  Now, perf report will show both overhead and latency (if
   --latency was given at record time) like below:
 
     $ perf report -s comm
     ...
     #
     # Overhead   Latency  Command
     # ........  ........  ...............
     #
         78.97%    48.66%  cc1
          6.54%    25.68%  python3
          4.21%     0.39%  shellcheck
          3.28%    13.70%  ld
          1.80%     2.56%  as
          1.37%     3.08%  cc1plus
          0.80%     0.98%  sh
          0.62%     0.61%  clang
          0.56%     0.33%  gcc
          0.44%     1.71%  perl
          0.39%     0.83%  make
 	 ...
 
   You can see latency of cc1 goes down to around 50% and python3 and ld
   contribute a lot more than their overhead.  You can use --latency
   option in perf report to get the same result but ordered by latency.
 
     $ perf report --latency -s comm
 
 perf report
 -----------
 * As a side effect of the latency profiling work, it adds a new output
   field 'latency' and a sort key 'parallelism'.  The below is a result
   from my system with 64 CPUs.  The build was well-parallelized but
   contained some serial portions.
 
     $ perf report -s parallelism
     ...
     #
     # Overhead   Latency  Parallelism
     # ........  ........  ...........
     #
         16.95%     1.54%           62
         13.38%     1.24%           61
         12.50%    70.47%            1
         11.81%     1.06%           63
          7.59%     0.71%           60
          4.33%    12.20%            2
          3.41%     0.33%           59
          2.05%     0.18%           64
          1.75%     1.09%            9
          1.64%     1.85%            5
          ...
 
 * Support Feodra mini-debuginfo which is a LZMA compressed symbol table
   inside ".gnu_debugdata" ELF section.
 
 perf annotate
 -------------
 * Add --code-with-type option to enable data-type profiling with the
   usual annotate output.  Instead of focusing on data structure, it
   shows code annotation together with data type it accesses in case the
   instruction refers to a memory location (and it was able to resolve
   the target data type).  Currently it only works with --stdio.
 
     $ perf annotate --stdio --code-with-type
     ...
      Percent |      Source code & Disassembly of vmlinux for cpu/mem-loads,ldlat=30/pp (18 samples, percent: local period)
     ----------------------------------------------------------------------------------------------------------------------
              : 0                0xffffffff81050610 <__fdget>:
         0.00 :   ffffffff81050610:        callq   0xffffffff81c01b80 <__fentry__>           # data-type: (stack operation)
         0.00 :   ffffffff81050615:        pushq   %rbp              # data-type: (stack operation)
         0.00 :   ffffffff81050616:        movq    %rsp, %rbp
         0.00 :   ffffffff81050619:        pushq   %r15              # data-type: (stack operation)
         0.00 :   ffffffff8105061b:        pushq   %r14              # data-type: (stack operation)
         0.00 :   ffffffff8105061d:        pushq   %rbx              # data-type: (stack operation)
         0.00 :   ffffffff8105061e:        subq    $0x10, %rsp
         0.00 :   ffffffff81050622:        movl    %edi, %ebx
         0.00 :   ffffffff81050624:        movq    %gs:0x7efc4814(%rip), %rax  # 0x14e40 <current_task>              # data-type: struct task_struct* +0
         0.00 :   ffffffff8105062c:        movq    0x8d0(%rax), %r14         # data-type: struct task_struct +0x8d0 (files)
         0.00 :   ffffffff81050633:        movl    (%r14), %eax              # data-type: struct files_struct +0 (count.counter)
         0.00 :   ffffffff81050636:        cmpl    $0x1, %eax
         0.00 :   ffffffff81050639:        je      0xffffffff810506a9 <__fdget+0x99>
         0.00 :   ffffffff8105063b:        movq    0x20(%r14), %rcx          # data-type: struct files_struct +0x20 (fdt)
         0.00 :   ffffffff8105063f:        movl    (%rcx), %eax              # data-type: struct fdtable +0 (max_fds)
         0.00 :   ffffffff81050641:        cmpl    %ebx, %eax
         0.00 :   ffffffff81050643:        jbe     0xffffffff810506ef <__fdget+0xdf>
         0.00 :   ffffffff81050649:        movl    %ebx, %r15d
         5.56 :   ffffffff8105064c:        movq    0x8(%rcx), %rdx           # data-type: struct fdtable +0x8 (fd)
 	...
 
   The "# data-type:" part was added with this change.  The first few
   entries are not very interesting.  But later you can it accesses
   a couple of fields in the task_struct, files_struct and fdtable.
 
 perf trace
 ----------
 * Support syscall tracing for different ABI.  For example it can trace
   system calls for 32-bit applications on 64-bit kernel transparently.
 
 * Add --summary-mode=total option to show global syscall summary.  The
   default is 'thread' to show per-thread syscall summary.
 
 Python support
 --------------
 * Add more interfaces to 'perf' module to parse events, and config,
   enable or disable the event list properly so that it can implement
   basic functionalities purely in Python.  There is an example code
   for these new interfaces in python/tracepoint.py.
 
 * Add mypy and pylint support to enable build time checking.  Fix
   some code based on the findings from these tools.
 
 Internals
 ---------
 * Introduce io_dir__readdir() API to make directory traveral (usually
   for proc or sysfs) efficient with less memory footprint.
 
 JSON vendor events
 ------------------
 * Add events and metrics for ARM Neoverse N3 and V3
 * Update events and metrics on various Intel CPUs
 * Add/update events for a number of SiFive processors
 
 Signed-off-by: Namhyung Kim <namhyung@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSo2x5BnqMqsoHtzsmMstVUGiXMgwUCZ+Y/YQAKCRCMstVUGiXM
 g64CAP9sdUgKCe66JilZRAEy9d3Cp835v7KjHSlCXLXkRSoy6wD+KH96Y0OqVtGw
 GYON4s9ZoXAoH+J0lKIFSffVL9MhYgY=
 =7L7B
 -----END PGP SIGNATURE-----

Merge tag 'perf-tools-for-v6.15-2025-03-27' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf tools updates from Namhyung Kim:
 "perf record:

   - Introduce latency profiling using scheduler information.

     The latency profiling is to show impacts on wall-time rather than
     cpu-time. By tracking context switches, it can weight samples and
     find which part of the code contributed more to the execution
     latency.

     The value (period) of the sample is weighted by dividing it by the
     number of parallel execution at the moment. The parallelism is
     tracked in perf report with sched-switch records. This will reduce
     the portion that are run in parallel and in turn increase the
     portion of serial executions.

     For now, it's limited to profile processes, IOW system-wide
     profiling is not supported. You can add --latency option to enable
     this.

       $ perf record --latency -- make -C tools/perf

     I've run the above command for perf build which adds -j option to
     make with the number of CPUs in the system internally. Normally
     it'd show something like below:

       $ perf report -F overhead,comm
       ...
       #
       # Overhead  Command
       # ........  ...............
       #
           78.97%  cc1
            6.54%  python3
            4.21%  shellcheck
            3.28%  ld
            1.80%  as
            1.37%  cc1plus
            0.80%  sh
            0.62%  clang
            0.56%  gcc
            0.44%  perl
            0.39%  make
  	 ...

     The cc1 takes around 80% of the overhead as it's the actual
     compiler. However it runs in parallel so its contribution to
     latency may be less than that. Now, perf report will show both
     overhead and latency (if --latency was given at record time) like
     below:

       $ perf report -s comm
       ...
       #
       # Overhead   Latency  Command
       # ........  ........  ...............
       #
           78.97%    48.66%  cc1
            6.54%    25.68%  python3
            4.21%     0.39%  shellcheck
            3.28%    13.70%  ld
            1.80%     2.56%  as
            1.37%     3.08%  cc1plus
            0.80%     0.98%  sh
            0.62%     0.61%  clang
            0.56%     0.33%  gcc
            0.44%     1.71%  perl
            0.39%     0.83%  make
  	 ...

     You can see latency of cc1 goes down to around 50% and python3 and
     ld contribute a lot more than their overhead. You can use --latency
     option in perf report to get the same result but ordered by
     latency.

       $ perf report --latency -s comm

  perf report:

   - As a side effect of the latency profiling work, it adds a new
     output field 'latency' and a sort key 'parallelism'. The below is a
     result from my system with 64 CPUs. The build was well-parallelized
     but contained some serial portions.

       $ perf report -s parallelism
       ...
       #
       # Overhead   Latency  Parallelism
       # ........  ........  ...........
       #
           16.95%     1.54%           62
           13.38%     1.24%           61
           12.50%    70.47%            1
           11.81%     1.06%           63
            7.59%     0.71%           60
            4.33%    12.20%            2
            3.41%     0.33%           59
            2.05%     0.18%           64
            1.75%     1.09%            9
            1.64%     1.85%            5
            ...

   - Support Feodra mini-debuginfo which is a LZMA compressed symbol
     table inside ".gnu_debugdata" ELF section.

  perf annotate:

   - Add --code-with-type option to enable data-type profiling with the
     usual annotate output.

     Instead of focusing on data structure, it shows code annotation
     together with data type it accesses in case the instruction refers
     to a memory location (and it was able to resolve the target data
     type). Currently it only works with --stdio.

       $ perf annotate --stdio --code-with-type
       ...
        Percent |      Source code & Disassembly of vmlinux for cpu/mem-loads,ldlat=30/pp (18 samples, percent: local period)
       ----------------------------------------------------------------------------------------------------------------------
                : 0                0xffffffff81050610 <__fdget>:
           0.00 :   ffffffff81050610:        callq   0xffffffff81c01b80 <__fentry__>           # data-type: (stack operation)
           0.00 :   ffffffff81050615:        pushq   %rbp              # data-type: (stack operation)
           0.00 :   ffffffff81050616:        movq    %rsp, %rbp
           0.00 :   ffffffff81050619:        pushq   %r15              # data-type: (stack operation)
           0.00 :   ffffffff8105061b:        pushq   %r14              # data-type: (stack operation)
           0.00 :   ffffffff8105061d:        pushq   %rbx              # data-type: (stack operation)
           0.00 :   ffffffff8105061e:        subq    $0x10, %rsp
           0.00 :   ffffffff81050622:        movl    %edi, %ebx
           0.00 :   ffffffff81050624:        movq    %gs:0x7efc4814(%rip), %rax  # 0x14e40 <current_task>              # data-type: struct task_struct* +0
           0.00 :   ffffffff8105062c:        movq    0x8d0(%rax), %r14         # data-type: struct task_struct +0x8d0 (files)
           0.00 :   ffffffff81050633:        movl    (%r14), %eax              # data-type: struct files_struct +0 (count.counter)
           0.00 :   ffffffff81050636:        cmpl    $0x1, %eax
           0.00 :   ffffffff81050639:        je      0xffffffff810506a9 <__fdget+0x99>
           0.00 :   ffffffff8105063b:        movq    0x20(%r14), %rcx          # data-type: struct files_struct +0x20 (fdt)
           0.00 :   ffffffff8105063f:        movl    (%rcx), %eax              # data-type: struct fdtable +0 (max_fds)
           0.00 :   ffffffff81050641:        cmpl    %ebx, %eax
           0.00 :   ffffffff81050643:        jbe     0xffffffff810506ef <__fdget+0xdf>
           0.00 :   ffffffff81050649:        movl    %ebx, %r15d
           5.56 :   ffffffff8105064c:        movq    0x8(%rcx), %rdx           # data-type: struct fdtable +0x8 (fd)
  	...

     The "# data-type:" part was added with this change. The first few
     entries are not very interesting. But later you can it accesses a
     couple of fields in the task_struct, files_struct and fdtable.

  perf trace:

   - Support syscall tracing for different ABI. For example it can trace
     system calls for 32-bit applications on 64-bit kernel
     transparently.

   - Add --summary-mode=total option to show global syscall summary. The
     default is 'thread' to show per-thread syscall summary.

  Python support:

   - Add more interfaces to 'perf' module to parse events, and config,
     enable or disable the event list properly so that it can implement
     basic functionalities purely in Python. There is an example code
     for these new interfaces in python/tracepoint.py.

   - Add mypy and pylint support to enable build time checking. Fix some
     code based on the findings from these tools.

  Internals:

   - Introduce io_dir__readdir() API to make directory traveral (usually
     for proc or sysfs) efficient with less memory footprint.

  JSON vendor events:

   - Add events and metrics for ARM Neoverse N3 and V3

   - Update events and metrics on various Intel CPUs

   - Add/update events for a number of SiFive processors"

* tag 'perf-tools-for-v6.15-2025-03-27' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (229 commits)
  perf bpf-filter: Fix a parsing error with comma
  perf report: Fix a memory leak for perf_env on AMD
  perf trace: Fix wrong size to bpf_map__update_elem call
  perf tools: annotate asm_pure_loop.S
  perf python: Fix setup.py mypy errors
  perf test: Address attr.py mypy error
  perf build: Add pylint build tests
  perf build: Add mypy build tests
  perf build: Rename TEST_LOGS to SHELL_TEST_LOGS
  tools/build: Don't pass test log files to linker
  perf bench sched pipe: fix enforced blocking reads in worker_thread
  perf tools: Fix is_compat_mode build break in ppc64
  perf build: filter all combinations of -flto for libperl
  perf vendor events arm64 AmpereOneX: Fix frontend_bound calculation
  perf vendor events arm64: AmpereOne/AmpereOneX: Mark LD_RETIRED impacted by errata
  perf trace: Fix evlist memory leak
  perf trace: Fix BTF memory leak
  perf trace: Make syscall table stable
  perf syscalltbl: Mask off ABI type for MIPS system calls
  perf build: Remove Makefile.syscalls
  ...
2025-03-31 08:52:33 -07:00
Linus Torvalds
fa593d0f96 bpf-next-6.15
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmfi6ZAACgkQ6rmadz2v
 bTpLOg/+J7xUddPMhlpFAUlifQEadE5hmw6v1tXpM3zyKHzUWJiv/qsx3j8/ckgD
 D+d4P8bqIbI9SSuIS4oZ0+D9pr/g7GYztnoYZmPiYJ7v2AijPuof5dsagFQE8E2y
 rhfbt9KHTMzzkdkTvaAZaITS/HWAoJ2YVRB6gfLex2ghcXYHcgmtKRZniQrbBiFZ
 MIXBN8Rg6HP+pUdIVllSXFcQCb3XIgjPONRAos4hr5tIm+3Ku7Jvkgk2H/9vUcoF
 bdXAcg8xygyH7eY+1l3e7nEPQlG0jUZEsL+tq+vpdoLRLqlIpAUYmwUvqcmq4dPS
 QGFjiUcpDbXlxsUFpzjXHIFto7fXCfND7HEICQPwAncdflIIfYaATSQUfkEexn0a
 wBCFlAChrEzAmg2vFl4EeEr0fdSe/3jswrgKx0m6ctKieMjgloBUeeH4fXOpfkhS
 9tvhuduVFuronlebM8ew4w9T/mBgbyxkE5KkvP4hNeB3ni3N0K6Mary5/u2HyN1e
 lqTlnZxRA4p6lrvxce/mDrR4VSwlKLcSeQVjxAL1afD5KRkuZJnUv7bUhS361vkG
 IjNrQX30EisDAz+X7tMn3ndBf9vVatwFT4+c3yaxlQRor1WofhDfT88HPiyB4QqQ
 Kdx2EHgbQxJp4vkzhp4/OXlTfkihsMEn8egzZuphdPEQ9Y+Jdwg=
 =aN/V
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:
 "For this merge window we're splitting BPF pull request into three for
  higher visibility: main changes, res_spin_lock, try_alloc_pages.

  These are the main BPF changes:

   - Add DFA-based live registers analysis to improve verification of
     programs with loops (Eduard Zingerman)

   - Introduce load_acquire and store_release BPF instructions and add
     x86, arm64 JIT support (Peilin Ye)

   - Fix loop detection logic in the verifier (Eduard Zingerman)

   - Drop unnecesary lock in bpf_map_inc_not_zero() (Eric Dumazet)

   - Add kfunc for populating cpumask bits (Emil Tsalapatis)

   - Convert various shell based tests to selftests/bpf/test_progs
     format (Bastien Curutchet)

   - Allow passing referenced kptrs into struct_ops callbacks (Amery
     Hung)

   - Add a flag to LSM bpf hook to facilitate bpf program signing
     (Blaise Boscaccy)

   - Track arena arguments in kfuncs (Ihor Solodrai)

   - Add copy_remote_vm_str() helper for reading strings from remote VM
     and bpf_copy_from_user_task_str() kfunc (Jordan Rome)

   - Add support for timed may_goto instruction (Kumar Kartikeya
     Dwivedi)

   - Allow bpf_get_netns_cookie() int cgroup_skb programs (Mahe Tardy)

   - Reduce bpf_cgrp_storage_busy false positives when accessing cgroup
     local storage (Martin KaFai Lau)

   - Introduce bpf_dynptr_copy() kfunc (Mykyta Yatsenko)

   - Allow retrieving BTF data with BTF token (Mykyta Yatsenko)

   - Add BPF kfuncs to set and get xattrs with 'security.bpf.' prefix
     (Song Liu)

   - Reject attaching programs to noreturn functions (Yafang Shao)

   - Introduce pre-order traversal of cgroup bpf programs (Yonghong
     Song)"

* tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (186 commits)
  selftests/bpf: Add selftests for load-acquire/store-release when register number is invalid
  bpf: Fix out-of-bounds read in check_atomic_load/store()
  libbpf: Add namespace for errstr making it libbpf_errstr
  bpf: Add struct_ops context information to struct bpf_prog_aux
  selftests/bpf: Sanitize pointer prior fclose()
  selftests/bpf: Migrate test_xdp_vlan.sh into test_progs
  selftests/bpf: test_xdp_vlan: Rename BPF sections
  bpf: clarify a misleading verifier error message
  selftests/bpf: Add selftest for attaching fexit to __noreturn functions
  bpf: Reject attaching fexit/fmod_ret to __noreturn functions
  bpf: Only fails the busy counter check in bpf_cgrp_storage_get if it creates storage
  bpf: Make perf_event_read_output accessible in all program types.
  bpftool: Using the right format specifiers
  bpftool: Add -Wformat-signedness flag to detect format errors
  selftests/bpf: Test freplace from user namespace
  libbpf: Pass BPF token from find_prog_btf_id to BPF_BTF_GET_FD_BY_ID
  bpf: Return prog btf_id without capable check
  bpf: BPF token support for BPF_BTF_GET_FD_BY_ID
  bpf, x86: Fix objtool warning for timed may_goto
  bpf: Check map->record at the beginning of check_and_free_fields()
  ...
2025-03-30 12:43:03 -07:00
Ian Rogers
307ef667e9 libbpf: Add namespace for errstr making it libbpf_errstr
When statically linking symbols can be replaced with those from other
statically linked libraries depending on the link order and the hoped
for "multiple definition" error may not appear. To avoid conflicts it
is good practice to namespace symbols, this change renames errstr to
libbpf_errstr. To avoid churn a #define is used to turn use of
errstr(err) to libbpf_errstr(err).

Fixes: 1633a83bf9 ("libbpf: Introduce errstr() for stringifying errno")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250320222439.1350187-1-irogers@google.com
2025-03-21 13:44:54 -07:00
James Clark
f5b07010c1 libperf: Don't remove -g when EXTRA_CFLAGS are used
When using EXTRA_CFLAGS, for example "EXTRA_CFLAGS=-DREFCNT_CHECKING=1",
this construct stops setting -g which you'd expect would not be affected
by adding extra flags. Additionally, EXTRA_CFLAGS should be the last
thing to be appended so that it can be used to undo any defaults. And no
condition is required, just += appends to any existing CFLAGS and also
appends or doesn't append EXTRA_CFLAGS if they are or aren't set.

It's not clear why DEBUG=1 is required for -g in Perf when in libperf
it's always on, but I don't think we need to change that behavior now
because someone may be depending on it.

Signed-off-by: James Clark <james.clark@linaro.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250319114009.417865-1-james.clark@linaro.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-03-19 17:00:39 -07:00
Mykyta Yatsenko
974ef9f0d2 libbpf: Pass BPF token from find_prog_btf_id to BPF_BTF_GET_FD_BY_ID
Pass BPF token from bpf_program__set_attach_target to
BPF_BTF_GET_FD_BY_ID bpf command.
When freplace program attaches to target program, it needs to look up
for BTF of the target, this may require BPF token, if, for example,
running from user namespace.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20250317174039.161275-4-mykyta.yatsenko5@gmail.com
2025-03-17 13:45:12 -07:00
Wei Yang
82114e4513 lib/interval_tree: add test case for interval_tree_iter_xxx() helpers
Verify interval_tree_iter_xxx() helpers could find intersection ranges
as expected.

[sfr@canb.auug.org.au: some of tools/ uses -Wno-unused-parameter]
  Link: https://lkml.kernel.org/r/20250312113612.31ac808e@canb.auug.org.au
Link: https://lkml.kernel.org/r/20250310074938.26756-5-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17 12:17:00 -07:00
Wei Yang
4164e1525d lib/rbtree: enable userland test suite for rbtree related data structure
Patch series "lib/interval_tree: add some test cases and cleanup", v2.

Since rbtree/augmented tree/interval tree share similar data structure,
besides new cases for interval tree, this patch set also does cleanup for
others.


This patch (of 7):

Currently we have some tests for rbtree related data structure, e.g. 
rbtree, augmented rbtree, interval tree, in lib/ as kernel module.

To facilitate the test and debug for those fundamental data structure,
this patch enable those tests in userland.

Link: https://lkml.kernel.org/r/20250310074938.26756-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20250310074938.26756-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17 12:17:00 -07:00
Mykyta Yatsenko
1315c28ed8 libbpf: Split bpf object load into prepare/load
Introduce bpf_object__prepare API: additional intermediate preparation
step that performs ELF processing, relocations, prepares final state of
BPF program instructions (accessible with bpf_program__insns()), creates
and (potentially) pins maps, and stops short of loading BPF programs.

We anticipate few use cases for this API, such as:
* Use prepare to initialize bpf_token, without loading freplace
programs, unlocking possibility to lookup BTF of other programs.
* Execute prepare to obtain finalized BPF program instructions without
loading programs, enabling tools like veristat to process one program at
a time, without incurring cost of ELF parsing and processing.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250303135752.158343-4-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:28 -07:00
Mykyta Yatsenko
9a9e347835 libbpf: Introduce more granular state for bpf_object
We are going to split bpf_object loading into 2 stages: preparation and
loading. This will increase flexibility when working with bpf_object
and unlock some optimizations and use cases.
This patch substitutes a boolean flag (loaded) by more finely-grained
state for bpf_object.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250303135752.158343-3-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:28 -07:00
Mykyta Yatsenko
6ef78c4191 libbpf: Use map_is_created helper in map setters
Refactoring: use map_is_created helper in map setters that need to check
the state of the map. This helps to reduce the number of the places that
depend explicitly on the loaded flag, simplifying refactoring in the
next patch of this set.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250303135752.158343-2-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15 11:48:27 -07:00
Arnaldo Carvalho de Melo
0c9f3a8597 libapi: Add missing header with NAME_MAX define to io_dir.h
Most systems get this indirectly, but some odd cases (some musl libc
systems) can't find it, so just add the header where NAME_MAX is defined
to avoid that.

Fixes: d118b08f7e ("tools lib api: Add io_dir an allocation free readdir alternative")
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20250310194534.265487-2-acme@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-03-13 00:29:36 -07:00
Ian Rogers
c760174401 perf cpumap: Reduce cpu size from int to int16_t
Fewer than 32k logical CPUs are currently supported by perf. A cpumap
is indexed by an integer (see perf_cpu_map__cpu) yielding a perf_cpu
that wraps a 4-byte int for the logical CPU - the wrapping is done
deliberately to avoid confusing a logical CPU with an index into a
cpumap. Using a 4-byte int within the perf_cpu is larger than required
so this patch reduces it to the 2-byte int16_t. For a cpumap
containing 16 entries this will reduce the array size from 64 to 32
bytes. For very large servers with lots of logical CPUs the size
savings will be greater.

Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250210191231.156294-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-27 08:47:25 -08:00
Ihor Solodrai
b62dff1440 libbpf: Implement bpf_usdt_arg_size BPF function
Information about USDT argument size is implicitly stored in
__bpf_usdt_arg_spec, but currently it's not accessbile to BPF programs
that use USDT.

Implement bpf_sdt_arg_size() that returns the size of an USDT argument
in bytes.

v1->v2:
  * do not add __bpf_usdt_arg_spec() helper

v1: https://lore.kernel.org/bpf/20250220215904.3362709-1-ihor.solodrai@linux.dev/

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250224235756.2612606-1-ihor.solodrai@linux.dev
2025-02-26 08:59:44 -08:00
Linus Torvalds
9f5270d758 perf tools fixes for v6.14: 2nd batch
- Fix tools/ quiet build Makefile infrastructure that was broken when
   working on tools/perf/ without testing on other tools/ living
   utilities.
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCZ74OJwAKCRCyPKLppCJ+
 J30mAPsHCA8A+CNq/5yW2VhFLV1GgCSL5oWqxXRn7QjhSrCQBQEAot2u4O5zXs7M
 sg+mPlYiS1oT+zmvTLlXrN+bVyWP9A4=
 =jH1N
 -----END PGP SIGNATURE-----

Merge tag 'perf-tools-fixes-for-v6.14-2-2025-02-25' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf tools fixes from Arnaldo Carvalho de Melo:

 - Fix tools/ quiet build Makefile infrastructure that was broken when
   working on tools/perf/ without testing on other tools/ living
   utilities.

* tag 'perf-tools-fixes-for-v6.14-2-2025-02-25' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
  tools: Remove redundant quiet setup
  tools: Unify top-level quiet infrastructure
2025-02-25 13:32:32 -08:00
Ian Rogers
d118b08f7e tools lib api: Add io_dir an allocation free readdir alternative
glibc's opendir allocates a minimum of 32kb, when called recursively
for a directory tree the memory consumption can add up - nearly 300kb
during perf start-up when processing modules. Add a stack allocated
variant of readdir sized a little more than 1kb.

As getdents64 may be missing from libc, add support using syscall. As
the system call number maybe missing, add #defines for those.

Note, an earlier version of this patch had a feature test for
getdents64 but there were problems on certains distros where
getdents64 would be #define renamed to getdents breaking the code. The
syscall use was made uncondtional to work around this. There is
context in:
https://lore.kernel.org/lkml/20231207050433.1426834-1-irogers@google.com/

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250222061015.303622-2-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-24 15:46:33 -08:00
Nandakumar Edamana
236d391011 libbpf: Fix out-of-bound read
In `set_kcfg_value_str`, an untrusted string is accessed with the assumption
that it will be at least two characters long due to the presence of checks for
opening and closing quotes. But the check for the closing quote
(value[len - 1] != '"') misses the fact that it could be checking the opening
quote itself in case of an invalid input that consists of just the opening
quote.

This commit adds an explicit check to make sure the string is at least two
characters long.

Signed-off-by: Nandakumar Edamana <nandakumar@nandakumar.co.in>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250221210110.3182084-1-nandakumar@nandakumar.co.in
2025-02-24 13:58:41 -08:00
Andrii Nakryiko
e0525cd72b libbpf: Fix hypothetical STT_SECTION extern NULL deref case
Fix theoretical NULL dereference in linker when resolving *extern*
STT_SECTION symbol against not-yet-existing ELF section. Not sure if
it's possible in practice for valid ELF object files (this would require
embedded assembly manipulations, at which point BTF will be missing),
but fix the s/dst_sym/dst_sec/ typo guarding this condition anyways.

Fixes: faf6ed321c ("libbpf: Add BPF static linker APIs")
Fixes: a46349227c ("libbpf: Add linker extern resolution support for functions and global variables")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250220002821.834400-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-20 18:42:16 -08:00
Tao Chen
e8af068239 libbpf: Wrap libbpf API direct err with libbpf_err
Just wrap the direct err with libbpf_err, keep consistency
with other APIs.

Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250219153711.29651-1-chen.dylane@linux.dev
2025-02-19 14:56:30 -08:00
Charlie Jenkins
42367eca76 tools: Remove redundant quiet setup
Q is exported from Makefile.include so it is not necessary to manually
set it.

Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <qmo@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Benjamin Tissoires <bentiss@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Mykola Lysenko <mykolal@fb.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20250213-quiet_tools-v3-2-07de4482a581@rivosinc.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-02-18 16:27:43 -03:00
Andrii Nakryiko
06096d19ee libbpf: fix LDX/STX/ST CO-RE relocation size adjustment logic
Libbpf has a somewhat obscure feature of automatically adjusting the
"size" of LDX/STX/ST instruction (memory store and load instructions),
based on originally recorded access size (u8, u16, u32, or u64) and the
actual size of the field on target kernel. This is meant to facilitate
using BPF CO-RE on 32-bit architectures (pointers are always 64-bit in
BPF, but host kernel's BTF will have it as 32-bit type), as well as
generally supporting safe type changes (unsigned integer type changes
can be transparently "relocated").

One issue that surfaced only now, 5 years after this logic was
implemented, is how this all works when dealing with fields that are
arrays. This isn't all that easy and straightforward to hit (see
selftests that reproduce this condition), but one of sched_ext BPF
programs did hit it with innocent looking loop.

Long story short, libbpf used to calculate entire array size, instead of
making sure to only calculate array's element size. But it's the element
that is loaded by LDX/STX/ST instructions (1, 2, 4, or 8 bytes), so
that's what libbpf should check. This patch adjusts the logic for
arrays and fixed the issue.

Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250207014809.1573841-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-14 19:58:05 -08:00
Ihor Solodrai
2019c58318 libbpf: Check the kflag of type tags in btf_dump
If the kflag is set for a BTF type tag, then the tag represents an
arbitrary __attribute__. Change btf_dump accordingly.

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250130201239.1429648-4-ihor.solodrai@linux.dev
2025-02-05 16:17:59 -08:00
Ihor Solodrai
51d1b1d428 libbpf: Introduce kflag for type_tags and decl_tags in BTF
Add the following functions to libbpf API:
  * btf__add_type_attr()
  * btf__add_decl_attr()

These functions allow to add to BTF the type tags and decl tags with
info->kflag set to 1. The kflag indicates that the tag directly
encodes an __attribute__ and not a normal tag.

See Documentation/bpf/btf.rst changes in the subsequent patch for
details on the semantics.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250130201239.1429648-2-ihor.solodrai@linux.dev
2025-02-05 16:17:59 -08:00
Tony Ambardar
0a7c2a8435 libbpf: Fix accessing BTF.ext core_relo header
Update btf_ext_parse_info() to ensure the core_relo header is present
before reading its fields. This avoids a potential buffer read overflow
reported by the OSS Fuzz project.

Fixes: cf579164e9 ("libbpf: Support BTF.ext loading and output in either endianness")
Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://issues.oss-fuzz.com/issues/388905046
Link: https://lore.kernel.org/bpf/20250125065236.2603346-1-itugrok@yahoo.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-03 03:33:51 -08:00
Linus Torvalds
7685b334d1 perf-tools changes for v6.14
There are a lot of changes in the perf tools in this cycle.
 
 build
 -----
 * Use generic syscall table to generate syscall numbers on supported archs.
 * This also enables to get rid of libaudit which was used for syscall numbers.
 * Remove python2 support as it's deprecated for years.
 * Fix issues on static build with libzstd.
 
 perf record
 -----------
 * Intel-PT supports "aux-action" config term to pause or resume tracing in
   the aux-buffer.  Users can start the intel_pt event as "started-paused" and
   configure other events to control the Intel-PT tracing.
 
     # perf record --kcore -e intel_pt/aux-action=start-paused/   \
         -e syscalls:sys_enter_newuname/aux-action=resume/        \
         -e syscalls:sys_exit_newuname/aux-action=pause/ -- uname
 
   This requires the kernel support (which was added in v6.13).
 
 perf lock
 ---------
 * 'perf lock contention' command has an ability to symbolize locks in
   dynamically allocated objects using slab cache name when it runs with BPF.
   Those dynamic locks would have "&" prefix in the name to distinguish them
   from ordinary (static) locks.
 
     # perf lock con -abl -E 5 sleep 1
        contended   total wait     max wait     avg wait            address   symbol
 
                2      1.95 us      1.77 us       975 ns   ffff9d5e852d3498   &task_struct (mutex)
                1      1.18 us      1.18 us      1.18 us   ffff9d5e852d3538   &task_struct (mutex)
                4      1.12 us       354 ns       279 ns   ffff9d5e841ca800   &kmalloc-cg-512 (mutex)
                2       859 ns       617 ns       429 ns   ffffffffa41c3620   delayed_uprobe_lock (mutex)
                3       691 ns       388 ns       230 ns   ffffffffa41c0940   pack_mutex (mutex)
 
   This also requires the kernel/BPF support (which was added in v6.13).
 
 perf ftrace
 -----------
 * 'perf ftrace latency' command gets a couple of options to support linear
   buckets instead of exponential.  Also it's possible to specify max and
   min latency for the linear buckets.
 
     # perf ftrace latency -abn -T switch_mm_irqs_off --bucket-range=100   \
         --min-latency=200 --max-latency=800 -- sleep 1
     #   DURATION     |      COUNT | GRAPH                                  |
          0 -  200 ns |        186 | ###                                    |
        200 -  300 ns |        256 | #####                                  |
        300 -  400 ns |        364 | #######                                |
        400 -  500 ns |        223 | ####                                   |
        500 -  600 ns |        111 | ##                                     |
        600 -  700 ns |         41 |                                        |
        700 -  800 ns |        141 | ##                                     |
        800 -  ... ns |        169 | ###                                    |
 
     # statistics  (in nsec)
       total time:              2162212
         avg time:                  967
         max time:                16817
         min time:                  132
            count:                 2236
 
 * As you can see in the above example, it nows shows the statistics at the
   end so that users can see the avg/max/min latencies easily.
 
 * 'perf ftrace profile' command has --graph-opts option like 'perf ftrace
   trace' so that it can control the tracing behaviors in the same way.
   For example, it can limit the function call depth or threshold.
 
 perf script
 -----------
 * Improve physical memory resolution in 'mem-phys-addr' script by parsing
   /proc/iomem file.
 
     # perf script mem-phys-addr -- find /
     ...
     Event: mem_inst_retired.all_loads:P
     Memory type                                    count  percentage
     ----------------------------------------  ----------  ----------
     100000000-85f7fffff : System RAM                8929        69.7
       547600000-54785d23f : Kernel data             1240         9.7
       546a00000-5474bdfff : Kernel rodata            490         3.8
       5480ce000-5485fffff : Kernel bss               121         0.9
     0-fff : Reserved                                3860        30.1
     100000-89c01fff : System RAM                      18         0.1
     8a22c000-8df6efff : System RAM                     5         0.0
 
 Others
 ------
 * 'perf test' gets --runs-per-test option to run the test cases repeatedly.
   This would be helpful to see if it's flaky.
 
 * Add 'parse_events' method to Python perf extension module, so that users
   can use the same event parsing logic in the python code.  One more step
   towards implementing perf tools in Python. :)
 
 * Support opening tracepoint events without libtraceevent.  This will be
   helpful if it won't use the tracing data like in 'perf stat'.
 
 * Update ARM Neoverse N2/V2 JSON events and metrics
 
 Signed-off-by: Namhyung Kim <namhyung@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSo2x5BnqMqsoHtzsmMstVUGiXMgwUCZ5AgiQAKCRCMstVUGiXM
 g0WhAP43Dpfatrm1jicTyAogk5D/JrIMOgjGtrJJi5RXG/r0gwD8DSWFzLppS9xy
 KGtjLHrN6v6BqR4DCubdlZmRfh9Qjgg=
 =M0Kz
 -----END PGP SIGNATURE-----

Merge tag 'perf-tools-for-v6.14-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf-tools updates from Namhyung Kim:
 "There are a lot of changes in the perf tools in this cycle.

  build:

   - Use generic syscall table to generate syscall numbers on supported
     archs

   - This also enables to get rid of libaudit which was used for syscall
     numbers

   - Remove python2 support as it's deprecated for years

   - Fix issues on static build with libzstd

  perf record:

   - Intel-PT supports "aux-action" config term to pause or resume
     tracing in the aux-buffer. Users can start the intel_pt event as
     "started-paused" and configure other events to control the Intel-PT
     tracing:

         # perf record --kcore -e intel_pt/aux-action=start-paused/   \
             -e syscalls:sys_enter_newuname/aux-action=resume/        \
             -e syscalls:sys_exit_newuname/aux-action=pause/ -- uname

     This requires kernel support (which was added in v6.13)

  perf lock:

   - 'perf lock contention' command has an ability to symbolize locks in
     dynamically allocated objects using slab cache name when it runs
     with BPF. Those dynamic locks would have "&" prefix in the name to
     distinguish them from ordinary (static) locks

        # perf lock con -abl -E 5 sleep 1
           contended   total wait     max wait     avg wait            address   symbol

                   2      1.95 us      1.77 us       975 ns   ffff9d5e852d3498   &task_struct (mutex)
                   1      1.18 us      1.18 us      1.18 us   ffff9d5e852d3538   &task_struct (mutex)
                   4      1.12 us       354 ns       279 ns   ffff9d5e841ca800   &kmalloc-cg-512 (mutex)
                   2       859 ns       617 ns       429 ns   ffffffffa41c3620   delayed_uprobe_lock (mutex)
                   3       691 ns       388 ns       230 ns   ffffffffa41c0940   pack_mutex (mutex)

     This also requires kernel/BPF support (which was added in v6.13)

  perf ftrace:

   - 'perf ftrace latency' command gets a couple of options to support
     linear buckets instead of exponential. Also it's possible to
     specify max and min latency for the linear buckets:

        # perf ftrace latency -abn -T switch_mm_irqs_off --bucket-range=100   \
            --min-latency=200 --max-latency=800 -- sleep 1
        #   DURATION     |      COUNT | GRAPH                                  |
             0 -  200 ns |        186 | ###                                    |
           200 -  300 ns |        256 | #####                                  |
           300 -  400 ns |        364 | #######                                |
           400 -  500 ns |        223 | ####                                   |
           500 -  600 ns |        111 | ##                                     |
           600 -  700 ns |         41 |                                        |
           700 -  800 ns |        141 | ##                                     |
           800 -  ... ns |        169 | ###                                    |

        # statistics  (in nsec)
          total time:              2162212
            avg time:                  967
            max time:                16817
            min time:                  132
               count:                 2236

   - As you can see in the above example, it nows shows the statistics
     at the end so that users can see the avg/max/min latencies easily

   - 'perf ftrace profile' command has --graph-opts option like 'perf
     ftrace trace' so that it can control the tracing behaviors in the
     same way. For example, it can limit the function call depth or
     threshold

  perf script:

   - Improve physical memory resolution in 'mem-phys-addr' script by
     parsing /proc/iomem file

        # perf script mem-phys-addr -- find /
        ...
        Event: mem_inst_retired.all_loads:P
        Memory type                                    count  percentage
        ----------------------------------------  ----------  ----------
        100000000-85f7fffff : System RAM                8929        69.7
          547600000-54785d23f : Kernel data             1240         9.7
          546a00000-5474bdfff : Kernel rodata            490         3.8
          5480ce000-5485fffff : Kernel bss               121         0.9
        0-fff : Reserved                                3860        30.1
        100000-89c01fff : System RAM                      18         0.1
        8a22c000-8df6efff : System RAM                     5         0.0

  Others:

   - 'perf test' gets --runs-per-test option to run the test cases
     repeatedly. This would be helpful to see if it's flaky

   - Add 'parse_events' method to Python perf extension module, so that
     users can use the same event parsing logic in the python code. One
     more step towards implementing perf tools in Python. :)

   - Support opening tracepoint events without libtraceevent. This will
     be helpful if it won't use the tracing data like in 'perf stat'

   - Update ARM Neoverse N2/V2 JSON events and metrics"

* tag 'perf-tools-for-v6.14-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (176 commits)
  perf test: Update event_groups test to use instructions
  perf bench: Fix undefined behavior in cmpworker()
  perf annotate: Prefer passing evsel to evsel->core.idx
  perf lock: Rename fields in lock_type_table
  perf lock: Add percpu-rwsem for type filter
  perf lock: Fix parse_lock_type which only retrieve one lock flag
  perf lock: Fix return code for functions in __cmd_contention
  perf hist: Fix width calculation in hpp__fmt()
  perf hist: Fix bogus profiles when filters are enabled
  perf hist: Deduplicate cmp/sort/collapse code
  perf test: Improve verbose documentation
  perf test: Add a runs-per-test flag
  perf test: Fix parallel/sequential option documentation
  perf test: Send list output to stdout rather than stderr
  perf test: Rename functions and variables for better clarity
  perf tools: Expose quiet/verbose variables in Makefile.perf
  perf config: Add a function to set one variable in .perfconfig
  perf test perftool_testsuite: Return correct value for skipping
  perf test perftool_testsuite: Add missing description
  perf test record+probe_libc_inet_pton: Make test resilient
  ...
2025-01-24 05:45:40 -08:00
Andrii Nakryiko
f8a05692de libbpf: Work around kernel inconsistently stripping '.llvm.' suffix
Some versions of kernel were stripping out '.llvm.<hash>' suffix from
kerne symbols (produced by Clang LTO compilation) from function names
reported in available_filter_functions, while kallsyms reported full
original name. This confuses libbpf's multi-kprobe logic of finding all
matching kernel functions for specified user glob pattern by joining
available_filter_functions and kallsyms contents, because joining by
full symbol name won't work for symbols containing '.llvm.<hash>' suffix.

This was eventually fixed by [0] in the kernel, but we'd like to not
regress multi-kprobe experience and add a work around for this bug on
libbpf side, stripping kallsym's name if it matches user pattern and
contains '.llvm.' suffix.

  [0] fb6a421fb6 ("kallsyms: Match symbols exactly with CONFIG_LTO_CLANG")

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20250117003957.179331-1-andrii@kernel.org
2025-01-17 15:16:56 +01:00
Pu Lehui
5ca681a86e libbpf: Fix incorrect traversal end type ID when marking BTF_IS_EMBEDDED
When redirecting the split BTF to the vmlinux base BTF, we need to mark
the distilled base struct/union members of split BTF structs/unions in
id_map with BTF_IS_EMBEDDED. This indicates that these types must match
both name and size later. Therefore, we need to traverse the entire
split BTF, which involves traversing type IDs from nr_dist_base_types to
nr_types. However, the current implementation uses an incorrect
traversal end type ID, so let's correct it.

Fixes: 19e00c897d ("libbpf: Split BTF relocation")
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250115100241.4171581-3-pulehui@huaweicloud.com
2025-01-16 15:34:18 -08:00
Pu Lehui
5436a54332 libbpf: Fix return zero when elf_begin failed
The error number of elf_begin is omitted when encapsulating the
btf_find_elf_sections function.

Fixes: c86f180ffc ("libbpf: Make btf_parse_elf process .BTF.base transparently")
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250115100241.4171581-2-pulehui@huaweicloud.com
2025-01-16 15:34:18 -08:00
Yonghong Song
e2b0bda62d libbpf: Add unique_match option for multi kprobe
Jordan reported an issue in Meta production environment where func
try_to_wake_up() is renamed to try_to_wake_up.llvm.<hash>() by clang
compiler at lto mode. The original 'kprobe/try_to_wake_up' does not
work any more since try_to_wake_up() does not match the actual func
name in /proc/kallsyms.

There are a couple of ways to resolve this issue. For example, in
attach_kprobe(), we could do lookup in /proc/kallsyms so try_to_wake_up()
can be replaced by try_to_wake_up.llvm.<hach>(). Or we can force users
to use bpf_program__attach_kprobe() where they need to lookup
/proc/kallsyms to find out try_to_wake_up.llvm.<hach>(). But these two
approaches requires extra work by either libbpf or user.

Luckily, suggested by Andrii, multi kprobe already supports wildcard ('*')
for symbol matching. In the above example, 'try_to_wake_up*' can match
to try_to_wake_up() or try_to_wake_up.llvm.<hash>() and this allows
bpf prog works for different kernels as some kernels may have
try_to_wake_up() and some others may have try_to_wake_up.llvm.<hash>().

The original intention is to kprobe try_to_wake_up() only, so an optional
field unique_match is added to struct bpf_kprobe_multi_opts. If the
field is set to true, the number of matched functions must be one.
Otherwise, the attachment will fail. In the above case, multi kprobe
with 'try_to_wake_up*' and unique_match preserves user functionality.

Reported-by: Jordan Rome <linux@jordanrome.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250109174023.3368432-1-yonghong.song@linux.dev
2025-01-10 13:11:42 -08:00
Daniel Xu
1846dd8e3a libbpf: Set MFD_NOEXEC_SEAL when creating memfd
Starting from 105ff5339f ("mm/memfd: add MFD_NOEXEC_SEAL and
MFD_EXEC") and until 1717449b44 ("memfd: drop warning for missing
exec-related flags"), the kernel would print a warning if neither
MFD_NOEXEC_SEAL nor MFD_EXEC is set in memfd_create().

If libbpf runs on on a kernel between these two commits (eg. on an
improperly backported system), it'll trigger this warning.

To avoid this warning (and also be more secure), explicitly set
MFD_NOEXEC_SEAL. But since libbpf can be run on potentially very old
kernels, leave a fallback for kernels without MFD_NOEXEC_SEAL support.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/6e62c2421ad7eb1da49cbf16da95aaaa7f94d394.1735594195.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-30 14:48:15 -08:00
Alexei Starovoitov
06103dccbb Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR.

No conflicts.

Adjacent changes in:
Auto-merging include/linux/bpf.h
Auto-merging include/linux/bpf_verifier.h
Auto-merging kernel/bpf/btf.c
Auto-merging kernel/bpf/verifier.c
Auto-merging kernel/trace/bpf_trace.c
Auto-merging tools/testing/selftests/bpf/progs/test_tp_btf_nullable.c

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-16 08:53:59 -08:00
Anton Protopopov
f9933acda3 libbpf: prog load: Allow to use fd_array_cnt
Add new fd_array_cnt field to bpf_prog_load_opts
and pass it in bpf_attr, if set.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241213130934.1087929-6-aspsk@isovalent.com
2024-12-13 14:48:39 -08:00
Arnaldo Carvalho de Melo
aec95d7ce1 Merge remote-tracking branch 'torvalds/master' into perf-tools-next
To get the fixes that went thru perf-tools for v6.13.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-12-13 11:53:27 -03:00
Alastair Robertson
6d5e5e5d7c libbpf: Extend linker API to support in-memory ELF files
The new_fd and add_fd functions correspond to the original new and
add_file functions, but accept an FD instead of a file name. This
gives API consumers the option of using anonymous files/memfds to
avoid writing ELFs to disk.

This new API will be useful for performing linking as part of
bpftrace's JIT compilation.

The add_buf function is a convenience wrapper that does the work of
creating a memfd for the caller.

Signed-off-by: Alastair Robertson <ajor@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241211164030.573042-3-ajor@meta.com
2024-12-12 15:16:53 -08:00
Alastair Robertson
b641712925 libbpf: Pull file-opening logic up to top-level functions
Move the filename arguments and file-descriptor handling from
init_output_elf() and linker_load_obj_file() and instead handle them
at the top-level in bpf_linker__new() and bpf_linker__add_file().

This will allow the inner functions to be shared with a new,
non-filename-based, API in the next commit.

Signed-off-by: Alastair Robertson <ajor@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241211164030.573042-2-ajor@meta.com
2024-12-12 15:09:21 -08:00
James Clark
f7e36d02d7 libperf: evlist: Fix --cpu argument on hybrid platform
Since the linked fixes: commit, specifying a CPU on hybrid platforms
results in an error because Perf tries to open an extended type event
on "any" CPU which isn't valid. Extended type events can only be opened
on CPUs that match the type.

Before (working):

  $ perf record --cpu 1 -- true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 2.385 MB perf.data (7 samples) ]

After (not working):

  $ perf record -C 1 -- true
  WARNING: A requested CPU in '1' is not supported by PMU 'cpu_atom' (CPUs 16-27) for event 'cycles:P'
  Error:
  The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cpu_atom/cycles:P/).
  /bin/dmesg | grep -i perf may provide additional information.

(Ignore the warning message, that's expected and not particularly
relevant to this issue).

This is because perf_cpu_map__intersect() of the user specified CPU (1)
and one of the PMU's CPUs (16-27) correctly results in an empty (NULL)
CPU map. However for the purposes of opening an event, libperf converts
empty CPU maps into an any CPU (-1) which the kernel rejects.

Fix it by deleting evsels with empty CPU maps in the specific case where
user requested CPU maps are evaluated.

Fixes: 251aa04024 ("perf parse-events: Wildcard most "numeric" events")
Reviewed-by: Ian Rogers <irogers@google.com>
Tested-by: Thomas Falcon <thomas.falcon@intel.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20241114160450.295844-2-james.clark@linaro.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-12-11 09:19:44 -08:00
Ian Rogers
05be17eed7 tool api fs: Correctly encode errno for read/write open failures
Switch from returning -1 to -errno so that callers can determine types
of failure.

Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Ben Gainey <ben.gainey@arm.com>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Paran Lee <p4ranlee@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steinar H. Gunderson <sesse@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Falcon <thomas.falcon@intel.com>
Cc: Weilin Wang <weilin.wang@intel.com>
Cc: Yang Jihong <yangjihong@bytedance.com>
Cc: Yang Li <yang.lee@linux.alibaba.com>
Cc: Ze Gao <zegao2021@gmail.com>
Cc: Zixian Cai <fzczx123@gmail.com>
Cc: zhaimingbing <zhaimingbing@cmss.chinamobile.com>
Link: https://lore.kernel.org/r/20241118225345.889810-3-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-12-09 17:52:42 -03:00
Ian Rogers
bfb9467535 libperf cpumap: Grow array of read CPUs in smaller increments
Instead of growing the array by 2048, grow by the larger of the current
range or 16.

As ranges are typical for things like the online CPUs this will mean a
single allocation happens.

While uncore CPU maps will grow 16 at a time which is a value that is
generous except say on large servers.

Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ben Gainey <ben.gainey@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kyle Meyer <kyle.meyer@hpe.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241206044035.1062032-9-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-12-09 17:52:41 -03:00
Ian Rogers
e9ca57d711 libperf cpumap: Remove perf_cpu_map__read()
Function is no longer used and duplicates the parsing logic from
perf_cpu_map__new().

Remove to allow simplification.

Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ben Gainey <ben.gainey@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kyle Meyer <kyle.meyer@hpe.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241206044035.1062032-8-irogers@google.com
[ Applied manually to cope with "libperf cpumap: Refactor perf_cpu_map__merge()" ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-12-09 17:52:41 -03:00
Ian Rogers
9d9a83c51a libperf cpumap: Remove use of perf_cpu_map__read()
Remove use of a FILE and switch to reading a string that is then
passed to perf_cpu_map__new().

Being able to remove perf_cpu_map__read() avoids duplicated parsing
logic.

Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ben Gainey <ben.gainey@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kyle Meyer <kyle.meyer@hpe.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241206044035.1062032-7-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-12-09 17:52:41 -03:00
Ian Rogers
5d2fd516bb libperf cpumap: Be tolerant of newline at the end of a cpumask
File cpumasks often have a newline that shouldn't trigger the invalid
parsing case in perf_cpu_map__new().

Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ben Gainey <ben.gainey@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kyle Meyer <kyle.meyer@hpe.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241206044035.1062032-5-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-12-09 17:52:41 -03:00
Ian Rogers
e8399d34d5 libperf cpumap: Hide/reduce scope of MAX_NR_CPUS
Avoid redefinition of MAX_NR_CPUS as a global constant, the original
definition is tools/perf/perf.h.

Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ben Gainey <ben.gainey@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kyle Meyer <kyle.meyer@hpe.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241206044035.1062032-4-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-12-09 17:52:41 -03:00
Kyle Meyer
9a1e106550 perf: Increase MAX_NR_CPUS to 4096
Systems have surpassed 2048 CPUs. Increase MAX_NR_CPUS to 4096.

Bitmaps declared with MAX_NR_CPUS bits will increase from 256B to 512B,
cpus_runtime will increase from 81960B to 163880B, and max_entries will
increase from 8192B to 16384B.

Reviewed-by: Ian Rogers <irogers@google.com>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ben Gainey <ben.gainey@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241206044035.1062032-2-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-12-09 17:52:41 -03:00
Leo Yan
a9d2217556 libperf cpumap: Refactor perf_cpu_map__merge()
The perf_cpu_map__merge() function has two arguments, 'orig' and
'other'.  The function definition might cause confusion as it could give
the impression that the CPU maps in the two arguments are copied into a
new allocated structure, which is then returned as the result.

The purpose of the function is to merge the CPU map 'other' into the CPU
map 'orig'.  This commit changes the 'orig' argument to a pointer to
pointer, so the new result will be updated into 'orig'.

The return value is changed to an int type, as an error number or 0 for
success.

Update callers and tests for the new function definition.

Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Leo Yan <leo.yan@arm.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20241107125308.41226-2-leo.yan@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-12-09 17:52:41 -03:00
Quentin Monnet
e10500b69c libbpf: Fix segfault due to libelf functions not setting errno
Libelf functions do not set errno on failure. Instead, it relies on its
internal _elf_errno value, that can be retrieved via elf_errno (or the
corresponding message via elf_errmsg()). From "man libelf":

    If a libelf function encounters an error it will set an internal
    error code that can be retrieved with elf_errno. Each thread
    maintains its own separate error code. The meaning of each error
    code can be determined with elf_errmsg, which returns a string
    describing the error.

As a consequence, libbpf should not return -errno when a function from
libelf fails, because an empty value will not be interpreted as an error
and won't prevent the program to stop. This is visible in
bpf_linker__add_file(), for example, where we call a succession of
functions that rely on libelf:

    err = err ?: linker_load_obj_file(linker, filename, opts, &obj);
    err = err ?: linker_append_sec_data(linker, &obj);
    err = err ?: linker_append_elf_syms(linker, &obj);
    err = err ?: linker_append_elf_relos(linker, &obj);
    err = err ?: linker_append_btf(linker, &obj);
    err = err ?: linker_append_btf_ext(linker, &obj);

If the object file that we try to process is not, in fact, a correct
object file, linker_load_obj_file() may fail with errno not being set,
and return 0. In this case we attempt to run linker_append_elf_sysms()
and may segfault.

This can happen (and was discovered) with bpftool:

    $ bpftool gen object output.o sample_ret0.bpf.c
    libbpf: failed to get ELF header for sample_ret0.bpf.c: invalid `Elf' handle
    zsh: segmentation fault (core dumped)  bpftool gen object output.o sample_ret0.bpf.c

Fix the issue by returning a non-null error code (-EINVAL) when libelf
functions fail.

Fixes: faf6ed321c ("libbpf: Add BPF static linker APIs")
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241205135942.65262-1-qmo@kernel.org
2024-12-05 15:19:00 -08:00
Ben Olson
9a17db586d libbpf: Improve debug message when the base BTF cannot be found
When running `bpftool` on a kernel module installed in `/lib/modules...`,
this error is encountered if the user does not specify `--base-btf` to
point to a valid base BTF (e.g. usually in `/sys/kernel/btf/vmlinux`).
However, looking at the debug output to determine the cause of the error
simply says `Invalid BTF string section`, which does not point to the
actual source of the error. This just improves that debug message to tell
users what happened.

Signed-off-by: Ben Olson <matthew.olson@intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/Z0YqzQ5lNz7obQG7@bolson-desk
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-02 08:41:51 -08:00
Andrii Nakryiko
98ebe5ef6f libbpf: don't adjust USDT semaphore address if .stapsdt.base addr is missing
USDT ELF note optionally can record an offset of .stapsdt.base, which is
used to make adjustments to USDT target attach address. Currently,
libbpf will do this address adjustment unconditionally if it finds
.stapsdt.base ELF section in target binary. But there is a corner case
where .stapsdt.base ELF section is present, but specific USDT note
doesn't reference it. In such case, libbpf will basically just add base
address and end up with absolutely incorrect USDT target address.

This adjustment has to be done only if both .stapsdt.sema section is
present and USDT note is recording a reference to it.

Fixes: 74cc6311ce ("libbpf: Add USDT notes parsing and resolution logic")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20241121224558.796110-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-02 08:41:17 -08:00
Linus Torvalds
b50ecc5aca perf tools changes for v6.13
perf record
 -----------
 * Enable leader sampling for inherited task events.  It was supported
   only for system-wide events but the kernel started to support such a
   setup since v6.12.
 
   This is to reduce the number of PMU interrupts.  The samples of the
   leader event will contain counts of other events and no samples will
   be generated for the other member events.
 
     $ perf record -e '{cycles,instructions}:S'  ${MYPROG}
 
 perf report
 -----------
 * Fix --branch-history option to display more branch-related information
   like prediction, abort and cycles which is available on Intel machines.
 
     $ perf record -bg -- perf test -w brstack
 
     $ perf report --branch-history
     ...
     #
     # Overhead  Source:Line               Symbol          Shared Object         Predicted  Abort  Cycles  IPC   [IPC Coverage]
     # ........  ........................  ..............  ....................  .........  .....  ......  ....................
     #
          8.17%  copy_page_64.S:19         [k] copy_page   [kernel.kallsyms]     50.0%      0      5       -      -
                 |
                 ---xas_load xarray.h:171
                    |
                    |--5.68%--xas_load xarray.c:245 (cycles:1)
                    |          xas_load xarray.c:242
                    |          xas_load xarray.h:1260 (cycles:1)
                    |          xas_descend xarray.c:146
                    |          xas_load xarray.c:244 (cycles:2)
                    |          xas_load xarray.c:245
                    |          xas_descend xarray.c:218 (cycles:10)
     ...
 
 perf stat
 ---------
 * Add HWMON PMU support.  The HWMON provides various system information
   like CPU/GPU temperature, fan speed and so on.  Expose them as PMU
   events so that users can see the values using perf stat commands.
 
     $ perf stat -e temp_cpu,fan1 true
 
      Performance counter stats for 'true':
 
                  60.00 'C   temp_cpu
                      0 rpm  fan1
 
            0.000745382 seconds time elapsed
 
            0.000883000 seconds user
            0.000000000 seconds sys
 
 * Display metric threshold in JSON output.  Some metrics define
   thresholds to classify value ranges.  It used to be in a different
   color but it won't work for JSON.  Add "metric-threshold" field to
   the JSON that can be one of "good", "less good", "nearly bad" and
   "bad".
 
     # perf stat -a -M TopdownL1 -j true
     {"counter-value" : "18693525.000000", "unit" : "", "event" : "TOPDOWN.SLOTS", "event-runtime" : 5552708, "pcnt-running" : 100.00, "metric-value" : "43.226002", "metric-unit" : "%  tma_backend_bound", "metric-threshold" : "bad"}
     {"metric-value" : "29.212267", "metric-unit" : "%  tma_frontend_bound", "metric-threshold" : "bad"}
     {"metric-value" : "7.138972", "metric-unit" : "%  tma_bad_speculation", "metric-threshold" : "good"}
     {"metric-value" : "20.422759", "metric-unit" : "%  tma_retiring", "metric-threshold" : "good"}
     {"counter-value" : "3817732.000000", "unit" : "", "event" : "topdown-retiring", "event-runtime" : 5552708, "pcnt-running" : 100.00, }
     {"counter-value" : "5472824.000000", "unit" : "", "event" : "topdown-fe-bound", "event-runtime" : 5552708, "pcnt-running" : 100.00, }
     {"counter-value" : "7984780.000000", "unit" : "", "event" : "topdown-be-bound", "event-runtime" : 5552708, "pcnt-running" : 100.00, }
     {"counter-value" : "1418181.000000", "unit" : "", "event" : "topdown-bad-spec", "event-runtime" : 5552708, "pcnt-running" : 100.00, }
     ...
 
 perf sched
 ----------
 * Add -P/--pre-migrations option for 'timehist' sub-command to track
   time a task waited on a run-queue before migrating to a different CPU.
 
     $ perf sched timehist -P
                time    cpu  task name                       wait time  sch delay   run time  pre-mig time
                             [tid/pid]                          (msec)     (msec)     (msec)     (msec)
     --------------- ------  ------------------------------  ---------  ---------  ---------  ---------
       585940.535527 [0000]  perf[584885]                        0.000      0.000      0.000      0.000
       585940.535535 [0000]  migration/0[20]                     0.000      0.002      0.008      0.000
       585940.535559 [0001]  perf[584885]                        0.000      0.000      0.000      0.000
       585940.535563 [0001]  migration/1[25]                     0.000      0.001      0.004      0.000
       585940.535678 [0002]  perf[584885]                        0.000      0.000      0.000      0.000
       585940.535686 [0002]  migration/2[31]                     0.000      0.002      0.008      0.000
       585940.535905 [0001]  <idle>                              0.000      0.000      0.342      0.000
       585940.535938 [0003]  perf[584885]                        0.000      0.000      0.000      0.000
       585940.537048 [0001]  sleep[584886]                       0.000      0.019      1.142      0.001
       585940.537749 [0002]  <idle>                              0.000      0.000      2.062      0.000
     ...
 
 Build
 -----
 * Make libunwind opt-in (LIBUNWIND=1) rather than opt-out.  The perf
   tools are generally built with libelf and libdw which has unwinder
   functionality.  The libunwind support predates it and no need to
   have duplicate unwinders by default.
 
 * Rename NO_DWARF=1 build option to NO_LIBDW=1 in order to clarify it's
   using libdw for handling DWARF information.
 
 Internals
 ---------
 * Do not set exclude_guest bit in the perf_event_attr by default.  This
   was causing a trouble in AMD IBS PMU as it doesn't support the bit.
   The bit will be set when it's needed later by the fallback logic.
   Also update the missing feature detection logic to make sure not clear
   supported bits unnecessarily.
 
 * Run perf test in parallel by default and mark flaky tests "exclusive"
   to run them serially at the end.  Some test numbers are changed but
   the test can complete in less than half the time.
 
 JSON vendor events
 ------------------
 * Add AMD Zen 5 events and metrics.
 
 * Add i.MX91 and i.MX95 DDR metrics
 
 * Fix HiSilicon HIP08 Topdown metric name.
 
 * Support compat events on PowerPC.
 
 Signed-off-by: Namhyung Kim <namhyung@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSo2x5BnqMqsoHtzsmMstVUGiXMgwUCZ0Qi3gAKCRCMstVUGiXM
 g6NIAP49eoSmQF40u55sJN0J7RpYd+bTgXZkahv0IUCBX98TLwEA2NrK0oUcB84C
 xeanq28/3JxNM/oBpsEvvB8mb/0lGwI=
 =FAVF
 -----END PGP SIGNATURE-----

Merge tag 'perf-tools-for-v6.13-2024-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf tools updates from Namhyung Kim:
 "perf record:

   - Enable leader sampling for inherited task events. It was supported
     only for system-wide events but the kernel started to support such
     a setup since v6.12.

     This is to reduce the number of PMU interrupts. The samples of the
     leader event will contain counts of other events and no samples
     will be generated for the other member events.

       $ perf record -e '{cycles,instructions}:S'  ${MYPROG}

  perf report:

   - Fix --branch-history option to display more branch-related
     information like prediction, abort and cycles which is available
     on Intel machines.

       $ perf record -bg -- perf test -w brstack

       $ perf report --branch-history
       ...
       #
       # Overhead  Source:Line               Symbol          Shared Object         Predicted  Abort  Cycles  IPC   [IPC Coverage]
       # ........  ........................  ..............  ....................  .........  .....  ......  ....................
       #
            8.17%  copy_page_64.S:19         [k] copy_page   [kernel.kallsyms]     50.0%      0      5       -      -
                   |
                   ---xas_load xarray.h:171
                      |
                      |--5.68%--xas_load xarray.c:245 (cycles:1)
                      |          xas_load xarray.c:242
                      |          xas_load xarray.h:1260 (cycles:1)
                      |          xas_descend xarray.c:146
                      |          xas_load xarray.c:244 (cycles:2)
                      |          xas_load xarray.c:245
                      |          xas_descend xarray.c:218 (cycles:10)
       ...

  perf stat:

   - Add HWMON PMU support.

     The HWMON provides various system information like CPU/GPU
     temperature, fan speed and so on. Expose them as PMU events so that
     users can see the values using perf stat commands.

       $ perf stat -e temp_cpu,fan1 true

        Performance counter stats for 'true':

                    60.00 'C   temp_cpu
                        0 rpm  fan1

              0.000745382 seconds time elapsed

              0.000883000 seconds user
              0.000000000 seconds sys

   - Display metric threshold in JSON output.

     Some metrics define thresholds to classify value ranges. It used to
     be in a different color but it won't work for JSON.

     Add "metric-threshold" field to the JSON that can be one of "good",
     "less good", "nearly bad" and "bad".

       # perf stat -a -M TopdownL1 -j true
       {"counter-value" : "18693525.000000", "unit" : "", "event" : "TOPDOWN.SLOTS", "event-runtime" : 5552708, "pcnt-running" : 100.00, "metric-value" : "43.226002", "metric-unit" : "%  tma_backend_bound", "metric-threshold" : "bad"}
       {"metric-value" : "29.212267", "metric-unit" : "%  tma_frontend_bound", "metric-threshold" : "bad"}
       {"metric-value" : "7.138972", "metric-unit" : "%  tma_bad_speculation", "metric-threshold" : "good"}
       {"metric-value" : "20.422759", "metric-unit" : "%  tma_retiring", "metric-threshold" : "good"}
       {"counter-value" : "3817732.000000", "unit" : "", "event" : "topdown-retiring", "event-runtime" : 5552708, "pcnt-running" : 100.00, }
       {"counter-value" : "5472824.000000", "unit" : "", "event" : "topdown-fe-bound", "event-runtime" : 5552708, "pcnt-running" : 100.00, }
       {"counter-value" : "7984780.000000", "unit" : "", "event" : "topdown-be-bound", "event-runtime" : 5552708, "pcnt-running" : 100.00, }
       {"counter-value" : "1418181.000000", "unit" : "", "event" : "topdown-bad-spec", "event-runtime" : 5552708, "pcnt-running" : 100.00, }
       ...

  perf sched:

   - Add -P/--pre-migrations option for 'timehist' sub-command to track
     time a task waited on a run-queue before migrating to a different
     CPU.

       $ perf sched timehist -P
                  time    cpu  task name                       wait time  sch delay   run time  pre-mig time
                               [tid/pid]                          (msec)     (msec)     (msec)     (msec)
       --------------- ------  ------------------------------  ---------  ---------  ---------  ---------
         585940.535527 [0000]  perf[584885]                        0.000      0.000      0.000      0.000
         585940.535535 [0000]  migration/0[20]                     0.000      0.002      0.008      0.000
         585940.535559 [0001]  perf[584885]                        0.000      0.000      0.000      0.000
         585940.535563 [0001]  migration/1[25]                     0.000      0.001      0.004      0.000
         585940.535678 [0002]  perf[584885]                        0.000      0.000      0.000      0.000
         585940.535686 [0002]  migration/2[31]                     0.000      0.002      0.008      0.000
         585940.535905 [0001]  <idle>                              0.000      0.000      0.342      0.000
         585940.535938 [0003]  perf[584885]                        0.000      0.000      0.000      0.000
         585940.537048 [0001]  sleep[584886]                       0.000      0.019      1.142      0.001
         585940.537749 [0002]  <idle>                              0.000      0.000      2.062      0.000
       ...

  Build:

   - Make libunwind opt-in (LIBUNWIND=1) rather than opt-out.

     The perf tools are generally built with libelf and libdw which has
     unwinder functionality. The libunwind support predates it and no
     need to have duplicate unwinders by default.

   - Rename NO_DWARF=1 build option to NO_LIBDW=1 in order to clarify
     it's using libdw for handling DWARF information.

  Internals:

   - Do not set exclude_guest bit in the perf_event_attr by default.

     This was causing a trouble in AMD IBS PMU as it doesn't support the
     bit. The bit will be set when it's needed later by the fallback
     logic. Also update the missing feature detection logic to make sure
     not clear supported bits unnecessarily.

   - Run perf test in parallel by default and mark flaky tests
     "exclusive" to run them serially at the end. Some test numbers are
     changed but the test can complete in less than half the time.

  JSON vendor events:

   - Add AMD Zen 5 events and metrics.

   - Add i.MX91 and i.MX95 DDR metrics

   - Fix HiSilicon HIP08 Topdown metric name.

   - Support compat events on PowerPC"

* tag 'perf-tools-for-v6.13-2024-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (232 commits)
  perf tests: Fix hwmon parsing with PMU name test
  perf hwmon_pmu: Ensure hwmon key union is zeroed before use
  perf tests hwmon_pmu: Remove double evlist__delete()
  perf/test: fix perf ftrace test on s390
  perf bpf-filter: Return -ENOMEM directly when pfi allocation fails
  perf test: Correct hwmon test PMU detection
  perf: Remove unused del_perf_probe_events()
  perf pmu: Move pmu_metrics_table__find and remove ARM override
  perf jevents: Add map_for_cpu()
  perf header: Pass a perf_cpu rather than a PMU to get_cpuid_str
  perf header: Avoid transitive PMU includes
  perf arm64 header: Use cpu argument in get_cpuid
  perf header: Refactor get_cpuid to take a CPU for ARM
  perf header: Move is_cpu_online to numa bench
  perf jevents: fix breakage when do perf stat on system metric
  perf test: Add missing __exit calls in tool/hwmon tests
  perf tests: Make leader sampling test work without branch event
  perf util: Remove kernel version deadcode
  perf test shell trace_exit_race: Use --no-comm to avoid cases where COMM isn't resolved
  perf test shell trace_exit_race: Show what went wrong in verbose mode
  ...
2024-11-26 14:54:00 -08:00