mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-09-04 20:19:47 +08:00
04e317ba72
15 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
c164b8b404 |
selftests/bpf: Remove explicit setrlimit(RLIMIT_MEMLOCK) in main selftests
As libbpf now is able to automatically take care of RLIMIT_MEMLOCK increase (or skip it altogether on recent enough kernels), remove explicit setrlimit() invocations in bench, test_maps, test_verifier, and test_progs. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211214195904.1785155-3-andrii@kernel.org |
||
![]() |
9c42652f8b |
selftests/bpf: Add benchmark for bpf_strncmp() helper
Add benchmark to compare the performance between home-made strncmp() in bpf program and bpf_strncmp() helper. In summary, the performance win of bpf_strncmp() under x86-64 is greater than 18% when the compared string length is greater than 64, and is 179% when the length is 4095. Under arm64 the performance win is even bigger: 33% when the length is greater than 64 and 600% when the length is 4095. The following is the details: no-helper-X: use home-made strncmp() to compare X-sized string helper-Y: use bpf_strncmp() to compare Y-sized string Under x86-64: no-helper-1 3.504 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-1 3.347 ± 0.001M/s (drops 0.000 ± 0.000M/s) no-helper-8 3.357 ± 0.001M/s (drops 0.000 ± 0.000M/s) helper-8 3.307 ± 0.001M/s (drops 0.000 ± 0.000M/s) no-helper-32 3.064 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-32 3.253 ± 0.001M/s (drops 0.000 ± 0.000M/s) no-helper-64 2.563 ± 0.001M/s (drops 0.000 ± 0.000M/s) helper-64 3.040 ± 0.001M/s (drops 0.000 ± 0.000M/s) no-helper-128 1.975 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-128 2.641 ± 0.000M/s (drops 0.000 ± 0.000M/s) no-helper-512 0.759 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-512 1.574 ± 0.000M/s (drops 0.000 ± 0.000M/s) no-helper-2048 0.329 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-2048 0.602 ± 0.000M/s (drops 0.000 ± 0.000M/s) no-helper-4095 0.117 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-4095 0.327 ± 0.000M/s (drops 0.000 ± 0.000M/s) Under arm64: no-helper-1 2.806 ± 0.004M/s (drops 0.000 ± 0.000M/s) helper-1 2.819 ± 0.002M/s (drops 0.000 ± 0.000M/s) no-helper-8 2.797 ± 0.109M/s (drops 0.000 ± 0.000M/s) helper-8 2.786 ± 0.025M/s (drops 0.000 ± 0.000M/s) no-helper-32 2.399 ± 0.011M/s (drops 0.000 ± 0.000M/s) helper-32 2.703 ± 0.002M/s (drops 0.000 ± 0.000M/s) no-helper-64 2.020 ± 0.015M/s (drops 0.000 ± 0.000M/s) helper-64 2.702 ± 0.073M/s (drops 0.000 ± 0.000M/s) no-helper-128 1.604 ± 0.001M/s (drops 0.000 ± 0.000M/s) helper-128 2.516 ± 0.002M/s (drops 0.000 ± 0.000M/s) no-helper-512 0.699 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-512 2.106 ± 0.003M/s (drops 0.000 ± 0.000M/s) no-helper-2048 0.215 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-2048 1.223 ± 0.003M/s (drops 0.000 ± 0.000M/s) no-helper-4095 0.112 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-4095 0.796 ± 0.000M/s (drops 0.000 ± 0.000M/s) Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211210141652.877186-4-houtao1@huawei.com |
||
![]() |
9a93bf3fda |
selftests/bpf: Fix checkpatch error on empty function parameter
Fix checkpatch error: "ERROR: Bad function definition - void foo() should probably be void foo(void)". Most replacements are done by the following command: sed -i 's#\([a-z]\)()$#\1(void)#g' testing/selftests/bpf/benchs/*.c Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211210141652.877186-3-houtao1@huawei.com |
||
![]() |
ec151037af |
selftest/bpf/benchs: Add bpf_loop benchmark
Add benchmark to measure the throughput and latency of the bpf_loop call. Testing this on my dev machine on 1 thread, the data is as follows: nr_loops: 10 bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op nr_loops: 100 bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op nr_loops: 500 bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op nr_loops: 1000 bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op nr_loops: 5000 bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op nr_loops: 10000 bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op nr_loops: 50000 bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op nr_loops: 100000 bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op nr_loops: 500000 bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op nr_loops: 1000000 bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op >From this data, we can see that the latency per loop decreases as the number of loops increases. On this particular machine, each loop had an overhead of about ~4 ns, and we were able to run ~250 million loops per second. Signed-off-by: Joanne Koong <joannekoong@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211130030622.4131246-5-joannekoong@fb.com |
||
![]() |
d41bc48bfa |
selftests/bpf: Add uprobe triggering overhead benchmarks
Add benchmark to measure overhead of uprobes and uretprobes. Also have a baseline (no uprobe attached) benchmark. On my dev machine, baseline benchmark can trigger 130M user_target() invocations. When uprobe is attached, this falls to just 700K. With uretprobe, we get down to 520K: $ sudo ./bench trig-uprobe-base -a Summary: hits 131.289 ± 2.872M/s # UPROBE $ sudo ./bench -a trig-uprobe-without-nop Summary: hits 0.729 ± 0.007M/s $ sudo ./bench -a trig-uprobe-with-nop Summary: hits 1.798 ± 0.017M/s # URETPROBE $ sudo ./bench -a trig-uretprobe-without-nop Summary: hits 0.508 ± 0.012M/s $ sudo ./bench -a trig-uretprobe-with-nop Summary: hits 0.883 ± 0.008M/s So there is almost 2.5x performance difference between probing nop vs non-nop instruction for entry uprobe. And 1.7x difference for uretprobe. This means that non-nop uprobe overhead is around 1.4 microseconds for uprobe and 2 microseconds for non-nop uretprobe. For nop variants, uprobe and uretprobe overhead is down to 0.556 and 1.13 microseconds, respectively. For comparison, just doing a very low-overhead syscall (with no BPF programs attached anywhere) gives: $ sudo ./bench trig-base -a Summary: hits 4.830 ± 0.036M/s So uprobes are about 2.67x slower than pure context switch. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211116013041.4072571-1-andrii@kernel.org |
||
![]() |
f44bc543a0 |
bpf/benchs: Add benchmarks for comparing hashmap lookups w/ vs. w/out bloom filter
This patch adds benchmark tests for comparing the performance of hashmap lookups without the bloom filter vs. hashmap lookups with the bloom filter. Checking the bloom filter first for whether the element exists should overall enable a higher throughput for hashmap lookups, since if the element does not exist in the bloom filter, we can avoid a costly lookup in the hashmap. On average, using 5 hash functions in the bloom filter tended to perform the best across the widest range of different entry sizes. The benchmark results using 5 hash functions (running on 8 threads on a machine with one numa node, and taking the average of 3 runs) were roughly as follows: value_size = 4 bytes - 10k entries: 30% faster 50k entries: 40% faster 100k entries: 40% faster 500k entres: 70% faster 1 million entries: 90% faster 5 million entries: 140% faster value_size = 8 bytes - 10k entries: 30% faster 50k entries: 40% faster 100k entries: 50% faster 500k entres: 80% faster 1 million entries: 100% faster 5 million entries: 150% faster value_size = 16 bytes - 10k entries: 20% faster 50k entries: 30% faster 100k entries: 35% faster 500k entres: 65% faster 1 million entries: 85% faster 5 million entries: 110% faster value_size = 40 bytes - 10k entries: 5% faster 50k entries: 15% faster 100k entries: 20% faster 500k entres: 65% faster 1 million entries: 75% faster 5 million entries: 120% faster Signed-off-by: Joanne Koong <joannekoong@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211027234504.30744-6-joannekoong@fb.com |
||
![]() |
57fd1c63c9 |
bpf/benchs: Add benchmark tests for bloom filter throughput + false positive
This patch adds benchmark tests for the throughput (for lookups + updates) and the false positive rate of bloom filter lookups, as well as some minor refactoring of the bash script for running the benchmarks. These benchmarks show that as the number of hash functions increases, the throughput and the false positive rate of the bloom filter decreases. >From the benchmark data, the approximate average false-positive rates are roughly as follows: 1 hash function = ~30% 2 hash functions = ~15% 3 hash functions = ~5% 4 hash functions = ~2.5% 5 hash functions = ~1% 6 hash functions = ~0.5% 7 hash functions = ~0.35% 8 hash functions = ~0.15% 9 hash functions = ~0.1% 10 hash functions = ~0% For reference data, the benchmarks run on one thread on a machine with one numa node for 1 to 5 hash functions for 8-byte and 64-byte values are as follows: 1 hash function: 50k entries 8-byte value Lookups - 51.1 M/s operations Updates - 33.6 M/s operations False positive rate: 24.15% 64-byte value Lookups - 15.7 M/s operations Updates - 15.1 M/s operations False positive rate: 24.2% 100k entries 8-byte value Lookups - 51.0 M/s operations Updates - 33.4 M/s operations False positive rate: 24.04% 64-byte value Lookups - 15.6 M/s operations Updates - 14.6 M/s operations False positive rate: 24.06% 500k entries 8-byte value Lookups - 50.5 M/s operations Updates - 33.1 M/s operations False positive rate: 27.45% 64-byte value Lookups - 15.6 M/s operations Updates - 14.2 M/s operations False positive rate: 27.42% 1 mil entries 8-byte value Lookups - 49.7 M/s operations Updates - 32.9 M/s operations False positive rate: 27.45% 64-byte value Lookups - 15.4 M/s operations Updates - 13.7 M/s operations False positive rate: 27.58% 2.5 mil entries 8-byte value Lookups - 47.2 M/s operations Updates - 31.8 M/s operations False positive rate: 30.94% 64-byte value Lookups - 15.3 M/s operations Updates - 13.2 M/s operations False positive rate: 30.95% 5 mil entries 8-byte value Lookups - 41.1 M/s operations Updates - 28.1 M/s operations False positive rate: 31.01% 64-byte value Lookups - 13.3 M/s operations Updates - 11.4 M/s operations False positive rate: 30.98% 2 hash functions: 50k entries 8-byte value Lookups - 34.1 M/s operations Updates - 20.1 M/s operations False positive rate: 9.13% 64-byte value Lookups - 8.4 M/s operations Updates - 7.9 M/s operations False positive rate: 9.21% 100k entries 8-byte value Lookups - 33.7 M/s operations Updates - 18.9 M/s operations False positive rate: 9.13% 64-byte value Lookups - 8.4 M/s operations Updates - 7.7 M/s operations False positive rate: 9.19% 500k entries 8-byte value Lookups - 32.7 M/s operations Updates - 18.1 M/s operations False positive rate: 12.61% 64-byte value Lookups - 8.4 M/s operations Updates - 7.5 M/s operations False positive rate: 12.61% 1 mil entries 8-byte value Lookups - 30.6 M/s operations Updates - 18.9 M/s operations False positive rate: 12.54% 64-byte value Lookups - 8.0 M/s operations Updates - 7.0 M/s operations False positive rate: 12.52% 2.5 mil entries 8-byte value Lookups - 25.3 M/s operations Updates - 16.7 M/s operations False positive rate: 16.77% 64-byte value Lookups - 7.9 M/s operations Updates - 6.5 M/s operations False positive rate: 16.88% 5 mil entries 8-byte value Lookups - 20.8 M/s operations Updates - 14.7 M/s operations False positive rate: 16.78% 64-byte value Lookups - 7.0 M/s operations Updates - 6.0 M/s operations False positive rate: 16.78% 3 hash functions: 50k entries 8-byte value Lookups - 25.1 M/s operations Updates - 14.6 M/s operations False positive rate: 7.65% 64-byte value Lookups - 5.8 M/s operations Updates - 5.5 M/s operations False positive rate: 7.58% 100k entries 8-byte value Lookups - 24.7 M/s operations Updates - 14.1 M/s operations False positive rate: 7.71% 64-byte value Lookups - 5.8 M/s operations Updates - 5.3 M/s operations False positive rate: 7.62% 500k entries 8-byte value Lookups - 22.9 M/s operations Updates - 13.9 M/s operations False positive rate: 2.62% 64-byte value Lookups - 5.6 M/s operations Updates - 4.8 M/s operations False positive rate: 2.7% 1 mil entries 8-byte value Lookups - 19.8 M/s operations Updates - 12.6 M/s operations False positive rate: 2.60% 64-byte value Lookups - 5.3 M/s operations Updates - 4.4 M/s operations False positive rate: 2.69% 2.5 mil entries 8-byte value Lookups - 16.2 M/s operations Updates - 10.7 M/s operations False positive rate: 4.49% 64-byte value Lookups - 4.9 M/s operations Updates - 4.1 M/s operations False positive rate: 4.41% 5 mil entries 8-byte value Lookups - 18.8 M/s operations Updates - 9.2 M/s operations False positive rate: 4.45% 64-byte value Lookups - 5.2 M/s operations Updates - 3.9 M/s operations False positive rate: 4.54% 4 hash functions: 50k entries 8-byte value Lookups - 19.7 M/s operations Updates - 11.1 M/s operations False positive rate: 1.01% 64-byte value Lookups - 4.4 M/s operations Updates - 4.0 M/s operations False positive rate: 1.00% 100k entries 8-byte value Lookups - 19.5 M/s operations Updates - 10.9 M/s operations False positive rate: 1.00% 64-byte value Lookups - 4.3 M/s operations Updates - 3.9 M/s operations False positive rate: 0.97% 500k entries 8-byte value Lookups - 18.2 M/s operations Updates - 10.6 M/s operations False positive rate: 2.05% 64-byte value Lookups - 4.3 M/s operations Updates - 3.7 M/s operations False positive rate: 2.05% 1 mil entries 8-byte value Lookups - 15.5 M/s operations Updates - 9.6 M/s operations False positive rate: 1.99% 64-byte value Lookups - 4.0 M/s operations Updates - 3.4 M/s operations False positive rate: 1.99% 2.5 mil entries 8-byte value Lookups - 13.8 M/s operations Updates - 7.7 M/s operations False positive rate: 3.91% 64-byte value Lookups - 3.7 M/s operations Updates - 3.6 M/s operations False positive rate: 3.78% 5 mil entries 8-byte value Lookups - 13.0 M/s operations Updates - 6.9 M/s operations False positive rate: 3.93% 64-byte value Lookups - 3.5 M/s operations Updates - 3.7 M/s operations False positive rate: 3.39% 5 hash functions: 50k entries 8-byte value Lookups - 16.4 M/s operations Updates - 9.1 M/s operations False positive rate: 0.78% 64-byte value Lookups - 3.5 M/s operations Updates - 3.2 M/s operations False positive rate: 0.77% 100k entries 8-byte value Lookups - 16.3 M/s operations Updates - 9.0 M/s operations False positive rate: 0.79% 64-byte value Lookups - 3.5 M/s operations Updates - 3.2 M/s operations False positive rate: 0.78% 500k entries 8-byte value Lookups - 15.1 M/s operations Updates - 8.8 M/s operations False positive rate: 1.82% 64-byte value Lookups - 3.4 M/s operations Updates - 3.0 M/s operations False positive rate: 1.78% 1 mil entries 8-byte value Lookups - 13.2 M/s operations Updates - 7.8 M/s operations False positive rate: 1.81% 64-byte value Lookups - 3.2 M/s operations Updates - 2.8 M/s operations False positive rate: 1.80% 2.5 mil entries 8-byte value Lookups - 10.5 M/s operations Updates - 5.9 M/s operations False positive rate: 0.29% 64-byte value Lookups - 3.2 M/s operations Updates - 2.4 M/s operations False positive rate: 0.28% 5 mil entries 8-byte value Lookups - 9.6 M/s operations Updates - 5.7 M/s operations False positive rate: 0.30% 64-byte value Lookups - 3.2 M/s operations Updates - 2.7 M/s operations False positive rate: 0.30% Signed-off-by: Joanne Koong <joannekoong@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211027234504.30744-5-joannekoong@fb.com |
||
![]() |
bad2e478af |
selftests/bpf: Turn on libbpf 1.0 mode and fix all IS_ERR checks
Turn ony libbpf 1.0 mode. Fix all the explicit IS_ERR checks that now will be broken because libbpf returns NULL on error (and sets errno). Fix ASSERT_OK_PTR and ASSERT_ERR_PTR to work for both old mode and new modes and use them throughout selftests. This is trivial to do by using libbpf_get_error() API that all libbpf users are supposed to use, instead of IS_ERR checks. A bunch of checks also did explicit -1 comparison for various fd-returning APIs. Such checks are replaced with >= 0 or < 0 cases. There were also few misuses of bpf_object__find_map_by_name() in test_maps. Those are fixed in this patch as well. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210525035935.1461796-3-andrii@kernel.org |
||
![]() |
b000def2e0 |
selftests: Remove fmod_ret from test_overhead
The test_overhead prog_test included an fmod_ret program that attached to
__set_task_comm() in the kernel. However, this function was never listed as
allowed for return modification, so this only worked because of the
verifier skipping tests when a trampoline already existed for the attach
point. Now that the verifier checks have been fixed, remove fmod_ret from
the test so it works again.
Fixes:
|
||
![]() |
e68a144547 |
selftests/bpf: Add sleepable tests
Modify few tests to sanity test sleepable bpf functionality. Running 'bench trig-fentry-sleep' vs 'bench trig-fentry' and 'perf report': sleepable with SRCU: 3.86% bench [k] __srcu_read_unlock 3.22% bench [k] __srcu_read_lock 0.92% bench [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry_sleep 0.50% bench [k] bpf_trampoline_10297 0.26% bench [k] __bpf_prog_exit_sleepable 0.21% bench [k] __bpf_prog_enter_sleepable sleepable with RCU_TRACE: 0.79% bench [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry_sleep 0.72% bench [k] bpf_trampoline_10381 0.31% bench [k] __bpf_prog_exit_sleepable 0.29% bench [k] __bpf_prog_enter_sleepable non-sleepable with RCU: 0.88% bench [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry 0.84% bench [k] bpf_trampoline_10297 0.13% bench [k] __bpf_prog_enter 0.12% bench [k] __bpf_prog_exit Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: KP Singh <kpsingh@google.com> Link: https://lore.kernel.org/bpf/20200827220114.69225-6-alexei.starovoitov@gmail.com |
||
![]() |
c97099b0f2 |
bpf: Add BPF ringbuf and perf buffer benchmarks
Extend bench framework with ability to have benchmark-provided child argument parser for custom benchmark-specific parameters. This makes bench generic code modular and independent from any specific benchmark. Also implement a set of benchmarks for new BPF ring buffer and existing perf buffer. 4 benchmarks were implemented: 2 variations for each of BPF ringbuf and perfbuf:, - rb-libbpf utilizes stock libbpf ring_buffer manager for reading data; - rb-custom implements custom ring buffer setup and reading code, to eliminate overheads inherent in generic libbpf code due to callback functions and the need to update consumer position after each consumed record, instead of batching updates (due to pessimistic assumption that user callback might take long time and thus could unnecessarily hold ring buffer space for too long); - pb-libbpf uses stock libbpf perf_buffer code with all the default settings, though uses higher-performance raw event callback to minimize unnecessary overhead; - pb-custom implements its own custom consumer code to minimize any possible overhead of generic libbpf implementation and indirect function calls. All of the test support default, no data notification skipped, mode, as well as sampled mode (with --rb-sampled flag), which allows to trigger epoll notification less frequently and reduce overhead. As will be shown, this mode is especially critical for perf buffer, which suffers from high overhead of wakeups in kernel. Otherwise, all benchamrks implement similar way to generate a batch of records by using fentry/sys_getpgid BPF program, which pushes a bunch of records in a tight loop and records number of successful and dropped samples. Each record is a small 8-byte integer, to minimize the effect of memory copying with bpf_perf_event_output() and bpf_ringbuf_output(). Benchmarks that have only one producer implement optional back-to-back mode, in which record production and consumption is alternating on the same CPU. This is the highest-throughput happy case, showing ultimate performance achievable with either BPF ringbuf or perfbuf. All the below scenarios are implemented in a script in benchs/run_bench_ringbufs.sh. Tests were performed on 28-core/56-thread Intel Xeon CPU E5-2680 v4 @ 2.40GHz CPU. Single-producer, parallel producer ================================== rb-libbpf 12.054 ± 0.320M/s (drops 0.000 ± 0.000M/s) rb-custom 8.158 ± 0.118M/s (drops 0.001 ± 0.003M/s) pb-libbpf 0.931 ± 0.007M/s (drops 0.000 ± 0.000M/s) pb-custom 0.965 ± 0.003M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 11.563 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-custom 15.895 ± 0.076M/s (drops 0.000 ± 0.000M/s) pb-libbpf 9.889 ± 0.032M/s (drops 0.000 ± 0.000M/s) pb-custom 9.866 ± 0.028M/s (drops 0.000 ± 0.000M/s) Single producer on one CPU, consumer on another one, both running at full speed. Curiously, rb-libbpf has higher throughput than objectively faster (due to more lightweight consumer code path) rb-custom. It appears that faster consumer causes kernel to send notifications more frequently, because consumer appears to be caught up more frequently. Performance of perfbuf suffers from default "no sampling" policy and huge overhead that causes. In sampled mode, rb-custom is winning very significantly eliminating too frequent in-kernel wakeups, the gain appears to be more than 2x. Perf buffer achieves even more impressive wins, compared to stock perfbuf settings, with 10x improvements in throughput with 1:500 sampling rate. The trade-off is that with sampling, application might not get next X events until X+1st arrives, which is not always acceptable. With steady influx of events, though, this shouldn't be a problem. Overall, single-producer performance of ring buffers seems to be better no matter the sampled/non-sampled modes, but it especially beats ring buffer without sampling due to its adaptive notification approach. Single-producer, back-to-back mode ================================== rb-libbpf 15.507 ± 0.247M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 14.692 ± 0.195M/s (drops 0.000 ± 0.000M/s) rb-custom 21.449 ± 0.157M/s (drops 0.000 ± 0.000M/s) rb-custom-sampled 20.024 ± 0.386M/s (drops 0.000 ± 0.000M/s) pb-libbpf 1.601 ± 0.015M/s (drops 0.000 ± 0.000M/s) pb-libbpf-sampled 8.545 ± 0.064M/s (drops 0.000 ± 0.000M/s) pb-custom 1.607 ± 0.022M/s (drops 0.000 ± 0.000M/s) pb-custom-sampled 8.988 ± 0.144M/s (drops 0.000 ± 0.000M/s) Here we test a back-to-back mode, which is arguably best-case scenario both for BPF ringbuf and perfbuf, because there is no contention and for ringbuf also no excessive notification, because consumer appears to be behind after the first record. For ringbuf, custom consumer code clearly wins with 21.5 vs 16 million records per second exchanged between producer and consumer. Sampled mode actually hurts a bit due to slightly slower producer logic (it needs to fetch amount of data available to decide whether to skip or force notification). Perfbuf with wakeup sampling gets 5.5x throughput increase, compared to no-sampling version. There also doesn't seem to be noticeable overhead from generic libbpf handling code. Perfbuf back-to-back, effect of sample rate =========================================== pb-sampled-1 1.035 ± 0.012M/s (drops 0.000 ± 0.000M/s) pb-sampled-5 3.476 ± 0.087M/s (drops 0.000 ± 0.000M/s) pb-sampled-10 5.094 ± 0.136M/s (drops 0.000 ± 0.000M/s) pb-sampled-25 7.118 ± 0.153M/s (drops 0.000 ± 0.000M/s) pb-sampled-50 8.169 ± 0.156M/s (drops 0.000 ± 0.000M/s) pb-sampled-100 8.887 ± 0.136M/s (drops 0.000 ± 0.000M/s) pb-sampled-250 9.180 ± 0.209M/s (drops 0.000 ± 0.000M/s) pb-sampled-500 9.353 ± 0.281M/s (drops 0.000 ± 0.000M/s) pb-sampled-1000 9.411 ± 0.217M/s (drops 0.000 ± 0.000M/s) pb-sampled-2000 9.464 ± 0.167M/s (drops 0.000 ± 0.000M/s) pb-sampled-3000 9.575 ± 0.273M/s (drops 0.000 ± 0.000M/s) This benchmark shows the effect of event sampling for perfbuf. Back-to-back mode for highest throughput. Just doing every 5th record notification gives 3.5x speed up. 250-500 appears to be the point of diminishing return, with almost 9x speed up. Most benchmarks use 500 as the default sampling for pb-raw and pb-custom. Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.106 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 4.746 ± 0.149M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 7.706 ± 0.164M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 12.893 ± 0.273M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 15.961 ± 0.361M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 18.203 ± 0.445M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 19.962 ± 0.786M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 20.881 ± 0.551M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 21.317 ± 0.532M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 21.331 ± 0.535M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 21.688 ± 0.392M/s (drops 0.000 ± 0.000M/s) Similar benchmark for ring buffer also shows a great advantage (in terms of throughput) of skipping notifications. Skipping every 5th one gives 4x boost. Also similar to perfbuf case, 250-500 seems to be the point of diminishing returns, giving roughly 20x better results. Keep in mind, for this test, notifications are controlled manually with BPF_RB_NO_WAKEUP and BPF_RB_FORCE_WAKEUP. As can be seen from previous benchmarks, adaptive notifications based on consumer's positions provides same (or even slightly better due to simpler load generator on BPF side) benefits in favorable back-to-back scenario. Over zealous and fast consumer, which is almost always caught up, will make thoughput numbers smaller. That's the case when manual notification control might prove to be extremely beneficial. Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 22.819 ± 0.503M/s (drops 0.000 ± 0.000M/s) output 18.906 ± 0.433M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 15.350 ± 0.132M/s (drops 0.000 ± 0.000M/s) output-sampled 14.195 ± 0.144M/s (drops 0.000 ± 0.000M/s) BPF ringbuf supports two sets of APIs with various usability and performance tradeoffs: bpf_ringbuf_reserve()+bpf_ringbuf_commit() vs bpf_ringbuf_output(). This benchmark clearly shows superiority of reserve+commit approach, despite using a small 8-byte record size. Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 3.045 ± 0.020M/s (drops 3.536 ± 0.148M/s) rb-custom 3.055 ± 0.022M/s (drops 3.893 ± 0.066M/s) pb-libbpf 1.393 ± 0.024M/s (drops 0.000 ± 0.000M/s) pb-custom 1.407 ± 0.016M/s (drops 0.000 ± 0.000M/s) This benchmark shows one of the worst-case scenarios, in which producer and consumer do not coordinate *and* fight for the same CPU. No batch count and sampling settings were able to eliminate drops for ringbuffer, producer is just too fast for consumer to keep up. But ringbuf and perfbuf still able to pass through quite a lot of messages, which is more than enough for a lot of applications. Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 10.916 ± 0.399M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 4.931 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 4.880 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 3.926 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 4.011 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.967 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 2.604 ± 0.030M/s (drops 0.001 ± 0.002M/s) rb-libbpf nr_prod 20 2.233 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.085 ± 0.015M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.055 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.962 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.089 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.118 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.105 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.120 ± 0.058M/s (drops 0.000 ± 0.001M/s) rb-libbpf nr_prod 52 2.074 ± 0.024M/s (drops 0.007 ± 0.014M/s) Ringbuf uses a very short-duration spinlock during reservation phase, to check few invariants, increment producer count and set record header. This is the biggest point of contention for ringbuf implementation. This benchmark evaluates the effect of multiple competing writers on overall throughput of a single shared ringbuffer. Overall throughput drops almost 2x when going from single to two highly-contended producers, gradually dropping with additional competing producers. Performance drop stabilizes at around 20 producers and hovers around 2mln even with 50+ fighting producers, which is a 5x drop compared to non-contended case. Good kernel implementation in kernel helps maintain decent performance here. Note, that in the intended real-world scenarios, it's not expected to get even close to such a high levels of contention. But if contention will become a problem, there is always an option of sharding few ring buffers across a set of CPUs. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200529075424.3139988-5-andriin@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
![]() |
5b0004d92b |
selftest/bpf: Fix spelling mistake "SIGALARM" -> "SIGALRM"
There is a spelling mistake in an error message, fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200514121529.259668-1-colin.king@canonical.com |
||
![]() |
c5d420c32c |
selftest/bpf: Add BPF triggering benchmark
It is sometimes desirable to be able to trigger BPF program from user-space with minimal overhead. sys_enter would seem to be a good candidate, yet in a lot of cases there will be a lot of noise from syscalls triggered by other processes on the system. So while searching for low-overhead alternative, I've stumbled upon getpgid() syscall, which seems to be specific enough to not suffer from accidental syscall by other apps. This set of benchmarks compares tp, raw_tp w/ filtering by syscall ID, kprobe, fentry and fmod_ret with returning error (so that syscall would not be executed), to determine the lowest-overhead way. Here are results on my machine (using benchs/run_bench_trigger.sh script): base : 9.200 ± 0.319M/s tp : 6.690 ± 0.125M/s rawtp : 8.571 ± 0.214M/s kprobe : 6.431 ± 0.048M/s fentry : 8.955 ± 0.241M/s fmodret : 8.903 ± 0.135M/s So it seems like fmodret doesn't give much benefit for such lightweight syscall. Raw tracepoint is pretty decent despite additional filtering logic, but it will be called for any other syscall in the system, which rules it out. Fentry, though, seems to be adding the least amoung of overhead and achieves 97.3% of performance of baseline no-BPF-attached syscall. Using getpgid() seems to be preferable to set_task_comm() approach from test_overhead, as it's about 2.35x faster in a baseline performance. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200512192445.2351848-5-andriin@fb.com |
||
![]() |
4eaf0b5c5e |
selftest/bpf: Fmod_ret prog and implement test_overhead as part of bench
Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement user-space benchmarking part into benchmark runner to compare results. Results with ./bench are consistently somewhat lower than test_overhead's, but relative performance of various types of BPF programs stay consisten (e.g., kretprobe is noticeably slower). This slowdown seems to be coming from the fact that test_overhead is single-threaded, while benchmark always spins off at least one thread for producer. This has been confirmed by hacking multi-threaded test_overhead variant and also single-threaded bench variant. Resutls are below. run_bench_rename.sh script from benchs/ subdirectory was used to produce results for ./bench. Single-threaded implementations =============================== /* bench: single-threaded, atomics */ base : 4.622 ± 0.049M/s kprobe : 3.673 ± 0.052M/s kretprobe : 2.625 ± 0.052M/s rawtp : 4.369 ± 0.089M/s fentry : 4.201 ± 0.558M/s fexit : 4.309 ± 0.148M/s fmodret : 4.314 ± 0.203M/s /* selftest: single-threaded, no atomics */ task_rename base 4555K events per sec task_rename kprobe 3643K events per sec task_rename kretprobe 2506K events per sec task_rename raw_tp 4303K events per sec task_rename fentry 4307K events per sec task_rename fexit 4010K events per sec task_rename fmod_ret 3984K events per sec Multi-threaded implementations ============================== /* bench: multi-threaded w/ atomics */ base : 3.910 ± 0.023M/s kprobe : 3.048 ± 0.037M/s kretprobe : 2.300 ± 0.015M/s rawtp : 3.687 ± 0.034M/s fentry : 3.740 ± 0.087M/s fexit : 3.510 ± 0.009M/s fmodret : 3.485 ± 0.050M/s /* selftest: multi-threaded w/ atomics */ task_rename base 3872K events per sec task_rename kprobe 3068K events per sec task_rename kretprobe 2350K events per sec task_rename raw_tp 3731K events per sec task_rename fentry 3639K events per sec task_rename fexit 3558K events per sec task_rename fmod_ret 3511K events per sec /* selftest: multi-threaded, no atomics */ task_rename base 3945K events per sec task_rename kprobe 3298K events per sec task_rename kretprobe 2451K events per sec task_rename raw_tp 3718K events per sec task_rename fentry 3782K events per sec task_rename fexit 3543K events per sec task_rename fmod_ret 3526K events per sec Note that the fact that ./bench benchmark always uses atomic increments for counting, while test_overhead doesn't, doesn't influence test results all that much. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200512192445.2351848-4-andriin@fb.com |
||
![]() |
8e7c2a023a |
selftests/bpf: Add benchmark runner infrastructure
While working on BPF ringbuf implementation, testing, and benchmarking, I've developed a pretty generic and modular benchmark runner, which seems to be generically useful, as I've already used it for one more purpose (testing fastest way to trigger BPF program, to minimize overhead of in-kernel code). This patch adds generic part of benchmark runner and sets up Makefile for extending it with more sets of benchmarks. Benchmarker itself operates by spinning up specified number of producer and consumer threads, setting up interval timer sending SIGALARM signal to application once a second. Every second, current snapshot with hits/drops counters are collected and stored in an array. Drops are useful for producer/consumer benchmarks in which producer might overwhelm consumers. Once test finishes after given amount of warm-up and testing seconds, mean and stddev are calculated (ignoring warm-up results) and is printed out to stdout. This setup seems to give consistent and accurate results. To validate behavior, I added two atomic counting tests: global and local. For global one, all the producer threads are atomically incrementing same counter as fast as possible. This, of course, leads to huge drop of performance once there is more than one producer thread due to CPUs fighting for the same memory location. Local counting, on the other hand, maintains one counter per each producer thread, incremented independently. Once per second, all counters are read and added together to form final "counting throughput" measurement. As expected, such setup demonstrates linear scalability with number of producers (as long as there are enough physical CPU cores, of course). See example output below. Also, this setup can nicely demonstrate disastrous effects of false sharing, if care is not taken to take those per-producer counters apart into independent cache lines. Demo output shows global counter first with 1 producer, then with 4. Both total and per-producer performance significantly drop. The last run is local counter with 4 producers, demonstrating near-perfect scalability. $ ./bench -a -w1 -d2 -p1 count-global Setting up benchmark 'count-global'... Benchmark 'count-global' started. Iter 0 ( 24.822us): hits 148.179M/s (148.179M/prod), drops 0.000M/s Iter 1 ( 37.939us): hits 149.308M/s (149.308M/prod), drops 0.000M/s Iter 2 (-10.774us): hits 150.717M/s (150.717M/prod), drops 0.000M/s Iter 3 ( 3.807us): hits 151.435M/s (151.435M/prod), drops 0.000M/s Summary: hits 150.488 ± 1.079M/s (150.488M/prod), drops 0.000 ± 0.000M/s $ ./bench -a -w1 -d2 -p4 count-global Setting up benchmark 'count-global'... Benchmark 'count-global' started. Iter 0 ( 60.659us): hits 53.910M/s ( 13.477M/prod), drops 0.000M/s Iter 1 (-17.658us): hits 53.722M/s ( 13.431M/prod), drops 0.000M/s Iter 2 ( 5.865us): hits 53.495M/s ( 13.374M/prod), drops 0.000M/s Iter 3 ( 0.104us): hits 53.606M/s ( 13.402M/prod), drops 0.000M/s Summary: hits 53.608 ± 0.113M/s ( 13.402M/prod), drops 0.000 ± 0.000M/s $ ./bench -a -w1 -d2 -p4 count-local Setting up benchmark 'count-local'... Benchmark 'count-local' started. Iter 0 ( 23.388us): hits 640.450M/s (160.113M/prod), drops 0.000M/s Iter 1 ( 2.291us): hits 605.661M/s (151.415M/prod), drops 0.000M/s Iter 2 ( -6.415us): hits 607.092M/s (151.773M/prod), drops 0.000M/s Iter 3 ( -1.361us): hits 601.796M/s (150.449M/prod), drops 0.000M/s Summary: hits 604.849 ± 2.739M/s (151.212M/prod), drops 0.000 ± 0.000M/s Benchmark runner supports setting thread affinity for producer and consumer threads. You can use -a flag for default CPU selection scheme, where first consumer gets CPU #0, next one gets CPU #1, and so on. Then producer threads pick up next CPU and increment one-by-one as well. But user can also specify a set of CPUs independently for producers and consumers with --prod-affinity 1,2-10,15 and --cons-affinity <set-of-cpus>. The latter allows to force producers and consumers to share same set of CPUs, if necessary. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200512192445.2351848-3-andriin@fb.com |