linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-04-07 06:59:25 +08:00

Author	SHA1	Message	Date
Andrii Nakryiko	8e432e6197	bpf: Ensure precise is reset to false in __mark_reg_const_zero() It is safe to always start with imprecise SCALAR_VALUE register. Previously __mark_reg_const_zero() relied on caller to reset precise mark, but it's very error prone and we already missed it in a few places. So instead make __mark_reg_const_zero() reset precision always, as it's a safe default for SCALAR_VALUE. Explanation is basically the same as for why we are resetting (or rather not setting) precision in current state. If necessary, precision propagation will set it to precise correctly. As such, also remove a big comment about forward precision propagation in mark_reg_stack_read() and avoid unnecessarily setting precision to true after reading from STACK_ZERO stack. Again, precision propagation will correctly handle this, if that SCALAR_VALUE register will ever be needed to be precise. Reported-by: Maxim Mikityanskiy <maxtram95@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Maxim Mikityanskiy <maxtram95@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231218173601.53047-1-andrii@kernel.org	2023-12-18 23:54:21 +01:00
Andrii Nakryiko	6079ae6376	Merge branch 'bpf-add-check-for-negative-uprobe-multi-offset' Jiri Olsa says: ==================== bpf: Add check for negative uprobe multi offset hi, adding the check for negative offset for uprobe multi link. v2 changes: - add more failure checks [Alan] - move the offset retrieval/check up in the loop to be done earlier [Song] thanks, jirka --- ==================== Link: https://lore.kernel.org/r/20231217215538.3361991-1-jolsa@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2023-12-18 09:52:17 -08:00
Jiri Olsa	f17d1a18a3	selftests/bpf: Add more uprobe multi fail tests We fail to create uprobe if we pass negative offset. Add more tests validating kernel-side error checking code. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20231217215538.3361991-3-jolsa@kernel.org	2023-12-18 09:51:50 -08:00
Jiri Olsa	3983c00281	bpf: Fail uprobe multi link with negative offset Currently the __uprobe_register will return 0 (success) when called with negative offset. The reason is that the call to register_for_each_vma and then build_map_info won't return error for negative offset. They just won't do anything - no matching vma is found so there's no registered breakpoint for the uprobe. I don't think we can change the behaviour of __uprobe_register and fail for negative uprobe offset, because apps might depend on that already. But I think we can still make the change and check for it on bpf multi link syscall level. Also moving the __get_user call and check for the offsets to the top of loop, to fail early without extra __get_user calls for ref_ctr_offset and cookie arrays. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20231217215538.3361991-2-jolsa@kernel.org	2023-12-18 09:51:30 -08:00
Hou Tao	e58aac1a9a	selftests/bpf: Test the release of map btf When there is bpf_list_head or bpf_rb_root field in map value, the free of map btf and the free of map value may run concurrently and there may be use-after-free problem, so add two test cases to demonstrate it. And the use-after-free problem can been easily reproduced by using bpf_next tree and a KASAN-enabled kernel. The first test case tests the racing between the free of map btf and the free of array map. It constructs the racing by releasing the array map in the end after other ref-counter of map btf has been released. To delay the free of array map and make it be invoked after btf_free_rcu() is invoked, it stresses system_unbound_wq by closing multiple percpu array maps before it closes the array map. The second case tests the racing between the free of map btf and the free of inner map. Beside using the similar method as the first one does, it uses bpf_map_delete_elem() to delete the inner map and to defer the release of inner map after one RCU grace period. The reason for using two skeletons is to prevent the release of outer map and inner map in map_in_map_btf.c interfering the release of bpf map in normal_map_btf.c. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20231216035510.4030605-1-houtao@huaweicloud.com	2023-12-18 18:15:49 +01:00
Alexei Starovoitov	0c970ed2f8	s390/bpf: Fix indirect trampoline generation The func_addr used to be NULL for indirect trampolines used by struct_ops. Now func_addr is a valid function pointer. Hence use BPF_TRAMP_F_INDIRECT flag to detect such condition. Fixes: `2cd3e3772e` ("x86/cfi,bpf: Fix bpf_struct_ops CFI") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/bpf/20231216004549.78355-1-alexei.starovoitov@gmail.com	2023-12-18 12:00:37 +01:00
Alexei Starovoitov	42d45c4562	selftests/bpf: Temporarily disable dummy_struct_ops test on s390 Temporarily disable dummy_struct_ops test on s390. The breakage is likely due to commit `2cd3e3772e` ("x86/cfi,bpf: Fix bpf_struct_ops CFI"). Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:28:25 -08:00
Alexei Starovoitov	3c302e14bd	Merge branch 'x86-cfi-bpf-fix-cfi-vs-ebpf' Peter Zijlstra says: ==================== x86/cfi,bpf: Fix CFI vs eBPF Hi! What started with the simple observation that bpf_dispatcher__func() was broken for calling CFI functions with a __nocfi calling context for FineIBT ended up with a complete BPF wide CFI fixup. With these changes on the BPF selftest suite passes without crashing -- there's still a few failures, but Alexei has graciously offered to look into those. (Alexei, I have presumed your SoB on the very last patch, please update as you see fit) Changes since v2 are numerous but include: - cfi_get_offset() -- as a means to communicate the offset (ast) - 5 new patches fixing various BPF internals to be CFI clean Note: it might* be possible to merge the bpf_bpf_tcp_ca.c:unsupported_ops[] thing into the CFI stubs, as is get_info will have a NULL stub, unlike the others. --- arch/riscv/include/asm/cfi.h \| 3 +- arch/riscv/kernel/cfi.c \| 2 +- arch/x86/include/asm/cfi.h \| 126 +++++++++++++++++++++++++++++++++++++- arch/x86/kernel/alternative.c \| 87 +++++++++++++++++++++++--- arch/x86/kernel/cfi.c \| 4 +- arch/x86/net/bpf_jit_comp.c \| 134 +++++++++++++++++++++++++++++++++++------ include/asm-generic/Kbuild \| 1 + include/linux/bpf.h \| 27 ++++++++- include/linux/cfi.h \| 12 ++++ kernel/bpf/bpf_struct_ops.c \| 16 ++--- kernel/bpf/core.c \| 25 ++++++++ kernel/bpf/cpumask.c \| 8 ++- kernel/bpf/helpers.c \| 18 +++++- net/bpf/bpf_dummy_struct_ops.c \| 31 +++++++++- net/bpf/test_run.c \| 15 ++++- net/ipv4/bpf_tcp_ca.c \| 69 +++++++++++++++++++++ 16 files changed, 528 insertions(+), 50 deletions(-) ==================== Link: https://lore.kernel.org/r/20231215091216.135791411@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:56 -08:00
Alexei Starovoitov	852486b35f	x86/cfi,bpf: Fix bpf_exception_cb() signature As per the earlier patches, BPF sub-programs have bpf_callback_t signature and CFI expects callers to have matching signature. This is violated by bpf_prog_aux::bpf_exception_cb(). [peterz: Changelog] Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/CAADnVQ+Z7UcXXBBhMubhcMM=R-dExk-uHtfOLtoLxQ1XxEpqEA@mail.gmail.com Link: https://lore.kernel.org/r/20231215092707.910319166@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:55 -08:00
Peter Zijlstra	e4c0033989	bpf: Fix dtor CFI Ensure the various dtor functions match their prototype and retain their CFI signatures, since they don't have their address taken, they are prone to not getting CFI, making them impossible to call indirectly. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.799451071@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:55 -08:00
Peter Zijlstra	e9d13b9d2f	cfi: Add CFI_NOSEAL() Add a CFI_NOSEAL() helper to mark functions that need to retain their CFI information, despite not otherwise leaking their address. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.669401084@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:55 -08:00
Peter Zijlstra	2cd3e3772e	x86/cfi,bpf: Fix bpf_struct_ops CFI BPF struct_ops uses __arch_prepare_bpf_trampoline() to write trampolines for indirect function calls. These tramplines much have matching CFI. In order to obtain the correct CFI hash for the various methods, add a matching structure that contains stub functions, the compiler will generate correct CFI which we can pilfer for the trampolines. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.566977112@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:55 -08:00
Peter Zijlstra	e72d88d18d	x86/cfi,bpf: Fix bpf_callback_t CFI Where the main BPF program is expected to match bpf_func_t, sub-programs are expected to match bpf_callback_t. This fixes things like: tools/testing/selftests/bpf/progs/bloom_filter_bench.c: bpf_for_each_map_elem(&array_map, bloom_callback, &data, 0); Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.451956710@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:55 -08:00
Peter Zijlstra	4f9087f166	x86/cfi,bpf: Fix BPF JIT call The current BPF call convention is __nocfi, except when it calls !JIT things, then it calls regular C functions. It so happens that with FineIBT the __nocfi and C calling conventions are incompatible. Specifically __nocfi will call at func+0, while FineIBT will have endbr-poison there, which is not a valid indirect target. Causing #CP. Notably this only triggers on IBT enabled hardware, which is probably why this hasn't been reported (also, most people will have JIT on anyway). Implement proper CFI prologues for the BPF JIT codegen and drop __nocfi for x86. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.345270396@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:55 -08:00
Peter Zijlstra	4382159696	cfi: Flip headers Normal include order is that linux/foo.h should include asm/foo.h, CFI has it the wrong way around. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20231215092707.231038174@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-15 16:25:55 -08:00
Hou Tao	1467affd16	selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment If an abnormally huge cnt is used for multi-kprobes attachment, the following warning will be reported: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 392 at mm/util.c:632 kvmalloc_node+0xd9/0xe0 Modules linked in: bpf_testmod(O) CPU: 1 PID: 392 Comm: test_progs Tainted: G ...... 6.7.0-rc3+ #32 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...... RIP: 0010:kvmalloc_node+0xd9/0xe0 ? __warn+0x89/0x150 ? kvmalloc_node+0xd9/0xe0 bpf_kprobe_multi_link_attach+0x87/0x670 __sys_bpf+0x2a28/0x2bc0 __x64_sys_bpf+0x1a/0x30 do_syscall_64+0x36/0xb0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7fbe067f0e0d ...... </TASK> ---[ end trace 0000000000000000 ]--- So add a test to ensure the warning is fixed. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231215100708.2265609-6-houtao@huaweicloud.com	2023-12-15 22:54:55 +01:00
Hou Tao	00cdcd2900	selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test Since libbpf v1.0, libbpf doesn't return error code embedded into the pointer iteself, libbpf_get_error() is deprecated and it is basically the same as using -errno directly. So replace the invocations of libbpf_get_error() by -errno in kprobe_multi_test. For libbpf_get_error() in test_attach_api_fails(), saving -errno before invoking ASSERT_xx() macros just in case that errno is overwritten by these macros. However, the invocation of libbpf_get_error() in get_syms() should be kept intact, because hashmap__new() still returns a pointer with embedded error code. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231215100708.2265609-5-houtao@huaweicloud.com	2023-12-15 22:54:55 +01:00
Hou Tao	0d83786f56	selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment If an abnormally huge cnt is used for multi-uprobes attachment, the following warning will be reported: ------------[ cut here ]------------ WARNING: CPU: 7 PID: 406 at mm/util.c:632 kvmalloc_node+0xd9/0xe0 Modules linked in: bpf_testmod(O) CPU: 7 PID: 406 Comm: test_progs Tainted: G ...... 6.7.0-rc3+ #32 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...... RIP: 0010:kvmalloc_node+0xd9/0xe0 ...... Call Trace: <TASK> ? __warn+0x89/0x150 ? kvmalloc_node+0xd9/0xe0 bpf_uprobe_multi_link_attach+0x14a/0x480 __sys_bpf+0x14a9/0x2bc0 do_syscall_64+0x36/0xb0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 ...... </TASK> ---[ end trace 0000000000000000 ]--- So add a test to ensure the warning is fixed. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231215100708.2265609-4-houtao@huaweicloud.com	2023-12-15 22:54:55 +01:00
Hou Tao	d6d1e6c17c	bpf: Limit the number of kprobes when attaching program to multiple kprobes An abnormally big cnt may also be assigned to kprobe_multi.cnt when attaching multiple kprobes. It will trigger the following warning in kvmalloc_node(): if (unlikely(size > INT_MAX)) { WARN_ON_ONCE(!(flags & __GFP_NOWARN)); return NULL; } Fix the warning by limiting the maximal number of kprobes in bpf_kprobe_multi_link_attach(). If the number of kprobes is greater than MAX_KPROBE_MULTI_CNT, the attachment will fail and return -E2BIG. Fixes: `0dcac27254` ("bpf: Add multi kprobe link") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231215100708.2265609-3-houtao@huaweicloud.com	2023-12-15 22:54:55 +01:00
Hou Tao	8b2efe51ba	bpf: Limit the number of uprobes when attaching program to multiple uprobes An abnormally big cnt may be passed to link_create.uprobe_multi.cnt, and it will trigger the following warning in kvmalloc_node(): if (unlikely(size > INT_MAX)) { WARN_ON_ONCE(!(flags & __GFP_NOWARN)); return NULL; } Fix the warning by limiting the maximal number of uprobes in bpf_uprobe_multi_link_attach(). If the number of uprobes is greater than MAX_UPROBE_MULTI_CNT, the attachment will return -E2BIG. Fixes: `89ae89f53d` ("bpf: Add multi uprobe link") Reported-by: Xingwei Lee <xrivendell7@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Closes: https://lore.kernel.org/bpf/CABOYnLwwJY=yFAGie59LFsUsBAgHfroVqbzZ5edAXbFE3YiNVA@mail.gmail.com Link: https://lore.kernel.org/bpf/20231215100708.2265609-2-houtao@huaweicloud.com	2023-12-15 22:54:46 +01:00
Daniel Xu	7489723c2e	bpf: xdp: Register generic_kfunc_set with XDP programs Registering generic_kfunc_set with XDP programs enables some of the newer BPF features inside XDP -- namely tree based data structures and BPF exceptions. The current motivation for this commit is to enable assertions inside XDP bpf progs. Assertions are a standard and useful tool to encode intent. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/d07d4614b81ca6aada44fcb89bb6b618fb66e4ca.1702594357.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 19:12:16 -08:00
Alexei Starovoitov	0f5d5454c7	Merge branch 'bpf-fs-mount-options-parsing-follow-ups' Andrii Nakryiko says: ==================== BPF FS mount options parsing follow ups Original BPF token patch set ([0]) added delegate_xxx mount options which supported only special "any" value and hexadecimal bitmask. This patch set attempts to make specifying and inspecting these mount options more human-friendly by supporting string constants matching corresponding bpf_cmd, bpf_map_type, bpf_prog_type, and bpf_attach_type enumerators. This implementation relies on BTF information to find all supported symbolic names. If kernel wasn't built with BTF, BPF FS will still support "any" and hex-based mask. [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=805707&state=* v1->v2: - strip BPF_, BPF_MAP_TYPE_, and BPF_PROG_TYPE_ prefixes, do case-insensitive comparison, normalize to lower case (Alexei). ==================== Link: https://lore.kernel.org/r/20231214225016.1209867-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 17:30:27 -08:00
Andrii Nakryiko	f2d0ffee1f	selftests/bpf: utilize string values for delegate_xxx mount options Use both hex-based and string-based way to specify delegate mount options for BPF FS. Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231214225016.1209867-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 17:30:27 -08:00
Andrii Nakryiko	c5707b2146	bpf: support symbolic BPF FS delegation mount options Besides already supported special "any" value and hex bit mask, support string-based parsing of delegation masks based on exact enumerator names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`, `enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported symbolic names (ignoring __MAX_xxx guard values and stripping repetitive prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is normalized to lower case in mount option output. So "PROG_LOAD", "prog_load", and "MAP_create" are all valid values to specify for delegate_cmds options, "array" is among supported for map types, etc. Besides supporting string values, we also support multiple values specified at the same time, using colon (':') separator. There are corresponding changes on bpf_show_options side to use known values to print them in human-readable format, falling back to hex mask printing, if there are any unrecognized bits. This shouldn't be necessary when enum BTF information is present, but in general we should always be able to fall back to this even if kernel was built without BTF. As mentioned, emitted symbolic names are normalized to be all lower case. Example below shows various ways to specify delegate_cmds options through mount command and how mount options are printed back: 12/14 14:39:07.604 vmuser@archvm:~/local/linux/tools/testing/selftests/bpf $ mount \| rg token $ sudo mkdir -p /sys/fs/bpf/token $ sudo mount -t bpf bpffs /sys/fs/bpf/token \ -o delegate_cmds=prog_load:MAP_CREATE \ -o delegate_progs=kprobe \ -o delegate_attachs=xdp $ mount \| grep token bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp) Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231214225016.1209867-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 17:30:27 -08:00
Alexei Starovoitov	403f3e8fda	Merge branch 'add-bpf_xdp_get_xfrm_state-kfunc' Daniel Xu says: ==================== Add bpf_xdp_get_xfrm_state() kfunc This patchset adds two kfunc helpers, bpf_xdp_get_xfrm_state() and bpf_xdp_xfrm_state_release() that wrap xfrm_state_lookup() and xfrm_state_put(). The intent is to support software RSS (via XDP) for the ongoing/upcoming ipsec pcpu work [0]. Recent experiments performed on (hopefully) reproducible AWS testbeds indicate that single tunnel pcpu ipsec can reach line rate on 100G ENA nics. Note this patchset only tests/shows generic xfrm_state access. The "secret sauce" (if you can really even call it that) involves accessing a soon-to-be-upstreamed pcpu_num field in xfrm_state. Early example is available here [1]. [0]: https://datatracker.ietf.org/doc/draft-ietf-ipsecme-multi-sa-performance/03/ [1]: `e89a1c617a/xdp-bench/xdp_redirect_cpumap.bpf.c (L385-L406)` Changes from v5: * Improve kfunc doc comments * Remove extraneous replay-window setting on selftest reverse path * Squash two kfunc commits into one * Rebase to bpf-next to pick up bitfield write patches * Remove testing of opts.error in selftest prog Changes from v4: * Fixup commit message for selftest * Set opts->error -ENOENT for !x * Revert single file xfrm + bpf Changes from v3: * Place all xfrm bpf integrations in xfrm_bpf.c * Avoid using nval as a temporary * Rebase to bpf-next * Remove extraneous __failure_unpriv annotation for verifier tests Changes from v2: * Fix/simplify BPF_CORE_WRITE_BITFIELD() algorithm * Added verifier tests for bitfield writes * Fix state leakage across test_tunnel subtests Changes from v1: * Move xfrm tunnel tests to test_progs * Fix writing to opts->error when opts is invalid * Use __bpf_kfunc_start_defs() * Remove unused vxlanhdr definition * Add and use BPF_CORE_WRITE_BITFIELD() macro * Make series bisect clean Changes from RFCv2: * Rebased to ipsec-next * Fix netns leak Changes from RFCv1: * Add Antony's commit tags * Add KF_ACQUIRE and KF_RELEASE semantics ==================== Reviewed-by: Eyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/cover.1702593901.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 17:12:56 -08:00
Daniel Xu	2cd07b0eb0	bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state() This commit extends test_tunnel selftest to test the new XDP xfrm state lookup kfunc. Co-developed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/e704e9a4332e3eac7b458e4bfdec8fcc6984cdb6.1702593901.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 17:12:49 -08:00
Daniel Xu	e7adc8291a	bpf: selftests: Move xfrm tunnel test to test_progs test_progs is better than a shell script b/c C is a bit easier to maintain than shell. Also it's easier to use new infra like memory mapped global variables from C via bpf skeleton. Co-developed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/a350db9e08520c64544562d88ec005a039124d9b.1702593901.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 17:12:49 -08:00
Daniel Xu	02b4e126e6	bpf: selftests: test_tunnel: Use vmlinux.h declarations vmlinux.h declarations are more ergnomic, especially when working with kfuncs. The uapi headers are often incomplete for kfunc definitions. This commit also switches bitfield accesses to use CO-RE helpers. Switching to vmlinux.h definitions makes the verifier very unhappy with raw bitfield accesses. The error is: ; md.u.md2.dir = direction; 33: (69) r1 = (u16 )(r2 +11) misaligned stack access off (0x0; 0x0)+-64+11 size 2 Fix by using CO-RE-aware bitfield reads and writes. Co-developed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/884bde1d9a351d126a3923886b945ea6b1b0776b.1702593901.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 17:12:49 -08:00
Daniel Xu	77a7a8220f	bpf: selftests: test_tunnel: Setup fresh topology for each subtest This helps with determinism b/c individual setup/teardown prevents leaking state between different subtests. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/0fb59fa16fb58cca7def5239df606005a3e8dd0e.1702593901.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 17:12:49 -08:00
Daniel Xu	8f0ec8c681	bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc This commit adds an unstable kfunc helper to access internal xfrm_state associated with an SA. This is intended to be used for the upcoming IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other words: for custom software RSS. That being said, the function that this kfunc wraps is fairly generic and used for a lot of xfrm tasks. I'm sure people will find uses elsewhere over time. This commit also adds a corresponding bpf_xdp_xfrm_state_release() kfunc to release the refcnt acquired by bpf_xdp_get_xfrm_state(). The verifier will require that all acquired xfrm_state's are released. Co-developed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Antony Antony <antony.antony@secunet.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/a29699c42f5fad456b875c98dd11c6afc3ffb707.1702593901.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 17:12:49 -08:00
Yonghong Song	56925f389e	selftests/bpf: Remove flaky test_btf_id test With previous patch, one of subtests in test_btf_id becomes flaky and may fail. The following is a failing example: Error: #26 btf Error: #26/174 btf/BTF ID Error: #26/174 btf/BTF ID btf_raw_create:PASS:check 0 nsec btf_raw_create:PASS:check 0 nsec test_btf_id:PASS:check 0 nsec ... test_btf_id:PASS:check 0 nsec test_btf_id:FAIL:check BTF lingersdo_test_get_info:FAIL:check failed: -1 The test tries to prove a btf_id not available after the map is closed. But btf_id is freed only after workqueue and a rcu grace period, compared to previous case just after a rcu grade period. Depending on system workload, workqueue could take quite some time to execute function bpf_map_free_deferred() which may cause the test failure. Instead of adding arbitrary delays, let us remove the logic to check btf_id availability after map is closed. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231214203820.1469402-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 17:10:32 -08:00
Yonghong Song	59e5791f59	bpf: Fix a race condition between btf_put() and map_free() When running `./test_progs -j` in my local vm with latest kernel, I once hit a kasan error like below: [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0 [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830 [ 1887.186498] [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G OEL 6.7.0-rc3-00699-g90679706d486-dirty #494 [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred [ 1887.190341] Call Trace: [ 1887.190666] <TASK> [ 1887.190949] dump_stack_lvl+0xac/0xe0 [ 1887.191423] ? nf_tcp_handle_invalid+0x1b0/0x1b0 [ 1887.192019] ? panic+0x3c0/0x3c0 [ 1887.192449] print_report+0x14f/0x720 [ 1887.192930] ? preempt_count_sub+0x1c/0xd0 [ 1887.193459] ? __virt_addr_valid+0xac/0x120 [ 1887.194004] ? bpf_rb_root_free+0x1f8/0x2b0 [ 1887.194572] kasan_report+0xc3/0x100 [ 1887.195085] ? bpf_rb_root_free+0x1f8/0x2b0 [ 1887.195668] bpf_rb_root_free+0x1f8/0x2b0 [ 1887.196183] ? __bpf_obj_drop_impl+0xb0/0xb0 [ 1887.196736] ? preempt_count_sub+0x1c/0xd0 [ 1887.197270] ? preempt_count_sub+0x1c/0xd0 [ 1887.197802] ? _raw_spin_unlock+0x1f/0x40 [ 1887.198319] bpf_obj_free_fields+0x1d4/0x260 [ 1887.198883] array_map_free+0x1a3/0x260 [ 1887.199380] bpf_map_free_deferred+0x7b/0xe0 [ 1887.199943] process_scheduled_works+0x3a2/0x6c0 [ 1887.200549] worker_thread+0x633/0x890 [ 1887.201047] ? __kthread_parkme+0xd7/0xf0 [ 1887.201574] ? kthread+0x102/0x1d0 [ 1887.202020] kthread+0x1ab/0x1d0 [ 1887.202447] ? pr_cont_work+0x270/0x270 [ 1887.202954] ? kthread_blkcg+0x50/0x50 [ 1887.203444] ret_from_fork+0x34/0x50 [ 1887.203914] ? kthread_blkcg+0x50/0x50 [ 1887.204397] ret_from_fork_asm+0x11/0x20 [ 1887.204913] </TASK> [ 1887.204913] </TASK> [ 1887.205209] [ 1887.205416] Allocated by task 2197: [ 1887.205881] kasan_set_track+0x3f/0x60 [ 1887.206366] __kasan_kmalloc+0x6e/0x80 [ 1887.206856] __kmalloc+0xac/0x1a0 [ 1887.207293] btf_parse_fields+0xa15/0x1480 [ 1887.207836] btf_parse_struct_metas+0x566/0x670 [ 1887.208387] btf_new_fd+0x294/0x4d0 [ 1887.208851] __sys_bpf+0x4ba/0x600 [ 1887.209292] __x64_sys_bpf+0x41/0x50 [ 1887.209762] do_syscall_64+0x4c/0xf0 [ 1887.210222] entry_SYSCALL_64_after_hwframe+0x63/0x6b [ 1887.210868] [ 1887.211074] Freed by task 36: [ 1887.211460] kasan_set_track+0x3f/0x60 [ 1887.211951] kasan_save_free_info+0x28/0x40 [ 1887.212485] ____kasan_slab_free+0x101/0x180 [ 1887.213027] __kmem_cache_free+0xe4/0x210 [ 1887.213514] btf_free+0x5b/0x130 [ 1887.213918] rcu_core+0x638/0xcc0 [ 1887.214347] __do_softirq+0x114/0x37e The error happens at bpf_rb_root_free+0x1f8/0x2b0: 00000000000034c0 <bpf_rb_root_free>: ; { 34c0: f3 0f 1e fa endbr64 34c4: e8 00 00 00 00 callq 0x34c9 <bpf_rb_root_free+0x9> 34c9: 55 pushq %rbp 34ca: 48 89 e5 movq %rsp, %rbp ... ; if (rec && rec->refcount_off >= 0 && 36aa: 4d 85 ed testq %r13, %r13 36ad: 74 a9 je 0x3658 <bpf_rb_root_free+0x198> 36af: 49 8d 7d 10 leaq 0x10(%r13), %rdi 36b3: e8 00 00 00 00 callq 0x36b8 <bpf_rb_root_free+0x1f8> <==== kasan function 36b8: 45 8b 7d 10 movl 0x10(%r13), %r15d <==== use-after-free load 36bc: 45 85 ff testl %r15d, %r15d 36bf: 78 8c js 0x364d <bpf_rb_root_free+0x18d> So the problem is at rec->refcount_off in the above. I did some source code analysis and find the reason. CPU A CPU B bpf_map_put: ... btf_put with rcu callback ... bpf_map_free_deferred with system_unbound_wq ... ... ... ... btf_free_rcu: ... ... ... bpf_map_free_deferred: ... ... ... ---------> btf_struct_metas_free() ... \| race condition ... ... ---------> map->ops->map_free() ... ... btf->struct_meta_tab = NULL In the above, map_free() corresponds to array_map_free() and eventually calling bpf_rb_root_free() which calls: ... __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false); ... Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code: meta = btf_find_struct_meta(btf, btf_id); if (!meta) return -EFAULT; rec->fields[i].graph_root.value_rec = meta->record; So basically, 'value_rec' is a pointer to the record in struct_metas_tab. And it is possible that that particular record has been freed by btf_struct_metas_free() and hence we have a kasan error here. Actually it is very hard to reproduce the failure with current bpf/bpf-next code, I only got the above error once. To increase reproducibility, I added a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which significantly increased reproducibility. diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 5e43ddd1b83f..aae5b5213e93 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct work) struct bpf_map map = container_of(work, struct bpf_map, work); struct btf_record rec = map->record; + mdelay(100); security_bpf_map_free(map); bpf_map_release_memcg(map); / implementation dependent freeing */ Hao also provided test cases ([1]) for easily reproducing the above issue. There are two ways to fix the issue, the v1 of the patch ([2]) moving btf_put() after map_free callback, and the v5 of the patch ([3]) using a kptr style fix which tries to get a btf reference during map_check_btf(). Each approach has its pro and cons. The first approach delays freeing btf while the second approach needs to acquire reference depending on context which makes logic not very elegant and may complicate things with future new data structures. Alexei suggested in [4] going back to v1 which is what this patch tries to do. Rerun './test_progs -j' with the above mdelay() hack for a couple of times and didn't observe the error for the above rb_root test cases. Running Hou's test ([1]) is also successful. [1] https://lore.kernel.org/bpf/20231207141500.917136-1-houtao@huaweicloud.com/ [2] v1: https://lore.kernel.org/bpf/20231204173946.3066377-1-yonghong.song@linux.dev/ [3] v5: https://lore.kernel.org/bpf/20231208041621.2968241-1-yonghong.song@linux.dev/ [4] v4: https://lore.kernel.org/bpf/CAADnVQJ3FiXUhZJwX_81sjZvSYYKCFB3BT6P8D59RS2Gu+0Z7g@mail.gmail.com/ Cc: Hou Tao <houtao@huaweicloud.com> Fixes: `958cf2e273` ("bpf: Introduce bpf_obj_new") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231214203815.1469107-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-14 17:10:32 -08:00
Randy Dunlap	04d25ccea2	net, xdp: Correct grammar Use the correct verb form in 2 places in the XDP rx-queue comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/bpf/20231213043735.30208-1-rdunlap@infradead.org	2023-12-14 16:38:59 +01:00
Tushar Vyavahare	2e1d6a0411	selftests/xsk: Fix for SEND_RECEIVE_UNALIGNED test Fix test broken by shared umem test and framework enhancement commit. Correct the current implementation of pkt_stream_replace_half() by ensuring that nb_valid_entries are not set to half, as this is not true for all the tests. Ensure that the expected value for valid_entries for the SEND_RECEIVE_UNALIGNED test equals the total number of packets sent, which is 4096. Create a new function called pkt_stream_pkt_set() that allows for packet modification to meet specific requirements while ensuring the accurate maintenance of the valid packet count to prevent inconsistencies in packet tracking. Fixes: `6d198a89c0` ("selftests/xsk: Add a test for shared umem feature") Reported-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20231214130007.33281-1-tushar.vyavahare@intel.com	2023-12-14 16:11:13 +01:00
Alexei Starovoitov	c838fe1282	Merge branch 'bpf-use-gfp_kernel-in-bpf_event_entry_gen' Hou Tao says: ==================== The simple patch set aims to replace GFP_ATOMIC by GFP_KERNEL in bpf_event_entry_gen(). These two patches in the patch set were preparatory patches in "Fix the release of inner map" patchset [1] and are not needed for v2, so re-post it to bpf-next tree. Patch #1 reduces the scope of rcu_read_lock when updating fd map and patch #2 replaces GFP_ATOMIC by GFP_KERNEL. Please see individual patches for more details. Change Log: v3: * patch #1: fallback to patch #1 in v1. Update comments in bpf_fd_htab_map_update_elem() to explain the reason for rcu_read_lock() (Alexei) v2: https://lore.kernel.org/bpf/20231211073843.1888058-1-houtao@huaweicloud.com/ * patch #1: add rcu_read_lock/unlock() for bpf_fd_array_map_update_elem as well to make it consistent with bpf_fd_htab_map_update_elem and update commit message accordingly (Alexei) * patch #1/#2: collects ack tags from Yonghong v1: https://lore.kernel.org/bpf/20231208103357.2637299-1-houtao@huaweicloud.com/ [1]: https://lore.kernel.org/bpf/20231107140702.1891778-1-houtao@huaweicloud.com/ ==================== Link: https://lore.kernel.org/r/20231214043010.3458072-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 21:02:40 -08:00
Hou Tao	dc68540913	bpf: Use GFP_KERNEL in bpf_event_entry_gen() rcu_read_lock() is no longer held when invoking bpf_event_entry_gen() which is called by perf_event_fd_array_get_ptr(), so using GFP_KERNEL instead of GFP_ATOMIC to reduce the possibility of failures due to out-of-memory. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231214043010.3458072-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 20:49:11 -08:00
Hou Tao	8f82583f95	bpf: Reduce the scope of rcu_read_lock when updating fd map There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two callbacks. For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need rcu-read-lock because array->ptrs must still be allocated. For bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use rcu_read_lock() during the invocation of htab_map_update_elem(). Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231214043010.3458072-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 20:49:11 -08:00
Hou Tao	2a0c6b41ee	bpf: Update the comments in maybe_wait_bpf_programs() Since commit `638e4b825d` ("bpf: Allows per-cpu maps and map-in-map in sleepable programs"), sleepable BPF program can also use map-in-map, but maybe_wait_bpf_programs() doesn't handle it accordingly. The main reason is that using synchronize_rcu_tasks_trace() to wait for the completions of these sleepable BPF programs may incur a very long delay and userspace may think it is hung, so the wait for sleepable BPF programs is skipped. Update the comments in maybe_wait_bpf_programs() to reflect the reason. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20231211083447.1921178-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 17:01:42 -08:00
Matt Bobrowski	b13cddf633	bpf: add small subset of SECURITY_PATH hooks to BPF sleepable_lsm_hooks list security_path_* based LSM hooks appear to be generally missing from the sleepable_lsm_hooks list. Initially add a small subset of them to the preexisting sleepable_lsm_hooks list so that sleepable BPF helpers like bpf_d_path() can be used from sleepable BPF LSM based programs. The security_path_* hooks added in this patch are similar to the security_inode_* counterparts that already exist in the sleepable_lsm_hooks list, and are called in roughly similar points and contexts. Presumably, making them OK to be also annotated as sleepable. Building a kernel with DEBUG_ATOMIC_SLEEP options enabled and running reasonable workloads stimulating activity that would be intercepted by such security hooks didn't show any splats. Notably, I haven't added all the security_path_* LSM hooks that are available as I don't need them at this point in time. Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Acked-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/ZXM3IHHXpNY9y82a@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:56:19 -08:00
Alexei Starovoitov	ec14325c73	Merge branch 'xdp-metadata-via-kfuncs-for-ice-vlan-hint' Larysa Zaremba says: ==================== XDP metadata via kfuncs for ice + VLAN hint This series introduces XDP hints via kfuncs [0] to the ice driver. Series brings the following existing hints to the ice driver: - HW timestamp - RX hash with type Series also introduces VLAN tag with protocol XDP hint, it now be accessed by XDP and userspace (AF_XDP) programs. They can also be checked with xdp_metadata test and xdp_hw_metadata program. Impact of these patches on ice performance: ZC: * Full hints implementation decreases pps in ZC mode by less than 3% (64B, rxdrop) skb (packets with invalid IP, dropped by stack): * Overall, patchset improves peak performance in skb mode by about 0.5% [0] https://patchwork.kernel.org/project/netdevbpf/cover/20230119221536.3349901-1-sdf@google.com/ v7: https://lore.kernel.org/bpf/20231115175301.534113-1-larysa.zaremba@intel.com/ v6: https://lore.kernel.org/bpf/20231012170524.21085-1-larysa.zaremba@intel.com/ Intermediate RFC v2: https://lore.kernel.org/bpf/20230927075124.23941-1-larysa.zaremba@intel.com/ Intermediate RFC v1: https://lore.kernel.org/bpf/20230824192703.712881-1-larysa.zaremba@intel.com/ v5: https://lore.kernel.org/bpf/20230811161509.19722-1-larysa.zaremba@intel.com/ v4: https://lore.kernel.org/bpf/20230728173923.1318596-1-larysa.zaremba@intel.com/ v3: https://lore.kernel.org/bpf/20230719183734.21681-1-larysa.zaremba@intel.com/ v2: https://lore.kernel.org/bpf/20230703181226.19380-1-larysa.zaremba@intel.com/ v1: https://lore.kernel.org/all/20230512152607.992209-1-larysa.zaremba@intel.com/ Changes since v7: * shorten timestamp assignment in ice * change first argument of ice_fill_rx_descs back to xsk_buff_pool * fix kernel-doc for ice_run_xdp_zc * add missing XSK_CHECK_PRIV_TYPE() in ice * resolved selftests merge conflicts with TX hints * AF_INET patch adds new packet generation, not replaces AF_XDP one * fix destination port in xdp_metadata Changes since v6: * add ability to fill cb of all xdp_buffs in xsk_buff_pool * place just pointer to packet context in ice_xdp_buff * add const qualifiers in veth implementation * generate uapi for VLAN hint Changes since v5: * drop checksum hint from the patchset entirely * Alex's patch that lifts the data_meta size limitation is no longer required in this patchset, so will be sent separately * new patch: hide some ice hints code behind a static key * fix several bugs in ZC mode (ice) * change argument order in VLAN hint kfunc (tci, proto -> proto, tci) * cosmetic changes * analyze performance impact Changes since v4: * Drop the concept of partial checksum from the hint design * Drop the concept of checksum level from the hint design Changes since v3: * use XDP_CHECKSUM_VALID_LVL0 + csum_level instead of csum_level + 1 * fix spelling mistakes * read XDP timestamp unconditionally * add TO_STR() macro Changes since v2: * redesign checksum hint, so now it gives full status * rename vlan_tag -> vlan_tci, where applicable * use open_netns() and close_netns() in xdp_metadata * improve VLAN hint documentation * replace CFI with DEI * use VLAN_VID_MASK in xdp_metadata * make vlan_get_tag() return -ENODATA * remove unused rx_ptype in ice_xsk.c * fix ice timestamp code division between patches Changes since v1: * directly return RX hash, RX timestamp and RX checksum status in skb-common functions * use intermediate enum value for checksum status in ice * get rid of ring structure dependency in ice kfunc implementation * make variables const, when possible, in ice implementation * use -ENODATA instead of -EOPNOTSUPP for driver implementation * instead of having 2 separate functions for c-tag and s-tag, use 1 function that outputs both VLAN tag and protocol ID * improve documentation for introduced hints * update xdp_metadata selftest to test new hints * implement new hints in veth, so they can be tested in xdp_metadata * parse VLAN tag in xdp_hw_metadata ==================== Link: https://lore.kernel.org/r/20231205210847.28460-1-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:42 -08:00
Larysa Zaremba	4c6612f610	selftests/bpf: Check VLAN tag and proto in xdp_metadata Verify, whether VLAN tag and proto are set correctly. To simulate "stripped" VLAN tag on veth, send test packet from VLAN interface. Also, add TO_STR() macro for convenience. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-19-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:41 -08:00
Larysa Zaremba	a3850af4ea	selftests/bpf: Add AF_INET packet generation to xdp_metadata The easiest way to simulate stripped VLAN tag in veth is to send a packet from VLAN interface, attached to veth. Unfortunately, this approach is incompatible with AF_XDP on TX side, because VLAN interfaces do not have such feature. Check both packets sent via AF_XDP TX and regular socket. AF_INET packet will also have a filled-in hash type (XDP_RSS_TYPE_L4), unlike AF_XDP packet, so more values can be checked. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20231205210847.28460-18-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:41 -08:00
Larysa Zaremba	8e68a4beba	selftests/bpf: Add flags and VLAN hint to xdp_hw_metadata Add VLAN hint to the xdp_hw_metadata program. Also, to make metadata layout more straightforward, add flags field to pass information about validity of every separate hint separately. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-17-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:41 -08:00
Larysa Zaremba	e71a9fa7fd	selftests/bpf: Allow VLAN packets in xdp_hw_metadata Make VLAN c-tag and s-tag XDP hint testing more convenient by not skipping VLAN-ed packets. Allow both 802.1ad and 802.1Q headers. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-16-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:41 -08:00
Larysa Zaremba	7978bad4b6	mlx5: implement VLAN tag XDP hint Implement the newly added .xmo_rx_vlan_tag() hint function. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20231205210847.28460-15-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:41 -08:00
Larysa Zaremba	537fec0733	net: make vlan_get_tag() return -ENODATA instead of -EINVAL __vlan_hwaccel_get_tag() is used in veth XDP hints implementation, its return value (-EINVAL if skb is not VLAN tagged) is passed to bpf code, but XDP hints specification requires drivers to return -ENODATA, if a hint cannot be provided for a particular packet. Solve this inconsistency by changing error return value of __vlan_hwaccel_get_tag() from -EINVAL to -ENODATA, do the same thing to __vlan_get_tag(), because this function is supposed to follow the same convention. This, in turn, makes -ENODATA the only non-zero value vlan_get_tag() can return. We can do this with no side effects, because none of the users of the 3 above-mentioned functions rely on the exact value. Suggested-by: Jesper Dangaard Brouer <jbrouer@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-14-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:41 -08:00
Larysa Zaremba	fca783799f	veth: Implement VLAN tag XDP hint In order to test VLAN tag hint in hardware-independent selftests, implement newly added hint in veth driver. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-13-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:41 -08:00
Larysa Zaremba	b591137c4e	ice: use VLAN proto from ring packet context in skb path VLAN proto, used in ice XDP hints implementation is stored in ring packet context. Utilize this value in skb VLAN processing too instead of checking netdev features. At the same time, use vlan_tci instead of vlan_tag in touched code, because VLAN tag often refers to VLAN proto and VLAN TCI combined, while in the code we clearly store only VLAN TCI. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-12-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:41 -08:00
Larysa Zaremba	714ed949c6	ice: Implement VLAN tag hint Implement .xmo_rx_vlan_tag callback to allow XDP code to read packet's VLAN tag. At the same time, use vlan_tci instead of vlan_tag in touched code, because VLAN tag often refers to VLAN proto and VLAN TCI combined, while in the code we clearly store only VLAN TCI. Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-11-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:41 -08:00
Larysa Zaremba	e6795330f8	xdp: Add VLAN tag hint Implement functionality that enables drivers to expose VLAN tag to XDP code. VLAN tag is represented by 2 variables: - protocol ID, which is passed to bpf code in BE - VLAN TCI, in host byte order Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20231205210847.28460-10-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-13 16:16:40 -08:00

1 2 3 4 5 ...

1234950 Commits