Add ability to specify extra compiler flags with SAN_CFLAGS for compilation of
all user-space C files. This allows to build all of selftest programs with,
e.g., custom sanitizer flags, without requiring support for such sanitizers
from anyone compiling selftest/bpf.
As an example, to compile everything with AddressSanitizer, one would do:
$ make clean && make SAN_CFLAGS="-fsanitize=address"
For AddressSanitizer to work, one needs appropriate libasan shared library
installed in the system, with version of libasan matching what GCC links
against. E.g., GCC8 needs libasan5, while GCC7 uses libasan4.
For CentOS 7, to build everything successfully one would need to:
$ sudo yum install devtoolset-8-gcc devtoolset-libasan-devel
$ scl enable devtoolset-8 bash # set up environment
For Arch Linux to run selftests, one would need to install gcc-libs package to
get libasan.so.5:
$ sudo pacman -S gcc-libs
N.B. EXTRA_CFLAGS name wasn't used, because it's also used by libbpf's
Makefile and this causes few issues:
1. default "-g -Wall" flags are overriden;
2. compiling shared library with AddressSanitizer generates a bunch of symbols
like: "_GLOBAL__sub_D_00099_0_btf_dump.c", "_GLOBAL__sub_D_00099_0_bpf.c",
etc, which screws up versioned symbols check.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Julia Kartseva <hex@fb.com>
Link: https://lore.kernel.org/bpf/20200429012111.277390-3-andriin@fb.com
Ensure that test runner flavors include their own skeletons from <flavor>/
directory. Previously, skeletons generated for no-flavor test_progs were used.
Apart from fixing correctness, this also makes it possible to compile only
flavors individually:
$ make clean && make test_progs-no_alu32
... now succeeds ...
Fixes: 74b5a5968f ("selftests/bpf: Replace test_progs and test_maps w/ general rule")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429012111.277390-2-andriin@fb.com
As discussed at LPC 2019 ([0]), this patch brings (a quite belated) support
for declarative BTF-defined map-in-map support in libbpf. It allows to define
ARRAY_OF_MAPS and HASH_OF_MAPS BPF maps without any user-space initialization
code involved.
Additionally, it allows to initialize outer map's slots with references to
respective inner maps at load time, also completely declaratively.
Despite a weak type system of C, the way BTF-defined map-in-map definition
works, it's actually quite hard to accidentally initialize outer map with
incompatible inner maps. This being C, of course, it's still possible, but
even that would be caught at load time and error returned with helpful debug
log pointing exactly to the slot that failed to be initialized.
As an example, here's a rather advanced HASH_OF_MAPS declaration and
initialization example, filling slots #0 and #4 with two inner maps:
#include <bpf/bpf_helpers.h>
struct inner_map {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 1);
__type(key, int);
__type(value, int);
} inner_map1 SEC(".maps"),
inner_map2 SEC(".maps");
struct outer_hash {
__uint(type, BPF_MAP_TYPE_HASH_OF_MAPS);
__uint(max_entries, 5);
__uint(key_size, sizeof(int));
__array(values, struct inner_map);
} outer_hash SEC(".maps") = {
.values = {
[0] = &inner_map2,
[4] = &inner_map1,
},
};
Here's the relevant part of libbpf debug log showing pretty clearly of what's
going on with map-in-map initialization:
libbpf: .maps relo #0: for 6 value 0 rel.r_offset 96 name 260 ('inner_map1')
libbpf: .maps relo #0: map 'outer_arr' slot [0] points to map 'inner_map1'
libbpf: .maps relo #1: for 7 value 32 rel.r_offset 112 name 249 ('inner_map2')
libbpf: .maps relo #1: map 'outer_arr' slot [2] points to map 'inner_map2'
libbpf: .maps relo #2: for 7 value 32 rel.r_offset 144 name 249 ('inner_map2')
libbpf: .maps relo #2: map 'outer_hash' slot [0] points to map 'inner_map2'
libbpf: .maps relo #3: for 6 value 0 rel.r_offset 176 name 260 ('inner_map1')
libbpf: .maps relo #3: map 'outer_hash' slot [4] points to map 'inner_map1'
libbpf: map 'inner_map1': created successfully, fd=4
libbpf: map 'inner_map2': created successfully, fd=5
libbpf: map 'outer_hash': created successfully, fd=7
libbpf: map 'outer_hash': slot [0] set to map 'inner_map2' fd=5
libbpf: map 'outer_hash': slot [4] set to map 'inner_map1' fd=4
Notice from the log above that fd=6 (not logged explicitly) is used for inner
"prototype" map, necessary for creation of outer map. It is destroyed
immediately after outer map is created.
See also included selftest with some extra comments explaining extra details
of usage. Additionally, similar initialization syntax and libbpf functionality
can be used to do initialization of BPF_PROG_ARRAY with references to BPF
sub-programs. This can be done in follow up patches, if there will be a demand
for this.
[0] https://linuxplumbersconf.org/event/4/contributions/448/
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200429002739.48006-4-andriin@fb.com
Similar to commit b7a0d65d80 ("bpf, testing: Workaround a verifier failure for test_progs")
fix test_sysctl_prog.c as well.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
$(OUTPUT)/runqslower makefile target doesn't actually create runqslower
binary in the $(OUTPUT) directory. As lib.mk expects all
TEST_GEN_PROGS_EXTENDED (which runqslower is a part of) to be present in
the OUTPUT directory, this results in an error when running e.g. `make
install`:
rsync: link_stat "tools/testing/selftests/bpf/runqslower" failed: No
such file or directory (2)
Copy the binary into the OUTPUT directory after building it to fix the
error.
Fixes: 3a0d3092a4 ("selftests/bpf: Build runqslower from selftests")
Signed-off-by: Veronika Kabatova <vkabatov@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200428173742.2988395-1-vkabatov@redhat.com
cls_redirect is a TC clsact based replacement for the glb-redirect iptables
module available at [1]. It enables what GitHub calls "second chance"
flows [2], similarly proposed by the Beamer paper [3]. In contrast to
glb-redirect, it also supports migrating UDP flows as long as connected
sockets are used. cls_redirect is in production at Cloudflare, as part of
our own L4 load balancer.
We have modified the encapsulation format slightly from glb-redirect:
glbgue_chained_routing.private_data_type has been repurposed to form a
version field and several flags. Both have been arranged in a way that
a private_data_type value of zero matches the current glb-redirect
behaviour. This means that cls_redirect will understand packets in
glb-redirect format, but not vice versa.
The test suite only covers basic features. For example, cls_redirect will
correctly forward path MTU discovery packets, but this is not exercised.
It is also possible to switch the encapsulation format to GRE on the last
hop, which is also not tested.
There are two major distinctions from glb-redirect: first, cls_redirect
relies on receiving encapsulated packets directly from a router. This is
because we don't have access to the neighbour tables from BPF, yet. See
forward_to_next_hop for details. Second, cls_redirect performs decapsulation
instead of using separate ipip and sit tunnel devices. This
avoids issues with the sit tunnel [4] and makes deploying the classifier
easier: decapsulated packets appear on the same interface, so existing
firewall rules continue to work as expected.
The code base started it's life on v4.19, so there are most likely still
hold overs from old workarounds. In no particular order:
- The function buf_off is required to defeat a clang optimization
that leads to the verifier rejecting the program due to pointer
arithmetic in the wrong order.
- The function pkt_parse_ipv6 is force inlined, because it would
otherwise be rejected due to returning a pointer to stack memory.
- The functions fill_tuple and classify_tcp contain kludges, because
we've run out of function arguments.
- The logic in general is rather nested, due to verifier restrictions.
I think this is either because the verifier loses track of constants
on the stack, or because it can't track enum like variables.
1: https://github.com/github/glb-director/tree/master/src/glb-redirect
2: https://github.com/github/glb-director/blob/master/docs/development/second-chance-design.md
3: https://www.usenix.org/conference/nsdi18/presentation/olteanu
4: https://github.com/github/glb-director/issues/64
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200424185556.7358-2-lmb@cloudflare.com
To make BPF verifier verbose log more releavant and easier to use to debug
verification failures, "pop" parts of log that were successfully verified.
This has effect of leaving only verifier logs that correspond to code branches
that lead to verification failure, which in practice should result in much
shorter and more relevant verifier log dumps. This behavior is made the
default behavior and can be overriden to do exhaustive logging by specifying
BPF_LOG_LEVEL2 log level.
Using BPF_LOG_LEVEL2 to disable this behavior is not ideal, because in some
cases it's good to have BPF_LOG_LEVEL2 per-instruction register dump
verbosity, but still have only relevant verifier branches logged. But for this
patch, I didn't want to add any new flags. It might be worth-while to just
rethink how BPF verifier logging is performed and requested and streamline it
a bit. But this trimming of successfully verified branches seems to be useful
and a good default behavior.
To test this, I modified runqslower slightly to introduce read of
uninitialized stack variable. Log (**truncated in the middle** to save many
lines out of this commit message) BEFORE this change:
; int handle__sched_switch(u64 *ctx)
0: (bf) r6 = r1
; struct task_struct *prev = (struct task_struct *)ctx[1];
1: (79) r1 = *(u64 *)(r6 +8)
func 'sched_switch' arg1 has btf_id 151 type STRUCT 'task_struct'
2: (b7) r2 = 0
; struct event event = {};
3: (7b) *(u64 *)(r10 -24) = r2
last_idx 3 first_idx 0
regs=4 stack=0 before 2: (b7) r2 = 0
4: (7b) *(u64 *)(r10 -32) = r2
5: (7b) *(u64 *)(r10 -40) = r2
6: (7b) *(u64 *)(r10 -48) = r2
; if (prev->state == TASK_RUNNING)
[ ... instruction dump from insn #7 through #50 are cut out ... ]
51: (b7) r2 = 16
52: (85) call bpf_get_current_comm#16
last_idx 52 first_idx 42
regs=4 stack=0 before 51: (b7) r2 = 16
; bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
53: (bf) r1 = r6
54: (18) r2 = 0xffff8881f3868800
56: (18) r3 = 0xffffffff
58: (bf) r4 = r7
59: (b7) r5 = 32
60: (85) call bpf_perf_event_output#25
last_idx 60 first_idx 53
regs=20 stack=0 before 59: (b7) r5 = 32
61: (bf) r2 = r10
; event.pid = pid;
62: (07) r2 += -16
; bpf_map_delete_elem(&start, &pid);
63: (18) r1 = 0xffff8881f3868000
65: (85) call bpf_map_delete_elem#3
; }
66: (b7) r0 = 0
67: (95) exit
from 44 to 66: safe
from 34 to 66: safe
from 11 to 28: R1_w=inv0 R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-8=mmmm???? fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000
; bpf_map_update_elem(&start, &pid, &ts, 0);
28: (bf) r2 = r10
;
29: (07) r2 += -16
; tsp = bpf_map_lookup_elem(&start, &pid);
30: (18) r1 = 0xffff8881f3868000
32: (85) call bpf_map_lookup_elem#1
invalid indirect read from stack off -16+0 size 4
processed 65 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 4
Notice how there is a successful code path from instruction 0 through 67, few
successfully verified jumps (44->66, 34->66), and only after that 11->28 jump
plus error on instruction #32.
AFTER this change (full verifier log, **no truncation**):
; int handle__sched_switch(u64 *ctx)
0: (bf) r6 = r1
; struct task_struct *prev = (struct task_struct *)ctx[1];
1: (79) r1 = *(u64 *)(r6 +8)
func 'sched_switch' arg1 has btf_id 151 type STRUCT 'task_struct'
2: (b7) r2 = 0
; struct event event = {};
3: (7b) *(u64 *)(r10 -24) = r2
last_idx 3 first_idx 0
regs=4 stack=0 before 2: (b7) r2 = 0
4: (7b) *(u64 *)(r10 -32) = r2
5: (7b) *(u64 *)(r10 -40) = r2
6: (7b) *(u64 *)(r10 -48) = r2
; if (prev->state == TASK_RUNNING)
7: (79) r2 = *(u64 *)(r1 +16)
; if (prev->state == TASK_RUNNING)
8: (55) if r2 != 0x0 goto pc+19
R1_w=ptr_task_struct(id=0,off=0,imm=0) R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000
; trace_enqueue(prev->tgid, prev->pid);
9: (61) r1 = *(u32 *)(r1 +1184)
10: (63) *(u32 *)(r10 -4) = r1
; if (!pid || (targ_pid && targ_pid != pid))
11: (15) if r1 == 0x0 goto pc+16
from 11 to 28: R1_w=inv0 R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-8=mmmm???? fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000
; bpf_map_update_elem(&start, &pid, &ts, 0);
28: (bf) r2 = r10
;
29: (07) r2 += -16
; tsp = bpf_map_lookup_elem(&start, &pid);
30: (18) r1 = 0xffff8881db3ce800
32: (85) call bpf_map_lookup_elem#1
invalid indirect read from stack off -16+0 size 4
processed 65 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 4
Notice how in this case, there are 0-11 instructions + jump from 11 to
28 is recorded + 28-32 instructions with error on insn #32.
test_verifier test runner was updated to specify BPF_LOG_LEVEL2 for
VERBOSE_ACCEPT expected result due to potentially "incomplete" success verbose
log at BPF_LOG_LEVEL1.
On success, verbose log will only have a summary of number of processed
instructions, etc, but no branch tracing log. Having just a last succesful
branch tracing seemed weird and confusing. Having small and clean summary log
in success case seems quite logical and nice, though.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200423195850.1259827-1-andriin@fb.com
Currently the following prog types don't fall back to bpf_base_func_proto()
(instead they have cgroup_base_func_proto which has a limited set of
helpers from bpf_base_func_proto):
* BPF_PROG_TYPE_CGROUP_DEVICE
* BPF_PROG_TYPE_CGROUP_SYSCTL
* BPF_PROG_TYPE_CGROUP_SOCKOPT
I don't see any specific reason why we shouldn't use bpf_base_func_proto(),
every other type of program (except bpf-lirc and, understandably, tracing)
use it, so let's fall back to bpf_base_func_proto for those prog types
as well.
This basically boils down to adding access to the following helpers:
* BPF_FUNC_get_prandom_u32
* BPF_FUNC_get_smp_processor_id
* BPF_FUNC_get_numa_node_id
* BPF_FUNC_tail_call
* BPF_FUNC_ktime_get_ns
* BPF_FUNC_spin_lock (CAP_SYS_ADMIN)
* BPF_FUNC_spin_unlock (CAP_SYS_ADMIN)
* BPF_FUNC_jiffies64 (CAP_SYS_ADMIN)
I've also added bpf_perf_event_output() because it's really handy for
logging and debugging.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200420174610.77494-1-sdf@google.com
Pull networking fixes from David Miller:
1) Fix memory leak in netfilter flowtable, from Roi Dayan.
2) Ref-count leaks in netrom and tipc, from Xiyu Yang.
3) Fix warning when mptcp socket is never accepted before close, from
Florian Westphal.
4) Missed locking in ovs_ct_exit(), from Tonghao Zhang.
5) Fix large delays during PTP synchornization in cxgb4, from Rahul
Lakkireddy.
6) team_mode_get() can hang, from Taehee Yoo.
7) Need to use kvzalloc() when allocating fw tracer in mlx5 driver,
from Niklas Schnelle.
8) Fix handling of bpf XADD on BTF memory, from Jann Horn.
9) Fix BPF_STX/BPF_B encoding in x86 bpf jit, from Luke Nelson.
10) Missing queue memory release in iwlwifi pcie code, from Johannes
Berg.
11) Fix NULL deref in macvlan device event, from Taehee Yoo.
12) Initialize lan87xx phy correctly, from Yuiko Oshino.
13) Fix looping between VRF and XFRM lookups, from David Ahern.
14) etf packet scheduler assumes all sockets are full sockets, which is
not necessarily true. From Eric Dumazet.
15) Fix mptcp data_fin handling in RX path, from Paolo Abeni.
16) fib_select_default() needs to handle nexthop objects, from David
Ahern.
17) Use GFP_ATOMIC under spinlock in mac80211_hwsim, from Wei Yongjun.
18) vxlan and geneve use wrong nlattr array, from Sabrina Dubroca.
19) Correct rx/tx stats in bcmgenet driver, from Doug Berger.
20) BPF_LDX zero-extension is encoded improperly in x86_32 bpf jit, fix
from Luke Nelson.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (100 commits)
selftests/bpf: Fix a couple of broken test_btf cases
tools/runqslower: Ensure own vmlinux.h is picked up first
bpf: Make bpf_link_fops static
bpftool: Respect the -d option in struct_ops cmd
selftests/bpf: Add test for freplace program with expected_attach_type
bpf: Propagate expected_attach_type when verifying freplace programs
bpf: Fix leak in LINK_UPDATE and enforce empty old_prog_fd
bpf, x86_32: Fix logic error in BPF_LDX zero-extension
bpf, x86_32: Fix clobbering of dst for BPF_JSET
bpf, x86_32: Fix incorrect encoding in BPF_LDX zero-extension
bpf: Fix reStructuredText markup
net: systemport: suppress warnings on failed Rx SKB allocations
net: bcmgenet: suppress warnings on failed Rx SKB allocations
macsec: avoid to set wrong mtu
mac80211: sta_info: Add lockdep condition for RCU list usage
mac80211: populate debugfs only after cfg80211 init
net: bcmgenet: correct per TX/RX ring statistics
net: meth: remove spurious copyright text
net: phy: bcm84881: clear settings on link down
chcr: Fix CPU hard lockup
...
Alexei Starovoitov says:
====================
pull-request: bpf 2020-04-24
The following pull-request contains BPF updates for your *net* tree.
We've added 17 non-merge commits during the last 5 day(s) which contain
a total of 19 files changed, 203 insertions(+), 85 deletions(-).
The main changes are:
1) link_update fix, from Andrii.
2) libbpf get_xdp_id fix, from David.
3) xadd verifier fix, from Jann.
4) x86-32 JIT fixes, from Luke and Wang.
5) test_btf fix, from Stanislav.
6) freplace verifier fix, from Toke.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add nodad when adding IPv6 addresses and remove the sleep.
A recent change to iproute2 moved the 'pref medium' to the prefix
(where it belongs). Change the expected route check to strip
'pref medium' to be compatible with old and new iproute2.
Add IPv4 runtime test with an IPv6 address as the gateway in
the default route.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a self-test for the IPv6 dsfield munge that iproute2 will support.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend the pedit_dsfield forwarding selftest with coverage of "pedit ex
munge ip6 dsfield set".
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add tests for vrf and xfrms with a second round after adding a
qdisc. There are a few known problems documented with the test
cases that fail. The fix is non-trivial; will come back to it
when time allows.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_tests is spewing errors:
...
Cannot open network namespace "ns1": No such file or directory
Cannot open network namespace "ns1": No such file or directory
Cannot open network namespace "ns1": No such file or directory
Cannot open network namespace "ns1": No such file or directory
ping: connect: Network is unreachable
Cannot open network namespace "ns1": No such file or directory
Cannot open network namespace "ns1": No such file or directory
...
Each test entry in fib_tests is supposed to do its own setup and
cleanup. Right now the $IP commands in fib_suppress_test are
failing because there is no ns1. Add the setup/cleanup and logging
expected for each test.
Fixes: ca7a03c417 ("ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule")
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull kselftest fixes from Shuah Khan:
"This consists of fixes to runner scripts and individual test run-time
bugs. Includes fixes to tpm2 and memfd test run-time regressions"
* tag 'linux-kselftest-5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/ipc: Fix test failure seen after initial test run
Revert "Kernel selftests: tpm2: check for tpm support"
selftests/ftrace: Add CONFIG_SAMPLE_FTRACE_DIRECT=m kconfig
selftests/seccomp: allow clock_nanosleep instead of nanosleep
kselftest/runner: allow to properly deliver signals to tests
selftests/harness: fix spelling mistake "SIGARLM" -> "SIGALRM"
selftests: Fix memfd test run-time regression
selftests: vm: Fix 64-bit test builds for powerpc64le
selftests: vm: Do not override definition of ARCH
This patch adds a test to test_verifier that writes the lower 8 bits of
R10 (aka FP) using BPF_B to an array map and reads the result back. The
expected behavior is that the result should be the same as first copying
R10 to R9, and then storing / loading the lower 8 bits of R9.
This test catches a bug that was present in the x86-64 JIT that caused
an incorrect encoding for BPF_STX BPF_B when the source operand is R10.
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200418232655.23870-2-luke.r.nels@gmail.com
When check_xadd() verifies an XADD operation on a pointer to a stack slot
containing a spilled pointer, check_stack_read() verifies that the read,
which is part of XADD, is valid. However, since the placeholder value -1 is
passed as `value_regno`, check_stack_read() can only return a binary
decision and can't return the type of the value that was read. The intent
here is to verify whether the value read from the stack slot may be used as
a SCALAR_VALUE; but since check_stack_read() doesn't check the type, and
the type information is lost when check_stack_read() returns, this is not
enforced, and a malicious user can abuse XADD to leak spilled kernel
pointers.
Fix it by letting check_stack_read() verify that the value is usable as a
SCALAR_VALUE if no type information is passed to the caller.
To be able to use __is_pointer_value() in check_stack_read(), move it up.
Fix up the expected unprivileged error message for a BPF selftest that,
until now, assumed that unprivileged users can use XADD on stack-spilled
pointers. This also gives us a test for the behavior introduced in this
patch for free.
In theory, this could also be fixed by forbidding XADD on stack spills
entirely, since XADD is a locked operation (for operations on memory with
concurrency) and there can't be any concurrency on the BPF stack; but
Alexei has said that he wants to keep XADD on stack slots working to avoid
changes to the test suite [1].
The following BPF program demonstrates how to leak a BPF map pointer as an
unprivileged user using this bug:
// r7 = map_pointer
BPF_LD_MAP_FD(BPF_REG_7, small_map),
// r8 = launder(map_pointer)
BPF_STX_MEM(BPF_DW, BPF_REG_FP, BPF_REG_7, -8),
BPF_MOV64_IMM(BPF_REG_1, 0),
((struct bpf_insn) {
.code = BPF_STX | BPF_DW | BPF_XADD,
.dst_reg = BPF_REG_FP,
.src_reg = BPF_REG_1,
.off = -8
}),
BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_FP, -8),
// store r8 into map
BPF_MOV64_REG(BPF_REG_ARG1, BPF_REG_7),
BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
BPF_ST_MEM(BPF_W, BPF_REG_ARG2, 0, 0),
BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
BPF_EXIT_INSN(),
BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_8, 0),
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN()
[1] https://lore.kernel.org/bpf/20200416211116.qxqcza5vo2ddnkdq@ast-mbp.dhcp.thefacebook.com/
Fixes: 17a5267067 ("bpf: verifier (add verifier core)")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200417000007.10734-1-jannh@google.com
Add PMTU discovery tests for these encapsulations:
- IPIP
- SIT, mode ip6ip
- ip6tnl, modes ip6ip6 and ipip6
Signed-off-by: Lourdes Pedrajas <lu@pplo.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Disable RISCV BPF JIT builds when !MMU, from Björn Töpel.
2) nf_tables leaves dangling pointer after free, fix from Eric Dumazet.
3) Out of boundary write in __xsk_rcv_memcpy(), fix from Li RongQing.
4) Adjust icmp6 message source address selection when routes have a
preferred source address set, from Tim Stallard.
5) Be sure to validate HSR protocol version when creating new links,
from Taehee Yoo.
6) CAP_NET_ADMIN should be sufficient to manage l2tp tunnels even in
non-initial namespaces, from Michael Weiß.
7) Missing release firmware call in mlx5, from Eran Ben Elisha.
8) Fix variable type in macsec_changelink(), caught by KASAN. Fix from
Taehee Yoo.
9) Fix pause frame negotiation in marvell phy driver, from Clemens
Gruber.
10) Record RX queue early enough in tun packet paths such that XDP
programs will see the correct RX queue index, from Gilberto Bertin.
11) Fix double unlock in mptcp, from Florian Westphal.
12) Fix offset overflow in ARM bpf JIT, from Luke Nelson.
13) marvell10g needs to soft reset PHY when coming out of low power
mode, from Russell King.
14) Fix MTU setting regression in stmmac for some chip types, from
Florian Fainelli.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
amd-xgbe: Use __napi_schedule() in BH context
mISDN: make dmril and dmrim static
net: stmmac: dwmac-sunxi: Provide TX and RX fifo sizes
net: dsa: mt7530: fix tagged frames pass-through in VLAN-unaware mode
tipc: fix incorrect increasing of link window
Documentation: Fix tcp_challenge_ack_limit default value
net: tulip: make early_486_chipsets static
dt-bindings: net: ethernet-phy: add desciption for ethernet-phy-id1234.d400
ipv6: remove redundant assignment to variable err
net/rds: Use ERR_PTR for rds_message_alloc_sgs()
net: mscc: ocelot: fix untagged packet drops when enslaving to vlan aware bridge
selftests/bpf: Check for correct program attach/detach in xdp_attach test
libbpf: Fix type of old_fd in bpf_xdp_set_link_opts
libbpf: Always specify expected_attach_type on program load if supported
xsk: Add missing check on user supplied headroom size
mac80211: fix channel switch trigger from unknown mesh peer
mac80211: fix race in ieee80211_register_hw()
net: marvell10g: soft-reset the PHY when coming out of low power
net: marvell10g: report firmware version
net/cxgb4: Check the return from t4_query_params properly
...
David Ahern noticed that there was a bug in the EXPECTED_FD code so
programs did not get detached properly when that parameter was supplied.
This case was not included in the xdp_attach tests; so let's add it to be
sure that such a bug does not sneak back in down.
Fixes: 87854a0b57 ("selftests/bpf: Add tests for attaching XDP programs")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200414145025.182163-2-toke@redhat.com
For some types of BPF programs that utilize expected_attach_type, libbpf won't
set load_attr.expected_attach_type, even if expected_attach_type is known from
section definition. This was done to preserve backwards compatibility with old
kernels that didn't recognize expected_attach_type attribute yet (which was
added in 5e43f899b0 ("bpf: Check attach type at prog load time"). But this
is problematic for some BPF programs that utilize newer features that require
kernel to know specific expected_attach_type (e.g., extended set of return
codes for cgroup_skb/egress programs).
This patch makes libbpf specify expected_attach_type by default, but also
detect support for this field in kernel and not set it during program load.
This allows to have a good metadata for bpf_program
(e.g., bpf_program__get_extected_attach_type()), but still work with old
kernels (for cases where it can work at all).
Additionally, due to expected_attach_type being always set for recognized
program types, bpf_program__attach_cgroup doesn't have to do extra checks to
determine correct attach type, so remove that additional logic.
Also adjust section_names selftest to account for this change.
More detailed discussion can be found in [0].
[0] https://lore.kernel.org/bpf/20200412003604.GA15986@rdna-mbp.dhcp.thefacebook.com/
Fixes: 5cf1e91456 ("bpf: cgroup inet skb programs can return 0 to 3")
Fixes: 5e43f899b0 ("bpf: Check attach type at prog load time")
Reported-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20200414182645.1368174-1-andriin@fb.com
Test that frozen and mmap()'ed BPF map can't be mprotect()'ed as writable or
executable memory. Also validate that "downgrading" from writable to read-only
doesn't screw up internal writable count accounting for the purposes of map
freezing.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200410202613.3679837-2-andriin@fb.com
After successfully running the IPC msgque test once, subsequent runs
result in a test failure:
$ sudo ./run_kselftest.sh
TAP version 13
1..1
# selftests: ipc: msgque
# Failed to get stats for IPC queue with id 0
# Failed to dump queue: -22
# Bail out!
# # Pass 0 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0
not ok 1 selftests: ipc: msgque # exit=1
The dump_queue() function loops through the possible message queue index
values using calls to msgctl(kern_id, MSG_STAT, ...) where kern_id
represents the index value. The first time the test is ran, the initial
index value of 0 is valid and the test is able to complete. The index
value of 0 is not valid in subsequent test runs and the loop attempts to
try index values of 1, 2, 3, and so on until a valid index value is
found that corresponds to the message queue created earlier in the test.
The msgctl() syscall returns -1 and sets errno to EINVAL when invalid
index values are used. The test failure is caused by incorrectly
comparing errno to -EINVAL when cycling through possible index values.
Fix invalid test failures on subsequent runs of the msgque test by
correctly comparing errno values to a non-negated EINVAL.
Fixes: 3a665531a3 ("selftests: IPC message queue copy feature test")
Signed-off-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
ftrace-direct.tc and kprobe-direct.tc require CONFIG_SAMPLE_FTRACE_DIRECT=m
so add it to config file which is used by merge_config.sh.
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
glibc 2.31 calls clock_nanosleep when its nanosleep function is used. So
the restart_syscall fails after that. In order to deal with it, we trace
clock_nanosleep and nanosleep. Then we check for either.
This works just fine on systems with both glibc 2.30 and glibc 2.31,
whereas it failed before on a system with glibc 2.31.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
While running seccomp_bpf, kill_after_ptrace() gets stuck if we run it
via /usr/bin/timeout (that is the default), until the timeout expires.
This is because /usr/bin/timeout is preventing to properly deliver
signals to ptrace'd children (SIGSYS in this case).
This problem can be easily reproduced by running:
$ sudo make TARGETS=seccomp kselftest
...
# [ RUN ] TRACE_syscall.skip_a#
not ok 1 selftests: seccomp: seccomp_bpf # TIMEOUT
The test is hanging at this point until the timeout expires and then it
reports the timeout error.
Prevent this problem by passing --foreground to /usr/bin/timeout,
allowing to properly deliver signals to children processes.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Commit d3fd949abd ("selftests: Fix memfd to support relocatable
build (O=objdir)") introduced regression run-time regression with
a change to include programs that should be run from shell scripts
to list of programs that run as independent tests. This fix restores
the original designation.
Fixes: d3fd949abd ("selftests: Fix memfd to support relocatable build (O=objdir)")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Some tests are built only for 64-bit systems. This makes
sure that these tests are built for both big and little
endian variants of powerpc64.
Fixes: 7549b33642 ("selftests: vm: Build/Run 64bit tests only on 64bit arch")
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Independent builds of the vm selftests is currently broken because
commit 7549b33642 ("selftests: vm: Build/Run 64bit tests only on
64bit arch") overrides the value of ARCH with the machine name from
uname. This does not always match the architecture names used for
tasks like header installation.
E.g. for building tests on powerpc64, we need ARCH=powerpc
and not ARCH=ppc64 or ARCH=ppc64le. Otherwise, the build
fails as shown below.
$ uname -m
ppc64le
$ make -C tools/testing/selftests/vm
make: Entering directory '/home/sandipan/linux/tools/testing/selftests/vm'
make --no-builtin-rules ARCH=ppc64le -C ../../../.. headers_install
make[1]: Entering directory '/home/sandipan/linux'
Makefile:653: arch/ppc64le/Makefile: No such file or directory
make[1]: *** No rule to make target 'arch/ppc64le/Makefile'. Stop.
make[1]: Leaving directory '/home/sandipan/linux'
../lib.mk:50: recipe for target 'khdr' failed
make: *** [khdr] Error 2
make: Leaving directory '/home/sandipan/linux/tools/testing/selftests/vm'
Fixes: 7549b33642 ("selftests: vm: Build/Run 64bit tests only on 64bit arch")
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Daniel Borkmann says:
====================
pull-request: bpf 2020-04-10
The following pull-request contains BPF updates for your *net* tree.
We've added 13 non-merge commits during the last 7 day(s) which contain
a total of 13 files changed, 137 insertions(+), 43 deletions(-).
The main changes are:
1) JIT code emission fixes for riscv and arm32, from Luke Nelson and Xi Wang.
2) Disable vmlinux BTF info if GCC_PLUGIN_RANDSTRUCT is used, from Slava Bacherikov.
3) Fix oob write in AF_XDP when meta data is used, from Li RongQing.
4) Fix bpf_get_link_xdp_id() handling on single prog when flags are specified,
from Andrey Ignatov.
5) Fix sk_assign() BPF helper for request sockets that can have sk_reuseport
field uninitialized, from Joe Stringer.
6) Fix mprotect() test case for the BPF LSM, from KP Singh.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull more powerpc updates from Michael Ellerman:
"The bulk of this is the series to make CONFIG_COMPAT user-selectable,
it's been around for a long time but was blocked behind the
syscall-in-C series.
Plus there's also a few fixes and other minor things.
Summary:
- A fix for a crash in machine check handling on pseries (ie. guests)
- A small series to make it possible to disable CONFIG_COMPAT, and
turn it off by default for ppc64le where it's not used.
- A few other miscellaneous fixes and small improvements.
Thanks to: Alexey Kardashevskiy, Anju T Sudhakar, Arnd Bergmann,
Christophe Leroy, Dan Carpenter, Ganesh Goudar, Geert Uytterhoeven,
Geoff Levand, Mahesh Salgaonkar, Markus Elfring, Michal Suchanek,
Nicholas Piggin, Stephen Boyd, Wen Xiong"
* tag 'powerpc-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Always build the tm-poison test 64-bit
powerpc: Improve ppc_save_regs()
Revert "powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled"
powerpc/time: Replace <linux/clk-provider.h> by <linux/of_clk.h>
powerpc/pseries/ddw: Extend upper limit for huge DMA window for persistent memory
powerpc/perf: split callchain.c by bitness
powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.
powerpc/64: make buildable without CONFIG_COMPAT
powerpc/perf: consolidate valid_user_sp -> invalid_user_sp
powerpc/perf: consolidate read_user_stack_32
powerpc: move common register copy functions from signal_32.c to signal.c
powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro
powerpc/ps3: Set CONFIG_UEVENT_HELPER=y in ps3_defconfig
powerpc/ps3: Remove an unneeded NULL check
powerpc/ps3: Remove duplicate error message
powerpc/powernv: Re-enable imc trace-mode in kernel
powerpc/perf: Implement a global lock to avoid races between trace, core and thread imc events.
powerpc/pseries: Fix MCE handling on pseries
selftests/eeh: Skip ahci adapters
powerpc/64s: Fix doorbell wakeup msgclr optimisation
Pull libnvdimm and dax updates from Dan Williams:
"There were multiple touches outside of drivers/nvdimm/ this round to
add cross arch compatibility to the devm_memremap_pages() interface,
enhance numa information for persistent memory ranges, and add a
zero_page_range() dax operation.
This cycle I switched from the patchwork api to Konstantin's b4 script
for collecting tags (from x86, PowerPC, filesystem, and device-mapper
folks), and everything looks to have gone ok there. This has all
appeared in -next with no reported issues.
Summary:
- Add support for region alignment configuration and enforcement to
fix compatibility across architectures and PowerPC page size
configurations.
- Introduce 'zero_page_range' as a dax operation. This facilitates
filesystem-dax operation without a block-device.
- Introduce phys_to_target_node() to facilitate drivers that want to
know resulting numa node if a given reserved address range was
onlined.
- Advertise a persistence-domain for of_pmem and papr_scm. The
persistence domain indicates where cpu-store cycles need to reach
in the platform-memory subsystem before the platform will consider
them power-fail protected.
- Promote numa_map_to_online_node() to a cross-kernel generic
facility.
- Save x86 numa information to allow for node-id lookups for reserved
memory ranges, deploy that capability for the e820-pmem driver.
- Pick up some miscellaneous minor fixes, that missed v5.6-final,
including a some smatch reports in the ioctl path and some unit
test compilation fixups.
- Fixup some flexible-array declarations"
* tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
dax: Move mandatory ->zero_page_range() check in alloc_dax()
dax,iomap: Add helper dax_iomap_zero() to zero a range
dax: Use new dax zero page method for zeroing a page
dm,dax: Add dax zero_page_range operation
s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
dax, pmem: Add a dax operation zero_page_range
pmem: Add functions for reading/writing page to/from pmem
libnvdimm: Update persistence domain value for of_pmem and papr_scm device
tools/test/nvdimm: Fix out of tree build
libnvdimm/region: Fix build error
libnvdimm/region: Replace zero-length array with flexible-array member
libnvdimm/label: Replace zero-length array with flexible-array member
ACPI: NFIT: Replace zero-length array with flexible-array member
libnvdimm/region: Introduce an 'align' attribute
libnvdimm/region: Introduce NDD_LABELING
libnvdimm/namespace: Enforce memremap_compat_align()
libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
libnvdimm: Out of bounds read in __nd_ioctl()
acpi/nfit: improve bounds checking for 'func'
mm/memremap_pages: Introduce memremap_compat_align()
...
In set_operation_mode() function remove duplicated check for args.list
parameter, which is already done one line before.
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add uffd tests for write protection.
Instead of introducing new tests for it, let's simply squashing uffd-wp
tests into existing uffd-missing test cases. Changes are:
(1) Bouncing tests
We do the write-protection in two ways during the bouncing test:
- By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
we'll make sure for each bounce process every single page will be
at least fault twice: once for MISSING, once for WP.
- By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
To further torture the explicit page protection procedures of
uffd-wp, we split each bounce procedure into two halves (in the
background thread): the first half will be MISSING+WP for each
page as explained above. After the first half, we write protect
the faulted region in the background thread to make sure at least
half of the pages will be write protected again which is the first
half to test the new UFFDIO_WRITEPROTECT call. Then we continue
with the 2nd half, which will contain both MISSING and WP faulting
tests for the 2nd half and WP-only faults from the 1st half.
(2) Event/Signal test
Mostly previous tests but will do MISSING+WP for each page. For
sigbus-mode test we'll need to provide standalone path to handle the
write protection faults.
For all tests, do statistics as well for uffd-wp pages.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-20-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>