2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

511 Commits

Author SHA1 Message Date
Andrey Ignatov
7b146cebe3 bpf: Sysctl hook
Containerized applications may run as root and it may create problems
for whole host. Specifically such applications may change a sysctl and
affect applications in other containers.

Furthermore in existing infrastructure it may not be possible to just
completely disable writing to sysctl, instead such a process should be
gradual with ability to log what sysctl are being changed by a
container, investigate, limit the set of writable sysctl to currently
used ones (so that new ones can not be changed) and eventually reduce
this set to zero.

The patch introduces new program type BPF_PROG_TYPE_CGROUP_SYSCTL and
attach type BPF_CGROUP_SYSCTL to solve these problems on cgroup basis.

New program type has access to following minimal context:
	struct bpf_sysctl {
		__u32	write;
	};

Where @write indicates whether sysctl is being read (= 0) or written (=
1).

Helpers to access sysctl name and value will be introduced separately.

BPF_CGROUP_SYSCTL attach point is added to sysctl code right before
passing control to ctl_table->proc_handler so that BPF program can
either allow or deny access to sysctl.

Suggested-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 13:54:58 -07:00
David S. Miller
bb23581b9b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-04-12

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Improve BPF verifier scalability for large programs through two
   optimizations: i) remove verifier states that are not useful in pruning,
   ii) stop walking parentage chain once first LIVE_READ is seen. Combined
   gives approx 20x speedup. Increase limits for accepting large programs
   under root, and add various stress tests, from Alexei.

2) Implement global data support in BPF. This enables static global variables
   for .data, .rodata and .bss sections to be properly handled which allows
   for more natural program development. This also opens up the possibility
   to optimize program workflow by compiling ELFs only once and later only
   rewriting section data before reload, from Daniel and with test cases and
   libbpf refactoring from Joe.

3) Add config option to generate BTF type info for vmlinux as part of the
   kernel build process. DWARF debug info is converted via pahole to BTF.
   Latter relies on libbpf and makes use of BTF deduplication algorithm which
   results in 100x savings compared to DWARF data. Resulting .BTF section is
   typically about 2MB in size, from Andrii.

4) Add BPF verifier support for stack access with variable offset from
   helpers and add various test cases along with it, from Andrey.

5) Extend bpf_skb_adjust_room() growth BPF helper to mark inner MAC header
   so that L2 encapsulation can be used for tc tunnels, from Alan.

6) Add support for input __sk_buff context in BPF_PROG_TEST_RUN so that
   users can define a subset of allowed __sk_buff fields that get fed into
   the test program, from Stanislav.

7) Add bpf fs multi-dimensional array tests for BTF test suite and fix up
   various UBSAN warnings in bpftool, from Yonghong.

8) Generate a pkg-config file for libbpf, from Luca.

9) Dump program's BTF id in bpftool, from Prashant.

10) libbpf fix to use smaller BPF log buffer size for AF_XDP's XDP
    program, from Magnus.

11) kallsyms related fixes for the case when symbols are not present in
    BPF selftests and samples, from Daniel
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 17:00:05 -07:00
Daniel Borkmann
591fe9888d bpf: add program side {rd, wr}only support for maps
This work adds two new map creation flags BPF_F_RDONLY_PROG
and BPF_F_WRONLY_PROG in order to allow for read-only or
write-only BPF maps from a BPF program side.

Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
applies to system call side, meaning the BPF program has full
read/write access to the map as usual while bpf(2) calls with
map fd can either only read or write into the map depending
on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
for the exact opposite such that verifier is going to reject
program loads if write into a read-only map or a read into a
write-only map is detected. For read-only map case also some
helpers are forbidden for programs that would alter the map
state such as map deletion, update, etc. As opposed to the two
BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
as BPF_F_WRONLY_PROG really do correspond to the map lifetime.

We've enabled this generic map extension to various non-special
maps holding normal user data: array, hash, lru, lpm, local
storage, queue and stack. Further generic map types could be
followed up in future depending on use-case. Main use case
here is to forbid writes into .rodata map values from verifier
side.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09 17:05:46 -07:00
Daniel Borkmann
d8eca5bbb2 bpf: implement lookup-free direct value access for maps
This generic extension to BPF maps allows for directly loading
an address residing inside a BPF map value as a single BPF
ldimm64 instruction!

The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
is a special src_reg flag for ldimm64 instruction that indicates
that inside the first part of the double insns's imm field is a
file descriptor which the verifier then replaces as a full 64bit
address of the map into both imm parts. For the newly added
BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
the first part of the double insns's imm field is again a file
descriptor corresponding to the map, and the second part of the
imm field is an offset into the value. The verifier will then
replace both imm parts with an address that points into the BPF
map value at the given value offset for maps that support this
operation. Currently supported is array map with single entry.
It is possible to support more than just single map element by
reusing both 16bit off fields of the insns as a map index, so
full array map lookup could be expressed that way. It hasn't
been implemented here due to lack of concrete use case, but
could easily be done so in future in a compatible way, since
both off fields right now have to be 0 and would correctly
denote a map index 0.

The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
map pointer versus load of map's value at offset 0, and changing
BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
regular map pointer and map value pointer would add unnecessary
complexity and increases barrier for debugability thus less
suitable. Using the second part of the imm field as an offset
into the value does /not/ come with limitations since maximum
possible value size is in u32 universe anyway.

This optimization allows for efficiently retrieving an address
to a map value memory area without having to issue a helper call
which needs to prepare registers according to calling convention,
etc, without needing the extra NULL test, and without having to
add the offset in an additional instruction to the value base
pointer. The verifier then treats the destination register as
PTR_TO_MAP_VALUE with constant reg->off from the user passed
offset from the second imm field, and guarantees that this is
within bounds of the map value. Any subsequent operations are
normally treated as typical map value handling without anything
extra needed from verification side.

The two map operations for direct value access have been added to
array map for now. In future other types could be supported as
well depending on the use case. The main use case for this commit
is to allow for BPF loader support for global variables that
reside in .data/.rodata/.bss sections such that we can directly
load the address of them with minimal additional infrastructure
required. Loader support has been added in subsequent commits for
libbpf library.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09 17:05:46 -07:00
David S. Miller
f83f715195 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor comment merge conflict in mlx5.

Staging driver has a fixup due to the skb->xmit_more changes
in 'net-next', but was removed in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-05 14:14:19 -07:00
Andrey Ignatov
1fbd20f8b7 bpf: Add missed newline in verifier verbose log
check_stack_access() that prints verbose log is used in
adjust_ptr_min_max_vals() that prints its own verbose log and now they
stick together, e.g.:

  variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16
  size=1R2 stack pointer arithmetic goes out of range, prohibited for
  !root

Add missing newline so that log is more readable:
  variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1
  R2 stack pointer arithmetic goes out of range, prohibited for !root

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-05 16:50:08 +02:00
Andrey Ignatov
107c26a70c bpf: Sanity check max value for var_off stack access
As discussed in [1] max value of variable offset has to be checked for
overflow on stack access otherwise verifier would accept code like this:

  0: (b7) r2 = 6
  1: (b7) r3 = 28
  2: (7a) *(u64 *)(r10 -16) = 0
  3: (7a) *(u64 *)(r10 -8) = 0
  4: (79) r4 = *(u64 *)(r1 +168)
  5: (c5) if r4 s< 0x0 goto pc+4
   R1=ctx(id=0,off=0,imm=0) R2=inv6 R3=inv28
   R4=inv(id=0,umax_value=9223372036854775807,var_off=(0x0;
   0x7fffffffffffffff)) R10=fp0,call_-1 fp-8=mmmmmmmm fp-16=mmmmmmmm
  6: (17) r4 -= 16
  7: (0f) r4 += r10
  8: (b7) r5 = 8
  9: (85) call bpf_getsockopt#57
  10: (b7) r0 = 0
  11: (95) exit

, where R4 obviosly has unbounded max value.

Fix it by checking that reg->smax_value is inside (-BPF_MAX_VAR_OFF;
BPF_MAX_VAR_OFF) range.

reg->smax_value is used instead of reg->umax_value because stack
pointers are calculated using negative offset from fp. This is opposite
to e.g. map access where offset must be non-negative and where
umax_value is used.

Also dedicated verbose logs are added for both min and max bound check
failures to have diagnostics consistent with variable offset handling in
check_map_access().

[1] https://marc.info/?l=linux-netdev&m=155433357510597&w=2

Fixes: 2011fccfb6 ("bpf: Support variable offset stack access from helpers")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-05 16:50:08 +02:00
Andrey Ignatov
088ec26d9c bpf: Reject indirect var_off stack access in unpriv mode
Proper support of indirect stack access with variable offset in
unprivileged mode (!root) requires corresponding support in Spectre
masking for stack ALU in retrieve_ptr_limit().

There are no use-case for variable offset in unprivileged mode though so
make verifier reject such accesses for simplicity.

Pointer arithmetics is one (and only?) way to cause variable offset and
it's already rejected in unpriv mode so that verifier won't even get to
helper function whose argument contains variable offset, e.g.:

  0: (7a) *(u64 *)(r10 -16) = 0
  1: (7a) *(u64 *)(r10 -8) = 0
  2: (61) r2 = *(u32 *)(r1 +0)
  3: (57) r2 &= 4
  4: (17) r2 -= 16
  5: (0f) r2 += r10
  variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1R2
  stack pointer arithmetic goes out of range, prohibited for !root

Still it looks like a good idea to reject variable offset indirect stack
access for unprivileged mode in check_stack_boundary() explicitly.

Fixes: 2011fccfb6 ("bpf: Support variable offset stack access from helpers")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-05 16:50:07 +02:00
Andrey Ignatov
f2bcd05ec7 bpf: Reject indirect var_off stack access in raw mode
It's hard to guarantee that whole memory is marked as initialized on
helper return if uninitialized stack is accessed with variable offset
since specific bounds are unknown to verifier. This may cause
uninitialized stack leaking.

Reject such an access in check_stack_boundary to prevent possible
leaking.

There are no known use-cases for indirect uninitialized stack access
with variable offset so it shouldn't break anything.

Fixes: 2011fccfb6 ("bpf: Support variable offset stack access from helpers")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-05 16:50:07 +02:00
Alexei Starovoitov
7a9f5c65ab bpf: increase verifier log limit
The existing 16Mbyte verifier log limit is not enough for log_level=2
even for small programs. Increase it to 1Gbyte.
Note it's not a kernel memory limit.
It's an amount of memory user space provides to store
the verifier log. The kernel populates it 1k at a time.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:38 +02:00
Alexei Starovoitov
c04c0d2b96 bpf: increase complexity limit and maximum program size
Large verifier speed improvements allow to increase
verifier complexity limit.
Now regardless of the program composition and its size it takes
little time for the verifier to hit insn_processed limit.
On typical x86 machine non-debug kernel processes 1M instructions
in 1/10 of a second.
(before these speed improvements specially crafted programs
could be hitting multi-second verification times)
Full kasan kernel with debug takes ~1 second for the same 1M insns.
Hence bump the BPF_COMPLEXITY_LIMIT_INSNS limit to 1M.
Also increase the number of instructions per program
from 4k to internal BPF_COMPLEXITY_LIMIT_INSNS limit.
4k limit was confusing to users, since small programs with hundreds
of insns could be hitting BPF_COMPLEXITY_LIMIT_INSNS limit.
Sometimes adding more insns and bpf_trace_printk debug statements
would make the verifier accept the program while removing
code would make the verifier reject it.
Some user space application started to add #define MAX_FOO to
their programs and do:
  MAX_FOO=100;
again:
  compile with MAX_FOO;
  try to load;
  if (fails_to_load) { reduce MAX_FOO; goto again; }
to be able to fit maximum amount of processing into single program.
Other users artificially split their single program into a set of programs
and use all 32 iterations of tail_calls to increase compute limits.
And the most advanced folks used unlimited tc-bpf filter list
to execute many bpf programs.
Essentially the users managed to workaround 4k insn limit.
This patch removes the limit for root programs from uapi.
BPF_COMPLEXITY_LIMIT_INSNS is the kernel internal limit
and success to load the program no longer depends on program size,
but on 'smartness' of the verifier only.
The verifier will continue to get smarter with every kernel release.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:38 +02:00
Alexei Starovoitov
4f73379ec5 bpf: verbose jump offset overflow check
Larger programs may trigger 16-bit jump offset overflow check
during instruction patching. Make this error verbose otherwise
users cannot decipher error code without printks in the verifier.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:38 +02:00
Alexei Starovoitov
71dde681a8 bpf: convert temp arrays to kvcalloc
Temporary arrays used during program verification need to be vmalloc-ed
to support large bpf programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:38 +02:00
Alexei Starovoitov
25af32dad8 bpf: improve verification speed by not remarking live_read
With large verifier speed improvement brought by the previous patch
mark_reg_read() becomes the hottest function during verification.
On a typical program it consumes 40% of cpu.
mark_reg_read() walks parentage chain of registers to mark parents as LIVE_READ.
Once the register is marked there is no need to remark it again in the future.
Hence stop walking the chain once first LIVE_READ is seen.
This optimization drops mark_reg_read() time from 40% of cpu to <1%
and overall 2x improvement of verification speed.
For some programs the longest_mark_read_walk counter improves from ~500 to ~5

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:37 +02:00
Alexei Starovoitov
9f4686c41b bpf: improve verification speed by droping states
Branch instructions, branch targets and calls in a bpf program are
the places where the verifier remembers states that led to successful
verification of the program.
These states are used to prune brute force program analysis.
For unprivileged programs there is a limit of 64 states per such
'branching' instructions (maximum length is tracked by max_states_per_insn
counter introduced in the previous patch).
Simply reducing this threshold to 32 or lower increases insn_processed
metric to the point that small valid programs get rejected.
For root programs there is no limit and cilium programs can have
max_states_per_insn to be 100 or higher.
Walking 100+ states multiplied by number of 'branching' insns during
verification consumes significant amount of cpu time.
Turned out simple LRU-like mechanism can be used to remove states
that unlikely will be helpful in future search pruning.
This patch introduces hit_cnt and miss_cnt counters:
hit_cnt - this many times this state successfully pruned the search
miss_cnt - this many times this state was not equivalent to other states
(and that other states were added to state list)

The heuristic introduced in this patch is:
if (sl->miss_cnt > sl->hit_cnt * 3 + 3)
  /* drop this state from future considerations */

Higher numbers increase max_states_per_insn (allow more states to be
considered for pruning) and slow verification speed, but do not meaningfully
reduce insn_processed metric.
Lower numbers drop too many states and insn_processed increases too much.
Many different formulas were considered.
This one is simple and works well enough in practice.
(the analysis was done on selftests/progs/* and on cilium programs)

The end result is this heuristic improves verification speed by 10 times.
Large synthetic programs that used to take a second more now take
1/10 of a second.
In cases where max_states_per_insn used to be 100 or more, now it's ~10.

There is a slight increase in insn_processed for cilium progs:
                       before   after
bpf_lb-DLB_L3.o 	1831	1838
bpf_lb-DLB_L4.o 	3029	3218
bpf_lb-DUNKNOWN.o 	1064	1064
bpf_lxc-DDROP_ALL.o	26309	26935
bpf_lxc-DUNKNOWN.o	33517	34439
bpf_netdev.o		9713	9721
bpf_overlay.o		6184	6184
bpf_lcx_jit.o		37335	39389
And 2-3 times improvement in the verification speed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:37 +02:00
Alexei Starovoitov
06ee7115b0 bpf: add verifier stats and log_level bit 2
In order to understand the verifier bottlenecks add various stats
and extend log_level:
log_level 1 and 2 are kept as-is:
bit 0 - level=1 - print every insn and verifier state at branch points
bit 1 - level=2 - print every insn and verifier state at every insn
bit 2 - level=4 - print verifier error and stats at the end of verification

When verifier rejects the program the libbpf is trying to load the program twice.
Once with log_level=0 (no messages, only error code is reported to user space)
and second time with log_level=1 to tell the user why the verifier rejected it.

With introduction of bit 2 - level=4 the libbpf can choose to always use that
level and load programs once, since the verification speed is not affected and
in case of error the verbose message will be available.

Note that the verifier stats are not part of uapi just like all other
verbose messages. They're expected to change in the future.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:37 +02:00
Andrey Ignatov
2011fccfb6 bpf: Support variable offset stack access from helpers
Currently there is a difference in how verifier checks memory access for
helper arguments for PTR_TO_MAP_VALUE and PTR_TO_STACK with regard to
variable part of offset.

check_map_access, that is used for PTR_TO_MAP_VALUE, can handle variable
offsets just fine, so that BPF program can call a helper like this:

  some_helper(map_value_ptr + off, size);

, where offset is unknown at load time, but is checked by program to be
in a safe rage (off >= 0 && off + size < map_value_size).

But it's not the case for check_stack_boundary, that is used for
PTR_TO_STACK, and same code with pointer to stack is rejected by
verifier:

  some_helper(stack_value_ptr + off, size);

For example:
  0: (7a) *(u64 *)(r10 -16) = 0
  1: (7a) *(u64 *)(r10 -8) = 0
  2: (61) r2 = *(u32 *)(r1 +0)
  3: (57) r2 &= 4
  4: (17) r2 -= 16
  5: (0f) r2 += r10
  6: (18) r1 = 0xffff888111343a80
  8: (85) call bpf_map_lookup_elem#1
  invalid variable stack read R2 var_off=(0xfffffffffffffff0; 0x4)

Add support for variable offset access to check_stack_boundary so that
if offset is checked by program to be in a safe range it's accepted by
verifier.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-29 12:05:35 -07:00
David S. Miller
356d71e00d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-03-27 17:37:58 -07:00
Paul Chaignon
927cb78177 bpf: remove incorrect 'verifier bug' warning
The BPF verifier checks the maximum number of call stack frames twice,
first in the main CFG traversal (do_check) and then in a subsequent
traversal (check_max_stack_depth).  If the second check fails, it logs a
'verifier bug' warning and errors out, as the number of call stack frames
should have been verified already.

However, the second check may fail without indicating a verifier bug: if
the excessive function calls reside in dead code, the main CFG traversal
may not visit them; the subsequent traversal visits all instructions,
including dead code.

This case raises the question of how invalid dead code should be treated.
This patch implements the conservative option and rejects such code.

Signed-off-by: Paul Chaignon <paul.chaignon@orange.com>
Tested-by: Xiao Han <xiao.han@orange.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-26 13:02:16 -07:00
Jakub Kicinski
83d163124c bpf: verifier: propagate liveness on all frames
Commit 7640ead939 ("bpf: verifier: make sure callees don't prune
with caller differences") connected up parentage chains of all
frames of the stack.  It didn't, however, ensure propagate_liveness()
propagates all liveness information along those chains.

This means pruning happening in the callee may generate explored
states with incomplete liveness for the chains in lower frames
of the stack.

The included selftest is similar to the prior one from commit
7640ead939 ("bpf: verifier: make sure callees don't prune with
caller differences"), where callee would prune regardless of the
difference in r8 state.

Now we also initialize r9 to 0 or 1 based on a result from get_random().
r9 is never read so the walk with r9 = 0 gets pruned (correctly) after
the walk with r9 = 1 completes.

The selftest is so arranged that the pruning will happen in the
callee.  Since callee does not propagate read marks of r8, the
explored state at the pruning point prior to the callee will
now ignore r8.

Propagate liveness on all frames of the stack when pruning.

Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-21 19:57:02 -07:00
Lorenz Bauer
edbf8c01de bpf: add skc_lookup_tcp helper
Allow looking up a sock_common. This gives eBPF programs
access to timewait and request sockets.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-21 18:59:10 -07:00
Lorenz Bauer
85a51f8c28 bpf: allow helpers to return PTR_TO_SOCK_COMMON
It's currently not possible to access timewait or request sockets
from eBPF, since there is no way to return a PTR_TO_SOCK_COMMON
from a helper. Introduce RET_PTR_TO_SOCK_COMMON to enable this
behaviour.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-21 18:59:10 -07:00
Lorenz Bauer
0f3adc288d bpf: track references based on is_acquire_func
So far, the verifier only acquires reference tracking state for
RET_PTR_TO_SOCKET_OR_NULL. Instead of extending this for every
new return type which desires these semantics, acquire reference
tracking state iff the called helper is an acquire function.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-21 18:59:10 -07:00
Xu Yu
0803278b0b bpf: do not restore dst_reg when cur_state is freed
Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.

Call trace:

  dump_stack+0xbf/0x12e
  print_address_description+0x6a/0x280
  kasan_report+0x237/0x360
  sanitize_ptr_alu+0x85a/0x8d0
  adjust_ptr_min_max_vals+0x8f2/0x1ca0
  adjust_reg_min_max_vals+0x8ed/0x22e0
  do_check+0x1ca6/0x5d00
  bpf_check+0x9ca/0x2570
  bpf_prog_load+0xc91/0x1030
  __se_sys_bpf+0x61e/0x1f00
  do_syscall_64+0xc8/0x550
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fault injection trace:

  kfree+0xea/0x290
  free_func_state+0x4a/0x60
  free_verifier_state+0x61/0xe0
  push_stack+0x216/0x2f0	          <- inject failslab
  sanitize_ptr_alu+0x2b1/0x8d0
  adjust_ptr_min_max_vals+0x8f2/0x1ca0
  adjust_reg_min_max_vals+0x8ed/0x22e0
  do_check+0x1ca6/0x5d00
  bpf_check+0x9ca/0x2570
  bpf_prog_load+0xc91/0x1030
  __se_sys_bpf+0x61e/0x1f00
  do_syscall_64+0xc8/0x550
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

When kzalloc() fails in push_stack(), free_verifier_state() will free
current verifier state. As push_stack() returns, dst_reg was restored
if ptr_is_dst_reg is false. However, as member of the cur_state,
dst_reg is also freed, and error occurs when dereferencing dst_reg.
Simply fix it by testing ret of push_stack() before restoring dst_reg.

Fixes: 979d63d50c ("bpf: prevent out of bounds speculation on pointer arithmetic")
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-03-21 12:18:18 +01:00
Martin KaFai Lau
cba368c1f0 bpf: Only print ref_obj_id for refcounted reg
Naresh reported that test_align fails because of the mismatch at the
verbose printout of the register states.  The reason is due to the newly
added ref_obj_id.

ref_obj_id is only useful for refcounted reg.  Thus, this patch fixes it
by only printing ref_obj_id for refcounted reg.  While at it, it also uses
comma instead of space to separate between "id" and "ref_obj_id".

Fixes: 1b98658968 ("bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-20 18:24:35 -07:00
Martin KaFai Lau
1b98658968 bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release
Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk)
can still be accessed after bpf_sk_release(sk).
Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue.
This patch addresses them together.

A simple reproducer looks like this:

	sk = bpf_sk_lookup_tcp();
	/* if (!sk) ... */
	tp = bpf_tcp_sock(sk);
	/* if (!tp) ... */
	bpf_sk_release(sk);
	snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */

The problem is the verifier did not scrub the register's states of
the tcp_sock ptr (tp) after bpf_sk_release(sk).

[ Note that when calling bpf_tcp_sock(sk), the sk is not always
  refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works
  fine for this case. ]

Currently, the verifier does not track if a helper's return ptr (in REG_0)
is "carry"-ing one of its argument's refcount status. To carry this info,
the reg1->id needs to be stored in reg0.

One approach was tried, like "reg0->id = reg1->id", when calling
"bpf_tcp_sock()".  The main idea was to avoid adding another "ref_obj_id"
for the same reg.  However, overlapping the NULL marking and ref
tracking purpose in one "id" does not work well:

	ref_sk = bpf_sk_lookup_tcp();
	fullsock = bpf_sk_fullsock(ref_sk);
	tp = bpf_tcp_sock(ref_sk);
	if (!fullsock) {
	     bpf_sk_release(ref_sk);
	     return 0;
	}
	/* fullsock_reg->id is marked for NOT-NULL.
	 * Same for tp_reg->id because they have the same id.
	 */

	/* oops. verifier did not complain about the missing !tp check */
	snd_cwnd = tp->snd_cwnd;

Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state".
With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can
scrub all reg states which has a ref_obj_id match.  It is done with the
changes in release_reg_references() in this patch.

While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and
bpf_sk_fullsock() to avoid these helpers from returning
another ptr. It will make bpf_sk_release(tp) possible:

	sk = bpf_sk_lookup_tcp();
	/* if (!sk) ... */
	tp = bpf_tcp_sock(sk);
	/* if (!tp) ... */
	bpf_sk_release(tp);

A separate helper "bpf_get_listener_sock()" will be added in a later
patch to do sk_to_full_sk().

Misc change notes:
- To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed
  from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON.  ARG_PTR_TO_SOCKET
  is removed from bpf.h since no helper is using it.

- arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted()
  because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not
  refcounted.  All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock()
  take ARG_PTR_TO_SOCK_COMMON.

- check_refcount_ok() ensures is_acquire_function() cannot take
  arg_type_may_be_refcounted() as its argument.

- The check_func_arg() can only allow one refcount-ed arg.  It is
  guaranteed by check_refcount_ok() which ensures at most one arg can be
  refcounted.  Hence, it is a verifier internal error if >1 refcount arg
  found in check_func_arg().

- In release_reference(), release_reference_state() is called
  first to ensure a match on "reg->ref_obj_id" can be found before
  scrubbing the reg states with release_reg_references().

- reg_is_refcounted() is no longer needed.
  1. In mark_ptr_or_null_regs(), its usage is replaced by
     "ref_obj_id && ref_obj_id == id" because,
     when is_null == true, release_reference_state() should only be
     called on the ref_obj_id obtained by a acquire helper (i.e.
     is_acquire_function() == true).  Otherwise, the following
     would happen:

	sk = bpf_sk_lookup_tcp();
	/* if (!sk) { ... } */
	fullsock = bpf_sk_fullsock(sk);
	if (!fullsock) {
		/*
		 * release_reference_state(fullsock_reg->ref_obj_id)
		 * where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id.
		 *
		 * Hence, the following bpf_sk_release(sk) will fail
		 * because the ref state has already been released in the
		 * earlier release_reference_state(fullsock_reg->ref_obj_id).
		 */
		bpf_sk_release(sk);
	}

  2. In release_reg_references(), the current reg_is_refcounted() call
     is unnecessary because the id check is enough.

- The type_is_refcounted() and type_is_refcounted_or_null()
  are no longer needed also because reg_is_refcounted() is removed.

Fixes: 655a51e536 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock")
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-13 12:04:35 -07:00
Daniel Borkmann
20182390c4 bpf: fix replace_map_fd_with_map_ptr's ldimm64 second imm field
Non-zero imm value in the second part of the ldimm64 instruction for
BPF_PSEUDO_MAP_FD is invalid, and thus must be rejected. The map fd
only ever sits in the first instructions' imm field. None of the BPF
loaders known to us are using it, so risk of regression is minimal.
For clarity and consistency, the few insn->{src_reg,imm} occurrences
are rewritten into insn[0].{src_reg,imm}. Add a test case to the BPF
selftest suite as well.

Fixes: 0246e64d9a ("bpf: handle pseudo BPF_LD_IMM64 insn")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-07 08:47:13 -08:00
David S. Miller
f7fb7c1a1c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-03-04

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add AF_XDP support to libbpf. Rationale is to facilitate writing
   AF_XDP applications by offering higher-level APIs that hide many
   of the details of the AF_XDP uapi. Sample programs are converted
   over to this new interface as well, from Magnus.

2) Introduce a new cant_sleep() macro for annotation of functions
   that cannot sleep and use it in BPF_PROG_RUN() to assert that
   BPF programs run under preemption disabled context, from Peter.

3) Introduce per BPF prog stats in order to monitor the usage
   of BPF; this is controlled by kernel.bpf_stats_enabled sysctl
   knob where monitoring tools can make use of this to efficiently
   determine the average cost of programs, from Alexei.

4) Split up BPF selftest's test_progs similarly as we already
   did with test_verifier. This allows to further reduce merge
   conflicts in future and to get more structure into our
   quickly growing BPF selftest suite, from Stanislav.

5) Fix a bug in BTF's dedup algorithm which can cause an infinite
   loop in some circumstances; also various BPF doc fixes and
   improvements, from Andrii.

6) Various BPF sample cleanups and migration to libbpf in order
   to further isolate the old sample loader code (so we can get
   rid of it at some point), from Jakub.

7) Add a new BPF helper for BPF cgroup skb progs that allows
   to set ECN CE code point and a Host Bandwidth Manager (HBM)
   sample program for limiting the bandwidth used by v2 cgroups,
   from Lawrence.

8) Enable write access to skb->queue_mapping from tc BPF egress
   programs in order to let BPF pick TX queue, from Jesper.

9) Fix a bug in BPF spinlock handling for map-in-map which did
   not propagate spin_lock_off to the meta map, from Yonghong.

10) Fix a bug in the new per-CPU BPF prog counters to properly
    initialize stats for each CPU, from Eric.

11) Add various BPF helper prototypes to selftest's bpf_helpers.h,
    from Willem.

12) Fix various BPF samples bugs in XDP and tracing progs,
    from Toke, Daniel and Yonghong.

13) Silence preemption splat in test_bpf after BPF_PROG_RUN()
    enforces it now everywhere, from Anders.

14) Fix a signedness bug in libbpf's btf_dedup_ref_type() to
    get error handling working, from Dan.

15) Fix bpftool documentation and auto-completion with regards
    to stream_{verdict,parser} attach types, from Alban.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-04 10:14:31 -08:00
David S. Miller
9eb359140c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-03-02 12:54:35 -08:00
Daniel Borkmann
3612af783c bpf: fix sanitation rewrite in case of non-pointers
Marek reported that he saw an issue with the below snippet in that
timing measurements where off when loaded as unpriv while results
were reasonable when loaded as privileged:

    [...]
    uint64_t a = bpf_ktime_get_ns();
    uint64_t b = bpf_ktime_get_ns();
    uint64_t delta = b - a;
    if ((int64_t)delta > 0) {
    [...]

Turns out there is a bug where a corner case is missing in the fix
d3bd7413e0 ("bpf: fix sanitation of alu op with pointer / scalar
type from different paths"), namely fixup_bpf_calls() only checks
whether aux has a non-zero alu_state, but it also needs to test for
the case of BPF_ALU_NON_POINTER since in both occasions we need to
skip the masking rewrite (as there is nothing to mask).

Fixes: d3bd7413e0 ("bpf: fix sanitation of alu op with pointer / scalar type from different paths")
Reported-by: Marek Majkowski <marek@cloudflare.com>
Reported-by: Arthur Fabre <afabre@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-01 21:24:08 -08:00
Alexei Starovoitov
492ecee892 bpf: enable program stats
JITed BPF programs are indistinguishable from kernel functions, but unlike
kernel code BPF code can be changed often.
Typical approach of "perf record" + "perf report" profiling and tuning of
kernel code works just as well for BPF programs, but kernel code doesn't
need to be monitored whereas BPF programs do.
Users load and run large amount of BPF programs.
These BPF stats allow tools monitor the usage of BPF on the server.
The monitoring tools will turn sysctl kernel.bpf_stats_enabled
on and off for few seconds to sample average cost of the programs.
Aggregated data over hours and days will provide an insight into cost of BPF
and alarms can trigger in case given program suddenly gets more expensive.

The cost of two sched_clock() per program invocation adds ~20 nsec.
Fast BPF progs (like selftests/bpf/progs/test_pkt_access.c) will slow down
from ~10 nsec to ~30 nsec.
static_key minimizes the cost of the stats collection.
There is no measurable difference before/after this patch
with kernel.bpf_stats_enabled=0

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-27 17:22:50 +01:00
Martin KaFai Lau
655a51e536 bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock
This patch adds a helper function BPF_FUNC_tcp_sock and it
is currently available for cg_skb and sched_(cls|act):

struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk);

int cg_skb_foo(struct __sk_buff *skb) {
	struct bpf_tcp_sock *tp;
	struct bpf_sock *sk;
	__u32 snd_cwnd;

	sk = skb->sk;
	if (!sk)
		return 1;

	tp = bpf_tcp_sock(sk);
	if (!tp)
		return 1;

	snd_cwnd = tp->snd_cwnd;
	/* ... */

	return 1;
}

A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide
read-only access.  bpf_tcp_sock has all the existing tcp_sock's fields
that has already been exposed by the bpf_sock_ops.
i.e. no new tcp_sock's fields are exposed in bpf.h.

This helper returns a pointer to the tcp_sock.  If it is not a tcp_sock
or it cannot be traced back to a tcp_sock by sk_to_full_sk(), it
returns NULL.  Hence, the caller needs to check for NULL before
accessing it.

The current use case is to expose members from tcp_sock
to allow a cg_skb_bpf_prog to provide per cgroup traffic
policing/shaping.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10 19:46:17 -08:00
Martin KaFai Lau
46f8bc9275 bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
In kernel, it is common to check "skb->sk && sk_fullsock(skb->sk)"
before accessing the fields in sock.  For example, in __netdev_pick_tx:

static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
			    struct net_device *sb_dev)
{
	/* ... */

	struct sock *sk = skb->sk;

		if (queue_index != new_index && sk &&
		    sk_fullsock(sk) &&
		    rcu_access_pointer(sk->sk_dst_cache))
			sk_tx_queue_set(sk, new_index);

	/* ... */

	return queue_index;
}

This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
where a few of the convert_ctx_access() in filter.c has already been
accessing the skb->sk sock_common's fields,
e.g. sock_ops_convert_ctx_access().

"__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
Some of the fileds in "bpf_sock" will not be directly
accessible through the "__sk_buff->sk" pointer.  It is limited
by the new "bpf_sock_common_is_valid_access()".
e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
     are not allowed.

The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
can be used to get a sk with all accessible fields in "bpf_sock".
This helper is added to both cg_skb and sched_(cls|act).

int cg_skb_foo(struct __sk_buff *skb) {
	struct bpf_sock *sk;

	sk = skb->sk;
	if (!sk)
		return 1;

	sk = bpf_sk_fullsock(sk);
	if (!sk)
		return 1;

	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
		return 1;

	/* some_traffic_shaping(); */

	return 1;
}

(1) The sk is read only

(2) There is no new "struct bpf_sock_common" introduced.

(3) Future kernel sock's members could be added to bpf_sock only
    instead of repeatedly adding at multiple places like currently
    in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.

(4) After "sk = skb->sk", the reg holding sk is in type
    PTR_TO_SOCK_COMMON_OR_NULL.

(5) After bpf_sk_fullsock(), the return type will be in type
    PTR_TO_SOCKET_OR_NULL which is the same as the return type of
    bpf_sk_lookup_xxx().

    However, bpf_sk_fullsock() does not take refcnt.  The
    acquire_reference_state() is only depending on the return type now.
    To avoid it, a new is_acquire_function() is checked before calling
    acquire_reference_state().

(6) The WARN_ON in "release_reference_state()" is no longer an
    internal verifier bug.

    When reg->id is not found in state->refs[], it means the
    bpf_prog does something wrong like
    "bpf_sk_release(bpf_sk_fullsock(skb->sk))" where reference has
    never been acquired by calling "bpf_sk_fullsock(skb->sk)".

    A -EINVAL and a verbose are done instead of WARN_ON.  A test is
    added to the test_verifier in a later patch.

    Since the WARN_ON in "release_reference_state()" is no longer
    needed, "__release_reference_state()" is folded into
    "release_reference_state()" also.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10 19:46:17 -08:00
Martin KaFai Lau
5f4566498d bpf: Fix narrow load on a bpf_sock returned from sk_lookup()
By adding this test to test_verifier:
{
	"reference tracking: access sk->src_ip4 (narrow load)",
	.insns = {
	BPF_SK_LOOKUP,
	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
	BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
	BPF_LDX_MEM(BPF_H, BPF_REG_2, BPF_REG_0, offsetof(struct bpf_sock, src_ip4) + 2),
	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
	BPF_EMIT_CALL(BPF_FUNC_sk_release),
	BPF_EXIT_INSN(),
	},
	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
	.result = ACCEPT,
},

The above test loads 2 bytes from sk->src_ip4 where
sk is obtained by bpf_sk_lookup_tcp().

It hits an internal verifier error from convert_ctx_accesses():
[root@arch-fb-vm1 bpf]# ./test_verifier 665 665
Failed to load prog 'Invalid argument'!
0: (b7) r2 = 0
1: (63) *(u32 *)(r10 -8) = r2
2: (7b) *(u64 *)(r10 -16) = r2
3: (7b) *(u64 *)(r10 -24) = r2
4: (7b) *(u64 *)(r10 -32) = r2
5: (7b) *(u64 *)(r10 -40) = r2
6: (7b) *(u64 *)(r10 -48) = r2
7: (bf) r2 = r10
8: (07) r2 += -48
9: (b7) r3 = 36
10: (b7) r4 = 0
11: (b7) r5 = 0
12: (85) call bpf_sk_lookup_tcp#84
13: (bf) r6 = r0
14: (15) if r0 == 0x0 goto pc+3
 R0=sock(id=1,off=0,imm=0) R6=sock(id=1,off=0,imm=0) R10=fp0,call_-1 fp-8=????0000 fp-16=0000mmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm refs=1
15: (69) r2 = *(u16 *)(r0 +26)
16: (bf) r1 = r6
17: (85) call bpf_sk_release#86
18: (95) exit

from 14 to 18: safe
processed 20 insns (limit 131072), stack depth 48
bpf verifier is misconfigured
Summary: 0 PASSED, 0 SKIPPED, 1 FAILED

The bpf_sock_is_valid_access() is expecting src_ip4 can be narrowly
loaded (meaning load any 1 or 2 bytes of the src_ip4) by
marking info->ctx_field_size.  However, this marked
ctx_field_size is not used.  This patch fixes it.

Due to the recent refactoring in test_verifier,
this new test will be added to the bpf-next branch
(together with the bpf_tcp_sock patchset)
to avoid merge conflict.

Fixes: c64b798328 ("bpf: Add PTR_TO_SOCKET verifier type")
Cc: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10 19:37:41 -08:00
Martin KaFai Lau
d623876646 bpf: Fix narrow load on a bpf_sock returned from sk_lookup()
By adding this test to test_verifier:
{
	"reference tracking: access sk->src_ip4 (narrow load)",
	.insns = {
	BPF_SK_LOOKUP,
	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
	BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
	BPF_LDX_MEM(BPF_H, BPF_REG_2, BPF_REG_0, offsetof(struct bpf_sock, src_ip4) + 2),
	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
	BPF_EMIT_CALL(BPF_FUNC_sk_release),
	BPF_EXIT_INSN(),
	},
	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
	.result = ACCEPT,
},

The above test loads 2 bytes from sk->src_ip4 where
sk is obtained by bpf_sk_lookup_tcp().

It hits an internal verifier error from convert_ctx_accesses():
[root@arch-fb-vm1 bpf]# ./test_verifier 665 665
Failed to load prog 'Invalid argument'!
0: (b7) r2 = 0
1: (63) *(u32 *)(r10 -8) = r2
2: (7b) *(u64 *)(r10 -16) = r2
3: (7b) *(u64 *)(r10 -24) = r2
4: (7b) *(u64 *)(r10 -32) = r2
5: (7b) *(u64 *)(r10 -40) = r2
6: (7b) *(u64 *)(r10 -48) = r2
7: (bf) r2 = r10
8: (07) r2 += -48
9: (b7) r3 = 36
10: (b7) r4 = 0
11: (b7) r5 = 0
12: (85) call bpf_sk_lookup_tcp#84
13: (bf) r6 = r0
14: (15) if r0 == 0x0 goto pc+3
 R0=sock(id=1,off=0,imm=0) R6=sock(id=1,off=0,imm=0) R10=fp0,call_-1 fp-8=????0000 fp-16=0000mmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm refs=1
15: (69) r2 = *(u16 *)(r0 +26)
16: (bf) r1 = r6
17: (85) call bpf_sk_release#86
18: (95) exit

from 14 to 18: safe
processed 20 insns (limit 131072), stack depth 48
bpf verifier is misconfigured
Summary: 0 PASSED, 0 SKIPPED, 1 FAILED

The bpf_sock_is_valid_access() is expecting src_ip4 can be narrowly
loaded (meaning load any 1 or 2 bytes of the src_ip4) by
marking info->ctx_field_size.  However, this marked
ctx_field_size is not used.  This patch fixes it.

Due to the recent refactoring in test_verifier,
this new test will be added to the bpf-next branch
(together with the bpf_tcp_sock patchset)
to avoid merge conflict.

Fixes: c64b798328 ("bpf: Add PTR_TO_SOCKET verifier type")
Cc: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-09 19:57:22 -08:00
Alexei Starovoitov
e16d2f1ab9 bpf: add support for bpf_spin_lock to cgroup local storage
Allow 'struct bpf_spin_lock' to reside inside cgroup local storage.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01 20:55:38 +01:00
Alexei Starovoitov
d83525ca62 bpf: introduce bpf_spin_lock
Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
bpf program serialize access to other variables.

Example:
struct hash_elem {
    int cnt;
    struct bpf_spin_lock lock;
};
struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
if (val) {
    bpf_spin_lock(&val->lock);
    val->cnt++;
    bpf_spin_unlock(&val->lock);
}

Restrictions and safety checks:
- bpf_spin_lock is only allowed inside HASH and ARRAY maps.
- BTF description of the map is mandatory for safety analysis.
- bpf program can take one bpf_spin_lock at a time, since two or more can
  cause dead locks.
- only one 'struct bpf_spin_lock' is allowed per map element.
  It drastically simplifies implementation yet allows bpf program to use
  any number of bpf_spin_locks.
- when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
- bpf program must bpf_spin_unlock() before return.
- bpf program can access 'struct bpf_spin_lock' only via
  bpf_spin_lock()/bpf_spin_unlock() helpers.
- load/store into 'struct bpf_spin_lock lock;' field is not allowed.
- to use bpf_spin_lock() helper the BTF description of map value must be
  a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
  Nested lock inside another struct is not allowed.
- syscall map_lookup doesn't copy bpf_spin_lock field to user space.
- syscall map_update and program map_update do not update bpf_spin_lock field.
- bpf_spin_lock cannot be on the stack or inside networking packet.
  bpf_spin_lock can only be inside HASH or ARRAY map value.
- bpf_spin_lock is available to root only and to all program types.
- bpf_spin_lock is not allowed in inner maps of map-in-map.
- ld_abs is not allowed inside spin_lock-ed region.
- tracing progs and socket filter progs cannot use bpf_spin_lock due to
  insufficient preemption checks

Implementation details:
- cgroup-bpf class of programs can nest with xdp/tc programs.
  Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
  Other solutions to avoid nested bpf_spin_lock are possible.
  Like making sure that all networking progs run with softirq disabled.
  spin_lock_irqsave is the simplest and doesn't add overhead to the
  programs that don't use it.
- arch_spinlock_t is used when its implemented as queued_spin_lock
- archs can force their own arch_spinlock_t
- on architectures where queued_spin_lock is not available and
  sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
- presence of bpf_spin_lock inside map value could have been indicated via
  extra flag during map_create, but specifying it via BTF is cleaner.
  It provides introspection for map key/value and reduces user mistakes.

Next steps:
- allow bpf_spin_lock in other map types (like cgroup local storage)
- introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
  to request kernel to grab bpf_spin_lock before rewriting the value.
  That will serialize access to map elements.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01 20:55:38 +01:00
Jiong Wang
092ed0968b bpf: verifier support JMP32
This patch teach verifier about the new BPF_JMP32 instruction class.
Verifier need to treat it similar as the existing BPF_JMP class.
A BPF_JMP32 insn needs to go through all checks that have been done on
BPF_JMP.

Also, verifier is doing runtime optimizations based on the extra info
conditional jump instruction could offer, especially when the comparison is
between constant and register that the value range of the register could be
improved based on the comparison results. These code are updated
accordingly.

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-26 13:33:01 -08:00
Jiong Wang
a72dafafbd bpf: refactor verifier min/max code for condition jump
The current min/max code does both signed and unsigned comparisons against
the input argument "val" which is "u64" and there is explicit type casting
when the comparison is signed.

As we will need slightly more complexer type casting when JMP32 introduced,
it is better to host the signed type casting. This makes the code more
clean with ignorable runtime overhead.

Also, code for J*GE/GT/LT/LE and JEQ/JNE are very similar, this patch
combine them.

The main purpose for this refactor is to make sure the min/max code will
still be readable and with minimum code duplication after JMP32 introduced.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-26 13:33:01 -08:00
Jakub Kicinski
08ca90afba bpf: notify offload JITs about optimizations
Let offload JITs know when instructions are replaced and optimized
out, so they can update their state appropriately.  The optimizations
are best effort, if JIT returns an error from any callback verifier
will stop notifying it as state may now be out of sync, but the
verifier continues making progress.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-23 17:35:32 -08:00
Jakub Kicinski
9e4c24e7ee bpf: verifier: record original instruction index
The communication between the verifier and advanced JITs is based
on instruction indexes.  We have to keep them stable throughout
the optimizations otherwise referring to a particular instruction
gets messy quickly.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-23 17:35:32 -08:00
Jakub Kicinski
a1b14abc00 bpf: verifier: remove unconditional branches by 0
Unconditional branches by 0 instructions are basically noops
but they can result from earlier optimizations, e.g. a conditional
jumps which would never be taken or a conditional jump around
dead code.

Remove those branches.

v0.2:
 - s/opt_remove_dead_branches/opt_remove_nops/ (Jiong).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-23 17:35:32 -08:00
Jakub Kicinski
52875a04f4 bpf: verifier: remove dead code
Instead of overwriting dead code with jmp -1 instructions
remove it completely for root.  Adjust verifier state and
line info appropriately.

v2:
 - adjust func_info (Alexei);
 - make sure first instruction retains line info (Alexei).
v4: (Yonghong)
 - remove unnecessary if (!insn to remove) checks;
 - always keep last line info if first live instruction lacks one.
v5: (Martin Lau)
 - improve and clarify comments.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-23 17:35:31 -08:00
Jakub Kicinski
e2ae4ca266 bpf: verifier: hard wire branches to dead code
Loading programs with dead code becomes more and more
common, as people begin to patch constants at load time.
Turn conditional jumps to unconditional ones, to avoid
potential branch misprediction penalty.

This optimization is enabled for privileged users only.

For branches which just fall through we could just mark
them as not seen and have dead code removal take care of
them, but that seems less clean.

v0.2:
 - don't call capable(CAP_SYS_ADMIN) twice (Jiong).
v3:
 - fix GCC warning;

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-23 17:35:31 -08:00
Peter Oskolkov
d0b2818efb bpf: fix a (false) compiler warning
An older GCC compiler complains:

kernel/bpf/verifier.c: In function 'bpf_check':
kernel/bpf/verifier.c:4***:13: error: 'prev_offset' may be used uninitialized
      in this function [-Werror=maybe-uninitialized]
   } else if (krecord[i].insn_offset <= prev_offset) {
             ^
kernel/bpf/verifier.c:4***:38: note: 'prev_offset' was declared here
  u32 i, nfuncs, urec_size, min_size, prev_offset;

Although the compiler is wrong here, the patch makes sure
that prev_offset is always initialized, just to silence the warning.

v2: fix a spelling error in the commit message.

Signed-off-by: Peter Oskolkov <posk@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-17 10:40:16 +01:00
Daniel Borkmann
d3bd7413e0 bpf: fix sanitation of alu op with pointer / scalar type from different paths
While 979d63d50c ("bpf: prevent out of bounds speculation on pointer
arithmetic") took care of rejecting alu op on pointer when e.g. pointer
came from two different map values with different map properties such as
value size, Jann reported that a case was not covered yet when a given
alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
different branches where we would incorrectly try to sanitize based
on the pointer's limit. Catch this corner case and reject the program
instead.

Fixes: 979d63d50c ("bpf: prevent out of bounds speculation on pointer arithmetic")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-05 21:32:38 -08:00
Daniel Borkmann
979d63d50c bpf: prevent out of bounds speculation on pointer arithmetic
Jann reported that the original commit back in b2157399cc
("bpf: prevent out-of-bounds speculation") was not sufficient
to stop CPU from speculating out of bounds memory access:
While b2157399cc only focussed on masking array map access
for unprivileged users for tail calls and data access such
that the user provided index gets sanitized from BPF program
and syscall side, there is still a more generic form affected
from BPF programs that applies to most maps that hold user
data in relation to dynamic map access when dealing with
unknown scalars or "slow" known scalars as access offset, for
example:

  - Load a map value pointer into R6
  - Load an index into R7
  - Do a slow computation (e.g. with a memory dependency) that
    loads a limit into R8 (e.g. load the limit from a map for
    high latency, then mask it to make the verifier happy)
  - Exit if R7 >= R8 (mispredicted branch)
  - Load R0 = R6[R7]
  - Load R0 = R6[R0]

For unknown scalars there are two options in the BPF verifier
where we could derive knowledge from in order to guarantee
safe access to the memory: i) While </>/<=/>= variants won't
allow to derive any lower or upper bounds from the unknown
scalar where it would be safe to add it to the map value
pointer, it is possible through ==/!= test however. ii) another
option is to transform the unknown scalar into a known scalar,
for example, through ALU ops combination such as R &= <imm>
followed by R |= <imm> or any similar combination where the
original information from the unknown scalar would be destroyed
entirely leaving R with a constant. The initial slow load still
precedes the latter ALU ops on that register, so the CPU
executes speculatively from that point. Once we have the known
scalar, any compare operation would work then. A third option
only involving registers with known scalars could be crafted
as described in [0] where a CPU port (e.g. Slow Int unit)
would be filled with many dependent computations such that
the subsequent condition depending on its outcome has to wait
for evaluation on its execution port and thereby executing
speculatively if the speculated code can be scheduled on a
different execution port, or any other form of mistraining
as described in [1], for example. Given this is not limited
to only unknown scalars, not only map but also stack access
is affected since both is accessible for unprivileged users
and could potentially be used for out of bounds access under
speculation.

In order to prevent any of these cases, the verifier is now
sanitizing pointer arithmetic on the offset such that any
out of bounds speculation would be masked in a way where the
pointer arithmetic result in the destination register will
stay unchanged, meaning offset masked into zero similar as
in array_index_nospec() case. With regards to implementation,
there are three options that were considered: i) new insn
for sanitation, ii) push/pop insn and sanitation as inlined
BPF, iii) reuse of ax register and sanitation as inlined BPF.

Option i) has the downside that we end up using from reserved
bits in the opcode space, but also that we would require
each JIT to emit masking as native arch opcodes meaning
mitigation would have slow adoption till everyone implements
it eventually which is counter-productive. Option ii) and iii)
have both in common that a temporary register is needed in
order to implement the sanitation as inlined BPF since we
are not allowed to modify the source register. While a push /
pop insn in ii) would be useful to have in any case, it
requires once again that every JIT needs to implement it
first. While possible, amount of changes needed would also
be unsuitable for a -stable patch. Therefore, the path which
has fewer changes, less BPF instructions for the mitigation
and does not require anything to be changed in the JITs is
option iii) which this work is pursuing. The ax register is
already mapped to a register in all JITs (modulo arm32 where
it's mapped to stack as various other BPF registers there)
and used in constant blinding for JITs-only so far. It can
be reused for verifier rewrites under certain constraints.
The interpreter's tmp "register" has therefore been remapped
into extending the register set with hidden ax register and
reusing that for a number of instructions that needed the
prior temporary variable internally (e.g. div, mod). This
allows for zero increase in stack space usage in the interpreter,
and enables (restricted) generic use in rewrites otherwise as
long as such a patchlet does not make use of these instructions.
The sanitation mask is dynamic and relative to the offset the
map value or stack pointer currently holds.

There are various cases that need to be taken under consideration
for the masking, e.g. such operation could look as follows:
ptr += val or val += ptr or ptr -= val. Thus, the value to be
sanitized could reside either in source or in destination
register, and the limit is different depending on whether
the ALU op is addition or subtraction and depending on the
current known and bounded offset. The limit is derived as
follows: limit := max_value_size - (smin_value + off). For
subtraction: limit := umax_value + off. This holds because
we do not allow any pointer arithmetic that would
temporarily go out of bounds or would have an unknown
value with mixed signed bounds where it is unclear at
verification time whether the actual runtime value would
be either negative or positive. For example, we have a
derived map pointer value with constant offset and bounded
one, so limit based on smin_value works because the verifier
requires that statically analyzed arithmetic on the pointer
must be in bounds, and thus it checks if resulting
smin_value + off and umax_value + off is still within map
value bounds at time of arithmetic in addition to time of
access. Similarly, for the case of stack access we derive
the limit as follows: MAX_BPF_STACK + off for subtraction
and -off for the case of addition where off := ptr_reg->off +
ptr_reg->var_off.value. Subtraction is a special case for
the masking which can be in form of ptr += -val, ptr -= -val,
or ptr -= val. In the first two cases where we know that
the value is negative, we need to temporarily negate the
value in order to do the sanitation on a positive value
where we later swap the ALU op, and restore original source
register if the value was in source.

The sanitation of pointer arithmetic alone is still not fully
sufficient as is, since a scenario like the following could
happen ...

  PTR += 0x1000 (e.g. K-based imm)
  PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
  PTR += 0x1000
  PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
  [...]

... which under speculation could end up as ...

  PTR += 0x1000
  PTR -= 0 [ truncated by mitigation ]
  PTR += 0x1000
  PTR -= 0 [ truncated by mitigation ]
  [...]

... and therefore still access out of bounds. To prevent such
case, the verifier is also analyzing safety for potential out
of bounds access under speculative execution. Meaning, it is
also simulating pointer access under truncation. We therefore
"branch off" and push the current verification state after the
ALU operation with known 0 to the verification stack for later
analysis. Given the current path analysis succeeded it is
likely that the one under speculation can be pruned. In any
case, it is also subject to existing complexity limits and
therefore anything beyond this point will be rejected. In
terms of pruning, it needs to be ensured that the verification
state from speculative execution simulation must never prune
a non-speculative execution path, therefore, we mark verifier
state accordingly at the time of push_stack(). If verifier
detects out of bounds access under speculative execution from
one of the possible paths that includes a truncation, it will
reject such program.

Given we mask every reg-based pointer arithmetic for
unprivileged programs, we've been looking into how it could
affect real-world programs in terms of size increase. As the
majority of programs are targeted for privileged-only use
case, we've unconditionally enabled masking (with its alu
restrictions on top of it) for privileged programs for the
sake of testing in order to check i) whether they get rejected
in its current form, and ii) by how much the number of
instructions and size will increase. We've tested this by
using Katran, Cilium and test_l4lb from the kernel selftests.
For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
we've used test_l4lb.o as well as test_l4lb_noinline.o. We
found that none of the programs got rejected by the verifier
with this change, and that impact is rather minimal to none.
balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
7,797 bytes JITed before and after the change. Most complex
program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
and 18,538 bytes JITed before and after and none of the other
tail call programs in bpf_lxc.o had any changes either. For
the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
after the change. Other programs from that object file had
similar small increase. Both test_l4lb.o had no change and
remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
JITed and for test_l4lb_noinline.o constant at 5,080 bytes
(634 insns) xlated and 3,313 bytes JITed. This can be explained
in that LLVM typically optimizes stack based pointer arithmetic
by using K-based operations and that use of dynamic map access
is not overly frequent. However, in future we may decide to
optimize the algorithm further under known guarantees from
branch and value speculation. Latter seems also unclear in
terms of prediction heuristics that today's CPUs apply as well
as whether there could be collisions in e.g. the predictor's
Value History/Pattern Table for triggering out of bounds access,
thus masking is performed unconditionally at this point but could
be subject to relaxation later on. We were generally also
brainstorming various other approaches for mitigation, but the
blocker was always lack of available registers at runtime and/or
overhead for runtime tracking of limits belonging to a specific
pointer. Thus, we found this to be minimally intrusive under
given constraints.

With that in place, a simple example with sanitized access on
unprivileged load at post-verification time looks as follows:

  # bpftool prog dump xlated id 282
  [...]
  28: (79) r1 = *(u64 *)(r7 +0)
  29: (79) r2 = *(u64 *)(r7 +8)
  30: (57) r1 &= 15
  31: (79) r3 = *(u64 *)(r0 +4608)
  32: (57) r3 &= 1
  33: (47) r3 |= 1
  34: (2d) if r2 > r3 goto pc+19
  35: (b4) (u32) r11 = (u32) 20479  |
  36: (1f) r11 -= r2                | Dynamic sanitation for pointer
  37: (4f) r11 |= r2                | arithmetic with registers
  38: (87) r11 = -r11               | containing bounded or known
  39: (c7) r11 s>>= 63              | scalars in order to prevent
  40: (5f) r11 &= r2                | out of bounds speculation.
  41: (0f) r4 += r11                |
  42: (71) r4 = *(u8 *)(r4 +0)
  43: (6f) r4 <<= r1
  [...]

For the case where the scalar sits in the destination register
as opposed to the source register, the following code is emitted
for the above example:

  [...]
  16: (b4) (u32) r11 = (u32) 20479
  17: (1f) r11 -= r2
  18: (4f) r11 |= r2
  19: (87) r11 = -r11
  20: (c7) r11 s>>= 63
  21: (5f) r2 &= r11
  22: (0f) r2 += r0
  23: (61) r0 = *(u32 *)(r2 +0)
  [...]

JIT blinding example with non-conflicting use of r10:

  [...]
   d5:	je     0x0000000000000106    _
   d7:	mov    0x0(%rax),%edi       |
   da:	mov    $0xf153246,%r10d     | Index load from map value and
   e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
   e7:	and    %r10,%rdi            |_
   ea:	mov    $0x2f,%r10d          |
   f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
   f3:	or     %rdi,%r10            | but do not interfere with each
   f6:	neg    %r10                 | other. (Neither do these instructions
   f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
   fd:	and    %r10,%rdi            | in interpreter.)
  100:	add    %rax,%rdi            |_
  103:	mov    0x0(%rdi),%eax
 [...]

Tested that it fixes Jann's reproducer, and also checked that test_verifier
and test_progs suite with interpreter, JIT and JIT with hardening enabled
on x86-64 and arm64 runs successfully.

  [0] Speculose: Analyzing the Security Implications of Speculative
      Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
      https://arxiv.org/pdf/1801.04084.pdf

  [1] A Systematic Evaluation of Transient Execution Attacks and
      Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
      Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
      Dmitry Evtyushkin, Daniel Gruss,
      https://arxiv.org/pdf/1811.05441.pdf

Fixes: b2157399cc ("bpf: prevent out-of-bounds speculation")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Daniel Borkmann
b7137c4eab bpf: fix check_map_access smin_value test when pointer contains offset
In check_map_access() we probe actual bounds through __check_map_access()
with offset of reg->smin_value + off for lower bound and offset of
reg->umax_value + off for the upper bound. However, even though the
reg->smin_value could have a negative value, the final result of the
sum with off could be positive when pointer arithmetic with known and
unknown scalars is combined. In this case we reject the program with
an error such as "R<x> min value is negative, either use unsigned index
or do a if (index >=0) check." even though the access itself would be
fine. Therefore extend the check to probe whether the actual resulting
reg->smin_value + off is less than zero.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Daniel Borkmann
9d7eceede7 bpf: restrict unknown scalars of mixed signed bounds for unprivileged
For unknown scalars of mixed signed bounds, meaning their smin_value is
negative and their smax_value is positive, we need to reject arithmetic
with pointer to map value. For unprivileged the goal is to mask every
map pointer arithmetic and this cannot reliably be done when it is
unknown at verification time whether the scalar value is negative or
positive. Given this is a corner case, the likelihood of breaking should
be very small.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Daniel Borkmann
e4298d2583 bpf: restrict stack pointer arithmetic for unprivileged
Restrict stack pointer arithmetic for unprivileged users in that
arithmetic itself must not go out of bounds as opposed to the actual
access later on. Therefore after each adjust_ptr_min_max_vals() with
a stack pointer as a destination we simulate a check_stack_access()
of 1 byte on the destination and once that fails the program is
rejected for unprivileged program loads. This is analog to map
value pointer arithmetic and needed for masking later on.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Daniel Borkmann
0d6303db79 bpf: restrict map value pointer arithmetic for unprivileged
Restrict map value pointer arithmetic for unprivileged users in that
arithmetic itself must not go out of bounds as opposed to the actual
access later on. Therefore after each adjust_ptr_min_max_vals() with a
map value pointer as a destination it will simulate a check_map_access()
of 1 byte on the destination and once that fails the program is rejected
for unprivileged program loads. We use this later on for masking any
pointer arithmetic with the remainder of the map value space. The
likelihood of breaking any existing real-world unprivileged eBPF
program is very small for this corner case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Daniel Borkmann
c08435ec7f bpf: move {prev_,}insn_idx into verifier env
Move prev_insn_idx and insn_idx from the do_check() function into
the verifier environment, so they can be read inside the various
helper functions for handling the instructions. It's easier to put
this into the environment rather than changing all call-sites only
to pass it along. insn_idx is useful in particular since this later
on allows to hold state in env->insn_aux_data[env->insn_idx].

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
David S. Miller
339bbff2d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-12-21

The following pull-request contains BPF updates for your *net-next* tree.

There is a merge conflict in test_verifier.c. Result looks as follows:

        [...]
        },
        {
                "calls: cross frame pruning",
                .insns = {
                [...]
                .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
                .errstr_unpriv = "function calls to other bpf functions are allowed for root only",
                .result_unpriv = REJECT,
                .errstr = "!read_ok",
                .result = REJECT,
	},
        {
                "jset: functional",
                .insns = {
        [...]
        {
                "jset: unknown const compare not taken",
                .insns = {
                        BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
                                     BPF_FUNC_get_prandom_u32),
                        BPF_JMP_IMM(BPF_JSET, BPF_REG_0, 1, 1),
                        BPF_LDX_MEM(BPF_B, BPF_REG_8, BPF_REG_9, 0),
                        BPF_EXIT_INSN(),
                },
                .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
                .errstr_unpriv = "!read_ok",
                .result_unpriv = REJECT,
                .errstr = "!read_ok",
                .result = REJECT,
        },
        [...]
        {
                "jset: range",
                .insns = {
                [...]
                },
                .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
                .result_unpriv = ACCEPT,
                .result = ACCEPT,
        },

The main changes are:

1) Various BTF related improvements in order to get line info
   working. Meaning, verifier will now annotate the corresponding
   BPF C code to the error log, from Martin and Yonghong.

2) Implement support for raw BPF tracepoints in modules, from Matt.

3) Add several improvements to verifier state logic, namely speeding
   up stacksafe check, optimizations for stack state equivalence
   test and safety checks for liveness analysis, from Alexei.

4) Teach verifier to make use of BPF_JSET instruction, add several
   test cases to kselftests and remove nfp specific JSET optimization
   now that verifier has awareness, from Jakub.

5) Improve BPF verifier's slot_type marking logic in order to
   allow more stack slot sharing, from Jiong.

6) Add sk_msg->size member for context access and add set of fixes
   and improvements to make sock_map with kTLS usable with openssl
   based applications, from John.

7) Several cleanups and documentation updates in bpftool as well as
   auto-mount of tracefs for "bpftool prog tracelog" command,
   from Quentin.

8) Include sub-program tags from now on in bpf_prog_info in order to
   have a reliable way for user space to get all tags of the program
   e.g. needed for kallsyms correlation, from Song.

9) Add BTF annotations for cgroup_local_storage BPF maps and
   implement bpf fs pretty print support, from Roman.

10) Fix bpftool in order to allow for cross-compilation, from Ivan.

11) Update of bpftool license to GPLv2-only + BSD-2-Clause in order
    to be compatible with libbfd and allow for Debian packaging,
    from Jakub.

12) Remove an obsolete prog->aux sanitation in dump and get rid of
    version check for prog load, from Daniel.

13) Fix a memory leak in libbpf's line info handling, from Prashant.

14) Fix cpumap's frame alignment for build_skb() so that skb_shared_info
    does not get unaligned, from Jesper.

15) Fix test_progs kselftest to work with older compilers which are less
    smart in optimizing (and thus throwing build error), from Stanislav.

16) Cleanup and simplify AF_XDP socket teardown, from Björn.

17) Fix sk lookup in BPF kselftest's test_sock_addr with regards
    to netns_id argument, from Andrey.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20 17:31:36 -08:00
David S. Miller
2be09de7d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.

Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20 11:53:36 -08:00
Jakub Kicinski
9b38c4056b bpf: verifier: reorder stack size check with dead code sanitization
Reorder the calls to check_max_stack_depth() and sanitize_dead_code()
to separate functions which can rewrite instructions from pure checks.

No functional changes.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-20 17:28:28 +01:00
Jakub Kicinski
960ea05656 bpf: verifier: teach the verifier to reason about the BPF_JSET instruction
Some JITs (nfp) try to optimize code on their own.  It could make
sense in case of BPF_JSET instruction which is currently not interpreted
by the verifier, meaning for instance that dead could would not be
detected if it was under BPF_JSET branch.

Teach the verifier basics of BPF_JSET, JIT optimizations will be
removed shortly.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-20 17:28:28 +01:00
Martin KaFai Lau
fdbaa0beb7 bpf: Ensure line_info.insn_off cannot point to insn with zero code
This patch rejects a line_info if the bpf insn code referred by
line_info.insn_off is 0. F.e. a broken userspace tool might generate
a line_info.insn_off that points to the second 8 bytes of a BPF_LD_IMM64.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-19 15:42:55 -08:00
Jiong Wang
0bae2d4d62 bpf: correct slot_type marking logic to allow more stack slot sharing
Verifier is supposed to support sharing stack slot allocated to ptr with
SCALAR_VALUE for privileged program. However this doesn't happen for some
cases.

The reason is verifier is not clearing slot_type STACK_SPILL for all bytes,
it only clears part of them, while verifier is using:

  slot_type[0] == STACK_SPILL

as a convention to check one slot is ptr type.

So, the consequence of partial clearing slot_type is verifier could treat a
partially overridden ptr slot, which should now be a SCALAR_VALUE slot,
still as ptr slot, and rejects some valid programs.

Before this patch, test_xdp_noinline.o under bpf selftests, bpf_lxc.o and
bpf_netdev.o under Cilium bpf repo, when built with -mattr=+alu32 are
rejected due to this issue. After this patch, they all accepted.

There is no processed insn number change before and after this patch on
Cilium bpf programs.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-18 14:45:01 -08:00
Alexei Starovoitov
9242b5f561 bpf: add self-check logic to liveness analysis
Introduce REG_LIVE_DONE to check the liveness propagation
and prepare the states for merging.
See algorithm description in clean_live_states().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-15 01:28:32 +01:00
Alexei Starovoitov
19e2dbb7dd bpf: improve stacksafe state comparison
"if (old->allocated_stack > cur->allocated_stack)" check is too conservative.
In some cases explored stack could have allocated more space,
but that stack space was not live.
The test case improves from 19 to 15 processed insns
and improvement on real programs is significant as well:

                       before    after
bpf_lb-DLB_L3.o        1940      1831
bpf_lb-DLB_L4.o        3089      3029
bpf_lb-DUNKNOWN.o      1065      1064
bpf_lxc-DDROP_ALL.o    28052     26309
bpf_lxc-DUNKNOWN.o     35487     33517
bpf_netdev.o           10864     9713
bpf_overlay.o          6643      6184
bpf_lcx_jit.o          38437     37335

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Edward Cree <ecree@solarflare.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-15 01:28:32 +01:00
Alexei Starovoitov
b233920c97 bpf: speed up stacksafe check
Don't check the same stack liveness condition 8 times.
once is enough.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Edward Cree <ecree@solarflare.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-15 01:28:32 +01:00
Martin KaFai Lau
d9762e84ed bpf: verbose log bpf_line_info in verifier
This patch adds bpf_line_info during the verifier's verbose.
It can give error context for debug purpose.

~~~~~~~~~~
Here is the verbose log for backedge:
	while (a) {
		a += bpf_get_smp_processor_id();
		bpf_trace_printk(fmt, sizeof(fmt), a);
	}

~> bpftool prog load ./test_loop.o /sys/fs/bpf/test_loop type tracepoint
13: while (a) {
3: a += bpf_get_smp_processor_id();
back-edge from insn 13 to 3

~~~~~~~~~~
Here is the verbose log for invalid pkt access:
Modification to test_xdp_noinline.c:

	data = (void *)(long)xdp->data;
	data_end = (void *)(long)xdp->data_end;
/*
	if (data + 4 > data_end)
		return XDP_DROP;
*/
	*(u32 *)data = dst->dst;

~> bpftool prog load ./test_xdp_noinline.o /sys/fs/bpf/test_xdp_noinline type xdp
; data = (void *)(long)xdp->data;
224: (79) r2 = *(u64 *)(r10 -112)
225: (61) r2 = *(u32 *)(r2 +0)
; *(u32 *)data = dst->dst;
226: (63) *(u32 *)(r2 +0) = r1
invalid access to packet, off=0 size=4, R2(id=0,off=0,r=0)
R2 offset is outside of the packet

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-14 14:17:34 -08:00
Martin KaFai Lau
23127b33ec bpf: Create a new btf_name_by_offset() for non type name use case
The current btf_name_by_offset() is returning "(anon)" type name for
the offset == 0 case and "(invalid-name-offset)" for the out-of-bound
offset case.

It fits well for the internal BTF verbose log purpose which
is focusing on type.  For example,
offset == 0 => "(anon)" => anonymous type/name.
Returning non-NULL for the bad offset case is needed
during the BTF verification process because the BTF verifier may
complain about another field first before discovering the name_off
is invalid.

However, it may not be ideal for the newer use case which does not
necessary mean type name.  For example, when logging line_info
in the BPF verifier in the next patch, it is better to log an
empty src line instead of logging "(anon)".

The existing bpf_name_by_offset() is renamed to __bpf_name_by_offset()
and static to btf.c.

A new bpf_name_by_offset() is added for generic context usage.  It
returns "\0" for name_off == 0 (note that btf->strings[0] is "\0")
and NULL for invalid offset.  It allows the caller to decide
what is the best output in its context.

The new btf_name_by_offset() is overlapped with btf_name_offset_valid().
Hence, btf_name_offset_valid() is removed from btf.h to keep the btf.h API
minimal.  The existing btf_name_offset_valid() usage in btf.c could also be
replaced later.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-14 14:17:34 -08:00
Jakub Kicinski
7640ead939 bpf: verifier: make sure callees don't prune with caller differences
Currently for liveness and state pruning the register parentage
chains don't include states of the callee.  This makes some sense
as the callee can't access those registers.  However, this means
that READs done after the callee returns will not propagate into
the states of the callee.  Callee will then perform pruning
disregarding differences in caller state.

Example:

   0: (85) call bpf_user_rnd_u32
   1: (b7) r8 = 0
   2: (55) if r0 != 0x0 goto pc+1
   3: (b7) r8 = 1
   4: (bf) r1 = r8
   5: (85) call pc+4
   6: (15) if r8 == 0x1 goto pc+1
   7: (05) *(u64 *)(r9 - 8) = r3
   8: (b7) r0 = 0
   9: (95) exit

   10: (15) if r1 == 0x0 goto pc+0
   11: (95) exit

Here we acquire unknown state with call to get_random() [1].  Then
we store this random state in r8 (either 0 or 1) [1 - 3], and make
a call on line 5.  Callee does nothing but a trivial conditional
jump (to create a pruning point).  Upon return caller checks the
state of r8 and either performs an unsafe read or not.

Verifier will first explore the path with r8 == 1, creating a pruning
point at [11].  The parentage chain for r8 will include only callers
states so once verifier reaches [6] it will mark liveness only on states
in the caller, and not [11].  Now when verifier walks the paths with
r8 == 0 it will reach [11] and since REG_LIVE_READ on r8 was not
propagated there it will prune the walk entirely (stop walking
the entire program, not just the callee).  Since [6] was never walked
with r8 == 0, [7] will be considered dead and replaced with "goto -1"
causing hang at runtime.

This patch weaves the callee's explored states onto the callers
parentage chain.  Rough parentage for r8 would have looked like this
before:

[0] [1] [2] [3] [4] [5]   [10]      [11]      [6]      [7]
     |           |      ,---|----.    |        |        |
  sl0:         sl0:    / sl0:     \ sl0:      sl0:     sl0:
  fr0: r8 <-- fr0: r8<+--fr0: r8   `fr0: r8  ,fr0: r8<-fr0: r8
                       \ fr1: r8 <- fr1: r8 /
                        \__________________/

after:

[0] [1] [2] [3] [4] [5]   [10]      [11]      [6]      [7]
     |           |          |         |        |        |
   sl0:         sl0:      sl0:       sl0:      sl0:     sl0:
   fr0: r8 <-- fr0: r8 <- fr0: r8 <- fr0: r8 <-fr0: r8<-fr0: r8
                          fr1: r8 <- fr1: r8

Now the mark from instruction 6 will travel through callees states.

Note that we don't have to connect r0 because its overwritten by
callees state on return and r1 - r5 because those are not alive
any more once a call is made.

v2:
 - don't connect the callees registers twice (Alexei: suggestion & code)
 - add more details to the comment (Ed & Alexei)
v1: don't unnecessarily link caller saved regs (Jiong)

Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-13 10:35:40 -08:00
David S. Miller
addb067983 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-12-11

The following pull-request contains BPF updates for your *net-next* tree.

It has three minor merge conflicts, resolutions:

1) tools/testing/selftests/bpf/test_verifier.c

 Take first chunk with alignment_prevented_execution.

2) net/core/filter.c

  [...]
  case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
  case bpf_ctx_range(struct __sk_buff, wire_len):
        return false;
  [...]

3) include/uapi/linux/bpf.h

  Take the second chunk for the two cases each.

The main changes are:

1) Add support for BPF line info via BTF and extend libbpf as well
   as bpftool's program dump to annotate output with BPF C code to
   facilitate debugging and introspection, from Martin.

2) Add support for BPF_ALU | BPF_ARSH | BPF_{K,X} in interpreter
   and all JIT backends, from Jiong.

3) Improve BPF test coverage on archs with no efficient unaligned
   access by adding an "any alignment" flag to the BPF program load
   to forcefully disable verifier alignment checks, from David.

4) Add a new bpf_prog_test_run_xattr() API to libbpf which allows for
   proper use of BPF_PROG_TEST_RUN with data_out, from Lorenz.

5) Extend tc BPF programs to use a new __sk_buff field called wire_len
   for more accurate accounting of packets going to wire, from Petar.

6) Improve bpftool to allow dumping the trace pipe from it and add
   several improvements in bash completion and map/prog dump,
   from Quentin.

7) Optimize arm64 BPF JIT to always emit movn/movk/movk sequence for
   kernel addresses and add a dedicated BPF JIT backend allocator,
   from Ard.

8) Add a BPF helper function for IR remotes to report mouse movements,
   from Sean.

9) Various cleanups in BPF prog dump e.g. to make UAPI bpf_prog_info
   member naming consistent with existing conventions, from Yonghong
   and Song.

10) Misc cleanups and improvements in allowing to pass interface name
    via cmdline for xdp1 BPF example, from Matteo.

11) Fix a potential segfault in BPF sample loader's kprobes handling,
    from Daniel T.

12) Fix SPDX license in libbpf's README.rst, from Andrey.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-10 18:00:43 -08:00
Jiong Wang
e434b8cdf7 bpf: relax verifier restriction on BPF_MOV | BPF_ALU
Currently, the destination register is marked as unknown for 32-bit
sub-register move (BPF_MOV | BPF_ALU) whenever the source register type is
SCALAR_VALUE.

This is too conservative that some valid cases will be rejected.
Especially, this may turn a constant scalar value into unknown value that
could break some assumptions of verifier.

For example, test_l4lb_noinline.c has the following C code:

    struct real_definition *dst

1:  if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6))
2:    return TC_ACT_SHOT;
3:
4:  if (dst->flags & F_IPV6) {

get_packet_dst is responsible for initializing "dst" into valid pointer and
return true (1), otherwise return false (0). The compiled instruction
sequence using alu32 will be:

  412: (54) (u32) r7 &= (u32) 1
  413: (bc) (u32) r0 = (u32) r7
  414: (95) exit

insn 413, a BPF_MOV | BPF_ALU, however will turn r0 into unknown value even
r7 contains SCALAR_VALUE 1.

This causes trouble when verifier is walking the code path that hasn't
initialized "dst" inside get_packet_dst, for which case 0 is returned and
we would then expect verifier concluding line 1 in the above C code pass
the "if" check, therefore would skip fall through path starting at line 4.
Now, because r0 returned from callee has became unknown value, so verifier
won't skip analyzing path starting at line 4 and "dst->flags" requires
dereferencing the pointer "dst" which actually hasn't be initialized for
this path.

This patch relaxed the code marking sub-register move destination. For a
SCALAR_VALUE, it is safe to just copy the value from source then truncate
it into 32-bit.

A unit test also included to demonstrate this issue. This test will fail
before this patch.

This relaxation could let verifier skipping more paths for conditional
comparison against immediate. It also let verifier recording a more
accurate/strict value for one register at one state, if this state end up
with going through exit without rejection and it is used for state
comparison later, then it is possible an inaccurate/permissive value is
better. So the real impact on verifier processed insn number is complex.
But in all, without this fix, valid program could be rejected.

>From real benchmarking on kernel selftests and Cilium bpf tests, there is
no impact on processed instruction number when tests ares compiled with
default compilation options. There is slightly improvements when they are
compiled with -mattr=+alu32 after this patch.

Also, test_xdp_noinline/-mattr=+alu32 now passed verification. It is
rejected before this fix.

Insn processed before/after this patch:

                        default     -mattr=+alu32

Kernel selftest

===
test_xdp.o              371/371      369/369
test_l4lb.o             6345/6345    5623/5623
test_xdp_noinline.o     2971/2971    rejected/2727
test_tcp_estates.o      429/429      430/430

Cilium bpf
===
bpf_lb-DLB_L3.o:        2085/2085     1685/1687
bpf_lb-DLB_L4.o:        2287/2287     1986/1982
bpf_lb-DUNKNOWN.o:      690/690       622/622
bpf_lxc.o:              95033/95033   N/A
bpf_netdev.o:           7245/7245     N/A
bpf_overlay.o:          2898/2898     3085/2947

NOTE:
  - bpf_lxc.o and bpf_netdev.o compiled by -mattr=+alu32 are rejected by
    verifier due to another issue inside verifier on supporting alu32
    binary.
  - Each cilium bpf program could generate several processed insn number,
    above number is sum of them.

v1->v2:
 - Restrict the change on SCALAR_VALUE.
 - Update benchmark numbers on Cilium bpf tests.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-10 09:23:33 -08:00
David S. Miller
4cc1feeb6f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several conflicts, seemingly all over the place.

I used Stephen Rothwell's sample resolutions for many of these, if not
just to double check my own work, so definitely the credit largely
goes to him.

The NFP conflict consisted of a bug fix (moving operations
past the rhashtable operation) while chaning the initial
argument in the function call in the moved code.

The net/dsa/master.c conflict had to do with a bug fix intermixing of
making dsa_master_set_mtu() static with the fixing of the tagging
attribute location.

cls_flower had a conflict because the dup reject fix from Or
overlapped with the addition of port range classifiction.

__set_phy_supported()'s conflict was relatively easy to resolve
because Andrew fixed it in both trees, so it was just a matter
of taking the net-next copy.  Or at least I think it was :-)

Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup()
intermixed with changes on how the sdif and caller_net are calculated
in these code paths in net-next.

The remaining BPF conflicts were largely about the addition of the
__bpf_md_ptr stuff in 'net' overlapping with adjustments and additions
to the relevant data structure where the MD pointer macros are used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09 21:43:31 -08:00
Martin KaFai Lau
c454a46b5e bpf: Add bpf_line_info support
This patch adds bpf_line_info support.

It accepts an array of bpf_line_info objects during BPF_PROG_LOAD.
The "line_info", "line_info_cnt" and "line_info_rec_size" are added
to the "union bpf_attr".  The "line_info_rec_size" makes
bpf_line_info extensible in the future.

The new "check_btf_line()" ensures the userspace line_info is valid
for the kernel to use.

When the verifier is translating/patching the bpf_prog (through
"bpf_patch_insn_single()"), the line_infos' insn_off is also
adjusted by the newly added "bpf_adj_linfo()".

If the bpf_prog is jited, this patch also provides the jited addrs (in
aux->jited_linfo) for the corresponding line_info.insn_off.
"bpf_prog_fill_jited_linfo()" is added to fill the aux->jited_linfo.
It is currently called by the x86 jit.  Other jits can also use
"bpf_prog_fill_jited_linfo()" and it will be done in the followup patches.
In the future, if it deemed necessary, a particular jit could also provide
its own "bpf_prog_fill_jited_linfo()" implementation.

A few "*line_info*" fields are added to the bpf_prog_info such
that the user can get the xlated line_info back (i.e. the line_info
with its insn_off reflecting the translated prog).  The jited_line_info
is available if the prog is jited.  It is an array of __u64.
If the prog is not jited, jited_line_info_cnt is 0.

The verifier's verbose log with line_info will be done in
a follow up patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-09 13:54:38 -08:00
Jiong Wang
c49f7dbd4f bpf: verifier remove the rejection on BPF_ALU | BPF_ARSH
This patch remove the rejection on BPF_ALU | BPF_ARSH as we have supported
them on interpreter and all JIT back-ends

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-07 13:30:48 -08:00
Martin KaFai Lau
d30d42e08c bpf: Change insn_offset to insn_off in bpf_func_info
The later patch will introduce "struct bpf_line_info" which
has member "line_off" and "file_off" referring back to the
string section in btf.  The line_"off" and file_"off"
are more consistent to the naming convention in btf.h that
means "offset" (e.g. name_off in "struct btf_type").

The to-be-added "struct bpf_line_info" also has another
member, "insn_off" which is the same as the "insn_offset"
in "struct bpf_func_info".  Hence, this patch renames "insn_offset"
to "insn_off" for "struct bpf_func_info".

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-05 18:48:40 -08:00
Alexei Starovoitov
ceefbc96fa bpf: add per-insn complexity limit
malicious bpf program may try to force the verifier to remember
a lot of distinct verifier states.
Put a limit to number of per-insn 'struct bpf_verifier_state'.
Note that hitting the limit doesn't reject the program.
It potentially makes the verifier do more steps to analyze the program.
It means that malicious programs will hit BPF_COMPLEXITY_LIMIT_INSNS sooner
instead of spending cpu time walking long link list.

The limit of BPF_COMPLEXITY_LIMIT_STATES==64 affects cilium progs
with slight increase in number of "steps" it takes to successfully verify
the programs:
                       before    after
bpf_lb-DLB_L3.o         1940      1940
bpf_lb-DLB_L4.o         3089      3089
bpf_lb-DUNKNOWN.o       1065      1065
bpf_lxc-DDROP_ALL.o     28052  |  28162
bpf_lxc-DUNKNOWN.o      35487  |  35541
bpf_netdev.o            10864     10864
bpf_overlay.o           6643      6643
bpf_lcx_jit.o           38437     38437

But it also makes malicious program to be rejected in 0.4 seconds vs 6.5
Hence apply this limit to unprivileged programs only.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-04 17:22:02 +01:00
Alexei Starovoitov
4f7b3e8258 bpf: improve verifier branch analysis
pathological bpf programs may try to force verifier to explode in
the number of branch states:
  20: (d5) if r1 s<= 0x24000028 goto pc+0
  21: (b5) if r0 <= 0xe1fa20 goto pc+2
  22: (d5) if r1 s<= 0x7e goto pc+0
  23: (b5) if r0 <= 0xe880e000 goto pc+0
  24: (c5) if r0 s< 0x2100ecf4 goto pc+0
  25: (d5) if r1 s<= 0xe880e000 goto pc+1
  26: (c5) if r0 s< 0xf4041810 goto pc+0
  27: (d5) if r1 s<= 0x1e007e goto pc+0
  28: (b5) if r0 <= 0xe86be000 goto pc+0
  29: (07) r0 += 16614
  30: (c5) if r0 s< 0x6d0020da goto pc+0
  31: (35) if r0 >= 0x2100ecf4 goto pc+0

Teach verifier to recognize always taken and always not taken branches.
This analysis is already done for == and != comparison.
Expand it to all other branches.

It also helps real bpf programs to be verified faster:
                       before  after
bpf_lb-DLB_L3.o         2003    1940
bpf_lb-DLB_L4.o         3173    3089
bpf_lb-DUNKNOWN.o       1080    1065
bpf_lxc-DDROP_ALL.o     29584   28052
bpf_lxc-DUNKNOWN.o      36916   35487
bpf_netdev.o            11188   10864
bpf_overlay.o           6679    6643
bpf_lcx_jit.o           39555   38437

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-04 17:22:02 +01:00
Alexei Starovoitov
c3494801cd bpf: check pending signals while verifying programs
Malicious user space may try to force the verifier to use as much cpu
time and memory as possible. Hence check for pending signals
while verifying the program.
Note that suspend of sys_bpf(PROG_LOAD) syscall will lead to EAGAIN,
since the kernel has to release the resources used for program verification.

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-04 17:22:02 +01:00
David Miller
e9ee9efc0d bpf: Add BPF_F_ANY_ALIGNMENT.
Often we want to write tests cases that check things like bad context
offset accesses.  And one way to do this is to use an odd offset on,
for example, a 32-bit load.

This unfortunately triggers the alignment checks first on platforms
that do not set CONFIG_EFFICIENT_UNALIGNED_ACCESS.  So the test
case see the alignment failure rather than what it was testing for.

It is often not completely possible to respect the original intention
of the test, or even test the same exact thing, while solving the
alignment issue.

Another option could have been to check the alignment after the
context and other validations are performed by the verifier, but
that is a non-trivial change to the verifier.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-30 21:38:48 -08:00
David S. Miller
93029d7d40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
bpf-next 2018-11-30

The following pull-request contains BPF updates for your *net-next* tree.

(Getting out bit earlier this time to pull in a dependency from bpf.)

The main changes are:

1) Add libbpf ABI versioning and document API naming conventions
   as well as ABI versioning process, from Andrey.

2) Add a new sk_msg_pop_data() helper for sk_msg based BPF
   programs that is used in conjunction with sk_msg_push_data()
   for adding / removing meta data to the msg data, from John.

3) Optimize convert_bpf_ld_abs() for 0 offset and fix various
   lib and testsuite build failures on 32 bit, from David.

4) Make BPF prog dump for !JIT identical to how we dump subprogs
   when JIT is in use, from Yonghong.

5) Rename btf_get_from_id() to make it more conform with libbpf
   API naming conventions, from Martin.

6) Add a missing BPF kselftest config item, from Naresh.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-29 18:15:07 -08:00
David S. Miller
e561bb29b6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Trivial conflict in net/core/filter.c, a locally computed
'sdif' is now an argument to the function.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-28 22:10:54 -08:00
Yonghong Song
ba64e7d852 bpf: btf: support proper non-jit func info
Commit 838e96904f ("bpf: Introduce bpf_func_info")
added bpf func info support. The userspace is able
to get better ksym's for bpf programs with jit, and
is able to print out func prototypes.

For a program containing func-to-func calls, the existing
implementation returns user specified number of function
calls and BTF types if jit is enabled. If the jit is not
enabled, it only returns the type for the main function.

This is undesirable. Interpreter may still be used
and we should keep feature identical regardless of
whether jit is enabled or not.
This patch fixed this discrepancy.

Fixes: 838e96904f ("bpf: Introduce bpf_func_info")
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-26 17:57:10 -08:00
Yonghong Song
838e96904f bpf: Introduce bpf_func_info
This patch added interface to load a program with the following
additional information:
   . prog_btf_fd
   . func_info, func_info_rec_size and func_info_cnt
where func_info will provide function range and type_id
corresponding to each function.

The func_info_rec_size is introduced in the UAPI to specify
struct bpf_func_info size passed from user space. This
intends to make bpf_func_info structure growable in the future.
If the kernel gets a different bpf_func_info size from userspace,
it will try to handle user request with part of bpf_func_info
it can understand. In this patch, kernel can understand
  struct bpf_func_info {
       __u32   insn_offset;
       __u32   type_id;
  };
If user passed a bpf func_info record size of 16 bytes, the
kernel can still handle part of records with the above definition.

If verifier agrees with function range provided by the user,
the bpf_prog ksym for each function will use the func name
provided in the type_id, which is supposed to provide better
encoding as it is not limited by 16 bytes program name
limitation and this is better for bpf program which contains
multiple subprograms.

The bpf_prog_info interface is also extended to
return btf_id, func_info, func_info_rec_size and func_info_cnt
to userspace, so userspace can print out the function prototype
for each xlated function. The insn_offset in the returned
func_info corresponds to the insn offset for xlated functions.
With other jit related fields in bpf_prog_info, userspace can also
print out function prototypes for each jited function.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-20 10:54:39 -08:00
Edward Cree
afd5942408 bpf: fix off-by-one error in adjust_subprog_starts
When patching in a new sequence for the first insn of a subprog, the start
 of that subprog does not change (it's the first insn of the sequence), so
 adjust_subprog_starts should check start <= off (rather than < off).
Also added a test to test_verifier.c (it's essentially the syz reproducer).

Fixes: cc8b0b92a1 ("bpf: introduce function calls (function boundaries)")
Reported-by: syzbot+4fc427c7af994b0948be@syzkaller.appspotmail.com
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-16 21:10:00 -08:00
Andrey Ignatov
46f53a65d2 bpf: Allow narrow loads with offset > 0
Currently BPF verifier allows narrow loads for a context field only with
offset zero. E.g. if there is a __u32 field then only the following
loads are permitted:
  * off=0, size=1 (narrow);
  * off=0, size=2 (narrow);
  * off=0, size=4 (full).

On the other hand LLVM can generate a load with offset different than
zero that make sense from program logic point of view, but verifier
doesn't accept it.

E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code:

  #define DST_IP4			0xC0A801FEU /* 192.168.1.254 */
  ...
  	if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) &&

where ctx is struct bpf_sock_addr.

Some versions of LLVM can produce the following byte code for it:

       8:       71 12 07 00 00 00 00 00         r2 = *(u8 *)(r1 + 7)
       9:       67 02 00 00 18 00 00 00         r2 <<= 24
      10:       18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00         r3 = 4261412864 ll
      12:       5d 32 07 00 00 00 00 00         if r2 != r3 goto +7 <LBB0_6>

where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1
and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently
rejected by verifier.

Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok()
what means any is_valid_access implementation, that uses the function,
works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or
sock_addr_is_valid_access() for bpf_sock_addr.

The patch makes such loads supported. Offset can be in [0; size_default)
but has to be multiple of load size. E.g. for __u32 field the following
loads are supported now:
  * off=0, size=1 (narrow);
  * off=1, size=1 (narrow);
  * off=2, size=1 (narrow);
  * off=3, size=1 (narrow);
  * off=0, size=2 (narrow);
  * off=2, size=2 (narrow);
  * off=0, size=4 (full).

Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10 22:29:59 -08:00
Quentin Monnet
a40a26322a bpf: pass prog instead of env to bpf_prog_offload_verifier_prep()
Function bpf_prog_offload_verifier_prep(), called from the kernel BPF
verifier to run a driver-specific callback for preparing for the
verification step for offloaded programs, takes a pointer to a struct
bpf_verifier_env object. However, no driver callback needs the whole
structure at this time: the two drivers supporting this, nfp and
netdevsim, only need a pointer to the struct bpf_prog instance held by
env.

Update the callback accordingly, on kernel side and in these two
drivers.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10 15:39:54 -08:00
Jiong Wang
e647815a4d bpf: let verifier to calculate and record max_pkt_offset
In check_packet_access, update max_pkt_offset after the offset has passed
__check_packet_access.

It should be safe to use u32 for max_pkt_offset as explained in code
comment.

Also, when there is tail call, the max_pkt_offset of the called program is
unknown, so conservatively set max_pkt_offset to MAX_PACKET_OFF for such
case.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-11-09 09:16:31 +01:00
Daniel Borkmann
4d31f30148 bpf: don't set id on after map lookup with ptr_to_map_val return
In the verifier there is no such semantics where registers with
PTR_TO_MAP_VALUE type have an id assigned to them. This is only
used in PTR_TO_MAP_VALUE_OR_NULL and later on nullified once the
test against NULL has been pattern matched and type transformed
into PTR_TO_MAP_VALUE.

Fixes: 3e6a4b3e02 ("bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Roman Gushchin <guro@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-31 16:53:17 -07:00
Daniel Borkmann
0962590e55 bpf: fix partial copy of map_ptr when dst is scalar
ALU operations on pointers such as scalar_reg += map_value_ptr are
handled in adjust_ptr_min_max_vals(). Problem is however that map_ptr
and range in the register state share a union, so transferring state
through dst_reg->range = ptr_reg->range is just buggy as any new
map_ptr in the dst_reg is then truncated (or null) for subsequent
checks. Fix this by adding a raw member and use it for copying state
over to dst_reg.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-31 16:53:17 -07:00
Daniel Borkmann
b09928b976 bpf: make direct packet write unclone more robust
Given this seems to be quite fragile and can easily slip through the
cracks, lets make direct packet write more robust by requiring that
future program types which allow for such write must provide a prologue
callback. In case of XDP and sk_msg it's noop, thus add a generic noop
handler there. The latter starts out with NULL data/data_end unconditionally
when sg pages are shared.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-25 17:02:06 -07:00
Daniel Borkmann
d5563d367c bpf: fix cg_skb types to hint access type in may_access_direct_pkt_data
Commit b39b5f411d ("bpf: add cg_skb_is_valid_access for
BPF_PROG_TYPE_CGROUP_SKB") added direct packet access for skbs in
cg_skb program types, however allowed access type was not added to
the may_access_direct_pkt_data() helper. Therefore the latter always
returns false. This is not directly an issue, it just means writes
are unconditionally disabled (which is correct) but also reads.
Latter is relevant in this function when BPF helpers may read direct
packet data which is unconditionally disabled then. Fix it by properly
adding BPF_PROG_TYPE_CGROUP_SKB to may_access_direct_pkt_data().

Fixes: b39b5f411d ("bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-25 17:02:06 -07:00
Daniel Borkmann
5d66fa7d9e bpf: fix direct packet access for flow dissector progs
Commit d58e468b11 ("flow_dissector: implements flow dissector BPF
hook") added direct packet access for skbs in may_access_direct_pkt_data()
function where this enables read and write access to the skb->data. This
is buggy because without a prologue generator such as bpf_unclone_prologue()
we would allow for writing into cloned skbs. Original intention might have
been to only allow read access where this is not needed (similar as the
flow_dissector_func_proto() indicates which enables only bpf_skb_load_bytes()
as well), therefore this patch fixes it to restrict to read-only.

Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-25 17:02:06 -07:00
Daniel Borkmann
84430d4232 bpf, verifier: avoid retpoline for map push/pop/peek operation
Extend prior work from 09772d92cd ("bpf: avoid retpoline for
lookup/update/delete calls on maps") to also apply to the recently
added map helpers that perform push/pop/peek operations so that
the indirect call can be avoided.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-20 23:13:32 -07:00
Daniel Borkmann
ad38911dcd bpf, verifier: remove unneeded flow key in check_helper_mem_access
They PTR_TO_FLOW_KEYS is not used today to be passed into a helper
as memory, so it can be removed from check_helper_mem_access().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-20 23:13:32 -07:00
Daniel Borkmann
4b5defdec3 bpf, verifier: reject xadd on flow key memory
We should not enable xadd operation for flow key memory if not
needed there anyway. There is no such issue as described in the
commit f37a8cb84c ("bpf: reject stores into ctx via st and xadd")
since there's no context rewriter for flow keys today, but it
also shouldn't become part of the user facing behavior to allow
for it. After patch:

  0: (79) r7 = *(u64 *)(r1 +144)
  1: (b7) r3 = 4096
  2: (db) lock *(u64 *)(r7 +0) += r3
  BPF_XADD stores into R7 flow_keys is not allowed

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-20 23:13:32 -07:00
Daniel Borkmann
2a159c6f82 bpf, verifier: fix register type dump in xadd and st
Using reg_type_str[insn->dst_reg] is incorrect since insn->dst_reg
contains the register number but not the actual register type. Add
a small reg_state() helper and use it to get to the type. Also fix
up the test_verifier test cases that have an incorrect errstr.

Fixes: 9d2be44a7f ("bpf: Reuse canonical string formatter for ctx errs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-20 23:13:32 -07:00
Mauricio Vasquez B
f1a2e44a3a bpf: add queue and stack maps
Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
These maps support peek, pop and push operations that are exposed to eBPF
programs through the new bpf_map[peek/pop/push] helpers.  Those operations
are exposed to userspace applications through the already existing
syscalls in the following way:

BPF_MAP_LOOKUP_ELEM            -> peek
BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
BPF_MAP_UPDATE_ELEM            -> push

Queue/stack maps are implemented using a buffer, tail and head indexes,
hence BPF_F_NO_PREALLOC is not supported.

As opposite to other maps, queue and stack do not use RCU for protecting
maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
argument that is a pointer to a memory zone where to save the value of a
map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
be passed as an extra argument.

Our main motivation for implementing queue/stack maps was to keep track
of a pool of elements, like network ports in a SNAT, however we forsee
other use cases, like for exampling saving last N kernel events in a map
and then analysing from userspace.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:24:31 -07:00
Mauricio Vasquez B
2ea864c58f bpf/verifier: add ARG_PTR_TO_UNINIT_MAP_VALUE
ARG_PTR_TO_UNINIT_MAP_VALUE argument is a pointer to a memory zone
used to save the value of a map.  Basically the same as
ARG_PTR_TO_UNINIT_MEM, but the size has not be passed as an extra
argument.

This will be used in the following patch that implements some new
helpers that receive a pointer to be filled with a map value.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:24:31 -07:00
David S. Miller
071a234ad7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2018-10-08

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) sk_lookup_[tcp|udp] and sk_release helpers from Joe Stringer which allow
BPF programs to perform lookups for sockets in a network namespace. This would
allow programs to determine early on in processing whether the stack is
expecting to receive the packet, and perform some action (eg drop,
forward somewhere) based on this information.

2) per-cpu cgroup local storage from Roman Gushchin.
Per-cpu cgroup local storage is very similar to simple cgroup storage
except all the data is per-cpu. The main goal of per-cpu variant is to
implement super fast counters (e.g. packet counters), which don't require
neither lookups, neither atomic operations in a fast path.
The example of these hybrid counters is in selftests/bpf/netcnt_prog.c

3) allow HW offload of programs with BPF-to-BPF function calls from Quentin Monnet

4) support more than 64-byte key/value in HW offloaded BPF maps from Jakub Kicinski

5) rename of libbpf interfaces from Andrey Ignatov.
libbpf is maturing as a library and should follow good practices in
library design and implementation to play well with other libraries.
This patch set brings consistent naming convention to global symbols.

6) relicense libbpf as LGPL-2.1 OR BSD-2-Clause from Alexei Starovoitov
to let Apache2 projects use libbpf

7) various AF_XDP fixes from Björn and Magnus
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-08 23:42:44 -07:00
Quentin Monnet
e4052d06a5 bpf: allow offload of programs with BPF-to-BPF function calls
Now that there is at least one driver supporting BPF-to-BPF function
calls, lift the restriction, in the verifier, on hardware offload of
eBPF programs containing such calls. But prevent jit_subprogs(), still
in the verifier, from being run for offloaded programs.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-08 10:24:13 +02:00
Quentin Monnet
c941ce9c28 bpf: add verifier callback to get stack usage info for offloaded progs
In preparation for BPF-to-BPF calls in offloaded programs, add a new
function attribute to the struct bpf_prog_offload_ops so that drivers
supporting eBPF offload can hook at the end of program verification, and
potentially extract information collected by the verifier.

Implement a minimal callback (returning 0) in the drivers providing the
structs, namely netdevsim and nfp.

This will be useful in the nfp driver, in later commits, to extract the
number of subprograms as well as the stack depth for those subprograms.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-08 10:24:12 +02:00
David S. Miller
72438f8cef Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-10-06 14:43:42 -07:00
Jann Horn
b799207e1e bpf: 32-bit RSH verification must truncate input before the ALU op
When I wrote commit 468f6eafa6 ("bpf: fix 32-bit ALU op verification"), I
assumed that, in order to emulate 64-bit arithmetic with 32-bit logic, it
is sufficient to just truncate the output to 32 bits; and so I just moved
the register size coercion that used to be at the start of the function to
the end of the function.

That assumption is true for almost every op, but not for 32-bit right
shifts, because those can propagate information towards the least
significant bit. Fix it by always truncating inputs for 32-bit ops to 32
bits.

Also get rid of the coerce_reg_to_size() after the ALU op, since that has
no effect.

Fixes: 468f6eafa6 ("bpf: fix 32-bit ALU op verification")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-05 18:41:45 +02:00
Joe Stringer
6acc9b432e bpf: Add helper to retrieve socket in BPF
This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
socket listening on this host, and returns a socket pointer which the
BPF program can then access to determine, for instance, whether to
forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
socket, so when a BPF program makes use of this function, it must
subsequently pass the returned pointer into the newly added sk_release()
to return the reference.

By way of example, the following pseudocode would filter inbound
connections at XDP if there is no corresponding service listening for
the traffic:

  struct bpf_sock_tuple tuple;
  struct bpf_sock_ops *sk;

  populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
  sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
  if (!sk) {
    // Couldn't find a socket listening for this traffic. Drop.
    return TC_ACT_SHOT;
  }
  bpf_sk_release(sk, 0);
  return TC_ACT_OK;

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-03 02:53:47 +02:00
Joe Stringer
fd978bf7fd bpf: Add reference tracking to verifier
Allow helper functions to acquire a reference and return it into a
register. Specific pointer types such as the PTR_TO_SOCKET will
implicitly represent such a reference. The verifier must ensure that
these references are released exactly once in each path through the
program.

To achieve this, this commit assigns an id to the pointer and tracks it
in the 'bpf_func_state', then when the function or program exits,
verifies that all of the acquired references have been freed. When the
pointer is passed to a function that frees the reference, it is removed
from the 'bpf_func_state` and all existing copies of the pointer in
registers are marked invalid.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-03 02:53:47 +02:00
Joe Stringer
84dbf35073 bpf: Macrofy stack state copy
An upcoming commit will need very similar copy/realloc boilerplate, so
refactor the existing stack copy/realloc functions into macros to
simplify it.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-03 02:53:47 +02:00
Joe Stringer
c64b798328 bpf: Add PTR_TO_SOCKET verifier type
Teach the verifier a little bit about a new type of pointer, a
PTR_TO_SOCKET. This pointer type is accessed from BPF through the
'struct bpf_sock' structure.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-03 02:53:47 +02:00
Joe Stringer
840b9615d6 bpf: Generalize ptr_or_null regs check
This check will be reused by an upcoming commit for conditional jump
checks for sockets. Refactor it a bit to simplify the later commit.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-03 02:53:47 +02:00
Joe Stringer
9d2be44a7f bpf: Reuse canonical string formatter for ctx errs
The array "reg_type_str" provides canonical formatting of register
types, however a couple of places would previously check whether a
register represented the context and write the name "context" directly.
An upcoming commit will add another pointer type to these statements, so
to provide more accurate error messages in the verifier, update these
error messages to use "reg_type_str" instead.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-03 02:53:47 +02:00
Joe Stringer
aad2eeaf46 bpf: Simplify ptr_min_max_vals adjustment
An upcoming commit will add another two pointer types that need very
similar behaviour, so generalise this function now.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-03 02:53:47 +02:00
Joe Stringer
f3709f69b7 bpf: Add iterator for spilled registers
Add this iterator for spilled registers, it concentrates the details of
how to get the current frame's spilled registers into a single macro
while clarifying the intention of the code which is calling the macro.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-03 02:53:46 +02:00
Roman Gushchin
b741f16303 bpf: introduce per-cpu cgroup local storage
This commit introduced per-cpu cgroup local storage.

Per-cpu cgroup local storage is very similar to simple cgroup storage
(let's call it shared), except all the data is per-cpu.

The main goal of per-cpu variant is to implement super fast
counters (e.g. packet counters), which don't require neither
lookups, neither atomic operations.

>From userspace's point of view, accessing a per-cpu cgroup storage
is similar to other per-cpu map types (e.g. per-cpu hashmaps and
arrays).

Writing to a per-cpu cgroup storage is not atomic, but is performed
by copying longs, so some minimal atomicity is here, exactly
as with other per-cpu maps.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-01 16:18:32 +02:00
Roman Gushchin
8bad74f984 bpf: extend cgroup bpf core to allow multiple cgroup storage types
In order to introduce per-cpu cgroup storage, let's generalize
bpf cgroup core to support multiple cgroup storage types.
Potentially, per-node cgroup storage can be added later.

This commit is mostly a formal change that replaces
cgroup_storage pointer with a array of cgroup_storage pointers.
It doesn't actually introduce a new storage type,
it will be done later.

Each bpf program is now able to have one cgroup storage of each type.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-01 16:18:32 +02:00
David S. Miller
105bc1306e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-09-25

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Allow for RX stack hardening by implementing the kernel's flow
   dissector in BPF. Idea was originally presented at netconf 2017 [0].
   Quote from merge commit:

     [...] Because of the rigorous checks of the BPF verifier, this
     provides significant security guarantees. In particular, the BPF
     flow dissector cannot get inside of an infinite loop, as with
     CVE-2013-4348, because BPF programs are guaranteed to terminate.
     It cannot read outside of packet bounds, because all memory accesses
     are checked. Also, with BPF the administrator can decide which
     protocols to support, reducing potential attack surface. Rarely
     encountered protocols can be excluded from dissection and the
     program can be updated without kernel recompile or reboot if a
     bug is discovered. [...]

   Also, a sample flow dissector has been implemented in BPF as part
   of this work, from Petar and Willem.

   [0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

2) Add support for bpftool to list currently active attachment
   points of BPF networking programs providing a quick overview
   similar to bpftool's perf subcommand, from Yonghong.

3) Fix a verifier pruning instability bug where a union member
   from the register state was not cleared properly leading to
   branches not being pruned despite them being valid candidates,
   from Alexei.

4) Various smaller fast-path optimizations in XDP's map redirect
   code, from Jesper.

5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps
   in bpftool, from Roman.

6) Remove a duplicate check in libbpf that probes for function
   storage, from Taeung.

7) Fix an issue in test_progs by avoid checking for errno since
   on success its value should not be checked, from Mauricio.

8) Fix unused variable warning in bpf_getsockopt() helper when
   CONFIG_INET is not configured, from Anders.

9) Fix a compilation failure in the BPF sample code's use of
   bpf_flow_keys, from Prashant.

10) Minor cleanups in BPF code, from Yue and Zhong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-25 20:29:38 -07:00
David S. Miller
e366fa4350 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
Two new tls tests added in parallel in both net and net-next.

Used Stephen Rothwell's linux-next resolution.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18 09:33:27 -07:00
Petar Penkov
d58e468b11 flow_dissector: implements flow dissector BPF hook
Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is per-network namespace.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-09-14 12:04:33 -07:00
Alexei Starovoitov
dd066823db bpf/verifier: disallow pointer subtraction
Subtraction of pointers was accidentally allowed for unpriv programs
by commit 82abbf8d2f. Revert that part of commit.

Fixes: 82abbf8d2f ("bpf: do not allow root to mangle valid pointers")
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-12 23:30:02 +02:00
Alexei Starovoitov
a9c676bc8f bpf/verifier: fix verifier instability
Edward Cree says:
In check_mem_access(), for the PTR_TO_CTX case, after check_ctx_access()
has supplied a reg_type, the other members of the register state are set
appropriately.  Previously reg.range was set to 0, but as it is in a
union with reg.map_ptr, which is larger, upper bytes of the latter were
left in place.  This then caused the memcmp() in regsafe() to fail,
preventing some branches from being pruned (and occasionally causing the
same program to take a varying number of processed insns on repeated
verifier runs).

Fix the instability by clearing bpf_reg_state in __mark_reg_[un]known()

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Debugged-by: Edward Cree <ecree@solarflare.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-09-05 22:21:00 -07:00
Edward Cree
8efea21d33 bpf/verifier: display non-spill stack slot types in print_verifier_state
If a stack slot does not hold a spilled register (STACK_SPILL), then each
 of its eight bytes could potentially have a different slot_type.  This
 information can be important for debugging, and previously we either did
 not print anything for the stack slot, or just printed fp-X=0 in the case
 where its first byte was STACK_ZERO.
Instead, print eight characters with either 0 (STACK_ZERO), m (STACK_MISC)
 or ? (STACK_INVALID) for any stack slot which is neither STACK_SPILL nor
 entirely STACK_INVALID.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-29 18:52:12 -07:00
Edward Cree
679c782de1 bpf/verifier: per-register parent pointers
By giving each register its own liveness chain, we elide the skip_callee()
 logic.  Instead, each register's parent is the state it inherits from;
 both check_func_call() and prepare_func_exit() automatically connect
 reg states to the correct chain since when they copy the reg state across
 (r1-r5 into the callee as args, and r0 out as the return value) they also
 copy the parent pointer.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-29 18:52:12 -07:00
Daniel Borkmann
f6069b9aa9 bpf: fix redirect to map under tail calls
Commits 109980b894 ("bpf: don't select potentially stale ri->map
from buggy xdp progs") and 7c30013133 ("bpf: fix ri->map_owner
pointer on bpf_prog_realloc") tried to mitigate that buggy programs
using bpf_redirect_map() helper call do not leave stale maps behind.
Idea was to add a map_owner cookie into the per CPU struct redirect_info
which was set to prog->aux by the prog making the helper call as a
proof that the map is not stale since the prog is implicitly holding
a reference to it. This owner cookie could later on get compared with
the program calling into BPF whether they match and therefore the
redirect could proceed with processing the map safely.

In (obvious) hindsight, this approach breaks down when tail calls are
involved since the original caller's prog->aux pointer does not have
to match the one from one of the progs out of the tail call chain,
and therefore the xdp buffer will be dropped instead of redirected.
A way around that would be to fix the issue differently (which also
allows to remove related work in fast path at the same time): once
the life-time of a redirect map has come to its end we use it's map
free callback where we need to wait on synchronize_rcu() for current
outstanding xdp buffers and remove such a map pointer from the
redirect info if found to be present. At that time no program is
using this map anymore so we simply invalidate the map pointers to
NULL iff they previously pointed to that instance while making sure
that the redirect path only reads out the map once.

Fixes: 97f91a7cf0 ("bpf: add bpf_redirect_map helper routine")
Fixes: 109980b894 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
Reported-by: Sebastiano Miano <sebastiano.miano@polito.it>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-17 15:56:23 -07:00
Martin KaFai Lau
2dbb9b9e6d bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.

BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
to store the bpf context instead of using the skb->cb[48].

At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
point,  it is not always clear where the bpf context can be appended
in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
clear if the lower layer is only ipv4 and ipv6 in the future and
will it not touch the cb[] again before transiting to the upper
layer.

For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
instead of IP[6]CB and it may still modify the cb[] after calling
the udp[46]_lib_lookup_skb().  Because of the above reason, if
sk->cb is used for the bpf ctx, saving-and-restoring is needed
and likely the whole 48 bytes cb[] has to be saved and restored.

Instead of saving, setting and restoring the cb[], this patch opts
to create a new "struct sk_reuseport_kern" and setting the needed
values in there.

The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
specific usage at this point and it is also inline with the current
sock_reuseport.c implementation (i.e. no protocol specific requirement).

In "struct sk_reuseport_md", this patch exposes data/data_end/len
with semantic similar to other existing usages.  Together
with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
the bpf prog that the reuseport group is bind-ed to a local
INANY address which cannot be learned from skb.

The new "bind_inany" is added to "struct sock_reuseport" which will be
used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
to avoid repeating the "bind INANY" test on
"sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
only be properly initialized when a "sk->sk_reuseport" enabled sk is
adding to a hashtable (i.e. during "reuseport_alloc()" and
"reuseport_add_sock()").

The new "sk_select_reuseport()" is the main helper that the
bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
the earlier patch, the validity of a selected sk is checked in
run time in "sk_select_reuseport()".  Doing the check in
verification time is difficult and inflexible (consider the map-in-map
use case).  The runtime check is to compare the selected sk's reuseport_id
with the reuseport_id that we want.  This helper will return -EXXX if the
selected sk cannot serve the incoming request (e.g. reuseport_id
not match).  The bpf prog can decide if it wants to do SK_DROP as its
discretion.

When the bpf prog returns SK_PASS, the kernel will check if a
valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
If it does , it will use the selected sk.  If not, the kernel
will select one from "reuse->socks[]" (as before this patch).

The SK_DROP and SK_PASS handling logic will be in the next patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:46 +02:00
Roman Gushchin
cd33943176 bpf: introduce the bpf_get_local_storage() helper function
The bpf_get_local_storage() helper function is used
to get a pointer to the bpf local storage from a bpf program.

It takes a pointer to a storage map and flags as arguments.
Right now it accepts only cgroup storage maps, and flags
argument has to be 0. Further it can be extended to support
other types of local storage: e.g. thread local storage etc.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Roman Gushchin
3e6a4b3e02 bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE
BPF_MAP_TYPE_CGROUP_STORAGE maps are special in a way
that the access from the bpf program side is lookup-free.
That means the result is guaranteed to be a valid
pointer to the cgroup storage; no NULL-check is required.

This patch introduces BPF_PTR_TO_MAP_VALUE return type,
which is required to cause the verifier accept programs,
which are not checking the map value pointer for being NULL.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Roman Gushchin
de9cbbaadb bpf: introduce cgroup storage maps
This commit introduces BPF_MAP_TYPE_CGROUP_STORAGE maps:
a special type of maps which are implementing the cgroup storage.

>From the userspace point of view it's almost a generic
hash map with the (cgroup inode id, attachment type) pair
used as a key.

The only difference is that some operations are restricted:
  1) a user can't create new entries,
  2) a user can't remove existing entries.

The lookup from userspace is o(log(n)).

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Arthur Fabre
fbeb1603bf bpf: verifier: MOV64 don't mark dst reg unbounded
When check_alu_op() handles a BPF_MOV64 between two registers,
it calls check_reg_arg(DST_OP) on the dst register, marking it
as unbounded. If the src and dst register are the same, this
marks the src as unbounded, which can lead to unexpected errors
for further checks that rely on bounds info. For example:

	BPF_MOV64_IMM(BPF_REG_2, 0),
	BPF_MOV64_REG(BPF_REG_2, BPF_REG_2),
	BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_2),
	BPF_MOV64_IMM(BPF_REG_0, 0),
	BPF_EXIT_INSN(),

Results in:

	"math between ctx pointer and register with unbounded
	min value is not allowed"

check_alu_op() now uses check_reg_arg(DST_OP_NO_MARK), and MOVs
that need to mark the dst register (MOVIMM, MOV32) do so.

Added a test case for MOV64 dst == src, and dst != src.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-31 22:09:33 +02:00
David S. Miller
eae249b27f Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-07-20

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add sharing of BPF objects within one ASIC: this allows for reuse of
   the same program on multiple ports of a device, and therefore gains
   better code store utilization. On top of that, this now also enables
   sharing of maps between programs attached to different ports of a
   device, from Jakub.

2) Cleanup in libbpf and bpftool's Makefile to reduce unneeded feature
   detections and unused variable exports, also from Jakub.

3) First batch of RCU annotation fixes in prog array handling, i.e.
   there are several __rcu markers which are not correct as well as
   some of the RCU handling, from Roman.

4) Two fixes in BPF sample files related to checking of the prog_cnt
   upper limit from sample loader, from Dan.

5) Minor cleanup in sockmap to remove a set but not used variable,
   from Colin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20 23:58:30 -07:00
Jakub Kicinski
09728266b6 bpf: offload: rename bpf_offload_dev_match() to bpf_offload_prog_map_match()
A set of new API functions exported for the drivers will soon use
'bpf_offload_dev_' as a prefix.  Rename the bpf_offload_dev_match()
which is internal to the core (used by the verifier) to avoid any
confusion.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-18 15:10:34 +02:00
Daniel Borkmann
c7a8978432 bpf: don't leave partial mangled prog in jit_subprogs error path
syzkaller managed to trigger the following bug through fault injection:

  [...]
  [  141.043668] verifier bug. No program starts at insn 3
  [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                 get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
  [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                 fixup_call_args kernel/bpf/verifier.c:5587 [inline]
  [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                 bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
  [  141.047355] CPU: 3 PID: 4072 Comm: a.out Not tainted 4.18.0-rc4+ #51
  [  141.048446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS 1.10.2-1 04/01/2014
  [  141.049877] Call Trace:
  [  141.050324]  __dump_stack lib/dump_stack.c:77 [inline]
  [  141.050324]  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
  [  141.050950]  ? dump_stack_print_info.cold.2+0x52/0x52 lib/dump_stack.c:60
  [  141.051837]  panic+0x238/0x4e7 kernel/panic.c:184
  [  141.052386]  ? add_taint.cold.5+0x16/0x16 kernel/panic.c:385
  [  141.053101]  ? __warn.cold.8+0x148/0x1ba kernel/panic.c:537
  [  141.053814]  ? __warn.cold.8+0x117/0x1ba kernel/panic.c:530
  [  141.054506]  ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
  [  141.054506]  ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
  [  141.054506]  ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
  [  141.055163]  __warn.cold.8+0x163/0x1ba kernel/panic.c:538
  [  141.055820]  ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
  [  141.055820]  ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
  [  141.055820]  ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
  [...]

What happens in jit_subprogs() is that kcalloc() for the subprog func
buffer is failing with NULL where we then bail out. Latter is a plain
return -ENOMEM, and this is definitely not okay since earlier in the
loop we are walking all subprogs and temporarily rewrite insn->off to
remember the subprog id as well as insn->imm to temporarily point the
call to __bpf_call_base + 1 for the initial JIT pass. Thus, bailing
out in such state and handing this over to the interpreter is troublesome
since later/subsequent e.g. find_subprog() lookups are based on wrong
insn->imm.

Therefore, once we hit this point, we need to jump to out_free path
where we undo all changes from earlier loop, so that interpreter can
work on unmodified insn->{off,imm}.

Another point is that should find_subprog() fail in jit_subprogs() due
to a verifier bug, then we also should not simply defer the program to
the interpreter since also here we did partial modifications. Instead
we should just bail out entirely and return an error to the user who is
trying to load the program.

Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Reported-by: syzbot+7d427828b2ea6e592804@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-07-12 14:00:54 -07:00
Kees Cook
fad953ce0b treewide: Use array_size() in vzalloc()
The vzalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

        vzalloc(a * b)

with:
        vzalloc(array_size(a, b))

as well as handling cases of:

        vzalloc(a * b * c)

with:

        vzalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

        vzalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  vzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  vzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  vzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  vzalloc(
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

  vzalloc(
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  vzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  vzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  vzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  vzalloc(C1 * C2 * C3, ...)
|
  vzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
  vzalloc(C1 * C2, ...)
|
  vzalloc(
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
6396bb2215 treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:

        kzalloc(a * b, gfp)

with:
        kcalloc(a * b, gfp)

as well as handling cases of:

        kzalloc(a * b * c, gfp)

with:

        kzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kzalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc
+ kcalloc
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kzalloc(sizeof(THING) * C2, ...)
|
  kzalloc(sizeof(TYPE) * C2, ...)
|
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Daniel Borkmann
58990d1ff3 bpf: reject passing modified ctx to helper functions
As commit 28e33f9d78 ("bpf: disallow arithmetic operations on
context pointer") already describes, f1174f77b5 ("bpf/verifier:
rework value tracking") removed the specific white-listed cases
we had previously where we would allow for pointer arithmetic in
order to further generalize it, and allow e.g. context access via
modified registers. While the dereferencing of modified context
pointers had been forbidden through 28e33f9d78, syzkaller did
recently manage to trigger several KASAN splats for slab out of
bounds access and use after frees by simply passing a modified
context pointer to a helper function which would then do the bad
access since verifier allowed it in adjust_ptr_min_max_vals().

Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals()
generally could break existing programs as there's a valid use
case in tracing in combination with passing the ctx to helpers as
bpf_probe_read(), where the register then becomes unknown at
verification time due to adding a non-constant offset to it. An
access sequence may look like the following:

  offset = args->filename;  /* field __data_loc filename */
  bpf_probe_read(&dst, len, (char *)args + offset); // args is ctx

There are two options: i) we could special case the ctx and as
soon as we add a constant or bounded offset to it (hence ctx type
wouldn't change) we could turn the ctx into an unknown scalar, or
ii) we generalize the sanity test for ctx member access into a
small helper and assert it on the ctx register that was passed
as a function argument. Fwiw, latter is more obvious and less
complex at the same time, and one case that may potentially be
legitimate in future for ctx member access at least would be for
ctx to carry a const offset. Therefore, fix follows approach
from ii) and adds test cases to BPF kselftests.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Reported-by: syzbot+3d0b2441dbb71751615e@syzkaller.appspotmail.com
Reported-by: syzbot+c8504affd4fdd0c1b626@syzkaller.appspotmail.com
Reported-by: syzbot+e5190cb881d8660fb1a3@syzkaller.appspotmail.com
Reported-by: syzbot+efae31b384d5badbd620@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-07 12:37:17 -07:00
Daniel Borkmann
bc23105ca0 bpf: fix context access in tracing progs on 32 bit archs
Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT
program type in test_verifier report the following errors on x86_32:

  172/p unpriv: spill/fill of different pointers ldx FAIL
  Unexpected error message!
  0: (bf) r6 = r10
  1: (07) r6 += -8
  2: (15) if r1 == 0x0 goto pc+3
  R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1
  3: (bf) r2 = r10
  4: (07) r2 += -76
  5: (7b) *(u64 *)(r6 +0) = r2
  6: (55) if r1 != 0x0 goto pc+1
  R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp
  7: (7b) *(u64 *)(r6 +0) = r1
  8: (79) r1 = *(u64 *)(r6 +0)
  9: (79) r1 = *(u64 *)(r1 +68)
  invalid bpf_context access off=68 size=8

  378/p check bpf_perf_event_data->sample_period byte load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (71) r0 = *(u8 *)(r1 +68)
  invalid bpf_context access off=68 size=1

  379/p check bpf_perf_event_data->sample_period half load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (69) r0 = *(u16 *)(r1 +68)
  invalid bpf_context access off=68 size=2

  380/p check bpf_perf_event_data->sample_period word load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (61) r0 = *(u32 *)(r1 +68)
  invalid bpf_context access off=68 size=4

  381/p check bpf_perf_event_data->sample_period dword load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (79) r0 = *(u64 *)(r1 +68)
  invalid bpf_context access off=68 size=8

Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte
boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok()
will then bail out saying that off & (size_default - 1) which is 68 & 7
doesn't cleanly align in the case of sample_period access from struct
bpf_perf_event_data, hence verifier wrongly thinks we might be doing an
unaligned access here though underlying arch can handle it just fine.
Therefore adjust this down to machine size and check and rewrite the
offset for narrow access on that basis. We also need to fix corresponding
pe_prog_is_valid_access(), since we hit the check for off % size != 0
(e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs
for tracing work on x86_32.

Reported-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03 07:46:56 -07:00
Daniel Borkmann
09772d92cd bpf: avoid retpoline for lookup/update/delete calls on maps
While some of the BPF map lookup helpers provide a ->map_gen_lookup()
callback for inlining the map lookup altogether it is not available
for every map, so the remaining ones have to call bpf_map_lookup_elem()
helper which does a dispatch to map->ops->map_lookup_elem(). In
times of retpolines, this will control and trap speculative execution
rather than letting it do its work for the indirect call and will
therefore cause a slowdown. Likewise, bpf_map_update_elem() and
bpf_map_delete_elem() do not have an inlined version and need to call
into their map->ops->map_update_elem() resp. map->ops->map_delete_elem()
handlers.

Before:

  # bpftool prog dump xlated id 1
    0: (bf) r2 = r10
    1: (07) r2 += -8
    2: (7a) *(u64 *)(r2 +0) = 0
    3: (18) r1 = map[id:1]
    5: (85) call __htab_map_lookup_elem#232656
    6: (15) if r0 == 0x0 goto pc+4
    7: (71) r1 = *(u8 *)(r0 +35)
    8: (55) if r1 != 0x0 goto pc+1
    9: (72) *(u8 *)(r0 +35) = 1
   10: (07) r0 += 56
   11: (15) if r0 == 0x0 goto pc+4
   12: (bf) r2 = r0
   13: (18) r1 = map[id:1]
   15: (85) call bpf_map_delete_elem#215008  <-- indirect call via
   16: (95) exit                                 helper

After:

  # bpftool prog dump xlated id 1
    0: (bf) r2 = r10
    1: (07) r2 += -8
    2: (7a) *(u64 *)(r2 +0) = 0
    3: (18) r1 = map[id:1]
    5: (85) call __htab_map_lookup_elem#233328
    6: (15) if r0 == 0x0 goto pc+4
    7: (71) r1 = *(u8 *)(r0 +35)
    8: (55) if r1 != 0x0 goto pc+1
    9: (72) *(u8 *)(r0 +35) = 1
   10: (07) r0 += 56
   11: (15) if r0 == 0x0 goto pc+4
   12: (bf) r2 = r0
   13: (18) r1 = map[id:1]
   15: (85) call htab_lru_map_delete_elem#238240  <-- direct call
   16: (95) exit

In all three lookup/update/delete cases however we can use the actual
address of the map callback directly if we find that there's only a
single path with a map pointer leading to the helper call, meaning
when the map pointer has not been poisoned from verifier side.
Example code can be seen above for the delete case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03 07:45:37 -07:00
Daniel Borkmann
3fe2867cdf bpf: fixup error message from gpl helpers on license mismatch
Stating 'proprietary program' in the error is just silly since it
can also be a different open source license than that which is just
not compatible.

Reference: https://twitter.com/majek04/status/998531268039102465
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03 07:42:06 -07:00
David S. Miller
5b79c2af66 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of easy overlapping changes in the confict
resolutions here.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-26 19:46:15 -04:00
Linus Torvalds
03250e1028 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Let's begin the holiday weekend with some networking fixes:

   1) Whoops need to restrict cfg80211 wiphy names even more to 64
      bytes. From Eric Biggers.

   2) Fix flags being ignored when using kernel_connect() with SCTP,
      from Xin Long.

   3) Use after free in DCCP, from Alexey Kodanev.

   4) Need to check rhltable_init() return value in ipmr code, from Eric
      Dumazet.

   5) XDP handling fixes in virtio_net from Jason Wang.

   6) Missing RTA_TABLE in rtm_ipv4_policy[], from Roopa Prabhu.

   7) Need to use IRQ disabling spinlocks in mlx4_qp_lookup(), from Jack
      Morgenstein.

   8) Prevent out-of-bounds speculation using indexes in BPF, from
      Daniel Borkmann.

   9) Fix regression added by AF_PACKET link layer cure, from Willem de
      Bruijn.

  10) Correct ENIC dma mask, from Govindarajulu Varadarajan.

  11) Missing config options for PMTU tests, from Stefano Brivio"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
  ibmvnic: Fix partial success login retries
  selftests/net: Add missing config options for PMTU tests
  mlx4_core: allocate ICM memory in page size chunks
  enic: set DMA mask to 47 bit
  ppp: remove the PPPIOCDETACH ioctl
  ipv4: remove warning in ip_recv_error
  net : sched: cls_api: deal with egdev path only if needed
  vhost: synchronize IOTLB message with dev cleanup
  packet: fix reserve calculation
  net/mlx5: IPSec, Fix a race between concurrent sandbox QP commands
  net/mlx5e: When RXFCS is set, add FCS data into checksum calculation
  bpf: properly enforce index mask to prevent out-of-bounds speculation
  net/mlx4: Fix irq-unsafe spinlock usage
  net: phy: broadcom: Fix bcm_write_exp()
  net: phy: broadcom: Fix auxiliary control register reads
  net: ipv4: add missing RTA_TABLE to rtm_ipv4_policy
  net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message
  ibmvnic: Only do H_EOI for mobility events
  tuntap: correctly set SOCKWQ_ASYNC_NOSPACE
  virtio-net: fix leaking page for gso packet during mergeable XDP
  ...
2018-05-25 19:54:42 -07:00
Daniel Borkmann
c93552c443 bpf: properly enforce index mask to prevent out-of-bounds speculation
While reviewing the verifier code, I recently noticed that the
following two program variants in relation to tail calls can be
loaded.

Variant 1:

  # bpftool p d x i 15
    0: (15) if r1 == 0x0 goto pc+3
    1: (18) r2 = map[id:5]
    3: (05) goto pc+2
    4: (18) r2 = map[id:6]
    6: (b7) r3 = 7
    7: (35) if r3 >= 0xa0 goto pc+2
    8: (54) (u32) r3 &= (u32) 255
    9: (85) call bpf_tail_call#12
   10: (b7) r0 = 1
   11: (95) exit

  # bpftool m s i 5
    5: prog_array  flags 0x0
        key 4B  value 4B  max_entries 4  memlock 4096B
  # bpftool m s i 6
    6: prog_array  flags 0x0
        key 4B  value 4B  max_entries 160  memlock 4096B

Variant 2:

  # bpftool p d x i 20
    0: (15) if r1 == 0x0 goto pc+3
    1: (18) r2 = map[id:8]
    3: (05) goto pc+2
    4: (18) r2 = map[id:7]
    6: (b7) r3 = 7
    7: (35) if r3 >= 0x4 goto pc+2
    8: (54) (u32) r3 &= (u32) 3
    9: (85) call bpf_tail_call#12
   10: (b7) r0 = 1
   11: (95) exit

  # bpftool m s i 8
    8: prog_array  flags 0x0
        key 4B  value 4B  max_entries 160  memlock 4096B
  # bpftool m s i 7
    7: prog_array  flags 0x0
        key 4B  value 4B  max_entries 4  memlock 4096B

In both cases the index masking inserted by the verifier in order
to control out of bounds speculation from a CPU via b2157399cc
("bpf: prevent out-of-bounds speculation") seems to be incorrect
in what it is enforcing. In the 1st variant, the mask is applied
from the map with the significantly larger number of entries where
we would allow to a certain degree out of bounds speculation for
the smaller map, and in the 2nd variant where the mask is applied
from the map with the smaller number of entries, we get buggy
behavior since we truncate the index of the larger map.

The original intent from commit b2157399cc is to reject such
occasions where two or more different tail call maps are used
in the same tail call helper invocation. However, the check on
the BPF_MAP_PTR_POISON is never hit since we never poisoned the
saved pointer in the first place! We do this explicitly for map
lookups but in case of tail calls we basically used the tail
call map in insn_aux_data that was processed in the most recent
path which the verifier walked. Thus any prior path that stored
a pointer in insn_aux_data at the helper location was always
overridden.

Fix it by moving the map pointer poison logic into a small helper
that covers both BPF helpers with the same logic. After that in
fixup_bpf_calls() the poison check is then hit for tail calls
and the program rejected. Latter only happens in unprivileged
case since this is the *only* occasion where a rewrite needs to
happen, and where such rewrite is specific to the map (max_entries,
index_mask). In the privileged case the rewrite is generic for
the insn->imm / insn->code update so multiple maps from different
paths can be handled just fine since all the remaining logic
happens in the instruction processing itself. This is similar
to the case of map lookups: in case there is a collision of
maps in fixup_bpf_calls() we must skip the inlined rewrite since
this will turn the generic instruction sequence into a non-
generic one. Thus the patch_call_imm will simply update the
insn->imm location where the bpf_map_lookup_elem() will later
take care of the dispatch. Given we need this 'poison' state
as a check, the information of whether a map is an unpriv_array
gets lost, so enforcing it prior to that needs an additional
state. In general this check is needed since there are some
complex and tail call intensive BPF programs out there where
LLVM tends to generate such code occasionally. We therefore
convert the map_ptr rather into map_state to store all this
w/o extra memory overhead, and the bit whether one of the maps
involved in the collision was from an unpriv_array thus needs
to be retained as well there.

Fixes: b2157399cc ("bpf: prevent out-of-bounds speculation")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 08:15:43 -07:00
Mathieu Xhonneux
004d4b274e ipv6: sr: Add seg6local action End.BPF
This patch adds the End.BPF action to the LWT seg6local infrastructure.
This action works like any other seg6local End action, meaning that an IPv6
header with SRH is needed, whose DA has to be equal to the SID of the
action. It will also advance the SRH to the next segment, the BPF program
does not have to take care of this.

Since the BPF program may not be a source of instability in the kernel, it
is important to ensure that the integrity of the packet is maintained
before yielding it back to the IPv6 layer. The hook hence keeps track if
the SRH has been altered through the helpers, and re-validates its
content if needed with seg6_validate_srh. The state kept for validation is
stored in a per-CPU buffer. The BPF program is not allowed to directly
write into the packet, and only some fields of the SRH can be altered
through the helper bpf_lwt_seg6_store_bytes.

Performances profiling has shown that the SRH re-validation does not induce
a significant overhead. If the altered SRH is deemed as invalid, the packet
is dropped.

This validation is also done before executing any action through
bpf_lwt_seg6_action, and will not be performed again if the SRH is not
modified after calling the action.

The BPF program may return 3 types of return codes:
    - BPF_OK: the End.BPF action will look up the next destination through
             seg6_lookup_nexthop.
    - BPF_REDIRECT: if an action has been executed through the
          bpf_lwt_seg6_action helper, the BPF program should return this
          value, as the skb's destination is already set and the default
          lookup should not be performed.
    - BPF_DROP : the packet will be dropped.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24 11:57:36 +02:00
Sandipan Das
dbecd73884 bpf: get kernel symbol addresses via syscall
This adds new two new fields to struct bpf_prog_info. For
multi-function programs, these fields can be used to pass
a list of kernel symbol addresses for all functions in a
given program to userspace using the bpf system call with
the BPF_OBJ_GET_INFO_BY_FD command.

When bpf_jit_kallsyms is enabled, we can get the address
of the corresponding kernel symbol for a callee function
and resolve the symbol's name. The address is determined
by adding the value of the call instruction's imm field
to __bpf_call_base. This offset gets assigned to the imm
field by the verifier.

For some architectures, such as powerpc64, the imm field
is not large enough to hold this offset.

We resolve this by:

[1] Assigning the subprog id to the imm field of a call
    instruction in the verifier instead of the offset of
    the callee's symbol's address from __bpf_call_base.

[2] Determining the address of a callee's corresponding
    symbol by using the imm field as an index for the
    list of kernel symbol addresses now available from
    the program info.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24 09:20:49 +02:00
Sandipan Das
2162fed49f bpf: support 64-bit offsets for bpf function calls
The imm field of a bpf instruction is a signed 32-bit integer.
For JITed bpf-to-bpf function calls, it holds the offset of the
start address of the callee's JITed image from __bpf_call_base.

For some architectures, such as powerpc64, this offset may be
as large as 64 bits and cannot be accomodated in the imm field
without truncation.

We resolve this by:

[1] Additionally using the auxiliary data of each function to
    keep a list of start addresses of the JITed images for all
    functions determined by the verifier.

[2] Retaining the subprog id inside the off field of the call
    instructions and using it to index into the list mentioned
    above and lookup the callee's address.

To make sure that the existing JIT compilers continue to work
without requiring changes, we keep the imm field as it is.

Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24 09:20:48 +02:00
Alexei Starovoitov
af86ca4e30 bpf: Prevent memory disambiguation attack
Detect code patterns where malicious 'speculative store bypass' can be used
and sanitize such patterns.

 39: (bf) r3 = r10
 40: (07) r3 += -216
 41: (79) r8 = *(u64 *)(r7 +0)   // slow read
 42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
 43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
 44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
 45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                 // is now sanitized

Above code after x86 JIT becomes:
 e5: mov    %rbp,%rdx
 e8: add    $0xffffffffffffff28,%rdx
 ef: mov    0x0(%r13),%r14
 f3: movq   $0x0,-0x48(%rbp)
 fb: mov    %rdx,0x0(%r14)
 ff: mov    0x0(%rbx),%rdi
103: movzbq 0x0(%rdi),%rsi

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-19 20:44:24 +02:00
John Fastabend
8111038444 bpf: sockmap, add hash map support
Sockmap is currently backed by an array and enforces keys to be
four bytes. This works well for many use cases and was originally
modeled after devmap which also uses four bytes keys. However,
this has become limiting in larger use cases where a hash would
be more appropriate. For example users may want to use the 5-tuple
of the socket as the lookup key.

To support this add hash support.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-15 20:41:03 +02:00
Jakub Kicinski
0d83003256 bpf: xdp: allow offloads to store into rx_queue_index
It's fairly easy for offloaded XDP programs to select the RX queue
packets go to.  We need a way of expressing this in the software.
Allow write to the rx_queue_index field of struct xdp_md for
device-bound programs.

Skip convert_ctx_access callback entirely for offloads.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-09 18:04:36 +02:00
Jakub Kicinski
ab7f5bf092 bpf: fix references to free_bpf_prog_info() in comments
Comments in the verifier refer to free_bpf_prog_info() which
seems to have never existed in tree.  Replace it with
free_used_maps().

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 23:41:04 +02:00
Jakub Kicinski
f4e3ec0d57 bpf: replace map pointer loads before calling into offloads
Offloads may find host map pointers more useful than map fds.
Map pointers can be used to identify the map, while fds are
only valid within the context of loading process.

Jump to skip_full_check on error in case verifier log overflow
has to be handled (replace_map_fd_with_map_ptr() prints to the
log, driver prep may do that too in the future).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 23:41:03 +02:00
Jiong Wang
4cb3d99c84 bpf: add faked "ending" subprog
There are quite a few code snippet like the following in verifier:

       subprog_start = 0;
       if (env->subprog_cnt == cur_subprog + 1)
               subprog_end = insn_cnt;
       else
               subprog_end = env->subprog_info[cur_subprog + 1].start;

The reason is there is no marker in subprog_info array to tell the end of
it.

We could resolve this issue by introducing a faked "ending" subprog.
The special "ending" subprog is with "insn_cnt" as start offset, so it is
serving as the end mark whenever we iterate over all subprogs.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 11:58:36 +02:00
Jiong Wang
9c8105bd44 bpf: centre subprog information fields
It is better to centre all subprog information fields into one structure.
This structure could later serve as function node in call graph.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 11:58:36 +02:00
Jiong Wang
f910cefa32 bpf: unify main prog and subprog
Currently, verifier treat main prog and subprog differently. All subprogs
detected are kept in env->subprog_starts while main prog is not kept there.
Instead, main prog is implicitly defined as the prog start at 0.

There is actually no difference between main prog and subprog, it is better
to unify them, and register all progs detected into env->subprog_starts.

This could also help simplifying some code logic.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 11:58:35 +02:00
Daniel Borkmann
e0cea7ce98 bpf: implement ld_abs/ld_ind in native bpf
The main part of this work is to finally allow removal of LD_ABS
and LD_IND from the BPF core by reimplementing them through native
eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
keeping them around in native eBPF caused way more trouble than
actually worth it. To just list some of the security issues in
the past:

  * fdfaf64e75 ("x86: bpf_jit: support negative offsets")
  * 35607b02db ("sparc: bpf_jit: fix loads from negative offsets")
  * e0ee9c1215 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
  * 07aee94394 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
  * 6d59b7dbf7 ("bpf, s390x: do not reload skb pointers in non-skb context")
  * 87338c8e2c ("bpf, ppc64: do not reload skb pointers in non-skb context")

For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
these days due to their limitations and more efficient/flexible
alternatives that have been developed over time such as direct
packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
register, the load happens in host endianness and its exception
handling can yield unexpected behavior. The latter is explained
in depth in f6b1b3bf0d ("bpf: fix subprog verifier bypass by
div/mod by 0 exception") with similar cases of exceptions we had.
In native eBPF more recent program types will disable LD_ABS/LD_IND
altogether through may_access_skb() in verifier, and given the
limitations in terms of exception handling, it's also disabled
in programs that use BPF to BPF calls.

In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
to access packet data. It is not used in seccomp-BPF but programs
that use it for socket filtering or reuseport for demuxing with
cBPF. This is mostly relevant for applications that have not yet
migrated to native eBPF.

The main complexity and source of bugs in LD_ABS/LD_IND is coming
from their implementation in the various JITs. Most of them keep
the model around from cBPF times by implementing a fastpath written
in asm. They use typically two from the BPF program hidden CPU
registers for caching the skb's headlen (skb->len - skb->data_len)
and skb->data. Throughout the JIT phase this requires to keep track
whether LD_ABS/LD_IND are used and if so, the two registers need
to be recached each time a BPF helper would change the underlying
packet data in native eBPF case. At least in eBPF case, available
CPU registers are rare and the additional exit path out of the
asm written JIT helper makes it also inflexible since not all
parts of the JITer are in control from plain C. A LD_ABS/LD_IND
implementation in eBPF therefore allows to significantly reduce
the complexity in JITs with comparable performance results for
them, e.g.:

test_bpf             tcpdump port 22             tcpdump complex
x64      - before    15 21 10                    14 19  18
         - after      7 10 10                     7 10  15
arm64    - before    40 91 92                    40 91 151
         - after     51 64 73                    51 62 113

For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
and cache the skb's headlen and data in the cBPF prologue. The
BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
used as a local temporary variable. This allows to shrink the
image on x86_64 also for seccomp programs slightly since mapping
to %rsi is not an ereg. In callee-saved R8 and R9 we now track
skb data and headlen, respectively. For normal prologue emission
in the JITs this does not add any extra instructions since R8, R9
are pushed to stack in any case from eBPF side. cBPF uses the
convert_bpf_ld_abs() emitter which probes the fast path inline
already and falls back to bpf_skb_load_helper_{8,16,32}() helper
relying on the cached skb data and headlen as well. R8 and R9
never need to be reloaded due to bpf_helper_changes_pkt_data()
since all skb access in cBPF is read-only. Then, for the case
of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
does neither cache skb data and headlen nor has an inlined fast
path. The reason for the latter is that native eBPF does not have
any extra registers available anyway, but even if there were, it
avoids any reload of skb data and headlen in the first place.
Additionally, for the negative offsets, we provide an alternative
bpf_skb_load_bytes_relative() helper in eBPF which operates
similarly as bpf_skb_load_bytes() and allows for more flexibility.
Tested myself on x64, arm64, s390x, from Sandipan on ppc64.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03 16:49:19 -07:00
Björn Töpel
fbfc504a24 bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
The xskmap is yet another BPF map, very much inspired by
dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
adds AF_XDP sockets into the map, and by using the bpf_redirect_map
helper, an XDP program can redirect XDP frames to an AF_XDP socket.

Note that a socket that is bound to certain ifindex/queue index will
*only* accept XDP frames from that netdev/queue index. If an XDP
program tries to redirect from a netdev/queue index other than what
the socket is bound to, the frame will not be received on the socket.

A socket can reside in multiple maps.

v3: Fixed race and simplified code.
v2: Removed one indirection in map lookup.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03 15:55:24 -07:00
Yonghong Song
9cbe1f5a32 bpf/verifier: improve register value range tracking with ARSH
When helpers like bpf_get_stack returns an int value
and later on used for arithmetic computation, the LSH and ARSH
operations are often required to get proper sign extension into
64-bit. For example, without this patch:
    54: R0=inv(id=0,umax_value=800)
    54: (bf) r8 = r0
    55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
    55: (67) r8 <<= 32
    56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
    56: (c7) r8 s>>= 32
    57: R8=inv(id=0)
With this patch:
    54: R0=inv(id=0,umax_value=800)
    54: (bf) r8 = r0
    55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
    55: (67) r8 <<= 32
    56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
    56: (c7) r8 s>>= 32
    57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
With better range of "R8", later on when "R8" is added to other register,
e.g., a map pointer or scalar-value register, the better register
range can be derived and verifier failure may be avoided.

In our later example,
    ......
    usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
    if (usize < 0)
        return 0;
    ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
    ......
Without improving ARSH value range tracking, the register representing
"max_len - usize" will have smin_value equal to S64_MIN and will be
rejected by verifier.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29 08:45:53 -07:00
Yonghong Song
afbe1a5b79 bpf: remove never-hit branches in verifier adjust_scalar_min_max_vals
In verifier function adjust_scalar_min_max_vals,
when src_known is false and the opcode is BPF_LSH/BPF_RSH,
early return will happen in the function. So remove
the branch in handling BPF_LSH/BPF_RSH when src_known is false.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29 08:45:53 -07:00
Yonghong Song
849fa50662 bpf/verifier: refine retval R0 state for bpf_get_stack helper
The special property of return values for helpers bpf_get_stack
and bpf_probe_read_str are captured in verifier.
Both helpers return a negative error code or
a length, which is equal to or smaller than the buffer
size argument. This additional information in the
verifier can avoid the condition such as "retval > bufsize"
in the bpf program. For example, for the code blow,
    usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
    if (usize < 0 || usize > max_len)
        return 0;
The verifier may have the following errors:
    52: (85) call bpf_get_stack#65
     R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
     R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
     R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R9_w=inv800 R10=fp0,call_-1
    53: (bf) r8 = r0
    54: (bf) r1 = r8
    55: (67) r1 <<= 32
    56: (bf) r2 = r1
    57: (77) r2 >>= 32
    58: (25) if r2 > 0x31f goto pc+33
     R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
                         umax_value=18446744069414584320,
                         var_off=(0x0; 0xffffffff00000000))
     R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
     R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R8=inv(id=0) R9=inv800 R10=fp0,call_-1
    59: (1f) r9 -= r8
    60: (c7) r1 s>>= 32
    61: (bf) r2 = r7
    62: (0f) r2 += r1
    math between map_value pointer and register with unbounded
    min value is not allowed
The failure is due to llvm compiler optimization where register "r2",
which is a copy of "r1", is tested for condition while later on "r1"
is used for map_ptr operation. The verifier is not able to track such
inst sequence effectively.

Without the "usize > max_len" condition, there is no llvm optimization
and the below generated code passed verifier:
    52: (85) call bpf_get_stack#65
     R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
     R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
     R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R9_w=inv800 R10=fp0,call_-1
    53: (b7) r1 = 0
    54: (bf) r8 = r0
    55: (67) r8 <<= 32
    56: (c7) r8 s>>= 32
    57: (6d) if r1 s> r8 goto pc+24
     R0=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
     R1=inv0 R6=ctx(id=0,off=0,imm=0)
     R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
     R10=fp0,call_-1
    58: (bf) r2 = r7
    59: (0f) r2 += r8
    60: (1f) r9 -= r8
    61: (bf) r1 = r6

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29 08:45:53 -07:00
Yonghong Song
c195651e56 bpf: add bpf_get_stack helper
Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29 08:45:53 -07:00