linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-04-07 23:19:12 +08:00

Author	SHA1	Message	Date
Harishankar Vishwanathan	7a998a7316	bpf, verifier: Improve precision for BPF_ADD and BPF_SUB This patch improves the precison of the scalar(32)_min_max_add and scalar(32)_min_max_sub functions, which update the u(32)min/u(32)_max ranges for the BPF_ADD and BPF_SUB instructions. We discovered this more precise operator using a technique we are developing for automatically synthesizing functions for updating tnums and ranges. According to the BPF ISA [1], "Underflow and overflow are allowed during arithmetic operations, meaning the 64-bit or 32-bit value will wrap". Our patch leverages the wrap-around semantics of unsigned overflow and underflow to improve precision. Below is an example of our patch for scalar_min_max_add; the idea is analogous for all four functions. There are three cases to consider when adding two u64 ranges [dst_umin, dst_umax] and [src_umin, src_umax]. Consider a value x in the range [dst_umin, dst_umax] and another value y in the range [src_umin, src_umax]. (a) No overflow: No addition x + y overflows. This occurs when even the largest possible sum, i.e., dst_umax + src_umax <= U64_MAX. (b) Partial overflow: Some additions x + y overflow. This occurs when the largest possible sum overflows (dst_umax + src_umax > U64_MAX), but the smallest possible sum does not overflow (dst_umin + src_umin <= U64_MAX). (c) Full overflow: All additions x + y overflow. This occurs when both the smallest possible sum and the largest possible sum overflow, i.e., both (dst_umin + src_umin) and (dst_umax + src_umax) are > U64_MAX. The current implementation conservatively sets the output bounds to unbounded, i.e, [umin=0, umax=U64_MAX], whenever there is any possibility of overflow, i.e, in cases (b) and (c). Otherwise it computes tight bounds as [dst_umin + src_umin, dst_umax + src_umax]: if (check_add_overflow(dst_umin, src_reg->umin_value, dst_umin) \|\| check_add_overflow(dst_umax, src_reg->umax_value, dst_umax)) { dst_umin = 0; dst_umax = U64_MAX; } Our synthesis-based technique discovered a more precise operator. Particularly, in case (c), all possible additions x + y overflow and wrap around according to eBPF semantics, and the computation of the output range as [dst_umin + src_umin, dst_umax + src_umax] continues to work. Only in case (b), do we need to set the output bounds to unbounded, i.e., [0, U64_MAX]. Case (b) can be checked by seeing if the minimum possible sum does not overflow and the maximum possible sum does overflow, and when that happens, we set the output to unbounded: min_overflow = check_add_overflow(dst_umin, src_reg->umin_value, dst_umin); max_overflow = check_add_overflow(dst_umax, src_reg->umax_value, dst_umax); if (!min_overflow && max_overflow) { dst_umin = 0; dst_umax = U64_MAX; } Below is an example eBPF program and the corresponding log from the verifier. The current implementation of scalar_min_max_add() sets r3's bounds to [0, U64_MAX] at instruction 5: (0f) r3 += r3, due to conservative overflow handling. 0: R1=ctx() R10=fp0 0: (b7) r4 = 0 ; R4_w=0 1: (87) r4 = -r4 ; R4_w=scalar() 2: (18) r3 = 0xa000000000000000 ; R3_w=0xa000000000000000 4: (4f) r3 \|= r4 ; R3_w=scalar(smin=0xa000000000000000,smax=-1,umin=0xa000000000000000,var_off=(0xa000000000000000; 0x5fffffffffffffff)) R4_w=scalar() 5: (0f) r3 += r3 ; R3_w=scalar() 6: (b7) r0 = 1 ; R0_w=1 7: (95) exit With our patch, r3's bounds after instruction 5 are set to a much more precise [0x4000000000000000,0xfffffffffffffffe]. ... 5: (0f) r3 += r3 ; R3_w=scalar(umin=0x4000000000000000,umax=0xfffffffffffffffe) 6: (b7) r0 = 1 ; R0_w=1 7: (95) exit The logic for scalar32_min_max_add is analogous. For the scalar(32)_min_max_sub functions, the reasoning is similar but applied to detecting underflow instead of overflow. We verified the correctness of the new implementations using Agni [3,4]. We since also discovered that a similar technique has been used to calculate output ranges for unsigned interval addition and subtraction in Hacker's Delight [2]. [1] https://docs.kernel.org/bpf/standardization/instruction-set.html [2] Hacker's Delight Ch.4-2, Propagating Bounds through Add’s and Subtract’s [3] https://github.com/bpfverif/agni [4] https://people.cs.rutgers.edu/~sn349/papers/sas24-preprint.pdf Co-developed-by: Matan Shachnai <m.shachnai@rutgers.edu> Signed-off-by: Matan Shachnai <m.shachnai@rutgers.edu> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250623040359.343235-2-harishankar.vishwanathan@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-24 18:37:22 -07:00
Menglong Dong	c11f34e300	bpf: Make update_prog_stats() always_inline The function update_prog_stats() will be called in the bpf trampoline. In most cases, it will be optimized by the compiler by making it inline. However, we can't rely on the compiler all the time, and just make it __always_inline to reduce the possible overhead. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20250621045501.101187-1-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-23 09:21:07 -07:00
James Bottomley	bd07bd12f2	bpf: Fix key serial argument of bpf_lookup_user_key() The underlying lookup_user_key() function uses a signed 32 bit integer for key serial numbers because legitimate serial numbers are positive (and > 3) and keyrings are negative. Using a u32 for the keyring in the bpf function doesn't currently cause any conversion problems but will start to trip the signed to unsigned conversion warnings when the kernel enables them, so convert the argument to signed (and update the tests accordingly) before it acquires more users. Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com> Link: https://lore.kernel.org/r/84cdb0775254d297d75e21f577089f64abdfbd28.camel@HansenPartnership.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-17 18:15:27 -07:00
Al Viro	f5527f0171	bpf: Get rid of redundant 3rd argument of prepare_seq_file() Remove 3rd argument in prepare_seq_file() to clean up the code a bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250615004719.GE3011112@ZenIV Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-17 17:19:41 -07:00
Luis Gerhorst	f66b4aaff2	bpf: Remove redundant free_verifier_state()/pop_stack() This patch removes duplicated code. Eduard points out [1]: Same cleanup cycles are done in push_stack() and push_async_cb(), both functions are only reachable from do_check_common() via do_check() -> do_check_insn(). Hence, I think that cur state should not be freed in push_() functions and pop_stack() loop there is not needed. This would also fix the 'symptom' for [2], but the issue also has a simpler fix which was sent separately. This fix also makes sure the push_() callers always return an error for which error_recoverable_with_nospec(err) is false. This is required because otherwise we try to recover and access the stale `state`. Moving free_verifier_state() and pop_stack(..., pop_log=false) to happen after the bpf_vlog_reset() call in do_check_common() is fine because the pop_stack() call that is moved does not call bpf_vlog_reset() with the pop_log=false parameter. [1] https://lore.kernel.org/all/b6931bd0dd72327c55287862f821ca6c4c3eb69a.camel@gmail.com/ [2] https://lore.kernel.org/all/68497853.050a0220.33aa0e.036a.GAE@google.com/ Reported-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/all/b6931bd0dd72327c55287862f821ca6c4c3eb69a.camel@gmail.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Link: https://lore.kernel.org/r/20250613090157.568349-2-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-13 14:59:30 -07:00
Eduard Zingerman	3157f7e299	bpf: handle jset (if a & b ...) as a jump in CFG computation BPF_JSET is a conditional jump and currently verifier.c:can_jump() does not know about that. This can lead to incorrect live registers and SCC computation. E.g. in the following example: 1: r0 = 1; 2: r2 = 2; 3: if r1 & 0x7 goto +1; 4: exit; 5: r0 = r2; 6: exit; W/o this fix insn_successors(3) will return only (4), a jump to (5) would be missed and r2 won't be marked as alive at (3). Fixes: `14c8552db6` ("bpf: simple DFA-based live registers analysis") Reported-by: syzbot+a36aac327960ff474804@syzkaller.appspotmail.com Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250613175331.3238739-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-13 11:51:19 -07:00
Eduard Zingerman	43736ec3e0	bpf: Include verifier memory allocations in memcg statistics This commit adds __GFP_ACCOUNT flag to verifier induced memory allocations. The intent is to account for all allocations reachable from BPF_PROG_LOAD command, which is needed to track verifier memory consumption in veristat. This includes allocations done in verifier.c, and some allocations in btf.c, functions in log.c do not allocate. There is also a utility function bpf_memcg_flags() which selectively adds GFP_ACCOUNT flag depending on the `cgroup.memory=nobpf` option. As far as I understand [1], the idea is to remove bpf_prog instances and maps from memcg accounting as these objects do not strictly belong to cgroup, hence it should not apply here. (btf_parse_fields() is reachable from both program load and map creation, but allocated record is not persistent as is freed as soon as map_check_btf() exits). [1] https://lore.kernel.org/all/20230210154734.4416-1-laoar.shao@gmail.com/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250613072147.3938139-2-eddyz87@gmail.com	2025-06-13 10:29:45 -07:00
Song Liu	fa6932577c	bpf: Initialize used but uninit variable in propagate_liveness() With input changed == NULL, a local variable is used for "changed". Initialize tmp properly, so that it can be used in the following: *changed \|= err > 0; Otherwise, UBSAN will complain: UBSAN: invalid-load in kernel/bpf/verifier.c:18924:4 load of value <some random value> is not a valid value for type '_Bool' Fixes: `dfb2d4c64b` ("bpf: set 'changed' status if propagate_liveness() did any updates") Signed-off-by: Song Liu <song@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250612221100.2153401-1-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:53:40 -07:00
Luis Gerhorst	3d71b8b9ab	bpf: Fix state use-after-free on push_stack() err Without this, `state->speculative` is used after the cleanup cycles in push_stack() or push_async_cb() freed `env->cur_state` (i.e., `state`). Avoid this by relying on the short-circuit logic to only access `state` if the error is recoverable (and make sure it never is after push_() failed). push_() callers must always return an error for which error_recoverable_with_nospec(err) is false if push_() returns NULL, otherwise we try to recover and access the stale `state`. This is only violated by sanitize_ptr_alu(), thus also fix this case to return -ENOMEM. state->speculative does not make sense if the error path of push_() ran. In that case, `state->speculative && error_recoverable_with_nospec(err)` as a whole should already never evaluate to true (because all cases where push_stack() fails must return -ENOMEM/-EFAULT). As mentioned, this is only violated by the push_stack() call in sanitize_speculative_path() which returns -EACCES without [1] (through REASON_STACK in sanitize_err() after sanitize_ptr_alu()). To fix this, return -ENOMEM for REASON_STACK (which is also the behavior we will have after [1]). Checked that it fixes the syzbot reproducer as expected. [1] https://lore.kernel.org/all/20250603213232.339242-1-luis.gerhorst@fau.de/ Fixes: `d6f1c85f22` ("bpf: Fall back to nospec for Spectre v1") Reported-by: syzbot+b5eb72a560b8149a1885@syzkaller.appspotmail.com Reported-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/all/38862a832b91382cddb083dddd92643bed0723b8.camel@gmail.com/ Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611210728.266563-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	0f54ff5470	bpf: include backedges in peak_states stat Count states accumulated in bpf_scc_visit->backedges in env->peak_states. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-10-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	0e0da5f901	bpf: remove {update,get}_loop_entry functions The previous patch switched read and precision tracking for iterator-based loops from state-graph-based loop tracking to control-flow-graph-based loop tracking. This patch removes the now-unused `update_loop_entry()` and `get_loop_entry()` functions, which were part of the state-graph-based logic. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-9-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	c9e31900b5	bpf: propagate read/precision marks over state graph backedges Current loop_entry-based exact states comparison logic does not handle the following case: .-> A --. Assume the states are visited in the order A, B, C. \| \| \| Assume that state B reaches a state equivalent to state A. \| v v At this point, state C is not processed yet, so state A '-- B C has not received any read or precision marks from C. As a result, these marks won't be propagated to B. If B has incomplete marks, it is unsafe to use it in states_equal() checks. This commit replaces the existing logic with the following: - Strongly connected components (SCCs) are computed over the program's control flow graph (intraprocedurally). - When a verifier state enters an SCC, that state is recorded as the SCC entry point. - When a verifier state is found equivalent to another (e.g., B to A in the example), it is recorded as a states graph backedge. Backedges are accumulated per SCC. - When an SCC entry state reaches `branches == 0`, read and precision marks are propagated through the backedges (e.g., from A to B, from C to A, and then again from A to B). To support nested subprogram calls, the entry state and backedge list are associated not with the SCC itself but with an object called `bpf_scc_callchain`. A callchain is a tuple `(callsite*, scc_id)`, where `callsite` is the index of a call instruction for each frame except the last. See the comments added in `is_state_visited()` and `compute_scc_callchain()` for more details. Fixes: `2a0992829e` ("bpf: correct loop detection for iterators convergence") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-8-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	b5c677d8d9	bpf: move REG_LIVE_DONE check to clean_live_states() The next patch would add some relatively heavy-weight operation to clean_live_states(), this operation can be skipped if REG_LIVE_DONE is set. Move the check from clean_verifier_state() to clean_verifier_state() as a small refactoring commit. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-7-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	dfb2d4c64b	bpf: set 'changed' status if propagate_liveness() did any updates Add an out parameter to `propagate_liveness()` to record whether any new liveness bits were set during its execution. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	23b37d6165	bpf: set 'changed' status if propagate_precision() did any updates Add an out parameter to `propagate_precision()` to record whether any new precision bits were set during its execution. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	9a2a0d7924	bpf: starting_state parameter for __mark_chain_precision() Allow `mark_chain_precision()` to run from an arbitrary starting state by replacing direct references to `env->cur_state` with a parameter. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:42 -07:00
Eduard Zingerman	13f843c017	bpf: frame_insn_idx() utility function A function to return IP for a given frame in a call stack of a state. Will be used by a next patch. The `state->insn_idx = env->insn_idx;` assignment in the do_check() allows to use frame_insn_idx with env->cur_state. At the moment bpf_verifier_state->insn_idx is set when new cached state is added in is_state_visited() and accessed only in the contexts when the state is already in the cache. Hence this assignment does not change verifier behaviour. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:42 -07:00
Eduard Zingerman	96c6aa4c63	bpf: compute SCCs in program control flow graph Compute strongly connected components in the program CFG. Assign an SCC number to each instruction, recorded in env->insn_aux[*].scc. Use Tarjan's algorithm for SCC computation adapted to run non-recursively. For debug purposes print out computed SCCs as a part of full program dump in compute_live_registers() at log level 2, e.g.: func#0 @0 Live regs before insn: 0: .......... (b4) w6 = 10 2 1: ......6... (18) r1 = 0xffff88810bbb5565 2 3: .1....6... (b4) w2 = 2 2 4: .12...6... (85) call bpf_trace_printk#6 2 5: ......6... (04) w6 += -1 2 6: ......6... (56) if w6 != 0x0 goto pc-6 7: .......... (b4) w6 = 5 1 8: ......6... (18) r1 = 0xffff88810bbb5567 1 10: .1....6... (b4) w2 = 2 1 11: .12...6... (85) call bpf_trace_printk#6 1 12: ......6... (04) w6 += -1 1 13: ......6... (56) if w6 != 0x0 goto pc-6 14: .......... (b4) w0 = 0 15: 0......... (95) exit ^^^ SCC number for the instruction Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:42 -07:00
Eduard Zingerman	baaebe0928	Revert "bpf: use common instruction history across all states" This reverts commit `96a30e469c`. Next patches in the series modify propagate_precision() to allow arbitrary starting state. Precision propagation requires access to jump history, and arbitrary states represent history not belonging to `env->cur_state`. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:42 -07:00
Luis Gerhorst	d6f1c85f22	bpf: Fall back to nospec for Spectre v1 This implements the core of the series and causes the verifier to fall back to mitigating Spectre v1 using speculation barriers. The approach was presented at LPC'24 [1] and RAID'24 [2]. If we find any forbidden behavior on a speculative path, we insert a nospec (e.g., lfence speculation barrier on x86) before the instruction and stop verifying the path. While verifying a speculative path, we can furthermore stop verification of that path whenever we encounter a nospec instruction. A minimal example program would look as follows: A = true B = true if A goto e f() if B goto e unsafe() e: exit There are the following speculative and non-speculative paths (`cur->speculative` and `speculative` referring to the value of the push_stack() parameters): - A = true - B = true - if A goto e - A && !cur->speculative && !speculative - exit - !A && !cur->speculative && speculative - f() - if B goto e - B && cur->speculative && !speculative - exit - !B && cur->speculative && speculative - unsafe() If f() contains any unsafe behavior under Spectre v1 and the unsafe behavior matches `state->speculative && error_recoverable_with_nospec(err)`, do_check() will now add a nospec before f() instead of rejecting the program: A = true B = true if A goto e nospec f() if B goto e unsafe() e: exit Alternatively, the algorithm also takes advantage of nospec instructions inserted for other reasons (e.g., Spectre v4). Taking the program above as an example, speculative path exploration can stop before f() if a nospec was inserted there because of Spectre v4 sanitization. In this example, all instructions after the nospec are dead code (and with the nospec they are also dead code speculatively). For this, it relies on the fact that speculation barriers generally prevent all later instructions from executing if the speculation was not correct: * On Intel x86_64, lfence acts as full speculation barrier, not only as a load fence [3]: An LFENCE instruction or a serializing instruction will ensure that no later instructions execute, even speculatively, until all prior instructions complete locally. [...] Inserting an LFENCE instruction after a bounds check prevents later operations from executing before the bound check completes. This was experimentally confirmed in [4]. * On AMD x86_64, lfence is dispatch-serializing [5] (requires MSR C001_1029[1] to be set if the MSR is supported, this happens in init_amd()). AMD further specifies "A dispatch serializing instruction forces the processor to retire the serializing instruction and all previous instructions before the next instruction is executed" [8]. As dispatch is not specific to memory loads or branches, lfence therefore also affects all instructions there. Also, if retiring a branch means it's PC change becomes architectural (should be), this means any "wrong" speculation is aborted as required for this series. * ARM's SB speculation barrier instruction also affects "any instruction that appears later in the program order than the barrier" [6]. * PowerPC's barrier also affects all subsequent instructions [7]: [...] executing an ori R31,R31,0 instruction ensures that all instructions preceding the ori R31,R31,0 instruction have completed before the ori R31,R31,0 instruction completes, and that no subsequent instructions are initiated, even out-of-order, until after the ori R31,R31,0 instruction completes. The ori R31,R31,0 instruction may complete before storage accesses associated with instructions preceding the ori R31,R31,0 instruction have been performed Regarding the example, this implies that `if B goto e` will not execute before `if A goto e` completes. Once `if A goto e` completes, the CPU should find that the speculation was wrong and continue with `exit`. If there is any other path that leads to `if B goto e` (and therefore `unsafe()`) without going through `if A goto e`, then a nospec will still be needed there. However, this patch assumes this other path will be explored separately and therefore be discovered by the verifier even if the exploration discussed here stops at the nospec. This patch furthermore has the unfortunate consequence that Spectre v1 mitigations now only support architectures which implement BPF_NOSPEC. Before this commit, Spectre v1 mitigations prevented exploits by rejecting the programs on all architectures. Because some JITs do not implement BPF_NOSPEC, this patch therefore may regress unpriv BPF's security to a limited extent: * The regression is limited to systems vulnerable to Spectre v1, have unprivileged BPF enabled, and do NOT emit insns for BPF_NOSPEC. The latter is not the case for x86 64- and 32-bit, arm64, and powerpc 64-bit and they are therefore not affected by the regression. According to commit `a6f6a95f25` ("LoongArch, bpf: Fix jit to skip speculation barrier opcode"), LoongArch is not vulnerable to Spectre v1 and therefore also not affected by the regression. * To the best of my knowledge this regression may therefore only affect MIPS. This is deemed acceptable because unpriv BPF is still disabled there by default. As stated in a previous commit, BPF_NOSPEC could be implemented for MIPS based on GCC's speculation_barrier implementation. * It is unclear which other architectures (besides x86 64- and 32-bit, ARM64, PowerPC 64-bit, LoongArch, and MIPS) supported by the kernel are vulnerable to Spectre v1. Also, it is not clear if barriers are available on these architectures. Implementing BPF_NOSPEC on these architectures therefore is non-trivial. Searching GCC and the kernel for speculation barrier implementations for these architectures yielded no result. * If any of those regressed systems is also vulnerable to Spectre v4, the system was already vulnerable to Spectre v4 attacks based on unpriv BPF before this patch and the impact is therefore further limited. As an alternative to regressing security, one could still reject programs if the architecture does not emit BPF_NOSPEC (e.g., by removing the empty BPF_NOSPEC-case from all JITs except for LoongArch where it appears justified). However, this will cause rejections on these archs that are likely unfounded in the vast majority of cases. In the tests, some are now successful where we previously had a false-positive (i.e., rejection). Change them to reflect where the nospec should be inserted (using __xlated_unpriv) and modify the error message if the nospec is able to mitigate a problem that previously shadowed another problem (in that case __xlated_unpriv does not work, therefore just add a comment). Define SPEC_V1 to avoid duplicating this ifdef whenever we check for nospec insns using __xlated_unpriv, define it here once. This also improves readability. PowerPC can probably also be added here. However, omit it for now because the BPF CI currently does not include a test. Limit it to EPERM, EACCES, and EINVAL (and not everything except for EFAULT and ENOMEM) as it already has the desired effect for most real-world programs. Briefly went through all the occurrences of EPERM, EINVAL, and EACCESS in verifier.c to validate that catching them like this makes sense. Thanks to Dustin for their help in checking the vendor documentation. [1] https://lpc.events/event/18/contributions/1954/ ("Mitigating Spectre-PHT using Speculation Barriers in Linux eBPF") [2] https://arxiv.org/pdf/2405.00078 ("VeriFence: Lightweight and Precise Spectre Defenses for Untrusted Linux Kernel Extensions") [3] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/runtime-speculative-side-channel-mitigations.html ("Managed Runtime Speculative Execution Side Channel Mitigations") [4] https://dl.acm.org/doi/pdf/10.1145/3359789.3359837 ("Speculator: a tool to analyze speculative execution attacks and mitigations" - Section 4.6 "Stopping Speculative Execution") [5] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/software-techniques-for-managing-speculation.pdf ("White Paper - SOFTWARE TECHNIQUES FOR MANAGING SPECULATION ON AMD PROCESSORS - REVISION 5.09.23") [6] https://developer.arm.com/documentation/ddi0597/2020-12/Base-Instructions/SB--Speculation-Barrier- ("SB - Speculation Barrier - Arm Armv8-A A32/T32 Instruction Set Architecture (2020-12)") [7] https://wiki.raptorcs.com/w/images/5/5f/OPF_PowerISA_v3.1C.pdf ("Power ISA™ - Version 3.1C - May 26, 2024 - Section 9.2.1 of Book III") [8] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf ("AMD64 Architecture Programmer’s Manual Volumes 1–5 - Revision 4.08 - April 2024 - 7.6.4 Serializing Instructions") Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Henriette Herzog <henriette.herzog@rub.de> Cc: Dustin Nguyen <nguyen@cs.fau.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603212428.338473-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-09 20:11:10 -07:00
Luis Gerhorst	9124a45080	bpf: Rename sanitize_stack_spill to nospec_result This is made to clarify that this flag will cause a nospec to be added after this insn and can therefore be relied upon to reduce speculative path analysis. Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Henriette Herzog <henriette.herzog@rub.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603212024.338154-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-09 20:11:10 -07:00
Luis Gerhorst	dff883d9e9	bpf, arm64, powerpc: Change nospec to include v1 barrier This changes the semantics of BPF_NOSPEC (previously a v4-only barrier) to always emit a speculation barrier that works against both Spectre v1 AND v4. If mitigation is not needed on an architecture, the backend should set bpf_jit_bypass_spec_v4/v1(). As of now, this commit only has the user-visible implication that unpriv BPF's performance on PowerPC is reduced. This is the case because we have to emit additional v1 barrier instructions for BPF_NOSPEC now. This commit is required for a future commit to allow us to rely on BPF_NOSPEC for Spectre v1 mitigation. As of this commit, the feature that nospec acts as a v1 barrier is unused. Commit `f5e81d1117` ("bpf: Introduce BPF nospec instruction for mitigating Spectre v4") noted that mitigation instructions for v1 and v4 might be different on some archs. While this would potentially offer improved performance on PowerPC, it was dismissed after the following considerations: * Only having one barrier simplifies the verifier and allows us to easily rely on v4-induced barriers for reducing the complexity of v1-induced speculative path verification. * For the architectures that implemented BPF_NOSPEC, only PowerPC has distinct instructions for v1 and v4. Even there, some insns may be shared between the barriers for v1 and v4 (e.g., 'ori 31,31,0' and 'sync'). If this is still found to impact performance in an unacceptable way, BPF_NOSPEC can be split into BPF_NOSPEC_V1 and BPF_NOSPEC_V4 later. As an optimization, we can already skip v1/v4 insns from being emitted for PowerPC with this setup if bypass_spec_v1/v4 is set. Vulnerability-status for BPF_NOSPEC-based Spectre mitigations (v4 as of this commit, v1 in the future) is therefore: * x86 (32-bit and 64-bit), ARM64, and PowerPC (64-bit): Mitigated - This patch implements BPF_NOSPEC for these architectures. The previous v4-only version was supported since commit `f5e81d1117` ("bpf: Introduce BPF nospec instruction for mitigating Spectre v4") and commit `b7540d6250` ("powerpc/bpf: Emit stf barrier instruction sequences for BPF_NOSPEC"). * LoongArch: Not Vulnerable - Commit `a6f6a95f25` ("LoongArch, bpf: Fix jit to skip speculation barrier opcode") is the only other past commit related to BPF_NOSPEC and indicates that the insn is not required there. * MIPS: Vulnerable (if unprivileged BPF is enabled) - Commit a6f6a95f2580 ("LoongArch, bpf: Fix jit to skip speculation barrier opcode") indicates that it is not vulnerable, but this contradicts the kernel and Debian documentation. Therefore, I assume that there exist vulnerable MIPS CPUs (but maybe not from Loongson?). In the future, BPF_NOSPEC could be implemented for MIPS based on the GCC speculation_barrier [1]. For now, we rely on unprivileged BPF being disabled by default. * Other: Unknown - To the best of my knowledge there is no definitive information available that indicates that any other arch is vulnerable. They are therefore left untouched (BPF_NOSPEC is not implemented, but bypass_spec_v1/v4 is also not set). I did the following testing to ensure the insn encoding is correct: * ARM64: * 'dsb nsh; isb' was successfully tested with the BPF CI in [2] * 'sb' locally using QEMU v7.2.15 -cpu max (emitted sb insn is executed for example with './test_progs -t verifier_array_access') * PowerPC: The following configs were tested locally with ppc64le QEMU v8.2 '-machine pseries -cpu POWER9': * STF_BARRIER_EIEIO + CONFIG_PPC_BOOK32_64 * STF_BARRIER_SYNC_ORI (forced on) + CONFIG_PPC_BOOK32_64 * STF_BARRIER_FALLBACK (forced on) + CONFIG_PPC_BOOK32_64 * CONFIG_PPC_E500 (forced on) + STF_BARRIER_EIEIO * CONFIG_PPC_E500 (forced on) + STF_BARRIER_SYNC_ORI (forced on) * CONFIG_PPC_E500 (forced on) + STF_BARRIER_FALLBACK (forced on) * CONFIG_PPC_E500 (forced on) + STF_BARRIER_NONE (forced on) Most of those cobinations should not occur in practice, but I was not able to get an PPC e6500 rootfs (for testing PPC_E500 without forcing it on). In any case, this should ensure that there are no unexpected conflicts between the insns when combined like this. Individual v1/v4 barriers were already emitted elsewhere. Hari's ack is for the PowerPC changes only. [1] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=29b74545531f6afbee9fc38c267524326dbfbedf ("MIPS: Add speculation_barrier support") [2] https://github.com/kernel-patches/bpf/pull/8576 Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Henriette Herzog <henriette.herzog@rub.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603211703.337860-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-09 20:11:09 -07:00
Luis Gerhorst	03c68a0f8c	bpf, arm64, powerpc: Add bpf_jit_bypass_spec_v1/v4() JITs can set bpf_jit_bypass_spec_v1/v4() if they want the verifier to skip analysis/patching for the respective vulnerability. For v4, this will reduce the number of barriers the verifier inserts. For v1, it allows more programs to be accepted. The primary motivation for this is to not regress unpriv BPF's performance on ARM64 in a future commit where BPF_NOSPEC is also used against Spectre v1. This has the user-visible change that v1-induced rejections on non-vulnerable PowerPC CPUs are avoided. For now, this does not change the semantics of BPF_NOSPEC. It is still a v4-only barrier and must not be implemented if bypass_spec_v4 is always true for the arch. Changing it to a v1 AND v4-barrier is done in a future commit. As an alternative to bypass_spec_v1/v4, one could introduce NOSPEC_V1 AND NOSPEC_V4 instructions and allow backends to skip their lowering as suggested by commit `f5e81d1117` ("bpf: Introduce BPF nospec instruction for mitigating Spectre v4"). Adding bpf_jit_bypass_spec_v1/v4() was found to be preferable for the following reason: * bypass_spec_v1/v4 benefits non-vulnerable CPUs: Always performing the same analysis (not taking into account whether the current CPU is vulnerable), needlessly restricts users of CPUs that are not vulnerable. The only use case for this would be portability-testing, but this can later be added easily when needed by allowing users to force bypass_spec_v1/v4 to false. * Portability is still acceptable: Directly disabling the analysis instead of skipping the lowering of BPF_NOSPEC(_V1/V4) might allow programs on non-vulnerable CPUs to be accepted while the program will be rejected on vulnerable CPUs. With the fallback to speculation barriers for Spectre v1 implemented in a future commit, this will only affect programs that do variable stack-accesses or are very complex. For PowerPC, the SEC_FTR checking in bpf_jit_bypass_spec_v4() is based on the check that was previously located in the BPF_NOSPEC case. For LoongArch, it would likely be safe to set both bpf_jit_bypass_spec_v1() and _v4() according to commit a6f6a95f2580 ("LoongArch, bpf: Fix jit to skip speculation barrier opcode"). This is omitted here as I am unable to do any testing for LoongArch. Hari's ack concerns the PowerPC part only. Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Henriette Herzog <henriette.herzog@rub.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603211318.337474-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-09 20:11:09 -07:00
Luis Gerhorst	6b84d7895d	bpf: Return -EFAULT on internal errors This prevents us from trying to recover from these on speculative paths in the future. Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Henriette Herzog <henriette.herzog@rub.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603205800.334980-4-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-09 20:11:09 -07:00
Luis Gerhorst	fd508bde5d	bpf: Return -EFAULT on misconfigurations Mark these cases as non-recoverable to later prevent them from being caught when they occur during speculative path verification. Eduard writes [1]: The only pace I'm aware of that might act upon specific error code from verifier syscall is libbpf. Looking through libbpf code, it seems that this change does not interfere with libbpf. [1] https://lore.kernel.org/all/785b4531ce3b44a84059a4feb4ba458c68fce719.camel@gmail.com/ Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Henriette Herzog <henriette.herzog@rub.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603205800.334980-3-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-09 20:11:09 -07:00
Luis Gerhorst	8b7df50fd4	bpf: Move insn if/else into do_check_insn() This is required to catch the errors later and fall back to a nospec if on a speculative path. Eliminate the regs variable as it is only used once and insn_idx is not modified in-between the definition and usage. Do not pass insn but compute it in the function itself. As Eduard points out [1], insn is assumed to correspond to env->insn_idx in many places (e.g, __check_reg_arg()). Move code into do_check_insn(), replace * "continue" with "return 0" after modifying insn_idx * "goto process_bpf_exit" with "return PROCESS_BPF_EXIT" * "goto process_bpf_exit_full" with "return process_bpf_exit_full()" * "do_print_state = " with "*do_print_state = " [1] https://lore.kernel.org/all/293dbe3950a782b8eb3b87b71d7a967e120191fd.camel@gmail.com/ Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Henriette Herzog <henriette.herzog@rub.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603205800.334980-2-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-09 20:11:09 -07:00
Tao Chen	2bc0575fec	bpf: Add cookie in fdinfo for raw_tp Add cookie in fdinfo for raw_tp, the info as follows: link_type: raw_tracepoint link_id: 31 prog_tag: 9dfdf8ef453843bf prog_id: 32 tp_name: sys_enter cookie: 23925373020405760 Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606165818.3394397-5-chen.dylane@linux.dev	2025-06-09 16:45:17 -07:00
Tao Chen	380cb6dfa2	bpf: Add cookie in fdinfo for tracing Add cookie in fdinfo for tracing, the info as follows: link_type: tracing link_id: 6 prog_tag: 9dfdf8ef453843bf prog_id: 35 attach_type: 25 target_obj_id: 1 target_btf_id: 60355 cookie: 9007199254740992 Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606165818.3394397-4-chen.dylane@linux.dev	2025-06-09 16:45:17 -07:00
Tao Chen	c7beb48344	bpf: Add cookie to tracing bpf_link_info bpf_tramp_link includes cookie info, we can add it in bpf_link_info. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606165818.3394397-1-chen.dylane@linux.dev	2025-06-09 16:45:17 -07:00
Ihor Solodrai	5534e58f2e	bpf: Make reg_not_null() true for CONST_PTR_TO_MAP When reg->type is CONST_PTR_TO_MAP, it can not be null. However the verifier explores the branches under rX == 0 in check_cond_jmp_op() even if reg->type is CONST_PTR_TO_MAP, because it was not checked for in reg_not_null(). Fix this by adding CONST_PTR_TO_MAP to the set of types that are considered non nullable in reg_not_null(). An old "unpriv: cmp map pointer with zero" selftest fails with this change, because now early out correctly triggers in check_cond_jmp_op(), making the verification to pass. In practice verifier may allow pointer to null comparison in unpriv, since in many cases the relevant branch and comparison op are removed as dead code. So change the expected test result to __success_unpriv. Signed-off-by: Ihor Solodrai <isolodrai@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250609183024.359974-2-isolodrai@meta.com	2025-06-09 16:42:04 -07:00
Tao Chen	97ebac5886	bpf: Add show_fdinfo for perf_event After commit `1b715e1b0e` ("bpf: Support ->fill_link_info for perf_event") add perf_event info, we can also show the info with the method of cat /proc/[fd]/fdinfo. kprobe fdinfo: link_type: perf link_id: 10 prog_tag: bcf7977d3b93787c prog_id: 20 name: bpf_fentry_test1 offset: 0x0 missed: 0 addr: 0xffffffffa28a2904 event_type: kprobe cookie: 3735928559 uprobe fdinfo: link_type: perf link_id: 13 prog_tag: bcf7977d3b93787c prog_id: 21 name: /proc/self/exe offset: 0x63dce4 ref_ctr_offset: 0x33eee2a event_type: uprobe cookie: 3735928559 tracepoint fdinfo: link_type: perf link_id: 11 prog_tag: bcf7977d3b93787c prog_id: 22 tp_name: sched_switch event_type: tracepoint cookie: 3735928559 perf_event fdinfo: link_type: perf link_id: 12 prog_tag: bcf7977d3b93787c prog_id: 23 type: 1 config: 2 event_type: event cookie: 3735928559 Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606150258.3385166-1-chen.dylane@linux.dev	2025-06-09 16:40:13 -07:00
Yonghong Song	1209339844	bpf: Implement mprog API on top of existing cgroup progs Current cgroup prog ordering is appending at attachment time. This is not ideal. In some cases, users want specific ordering at a particular cgroup level. To address this, the existing mprog API seems an ideal solution with supporting BPF_F_BEFORE and BPF_F_AFTER flags. But there are a few obstacles to directly use kernel mprog interface. Currently cgroup bpf progs already support prog attach/detach/replace and link-based attach/detach/replace. For example, in struct bpf_prog_array_item, the cgroup_storage field needs to be together with bpf prog. But the mprog API struct bpf_mprog_fp only has bpf_prog as the member, which makes it difficult to use kernel mprog interface. In another case, the current cgroup prog detach tries to use the same flag as in attach. This is different from mprog kernel interface which uses flags passed from user space. So to avoid modifying existing behavior, I made the following changes to support mprog API for cgroup progs: - The support is for prog list at cgroup level. Cross-level prog list (a.k.a. effective prog list) is not supported. - Previously, BPF_F_PREORDER is supported only for prog attach, now BPF_F_PREORDER is also supported by link-based attach. - For attach, BPF_F_BEFORE/BPF_F_AFTER/BPF_F_ID/BPF_F_LINK is supported similar to kernel mprog but with different implementation. - For detach and replace, use the existing implementation. - For attach, detach and replace, the revision for a particular prog list, associated with a particular attach type, will be updated by increasing count by 1. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606163141.2428937-1-yonghong.song@linux.dev	2025-06-09 16:28:28 -07:00
Yonghong Song	9b8367b604	cgroup: Add bpf prog revisions to struct cgroup_bpf One of key items in mprog API is revision for prog list. The revision number will be increased if the prog list changed, e.g., attach, detach or replace. Add 'revisions' field to struct cgroup_bpf, representing revisions for all cgroup related attachment types. The initial revision value is set to 1, the same as kernel mprog implementations. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606163136.2428732-1-yonghong.song@linux.dev	2025-06-09 16:17:11 -07:00
Luis Gerhorst	97744b4971	bpf: Clarify sanitize_check_bounds() As is, it appears as if pointer arithmetic is allowed for everything except PTR_TO_{STACK,MAP_VALUE} if one only looks at sanitize_check_bounds(). However, this is misleading as the function only works together with retrieve_ptr_limit() and the two must be kept in sync. This patch documents the interdependency and adds a check to ensure they stay in sync. adjust_ptr_min_max_vals(): Because the preceding switch returns -EACCES for every opcode except for ADD/SUB, the sanitize_needed() following the sanitize_check_bounds() call is always true if reached. This means, unless sanitize_check_bounds() detected that the pointer goes OOB because of the ADD/SUB and returns -EACCES, sanitize_ptr_alu() always executes after sanitize_check_bounds(). The following shows that this also implies that retrieve_ptr_limit() runs in all relevant cases. Note that there are two calls to sanitize_ptr_alu(), these are simply needed to easily calculate the correct alu_limit as explained in commit 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic mask"). The truncation-simulation is already performed on the first call. In the second sanitize_ptr_alu(commit_window = true), we always run retrieve_ptr_limit(), unless: * can_skip_alu_sanititation() is true, notably `BPF_SRC(insn->code) == BPF_K`. BPF_K is fine because it means that there is no scalar register (which could be subject to speculative scalar confusion due to Spectre v4) that goes into the ALU operation. The pointer register can not be subject to v4-based value confusion due to the nospec added. Thus, in this case it would have been fine to also skip sanitize_check_bounds(). * If we are on a speculative path (`vstate->speculative`) and in the second "commit" phase, sanitize_ptr_alu() always just returns 0. This makes sense because there are no ALU sanitization limits to be learned from speculative paths. Furthermore, because the sanitization will ensure that pointer arithmetic stays in (architectural) bounds, the sanitize_check_bounds() on the speculative path could also be skipped. The second case needs more attention: Assume we have some ALU operation that is used with scalars architecturally, but with a non-PTR_TO_{STACK,MAP_VALUE} pointer (e.g., PTR_TO_PACKET) speculatively. It might appear as if this would allow an unsanitized pointer ALU operations, but this can not happen because one of the following two always holds: * The type mismatch stems from Spectre v4, then it is prevented by a nospec after the possibly-bypassed store involving the pointer. There is no speculative path simulated for this case thus it never happens. * The type mismatch stems from a Spectre v1 gadget like the following: r1 = slow(0) r4 = fast(0) r3 = SCALAR // Spectre v4 scalar confusion if (r1) { r2 = PTR_TO_PACKET } else { r2 = 42 } if (r4) { r2 += r3 r2 } If `r2 = PTR_TO_PACKET` is indeed dead code, it will be sanitized to `goto -1` (as is the case for the r4-if block). If it is not (e.g., if `r1 = r4 = 1` is possible), it will also be explored on an architectural path and retrieve_ptr_limit() will reject it. To summarize, the exception for `vstate->speculative` is safe. Back to retrieve_ptr_limit(): It only allows the ALU operation if the involved pointer register (can be either source or destination for ADD) is PTR_TO_STACK or PTR_TO_MAP_VALUE. Otherwise, it returns -EOPNOTSUPP. Therefore, sanitize_check_bounds() returning 0 for non-PTR_TO_{STACK,MAP_VALUE} is fine because retrieve_ptr_limit() also runs for all relevant cases and prevents unsafe operations. To summarize, we allow unsanitized pointer arithmetic with 64-bit ADD/SUB for the following instructions if the requirements from retrieve_ptr_limit() AND sanitize_check_bounds() hold: ptr -=/+= imm32 (i.e. `BPF_SRC(insn->code) == BPF_K`) * PTR_TO_{STACK,MAP_VALUE} -= scalar * PTR_TO_{STACK,MAP_VALUE} += scalar * scalar += PTR_TO_{STACK,MAP_VALUE} To document the interdependency between sanitize_check_bounds() and retrieve_ptr_limit(), add a verifier_bug_if() to make sure they stay in sync. Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/CAP01T76HZ+s5h+_REqRFkRjjoKwnZZn9YswpSVinGicah1pGJw@mail.gmail.com/ Link: https://lore.kernel.org/bpf/CAP01T75oU0zfZCiymEcH3r-GQ5A6GOc6GmYzJEnMa3=53XuUQQ@mail.gmail.com/ Link: https://lore.kernel.org/r/20250603204557.332447-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-05 13:20:07 -07:00
Tao Chen	2fe1c59347	bpf: Add cookie to raw_tp bpf_link_info After commit `68ca5d4eeb` ("bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs"), we can show the cookie in bpf_link_info like kprobe etc. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250603154309.3063644-1-chen.dylane@linux.dev	2025-06-05 11:44:52 -07:00
Linus Torvalds	16b70698aa	Merge tag 'sched_ext-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "Two fixes in the built-in idle selection helpers: - Fix prev_cpu handling to guarantee that idle selection never returns a CPU that's not allowed - Skip cross-node search when !NUMA which could lead to infinite looping due to a bug in NUMA iterator" * tag 'sched_ext-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: idle: Skip cross-node search with !CONFIG_NUMA sched_ext: idle: Properly handle invalid prev_cpu during idle selection	2025-06-04 12:07:16 -07:00
Linus Torvalds	70087d2200	Merge tag 'trace-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix UAF in module unload in ftrace when there's a bug in the module If a module is buggy and triggers ftrace_disable which is set when an anomaly is detected, when it gets unloaded it doesn't free the hooks into kallsyms, and when a kallsyms lookup is performed it may access the mod->modname field and crash via UAF. Fix this by still freeing the mod_maps that are attached to kallsyms on module unload regardless if ftrace_disable is set or not. - Do not bother allocating mod_maps for kallsyms if ftrace_disable is set - Remove unused trace events When a trace event or tracepoint is created but not used, it still creates the code and data structures needed for that trace event. This just wastes memory. Remove the trace events that are created but not used. This does not remove trace events that are created but are not used due configs not being set. That will be handled later. This only removes events that have no user under any config. * tag 'trace-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fsdax: Remove unused trace events for dax insert mapping genirq/matrix: Remove unused irq_matrix_alloc_reserved tracepoint xdp: Remove unused mem_return_failed event ftrace: Don't allocate ftrace module map if ftrace is disabled ftrace: Fix UAF when lookup kallsym after ftrace disabled	2025-06-03 14:21:31 -07:00
Linus Torvalds	def5b099a6	Merge tag 'cgroup-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "The rstat per-subsystem split change skipped per-cpu allocation on UP configs; however even on UP, depending on config options, the size of the percpu struct may not be zero leading to crashes. Fix it by conditionalizing the per-cpu area allocation and usage on the size of the per-cpu struct" * tag 'cgroup-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: adjust criteria for rstat subsystem cpu lock access	2025-06-03 14:12:36 -07:00
Andrea Righi	9960be72a5	sched_ext: idle: Skip cross-node search with !CONFIG_NUMA In the idle CPU selection logic, attempting cross-node searches adds unnecessary complexity when CONFIG_NUMA is disabled. Since there's no meaningful concept of nodes in this case, simplify the logic by restricting the idle CPU search to the current node only. Fixes: `48849271e6` ("sched_ext: idle: Per-node idle cpumasks") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-03 08:22:27 -10:00
Linus Torvalds	c8be542408	Merge tag 'modules-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux Pull module updates from Petr Pavlu: - Make .static_call_sites in modules read-only after init The .static_call_sites sections in modules have been made read-only after init to avoid any (non-)accidental modifications, similarly to how they are read-only after init in vmlinux - The rest are minor cleanups * tag 'modules-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: module: Remove outdated comment about text_size module: Make .static_call_sites read-only after init module: Add a separate function to mark sections as read-only after init module: Constify parameters of module_enforce_rwx_sections()	2025-06-02 17:35:06 -07:00
Linus Torvalds	fd1f847350	Merge tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "zram: support algorithm-specific parameters" from Sergey Senozhatsky adds infrastructure for passing algorithm-specific parameters into zram. A single parameter `winbits' is implemented at this time. - "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg charging nmi-safe, which is required by BFP, which can operate in NMI context. - "Some random fixes and cleanup to shmem" from Kemeng Shi implements small fixes and cleanups in the shmem code. - "Skip mm selftests instead when kernel features are not present" from Zi Yan fixes some issues in the MM selftest code. - "mm/damon: build-enable essential DAMON components by default" from SeongJae Park reworks DAMON Kconfig to make it easier to enable CONFIG_DAMON. - "sched/numa: add statistics of numa balance task migration" from Libo Chen adds more info into sysfs and procfs files to improve visibility into the NUMA balancer's task migration activity. - "selftests/mm: cow and gup_longterm cleanups" from Mark Brown provides various updates to some of the MM selftests to make them play better with the overall containing framework. * tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits) mm/khugepaged: clean up refcount check using folio_expected_ref_count() selftests/mm: fix test result reporting in gup_longterm selftests/mm: report unique test names for each cow test selftests/mm: add helper for logging test start and results selftests/mm: use standard ksft_finished() in cow and gup_longterm selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled sched/numa: add statistics of numa balance task sched/numa: fix task swap by skipping kernel threads tools/testing: check correct variable in open_procmap() tools/testing/vma: add missing function stub mm/gup: update comment explaining why gup_fast() disables IRQs selftests/mm: two fixes for the pfnmap test mm/khugepaged: fix race with folio split/free using temporary reference mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order mmu_notifiers: remove leftover stub macros selftests/mm: deduplicate test names in madv_populate kcov: rust: add flags for KCOV with Rust mm: rust: make CONFIG_MMU ifdefs more narrow mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables() mm/damon/Kconfig: enable CONFIG_DAMON by default ...	2025-06-02 16:00:26 -07:00
Linus Torvalds	7f9039c524	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull more kvm updates from Paolo Bonzini: Generic: - Clean up locking of all vCPUs for a VM by using the _nest_lock() family of functions, and move duplicated code to virt/kvm/. kernel/ patches acked by Peter Zijlstra - Add MGLRU support to the access tracking perf test ARM fixes: - Make the irqbypass hooks resilient to changes in the GSI<->MSI routing, avoiding behind stale vLPI mappings being left behind. The fix is to resolve the VGIC IRQ using the host IRQ (which is stable) and nuking the vLPI mapping upon a routing change - Close another VGIC race where vCPU creation races with VGIC creation, leading to in-flight vCPUs entering the kernel w/o private IRQs allocated - Fix a build issue triggered by the recently added workaround for Ampere's AC04_CPU_23 erratum - Correctly sign-extend the VA when emulating a TLBI instruction potentially targeting a VNCR mapping - Avoid dereferencing a NULL pointer in the VGIC debug code, which can happen if the device doesn't have any mapping yet s390: - Fix interaction between some filesystems and Secure Execution - Some cleanups and refactorings, preparing for an upcoming big series x86: - Wait for target vCPU to ack KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to fix a race between AP destroy and VMRUN - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM - Refine and harden handling of spurious faults - Add support for ALLOWED_SEV_FEATURES - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features that utilize those bits - Don't account temporary allocations in sev_send_update_data() - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold - Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between SVM and VMX - Advertise support to userspace for WRMSRNS and PREFETCHI - Rescan I/O APIC routes after handling EOI that needed to be intercepted due to the old/previous routing, but not the new/current routing - Add a module param to control and enumerate support for device posted interrupts - Fix a potential overflow with nested virt on Intel systems running 32-bit kernels - Flush shadow VMCSes on emergency reboot - Add support for SNP to the various SEV selftests - Add a selftest to verify fastops instructions via forced emulation - Refine and optimize KVM's software processing of the posted interrupt bitmap, and share the harvesting code between KVM and the kernel's Posted MSI handler" tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits) rtmutex_api: provide correct extern functions KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer KVM: arm64: vgic-init: Plug vCPU vs. VGIC creation race KVM: arm64: Unmap vLPIs affected by changes to GSI routing information KVM: arm64: Resolve vLPI by host IRQ in vgic_v4_unset_forwarding() KVM: arm64: Protect vLPI translation with vgic_irq::irq_lock KVM: arm64: Use lock guard in vgic_v4_set_forwarding() KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation arm64: sysreg: Drag linux/kconfig.h to work around vdso build issue KVM: s390: Simplify and move pv code KVM: s390: Refactor and split some gmap helpers KVM: s390: Remove unneeded srcu lock s390: Remove unneeded includes s390/uv: Improve splitting of large folios that cannot be split while dirty s390/uv: Always return 0 from s390_wiggle_split_folio() if successful s390/uv: Don't return 0 from make_hva_secure() if the operation was not successful rust: add helper for mutex_trylock RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUs KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation ...	2025-06-02 12:24:58 -07:00
Ye Bin	5834a59738	ftrace: Don't allocate ftrace module map if ftrace is disabled If ftrace is disabled, it is meaningless to allocate a module map. Add a check in allocate_ftrace_mod_map() to not allocate if ftrace is disabled. Link: https://lore.kernel.org/20250529111955.2349189-3-yebin@huaweicloud.com Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-06-02 13:12:26 -04:00
Ye Bin	f914b52c37	ftrace: Fix UAF when lookup kallsym after ftrace disabled The following issue happens with a buggy module: BUG: unable to handle page fault for address: ffffffffc05d0218 PGD 1bd66f067 P4D 1bd66f067 PUD 1bd671067 PMD 101808067 PTE 0 Oops: Oops: 0000 [#1] SMP KASAN PTI Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS RIP: 0010:sized_strscpy+0x81/0x2f0 RSP: 0018:ffff88812d76fa08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffffc0601010 RCX: dffffc0000000000 RDX: 0000000000000038 RSI: dffffc0000000000 RDI: ffff88812608da2d RBP: 8080808080808080 R08: ffff88812608da2d R09: ffff88812608da68 R10: ffff88812608d82d R11: ffff88812608d810 R12: 0000000000000038 R13: ffff88812608da2d R14: ffffffffc05d0218 R15: fefefefefefefeff FS: 00007fef552de740(0000) GS:ffff8884251c7000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffc05d0218 CR3: 00000001146f0000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ftrace_mod_get_kallsym+0x1ac/0x590 update_iter_mod+0x239/0x5b0 s_next+0x5b/0xa0 seq_read_iter+0x8c9/0x1070 seq_read+0x249/0x3b0 proc_reg_read+0x1b0/0x280 vfs_read+0x17f/0x920 ksys_read+0xf3/0x1c0 do_syscall_64+0x5f/0x2e0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The above issue may happen as follows: (1) Add kprobe tracepoint; (2) insmod test.ko; (3) Module triggers ftrace disabled; (4) rmmod test.ko; (5) cat /proc/kallsyms; --> Will trigger UAF as test.ko already removed; ftrace_mod_get_kallsym() ... strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN); ... The problem is when a module triggers an issue with ftrace and sets ftrace_disable. The ftrace_disable is set when an anomaly is discovered and to prevent any more damage, ftrace stops all text modification. The issue that happened was that the ftrace_disable stops more than just the text modification. When a module is loaded, its init functions can also be traced. Because kallsyms deletes the init functions after a module has loaded, ftrace saves them when the module is loaded and function tracing is enabled. This allows the output of the function trace to show the init function names instead of just their raw memory addresses. When a module is removed, ftrace_release_mod() is called, and if ftrace_disable is set, it just returns without doing anything more. The problem here is that it leaves the mod_list still around and if kallsyms is called, it will call into this code and access the module memory that has already been freed as it will return: strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN); Where the "mod" no longer exists and triggers a UAF bug. Link: https://lore.kernel.org/all/20250523135452.626d8dcd@gandalf.local.home/ Cc: stable@vger.kernel.org Fixes: `aba4b5c22c` ("ftrace: Save module init functions kallsyms symbols for tracing") Link: https://lore.kernel.org/20250529111955.2349189-2-yebin@huaweicloud.com Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-06-02 13:09:48 -04:00
Paolo Bonzini	438e22801b	rtmutex_api: provide correct extern functions Commit `fb49f07ba1` ("locking/mutex: implement mutex_lock_killable_nest_lock") changed the set of functions that mutex.c defines when CONFIG_DEBUG_LOCK_ALLOC is set. - it removed the "extern" declaration of mutex_lock_killable_nested from include/linux/mutex.h, and replaced it with a macro since it could be treated as a special case of _mutex_lock_killable. It also removed a definition of the function in kernel/locking/mutex.c. - likewise, it replaced mutex_trylock() with the more generic mutex_trylock_nest_lock() and replaced mutex_trylock() with a macro. However, it left the old definitions in place in kernel/locking/rtmutex_api.c, which causes failures when building with CONFIG_RT_MUTEXES=y. Bring over the changes. Fixes: `fb49f07ba1` ("locking/mutex: implement mutex_lock_killable_nest_lock") Reported-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-06-02 03:05:09 -04:00
Chen Yu	ad6b26b6a0	sched/numa: add statistics of numa balance task On systems with NUMA balancing enabled, it has been found that tracking task activities resulting from NUMA balancing is beneficial. NUMA balancing employs two mechanisms for task migration: one is to migrate a task to an idle CPU within its preferred node, and the other is to swap tasks located on different nodes when they are on each other's preferred nodes. The kernel already provides NUMA page migration statistics in /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However, it lacks statistics regarding task migration and swapping. Therefore, relevant counts for task migration and swapping should be added. The following two new fields: numa_task_migrated numa_task_swapped will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched and /proc/vmstat. Introducing both per-task and per-memory cgroup (memcg) NUMA balancing statistics facilitates a rapid evaluation of the performance and resource utilization of the target workload. For instance, users can first identify the container with high NUMA balancing activity and then further pinpoint a specific task within that group, and subsequently adjust the memory policy for that task. In short, although it is possible to iterate through /proc/$pid/sched to locate the problematic task, the introduction of aggregated NUMA balancing activity for tasks within each memcg can assist users in identifying the task more efficiently through a divide-and-conquer approach. As Libo Chen pointed out, the memcg event relies on the text names in vmstat_text, and /proc/vmstat generates corresponding items based on vmstat_text. Thus, the relevant task migration and swapping events introduced in vmstat_text also need to be populated by count_vm_numa_event(), otherwise these values are zero in /proc/vmstat. In theory, task migration and swap events are part of the scheduler's activities. The reason for exposing them through the memory.stat/vmstat interface is that we already have NUMA balancing statistics in memory.stat/vmstat, and these events are closely related to each other. Following Shakeel's suggestion, we describe the end-to-end flow/story of all these events occurring on a timeline for future reference: The goal of NUMA balancing is to co-locate a task and its memory pages on the same NUMA node. There are two strategies: migrate the pages to the task's node, or migrate the task to the node where its pages reside. Suppose a task p1 is running on Node 0, but its pages are located on Node 1. NUMA page fault statistics for p1 reveal its "page footprint" across nodes. If NUMA balancing detects that most of p1's pages are on Node 1: 1.Page Migration Attempt: The Numa balance first tries to migrate p1's pages to Node 0. The numa_page_migrate counter increments. 2.Task Migration Strategies: After the page migration finishes, Numa balance checks every 1 second to see if p1 can be migrated to Node 1. Case 2.1: Idle CPU Available If Node 1 has an idle CPU, p1 is directly scheduled there. This event is logged as numa_task_migrated. Case 2.2: No Idle CPU (Task Swap) If all CPUs on Node1 are busy, direct migration could cause CPU contention or load imbalance. Instead: The Numa balance selects a candidate task p2 on Node 1 that prefers Node 0 (e.g., due to its own page footprint). p1 and p2 are swapped. This cross-node swap is recorded as numa_task_swapped. Link: https://lkml.kernel.org/r/d00edb12ba0f0de3c5222f61487e65f2ac58f5b1.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/7ef90a88602ed536be46eba7152ed0d33bad5790.1748002400.git.yu.c.chen@intel.com Signed-off-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Aubrey Li <aubrey.li@intel.com> Cc: Ayush Jain <Ayush.jain3@amd.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Libo Chen <libo.chen@oracle.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-31 22:46:15 -07:00
Libo Chen	9709eb0f84	sched/numa: fix task swap by skipping kernel threads Patch series "sched/numa: add statistics of numa balance task migration", v6. Introduce task migration and swap statistics in the following places: /sys/fs/cgroup/{GROUP}/memory.stat /proc/{PID}/sched /proc/vmstat These statistics facilitate a rapid evaluation of the performance and resource utilization of the target workload. This patch (of 2): Task swapping is triggered when there are no idle CPUs in task A's preferred node. In this case, the NUMA load balancer chooses a task B on A's preferred node and swaps B with A. This helps improve NUMA locality without introducing load imbalance between nodes. In the current implementation, B's NUMA node preference is not mandatory. That is to say, a kernel thread might be incorrectly chosen as B. However, kernel thread and user space thread that does not have mm are not supposed to be covered by NUMA balancing because NUMA balancing only considers user pages via VMAs. According to Peter's suggestion for fixing this issue, we use PF_KTHREAD to skip the kernel thread. curr->mm is also checked because it is possible that user_mode_thread() might create a user thread without an mm. As per Prateek's analysis, after adding the PF_KTHREAD check, there is no need to further check the PF_IDLE flag: : - play_idle_precise() already ensures PF_KTHREAD is set before adding : PF_IDLE : : - cpu_startup_entry() is only called from the startup thread which : should be marked with PF_KTHREAD (based on my understanding looking at : commit `cff9b2332a` ("kernel/sched: Modify initial boot task idle : setup")) In summary, the check in task_numa_compare() now aligns with task_tick_numa(). Link: https://lkml.kernel.org/r/cover.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/43d68b356b25d124f0d222ebedf3859e86eefb9f.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/cover.1748002400.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/eaacc9c9bd37bac92d43a671867d85b2fdad3b06.1748002400.git.yu.c.chen@intel.com Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Libo Chen <libo.chen@oracle.com> Suggested-by: Michal Koutný <mkoutny@suse.com> Tested-by: Ayush Jain <Ayush.jain3@amd.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Aubrey Li <aubrey.li@intel.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-31 22:46:15 -07:00
Matthew Wilcox (Oracle)	acc53a0b4c	mm: rename page->index to page->__folio_index All users of page->index have been converted to not refer to it any more. Update a few pieces of documentation that were missed and prevent new users from appearing (or at least make them easy to grep for). Link: https://lkml.kernel.org/r/20250514181508.3019795-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-31 22:46:06 -07:00
Linus Torvalds	7d4e49a77d	Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "hung_task: extend blocking task stacktrace dump to semaphore" from Lance Yang enhances the hung task detector. The detector presently dumps the blocking tasks's stack when it is blocked on a mutex. Lance's series extends this to semaphores - "nilfs2: improve sanity checks in dirty state propagation" from Wentao Liang addresses a couple of minor flaws in nilfs2 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn fixes a couple of issues in the gdb scripts - "Support kdump with LUKS encryption by reusing LUKS volume keys" from Coiby Xu addresses a usability problem with kdump. When the dump device is LUKS-encrypted, the kdump kernel may not have the keys to the encrypted filesystem. A full writeup of this is in the series [0/N] cover letter - "sysfs: add counters for lockups and stalls" from Max Kellermann adds /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count - "fork: Page operation cleanups in the fork code" from Pasha Tatashin implements a number of code cleanups in fork.c - "scripts/gdb/symbols: determine KASLR offset on s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in the gdb scripts * tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits) llist: make llist_add_batch() a static inline delayacct: remove redundant code and adjust indentation squashfs: add optional full compressed block caching crash_dump, nvme: select CONFIGFS_FS as built-in scripts/gdb/symbols: determine KASLR offset on s390 during early boot scripts/gdb/symbols: factor out pagination_off() scripts/gdb/symbols: factor out get_vmlinux() kernel/panic.c: format kernel-doc comments mailmap: update and consolidate Casey Connolly's name and email nilfs2: remove wbc->for_reclaim handling fork: define a local GFP_VMAP_STACK fork: check charging success before zeroing stack fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code fork: clean-up ifdef logic around stack allocation kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count x86/crash: make the page that stores the dm crypt keys inaccessible x86/crash: pass dm crypt keys to kdump kernel Revert "x86/mm: Remove unused __set_memory_prot()" crash_dump: retrieve dm crypt keys in kdump kernel ...	2025-05-31 19:12:53 -07:00
Linus Torvalds	00c010e130	Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...	2025-05-31 15:44:16 -07:00

1 2 3 4 5 ...

48376 Commits