linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-04-14 02:09:07 +08:00

Author	SHA1	Message	Date
Hou Tao	6704b1e8cf	bpf: Don't allocate per-cpu extra_elems for fd htab The update of element in fd htab is in-place now, therefore, there is no need to allocate per-cpu extra_elems, just remove it. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 20:12:54 -07:00
Hou Tao	e8a65856c7	bpf: Add is_fd_htab() helper Add is_fd_htab() helper to check whether the map is htab of maps. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 20:12:53 -07:00
Hou Tao	2c304172e0	bpf: Support atomic update for htab of maps As reported by Cody Haas [1], when there is concurrent map lookup and map update operation in an existing element for htab of maps, the map lookup procedure may return -ENOENT unexpectedly. The root cause is twofold: 1) the update of existing element involves two separated list operation In htab_map_update_elem(), it first inserts the new element at the head of list, then it deletes the old element. Therefore, it is possible a lookup operation has already iterated to the middle of the list when a concurrent update operation begins, and the lookup operation will fail to find the target element. 2) the immediate reuse of htab element. It is more subtle. Even through the lookup operation finds the old element, it is possible that the target element has been removed by a concurrent update operation, and the element has been reused immediately by other update operation which runs on the same CPU as the previous update operation, and the element is inserted into the same bucket list. After these steps above, when the lookup operation tries to compare the key in the old element with the expected key, the match will fail because the key in the old element have been overwritten by other update operation. The two-step update process is relatively straightforward to address. The more challenging aspect is the immediate reuse. As Alexei pointed out: So since 2022 both prealloc and no_prealloc reuse elements. We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP that will use _rcu() flavor of freeing into bpf_ma, but it has to have a strong reason. Given that htab of maps doesn't support special field in value and directly stores the inner map pointer in htab_element, just do in-place update for htab of maps instead of attempting to address the immediate reuse issue. [1]: https://lore.kernel.org/xdp-newbies/CAH7f-ULFTwKdoH_t2SFc5rWCVYLEg-14d1fBYWH2eekudsnTRg@mail.gmail.com/ Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 20:12:53 -07:00
Hou Tao	5771e306b6	bpf: Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place, and add a new percpu argument for the helper to support in-place update for both per-cpu htab and htab of maps. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 20:12:53 -07:00
Hou Tao	ba2b31b0f3	bpf: Factor out htab_elem_value helper() All hash maps store map key and map value together. The relative offset of the map value compared to the map key is round_up(key_size, 8). Therefore, factor out a common helper htab_elem_value() to calculate the address of the map value instead of duplicating the logic. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 20:12:53 -07:00
Paul Chaignon	5a15a050df	bpf: Clarify the meaning of BPF_F_PSEUDO_HDR In the bpf_l4_csum_replace helper, the BPF_F_PSEUDO_HDR flag should only be set if the modified header field is part of the pseudo-header. If you modify for example the UDP ports and pass BPF_F_PSEUDO_HDR, inet_proto_csum_replace4 will update skb->csum even though it shouldn't (the port and the UDP checksum updates null each other). Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/5126ef84ba75425b689482cbc98bffe75e5d8ab0.1744102490.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 20:07:31 -07:00
Paul Chaignon	b412fd6bcc	bpf: Clarify role of BPF_F_RECOMPUTE_CSUM BPF_F_RECOMPUTE_CSUM doesn't update the actual L3 and L4 checksums in the packet, but simply updates skb->csum (according to skb->ip_summed). This patch clarifies that to avoid confusions. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/ff6895d42936f03dbb82334d8bcfd50e00c79086.1744102490.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 20:07:31 -07:00
Alexei Starovoitov	690d43d3b3	Merge branch 'bpf-sockmap-fix-data-loss-and-panic-issues' Jiayuan Chen says: ==================== bpf, sockmap: Fix data loss and panic issues I was writing a benchmark based on sockmap + TCP and discovered several issues: 1. When EAGAIN occurs, the direction of skb is incorrect, causing data loss when retry. 2. When sending partial data, the offset is not recorded, leading to duplicate data being sent when retry. 3. An unexpected BUG_ON() judgment in skb_linearize is triggered. 4. The memory of psock->ingress_skb is not limited by the socket buffer and memcg. Issues 1, 2, and 3 are described in each patch's commit message. Regarding issue 4, this patchset does not cover it as it is difficult to handle in practice, and I am still working on it. Here is a brief description of the issue: When using sockmap to skb/stream redirect, if the receiving end does not perform read operations, all data will be buffered in ingress_skb. For example: ''' // set memory limit to 50G cgcreate -g memory:myGroup cgset -r memory.max="5000M" myGroup // start benchmark and disable consumer from reading cgexec -g "memory:myGroup" ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --delay-consumer=-1 -d 100 Iter 0 ( 29.179us): Send Speed 2668.548 MB/s (20360.406 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 1 ( -7.237us): Send Speed 2694.467 MB/s (20557.149 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 2 ( -1.918us): Send Speed 2693.404 MB/s (20548.039 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 3 ( -0.684us): Send Speed 2693.138 MB/s (20548.014 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 4 ( 7.879us): Send Speed 2698.620 MB/s (20588.838 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 5 ( -3.224us): Send Speed 2696.553 MB/s (20573.066 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 6 ( -5.409us): Send Speed 2699.705 MB/s (20597.111 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 7 ( -0.439us): Send Speed 2699.691 MB/s (20597.009 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) ... // memory usage are not limited cat /proc/slabinfo \| grep skb skbuff_small_head 11824024 11824024 704 46 8 : tunables 0 0 0 : slabdata 257044 257044 0 skbuff_fclone_cache 11822080 11822080 512 32 4 : tunables 0 0 0 : slabdata 369440 369440 0 ''' Thus, a simple socket in a large file upload/download model can eat the entire OS memory. We must charge the skb memory to psock->sk, and if we do not want losing skb, we need to feedback the error info to read_sock/read_skb when the enqueue operation of psock->ingress_skb fails. --- My another patch related to stability also requires maintainers to spare some time from their busy schedules for review. https://lore.kernel.org/bpf/20250317092257.68760-1-jiayuan.chen@linux.dev/T/#t ==================== Link: https://patch.msgid.link/20250407142234.47591-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 19:59:00 -07:00
Jiayuan Chen	7b2fa44de5	selftest/bpf/benchs: Add benchmark for sockmap usage Add TCP+sockmap-based benchmark. Since sockmap's own update and delete operations are generally less critical, the performance of the fast forwarding framework built upon it is the key aspect. Also with cgset/cgexec, we can observe the behavior of sockmap under memory pressure. The benchmark can be run with: ''' ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress ''' In the future, we plan to move socket_helpers.h out of the prog_tests directory to make it accessible for the benchmark. This will enable better support for various socket types. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20250407142234.47591-5-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 19:59:00 -07:00
Jiayuan Chen	5ca2e29f68	bpf, sockmap: Fix panic when calling skb_linearize The panic can be reproduced by executing the command: ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --rx-strp 100000 Then a kernel panic was captured: ''' [ 657.460555] kernel BUG at net/core/skbuff.c:2178! [ 657.462680] Tainted: [W]=WARN [ 657.463287] Workqueue: events sk_psock_backlog ... [ 657.469610] <TASK> [ 657.469738] ? die+0x36/0x90 [ 657.469916] ? do_trap+0x1d0/0x270 [ 657.470118] ? pskb_expand_head+0x612/0xf40 [ 657.470376] ? pskb_expand_head+0x612/0xf40 [ 657.470620] ? do_error_trap+0xa3/0x170 [ 657.470846] ? pskb_expand_head+0x612/0xf40 [ 657.471092] ? handle_invalid_op+0x2c/0x40 [ 657.471335] ? pskb_expand_head+0x612/0xf40 [ 657.471579] ? exc_invalid_op+0x2d/0x40 [ 657.471805] ? asm_exc_invalid_op+0x1a/0x20 [ 657.472052] ? pskb_expand_head+0xd1/0xf40 [ 657.472292] ? pskb_expand_head+0x612/0xf40 [ 657.472540] ? lock_acquire+0x18f/0x4e0 [ 657.472766] ? find_held_lock+0x2d/0x110 [ 657.472999] ? __pfx_pskb_expand_head+0x10/0x10 [ 657.473263] ? __kmalloc_cache_noprof+0x5b/0x470 [ 657.473537] ? __pfx___lock_release.isra.0+0x10/0x10 [ 657.473826] __pskb_pull_tail+0xfd/0x1d20 [ 657.474062] ? __kasan_slab_alloc+0x4e/0x90 [ 657.474707] sk_psock_skb_ingress_enqueue+0x3bf/0x510 [ 657.475392] ? __kasan_kmalloc+0xaa/0xb0 [ 657.476010] sk_psock_backlog+0x5cf/0xd70 [ 657.476637] process_one_work+0x858/0x1a20 ''' The panic originates from the assertion BUG_ON(skb_shared(skb)) in skb_linearize(). A previous commit(see Fixes tag) introduced skb_get() to avoid race conditions between skb operations in the backlog and skb release in the recvmsg path. However, this caused the panic to always occur when skb_linearize is executed. The "--rx-strp 100000" parameter forces the RX path to use the strparser module which aggregates data until it reaches 100KB before calling sockmap logic. The 100KB payload exceeds MAX_MSG_FRAGS, triggering skb_linearize. To fix this issue, just move skb_get into sk_psock_skb_ingress_enqueue. ''' sk_psock_backlog: sk_psock_handle_skb skb_get(skb) <== we move it into 'sk_psock_skb_ingress_enqueue' sk_psock_skb_ingress____________ ↓ \| \| → sk_psock_skb_ingress_self \| sk_psock_skb_ingress_enqueue sk_psock_verdict_apply_________________↑ skb_linearize ''' Note that for verdict_apply path, the skb_get operation is unnecessary so we add 'take_ref' param to control it's behavior. Fixes: `a454d84ee2` ("bpf, sockmap: Fix skb refcnt race after locking changes") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20250407142234.47591-4-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 19:59:00 -07:00
Jiayuan Chen	3b4f14b794	bpf, sockmap: fix duplicated data transmission In the !ingress path under sk_psock_handle_skb(), when sending data to the remote under snd_buf limitations, partial skb data might be transmitted. Although we preserved the partial transmission state (offset/length), the state wasn't properly consumed during retries. This caused the retry path to resend the entire skb data instead of continuing from the previous offset, resulting in data overlap at the receiver side. Fixes: `405df89dd5` ("bpf, sockmap: Improved check for empty queue") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20250407142234.47591-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 19:58:59 -07:00
Jiayuan Chen	7683167196	bpf, sockmap: Fix data lost during EAGAIN retries We call skb_bpf_redirect_clear() to clean _sk_redir before handling skb in backlog, but when sk_psock_handle_skb() return EAGAIN due to sk_rcvbuf limit, the redirect info in _sk_redir is not recovered. Fix skb redir loss during EAGAIN retries by restoring _sk_redir information using skb_bpf_set_redir(). Before this patch: ''' ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress Setting up benchmark 'sockmap'... create socket fd c1:13 p1:14 c2:15 p2:16 Benchmark 'sockmap' started. Send Speed 1343.172 MB/s, BPF Speed 1343.238 MB/s, Rcv Speed 65.271 MB/s Send Speed 1352.022 MB/s, BPF Speed 1352.088 MB/s, Rcv Speed 0 MB/s Send Speed 1354.105 MB/s, BPF Speed 1354.105 MB/s, Rcv Speed 0 MB/s Send Speed 1355.018 MB/s, BPF Speed 1354.887 MB/s, Rcv Speed 0 MB/s ''' Due to the high send rate, the RX processing path may frequently hit the sk_rcvbuf limit. Once triggered, incorrect _sk_redir will cause the flow to mistakenly enter the "!ingress" path, leading to send failures. (The Rcv speed depends on tcp_rmem). After this patch: ''' ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress Setting up benchmark 'sockmap'... create socket fd c1:13 p1:14 c2:15 p2:16 Benchmark 'sockmap' started. Send Speed 1347.236 MB/s, BPF Speed 1347.367 MB/s, Rcv Speed 65.402 MB/s Send Speed 1353.320 MB/s, BPF Speed 1353.320 MB/s, Rcv Speed 65.536 MB/s Send Speed 1353.186 MB/s, BPF Speed 1353.121 MB/s, Rcv Speed 65.536 MB/s ''' Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20250407142234.47591-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 19:58:59 -07:00
Alexei Starovoitov	727d00a51f	Merge branch 'bpf-fix-ktls-panic-with-sockmap-and-add-tests' Jiayuan Chen says: ==================== bpf: fix ktls panic with sockmap and add tests We can reproduce the issue using the existing test program: './test_sockmap --ktls' Or use the selftest I provided, which will cause a panic: ------------[ cut here ]------------ kernel BUG at lib/iov_iter.c:629! PKRU: 55555554 Call Trace: <TASK> ? die+0x36/0x90 ? do_trap+0xdd/0x100 ? iov_iter_revert+0x178/0x180 ? iov_iter_revert+0x178/0x180 ? do_error_trap+0x7d/0x110 ? iov_iter_revert+0x178/0x180 ? exc_invalid_op+0x50/0x70 ? iov_iter_revert+0x178/0x180 ? asm_exc_invalid_op+0x1a/0x20 ? iov_iter_revert+0x178/0x180 ? iov_iter_revert+0x5c/0x180 tls_sw_sendmsg_locked.isra.0+0x794/0x840 tls_sw_sendmsg+0x52/0x80 ? inet_sendmsg+0x1f/0x70 __sys_sendto+0x1cd/0x200 ? find_held_lock+0x2b/0x80 ? syscall_trace_enter+0x140/0x270 ? __lock_release.isra.0+0x5e/0x170 ? find_held_lock+0x2b/0x80 ? syscall_trace_enter+0x140/0x270 ? lockdep_hardirqs_on_prepare+0xda/0x190 ? ktime_get_coarse_real_ts64+0xc2/0xd0 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x90/0x170 1. It looks like the issue started occurring after bpf being introduced to ktls and later the addition of assertions to iov_iter has caused a panic. If my fix tag is incorrect, please assist me in correcting the fix tag. 2. I make minimal changes for now, it's enough to make ktls work correctly. --- v1->v2: Added more content to the commit message https://lore.kernel.org/all/20250123171552.57345-1-mrpre@163.com/#r --- ==================== Link: https://patch.msgid.link/20250219052015.274405-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 19:57:14 -07:00
Jiayuan Chen	05ebde1bcb	selftests/bpf: add ktls selftest add ktls selftest for sockmap Test results: sockmap_ktls/sockmap_ktls disconnect_after_delete IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls update_fails_when_sock_has_ulp IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls disconnect_after_delete IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls update_fails_when_sock_has_ulp IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls disconnect_after_delete IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls update_fails_when_sock_has_ulp IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls disconnect_after_delete IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls update_fails_when_sock_has_ulp IPv4 SOCKMAP:OK sockmap_ktls/tls simple offload:OK sockmap_ktls/tls tx cork:OK sockmap_ktls/tls tx cork with push:OK sockmap_ktls/tls simple offload:OK sockmap_ktls/tls tx cork:OK sockmap_ktls/tls tx cork with push:OK sockmap_ktls:OK Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250219052015.274405-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 19:57:14 -07:00
Jiayuan Chen	54a3ecaeee	bpf: fix ktls panic with sockmap [ 2172.936997] ------------[ cut here ]------------ [ 2172.936999] kernel BUG at lib/iov_iter.c:629! ...... [ 2172.944996] PKRU: 55555554 [ 2172.945155] Call Trace: [ 2172.945299] <TASK> [ 2172.945428] ? die+0x36/0x90 [ 2172.945601] ? do_trap+0xdd/0x100 [ 2172.945795] ? iov_iter_revert+0x178/0x180 [ 2172.946031] ? iov_iter_revert+0x178/0x180 [ 2172.946267] ? do_error_trap+0x7d/0x110 [ 2172.946499] ? iov_iter_revert+0x178/0x180 [ 2172.946736] ? exc_invalid_op+0x50/0x70 [ 2172.946961] ? iov_iter_revert+0x178/0x180 [ 2172.947197] ? asm_exc_invalid_op+0x1a/0x20 [ 2172.947446] ? iov_iter_revert+0x178/0x180 [ 2172.947683] ? iov_iter_revert+0x5c/0x180 [ 2172.947913] tls_sw_sendmsg_locked.isra.0+0x794/0x840 [ 2172.948206] tls_sw_sendmsg+0x52/0x80 [ 2172.948420] ? inet_sendmsg+0x1f/0x70 [ 2172.948634] __sys_sendto+0x1cd/0x200 [ 2172.948848] ? find_held_lock+0x2b/0x80 [ 2172.949072] ? syscall_trace_enter+0x140/0x270 [ 2172.949330] ? __lock_release.isra.0+0x5e/0x170 [ 2172.949595] ? find_held_lock+0x2b/0x80 [ 2172.949817] ? syscall_trace_enter+0x140/0x270 [ 2172.950211] ? lockdep_hardirqs_on_prepare+0xda/0x190 [ 2172.950632] ? ktime_get_coarse_real_ts64+0xc2/0xd0 [ 2172.951036] __x64_sys_sendto+0x24/0x30 [ 2172.951382] do_syscall_64+0x90/0x170 ...... After calling bpf_exec_tx_verdict(), the size of msg_pl->sg may increase, e.g., when the BPF program executes bpf_msg_push_data(). If the BPF program sets cork_bytes and sg.size is smaller than cork_bytes, it will return -ENOSPC and attempt to roll back to the non-zero copy logic. However, during rollback, msg->msg_iter is reset, but since msg_pl->sg.size has been increased, subsequent executions will exceed the actual size of msg_iter. ''' iov_iter_revert(&msg->msg_iter, msg_pl->sg.size - orig_size); ''' The changes in this commit are based on the following considerations: 1. When cork_bytes is set, rolling back to non-zero copy logic is pointless and can directly go to zero-copy logic. 2. We can not calculate the correct number of bytes to revert msg_iter. Assume the original data is "abcdefgh" (8 bytes), and after 3 pushes by the BPF program, it becomes 11-byte data: "abc?de?fgh?". Then, we set cork_bytes to 6, which means the first 6 bytes have been processed, and the remaining 5 bytes "?fgh?" will be cached until the length meets the cork_bytes requirement. However, some data in "?fgh?" is not within 'sg->msg_iter' (but in msg_pl instead), especially the data "?" we pushed. So it doesn't seem as simple as just reverting through an offset of msg_iter. 3. For non-TLS sockets in tcp_bpf_sendmsg, when a "cork" situation occurs, the user-space send() doesn't return an error, and the returned length is the same as the input length parameter, even if some data is cached. Additionally, I saw that the current non-zero-copy logic for handling corking is written as: ''' line 1177 else if (ret != -EAGAIN) { if (ret == -ENOSPC) ret = 0; goto send_end; ''' So it's ok to just return 'copied' without error when a "cork" situation occurs. Fixes: `fcb14cb1bd` ("new iov_iter flavour - ITER_UBUF") Fixes: `d3b18ad31f` ("tls: add bpf support to sk_msg handling") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250219052015.274405-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 19:57:13 -07:00
Saket Kumar Bhaskar	967e8def11	selftests/bpf: Fix bpf_nf selftest failure For systems with missing iptables-legacy tool this selftest fails. Add check to find if iptables-legacy tool is available and skip the test if the tool is missing. Fixes: `de9c8d848d` ("selftests/bpf: S/iptables/iptables-legacy/ in the bpf_nf and xdp_synproxy test") Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250409095633.33653-1-skb99@linux.ibm.com	2025-04-09 16:29:39 -07:00
Tao Chen	a76116f422	bpf: Check link_create.flags parameter for multi_uprobe The link_create.flags are currently not used for multi-uprobes, so return -EINVAL if it is set, same as for other attach APIs. We allow target_fd to have an arbitrary value for multi-uprobe, though, as there are existing users (libbpf) relying on this. Fixes: `89ae89f53d` ("bpf: Add multi uprobe link") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250407035752.1108927-2-chen.dylane@linux.dev	2025-04-09 16:28:51 -07:00
Tao Chen	243911982a	bpf: Check link_create.flags parameter for multi_kprobe The link_create.flags are currently not used for multi-kprobes, so return -EINVAL if it is set, same as for other attach APIs. We allow target_fd, on the other hand, to have an arbitrary value for multi-kprobe, as there are existing users (libbpf) relying on this. Fixes: `0dcac27254` ("bpf: Add multi kprobe link") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250407035752.1108927-1-chen.dylane@linux.dev	2025-04-09 16:28:22 -07:00
Andrii Nakryiko	527b33dda1	Merge branch 'libbpf-introduce-line_info-and-func_info-getters' Mykyta Yatsenko says: ==================== libbpf: introduce line_info and func_info getters From: Mykyta Yatsenko <yatsenko@meta.com> This patchset introduces new libbpf API getters that enable the retrieval of .BTF.ext line and func info. This change enables users to load bpf_program directly using bpf_prog_load, bypassing the higher-level bpf_object__load API. Providing line and function info is essential for BPF program verification in some cases. v3 -> v5 * Fix tests on s390x, nits. v2 -> v3 * Return ENOTSUPP if func or line info struct size differs from the one in uapi linux headers. * Add selftests. v1 -> v2 Move bpf_line_info_min and bpf_func_info_min from libbpf_internal.h to btf.h. Did not remove _min suffix, because there already are bpf_line_info and bpf_func_info structs in uapi/../bpf.h. ==================== Link: https://patch.msgid.link/20250408234417.452565-1-mykyta.yatsenko5@gmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2025-04-09 16:16:57 -07:00
Mykyta Yatsenko	b8390dd1e0	selftests/bpf: Add BTF.ext line/func info getter tests Add selftests checking that line and func info retrieved by newly added libbpf APIs are the same as returned by kernel via bpf_prog_get_info_by_fd. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250408234417.452565-3-mykyta.yatsenko5@gmail.com	2025-04-09 16:16:56 -07:00
Mykyta Yatsenko	243d720e2e	libbpf: Add getters for BTF.ext func and line info Introducing new libbpf API getters for BTF.ext func and line info, namely: bpf_program__func_info bpf_program__func_info_cnt bpf_program__line_info bpf_program__line_info_cnt This change enables scenarios, when user needs to load bpf_program directly using `bpf_prog_load`, instead of higher-level `bpf_object__load`. Line and func info are required for checking BTF info in verifier; verification may fail without these fields if, for example, program calls `bpf_obj_new`. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250408234417.452565-2-mykyta.yatsenko5@gmail.com	2025-04-09 16:16:56 -07:00
Mykyta Yatsenko	37b1b3ed20	selftests/bpf: Support struct/union presets in veristat Extend commit `e3c9abd0d1` ("selftests/bpf: Implement setting global variables in veristat") to support applying presets to members of the global structs or unions in veristat. For example: ``` ./veristat set_global_vars.bpf.o -G "union1.struct3.var_u8_h = 0xBB" ``` Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250408104544.140317-1-mykyta.yatsenko5@gmail.com	2025-04-09 16:16:12 -07:00
Chen Ni	c966139485	selftests/bpf: Convert comma to semicolon Replace comma between expressions with semicolons. Using a ',' in place of a ';' can have unintended side effects. Although that is not the case here, it is seems best to use ';' unless ',' is intended. Found by inspection. No functional change intended. Compile tested only. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/bpf/20250401061546.1990156-1-nichen@iscas.ac.cn	2025-04-04 08:53:57 -07:00
Andrii Nakryiko	893c3938ab	Merge branch 'likely-unlikely-for-bpf_helpers-and-a-small-comment-fix' Anton Protopopov says: ==================== likely/unlikely for bpf_helpers and a small comment fix These two commits fix a comment describing bpf_attr in <linux/bpf.h> and add likely/unlikely macros to <bph/bpf_helpers.h> to be consumed by selftests and, later, by the static_branch_likely/unlikely macros. v1 -> v2: * squash libbpf and selftests fixes into one patch (Andrii) ==================== Link: https://patch.msgid.link/20250331203618.1973691-1-a.s.protopopov@gmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2025-04-04 08:53:25 -07:00
Anton Protopopov	dafae1ae2a	libbpf: Add likely/unlikely macros and use them in selftests A few selftests and, more importantly, consequent changes to the bpf_helpers.h file, use likely/unlikely macros, so define them here and remove duplicate definitions from existing selftests. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250331203618.1973691-3-a.s.protopopov@gmail.com	2025-04-04 08:53:24 -07:00
Anton Protopopov	62aa5790ce	bpf: Fix a comment describing bpf_attr The map_fd field of the bpf_attr union is used in the BPF_MAP_FREEZE syscall. Explicitly mention this in the comments. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250331203618.1973691-2-a.s.protopopov@gmail.com	2025-04-04 08:53:24 -07:00
Carlos Llamas	75011ad69b	libbpf: Fix implicit memfd_create() for bionic Since memfd_create() is not consistently available across different bionic libc implementations, using memfd_create() directly can break some Android builds: tools/lib/bpf/linker.c:576:7: error: implicit declaration of function 'memfd_create' [-Werror,-Wimplicit-function-declaration] 576 \| fd = memfd_create(filename, 0); \| ^ To fix this, relocate and inline the sys_memfd_create() helper so that it can be used in "linker.c". Similar issues were previously fixed by commit `9fa5e1a180` ("libbpf: Call memfd_create() syscall directly"). Fixes: `6d5e5e5d7c` ("libbpf: Extend linker API to support in-memory ELF files") Signed-off-by: Carlos Llamas <cmllamas@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250330211325.530677-1-cmllamas@google.com	2025-04-04 08:52:37 -07:00
Linus Torvalds	06a22366d6	Merge tag 'v6.15rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbd Pull smb server fixes from Steve French: "Four ksmbd SMB3 server fixes, all also for stable" * tag 'v6.15rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: fix null pointer dereference in alloc_preauth_hash() ksmbd: validate zero num_subauth before sub_auth is accessed ksmbd: fix overflow in dacloffset bounds check ksmbd: fix session use-after-free in multichannel connection	2025-04-03 16:18:06 -07:00
Linus Torvalds	6cb0bd94c0	Merge tag 'trace-ringbuffer-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer updates from Steven Rostedt: "Persistent buffer cleanups and simplifications. It was mistaken that the physical memory returned from "reserve_mem" had to be vmap()'d to get to it from a virtual address. But reserve_mem already maps the memory to the virtual address of the kernel so a simple phys_to_virt() can be used to get to the virtual address from the physical memory returned by "reserve_mem". With this new found knowledge, the code can be cleaned up and simplified. - Enforce that the persistent memory is page aligned As the buffers using the persistent memory are all going to be mapped via pages, make sure that the memory given to the tracing infrastructure is page aligned. If it is not, it will print a warning and fail to map the buffer. - Use phys_to_virt() to get the virtual address from reserve_mem Instead of calling vmap() on the physical memory returned from "reserve_mem", use phys_to_virt() instead. As the memory returned by "memmap" or any other means where a physical address is given to the tracing infrastructure, it still needs to be vmap(). Since this memory can never be returned back to the buddy allocator nor should it ever be memmory mapped to user space, flag this buffer and up the ref count. The ref count will keep it from ever being freed, and the flag will prevent it from ever being memory mapped to user space. - Use vmap_page_range() for memmap virtual address mapping For the memmap buffer, instead of allocating an array of struct pages, assigning them to the contiguous phsycial memory and then passing that to vmap(), use vmap_page_range() instead - Replace flush_dcache_folio() with flush_kernel_vmap_range() Instead of calling virt_to_folio() and passing that to flush_dcache_folio(), just call flush_kernel_vmap_range() directly. This also fixes a bug where if a subbuffer was bigger than PAGE_SIZE only the PAGE_SIZE portion would be flushed" * tag 'trace-ringbuffer-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio() tracing: Use vmap_page_range() to map memmap ring buffer tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer tracing: Enforce the persistent ring buffer to be page aligned	2025-04-03 16:09:29 -07:00
Linus Torvalds	949dd321de	Merge tag 'block-6.15-20250403' of git://git.kernel.dk/linux Pull more block updates from Jens Axboe: - NVMe pull request via Keith: - PCI endpoint target cleanup (Damien) - Early import for uring_cmd fixed buffer (Caleb) - Multipath documentation and notification improvements (John) - Invalid pci sq doorbell write fix (Maurizio) - Queue init locking fix - Remove dead nsegs parameter from blk_mq_get_new_requests() * tag 'block-6.15-20250403' of git://git.kernel.dk/linux: block: don't grab elevator lock during queue initialization nvme-pci: skip nvme_write_sq_db on empty rqlist nvme-multipath: change the NVME_MULTIPATH config option nvme: update the multipath warning in nvme_init_ns_head nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io() nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request() nvme/ioctl: don't warn on vectorized uring_cmd with fixed buffer nvmet: pci-epf: Keep completion queues mapped block: remove unused nseg parameter	2025-04-03 16:04:38 -07:00
Linus Torvalds	7930edcc3a	Merge tag 'io_uring-6.15-20250403' of git://git.kernel.dk/linux Pull more io_uring updates from Jens Axboe: "Set of fixes/updates for io_uring that should go into this release. The ublk bits could've gone via either tree - usually I put them in block, but they got a bit mixed this series with the zero-copy supported that ended up dipping into both trees. This contains: - Fix for sendmsg zc, include in pinned pages accounting like we do for the other zc types - Series for ublk fixing request aborting, doing various little cleanups, fixing some zc issues, and adding queue_rqs support - Another ublk series doing some code cleanups - Series cleaning up the io_uring send path, mostly in preparation for registered buffers - Series doing little MSG_RING cleanups - Fix for the newly added zc rx, fixing len being 0 for the last invocation of the callback - Add vectored registered buffer support for ublk. With that, then ublk also supports this feature in the kernel revision where it could generically introduced for rw/net - A bunch of selftest additions for ublk. This is the majority of the diffstat - Silence a KCSAN data race warning for io-wq - Various little cleanups and fixes" * tag 'io_uring-6.15-20250403' of git://git.kernel.dk/linux: (44 commits) io_uring: always do atomic put from iowq selftests: ublk: enable zero copy for stripe target io_uring: support vectored kernel fixed buffer block: add for_each_mp_bvec() io_uring: add validate_fixed_range() for validate fixed buffer selftests: ublk: kublk: fix an error log line selftests: ublk: kublk: use ioctl-encoded opcodes io_uring/zcrx: return early from io_zcrx_recv_skb if readlen is 0 io_uring/net: avoid import_ubuf for regvec send io_uring/rsrc: check size when importing reg buffer io_uring: cleanup {g,s]etsockopt sqe reading io_uring: hide caches sqes from drivers io_uring: make zcrx depend on CONFIG_IO_URING io_uring: add req flag invariant build assertion Documentation: ublk: remove dead footnote selftests: ublk: specify io_cmd_buf pointer type ublk: specify io_cmd_buf pointer type io_uring: don't pass ctx to tw add remote helper io_uring/msg: initialise msg request opcode io_uring/msg: rename io_double_lock_ctx() ...	2025-04-03 15:48:58 -07:00
Christian Brauner	c0dbd11ada	fs: actually hold the namespace semaphore Don't use a scoped guard that only protects the next statement. Use a regular guard to make sure that the namespace semaphore is held across the whole function. Signed-off-by: Christian Brauner <brauner@kernel.org> Reported-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/all/20250401170715.GA112019@unreal/ Fixes: `db04662e2f` ("fs: allow detached mounts in clone_private_mount()") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-03 15:45:35 -07:00
Linus Torvalds	56770e24f6	Merge tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs Pull more bcachefs updates from Kent Overstreet: "More notable fixes: - Fix for striping behaviour on tiering filesystems where replicas exceeds durability on destination target - Fix a race in device removal where deleting alloc info races with the discard worker - Some small stack usage improvements: this is just enough for KMSAN builds to not blow the stack, more is queued up for 6.16" * tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs: bcachefs: Fix "journal stuck" during recovery bcachefs: backpointer_get_key: check for null from peek_slot() bcachefs: Fix null ptr deref in invalidate_one_bucket() bcachefs: Fix check_snapshot_exists() restart handling bcachefs: use nonblocking variant of print_string_as_lines in error path bcachefs: Fix scheduling while atomic from logging changes bcachefs: Add error handling for zlib_deflateInit2() bcachefs: add missing selection of XARRAY_MULTI bcachefs: bch_dev_usage_full bcachefs: Kill btree_iter.trans bcachefs: do_trace_key_cache_fill() bcachefs: Split up bch_dev.io_ref bcachefs: fix ref leak in btree_node_read_all_replicas bcachefs: Fix null ptr deref in bch2_write_endio() bcachefs: Fix field spanning write warning bcachefs: Fix striping behaviour	2025-04-03 15:39:47 -07:00
Linus Torvalds	bdafff62ae	Merge tag '9p-for-6.15-rc1' of https://github.com/martinetd/linux Pull 9p updates from Dominique Martinet: - fix handling of bogus (negative/too long) replies - fix crash on mkdir with ACLs (... looks like nobody is using ACLs with semi-recent kernels...) - ipv6 support for trans=tcp - minor concurrency fix to make syzbot happy - minor cleanup * tag '9p-for-6.15-rc1' of https://github.com/martinetd/linux: docs: fs/9p: Add missing "not" in cache documentation 9p: Use hashtable.h for hash_errmap Documentation/fs/9p: fix broken link 9p/trans_fd: mark concurrent read and writes to p9_conn->err 9p/net: return error on bogus (longer than requested) replies 9p/net: fix improper handling of bogus negative read/write replies fs/9p: fix NULL pointer dereference on mkdir net/9p/fd: support ipv6 for trans=tcp	2025-04-03 15:35:46 -07:00
Linus Torvalds	5916a6fbc0	Merge tag 'rtc-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux Pull RTC updates from Alexandre Belloni: "We see a net reduction of the number of lines of code thanks to the removal of a now unused driver and a testing tool that is not used anymore. Apart from this, the max31335 driver gets support for a new part number and pm8xxx gets UEFI support. Core: - setdate is removed as it has better replacements - skip alarms with a second resolution when we know the RTC doesn't support those. Subsystem: - remove unnecessary private struct members - use devm_pm_set_wake_irq were relevant Drivers: - ds1307: stop disabling alarms on probe for DS1337, DS1339, DS1341 and DS3231 - max31335: add max31331 support - pcf50633 is removed as support for the related SoC has been removed - pcf85063: properly handle POR failures" * tag 'rtc-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (50 commits) rtc: remove 'setdate' test program selftest: rtc: skip some tests if the alarm only supports minutes rtc: mt6397: drop unused defines rtc: pcf85063: replace dev_err+return with return dev_err_probe rtc: pcf85063: do a SW reset if POR failed rtc: max31335: Add driver support for max31331 dt-bindings: rtc: max31335: Add max31331 support rtc: cros-ec: Avoid a couple of -Wflex-array-member-not-at-end warnings dt-bindings: rtc: pcf2127: Reference spi-peripheral-props.yaml rtc: rzn1: implement one-second accuracy for alarms rtc: pcf50633: Remove rtc: pm8xxx: implement qcom,no-alarm flag for non-HLOS owned alarm rtc: pm8xxx: mitigate flash wear rtc: pm8xxx: add support for uefi offset dt-bindings: rtc: qcom-pm8xxx: document qcom,no-alarm flag rtc: rv3032: drop WADA rtc: rv3032: fix EERD location rtc: pm8xxx: switch to devm_device_init_wakeup rtc: pm8xxx: fix possible race condition rtc: mpfs: switch to devm_device_init_wakeup ...	2025-04-03 15:31:14 -07:00
Linus Torvalds	e8b4712852	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux Pull ARM and clkdev updates from Russell King: - Simplify ARM_MMU_KEEP usage - Add Rust support for ARM architecture version 7 - Align IPIs reported in /proc/interrupts - require linker to support KEEP within OVERLAY - add KEEP() for ARM vectors - add __printf() attribute for clkdev functions * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux: ARM: 9445/1: clkdev: Mark some functions with __printf() attribute ARM: 9444/1: add KEEP() keyword to ARM_VECTORS ARM: 9443/1: Require linker to support KEEP within OVERLAY for DCE ARM: 9442/1: smp: Fix IPI alignment in /proc/interrupts ARM: 9441/1: rust: Enable Rust support for ARMv7 ARM: 9439/1: arm32: simplify ARM_MMU_KEEP usage	2025-04-03 12:21:44 -07:00
Linus Torvalds	aa18761a44	Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Catalin Marinas: - Fix max_pfn calculation when hotplugging memory so that it never decreases - Fix dereference of unused source register in the MOPS SET operation fault handling - Fix NULL calling in do_compat_alignment_fixup() when the 32-bit user space does an unaligned LDREX/STREX - Add the HiSilicon HIP09 processor to the Spectre-BHB affected CPUs - Drop unused code pud accessors (special/mkspecial) * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: Don't call NULL in do_compat_alignment_fixup() arm64: Add support for HIP09 Spectre-BHB mitigation arm64: mm: Drop dead code for pud special bit handling arm64: mops: Do not dereference src reg for a set operation arm64: mm: Correct the update of max_pfn	2025-04-03 12:07:01 -07:00
Linus Torvalds	531a62f223	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix BPF selftests expectations of assembler output and struct layout (Song Liu and Yonghong Song) - Fix XSK error code when queue is full (Wang Liang) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Fix verifier_private_stack test failure selftests/bpf: Fix verifier_bpf_fastcall test selftests/bpf: Fix tests after fields reorder in struct file xsk: Fix __xsk_generic_xmit() error code when cq is full	2025-04-03 11:55:41 -07:00
Linus Torvalds	5a2b5cb76c	Merge tag 'mm-nonmm-stable-2025-04-02-22-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more non-MM updates from Andrew Morton: "One bugfix and a couple of small late-arriving updates" * tag 'mm-nonmm-stable-2025-04-02-22-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: lib: scatterlist: fix sg_split_phys to preserve original scatterlist offsets lib/sort.c: add _nonatomic() variants with cond_resched() mailmap: add an entry for Nicolas Schier	2025-04-03 11:16:57 -07:00
Linus Torvalds	8c7c1b5506	Merge tag 'mm-stable-2025-04-02-22-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - The series "mm: fixes for fallouts from mem_init() cleanup" from Mike Rapoport fixes a couple of issues with the just-merged "arch, mm: reduce code duplication in mem_init()" series - The series "MAINTAINERS: add my isub-entries to MM part." from Mike Rapoport does some maintenance on MAINTAINERS - The series "remove tlb_remove_page_ptdesc()" from Qi Zheng does some cleanup work to the page mapping code - The series "mseal system mappings" from Jeff Xu permits sealing of "system mappings", such as vdso, vvar, vvar_vclock, vectors (arm compat-mode), sigpage (arm compat-mode) - Plus the usual shower of singleton patches * tag 'mm-stable-2025-04-02-22-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits) mseal sysmap: add arch-support txt mseal sysmap: enable s390 selftest: test system mappings are sealed mseal sysmap: update mseal.rst mseal sysmap: uprobe mapping mseal sysmap: enable arm64 mseal sysmap: enable x86-64 mseal sysmap: generic vdso vvar mapping selftests: x86: test_mremap_vdso: skip if vdso is msealed mseal sysmap: kernel config and header change mm: pgtable: remove tlb_remove_page_ptdesc() x86: pgtable: convert to use tlb_remove_ptdesc() riscv: pgtable: unconditionally use tlb_remove_ptdesc() mm: pgtable: convert some architectures to use tlb_remove_ptdesc() mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc* mm: pgtable: make generic tlb_remove_table() use struct ptdesc microblaze/mm: put mm_cmdline_setup() in .init.text section mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_range MAINTAINERS: mm: add entry for secretmem MAINTAINERS: mm: add entry for numa memblocks and numa emulation ...	2025-04-03 11:10:00 -07:00
Linus Torvalds	204e9a18f1	Merge tag 'mm-hotfixes-stable-2025-04-02-21-57' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM hotfixes from Andrew Morton: "Five hotfixes. Three are cc:stable and the remainder address post-6.14 issues or aren't considered necessary for -stable kernels. All patches are for MM" * tag 'mm-hotfixes-stable-2025-04-02-21-57' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: zswap: fix crypto_free_acomp() deadlock in zswap_cpu_comp_dead() mm/hugetlb: move hugetlb_sysctl_init() to the __init section mm: page_isolation: avoid calling folio_hstate() without hugetlb_lock mm/hugetlb_vmemmap: fix memory loads ordering mm/userfaultfd: fix release hang over concurrent GUP	2025-04-03 10:47:47 -07:00
Linus Torvalds	ea59cb7423	Merge tag 'sched_ext-for-6.15-rc0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Calling scx_bpf_create_dsq() with the same ID would succeed creating duplicate DSQs. Fix it to return -EEXIST. - scx_select_cpu_dfl() fixes and cleanups. - Synchronize tool/sched_ext with external scheduler repo. While this isn't a fix. There's no risk to the kernel and it's better if they stay synced closer. * tag 'sched_ext-for-6.15-rc0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: tools/sched_ext: Sync with scx repo sched_ext: initialize built-in idle state before ops.init() sched_ext: create_dsq: Return -EEXIST on duplicate request sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl() sched_ext: idle: Fix return code of scx_select_cpu_dfl()	2025-04-03 10:03:38 -07:00
Linus Torvalds	41677970ad	Merge tag 'trace-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix build error when CONFIG_PROBE_EVENTS_BTF_ARGS is not enabled The tracing of arguments in the function tracer depends on some functions that are only defined when PROBE_EVENTS_BTF_ARGS is enabled. In fact, PROBE_EVENTS_BTF_ARGS also depends on all the same configs as the function argument tracing requires. Just have the function argument tracing depend on PROBE_EVENTS_BTF_ARGS. - Free module_delta for persistent ring buffer instance When an instance holds the persistent ring buffer, it allocates a helper array to hold the deltas between where modules are loaded on the last boot and the current boot. This array needs to be freed when the instance is freed. - Add cond_resched() to loop in ftrace_graph_set_hash() The hash functions in ftrace loop over every function that can be enabled by ftrace. This can be 50,000 functions or more. This loop is known to trigger soft lockup warnings and requires a cond_resched(). The loop in ftrace_graph_set_hash() was missing it. - Fix the event format verifier to include "%p.." arguments To prevent events from dereferencing stale pointers that can happen if a trace event uses a dereferece pointer to something that was not copied into the ring buffer and can be freed by the time the trace is read, a verifier is called. At boot or module load, the verifier scans the print format string for pointers that can be dereferenced and it checks the arguments to make sure they do not contain something that can be freed. The "%p" was not handled, which would add another argument and cause the verifier to not only not verify this pointer, but it will look at the wrong argument for every pointer after that. - Fix mcount sorttable building for different endian type target When modifying the ELF file to sort the mcount_loc table in the sorttable.c code, the endianess of the file and the host is used to determine if the bytes need to be swapped when calculations are done. A change was made to the sorting of the mcount_loc that read the values from the ELF file into an array and the swap happened on the filling of the array. But one of the calculations of the array still did the swap when it did not need to. This caused building on a little endian machine for a big endian target to not find the mcount function in the 'nm' table and it zeroed it out, causing there to be no functions available to trace. - Add goto out_unlock jump to rv_register_monitor() on error path One of the error paths in rv_register_monitor() just returned the error when it should have jumped to the out_unlock label to release the mutex. * tag 'trace-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Fix missing unlock on double nested monitors return path scripts/sorttable: Fix endianness handling in build-time mcount sort tracing: Verify event formats that have "%*p.." ftrace: Add cond_resched() to ftrace_graph_set_hash() tracing: Free module_delta on freeing of persistent ring buffer ftrace: Have tracing function args depend on PROBE_EVENTS_BTF_ARGS	2025-04-03 09:52:44 -07:00
Kent Overstreet	77ad1df82b	bcachefs: Fix "journal stuck" during recovery If we crash when the journal pin fifo is completely full - i.e. we're at the maximum number of dirty journal entries - that may put us in a sticky situation in recovery, as journal replay will need to be able to open new journal entries in order to get going. bch2_fs_journal_start() already had provisions for resizing the journal pin fifo if needed, but it needs a fudge factor to ensure there's room for journal replay. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:43 -04:00
Kent Overstreet	2581f89ac8	bcachefs: backpointer_get_key: check for null from peek_slot() peek_slot() doesn't normally return bkey_s_c_null - except when we ask for a key at a btree level that doesn't exist, which can happen here. We might want to revisit this, but we'll have to look over all the places where we use peek_slot() on interior nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:43 -04:00
Kent Overstreet	39ebd74864	bcachefs: Fix null ptr deref in invalidate_one_bucket() bch2_backpointer_get_key() returns bkey_s_c_null when the target isn't found. backpointer_get_key() flags the error, so there's nothing else to do here - just skip it and move on. Link: https://github.com/koverstreet/bcachefs/issues/847 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:43 -04:00
Kent Overstreet	83d539b1b0	bcachefs: Fix check_snapshot_exists() restart handling Codepaths that create entries in the snapshots btree currently call bch2_mark_snapshot(), which updates the in-memory snapshot table, before transaction commit. This is because bch2_mark_snapshot() is an atomic trigger, run with btree write locks held, and isn't allowed to fail - but it might need to reallocate the table, hence we call it early when we're still allowed to fail. This is generally harmless - if we fail, we'll have left an entry in the snapshots table around, but nothing will reference it and it'll get overwritten if reused by another transaction. But check_snapshot_exists(), which reconstructs snapshots when the snapshots btree has been corrupted or lost, was erronously rechecking if the snapshot exists inside the transaction commit loop - so on transaction restart (in this case mem_realloced), the second iteration would return without repairing. This code needs some cleanup: splitting out a "maybe realloc snapshots table" helper would have avoided this, that will be in the next patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:43 -04:00
Bharadwaj Raju	570f5050bb	bcachefs: use nonblocking variant of print_string_as_lines in error path The inconsistency error path calls print_string_as_lines, which calls console_lock, which is a potentially-sleeping function and so can't be called in an atomic context. Replace calls to it with the nonblocking variant which is safe to call. Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:42 -04:00
Kent Overstreet	b2ffadcc7f	bcachefs: Fix scheduling while atomic from logging changes Two fixes from the recent logging changes: bch2_inconsistent(), bch2_fs_inconsistent() be called from interrupt context, or with rcu_read_lock() held. The one syzbot found is in bch2_bkey_pick_read_device bch2_dev_rcu bch2_fs_inconsistent We're starting to switch to lift the printbufs up to higher levels so we can emit better log messages and print them all in one go (avoid garbling), so that conversion will help with spotting these in the future; when we declare a printbuf it must be flagged if we're in an atomic context. Secondly, in btree_node_write_endio: 00085 BUG: sleeping function called from invalid context at include/linux/sched/mm.h:321 00085 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 618, name: bch-reclaim/fa6 00085 preempt_count: 10001, expected: 0 00085 RCU nest depth: 0, expected: 0 00085 4 locks held by bch-reclaim/fa6/618: 00085 #0: ffffff80d7ccad68 (&j->reclaim_lock){+.+.}-{4:4}, at: bch2_journal_reclaim_thread+0x84/0x198 00085 #1: ffffff80d7c84218 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x1c0/0x440 00085 #2: ffffff80cd3f8140 (bcachefs_btree){+.+.}-{0:0}, at: __bch2_trans_get+0x22c/0x440 00085 #3: ffffff80c3823c20 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x58/0x130 00085 irq event stamp: 328 00085 hardirqs last enabled at (327): [<ffffffc080073a14>] finish_task_switch.isra.0+0xbc/0x2a0 00085 hardirqs last disabled at (328): [<ffffffc080971a10>] el1_interrupt+0x20/0x60 00085 softirqs last enabled at (0): [<ffffffc08002f920>] copy_process+0x7c8/0x2118 00085 softirqs last disabled at (0): [<0000000000000000>] 0x0 00085 Preemption disabled at: 00085 [<ffffffc08003ada0>] irq_enter_rcu+0x18/0x90 00085 CPU: 8 UID: 0 PID: 618 Comm: bch-reclaim/fa6 Not tainted 6.14.0-rc6-ktest-g04630bde23e8 #18798 00085 Hardware name: linux,dummy-virt (DT) 00085 Call trace: 00085 show_stack+0x1c/0x30 (C) 00085 dump_stack_lvl+0x84/0xc0 00085 dump_stack+0x14/0x20 00085 __might_resched+0x180/0x288 00085 __might_sleep+0x4c/0x88 00085 __kmalloc_node_track_caller_noprof+0x34c/0x3e0 00085 krealloc_noprof+0x1a0/0x2d8 00085 bch2_printbuf_make_room+0x9c/0x120 00085 bch2_prt_printf+0x60/0x1b8 00085 btree_node_write_endio+0x1b0/0x2d8 00085 bio_endio+0x138/0x1f0 00085 btree_node_write_endio+0xe8/0x2d8 00085 bio_endio+0x138/0x1f0 00085 blk_update_request+0x220/0x4c0 00085 blk_mq_end_request+0x28/0x148 00085 virtblk_request_done+0x64/0xe8 00085 blk_mq_complete_request+0x34/0x40 00085 virtblk_done+0x78/0x130 00085 vring_interrupt+0x6c/0xb0 00085 __handle_irq_event_percpu+0x8c/0x2e0 00085 handle_irq_event+0x50/0xb0 00085 handle_fasteoi_irq+0xc4/0x250 00085 handle_irq_desc+0x44/0x60 00085 generic_handle_domain_irq+0x20/0x30 00085 gic_handle_irq+0x54/0xc8 00085 call_on_irq_stack+0x24/0x40 Reported-by: syzbot+c82cd2906e2f192410bb@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:42 -04:00
Wentao Liang	9364f17ba4	bcachefs: Add error handling for zlib_deflateInit2() In attempt_compress(), the return value of zlib_deflateInit2() needs to be checked. A proper implementation can be found in pstore_compress(). Add an error check and return 0 immediately if the initialzation fails. Fixes: `986e9842fb` ("bcachefs: Compression levels") Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:42 -04:00

1 2 3 4 5 ...

1350748 Commits