mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-09-04 20:19:47 +08:00
901b3290bd
12854 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
1e15510b71 |
Including fixes from bluetooth. We didn't get netfilter or wireless PRs
this week, so next week's PR is probably going to be bigger. A healthy dose of fixes for bugs introduced in the current release nonetheless. Current release - regressions: - Bluetooth: always allow SCO packets for user channel - af_unix: fix memory leak in unix_dgram_sendmsg() - rxrpc: - remove redundant peer->mtu_lock causing lockdep splats - fix spinlock flavor issues with the peer record hash - eth: iavf: fix circular lock dependency with netdev_lock - net: use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net() RDMA driver register notifier after the device Current release - new code bugs: - ethtool: fix ioctl confusing drivers about desired HDS user config - eth: ixgbe: fix media cage present detection for E610 device Previous releases - regressions: - loopback: avoid sending IP packets without an Ethernet header - mptcp: reset connection when MPTCP opts are dropped after join Previous releases - always broken: - net: better track kernel sockets lifetime - ipv6: fix dst ref loop on input in seg6 and rpl lw tunnels - phy: qca807x: use right value from DTS for DAC_DSP_BIAS_CURRENT - eth: enetc: number of error handling fixes - dsa: rtl8366rb: reshuffle the code to fix config / build issue with LED support Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmfAj8MACgkQMUZtbf5S IrtoTRAAj0XNWXGWZdOuVub0xhtjsPLoZktux4AzsELqaynextkJW6w9pG5qVrWu UZt3a3bC7u6+JoTgb+GQVhyjuuVjv6NOSuLK3FS+NePW8ijhLP5oTg6eD0MQS60Z wa9yQx3yL1Kvb6b80Go/3WgRX9V6Rx8zlROAl/gOlZ9NKB0rSVqnueZGPjGZJf1a ayyXsmzRykshbr5Ic0e+b74hFP3DGxVgHjIob1C4kk/Q+WOfQKnm3C3fnZ/R2QcS 7B7kSk9WokvNwk3hJc7ZtFxJbrQKSSuRI8nCD93hBjTn76yJjlPicJ9b6HJoGhE/ Pwt7fBnDCCA00x6ejD3OrurR+/80PbPtyvNtgMMTD49wSwxQpQ6YpTMInnodCzAV NvIhkkXBprI0kiTT4dDpNoeFMKD3i07etKpvMfEoDzZR7vgUsj6aClSmuxILeU9a crFC4Vp5SgyU1/lUPDiG4dfbd8s4hfM4bZ+d0zAtth3/rQA7/EA6dLqbRXXWX7h5 Gl6egKWPsSl+WUgFjpBjYfhqrQsc06hxaCh0SQYH6SnS3i+PlMU2uRJYZMLQ66rX QsSQOyqCEHwd1qnrLedg9rCniv+DzOJf+qh+H0eY9WhuOay+8T52OHLxpRjSHxBo SCP+qQxSX0qhH5DtUiOV50Fwg19UhJJyWd0COfv5SIGm/I1dUOY= =+Ci7 -----END PGP SIGNATURE----- Merge tag 'net-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth. We didn't get netfilter or wireless PRs this week, so next week's PR is probably going to be bigger. A healthy dose of fixes for bugs introduced in the current release nonetheless. Current release - regressions: - Bluetooth: always allow SCO packets for user channel - af_unix: fix memory leak in unix_dgram_sendmsg() - rxrpc: - remove redundant peer->mtu_lock causing lockdep splats - fix spinlock flavor issues with the peer record hash - eth: iavf: fix circular lock dependency with netdev_lock - net: use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net() RDMA driver register notifier after the device Current release - new code bugs: - ethtool: fix ioctl confusing drivers about desired HDS user config - eth: ixgbe: fix media cage present detection for E610 device Previous releases - regressions: - loopback: avoid sending IP packets without an Ethernet header - mptcp: reset connection when MPTCP opts are dropped after join Previous releases - always broken: - net: better track kernel sockets lifetime - ipv6: fix dst ref loop on input in seg6 and rpl lw tunnels - phy: qca807x: use right value from DTS for DAC_DSP_BIAS_CURRENT - eth: enetc: number of error handling fixes - dsa: rtl8366rb: reshuffle the code to fix config / build issue with LED support" * tag 'net-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (53 commits) net: ti: icss-iep: Reject perout generation request idpf: fix checksums set in idpf_rx_rsc() selftests: drv-net: Check if combined-count exists net: ipv6: fix dst ref loop on input in rpl lwt net: ipv6: fix dst ref loop on input in seg6 lwt usbnet: gl620a: fix endpoint checking in genelink_bind() net/mlx5: IRQ, Fix null string in debug print net/mlx5: Restore missing trace event when enabling vport QoS net/mlx5: Fix vport QoS cleanup on error net: mvpp2: cls: Fixed Non IP flow, with vlan tag flow defination. af_unix: Fix memory leak in unix_dgram_sendmsg() net: Handle napi_schedule() calls from non-interrupt net: Clear old fragment checksum value in napi_reuse_skb gve: unlink old napi when stopping a queue using queue API net: Use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net(). tcp: Defer ts_recent changes until req is owned net: enetc: fix the off-by-one issue in enetc_map_tx_tso_buffs() net: enetc: remove the mm_lock from the ENETC v4 driver net: enetc: add missing enetc4_link_deinit() net: enetc: update UDP checksum when updating originTimestamp field ... |
||
![]() |
8d52da23b6 |
tcp: Defer ts_recent changes until req is owned
Recently a bug was discovered where the server had entered TCP_ESTABLISHED
state, but the upper layers were not notified.
The same 5-tuple packet may be processed by different CPUSs, so two
CPUs may receive different ack packets at the same time when the
state is TCP_NEW_SYN_RECV.
In that case, req->ts_recent in tcp_check_req may be changed concurrently,
which will probably cause the newsk's ts_recent to be incorrectly large.
So that tcp_validate_incoming will fail. At this point, newsk will not be
able to enter the TCP_ESTABLISHED.
cpu1 cpu2
tcp_check_req
tcp_check_req
req->ts_recent = rcv_tsval = t1
req->ts_recent = rcv_tsval = t2
syn_recv_sock
tcp_sk(child)->rx_opt.ts_recent = req->ts_recent = t2 // t1 < t2
tcp_child_process
tcp_rcv_state_process
tcp_validate_incoming
tcp_paws_check
if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win)
// t2 - t1 > paws_win, failed
tcp_v4_do_rcv
tcp_rcv_state_process
// TCP_ESTABLISHED
The cpu2's skb or a newly received skb will call tcp_v4_do_rcv to get
the newsk into the TCP_ESTABLISHED state, but at this point it is no
longer possible to notify the upper layer application. A notification
mechanism could be added here, but the fix is more complex, so the
current fix is used.
In tcp_check_req, req->ts_recent is used to assign a value to
tcp_sk(child)->rx_opt.ts_recent, so removing the change in req->ts_recent
and changing tcp_sk(child)->rx_opt.ts_recent directly after owning the
req fixes this bug.
Fixes:
|
||
![]() |
18912c5206 |
tcp: devmem: don't write truncated dmabuf CMSGs to userspace
Currently, we report -ETOOSMALL (err) only on the first iteration
(!sent). When we get put_cmsg error after a bunch of successful
put_cmsg calls, we don't signal the error at all. This might be
confusing on the userspace side which will see truncated CMSGs
but no MSG_CTRUNC signal.
Consider the following case:
- sizeof(struct cmsghdr) = 16
- sizeof(struct dmabuf_cmsg) = 24
- total cmsg size (CMSG_LEN) = 40 (16+24)
When calling recvmsg with msg_controllen=60, the userspace
will receive two(!) dmabuf_cmsg(s), the first one will
be a valid one and the second one will be silently truncated. There is no
easy way to discover the truncation besides doing something like
"cm->cmsg_len != CMSG_LEN(sizeof(dmabuf_cmsg))".
Introduce new put_devmem_cmsg wrapper that reports an error instead
of doing the truncation. Mina suggests that it's the intended way
this API should work.
Note that we might now report MSG_CTRUNC when the users (incorrectly)
call us with msg_control == NULL.
Fixes:
|
||
![]() |
319fc77f8f |
BPF fixes:
- Fix a soft-lockup in BPF arena_map_free on 64k page size kernels (Alan Maguire) - Fix a missing allocation failure check in BPF verifier's acquire_lock_state (Kumar Kartikeya Dwivedi) - Fix a NULL-pointer dereference in trace_kfree_skb by adding kfree_skb to the raw_tp_null_args set (Kuniyuki Iwashima) - Fix a deadlock when freeing BPF cgroup storage (Abel Wu) - Fix a syzbot-reported deadlock when holding BPF map's freeze_mutex (Andrii Nakryiko) - Fix a use-after-free issue in bpf_test_init when eth_skb_pkt_type is accessing skb data not containing an Ethernet header (Shigeru Yoshida) - Fix skipping non-existing keys in generic_map_lookup_batch (Yan Zhai) - Several BPF sockmap fixes to address incorrect TCP copied_seq calculations, which prevented correct data reads from recv(2) in user space (Jiayuan Chen) - Two fixes for BPF map lookup nullness elision (Daniel Xu) - Fix a NULL-pointer dereference from vmlinux BTF lookup in bpf_sk_storage_tracing_allowed (Jared Kangas) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> -----BEGIN PGP SIGNATURE----- iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ7evlRUcZGFuaWVsQGlv Z2VhcmJveC5uZXQACgkQ2yufC7HISIPwHgD/dTvM00Ck4Q73fPivyT7tcqxeXJlD D6ggzWl/SG9LAbwA/2/cSgAM9Jm1g7ddvn/S9QaDYOs5GmFl6urq6krs+tYD =FCs9 -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull BPF fixes from Daniel Borkmann: - Fix a soft-lockup in BPF arena_map_free on 64k page size kernels (Alan Maguire) - Fix a missing allocation failure check in BPF verifier's acquire_lock_state (Kumar Kartikeya Dwivedi) - Fix a NULL-pointer dereference in trace_kfree_skb by adding kfree_skb to the raw_tp_null_args set (Kuniyuki Iwashima) - Fix a deadlock when freeing BPF cgroup storage (Abel Wu) - Fix a syzbot-reported deadlock when holding BPF map's freeze_mutex (Andrii Nakryiko) - Fix a use-after-free issue in bpf_test_init when eth_skb_pkt_type is accessing skb data not containing an Ethernet header (Shigeru Yoshida) - Fix skipping non-existing keys in generic_map_lookup_batch (Yan Zhai) - Several BPF sockmap fixes to address incorrect TCP copied_seq calculations, which prevented correct data reads from recv(2) in user space (Jiayuan Chen) - Two fixes for BPF map lookup nullness elision (Daniel Xu) - Fix a NULL-pointer dereference from vmlinux BTF lookup in bpf_sk_storage_tracing_allowed (Jared Kangas) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests: bpf: test batch lookup on array of maps with holes bpf: skip non exist keys in generic_map_lookup_batch bpf: Handle allocation failure in acquire_lock_state bpf: verifier: Disambiguate get_constant_map_key() errors bpf: selftests: Test constant key extraction on irrelevant maps bpf: verifier: Do not extract constant map keys for irrelevant maps bpf: Fix softlockup in arena_map_free on 64k page kernel net: Add rx_skb of kfree_skb to raw_tp_null_args[]. bpf: Fix deadlock when freeing cgroup storage selftests/bpf: Add strparser test for bpf selftests/bpf: Fix invalid flag of recv() bpf: Disable non stream socket for strparser bpf: Fix wrong copied_seq calculation strparser: Add read_sock callback bpf: avoid holding freeze_mutex during mmap operation bpf: unify VM_WRITE vs VM_MAYWRITE use in BPF map mmaping logic selftests/bpf: Adjust data size to have ETH_HLEN bpf, test_run: Fix use-after-free issue in eth_skb_pkt_type() bpf: Remove unnecessary BTF lookups in bpf_sk_storage_tracing_allowed |
||
![]() |
9b6412e697 |
tcp: drop secpath at the same time as we currently drop dst
Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
running tests that boil down to:
- create a pair of netns
- run a basic TCP test over ipcomp6
- delete the pair of netns
The xfrm_state found on spi_byaddr was not deleted at the time we
delete the netns, because we still have a reference on it. This
lingering reference comes from a secpath (which holds a ref on the
xfrm_state), which is still attached to an skb. This skb is not
leaked, it ends up on sk_receive_queue and then gets defer-free'd by
skb_attempt_defer_free.
The problem happens when we defer freeing an skb (push it on one CPU's
defer_list), and don't flush that list before the netns is deleted. In
that case, we still have a reference on the xfrm_state that we don't
expect at this point.
We already drop the skb's dst in the TCP receive path when it's no
longer needed, so let's also drop the secpath. At this point,
tcp_filter has already called into the LSM hooks that may require the
secpath, so it should not be needed anymore. However, in some of those
places, the MPTCP extension has just been attached to the skb, so we
cannot simply drop all extensions.
Fixes:
|
||
![]() |
4eae0ee0f1 |
arp: switch to dev_getbyhwaddr() in arp_req_set_public()
The arp_req_set_public() function is called with the rtnl lock held,
which provides enough synchronization protection. This makes the RCU
variant of dev_getbyhwaddr() unnecessary. Switch to using the simpler
dev_getbyhwaddr() function since we already have the required rtnl
locking.
This change helps maintain consistency in the networking code by using
the appropriate helper function for the existing locking context.
Since we're not holding the RCU read lock in arp_req_set_public()
existing code could trigger false positive locking warnings.
Fixes:
|
||
![]() |
f5da7c4518 |
tcp: adjust rcvq_space after updating scaling ratio
Since commit under Fixes we set the window clamp in accordance to newly measured rcvbuf scaling_ratio. If the scaling_ratio decreased significantly we may put ourselves in a situation where windows become smaller than rcvq_space, preventing tcp_rcv_space_adjust() from increasing rcvbuf. The significant decrease of scaling_ratio is far more likely since commit |
||
![]() |
a42b69f692 |
arp: use RCU protection in arp_xmit()
arp_xmit() can be called without RTNL or RCU protection.
Use RCU protection to avoid potential UAF.
Fixes:
|
||
![]() |
4b8474a095 |
ipv4: icmp: convert to dev_net_rcu()
__icmp_send() must ensure rcu_read_lock() is held, as spotted
by Jakub.
Other ICMP uses of dev_net() seem safe, change them to dev_net_rcu()
to get LOCKDEP support.
Fixes:
|
||
![]() |
139512191b |
ipv4: use RCU protection in __ip_rt_update_pmtu()
__ip_rt_update_pmtu() must use RCU protection to make sure the net structure it reads does not disappear. Fixes: |
||
![]() |
719817cd29 |
ipv4: use RCU protection in inet_select_addr()
inet_select_addr() must use RCU protection to make
sure the net structure it reads does not disappear.
Fixes:
|
||
![]() |
dd205fcc33 |
ipv4: use RCU protection in rt_is_expired()
rt_is_expired() must use RCU protection to make
sure the net structure it reads does not disappear.
Fixes:
|
||
![]() |
71b8471c93 |
ipv4: use RCU protection in ipv4_default_advmss()
ipv4_default_advmss() must use RCU protection to make
sure the net structure it reads does not disappear.
Fixes:
|
||
![]() |
235174b2be |
udp: gso: do not drop small packets when PMTU reduces
Commit |
||
![]() |
8c670bdfa5 |
tcp: correct handling of extreme memory squeeze
Testing with iperf3 using the "pasta" protocol splicer has revealed
a problem in the way tcp handles window advertising in extreme memory
squeeze situations.
Under memory pressure, a socket endpoint may temporarily advertise
a zero-sized window, but this is not stored as part of the socket data.
The reasoning behind this is that it is considered a temporary setting
which shouldn't influence any further calculations.
However, if we happen to stall at an unfortunate value of the current
window size, the algorithm selecting a new value will consistently fail
to advertise a non-zero window once we have freed up enough memory.
This means that this side's notion of the current window size is
different from the one last advertised to the peer, causing the latter
to not send any data to resolve the sitution.
The problem occurs on the iperf3 server side, and the socket in question
is a completely regular socket with the default settings for the
fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
The following excerpt of a logging session, with own comments added,
shows more in detail what is happening:
// tcp_v4_rcv(->)
// tcp_rcv_established(->)
[5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
[5201<->39222]: tcp_data_queue(->)
[5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
[OFO queue: gap: 65480, len: 0]
[5201<->39222]: tcp_data_queue(<-)
[5201<->39222]: __tcp_transmit_skb(->)
[tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]: tcp_select_window(->)
[5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
[tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
returning 0
[5201<->39222]: tcp_select_window(<-)
[5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160
[5201<->39222]: [__tcp_transmit_skb(<-)
[5201<->39222]: tcp_rcv_established(<-)
[5201<->39222]: tcp_v4_rcv(<-)
// Receive queue is at 85 buffers and we are out of memory.
// We drop the incoming buffer, although it is in sequence, and decide
// to send an advertisement with a window of zero.
// We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
// we unconditionally shrink the window.
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]: __tcp_cleanup_rbuf(<-)
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
returning 6104 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)
// After each read, the algorithm for calculating the new receive
// window in __tcp_cleanup_rbuf() finds it is too small to advertise
// or to update tp->rcv_wnd.
// Meanwhile, the peer thinks the window is zero, and will not send
// any more data to trigger an update from the interrupt mode side.
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]: __tcp_cleanup_rbuf(<-)
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
returning 131072 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)
// The above pattern repeats again and again, since nothing changes
// between the reads.
[...]
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]: __tcp_cleanup_rbuf(<-)
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
returning 54672 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)
// The receive queue is empty, but no new advertisement has been sent.
// The peer still thinks the receive window is zero, and sends nothing.
// We have ended up in a deadlock situation.
Note that well behaved endpoints will send win0 probes, so the problem
will not occur.
Furthermore, we have observed that in these situations this side may
send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
as it should be. Backing ack_seq seems to be harmless, but is of
course still wrong from a protocol viewpoint.
We fix this by updating the socket state correctly when a packet has
been dropped because of memory exhaustion and we have to advertize
a zero window.
Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.
Fixes:
|
||
![]() |
36b62df568 |
bpf: Fix wrong copied_seq calculation
'sk->copied_seq' was updated in the tcp_eat_skb() function when the action
of a BPF program was SK_REDIRECT. For other actions, like SK_PASS, the
update logic for 'sk->copied_seq' was moved to tcp_bpf_recvmsg_parser()
to ensure the accuracy of the 'fionread' feature.
It works for a single stream_verdict scenario, as it also modified
sk_data_ready->sk_psock_verdict_data_ready->tcp_read_skb
to remove updating 'sk->copied_seq'.
However, for programs where both stream_parser and stream_verdict are
active (strparser purpose), tcp_read_sock() was used instead of
tcp_read_skb() (sk_data_ready->strp_data_ready->tcp_read_sock).
tcp_read_sock() now still updates 'sk->copied_seq', leading to duplicate
updates.
In summary, for strparser + SK_PASS, copied_seq is redundantly calculated
in both tcp_read_sock() and tcp_bpf_recvmsg_parser().
The issue causes incorrect copied_seq calculations, which prevent
correct data reads from the recv() interface in user-land.
We do not want to add new proto_ops to implement a new version of
tcp_read_sock, as this would introduce code complexity [1].
We could have added noack and copied_seq to desc, and then called
ops->read_sock. However, unfortunately, other modules didn’t fully
initialize desc to zero. So, for now, we are directly calling
tcp_read_sock_noack() in tcp_bpf.c.
[1]: https://lore.kernel.org/bpf/20241218053408.437295-1-mrpre@163.com
Fixes:
|
||
![]() |
463ec95a16 |
ipsec-2025-01-27
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmeXIF4ACgkQrB3Eaf9P W7fRMA/+Js2x0HNA3+6SMb5nJzY6lywi1BIRAzstyfd6EsxbHlgfdWYCCpixboA0 /ZfDe7yPND/ewPIQLT9eO6hk9YzuAVhYUkdIcDC5jdFDNbh9dDqBdyu5P/5spsi9 9SdFEucoOsKBP4ejmSvtwGsVNIf/1vB8hFqYxB+vh8+d/g8PHrI3xxk+2b7KkIGS ms+IyDCoVdCGQUOp4BGtEQbzXtx67diH5dcfwg8/DJpSMbfqO3ZFRG7gPu8C5Igt cxVSCW67rv/zzPkGPv8B+nczAdVUZ3OFXgEWxdDCN/mUbFKwxUcIDxZVJMfBBAUP lcjsbzmNfj2PNMLZFe/5LuU6o+sFEZdxmTPmvbb+lSYrRHx2oz2/Jb871gEj8rTC vNZ+1Lu1k7QRjEPiO1fe85vWdmU4G81+WAzC88nD0KYLDUN4c+MmxUFQkKbAxf6p e6VCihcKqi5Sa6R73Ohm87iyiSuv8WvkyVSM0XgQrkXWDFy5Jp2Bo25pW0QgVxK+ l/aHhDA+YHFEOZTcjZsh/EdKlQRIxBNJ3ualITkjd2T+A1WyWm0A3S+kYZQCKqiM WGGWM3oVNXkUAaRxvURNvmXqO+hPeKfIElDeVrOUjG8zQ+EktKcg4KpDQb2BGJCj s9ksFj0pplR4GHxUrFmkEPxJWYKpFqUYCZMJDnBnHFm1ykC7QGM= =pg+h -----END PGP SIGNATURE----- Merge tag 'ipsec-2025-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2025-01-27 1) Fix incrementing the upper 32 bit sequence numbers for GSO skbs. From Jianbo Liu. 2) Fix an out-of-bounds read on xfrm state lookup. From Florian Westphal. 3) Fix secpath handling on packet offload mode. From Alexandre Cassen. 4) Fix the usage of skb->sk in the xfrm layer. 5) Don't disable preemption while looking up cache state to fix PREEMPT_RT. From Sebastian Sewior. * tag 'ipsec-2025-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Don't disable preemption while looking up cache state. xfrm: Fix the usage of skb->sk xfrm: delete intermediate secpath entry in packet offload mode xfrm: state: fix out-of-bounds read during lookup xfrm: replay: Fix the update of replay_esn->oseq_hi for GSO ==================== Link: https://patch.msgid.link/20250127060757.3946314-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
15a901361e |
ipmr: do not call mr_mfc_uses_dev() for unres entries
syzbot found that calling mr_mfc_uses_dev() for unres entries
would crash [1], because c->mfc_un.res.minvif / c->mfc_un.res.maxvif
alias to "struct sk_buff_head unresolved", which contain two pointers.
This code never worked, lets remove it.
[1]
Unable to handle kernel paging request at virtual address ffff5fff2d536613
KASAN: maybe wild-memory-access in range [0xfffefff96a9b3098-0xfffefff96a9b309f]
Modules linked in:
CPU: 1 UID: 0 PID: 7321 Comm: syz.0.16 Not tainted 6.13.0-rc7-syzkaller-g1950a0af2d55 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : mr_mfc_uses_dev net/ipv4/ipmr_base.c:290 [inline]
pc : mr_table_dump+0x5a4/0x8b0 net/ipv4/ipmr_base.c:334
lr : mr_mfc_uses_dev net/ipv4/ipmr_base.c:289 [inline]
lr : mr_table_dump+0x694/0x8b0 net/ipv4/ipmr_base.c:334
Call trace:
mr_mfc_uses_dev net/ipv4/ipmr_base.c:290 [inline] (P)
mr_table_dump+0x5a4/0x8b0 net/ipv4/ipmr_base.c:334 (P)
mr_rtm_dumproute+0x254/0x454 net/ipv4/ipmr_base.c:382
ipmr_rtm_dumproute+0x248/0x4b4 net/ipv4/ipmr.c:2648
rtnl_dump_all+0x2e4/0x4e8 net/core/rtnetlink.c:4327
rtnl_dumpit+0x98/0x1d0 net/core/rtnetlink.c:6791
netlink_dump+0x4f0/0xbc0 net/netlink/af_netlink.c:2317
netlink_recvmsg+0x56c/0xe64 net/netlink/af_netlink.c:1973
sock_recvmsg_nosec net/socket.c:1033 [inline]
sock_recvmsg net/socket.c:1055 [inline]
sock_read_iter+0x2d8/0x40c net/socket.c:1125
new_sync_read fs/read_write.c:484 [inline]
vfs_read+0x740/0x970 fs/read_write.c:565
ksys_read+0x15c/0x26c fs/read_write.c:708
Fixes:
|
||
![]() |
0ad9617c78 |
Networking changes for 6.14.
Core ---- - More core refactoring to reduce the RTNL lock contention, including preparatory work for the per-network namespace RTNL lock, replacing RTNL lock with a per device-one to protect NAPI-related net device data and moving synchronize_net() calls outside such lock. - Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge and more specific TCP coverage. - Reduce network namespace tear-down time by removing per-subsystems synchronize_net() in tipc and sched. - Add flow label selector support for fib rules, allowing traffic redirection based on such header field. Netfilter --------- - Do not remove netdev basechain when last device is gone, allowing netdev basechains without devices. - Revisit the flowtable teardown strategy, dealing better with fin, reset and re-open events. - Scale-up IP-vs connection dumping by avoiding linear search on each restart. Protocols --------- - A significant XDP socket refactor, consolidating and optimizing several helpers into the core - Better scaling of ICMP rate-limiting, by removing false-sharing in inet peers handling. - Introduces netlink notifications for multicast IPv4 and IPv6 address changes. - Add ipsec support for IP-TFS/AggFrag encapsulation, allowing aggregation and fragmentation of the inner IP. - Add sysctl to configure TIME-WAIT reuse delay for TCP sockets, to avoid local port exhaustion issues when the average connection lifetime is very short. - Support updating keys (re-keying) for connections using kernel TLS (for TLS 1.3 only). - Support ipv4-mapped ipv6 address clients in smc-r v2. - Add support for jumbo data packet transmission in RxRPC sockets, gluing multiple data packets in a single UDP packet. - Support RxRPC RACK-TLP to manage packet loss and retransmission in conjunction with the congestion control algorithm. Driver API ---------- - Introduce a unified and structured interface for reporting PHY statistics, exposing consistent data across different H/W via ethtool. - Make timestamping selectable, allow the user to select the desired hwtstamp provider (PHY or MAC) administratively. - Add support for configuring a header-data-split threshold (HDS) value via ethtool, to deal with partial or buggy H/W implementation. - Consolidate DSA drivers Energy Efficiency Ethernet support. - Add EEE management to phylink, making use of the phylib implementation. - Add phylib support for in-band capabilities negotiation. - Simplify how phylib-enabled mac drivers expose the supported interfaces. Tests and tooling ----------------- - Make the YNL tool package-friendly to make it easier to deploy it separately from the kernel. - Increase TCP selftest coverage importing several packetdrill test-cases. - Regenerate the ethtool uapi header from the YNL spec, to ease maintenance and future development. - Add YNL support for decoding the link types used in net self-tests, allowing a single build to run both net and drivers/net. Drivers ------- - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - add cross E-Switch QoS support - add SW Steering support for ConnectX-8 - implement support for HW-Managed Flow Steering, improving the rule deletion/insertion rate - support for multi-host LAG - Intel (ixgbe, ice, igb): - ice: add support for devlink health events - ixgbe: add initial support for E610 chipset variant - igb: add support for AF_XDP zero-copy - Meta: - add support for basic RSS config - allow changing the number of channels - add hardware monitoring support - Broadcom (bnxt): - implement TCP data split and HDS threshold ethtool support, enabling Device Memory TCP. - Marvell Octeon: - implement egress ipsec offload support for the cn10k family - Hisilicon (HIBMC): - implement unicast MAC filtering - Ethernet NICs embedded and virtual: - Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding contented atomic operations for drop counters - Freescale: - quicc: phylink conversion - enetc: support Tx and Rx checksum offload and improve TSO performances - MediaTek: - airoha: introduce support for ETS and HTB Qdisc offload - Microchip: - lan78XX USB: preparation work for phylink conversion - Synopsys (stmmac): - support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45 - refactor EEE support to leverage the new driver API - optimize DMA and cache access to increase raw RX performances by 40% - TI: - icssg-prueth: add multicast filtering support for VLAN interface - netkit: - add ability to configure head/tailroom - VXLAN: - accepts packets with user-defined reserved bit - Ethernet switches: - Microchip: - lan969x: add RGMII support - lan969x: improve TX and RX performance using the FDMA engine - nVidia/Mellanox: - move Tx header handling to PCI driver, to ease XDP support - Ethernet PHYs: - Texas Instruments DP83822: - add support for GPIO2 clock output - Realtek: - 8169: add support for RTL8125D rev.b - rtl822x: add hwmon support for the temperature sensor - Microchip: - add support for RDS PTP hardware - consolidate periodic output signal generation - CAN: - several DT-bindings to DT schema conversions - tcan4x5x: - add HW standby support - support nWKRQ voltage selection - kvaser: - allowing Bus Error Reporting runtime configuration - WiFi: - the on-going Multi-Link Operation (MLO) effort continues, affecting both the stack and in drivers - mac80211/cfg80211: - Emergency Preparedness Communication Services (EPCS) station mode support - support for adding and removing station links for MLO - add support for WiFi 7/EHT mesh over 320 MHz channels - report Tx power info for each link - RealTek (rtw88): - enable USB Rx aggregation and USB 3 to improve performance - LED support - RealTek (rtw89): - refactor power save to support Multi-Link Operations - add support for RTL8922AE-VS variant - MediaTek (mt76): - single wiphy multiband support (preparation for MLO) - p2p device support - add TP-Link TXE50UH USB adapter support - Qualcomm (ath10k): - support for the QCA6698AQ IP core - Qualcomm (ath12k): - enable MLO for QCN9274 - Bluetooth: - Allow sysfs to trigger hdev reset, to allow recovering devices not responsive from user-space - MediaTek: add support for MT7922, MT7925, MT7921e devices - Realtek: add support for RTL8851BE devices - Qualcomm: add support for WCN785x devices - ISO: allow BIG re-sync Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmePf5YSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkUcMQALblhkGTxurnfT+yK+Bsuhn2LoHl2RPN 4u2Kjkzm+2FYgcw6lS17cFXsnfAPlRIpmhnmKk1EBgsBdkuL29c+jtqnljA2bboD tIMhMgWiaLS3xgEMrLeKnseIo0G9mviQRphGeZPFTaLb4Ww/bd5LAp4ZGc5oij76 tURatC3b6MuO4Lt5U+jWKnRwviXku8udHkVHXlvPdirawHCVinmx3tvce/BI/MaD eUOp6ZeJCPCOLtk7b8WEyxxvdY0f6D9ed82qfPDHjb94SJv+Vxb38RZtNuApIjn9 S0KdlNih/4flDy17LDxGYSyFps78lUFRbpqmsUlnZkyLXpsph7/WTvAmMAFcrX0K UgQ/F/q5GAvcP5WZcCj5+tZaRmfKQraQirXMtYU/Uj50qCnSU7ssyACASt23GLZ8 OF8tCLlm9lLOU1B6Ofkul1Dbo5f0Xpaghga4dFb0kzSfbm78fTUnqBNsJ7jIkWfi fD6dO+fg+p2ZMD0CACGo3CNxQuJmaQWg6BIDeno6God8kZ6qBMxY/sFr4qozrvFH x/FgQq8dgc8WLmaPejKiNIPkdQepXrIiv3T9jgMVyEjJnWB/LBfyWKSQOdTfnLs+ rgr4YMV6XW4bx0fYqTI8B9jZ+FCWbG6sn4UtRTHITKcd3FSvd8Y+PHa5YyCUWvJM l8pePMGF0XVF =hrsp -----END PGP SIGNATURE----- Merge tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "This is slightly smaller than usual, with the most interesting work being still around RTNL scope reduction. Core: - More core refactoring to reduce the RTNL lock contention, including preparatory work for the per-network namespace RTNL lock, replacing RTNL lock with a per device-one to protect NAPI-related net device data and moving synchronize_net() calls outside such lock. - Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge and more specific TCP coverage. - Reduce network namespace tear-down time by removing per-subsystems synchronize_net() in tipc and sched. - Add flow label selector support for fib rules, allowing traffic redirection based on such header field. Netfilter: - Do not remove netdev basechain when last device is gone, allowing netdev basechains without devices. - Revisit the flowtable teardown strategy, dealing better with fin, reset and re-open events. - Scale-up IP-vs connection dumping by avoiding linear search on each restart. Protocols: - A significant XDP socket refactor, consolidating and optimizing several helpers into the core - Better scaling of ICMP rate-limiting, by removing false-sharing in inet peers handling. - Introduces netlink notifications for multicast IPv4 and IPv6 address changes. - Add ipsec support for IP-TFS/AggFrag encapsulation, allowing aggregation and fragmentation of the inner IP. - Add sysctl to configure TIME-WAIT reuse delay for TCP sockets, to avoid local port exhaustion issues when the average connection lifetime is very short. - Support updating keys (re-keying) for connections using kernel TLS (for TLS 1.3 only). - Support ipv4-mapped ipv6 address clients in smc-r v2. - Add support for jumbo data packet transmission in RxRPC sockets, gluing multiple data packets in a single UDP packet. - Support RxRPC RACK-TLP to manage packet loss and retransmission in conjunction with the congestion control algorithm. Driver API: - Introduce a unified and structured interface for reporting PHY statistics, exposing consistent data across different H/W via ethtool. - Make timestamping selectable, allow the user to select the desired hwtstamp provider (PHY or MAC) administratively. - Add support for configuring a header-data-split threshold (HDS) value via ethtool, to deal with partial or buggy H/W implementation. - Consolidate DSA drivers Energy Efficiency Ethernet support. - Add EEE management to phylink, making use of the phylib implementation. - Add phylib support for in-band capabilities negotiation. - Simplify how phylib-enabled mac drivers expose the supported interfaces. Tests and tooling: - Make the YNL tool package-friendly to make it easier to deploy it separately from the kernel. - Increase TCP selftest coverage importing several packetdrill test-cases. - Regenerate the ethtool uapi header from the YNL spec, to ease maintenance and future development. - Add YNL support for decoding the link types used in net self-tests, allowing a single build to run both net and drivers/net. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - add cross E-Switch QoS support - add SW Steering support for ConnectX-8 - implement support for HW-Managed Flow Steering, improving the rule deletion/insertion rate - support for multi-host LAG - Intel (ixgbe, ice, igb): - ice: add support for devlink health events - ixgbe: add initial support for E610 chipset variant - igb: add support for AF_XDP zero-copy - Meta: - add support for basic RSS config - allow changing the number of channels - add hardware monitoring support - Broadcom (bnxt): - implement TCP data split and HDS threshold ethtool support, enabling Device Memory TCP. - Marvell Octeon: - implement egress ipsec offload support for the cn10k family - Hisilicon (HIBMC): - implement unicast MAC filtering - Ethernet NICs embedded and virtual: - Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding contented atomic operations for drop counters - Freescale: - quicc: phylink conversion - enetc: support Tx and Rx checksum offload and improve TSO performances - MediaTek: - airoha: introduce support for ETS and HTB Qdisc offload - Microchip: - lan78XX USB: preparation work for phylink conversion - Synopsys (stmmac): - support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45 - refactor EEE support to leverage the new driver API - optimize DMA and cache access to increase raw RX performances by 40% - TI: - icssg-prueth: add multicast filtering support for VLAN interface - netkit: - add ability to configure head/tailroom - VXLAN: - accepts packets with user-defined reserved bit - Ethernet switches: - Microchip: - lan969x: add RGMII support - lan969x: improve TX and RX performance using the FDMA engine - nVidia/Mellanox: - move Tx header handling to PCI driver, to ease XDP support - Ethernet PHYs: - Texas Instruments DP83822: - add support for GPIO2 clock output - Realtek: - 8169: add support for RTL8125D rev.b - rtl822x: add hwmon support for the temperature sensor - Microchip: - add support for RDS PTP hardware - consolidate periodic output signal generation - CAN: - several DT-bindings to DT schema conversions - tcan4x5x: - add HW standby support - support nWKRQ voltage selection - kvaser: - allowing Bus Error Reporting runtime configuration - WiFi: - the on-going Multi-Link Operation (MLO) effort continues, affecting both the stack and in drivers - mac80211/cfg80211: - Emergency Preparedness Communication Services (EPCS) station mode support - support for adding and removing station links for MLO - add support for WiFi 7/EHT mesh over 320 MHz channels - report Tx power info for each link - RealTek (rtw88): - enable USB Rx aggregation and USB 3 to improve performance - LED support - RealTek (rtw89): - refactor power save to support Multi-Link Operations - add support for RTL8922AE-VS variant - MediaTek (mt76): - single wiphy multiband support (preparation for MLO) - p2p device support - add TP-Link TXE50UH USB adapter support - Qualcomm (ath10k): - support for the QCA6698AQ IP core - Qualcomm (ath12k): - enable MLO for QCN9274 - Bluetooth: - Allow sysfs to trigger hdev reset, to allow recovering devices not responsive from user-space - MediaTek: add support for MT7922, MT7925, MT7921e devices - Realtek: add support for RTL8851BE devices - Qualcomm: add support for WCN785x devices - ISO: allow BIG re-sync" * tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1386 commits) net/rose: prevent integer overflows in rose_setsockopt() net: phylink: fix regression when binding a PHY net: ethernet: ti: am65-cpsw: streamline TX queue creation and cleanup net: ethernet: ti: am65-cpsw: streamline RX queue creation and cleanup net: ethernet: ti: am65-cpsw: ensure proper channel cleanup in error path ipv6: Convert inet6_rtm_deladdr() to per-netns RTNL. ipv6: Convert inet6_rtm_newaddr() to per-netns RTNL. ipv6: Move lifetime validation to inet6_rtm_newaddr(). ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr(). ipv6: Pass dev to inet6_addr_add(). ipv6: Convert inet6_ioctl() to per-netns RTNL. ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup(). ipv6: Hold rtnl_net_lock() in addrconf_dad_work(). ipv6: Hold rtnl_net_lock() in addrconf_verify_work(). ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL. ipv6: Add __in6_dev_get_rtnl_net(). net: stmmac: Drop redundant skb_mark_for_recycle() for SKB frags net: mii: Fix the Speed display when the network cable is not connected sysctl net: Remove macro checks for CONFIG_SYSCTL eth: bnxt: update header sizing defaults ... |
||
![]() |
f96a974170 |
lsm/stable-6.14 PR 20250121
-----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmeQFBoUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPvcA//XCdwMz0bGtWKv58nuyP8vkQx08n6 //olz/O8te3uWK5O3kRiarzFLwH8qsHQ6A7GYalwwix34hatR4ndJE0Y/guVRWa1 +aBmJxJ7Jm/q3fvpAEfqiSgreuE6kBoztlDOWEq+hUQGu4qfnQGm2EnvbvfFrAmN VheOfIQSU2KCL/Scc3FGnF6uru4WrqN0JJ9RbvrEpfdQgmcyTGLnQsZLljutWSIq kDWkteIr7cj3O9J45zpxZsTftvYSgVn/y1iKeXbHI4DBA1eheK12vsHB9AADKI1J GwHxOrnLpZtv+ICUKqcfFTmWTl+NmfJJurAT5KXKdBjL3xM5MoJlBvK1A5qE9CMo LaHVG/TZR2MmBaoM3EN+gvWhDgWlvT02Q/0cYaafTlVLMez3HtfctxN6OnCvTXTB Y8dqYClhhlBm/mHQwYfMoeKw4MftUpzEqBd1Nj7Qe8dbP0f/62Ca3K2B3D6Rf8QV pj3ryMlSWYV9mdTerruLNQexTGoN7l66jPwzdWpTbFeL3WmNtfCako8OZGbXgPIu Iahm3P+jnSVx8ZQro2c9zwdKXI5xiI335pCBbDZ8aX+JAsfj0OofHsFx5Q5diber M7tAEhxDqRisbpz7Ei+/LOAEGg2Z619XKg8ks4z6Y4P5PF7zEgeWTkZJk2iLbxXe 6LLOjmF7LLw+G4M= =fgyr -----END PGP SIGNATURE----- Merge tag 'lsm-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Improved handling of LSM "secctx" strings through lsm_context struct The LSM secctx string interface is from an older time when only one LSM was supported, migrate over to the lsm_context struct to better support the different LSMs we now have and make it easier to support new LSMs in the future. These changes explain the Rust, VFS, and networking changes in the diffstat. - Only build lsm_audit.c if CONFIG_SECURITY and CONFIG_AUDIT are enabled Small tweak to be a bit smarter about when we build the LSM's common audit helpers. - Check for absurdly large policies from userspace in SafeSetID SafeSetID policies rules are fairly small, basically just "UID:UID", it easy to impose a limit of KMALLOC_MAX_SIZE on policy writes which helps quiet a number of syzbot related issues. While work is being done to address the syzbot issues through other mechanisms, this is a trivial and relatively safe fix that we can do now. - Various minor improvements and cleanups A collection of improvements to the kernel selftests, constification of some function parameters, removing redundant assignments, and local variable renames to improve readability. * tag 'lsm-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: lockdown: initialize local array before use to quiet static analysis safesetid: check size of policy writes net: corrections for security_secid_to_secctx returns lsm: rename variable to avoid shadowing lsm: constify function parameters security: remove redundant assignment to return variable lsm: Only build lsm_audit.c if CONFIG_SECURITY and CONFIG_AUDIT are set selftests: refactor the lsm `flags_overset_lsm_set_self_attr` test binder: initialize lsm_context structure rust: replace lsm context+len with lsm_context lsm: secctx provider check on release lsm: lsm_context in security_dentry_init_security lsm: use lsm_context in security_inode_getsecctx lsm: replace context+len with lsm_context lsm: ensure the correct LSM context releaser |
||
![]() |
cf33d96f50 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts and no adjacent changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
![]() |
25c1a9ca53 |
tcp_cubic: fix incorrect HyStart round start detection
I noticed that HyStart incorrectly marks the start of rounds, leading to inaccurate measurements of ACK train lengths and resetting the `ca->sample_cnt` variable. This inaccuracy can impact HyStart's functionality in terminating exponential cwnd growth during Slow-Start, potentially degrading TCP performance. The issue arises because the changes introduced in commit |
||
![]() |
1620c88887 |
xfrm: Fix the usage of skb->sk
xfrm assumed to always have a full socket at skb->sk. This is not always true, so fix it by converting to a full socket before it is used. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Eric Dumazet <edumazet@google.com> |
||
![]() |
65a55aa7e6 |
ipv4: Prepare inet_rtm_getroute() to .flowi4_tos conversion.
Store rtm->rtm_tos in a dscp_t variable, which can then be used for setting fl4.flowi4_tos and also be passed as parameter of ip_route_input_rcu(). The .flowi4_tos field is going to be converted to dscp_t to ensure ECN bits aren't erroneously taken into account during route lookups. Having a dscp_t variable available will simplify that conversion, as we'll just have to drop the inet_dscp_to_dsfield() call. Note that we can't just convert rtm->rtm_tos to dscp_t because this structure is exported to user space. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/7bc1c7dc47ad1393569095d334521fae59af5bc7.1736944951.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
2c77bcb344 |
gre: Prepare ipgre_open() to .flowi4_tos conversion.
Use ip4h_dscp() to get the tunnel DSCP option as dscp_t, instead of manually masking the raw tos field with INET_DSCP_MASK. This will ease the conversion of fl4->flowi4_tos to dscp_t, which just becomes a matter of dropping the inet_dscp_to_dsfield() call. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/6c05a11afdc61530f1a4505147e0909ad51feb15.1736941806.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
2ee738e90e |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc8). Conflicts: drivers/net/ethernet/realtek/r8169_main.c |
||
![]() |
3440fa34ad |
inet: ipmr: fix data-races
Following fields of 'struct mr_mfc' can be updated
concurrently (no lock protection) from ip_mr_forward()
and ip6_mr_forward()
- bytes
- pkt
- wrong_if
- lastuse
They also can be read from other functions.
Convert bytes, pkt and wrong_if to atomic_long_t,
and use READ_ONCE()/WRITE_ONCE() for lastuse.
Fixes:
|
||
![]() |
d16b344790 |
tcp: add LINUX_MIB_PAWS_OLD_ACK SNMP counter
Prior patch in the series added TCP_RFC7323_PAWS_ACK drop reason. This patch adds the corresponding SNMP counter, for folks using nstat instead of tracing for TCP diagnostics. nstat -az | grep PAWSOldAck Suggested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250113135558.3180360-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
124c4c32e9 |
tcp: add TCP_RFC7323_PAWS_ACK drop reason
XPS can cause reorders because of the relaxed OOO conditions for pure ACK packets. For hosts not using RFS, what can happpen is that ACK packets are sent on behalf of the cpu processing NIC interrupts, selecting TX queue A for ACK packet P1. Then a subsequent sendmsg() can run on another cpu. TX queue selection uses the socket hash and can choose another queue B for packets P2 (with payload). If queue A is more congested than queue B, the ACK packet P1 could be sent on the wire after P2. A linux receiver when processing P1 (after P2) currently increments LINUX_MIB_PAWSESTABREJECTED (TcpExtPAWSEstab) and use TCP_RFC7323_PAWS drop reason. It might also send a DUPACK if not rate limited. In order to better understand this pattern, this patch adds a new drop_reason : TCP_RFC7323_PAWS_ACK. For old ACKS like these, we no longer increment LINUX_MIB_PAWSESTABREJECTED and no longer sends a DUPACK, keeping credit for other more interesting DUPACK. perf record -e skb:kfree_skb -a perf script ... swapper 0 [148] 27475.438637: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK swapper 0 [208] 27475.438706: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK swapper 0 [208] 27475.438908: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK swapper 0 [148] 27475.439010: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK swapper 0 [148] 27475.439214: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK swapper 0 [208] 27475.439286: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK ... Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250113135558.3180360-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
ea98b61bdd |
tcp: add drop_reason support to tcp_disordered_ack()
Following patch is adding a new drop_reason to tcp_validate_incoming(). Change tcp_disordered_ack() to not return a boolean anymore, but a drop reason. Change its name to tcp_disordered_ack_check() Refactor tcp_validate_incoming() to ease the code review of the following patch, and reduce indentation level. This patch is a refactor, with no functional change. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250113135558.3180360-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
644f9108f3 |
udp: Make rehash4 independent in udp_lib_rehash()
As discussed in [0], rehash4 could be missed in udp_lib_rehash() when
udp hash4 changes while hash2 doesn't change. This patch fixes this by
moving rehash4 codes out of rehash2 checking, and then rehash2 and
rehash4 are done separately.
By doing this, we no longer need to call rehash4 explicitly in
udp_lib_hash4(), as the rehash callback in __ip4_datagram_connect takes
it. Thus, now udp_lib_hash4() returns directly if the sk is already
hashed.
Note that uhash4 may fail to work under consecutive connect(<dst
address>) calls because rehash() is not called with every connect(). To
overcome this, connect(<AF_UNSPEC>) needs to be called after the next
connect to a new destination.
[0]
https://lore.kernel.org/all/4761e466ab9f7542c68cdc95f248987d127044d2.1733499715.git.pabeni@redhat.com/
Fixes:
|
||
![]() |
7b24f164cf |
ipsec-next-2025-01-09
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmd/mNkACgkQrB3Eaf9P W7cQww/9Hnv7+wosuBxFW2o2ptgZThiwj/Ovz0oiWGWvctpN7VlpwmrDOsXQ+XMn xF6JBFEjnJanYoDBb78D0dMffJcMcUKZopTUU/ZMitSNr8aIHYiuB4SWPG1tqxl4 Ete7Mr2m3tS96YePQNnAaRZzEuGsx3BQb28VLTWl9So81MByD2OK4fsAbYz22Gg8 7A6tDHn1mUd9b2VG+LeeBZaDDFG8C0O2x4E/8Z3DX3z1N8y3LABPwZ38jcgTviKO 1ZldGrJT+PownBydu23bWDassKE2TuVvGH9e/SOPeQj8DJ4Lmd0bafMZTY6xwfNT RJCwhlzZUpYRXFzvcf3+U3egsqEWEemV7/LzAapdT0V9OqfLWMUh3b1jMA4KcblZ qmlm/MhZyXutDoeuASwtM4jgM3wGwovOofrKKsb13hD9VLBs5jFZmFSw5TlbmwE3 sjv7V4pFwNyfJnwQtyMmfuuHiy8w+fzqAA2GCg8mF3OosHABH/FOvEBP8xg1Vqu1 iKlLkByfyaCFD+GxTPzqSDvSB8nDzeZBgBM/ILGuwH0OWr+0gxBzl1sxrhAkw9hC Gf+4J3wg7EknTxfrJk4LyqfyS50GvUIzpLSkHYxDtQRd4zv75bxRzMvoZvuapTtN GGpWYftKO8n2kgLORQ0dTtFee3c4w/No+KXWsFdtRVNpyk3N0Yw= =U79q -----END PGP SIGNATURE----- Merge tag 'ipsec-next-2025-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== ipsec-next-2025-01-09 1) Implement the AGGFRAG protocol and basic IP-TFS (RFC9347) functionality. From Christian Hopps. 2) Support ESN context update to hardware for TX. From Jianbo Liu. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
8c7a6efc01 |
ipv4: route: fix drop reason being overridden in ip_route_input_slow
When jumping to 'martian_destination' a drop reason is always set but
that label falls-through the 'e_nobufs' one, overriding the value.
The behavior was introduced by the mentioned commit. The logic went
from,
goto martian_destination;
...
martian_destination:
...
e_inval:
err = -EINVAL;
goto out;
e_nobufs:
err = -ENOBUFS;
goto out;
to,
reason = ...;
goto martian_destination;
...
martian_destination:
...
e_nobufs:
reason = SKB_DROP_REASON_NOMEM;
goto out;
A 'goto out' is clearly missing now after 'martian_destination' to avoid
overriding the drop reason.
Fixes:
|
||
![]() |
14ea4cd1b1 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc7). Conflicts: |
||
![]() |
80fb40baba |
tcp: Annotate data-race around sk->sk_mark in tcp_v4_send_reset
This is a follow-up to |
||
![]() |
385f186aba |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc6). No conflicts. Adjacent changes: include/linux/if_vlan.h |
||
![]() |
aba74e639f |
Nothing major here. Over the last two weeks we gathered only around
two-thirds of our normal weekly fix count, but delaying sending these until -rc7 seemed like a really bad idea. AFAIK we have no bugs under investigation. One or two reverts for stuff for which we haven't gotten a proper fix will likely come in the next PR. Including fixes from wireles and netfilter. Current release - fix to a fix: - netfilter: nft_set_hash: unaligned atomic read on struct nft_set_ext - eth: gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup Previous releases - regressions: - net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets - mptcp: - fix sleeping rcvmsg sleeping forever after bad recvbuffer adjust - fix TCP options overflow - prevent excessive coalescing on receive, fix throughput - net: fix memory leak in tcp_conn_request() if map insertion fails - wifi: cw1200: fix potential NULL dereference after conversion to GPIO descriptors - phy: micrel: dynamically control external clock of KSZ PHY, fix suspend behavior Previous releases - always broken: - af_packet: fix VLAN handling with MSG_PEEK - net: restrict SO_REUSEPORT to inet sockets - netdev-genl: avoid empty messages in NAPI get - dsa: microchip: fix set_ageing_time function on KSZ9477 and LAN937X - eth: gve: XDP fixes around transmit, queue wakeup etc. - eth: ti: icssg-prueth: fix firmware load sequence to prevent time jump which breaks timesync related operations Misc: - netlink: specs: mptcp: add missing attr and improve documentation Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmd4FCgACgkQMUZtbf5S IruB7hAAkppr4+4kywDq/PIqgkisjYzpNIQIrphebUeIHIcAizWbJEL9CiHXnCQD w95+nVhzDmsWXdzE2AvwM/2YAByLP6CpgKEa+7rI3K5BQZ86lbzm+ftOOFYiF3eV 2v5yLfjgMVbzwqzuP3ivkNrtG95IGQFPobLl7tI9Wpgm3yD0+H8BCqtZNAwA4N6g mEdZD1jMOzkdINoJBg35O70B2GDq/WSS1N8dgxtj1F7EPDuQdO1kijJmqFjYT3fw 30/NdjlfGtbFro6zp+8ZY6P9RG3CLNhYXu+nUqdOHJOJw65aXUxOPjWCRp3o6m12 HIIp22imDTvNTSCfn0wlbGJfC+v9HdIgB9gqApAaxNOS02bQgEy5tuxxXNTo2B1I bqH7GJ+6a7axsQ6BnpcOcDhu2fnbtguj1W/zC/MlGQmZZj2g+UNCf4QqPir/XkcG r43/iy2/L8cgZWYXzH1skfUMgbymI5uLytO+n+CLsb6P/N3WLa7wxHNV5Lni+Xng qDyd0TwUccCr2KBfZqjgFeGbvQ7qub+eeQCYx9kUAaFhKUsQB/eZ5i9C7SsfutMa xzwITqCkcvmdyU3gG3v/oABtQkxdTcRqWCfYDGcXm0zb1jzMf5ZGN4E/gABYM7p9 EewL1kCAt4ULmBc8Hqrj+2Erlzh+xsodvvfpgpJandMVXaJW44k= =YKS4 -----END PGP SIGNATURE----- Merge tag 'net-6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wireles and netfilter. Nothing major here. Over the last two weeks we gathered only around two-thirds of our normal weekly fix count, but delaying sending these until -rc7 seemed like a really bad idea. AFAIK we have no bugs under investigation. One or two reverts for stuff for which we haven't gotten a proper fix will likely come in the next PR. Current release - fix to a fix: - netfilter: nft_set_hash: unaligned atomic read on struct nft_set_ext - eth: gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup Previous releases - regressions: - net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets - mptcp: - fix sleeping rcvmsg sleeping forever after bad recvbuffer adjust - fix TCP options overflow - prevent excessive coalescing on receive, fix throughput - net: fix memory leak in tcp_conn_request() if map insertion fails - wifi: cw1200: fix potential NULL dereference after conversion to GPIO descriptors - phy: micrel: dynamically control external clock of KSZ PHY, fix suspend behavior Previous releases - always broken: - af_packet: fix VLAN handling with MSG_PEEK - net: restrict SO_REUSEPORT to inet sockets - netdev-genl: avoid empty messages in NAPI get - dsa: microchip: fix set_ageing_time function on KSZ9477 and LAN937X - eth: - gve: XDP fixes around transmit, queue wakeup etc. - ti: icssg-prueth: fix firmware load sequence to prevent time jump which breaks timesync related operations Misc: - netlink: specs: mptcp: add missing attr and improve documentation" * tag 'net-6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits) net: ti: icssg-prueth: Fix clearing of IEP_CMP_CFG registers during iep_init net: ti: icssg-prueth: Fix firmware load sequence. mptcp: prevent excessive coalescing on receive mptcp: don't always assume copied data in mptcp_cleanup_rbuf() mptcp: fix recvbuffer adjust on sleeping rcvmsg ila: serialize calls to nf_register_net_hooks() af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK af_packet: fix vlan_get_tci() vs MSG_PEEK net: wwan: iosm: Properly check for valid exec stage in ipc_mmio_init() net: restrict SO_REUSEPORT to inet sockets net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets net: sfc: Correct key_len for efx_tc_ct_zone_ht_params net: wwan: t7xx: Fix FSM command timeout issue sky2: Add device ID 11ab:4373 for Marvell 88E8075 mptcp: fix TCP options overflow. net: mv643xx_eth: fix an OF node reference leak gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup eth: bcmsysport: fix call balance of priv->clk handling routines net: llc: reset skb->transport_header netlink: specs: mptcp: fix missing doc ... |
||
![]() |
5df7ca0b82 |
ipv4: remove useless arg
The "struct sock *sk" parameter in ip_rcv_finish_core is unused, which leads the compiler to optimize it out. As a result, the "struct sk_buff *skb" parameter is passed using x1. And this make kprobe hard to use. Signed-off-by: Yu Tian <tianyu2@kernelsoft.com> Link: https://patch.msgid.link/20241231023610.1657926-1-tianyu2@kernelsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
aa4ad7c3f2 |
netlink: correct nlmsg size for multicast notifications
Corrected the netlink message size calculation for multicast group
join/leave notifications. The previous calculation did not account for
the inclusion of both IPv4/IPv6 addresses and ifa_cacheinfo in the
payload. This fix ensures that the allocated message size is
sufficient to hold all necessary information.
This patch also includes the following improvements:
* Uses GFP_KERNEL instead of GFP_ATOMIC when holding the RTNL mutex.
* Uses nla_total_size(sizeof(struct in6_addr)) instead of
nla_total_size(16).
* Removes unnecessary EXPORT_SYMBOL().
Fixes:
|
||
![]() |
b5a7b661a0 |
net: Fix netns for ip_tunnel_init_flow()
The device denoted by tunnel->parms.link resides in the underlay net
namespace. Therefore pass tunnel->net to ip_tunnel_init_flow().
Fixes:
|
||
![]() |
4f4aa4aa28 |
net: fix memory leak in tcp_conn_request()
If inet_csk_reqsk_queue_hash_add() return false, tcp_conn_request() will
return without free the dst memory, which allocated in af_ops->route_req.
Here is the kmemleak stack:
unreferenced object 0xffff8881198631c0 (size 240):
comm "softirq", pid 0, jiffies 4299266571 (age 1802.392s)
hex dump (first 32 bytes):
00 10 9b 03 81 88 ff ff 80 98 da bc ff ff ff ff ................
81 55 18 bb ff ff ff ff 00 00 00 00 00 00 00 00 .U..............
backtrace:
[<ffffffffb93e8d4c>] kmem_cache_alloc+0x60c/0xa80
[<ffffffffba11b4c5>] dst_alloc+0x55/0x250
[<ffffffffba227bf6>] rt_dst_alloc+0x46/0x1d0
[<ffffffffba23050a>] __mkroute_output+0x29a/0xa50
[<ffffffffba23456b>] ip_route_output_key_hash+0x10b/0x240
[<ffffffffba2346bd>] ip_route_output_flow+0x1d/0x90
[<ffffffffba254855>] inet_csk_route_req+0x2c5/0x500
[<ffffffffba26b331>] tcp_conn_request+0x691/0x12c0
[<ffffffffba27bd08>] tcp_rcv_state_process+0x3c8/0x11b0
[<ffffffffba2965c6>] tcp_v4_do_rcv+0x156/0x3b0
[<ffffffffba299c98>] tcp_v4_rcv+0x1cf8/0x1d80
[<ffffffffba239656>] ip_protocol_deliver_rcu+0xf6/0x360
[<ffffffffba2399a6>] ip_local_deliver_finish+0xe6/0x1e0
[<ffffffffba239b8e>] ip_local_deliver+0xee/0x360
[<ffffffffba239ead>] ip_rcv+0xad/0x2f0
[<ffffffffba110943>] __netif_receive_skb_one_core+0x123/0x140
Call dst_release() to free the dst memory when
inet_csk_reqsk_queue_hash_add() return false in tcp_conn_request().
Fixes:
|
||
![]() |
a502ea6fa9 |
udp: Deal with race between UDP socket address change and rehash
If a UDP socket changes its local address while it's receiving datagrams, as a result of connect(), there is a period during which a lookup operation might fail to find it, after the address is changed but before the secondary hash (port and address) and the four-tuple hash (local and remote ports and addresses) are updated. Secondary hash chains were introduced by commit |
||
![]() |
5153a75ef3 |
tcp_bpf: Fix copied value in tcp_bpf_sendmsg
bpf kselftest sockhash::test_txmsg_cork_hangs in test_sockmap.c triggers a kernel NULL pointer dereference: BUG: kernel NULL pointer dereference, address: 0000000000000008 ? __die_body+0x6e/0xb0 ? __die+0x8b/0xa0 ? page_fault_oops+0x358/0x3c0 ? local_clock+0x19/0x30 ? lock_release+0x11b/0x440 ? kernelmode_fixup_or_oops+0x54/0x60 ? __bad_area_nosemaphore+0x4f/0x210 ? mmap_read_unlock+0x13/0x30 ? bad_area_nosemaphore+0x16/0x20 ? do_user_addr_fault+0x6fd/0x740 ? prb_read_valid+0x1d/0x30 ? exc_page_fault+0x55/0xd0 ? asm_exc_page_fault+0x2b/0x30 ? splice_to_socket+0x52e/0x630 ? shmem_file_splice_read+0x2b1/0x310 direct_splice_actor+0x47/0x70 splice_direct_to_actor+0x133/0x300 ? do_splice_direct+0x90/0x90 do_splice_direct+0x64/0x90 ? __ia32_sys_tee+0x30/0x30 do_sendfile+0x214/0x300 __se_sys_sendfile64+0x8e/0xb0 __x64_sys_sendfile64+0x25/0x30 x64_sys_call+0xb82/0x2840 do_syscall_64+0x75/0x110 entry_SYSCALL_64_after_hwframe+0x4b/0x53 This is caused by tcp_bpf_sendmsg() returning a larger value(12289) than size (8192), which causes the while loop in splice_to_socket() to release an uninitialized pipe buf. The underlying cause is that this code assumes sk_msg_memcopy_from_iter() will copy all bytes upon success but it actually might only copy part of it. This commit changes it to use the real copied bytes. Signed-off-by: Levi Zim <rsworktech@outlook.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Björn Töpel <bjorn@kernel.org> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20241130-tcp-bpf-sendmsg-v1-2-bae583d014f3@outlook.com |
||
![]() |
148721f8e0 |
ipv4: Use inet_sk_init_flowi4() in __ip_queue_xmit().
Use inet_sk_init_flowi4() to automatically initialise the flowi4 structure in __ip_queue_xmit() instead of passing parameters manually to ip_route_output_ports(). Override ->flowi4_tos with the value passed as parameter since that's required by SCTP. Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/37e64ffbd9adac187b14aa9097b095f5c86e85be.1734357769.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
42e5ffc385 |
ipv4: Use inet_sk_init_flowi4() in inet_csk_rebuild_route().
Use inet_sk_init_flowi4() to automatically initialise the flowi4 structure in inet_csk_rebuild_route() instead of passing parameters manually to ip_route_output_ports(). Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/b270931636effa1095508e0f0a3e8c3a0e6d357f.1734357769.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
5be1323b50 |
ipv4: Use inet_sk_init_flowi4() in ip4_datagram_release_cb().
Use inet_sk_init_flowi4() to automatically initialise the flowi4 structure in ip4_datagram_release_cb() instead of passing parameters manually to ip_route_output_ports(). Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/9c326b8d9e919478f7952b21473d31da07eba2dd.1734357769.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
1dbdce30f0 |
ipv4: Define inet_sk_init_flowi4() and use it in inet_sk_rebuild_header().
IPv4 code commonly has to initialise a flowi4 structure from an IPv4 socket. This requires looking at potential IPv4 options to set the proper destination address, call flowi4_init_output() with the correct set of parameters and run the sk_classify_flow security hook. Instead of reimplementing these operations in different parts of the stack, let's define inet_sk_init_flowi4() which does all these operations. The first user is inet_sk_rebuild_header(), where inet_sk_init_flowi4() replaces ip_route_output_ports(). Unlike ip_route_output_ports(), which sets the flowi4 structure and performs the route lookup in one go, inet_sk_init_flowi4() only initialises the flow. The route lookup is then done by ip_route_output_flow(). Decoupling flow initialisation from route lookup makes this new interface applicable more broadly as it will allow some users to overwrite specific struct flowi4 members before the route lookup. Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/fd416275262b1f518d5abfcef740ce4f4a1a6522.1734357769.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
05dd04b218 |
inetpeer: avoid false sharing in inet_peer_xrlim_allow()
Under DOS, inet_peer_xrlim_allow() might be called millions of times per second from different cpus. Make sure to write over peer->rate_tokens and peer->rate_last only when really needed. Note the inherent races of this function are still there, we do not care of precise ICMP rate limiting. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20241219150330.3159027-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
d888b7af7c |
tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirection
When we do sk_psock_verdict_apply->sk_psock_skb_ingress, an sk_msg will
be created out of the skb, and the rmem accounting of the sk_msg will be
handled by the skb.
For skmsgs in __SK_REDIRECT case of tcp_bpf_send_verdict, when redirecting
to the ingress of a socket, although we sk_rmem_schedule and add sk_msg to
the ingress_msg of sk_redir, we do not update sk_rmem_alloc. As a result,
except for the global memory limit, the rmem of sk_redir is nearly
unlimited. Thus, add sk_rmem_alloc related logic to limit the recv buffer.
Since the function sk_msg_recvmsg and __sk_psock_purge_ingress_msg are
used in these two paths. We use "msg->skb" to test whether the sk_msg is
skb backed up. If it's not, we shall do the memory accounting explicitly.
Fixes:
|
||
![]() |
54f89b3178 |
tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress()
When bpf_tcp_ingress() is called, the skmsg is being redirected to the
ingress of the destination socket. Therefore, we should charge its
receive socket buffer, instead of sending socket buffer.
Because sk_rmem_schedule() tests pfmemalloc of skb, we need to
introduce a wrapper and call it for skmsg.
Fixes:
|