linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-03-22 07:27:12 +08:00

Author	SHA1	Message	Date
Hangbin Liu	c211f5d7cb	net: vlan: sync VLAN features with lower device After registering a VLAN device and setting its feature flags, we need to synchronize the VLAN features with the lower device. For example, the VLAN device does not have the NETIF_F_LRO flag, it should be synchronized with the lower device based on the NETIF_F_UPPER_DISABLES definition. As the dev->vlan_features has changed, we need to call netdev_update_features(). The caller must run after netdev_upper_dev_link() links the lower devices, so this patch adds the netdev_update_features() call in register_vlan_dev(). Fixes: `fd867d51f8` ("net/core: generic support for disabling netdev features down stack") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20251030073539.133779-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 17:42:35 -07:00
Wang Liang	d01f8136d4	selftests: netdevsim: Fix ethtool-coalesce.sh fail by installing ethtool-common.sh The script "ethtool-common.sh" is not installed in INSTALL_PATH, and triggers some errors when I try to run the test 'drivers/net/netdevsim/ethtool-coalesce.sh': TAP version 13 1..1 # timeout set to 600 # selftests: drivers/net/netdevsim: ethtool-coalesce.sh # ./ethtool-coalesce.sh: line 4: ethtool-common.sh: No such file or directory # ./ethtool-coalesce.sh: line 25: make_netdev: command not found # ethtool: bad command line argument(s) # ./ethtool-coalesce.sh: line 124: check: command not found # ./ethtool-coalesce.sh: line 126: [: -eq: unary operator expected # FAILED /0 checks not ok 1 selftests: drivers/net/netdevsim: ethtool-coalesce.sh # exit=1 Install this file to avoid this error. After this patch: TAP version 13 1..1 # timeout set to 600 # selftests: drivers/net/netdevsim: ethtool-coalesce.sh # PASSED all 22 checks ok 1 selftests: drivers/net/netdevsim: ethtool-coalesce.sh Fixes: `fbb8531e58` ("selftests: extract common functions in ethtool-common.sh") Signed-off-by: Wang Liang <wangliang74@huawei.com> Link: https://patch.msgid.link/20251030040340.3258110-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 17:41:54 -07:00
Abdun Nihaal	3f978e3f15	isdn: mISDN: hfcsusb: fix memory leak in hfcsusb_probe() In hfcsusb_probe(), the memory allocated for ctrl_urb gets leaked when setup_instance() fails with an error code. Fix that by freeing the urb before freeing the hw structure. Also change the error paths to use the goto ladder style. Compile tested only. Issue found using a prototype static analysis tool. Fixes: `69f52adb2d` ("mISDN: Add HFC USB driver") Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in> Link: https://patch.msgid.link/20251030042524.194812-1-nihaal@cse.iitm.ac.in Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 17:39:14 -07:00
Anubhav Singh	f8e8486702	selftests/net: use destination options instead of hop-by-hop The GRO self-test, gro.c, currently constructs IPv6 packets containing a Hop-by-Hop Options header (IPPROTO_HOPOPTS) to ensure the GRO path correctly handles IPv6 extension headers. However, network elements may be configured to drop packets with the Hop-by-Hop Options header (HBH). This causes the self-test to fail in environments where such network elements are present. To improve the robustness and reliability of this test in diverse network environments, switch from using IPPROTO_HOPOPTS to IPPROTO_DSTOPTS (Destination Options). The Destination Options header is less likely to be dropped by intermediate routers and still serves the core purpose of the test: validating GRO's handling of an IPv6 extension header. This change ensures the test can execute successfully without being incorrectly failed by network policies outside the kernel's control. Fixes: `7d1575014a` ("selftests/net: GRO coalesce test") Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Anubhav Singh <anubhavsinggh@google.com> Link: https://patch.msgid.link/20251030060436.1556664-1-anubhavsinggh@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 17:33:17 -07:00
Anubhav Singh	02d064de05	selftests/net: fix out-of-order delivery of FIN in gro:tcp test Due to the gro_sender sending data packets and FIN packets in very quick succession, these are received almost simultaneously by the gro_receiver. FIN packets are sometimes processed before the data packets leading to intermittent (~1/100) test failures. This change adds a delay of 100ms before sending FIN packets in gro:tcp test to avoid the out-of-order delivery. The same mitigation already exists for the gro:ip test. Fixes: `7d1575014a` ("selftests/net: GRO coalesce test") Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Anubhav Singh <anubhavsinggh@google.com> Link: https://patch.msgid.link/20251030062818.1562228-1-anubhavsinggh@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 17:32:24 -07:00
Jonas Gorski	3d18a84edd	net: dsa: tag_brcm: legacy: fix untagged rx on unbridged ports for bcm63xx The internal switch on BCM63XX SoCs will unconditionally add 802.1Q VLAN tags on egress to CPU when 802.1Q mode is enabled. We do this unconditionally since commit `ed409f3bba` ("net: dsa: b53: Configure VLANs while not filtering"). This is fine for VLAN aware bridges, but for standalone ports and vlan unaware bridges this means all packets are tagged with the default VID, which is 0. While the kernel will treat that like untagged, this can break userspace applications processing raw packets, expecting untagged traffic, like STP daemons. This also breaks several bridge tests, where the tcpdump output then does not match the expected output anymore. Since 0 isn't a valid VID, just strip out the VLAN tag if we encounter it, unless the priority field is set, since that would be a valid tag again. Fixes: `964dbf186e` ("net: dsa: tag_brcm: add support for legacy tags") Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://patch.msgid.link/20251027194621.133301-1-jonas.gorski@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 16:28:10 -07:00
Carolina Jubran	5a89b27afd	ptp: Allow exposing cycles only for clocks with free-running counter The PTP core falls back to gettimex64 and getcrosststamp when getcycles64 or getcyclesx64 are not implemented. This causes the CYCLES ioctls to retrieve PHC real time instead of free-running cycles. Reject PTP_SYS_OFFSET_{PRECISE,EXTENDED}_CYCLES for clocks without free-running counter support since the result would represent PHC real time and system time rather than cycles and system time. Fixes: `faf23f54d3` ("ptp: Add ioctl commands to expose raw cycle counter values") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20251029083813.2276997-1-cjubran@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 16:27:40 -07:00
Jakub Kicinski	01534d73c5	Merge branch 'gve-fix-null-dereferencing-with-ptp-clock' Tim Hostetler says: ==================== gve: Fix NULL dereferencing with PTP clock This patch series fixes NULL dereferences that are possible with gve's PTP clock due to not stubbing certain ptp_clock_info callbacks. ==================== Link: https://patch.msgid.link/20251029184555.3852952-1-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 15:55:43 -07:00
Tim Hostetler	329d050bbe	gve: Implement settime64 with -EOPNOTSUPP ptp_clock_settime() assumes every ptp_clock has implemented settime64(). Stub it with -EOPNOTSUPP to prevent a NULL dereference. Fixes: `acd1638052` ("gve: Add initial PTP device support") Reported-by: syzbot+a546141ca6d53b90aba3@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=a546141ca6d53b90aba3 Signed-off-by: Tim Hostetler <thostet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Link: https://patch.msgid.link/20251029184555.3852952-3-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 15:55:32 -07:00
Tim Hostetler	6ab753b5d8	gve: Implement gettimex64 with -EOPNOTSUPP gve implemented a ptp_clock for sole use of do_aux_work at this time. ptp_clock_gettime() and ptp_sys_offset() assume every ptp_clock has implemented either gettimex64 or gettime64. Stub gettimex64 and return -EOPNOTSUPP to prevent NULL dereferencing. Fixes: `acd1638052` ("gve: Add initial PTP device support") Reported-by: syzbot+c8c0e7ccabd456541612@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c8c0e7ccabd456541612 Signed-off-by: Tim Hostetler <thostet@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Link: https://patch.msgid.link/20251029184555.3852952-2-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 15:55:32 -07:00
Jakub Kicinski	284987ab6c	Merge tag 'for-net-2025-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - btrtl: Fix memory leak in rtlbt_parse_firmware_v2() - MGMT: Fix OOB access in parse_adv_monitor_pattern() - hci_event: validate skb length for unknown CC opcode * tag 'for-net-2025-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: MGMT: Fix OOB access in parse_adv_monitor_pattern() Bluetooth: btrtl: Fix memory leak in rtlbt_parse_firmware_v2() Bluetooth: hci_event: validate skb length for unknown CC opcode ==================== Link: https://patch.msgid.link/20251031170959.590470-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 12:33:08 -07:00
Jakub Kicinski	b7904323e7	Merge tag 'wireless-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Couple of new fixes: - ath10k: revert a patch that had caused issues on some devices - cfg80211/mac80211: use hrtimers for some things where the precise timing matters - zd1211rw: fix a long-standing potential leak * tag 'wireless-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: zd1211rw: fix potential memory leak in __zd_usb_enable_rx() wifi: mac80211: use wiphy_hrtimer_work for csa.switch_work wifi: mac80211: use wiphy_hrtimer_work for ml_reconf_work wifi: mac80211: use wiphy_hrtimer_work for ttlm_work wifi: cfg80211: add an hrtimer based delayed work item Revert "wifi: ath10k: avoid unnecessary wait for service ready message" ==================== Link: https://patch.msgid.link/20251030104919.12871-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 12:30:33 -07:00
Ilia Gavrilov	8d59fba493	Bluetooth: MGMT: Fix OOB access in parse_adv_monitor_pattern() In the parse_adv_monitor_pattern() function, the value of the 'length' variable is currently limited to HCI_MAX_EXT_AD_LENGTH(251). The size of the 'value' array in the mgmt_adv_pattern structure is 31. If the value of 'pattern[i].length' is set in the user space and exceeds 31, the 'patterns[i].value' array can be accessed out of bound when copied. Increasing the size of the 'value' array in the 'mgmt_adv_pattern' structure will break the userspace. Considering this, and to avoid OOB access revert the limits for 'offset' and 'length' back to the value of HCI_MAX_AD_LENGTH. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Fixes: `db08722fc7` ("Bluetooth: hci_core: Fix missing instances using HCI_MAX_AD_LENGTH") Cc: stable@vger.kernel.org Signed-off-by: Ilia Gavrilov <Ilia.Gavrilov@infotecs.ru> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-10-31 12:43:05 -04:00
Abdun Nihaal	1c21cf89a6	Bluetooth: btrtl: Fix memory leak in rtlbt_parse_firmware_v2() The memory allocated for ptr using kvmalloc() is not freed on the last error path. Fix that by freeing it on that error path. Fixes: `9a24ce5e29` ("Bluetooth: btrtl: Firmware format v2 support") Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-10-31 12:42:47 -04:00
Raphael Pinsonneault-Thibeault	5c5f1f6468	Bluetooth: hci_event: validate skb length for unknown CC opcode In hci_cmd_complete_evt(), if the command complete event has an unknown opcode, we assume the first byte of the remaining skb->data contains the return status. However, parameter data has previously been pulled in hci_event_func(), which may leave the skb empty. If so, using skb->data[0] for the return status uses un-init memory. The fix is to check skb->len before using skb->data. Reported-by: syzbot+a9a4bedfca6aa9d7fa24@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=a9a4bedfca6aa9d7fa24 Tested-by: syzbot+a9a4bedfca6aa9d7fa24@syzkaller.appspotmail.com Fixes: `afcb3369f4` ("Bluetooth: hci_event: Fix vendor (unknown) opcode status handling") Signed-off-by: Raphael Pinsonneault-Thibeault <rpthibeault@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-10-31 12:41:01 -04:00
Linus Torvalds	e576349123	Merge tag 'net-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from wireless, Bluetooth and netfilter. Current release - regressions: - tcp: fix too slow tcp_rcvbuf_grow() action - bluetooth: fix corruption in h4_recv_buf() after cleanup Previous releases - regressions: - mptcp: restore window probe - bluetooth: - fix connection cleanup with BIG with 2 or more BIS - fix crash in set_mesh_sync and set_mesh_complete - batman-adv: release references to inactive interfaces - nic: - ice: fix usage of logical PF id - sfc: fix potential memory leak in efx_mae_process_mport() Previous releases - always broken: - devmem: refresh devmem TX dst in case of route invalidation - netfilter: add seqadj extension for natted connections - wifi: - iwlwifi: fix potential use after free in iwl_mld_remove_link() - brcmfmac: fix crash while sending action frames in standalone AP Mode - eth: - mlx5e: cancel tls RX async resync request in error flows - ixgbe: fix memory leak and use-after-free in ixgbe_recovery_probe() - hibmcge: fix rx buf avl irq is not re-enabled in irq_handle issue - cxgb4: fix potential use-after-free in ipsec callback - nfp: fix memory leak in nfp_net_alloc()" * tag 'net-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (75 commits) net: sctp: fix KMSAN uninit-value in sctp_inq_pop net: devmem: refresh devmem TX dst in case of route invalidation net: stmmac: est: Fix GCL bounds checks net: stmmac: Consider Tx VLAN offload tag length for maxSDU net: stmmac: vlan: Disable 802.1AD tag insertion offload net/mlx5e: kTLS, Cancel RX async resync request in error flows net: tls: Cancel RX async resync request on rcd_delta overflow net: tls: Change async resync helpers argument net: phy: dp83869: fix STRAP_OPMODE bitmask selftests: net: use BASH for bareudp testing net: mctp: Fix tx queue stall net/mlx5: Don't zero user_count when destroying FDB tables net: usb: asix_devices: Check return value of usbnet_get_endpoints mptcp: zero window probe mib mptcp: restore window probe mptcp: fix MSG_PEEK stream corruption mptcp: drop bogus optimization in __mptcp_check_push() netconsole: Fix race condition in between reader and writer of userdata Documentation: netconsole: Remove obsolete contact people nfp: xsk: fix memory leak in nfp_net_alloc() ...	2025-10-30 18:35:35 -07:00
Ranganath V N	51e5ad549c	net: sctp: fix KMSAN uninit-value in sctp_inq_pop Fix an issue detected by syzbot: KMSAN reported an uninitialized-value access in sctp_inq_pop BUG: KMSAN: uninit-value in sctp_inq_pop The issue is actually caused by skb trimming via sk_filter() in sctp_rcv(). In the reproducer, skb->len becomes 1 after sk_filter(), which bypassed the original check: if (skb->len < sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr) + skb_transport_offset(skb)) To handle this safely, a new check should be performed after sk_filter(). Reported-by: syzbot+d101e12bccd4095460e7@syzkaller.appspotmail.com Tested-by: syzbot+d101e12bccd4095460e7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d101e12bccd4095460e7 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Suggested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Ranganath V N <vnranganath.20@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251026-kmsan_fix-v3-1-2634a409fa5f@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-30 11:21:05 +01:00
Abdun Nihaal	70e8335485	wifi: zd1211rw: fix potential memory leak in __zd_usb_enable_rx() The memory allocated for urbs with kcalloc() is not freed on any error path. Fix that by freeing it in the error path. Fixes: `e85d0918b5` ("[PATCH] ZyDAS ZD1211 USB-WLAN driver") Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in> Link: https://patch.msgid.link/20251028174341.139134-1-nihaal@cse.iitm.ac.in Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-10-30 08:41:10 +01:00
Shivaji Kant	6a2108c780	net: devmem: refresh devmem TX dst in case of route invalidation The zero-copy Device Memory (Devmem) transmit path relies on the socket's route cache (`dst_entry`) to validate that the packet is being sent via the network device to which the DMA buffer was bound. However, this check incorrectly fails and returns `-ENODEV` if the socket's route cache entry (`dst`) is merely missing or expired (`dst == NULL`). This scenario is observed during network events, such as when flow steering rules are deleted, leading to a temporary route cache invalidation. This patch fixes -ENODEV error for `net_devmem_get_binding()` by doing the following: 1. It attempts to rebuild the route via `rebuild_header()` if the route is initially missing (`dst == NULL`). This allows the TCP/IP stack to recover from transient route cache misses. 2. It uses `rcu_read_lock()` and `dst_dev_rcu()` to safely access the network device pointer (`dst_dev`) from the route, preventing use-after-free conditions if the device is concurrently removed. 3. It maintains the critical safety check by validating that the retrieved destination device (`dst_dev`) is exactly the device registered in the Devmem binding (`binding->dev`). These changes prevent unnecessary ENODEV failures while maintaining the critical safety requirement that the Devmem resources are only used on the bound network device. Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: Vedant Mathur <vedantmathur@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Fixes: `bd61848900` ("net: devmem: Implement TX path") Signed-off-by: Shivaji Kant <shivajikant@google.com> Link: https://patch.msgid.link/20251029065420.3489943-1-shivajikant@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 19:23:21 -07:00
Jakub Kicinski	a38eeecfe3	Merge branch 'net-stmmac-fixes-for-stmmac-tx-vlan-insert-and-est' Rohan G Thomas says: ==================== net: stmmac: Fixes for stmmac Tx VLAN insert and EST This patchset includes following fixes for stmmac Tx VLAN insert and EST implementations: 1. Disable STAG insertion offloading, as DWMAC IPs doesn't support offload of STAG for double VLAN packets and CTAG for single VLAN packets when using the same register configuration. The current configuration in the driver is undocumented and is adding an additional 802.1Q tag with VLAN ID 0 for double VLAN packets. 2. Consider Tx VLAN offload tag length for maxSDU estimation. 3. Fix GCL bounds check ==================== Link: https://patch.msgid.link/20251028-qbv-fixes-v4-0-26481c7634e3@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:49:27 -07:00
Rohan G Thomas	48b2e323c0	net: stmmac: est: Fix GCL bounds checks Fix the bounds checks for the hw supported maximum GCL entry count and gate interval time. Fixes: `b60189e039` ("net: stmmac: Integrate EST with TAPRIO scheduler API") Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com> Link: https://patch.msgid.link/20251028-qbv-fixes-v4-3-26481c7634e3@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:49:24 -07:00
Rohan G Thomas	ded9813d17	net: stmmac: Consider Tx VLAN offload tag length for maxSDU Queue maxSDU requirement of 802.1 Qbv standard requires mac to drop packets that exceeds maxSDU length and maxSDU doesn't include preamble, destination and source address, or FCS but includes ethernet type and VLAN header. On hardware with Tx VLAN offload enabled, VLAN header length is not included in the skb->len, when Tx VLAN offload is requested. This leads to incorrect length checks and allows transmission of oversized packets. Add the VLAN_HLEN to the skb->len before checking the Qbv maxSDU if Tx VLAN offload is requested for the packet. Fixes: `c5c3e1bfc9` ("net: stmmac: Offload queueMaxSDU from tc-taprio") Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com> Link: https://patch.msgid.link/20251028-qbv-fixes-v4-2-26481c7634e3@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:49:24 -07:00
Rohan G Thomas	c657f86106	net: stmmac: vlan: Disable 802.1AD tag insertion offload The DWMAC IP's VLAN tag insertion offload does not support inserting STAG (802.1AD) and CTAG (802.1Q) types in bytes 13 and 14 using the same MAC_VLAN_Incl and MAC_VLAN_Inner_Incl register configurations. Currently, MAC_VLAN_Incl is configured to offload only STAG type insertion. However, the DWMAC IP inserts a CTAG type when the inner VLAN ID field of the descriptor is not configured, and a STAG type when it is configured. This behavior is not documented and leads to inconsistent double VLAN tagging. Additionally, an unexpected CTAG with VLAN ID 0 is inserted, resulting in frames like: Frame 1: 110 bytes on wire (880 bits), 110 bytes captured (880 bits) Ethernet II, Src: <src> (<src>), Dst: <dst> (<dst>) IEEE 802.1ad, ID: 100 802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 0 (unexpected) 802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 200 Internet Protocol Version 4, Src: 192.168.4.10, Dst: 192.168.4.11 Internet Control Message Protocol To avoid this undocumented and incorrect behavior, disable 802.1AD tag insertion offload. Also, don't set CSVL bit. As per the data book, when this bit is set, S-VLAN type (0x88A8) is inserted in the 13th and 14th bytes of transmitted packets and when this bit is reset, C-VLAN type (0x8100) is inserted in the 13th and 14th bytes of transmitted packets. Fixes: `30d932279d` ("net: stmmac: Add support for VLAN Insertion Offload") Fixes: `e94e3f3b51` ("net: stmmac: Add support for VLAN Insertion Offload in GMAC4+") Fixes: `1d2c7a5fee` ("net: stmmac: Refactor VLAN implementation") Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Reviewed-by: Boon Khai Ng <boon.khai.ng@altera.com> Link: https://patch.msgid.link/20251028-qbv-fixes-v4-1-26481c7634e3@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:49:24 -07:00
Jakub Kicinski	0dd1be4fe0	Merge branch 'tls-introduce-and-use-rx-async-resync-request-cancel-function' Tariq Toukan says: ==================== tls: Introduce and use RX async resync request cancel function This series by Shahar introduces RX async resync request cancel function in tls module, and uses it in mlx5e driver. For a device-offloaded TLS RX connection, the TLS module increments rcd_delta each time a new TLS record is received, tracking the distance from the original resync request. In the meanwhile, the device is queried and is expected to respond, asynchronously. However, if the device response is delayed or fails (e.g due to unstable connection and device getting out of tracking, hardware errors, resource exhaustion etc.), the TLS module keeps logging and incrementing rcd_delta, which can lead to a WARN() when rcd_delta exceeds the threshold. This series improves this code area by canceling the resync request when spotting an issue with the device response. ==================== Link: https://patch.msgid.link/1761508983-937977-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:32:24 -07:00
Shahar Shitrit	426e9da3b2	net/mlx5e: kTLS, Cancel RX async resync request in error flows When device loses track of TLS records, it attempts to resync by monitoring records and requests an asynchronous resynchronization from software for this TLS connection. The TLS module handles such device RX resync requests by logging record headers and comparing them with the record tcp_sn when provided by the device. It also increments rcd_delta to track how far the current record tcp_sn is from the tcp_sn of the original resync request. If the device later responds with a matching tcp_sn, the TLS module approves the tcp_sn for resync. However, the device response may be delayed or never arrive, particularly due to traffic-related issues such as packet drops or reordering. In such cases, the TLS module remains unaware that resync will not complete, and continues performing unnecessary work by logging headers and incrementing rcd_delta, which can eventually exceed the threshold and trigger a WARN(). For example, this was observed when the device got out of tracking, causing mlx5e_ktls_handle_get_psv_completion() to fail and ultimately leading to the rcd_delta warning. To address this, call tls_offload_rx_resync_async_request_cancel() to cancel the resync request and stop resync tracking in such error cases. Also, increment the tls_resync_req_skip counter to track these cancellations. Fixes: `0419d8c9d8` ("net/mlx5e: kTLS, Add kTLS RX resync support") Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1761508983-937977-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:32:18 -07:00
Shahar Shitrit	c15d5c62ab	net: tls: Cancel RX async resync request on rcd_delta overflow When a netdev issues a RX async resync request for a TLS connection, the TLS module handles it by logging record headers and attempting to match them to the tcp_sn provided by the device. If a match is found, the TLS module approves the tcp_sn for resynchronization. While waiting for a device response, the TLS module also increments rcd_delta each time a new TLS record is received, tracking the distance from the original resync request. However, if the device response is delayed or fails (e.g due to unstable connection and device getting out of tracking, hardware errors, resource exhaustion etc.), the TLS module keeps logging and incrementing, which can lead to a WARN() when rcd_delta exceeds the threshold. To address this, introduce tls_offload_rx_resync_async_request_cancel() to explicitly cancel resync requests when a device response failure is detected. Call this helper also as a final safeguard when rcd_delta crosses its threshold, as reaching this point implies that earlier cancellation did not occur. Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1761508983-937977-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:32:18 -07:00
Shahar Shitrit	34892cfec0	net: tls: Change async resync helpers argument Update tls_offload_rx_resync_async_request_start() and tls_offload_rx_resync_async_request_end() to get a struct tls_offload_resync_async parameter directly, rather than extracting it from struct sock. This change aligns the function signatures with the upcoming tls_offload_rx_resync_async_request_cancel() helper, which will be introduced in a subsequent patch. Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1761508983-937977-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:32:17 -07:00
Jakub Kicinski	e98cda764a	Merge tag 'nf-25-10-29' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== netfilter: updates for net 1) its not possible to attach conntrack labels via ctnetlink unless one creates a dummy 'ct labels set' rule in nftables. This is an oversight, the 'ruleset tests presence, userspace (netlink) sets' use-case is valid and should 'just work'. Always broken since this got added in Linux 4.7. 2) nft_connlimit reads count value without holding the relevant lock, add a READ_ONCE annotation. From Fernando Fernandez Mancera. 3) There is a long-standing bug (since 4.12) in nftables helper infra when NAT is in use: if the helper gets assigned after the nat binding was set up, we fail to initialise the 'seqadj' extension, which is needed in case NAT payload rewrites need to add (or remove) from the packet payload. Fix from Andrii Melnychenko. * tag 'nf-25-10-29' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_ct: add seqadj extension for natted connections netfilter: nft_connlimit: fix possible data race on connection count netfilter: nft_ct: enable labels for get case too ==================== Link: https://patch.msgid.link/20251029135617.18274-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:25:12 -07:00
Thanh Quan	298574936a	net: phy: dp83869: fix STRAP_OPMODE bitmask According to the TI DP83869HM datasheet Revision D (June 2025), section 7.6.1.41 STRAP_STS Register, the STRAP_OPMODE bitmask is bit [11:9]. Fix this. In case the PHY is auto-detected via PHY ID registers, or not described in DT, or, in case the PHY is described in DT but the optional DT property "ti,op-mode" is not present, then the driver reads out the PHY functional mode (RGMII, SGMII, ...) from hardware straps. Currently, all upstream users of this PHY specify both DT compatible string "ethernet-phy-id2000.a0f1" and ti,op-mode = <DP83869_RGMII_COPPER_ETHERNET> property, therefore it seems no upstream users are affected by this bug. The driver currently interprets bits [2:0] of STRAP_STS register as PHY functional mode. Those bits are controlled by ANEG_DIS, ANEGSEL_0 straps and an always-zero reserved bit. Systems that use RGMII-to-Copper functional mode are unlikely to disable auto-negotiation via ANEG_DIS strap, or change auto-negotiation behavior via ANEGSEL_0 strap. Therefore, even with this bug in place, the STRAP_STS register content is likely going to be interpreted by the driver as RGMII-to-Copper mode. However, for a system with PHY functional mode strapping set to other mode than RGMII-to-Copper, the driver is likely to misinterpret the strapping as RGMII-to-Copper and misconfigure the PHY. For example, on a system with SGMII-to-Copper strapping, the STRAP_STS register reads as 0x0c20, but the PHY ends up being configured for incompatible RGMII-to-Copper mode. Fixes: `0eaf8ccf20` ("net: phy: dp83869: Set opmode from straps") Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Thanh Quan <thanh.quan.xn@renesas.com> Signed-off-by: Hai Pham <hai.pham.ud@renesas.com> Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org> # Port from U-Boot to Linux Link: https://patch.msgid.link/20251027140320.8996-1-marek.vasut+renesas@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:59:09 -07:00
Po-Hsu Lin	9311e9540a	selftests: net: use BASH for bareudp testing In bareudp.sh, this script uses /bin/sh and it will load another lib.sh BASH script at the very beginning. But on some operating systems like Ubuntu, /bin/sh is actually pointed to DASH, thus it will try to run BASH commands with DASH and consequently leads to syntax issues: # ./bareudp.sh: 4: ./lib.sh: Bad substitution # ./bareudp.sh: 5: ./lib.sh: source: not found # ./bareudp.sh: 24: ./lib.sh: Syntax error: "(" unexpected Fix this by explicitly using BASH for bareudp.sh. This fixes test execution failures on systems where /bin/sh is not BASH. Reported-by: Edoardo Canepa <edoardo.canepa@canonical.com> Link: https://bugs.launchpad.net/bugs/2129812 Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/20251027095710.2036108-2-po-hsu.lin@canonical.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:56:03 -07:00
Jinliang Wang	da2522df3f	net: mctp: Fix tx queue stall The tx queue can become permanently stuck in a stopped state due to a race condition between the URB submission path and its completion callback. The URB completion callback can run immediately after usb_submit_urb() returns, before the submitting function calls netif_stop_queue(). If this occurs, the queue state management becomes desynchronized, leading to a stall where the queue is never woken. Fix this by moving the netif_stop_queue() call to before submitting the URB. This closes the race window by ensuring the network stack is aware the queue is stopped before the URB completion can possibly run. Fixes: `0791c0327a` ("net: mctp: Add MCTP USB transport driver") Signed-off-by: Jinliang Wang <jinliangw@google.com> Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20251027065530.2045724-1-jinliangw@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:55:14 -07:00
Cosmin Ratiu	53110232c9	net/mlx5: Don't zero user_count when destroying FDB tables esw->user_count tracks how many TC rules are added on an esw via mlx5e_configure_flower -> mlx5_esw_get -> atomic64_inc(&esw->user_count) esw.user_count was unconditionally set to 0 in esw_destroy_legacy_fdb_table and esw_destroy_offloads_fdb_tables. These two together can lead to the following sequence of events: 1. echo 1 > /sys/class/net/eth2/device/sriov_numvfs - mlx5_core_sriov_configure -...-> esw_create_legacy_table -> atomic64_set(&esw->user_count, 0) 2. tc qdisc add dev eth2 ingress && \ tc filter replace dev eth2 pref 1 protocol ip chain 0 ingress \ handle 1 flower action ct nat zone 64000 pipe - mlx5e_configure_flower -> mlx5_esw_get -> atomic64_inc(&esw->user_count) 3. echo 0 > /sys/class/net/eth2/device/sriov_numvfs - mlx5_core_sriov_configure -..-> esw_destroy_legacy_fdb_table -> atomic64_set(&esw->user_count, 0) 4. devlink dev eswitch set pci/0000:08:00.0 mode switchdev - mlx5_devlink_eswitch_mode_set -> mlx5_esw_try_lock -> atomic64_read(&esw->user_count) == 0 - then proceed to a WARN_ON in: esw_offloads_start -> mlx5_eswitch_enable_locke -> esw_offloads_enable -> mlx5_esw_offloads_rep_load -> mlx5e_vport_rep_load -> mlx5e_netdev_change_profile -> mlx5e_detach_netdev -> mlx5e_cleanup_nic_rx -> mlx5e_tc_nic_cleanup -> mlx5e_mod_hdr_tbl_destroy Fix this by not clearing out the user_count when destroying FDB tables, so that the check in mlx5_esw_try_lock can prevent the mode change when there are TC rules configured, as originally intended. Fixes: `2318b8bb94` ("net/mlx5: E-switch, Destroy legacy fdb table when needed") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1761510019-938772-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:53:37 -07:00
Miaoqian Lin	dc89548c69	net: usb: asix_devices: Check return value of usbnet_get_endpoints The code did not check the return value of usbnet_get_endpoints. Add checks and return the error if it fails to transfer the error. Found via static anlaysis and this is similar to commit `07161b2416` ("sr9800: Add check for usbnet_get_endpoints"). Fixes: `933a27d39e` ("USB: asix - Add AX88178 support and many other changes") Fixes: `2e55cc7210` ("[PATCH] USB: usbnet (3/9) module for ASIX Ethernet adapters") Cc: stable@vger.kernel.org Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Link: https://patch.msgid.link/20251026164318.57624-1-linmq006@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:51:48 -07:00
Jakub Kicinski	ac345c5fff	Merge branch 'mptcp-various-rare-sending-issues' Matthieu Baerts says: ==================== mptcp: various rare sending issues Here are various fixes from Paolo, addressing very occasional issues on the sending side: - Patch 1: drop an optimisation that could lead to timeout in case of race conditions. A fix for up to v5.11. - Patch 2: fix stream corruption under very specific conditions. A fix for up to v5.13. - Patch 3: restore MPTCP-level zero window probe after a recent fix. A fix for up to v5.16. - Patch 4: new MIB counter to track MPTCP-level zero windows probe to help catching issues similar to the one fixed by the previous patch. ==================== Link: https://patch.msgid.link/20251028-net-mptcp-send-timeout-v1-0-38ffff5a9ec8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:44:30 -07:00
Paolo Abeni	fe11dfa109	mptcp: zero window probe mib Explicitly account for MPTCP-level zero windows probe, to catch hopefully earlier issues alike the one addressed by the previous patch. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-mptcp-send-timeout-v1-4-38ffff5a9ec8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:44:28 -07:00
Paolo Abeni	a824084b98	mptcp: restore window probe Since commit `72377ab2d6` ("mptcp: more conservative check for zero probes") the MPTCP-level zero window probe check is always disabled, as the TCP-level write queue always contains at least the newly allocated skb. Refine the relevant check tacking in account that the above condition and that such skb can have zero length. Fixes: `72377ab2d6` ("mptcp: more conservative check for zero probes") Cc: stable@vger.kernel.org Reported-by: Geliang Tang <geliang@kernel.org> Closes: https://lore.kernel.org/d0a814c364e744ca6b836ccd5b6e9146882e8d42.camel@kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-mptcp-send-timeout-v1-3-38ffff5a9ec8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:44:28 -07:00
Paolo Abeni	8e04ce45a8	mptcp: fix MSG_PEEK stream corruption If a MSG_PEEK \| MSG_WAITALL read operation consumes all the bytes in the receive queue and recvmsg() need to waits for more data - i.e. it's a blocking one - upon arrival of the next packet the MPTCP protocol will start again copying the oldest data present in the receive queue, corrupting the data stream. Address the issue explicitly tracking the peeked sequence number, restarting from the last peeked byte. Fixes: `ca4fb89257` ("mptcp: add MSG_PEEK support") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-mptcp-send-timeout-v1-2-38ffff5a9ec8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:44:28 -07:00
Paolo Abeni	27b0e701d3	mptcp: drop bogus optimization in __mptcp_check_push() Accessing the transmit queue without owning the msk socket lock is inherently racy, hence __mptcp_check_push() could actually quit early even when there is pending data. That in turn could cause unexpected tx lock and timeout. Dropping the early check avoids the race, implicitly relaying on later tests under the relevant lock. With such change, all the other mptcp_send_head() call sites are now under the msk socket lock and we can additionally drop the now unneeded annotation on the transmit head pointer accesses. Fixes: `6e628cd3a8` ("mptcp: use mptcp release_cb for delayed tasks") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-mptcp-send-timeout-v1-1-38ffff5a9ec8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:44:28 -07:00
Gustavo Luiz Duarte	00764aa5c9	netconsole: Fix race condition in between reader and writer of userdata The update_userdata() function constructs the complete userdata string in nt->extradata_complete and updates nt->userdata_length. This data is then read by write_msg() and write_ext_msg() when sending netconsole messages. However, update_userdata() was not holding target_list_lock during this process, allowing concurrent message transmission to read partially updated userdata. This race condition could result in netconsole messages containing incomplete or inconsistent userdata - for example, reading the old userdata_length with new extradata_complete content, or vice versa, leading to truncated or corrupted output. Fix this by acquiring target_list_lock with spin_lock_irqsave() before updating extradata_complete and userdata_length, and releasing it after both fields are fully updated. This ensures that readers see a consistent view of the userdata, preventing corruption during concurrent access. The fix aligns with the existing locking pattern used throughout the netconsole code, where target_list_lock protects access to target fields including buf[] and msgcounter that are accessed during message transmission. Also get rid of the unnecessary variable complete_idx, which makes it easier to bail out of update_userdata(). Fixes: `df03f830d0` ("net: netconsole: cache userdata formatted string in netconsole_target") Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com> Link: https://patch.msgid.link/20251028-netconsole-fix-race-v4-1-63560b0ae1a0@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:41:00 -07:00
Bagas Sanjaya	a433038098	Documentation: netconsole: Remove obsolete contact people Breno Leitao has been listed in MAINTAINERS as netconsole maintainer since `7c938e438c` ("MAINTAINERS: make Breno the netconsole maintainer"), but the documentation says otherwise that bug reports should be sent to original netconsole authors. Remove obsolate contact info. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20251028132027.48102-1-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:40:19 -07:00
Abdun Nihaal	a4384d786e	nfp: xsk: fix memory leak in nfp_net_alloc() In nfp_net_alloc(), the memory allocated for xsk_pools is not freed in the subsequent error paths, leading to a memory leak. Fix that by freeing it in the error path. Fixes: `6402528b7a` ("nfp: xsk: add AF_XDP zero-copy Rx and Tx support") Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in> Link: https://patch.msgid.link/20251028160845.126919-1-nihaal@cse.iitm.ac.in Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:38:23 -07:00
Jakub Kicinski	bcc843bb0e	Merge branch 'tcp-fix-receive-autotune-again' Matthieu Baerts says: ==================== tcp: fix receive autotune again Neal Cardwell found that recent kernels were having RWIN limited issues, even when net.ipv4.tcp_rmem[2] was set to a very big value like 512MB. He suspected that tcp_stream default buffer size (64KB) was triggering heuristic added in `ea33537d82` ("tcp: add receive queue awareness in tcp_rcv_space_adjust()"). After more testing, it turns out the bug was added earlier with commit `65c5287892` ("tcp: fix sk_rcvbuf overshoot"). I forgot once again that DRS has one RTT latency. MPTCP also got the same issue. This series : - Prevents calling tcp_rcvbuf_grow() on some MPTCP subflows. - adds rcv_ssthresh, window_clamp and rcv_wnd to trace_tcp_rcvbuf_grow(). - Refactors code in a patch with no functional changes. - Fixes the issue in the final patch. ==================== Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-0-74b43ba4c84c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:30:45 -07:00
Eric Dumazet	aa251c8463	tcp: fix too slow tcp_rcvbuf_grow() action While the blamed commits apparently avoided an overshoot, they also limited how fast a sender can increase BDP at each RTT. This is not exactly a revert, we do not add the 16 * tp->advmss cushion we had, and we are keeping the out_of_order_queue contribution. Do the same in mptcp_rcvbuf_grow(). Tested: emulated 50ms rtt (tcp_stream --tcp-tx-delay 50000), cubic 20 second flow. net.ipv4.tcp_rmem set to "4096 131072 67000000" perf record -a -e tcp:tcp_rcvbuf_grow sleep 20 perf script Before: We can see we fail to roughly double RWIN at each RTT. Sender is RWIN limited while CWND is ramping up (before getting tcp_wmem limited). tcp_stream 33793 [010] 825.717525: tcp:tcp_rcvbuf_grow: time=100869 rtt_us=50428 copied=49152 inq=0 space=40960 ooo=0 scaling_ratio=219 rcvbuf=131072 rcv_ssthresh=103970 window_clamp=112128 rcv_wnd=106496 tcp_stream 33793 [010] 825.768966: tcp:tcp_rcvbuf_grow: time=51447 rtt_us=50362 copied=86016 inq=0 space=49152 ooo=0 scaling_ratio=219 rcvbuf=131072 rcv_ssthresh=107474 window_clamp=112128 rcv_wnd=106496 tcp_stream 33793 [010] 825.821539: tcp:tcp_rcvbuf_grow: time=52577 rtt_us=50243 copied=114688 inq=0 space=86016 ooo=0 scaling_ratio=219 rcvbuf=201096 rcv_ssthresh=167377 window_clamp=172031 rcv_wnd=167936 tcp_stream 33793 [010] 825.871781: tcp:tcp_rcvbuf_grow: time=50248 rtt_us=50237 copied=167936 inq=0 space=114688 ooo=0 scaling_ratio=219 rcvbuf=268129 rcv_ssthresh=224722 window_clamp=229375 rcv_wnd=225280 tcp_stream 33793 [010] 825.922475: tcp:tcp_rcvbuf_grow: time=50698 rtt_us=50183 copied=241664 inq=0 space=167936 ooo=0 scaling_ratio=219 rcvbuf=392617 rcv_ssthresh=331217 window_clamp=335871 rcv_wnd=323584 tcp_stream 33793 [010] 825.973326: tcp:tcp_rcvbuf_grow: time=50855 rtt_us=50213 copied=339968 inq=0 space=241664 ooo=0 scaling_ratio=219 rcvbuf=564986 rcv_ssthresh=478674 window_clamp=483327 rcv_wnd=462848 tcp_stream 33793 [010] 826.023970: tcp:tcp_rcvbuf_grow: time=50647 rtt_us=50248 copied=491520 inq=0 space=339968 ooo=0 scaling_ratio=219 rcvbuf=794811 rcv_ssthresh=671778 window_clamp=679935 rcv_wnd=651264 tcp_stream 33793 [010] 826.074612: tcp:tcp_rcvbuf_grow: time=50648 rtt_us=50227 copied=700416 inq=0 space=491520 ooo=0 scaling_ratio=219 rcvbuf=1149124 rcv_ssthresh=974881 window_clamp=983039 rcv_wnd=942080 tcp_stream 33793 [010] 826.125452: tcp:tcp_rcvbuf_grow: time=50845 rtt_us=50225 copied=987136 inq=8192 space=700416 ooo=0 scaling_ratio=219 rcvbuf=1637502 rcv_ssthresh=1392674 window_clamp=1400831 rcv_wnd=1339392 tcp_stream 33793 [010] 826.175698: tcp:tcp_rcvbuf_grow: time=50250 rtt_us=50198 copied=1347584 inq=0 space=978944 ooo=0 scaling_ratio=219 rcvbuf=2288672 rcv_ssthresh=1949729 window_clamp=1957887 rcv_wnd=1945600 tcp_stream 33793 [010] 826.225947: tcp:tcp_rcvbuf_grow: time=50252 rtt_us=50240 copied=1945600 inq=0 space=1347584 ooo=0 scaling_ratio=219 rcvbuf=3150516 rcv_ssthresh=2687010 window_clamp=2695167 rcv_wnd=2691072 tcp_stream 33793 [010] 826.276175: tcp:tcp_rcvbuf_grow: time=50233 rtt_us=50224 copied=2691072 inq=0 space=1945600 ooo=0 scaling_ratio=219 rcvbuf=4548617 rcv_ssthresh=3883041 window_clamp=3891199 rcv_wnd=3887104 tcp_stream 33793 [010] 826.326403: tcp:tcp_rcvbuf_grow: time=50233 rtt_us=50229 copied=3887104 inq=0 space=2691072 ooo=0 scaling_ratio=219 rcvbuf=6291456 rcv_ssthresh=5370482 window_clamp=5382144 rcv_wnd=5373952 tcp_stream 33793 [010] 826.376723: tcp:tcp_rcvbuf_grow: time=50323 rtt_us=50218 copied=5373952 inq=0 space=3887104 ooo=0 scaling_ratio=219 rcvbuf=9087658 rcv_ssthresh=7755537 window_clamp=7774207 rcv_wnd=7757824 tcp_stream 33793 [010] 826.426991: tcp:tcp_rcvbuf_grow: time=50274 rtt_us=50196 copied=7757824 inq=180224 space=5373952 ooo=0 scaling_ratio=219 rcvbuf=12563759 rcv_ssthresh=10729233 window_clamp=10747903 rcv_wnd=10575872 tcp_stream 33793 [010] 826.477229: tcp:tcp_rcvbuf_grow: time=50241 rtt_us=50078 copied=10731520 inq=180224 space=7577600 ooo=0 scaling_ratio=219 rcvbuf=17715667 rcv_ssthresh=15136529 window_clamp=15155199 rcv_wnd=14983168 tcp_stream 33793 [010] 826.527482: tcp:tcp_rcvbuf_grow: time=50258 rtt_us=50153 copied=15138816 inq=360448 space=10551296 ooo=0 scaling_ratio=219 rcvbuf=24667870 rcv_ssthresh=21073410 window_clamp=21102591 rcv_wnd=20766720 tcp_stream 33793 [010] 826.577712: tcp:tcp_rcvbuf_grow: time=50234 rtt_us=50228 copied=21073920 inq=0 space=14778368 ooo=0 scaling_ratio=219 rcvbuf=34550339 rcv_ssthresh=29517041 window_clamp=29556735 rcv_wnd=29519872 tcp_stream 33793 [010] 826.627982: tcp:tcp_rcvbuf_grow: time=50275 rtt_us=50220 copied=29519872 inq=540672 space=21073920 ooo=0 scaling_ratio=219 rcvbuf=49268707 rcv_ssthresh=42090625 window_clamp=42147839 rcv_wnd=41627648 tcp_stream 33793 [010] 826.678274: tcp:tcp_rcvbuf_grow: time=50296 rtt_us=50185 copied=42053632 inq=761856 space=28979200 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57238168 window_clamp=57316406 rcv_wnd=56606720 tcp_stream 33793 [010] 826.728627: tcp:tcp_rcvbuf_grow: time=50357 rtt_us=50128 copied=43913216 inq=851968 space=41291776 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57290728 window_clamp=57316406 rcv_wnd=56524800 tcp_stream 33793 [010] 827.131364: tcp:tcp_rcvbuf_grow: time=50239 rtt_us=50127 copied=43843584 inq=655360 space=43061248 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57290728 window_clamp=57316406 rcv_wnd=56696832 tcp_stream 33793 [010] 827.181613: tcp:tcp_rcvbuf_grow: time=50254 rtt_us=50115 copied=43843584 inq=524288 space=43188224 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57290728 window_clamp=57316406 rcv_wnd=56807424 tcp_stream 33793 [010] 828.339635: tcp:tcp_rcvbuf_grow: time=50283 rtt_us=50110 copied=43843584 inq=458752 space=43319296 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57290728 window_clamp=57316406 rcv_wnd=56864768 tcp_stream 33793 [010] 828.440350: tcp:tcp_rcvbuf_grow: time=50404 rtt_us=50099 copied=43843584 inq=393216 space=43384832 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57290728 window_clamp=57316406 rcv_wnd=56922112 tcp_stream 33793 [010] 829.195106: tcp:tcp_rcvbuf_grow: time=50154 rtt_us=50077 copied=43843584 inq=196608 space=43450368 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57290728 window_clamp=57316406 rcv_wnd=57090048 After: It takes few steps to increase RWIN. Sender is no longer RWIN limited. tcp_stream 50826 [010] 935.634212: tcp:tcp_rcvbuf_grow: time=100788 rtt_us=50315 copied=49152 inq=0 space=40960 ooo=0 scaling_ratio=219 rcvbuf=131072 rcv_ssthresh=103970 window_clamp=112128 rcv_wnd=106496 tcp_stream 50826 [010] 935.685642: tcp:tcp_rcvbuf_grow: time=51437 rtt_us=50361 copied=86016 inq=0 space=49152 ooo=0 scaling_ratio=219 rcvbuf=160875 rcv_ssthresh=132969 window_clamp=137623 rcv_wnd=131072 tcp_stream 50826 [010] 935.738299: tcp:tcp_rcvbuf_grow: time=52660 rtt_us=50256 copied=139264 inq=0 space=86016 ooo=0 scaling_ratio=219 rcvbuf=502741 rcv_ssthresh=411497 window_clamp=430079 rcv_wnd=413696 tcp_stream 50826 [010] 935.788544: tcp:tcp_rcvbuf_grow: time=50249 rtt_us=50233 copied=307200 inq=0 space=139264 ooo=0 scaling_ratio=219 rcvbuf=728690 rcv_ssthresh=618717 window_clamp=623371 rcv_wnd=618496 tcp_stream 50826 [010] 935.838796: tcp:tcp_rcvbuf_grow: time=50258 rtt_us=50202 copied=618496 inq=0 space=307200 ooo=0 scaling_ratio=219 rcvbuf=2450338 rcv_ssthresh=1855709 window_clamp=2096187 rcv_wnd=1859584 tcp_stream 50826 [010] 935.889140: tcp:tcp_rcvbuf_grow: time=50347 rtt_us=50166 copied=1261568 inq=0 space=618496 ooo=0 scaling_ratio=219 rcvbuf=4376503 rcv_ssthresh=3725291 window_clamp=3743961 rcv_wnd=3706880 tcp_stream 50826 [010] 935.939435: tcp:tcp_rcvbuf_grow: time=50300 rtt_us=50185 copied=2478080 inq=24576 space=1261568 ooo=0 scaling_ratio=219 rcvbuf=9082648 rcv_ssthresh=7733731 window_clamp=7769921 rcv_wnd=7692288 tcp_stream 50826 [010] 935.989681: tcp:tcp_rcvbuf_grow: time=50251 rtt_us=50221 copied=4915200 inq=114688 space=2453504 ooo=0 scaling_ratio=219 rcvbuf=16574936 rcv_ssthresh=14108110 window_clamp=14179339 rcv_wnd=14024704 tcp_stream 50826 [010] 936.039967: tcp:tcp_rcvbuf_grow: time=50289 rtt_us=50279 copied=9830400 inq=114688 space=4800512 ooo=0 scaling_ratio=219 rcvbuf=32695050 rcv_ssthresh=27896187 window_clamp=27969593 rcv_wnd=27815936 tcp_stream 50826 [010] 936.090172: tcp:tcp_rcvbuf_grow: time=50211 rtt_us=50200 copied=19841024 inq=114688 space=9715712 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57245176 window_clamp=57316406 rcv_wnd=57163776 tcp_stream 50826 [010] 936.140430: tcp:tcp_rcvbuf_grow: time=50262 rtt_us=50197 copied=39501824 inq=114688 space=19726336 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57245176 window_clamp=57316406 rcv_wnd=57163776 tcp_stream 50826 [010] 936.190527: tcp:tcp_rcvbuf_grow: time=50101 rtt_us=50071 copied=43655168 inq=262144 space=39387136 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57259192 window_clamp=57316406 rcv_wnd=57032704 tcp_stream 50826 [010] 936.240719: tcp:tcp_rcvbuf_grow: time=50197 rtt_us=50057 copied=43843584 inq=262144 space=43393024 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57259192 window_clamp=57316406 rcv_wnd=57032704 tcp_stream 50826 [010] 936.341271: tcp:tcp_rcvbuf_grow: time=50297 rtt_us=50123 copied=43843584 inq=131072 space=43581440 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57259192 window_clamp=57316406 rcv_wnd=57147392 tcp_stream 50826 [010] 936.642503: tcp:tcp_rcvbuf_grow: time=50131 rtt_us=50084 copied=43843584 inq=0 space=43712512 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57259192 window_clamp=57316406 rcv_wnd=57262080 Fixes: `65c5287892` ("tcp: fix sk_rcvbuf overshoot") Fixes: `e118cdc34d` ("mptcp: rcvbuf auto-tuning improvement") Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/589 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-4-74b43ba4c84c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:30:19 -07:00
Eric Dumazet	b1e014a1f3	tcp: add newval parameter to tcp_rcvbuf_grow() This patch has no functional change, and prepares the following one. tcp_rcvbuf_grow() will need to have access to tp->rcvq_space.space old and new values. Change mptcp_rcvbuf_grow() in a similar way. Signed-off-by: Eric Dumazet <edumazet@google.com> [ Moved 'oldval' declaration to the next patch to avoid warnings at build time. ] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-3-74b43ba4c84c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:30:19 -07:00
Eric Dumazet	24990d89c2	trace: tcp: add three metrics to trace_tcp_rcvbuf_grow() While chasing yet another receive autotuning bug, I found useful to add rcv_ssthresh, window_clamp and rcv_wnd. tcp_stream 40597 [068] 2172.978198: tcp:tcp_rcvbuf_grow: time=50307 rtt_us=50179 copied=77824 inq=0 space=40960 ooo=0 scaling_ratio=219 rcvbuf=131072 rcv_ssthresh=107474 window_clamp=112128 rcv_wnd=110592 tcp_stream 40597 [068] 2173.028528: tcp:tcp_rcvbuf_grow: time=50336 rtt_us=50206 copied=110592 inq=0 space=77824 ooo=0 scaling_ratio=219 rcvbuf=509444 rcv_ssthresh=328658 window_clamp=435813 rcv_wnd=331776 tcp_stream 40597 [068] 2173.078830: tcp:tcp_rcvbuf_grow: time=50305 rtt_us=50070 copied=270336 inq=0 space=110592 ooo=0 scaling_ratio=219 rcvbuf=509444 rcv_ssthresh=431159 window_clamp=435813 rcv_wnd=434176 tcp_stream 40597 [068] 2173.129137: tcp:tcp_rcvbuf_grow: time=50313 rtt_us=50118 copied=434176 inq=0 space=270336 ooo=0 scaling_ratio=219 rcvbuf=2457847 rcv_ssthresh=1299511 window_clamp=2102611 rcv_wnd=1302528 tcp_stream 40597 [068] 2173.179451: tcp:tcp_rcvbuf_grow: time=50318 rtt_us=50041 copied=1019904 inq=0 space=434176 ooo=0 scaling_ratio=219 rcvbuf=2457847 rcv_ssthresh=2087445 window_clamp=2102611 rcv_wnd=2088960 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-2-74b43ba4c84c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:30:18 -07:00
Paolo Abeni	a6f0459aad	mptcp: fix subflow rcvbuf adjust The mptcp PM can add subflow to the conn_list before tcp_init_transfer(). Calling tcp_rcvbuf_grow() on such subflow is not correct as later init will overwrite the update. Fix the issue calling tcp_rcvbuf_grow() only after init buffer initialization. Fixes: `e118cdc34d` ("mptcp: rcvbuf auto-tuning improvement") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-1-74b43ba4c84c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:30:18 -07:00
Jakub Kicinski	f99c579211	Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-10-28 (ice, ixgbe, igb, igc) For ice, Grzegorz fixes setting of PHY lane number and logical PF ID for E82x devices. He also corrects access of CGU (Clock Generation Unit) on dual complex devices. Kohei Enju resolves issues with error path cleanup for probe when in recovery mode on ixgbe and ensures PHY is powered on for link testing on igc. Lastly, he converts incorrect use of -ENOTSUPP to -EOPNOTSUPP on igb, igc, and ixgbe. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: ixgbe: use EOPNOTSUPP instead of ENOTSUPP in ixgbe_ptp_feature_enable() igc: use EOPNOTSUPP instead of ENOTSUPP in igc_ethtool_get_sset_count() igb: use EOPNOTSUPP instead of ENOTSUPP in igb_get_sset_count() igc: power up the PHY before the link test ixgbe: fix memory leak and use-after-free in ixgbe_recovery_probe() ice: fix usage of logical PF id ice: fix destination CGU for dual complex E825 ice: fix lane number calculation ==================== Link: https://patch.msgid.link/20251028202515.675129-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:23:11 -07:00
Andrii Melnychenko	90918e3b64	netfilter: nft_ct: add seqadj extension for natted connections Sequence adjustment may be required for FTP traffic with PASV/EPSV modes. due to need to re-write packet payload (IP, port) on the ftp control connection. This can require changes to the TCP length and expected seq / ack_seq. The easiest way to reproduce this issue is with PASV mode. Example ruleset: table inet ftp_nat { ct helper ftp_helper { type "ftp" protocol tcp l3proto inet } chain prerouting { type filter hook prerouting priority 0; policy accept; tcp dport 21 ct state new ct helper set "ftp_helper" } } table ip nat { chain prerouting { type nat hook prerouting priority -100; policy accept; tcp dport 21 dnat ip prefix to ip daddr map { 192.168.100.1 : 192.168.13.2/32 } } chain postrouting { type nat hook postrouting priority 100 ; policy accept; tcp sport 21 snat ip prefix to ip saddr map { 192.168.13.2 : 192.168.100.1/32 } } } Note that the ftp helper gets assigned after the dnat setup. The inverse (nat after helper assign) is handled by an existing check in nf_nat_setup_info() and will not show the problem. Topoloy: +-------------------+ +----------------------------------+ \| FTP: 192.168.13.2 \| <-> \| NAT: 192.168.13.3, 192.168.100.1 \| +-------------------+ +----------------------------------+ \| +-----------------------+ \| Client: 192.168.100.2 \| +-----------------------+ ftp nat changes do not work as expected in this case: Connected to 192.168.100.1. [..] ftp> epsv EPSV/EPRT on IPv4 off. ftp> ls 227 Entering passive mode (192,168,100,1,209,129). 421 Service not available, remote server has closed connection. Kernel logs: Missing nfct_seqadj_ext_add() setup call WARNING: CPU: 1 PID: 0 at net/netfilter/nf_conntrack_seqadj.c:41 [..] __nf_nat_mangle_tcp_packet+0x100/0x160 [nf_nat] nf_nat_ftp+0x142/0x280 [nf_nat_ftp] help+0x4d1/0x880 [nf_conntrack_ftp] nf_confirm+0x122/0x2e0 [nf_conntrack] nf_hook_slow+0x3c/0xb0 .. Fix this by adding the required extension when a conntrack helper is assigned to a connection that has a nat binding. Fixes: `1a64edf54f` ("netfilter: nft_ct: add helper set support") Signed-off-by: Andrii Melnychenko <a.melnychenko@vyos.io> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-10-29 14:47:59 +01:00
Fernando Fernandez Mancera	8d96dfdcab	netfilter: nft_connlimit: fix possible data race on connection count nft_connlimit_eval() reads priv->list->count to check if the connection limit has been exceeded. This value is being read without a lock and can be modified by a different process. Use READ_ONCE() for correctness. Fixes: `df4a902509` ("netfilter: nf_conncount: merge lookup and add functions") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-10-29 14:47:59 +01:00
Florian Westphal	514f1dc8f2	netfilter: nft_ct: enable labels for get case too conntrack labels can only be set when the conntrack has been created with the "ctlabel" extension. For older iptables (connlabel match), adding an "-m connlabel" rule turns on the ctlabel extension allocation for all future conntrack entries. For nftables, its only enabled for 'ct label set foo', but not for 'ct label foo' (i.e. check). But users could have a ruleset that only checks for presence, and rely on userspace to set a label bit via ctnetlink infrastructure. This doesn't work without adding a dummy 'ct label set' rule. We could also enable extension infra for the first (failing) ctnetlink request, but unlike ruleset we would not be able to disable the extension again. Therefore turn on ctlabel extension allocation if an nftables ruleset checks for a connlabel too. Fixes: `1ad8f48df6` ("netfilter: nftables: add connlabel set support") Reported-by: Antonio Ojea <aojea@google.com> Closes: https://lore.kernel.org/netfilter-devel/aPi_VdZpVjWujZ29@strlen.de/ Signed-off-by: Florian Westphal <fw@strlen.de>	2025-10-29 14:47:59 +01:00

1 2 3 4 5 ...

1397466 Commits