mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-09-04 20:19:47 +08:00
901b3290bd
79742 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
b4f82f9ed4 |
Bluetooth: L2CAP: Fix slab-use-after-free Read in l2cap_send_cmd
After the hci sync command releases l2cap_conn, the hci receive data work queue references the released l2cap_conn when sending to the upper layer. Add hci dev lock to the hci receive data work queue to synchronize the two. [1] BUG: KASAN: slab-use-after-free in l2cap_send_cmd+0x187/0x8d0 net/bluetooth/l2cap_core.c:954 Read of size 8 at addr ffff8880271a4000 by task kworker/u9:2/5837 CPU: 0 UID: 0 PID: 5837 Comm: kworker/u9:2 Not tainted 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Workqueue: hci1 hci_rx_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0x169/0x550 mm/kasan/report.c:489 kasan_report+0x143/0x180 mm/kasan/report.c:602 l2cap_build_cmd net/bluetooth/l2cap_core.c:2964 [inline] l2cap_send_cmd+0x187/0x8d0 net/bluetooth/l2cap_core.c:954 l2cap_sig_send_rej net/bluetooth/l2cap_core.c:5502 [inline] l2cap_sig_channel net/bluetooth/l2cap_core.c:5538 [inline] l2cap_recv_frame+0x221f/0x10db0 net/bluetooth/l2cap_core.c:6817 hci_acldata_packet net/bluetooth/hci_core.c:3797 [inline] hci_rx_work+0x508/0xdb0 net/bluetooth/hci_core.c:4040 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310 worker_thread+0x870/0xd30 kernel/workqueue.c:3391 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Allocated by task 5837: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394 kasan_kmalloc include/linux/kasan.h:260 [inline] __kmalloc_cache_noprof+0x243/0x390 mm/slub.c:4329 kmalloc_noprof include/linux/slab.h:901 [inline] kzalloc_noprof include/linux/slab.h:1037 [inline] l2cap_conn_add+0xa9/0x8e0 net/bluetooth/l2cap_core.c:6860 l2cap_connect_cfm+0x115/0x1090 net/bluetooth/l2cap_core.c:7239 hci_connect_cfm include/net/bluetooth/hci_core.h:2057 [inline] hci_remote_features_evt+0x68e/0xac0 net/bluetooth/hci_event.c:3726 hci_event_func net/bluetooth/hci_event.c:7473 [inline] hci_event_packet+0xac2/0x1540 net/bluetooth/hci_event.c:7525 hci_rx_work+0x3f3/0xdb0 net/bluetooth/hci_core.c:4035 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310 worker_thread+0x870/0xd30 kernel/workqueue.c:3391 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Freed by task 54: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2353 [inline] slab_free mm/slub.c:4613 [inline] kfree+0x196/0x430 mm/slub.c:4761 l2cap_connect_cfm+0xcc/0x1090 net/bluetooth/l2cap_core.c:7235 hci_connect_cfm include/net/bluetooth/hci_core.h:2057 [inline] hci_conn_failed+0x287/0x400 net/bluetooth/hci_conn.c:1266 hci_abort_conn_sync+0x56c/0x11f0 net/bluetooth/hci_sync.c:5603 hci_cmd_sync_work+0x22b/0x400 net/bluetooth/hci_sync.c:332 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310 worker_thread+0x870/0xd30 kernel/workqueue.c:3391 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Reported-by: syzbot+31c2f641b850a348a734@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=31c2f641b850a348a734 Tested-by: syzbot+31c2f641b850a348a734@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
![]() |
78dafe1cf3 |
vsock: Orphan socket after transport release
During socket release, sock_orphan() is called without considering that it
sets sk->sk_wq to NULL. Later, if SO_LINGER is enabled, this leads to a
null pointer dereferenced in virtio_transport_wait_close().
Orphan the socket only after transport release.
Partially reverts the 'Fixes:' commit.
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
lock_acquire+0x19e/0x500
_raw_spin_lock_irqsave+0x47/0x70
add_wait_queue+0x46/0x230
virtio_transport_release+0x4e7/0x7f0
__vsock_release+0xfd/0x490
vsock_release+0x90/0x120
__sock_release+0xa3/0x250
sock_close+0x14/0x20
__fput+0x35e/0xa90
__x64_sys_close+0x78/0xd0
do_syscall_64+0x93/0x1b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Reported-by: syzbot+9d55b199192a4be7d02c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9d55b199192a4be7d02c
Fixes:
|
||
![]() |
cf56aa8dd2 |
Revert "netfilter: flowtable: teardown flow if cached mtu is stale"
This reverts commit |
||
![]() |
06ea2c9c41 |
rxrpc: Fix alteration of headers whilst zerocopy pending
rxrpc: Fix alteration of headers whilst zerocopy pending
AF_RXRPC now uses MSG_SPLICE_PAGES to do zerocopy of the DATA packets when
it transmits them, but to reduce the number of descriptors required in the
DMA ring, it allocates a space for the protocol header in the memory
immediately before the data content so that it can include both in a single
descriptor. This is used for either the main RX header or the smaller
jumbo subpacket header as appropriate:
+----+------+
| RX | |
+-+--+DATA |
|JH| |
+--+------+
Now, when it stitches a large jumbo packet together from a number of
individual DATA packets (each of which is 1412 bytes of data), it uses the
full RX header from the first and then the jumbo subpacket header for the
rest of the components:
+---+--+------+--+------+--+------+--+------+--+------+--+------+
|UDP|RX|DATA |JH|DATA |JH|DATA |JH|DATA |JH|DATA |JH|DATA |
+---+--+------+--+------+--+------+--+------+--+------+--+------+
As mentioned, the main RX header and the jumbo header overlay one another
in memory and the formats don't match, so switching from one to the other
means rearranging the fields and adjusting the flags.
However, now that TLP has been included, it wants to retransmit the last
subpacket as a new data packet on its own, which means switching between
the header formats... and if the transmission is still pending, because of
the MSG_SPLICE_PAGES, we end up corrupting the jumbo subheader.
This has a variety of effects, with the RX service number overwriting the
jumbo checksum/key number field and the RX checksum overwriting the jumbo
flags - resulting in, at the very least, a confused connection-level abort
from the peer.
Fix this by leaving the jumbo header in the allocation with the data, but
allocating the RX header from the page frag allocator and concocting it on
the fly at the point of transmission as it does for ACK packets.
Fixes:
|
||
![]() |
44ce3511c2 |
Here are some batman-adv bugfixes:
- Fix panic during interface removal in BATMAN V, by Andy Strohman - Cleanup BATMAN V/ELP metric handling, by Sven Eckelmann (2 patches) - Fix incorrect offset in batadv_tt_tvlv_ogm_handler_v1(), by Remi Pommarel -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmel2OQWHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeodbBEADDhvbRjp7TfTYZa8m7Kw3GUO6h 3TEsYcz8the91rhHJ9e3j1q4D2W/xND6KJ9t784jR6IESUHVmpch3y5ofM68znFk lo9urmeNZhGqjekF96vR7jUHtTUbdjh5nKiBcHbcZmEVBB8rVXyoPZHxOhdzJ2am oBi1jDYewqKQWWiGlkTGNVEKOKV3ijXLj7/Jj5LvCQqHoiN/jowH8Nck+4ji43ue hKqS26Ma150d8SFsABIFEVOIcmOXADAEELnLHXj6L1QFMi+lR2/0WtAfgOTJabSh sEje1V+XCFat1ocCOyooUnKBXXMogig8ZDX89wZA+jT2BktNIkMMOvdmwti1cHf4 bS04ltMWm+IHAF7dJkRDvhUm23b4ZEmyvR+WIcjOg68Dau2NCu8fkJOGPv/fpXck WVYbc1H53E0d6ykQGLX++NgeCAhgDx0aOMVf2pFa/CsU23ZM/fpSsRYl+apUfIxs vk+KTFd1JCuQpkiAtr0QRG9vvz4u/Jd6q9haweFnKnCcIrkq7i2L48vp+dCGP1E2 8ujnoYsoKCS29QzyY1hUytQ9D55sRa7jr6raxr0QHSJ9d9DWLvtSdX+4nXRefgl5 4ywqGuhLUqQC9Bp7N0fcZYcQOqb7CcKBJCWv3wr2BR762wlnhZXoCDai9XtzkqsO XWlNGNrIphy4SH55lQ== =yESc -----END PGP SIGNATURE----- Merge tag 'batadv-net-pullrequest-20250207' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here are some batman-adv bugfixes: - Fix panic during interface removal in BATMAN V, by Andy Strohman - Cleanup BATMAN V/ELP metric handling, by Sven Eckelmann (2 patches) - Fix incorrect offset in batadv_tt_tvlv_ogm_handler_v1(), by Remi Pommarel * tag 'batadv-net-pullrequest-20250207' of git://git.open-mesh.org/linux-merge: batman-adv: Fix incorrect offset in batadv_tt_tvlv_ogm_handler_v1() batman-adv: Drop unmanaged ELP metric worker batman-adv: Ignore neighbor throughput metrics in error case batman-adv: fix panic during interface removal ==================== Link: https://patch.msgid.link/20250207095823.26043-1-sw@simonwunderlich.de Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
![]() |
8e248f2dbb |
linux-can-fixes-for-6.14-20250208
-----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEn/sM2K9nqF/8FWzzDHRl3/mQkZwFAmenQ2MTHG1rbEBwZW5n dXRyb25peC5kZQAKCRAMdGXf+ZCRnM1SCACKSffnuZgemuq/Gl/2q30SSkDyvC7s D0C1VO32y2SmoBhfRmQqMKbGH6zbTTgPaDygUxfPEUBZ6P4wN7Cj8VsKk0LKRRZF pV2+VPRdIJjqDjsHHJY7FZSE7GBtEhS6KjOi4S4dOIYM3RdXCYMrIROqPYYnQMAX 97Gvn3eaY4pmXG7Uq6oavxMAiTgjdKHecsM1qF0/TtTTcbufSEy1PronZmej7mHQ vv8SDNIG8nxB8lE8s3pKgeNeHxgHMMLrYSUtvCkBOsScDWP6FVn4bIJen2lZ4NJD 5VHt6hwseAtZv9Zt+swnHFUiIuXPQNFmAZHE57IpFWSh9b4oUQWImUZz =Sq6N -----END PGP SIGNATURE----- Merge tag 'linux-can-fixes-for-6.14-20250208' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2025-02-08 The first patch is by Reyders Morales and fixes a code example in the CAN ISO15765-2 documentation. The next patch is contributed by Alexander Hölzl and fixes sending of J1939 messages with zero data length. Fedor Pchelkin's patch for the ctucanfd driver adds a missing handling for an skb allocation error. Krzysztof Kozlowski contributes a patch for the c_can driver to fix unbalanced runtime PM disable in error path. The next patch is by Vincent Mailhol and fixes a NULL pointer dereference on udev->serial in the etas_es58x driver. The patch is by Robin van der Gracht and fixes the handling for an skb allocation error. * tag 'linux-can-fixes-for-6.14-20250208' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: rockchip: rkcanfd_handle_rx_fifo_overflow_int(): bail out if skb cannot be allocated can: etas_es58x: fix potential NULL pointer dereference on udev->serial can: c_can: fix unbalanced runtime PM disable in error path can: ctucanfd: handle skb allocation failure can: j1939: j1939_sk_send_loop(): fix unable to send messages with data length zero Documentation/networking: fix basic node example document ISO 15765-2 ==================== Link: https://patch.msgid.link/20250208115120.237274-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
087c1faa59 |
ipv6: mcast: extend RCU protection in igmp6_send()
igmp6_send() can be called without RTNL or RCU being held.
Extend RCU protection so that we can safely fetch the net pointer
and avoid a potential UAF.
Note that we no longer can use sock_alloc_send_skb() because
ipv6.igmp_sk uses GFP_KERNEL allocations which can sleep.
Instead use alloc_skb() and charge the net->ipv6.igmp_sk
socket under RCU protection.
Fixes:
|
||
![]() |
ed6ae1f325 |
ndisc: extend RCU protection in ndisc_send_skb()
ndisc_send_skb() can be called without RTNL or RCU held.
Acquire rcu_read_lock() earlier, so that we can use dev_net_rcu()
and avoid a potential UAF.
Fixes:
|
||
![]() |
90b2f49a50 |
openvswitch: use RCU protection in ovs_vport_cmd_fill_info()
ovs_vport_cmd_fill_info() can be called without RTNL or RCU.
Use RCU protection and dev_net_rcu() to avoid potential UAF.
Fixes:
|
||
![]() |
a42b69f692 |
arp: use RCU protection in arp_xmit()
arp_xmit() can be called without RTNL or RCU protection.
Use RCU protection to avoid potential UAF.
Fixes:
|
||
![]() |
becbd5850c |
neighbour: use RCU protection in __neigh_notify()
__neigh_notify() can be called without RTNL or RCU protection.
Use RCU protection to avoid potential UAF.
Fixes:
|
||
![]() |
628e6d1893 |
ndisc: use RCU protection in ndisc_alloc_skb()
ndisc_alloc_skb() can be called without RTNL or RCU being held.
Add RCU protection to avoid possible UAF.
Fixes:
|
||
![]() |
48145a57d4 |
ndisc: ndisc_send_redirect() must use dev_get_by_index_rcu()
ndisc_send_redirect() is called under RCU protection, not RTNL.
It must use dev_get_by_index_rcu() instead of __dev_get_by_index()
Fixes:
|
||
![]() |
58c9bf3363 |
hid-for-linus-2025021001
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEL65usyKPHcrRDEicpmLzj2vtYEkFAmepsoQACgkQpmLzj2vt YEmb2g//c9lAemKMzfKuAvm7X3wpuE+eOm98WqgPchWStqYy2yVR/gziIn5GtfV6 0FtOGUyR8qAgozruc+kHOUvuV6rrxWNgc4I+06//k+JhM8uHxC7pKdBSrJAURwsd 9DnZdAIHwu8gQBJ3b2zTtJZC/EEJdjTUOZiSqGL2YszvqjZCRGKXvDzPRwBUGcQq uJAL/RrRWtc0vRmN3DfmCtTA1A+hIOiE8KikYChYFKZdXSTDOKprQANWpfw7zAr0 8m9wv3c0wBX1Na+MdUG4RnxYJbJ/ojcVMtk1u67PmrC6netO/n0YnxFooCelP7BM WQgNvmp/KzMsMzSF98MJd4aiIkf8aeJZv67WJDxKH/pNdpY0y3d57y5U+LNE3bCB 8gfp9YGpkKgBOpv+sMMwSP2vl9OSroDCPitIcF9gJqM6ldw+WpQ7VXgsjHyp96LD lgUYyaUxni/nbp2cVwIUjAX9dgFNagAq0iAsCG0+PaFqsdtRtD4bx7hp8oP650KX KksdABkajP7AF7FtZ5qE4ODjvjtrIuWN+jqL0QKigbXLAlnL2M8ID9iFNB1gvAQK FXGBDNcY3m1/NWiQopmUlGWCYUwZiIxwjhykVlkqHHJLdhlRoVsTVFUbky1W6D4c SewJqrvzTwq+k5kUnvI+yUGM6E0i8rWlvNQwKlhZtR95S0H27kU= =s9Ex -----END PGP SIGNATURE----- Merge tag 'hid-for-linus-2025021001' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID fixes from Jiri Kosina: - build/dependency fixes for hid-lenovo and hid-intel-thc (Arnd Bergmann) - functional fixes for hid-corsair-void (Stuart Hayhurst) - workqueue handling and ordering fix for hid-steam (Vicki Pfau) - Gamepad mode vs. Lizard mode fix for hid-steam (Vicki Pfau) - OOB read fix for hid-thrustmaster (Tulio Fernandes) - fix for very long timeout on certain firmware in intel-ish-hid (Zhang Lixu) - other assorted small code fixes and device ID additions * tag 'hid-for-linus-2025021001' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: HID: hid-steam: Don't use cancel_delayed_work_sync in IRQ context HID: hid-steam: Move hidraw input (un)registering to work HID: hid-thrustmaster: fix stack-out-of-bounds read in usb_check_int_endpoints() HID: apple: fix up the F6 key on the Omoton KB066 keyboard HID: hid-apple: Apple Magic Keyboard a3203 USB-C support samples/hid: fix broken vmlinux path for VMLINUX_BTF samples/hid: remove unnecessary -I flags from libbpf EXTRA_CFLAGS HID: topre: Fix n-key rollover on Realforce R3S TKL boards HID: intel-ish-hid: ipc: Add Panther Lake PCI device IDs HID: multitouch: Add NULL check in mt_input_configured HID: winwing: Add NULL check in winwing_init_led() HID: hid-steam: Fix issues with disabling both gamepad mode and lizard mode HID: ignore non-functional sensor in HP 5MP Camera HID: intel-thc: fix CONFIG_HID dependency HID: lenovo: select CONFIG_ACPI_PLATFORM_PROFILE HID: intel-ish-hid: Send clock sync message immediately after reset HID: intel-ish-hid: fix the length of MNG_SYNC_FW_CLOCK in doorbell HID: corsair-void: Initialise memory for psy_cfg HID: corsair-void: Add missing delayed work cancel for headset status |
||
![]() |
44de577e61 |
can: j1939: j1939_sk_send_loop(): fix unable to send messages with data length zero
The J1939 standard requires the transmission of messages of length 0.
For example proprietary messages are specified with a data length of 0
to 1785. The transmission of such messages is not possible. Sending
results in no error being returned but no corresponding can frame
being generated.
Enable the transmission of zero length J1939 messages. In order to
facilitate this two changes are necessary:
1) If the transmission of a new message is requested from user space
the message is segmented in j1939_sk_send_loop(). Let the segmentation
take into account zero length messages, do not terminate immediately,
queue the corresponding skb.
2) j1939_session_skb_get_by_offset() selects the next skb to transmit
for a session. Take into account that there might be zero length skbs
in the queue.
Signed-off-by: Alexander Hölzl <alexander.hoelzl@gmx.net>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250205174651.103238-1-alexander.hoelzl@gmx.net
Fixes:
|
||
![]() |
011b033590 |
Revert "net: skb: introduce and use a single page frag cache"
This reverts commit |
||
![]() |
cb827db50a |
net: fib_rules: annotate data-races around rule->[io]ifindex
rule->iifindex and rule->oifindex can be read without holding RTNL.
Add READ_ONCE()/WRITE_ONCE() annotations where needed.
Fixes:
|
||
![]() |
8c67da5bc1 |
vfs-6.14-rc2.fixes
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ6XhrQAKCRCRxhvAZXjc oujrAQCpGmhvh2jGIKcSmEigNHOGCUXDG+1QsVpnCeP9OaUrkAEA+dMo4Ai4hz4J nYeeAgpjGuu+XLMmi7EiGxpI0fQL3gc= =oN/E -----END PGP SIGNATURE----- Merge tag 'vfs-6.14-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix fsnotify FMODE_NONOTIFY* handling. This also disables fsnotify on all pseudo files by default apart from very select exceptions. This carries a regression risk so we need to watch out and adapt accordingly. However, it is overall a significant improvement over the current status quo where every rando file can get fsnotify enabled. - Cleanup and simplify lockref_init() after recent lockref changes. - Fix vboxfs build with gcc-15. - Add an assert into inode_set_cached_link() to catch corrupt links. - Allow users to also use an empty string check to detect whether a given mount option string was empty or not. - Fix how security options were appended to statmount()'s ->mnt_opt field. - Fix statmount() selftests to always check the returned mask. - Fix uninitialized value in vfs_statx_path(). - Fix pidfs_ioctl() sanity checks to guard against ioctl() overloading and preserve extensibility. * tag 'vfs-6.14-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: sanity check the length passed to inode_set_cached_link() pidfs: improve ioctl handling fsnotify: disable pre-content and permission events by default selftests: always check mask returned by statmount(2) fsnotify: disable notification by default for all pseudo files fs: fix adding security options to statmount.mnt_opt fsnotify: use accessor to set FMODE_NONOTIFY_* lockref: remove count argument of lockref_init gfs2: switch to lockref_init(..., 1) gfs2: use lockref_init for gl_lockref statmount: let unset strings be empty vboxsf: fix building with GCC 15 fs/stat.c: avoid harmless garbage value problem in vfs_statx_path() |
||
![]() |
2a42754b31
|
fsnotify: disable notification by default for all pseudo files
Most pseudo files are not applicable for fsnotify events at all,
let alone to the new pre-content events.
Disable notifications to all files allocated with alloc_file_pseudo()
and enable legacy inotify events for the specific cases of pipe and
socket, which have known users of inotify events.
Pre-content events are also kept disabled for sockets and pipes.
Fixes:
|
||
![]() |
1438f5d07b |
rtnetlink: fix netns leak with rtnl_setlink()
A call to rtnl_nets_destroy() is needed to release references taken on
netns put in rtnl_nets.
CC: stable@vger.kernel.org
Fixes:
|
||
![]() |
bca0902e61 |
ax25: Fix refcount leak caused by setting SO_BINDTODEVICE sockopt
If an AX25 device is bound to a socket by setting the SO_BINDTODEVICE socket option, a refcount leak will occur in ax25_release(). Commit |
||
![]() |
6a774228e8 |
net: ethtool: tsconfig: Fix netlink type of hwtstamp flags
Fix the netlink type for hardware timestamp flags, which are represented
as a bitset of flags. Although only one flag is supported currently, the
correct netlink bitset type should be used instead of u32 to keep
consistency with other fields. Address this by adding a new named string
set description for the hwtstamp flag structure.
The code has been introduced in the current release so the uAPI change is
still okay.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Fixes:
|
||
![]() |
b768294d44 |
ipv6: Use RCU in ip6_input()
Instead of grabbing rcu_read_lock() from ip6_input_finish(), do it earlier in is caller, so that ip6_input() access to dev_net() can be validated by LOCKDEP. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250205155120.1676781-13-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
34aef2b0ce |
ipv6: icmp: convert to dev_net_rcu()
icmp6_send() must acquire rcu_read_lock() sooner to ensure
the dev_net() call done from a safe context.
Other ICMPv6 uses of dev_net() seem safe, change them to
dev_net_rcu() to get LOCKDEP support to catch bugs.
Fixes:
|
||
![]() |
3c8ffcd248 |
ipv6: use RCU protection in ip6_default_advmss()
ip6_default_advmss() needs rcu protection to make
sure the net structure it reads does not disappear.
Fixes:
|
||
![]() |
afec62cd0a |
flow_dissector: use RCU protection to fetch dev_net()
__skb_flow_dissect() can be called from arbitrary contexts.
It must extend its RCU protection section to include
the call to dev_net(), which can become dev_net_rcu().
This makes sure the net structure can not disappear under us.
Fixes:
|
||
![]() |
4b8474a095 |
ipv4: icmp: convert to dev_net_rcu()
__icmp_send() must ensure rcu_read_lock() is held, as spotted
by Jakub.
Other ICMP uses of dev_net() seem safe, change them to dev_net_rcu()
to get LOCKDEP support.
Fixes:
|
||
![]() |
139512191b |
ipv4: use RCU protection in __ip_rt_update_pmtu()
__ip_rt_update_pmtu() must use RCU protection to make sure the net structure it reads does not disappear. Fixes: |
||
![]() |
719817cd29 |
ipv4: use RCU protection in inet_select_addr()
inet_select_addr() must use RCU protection to make
sure the net structure it reads does not disappear.
Fixes:
|
||
![]() |
dd205fcc33 |
ipv4: use RCU protection in rt_is_expired()
rt_is_expired() must use RCU protection to make
sure the net structure it reads does not disappear.
Fixes:
|
||
![]() |
71b8471c93 |
ipv4: use RCU protection in ipv4_default_advmss()
ipv4_default_advmss() must use RCU protection to make
sure the net structure it reads does not disappear.
Fixes:
|
||
![]() |
3cf0a98fea |
Current release - regressions:
- core: harmonize tstats and dstats - ipv6: fix dst refleaks in rpl, seg6 and ioam6 lwtunnels - eth: tun: revert fix group permission check - eth: stmmac: revert "specify hardware capability value when FIFO size isn't specified" Previous releases - regressions: - udp: gso: do not drop small packets when PMTU reduces - rxrpc: fix race in call state changing vs recvmsg() - eth: ice: fix Rx data path for heavy 9k MTU traffic - eth: vmxnet3: fix tx queue race condition with XDP Previous releases - always broken: - sched: pfifo_tail_enqueue: drop new packet when sch->limit == 0 - ethtool: ntuple: fix rss + ring_cookie check - rxrpc: fix the rxrpc_connection attend queue handling Misc: - recognize Kuniyuki Iwashima as a maintainer Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmekqVQSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkLgcP/3sewa114G/vERakXPptWLnCRtXLMaMk E8ihlBj9qI0hD51Qi+NRYKI/IJqA5ZB/p4I5cBz7xI8d5VOQYhSuCdCnwMEZPAfG IQS8VpcFjcdj1e7cjlFu/s5PzF6cRRvinhWQpE5YpuE2TFCl+SJ8XAupLvi8zCe4 wo1Pet4dhKOKtrrhqiCeSNUer0/MSeLrCIB7mHha279l0D8Wx+m+h8OJMQahXI0t IJWNkxPyrIqx0ksR8Uy9SVAIrNlR5iEeqcwztLUtqxzs+b/s7OJcWmgqXgSMTOb5 RUr8Bthm3DQyXuFoR5bFA/oaoJb+iZvOjcEqQaYU2dF5zcRfdY/omhQmNq6s/7/w 7KV7C/RIrnhwLjkEd2IX5pzvgM9VuO9ewk65+sZ82o0wukhKn6RIyJYHzvwbc0rH rQMGAoz/uF4eiuqfS+uh86+VJfjBkG/sgdw0JSAt7dN14Fadg14+Ocz6apjwNuzJ 0QUWABxKSpvktufycyoo5s/bPqLMBxI/W3XZwIzbJ+dupRavPxkWpT/AwI//ZTDe rUTl2SEd4Ruliy1w41TXgubRxuW07M7bm4AxUrLzAzTx+aX7AdBeIyZp83x3oW1p nNUSdJ0m7A0Z1k28tdAP/gjD/vjcQ0J5YQ6MvZ1RBQl7MI+GLYD2irWbbrlI+g+6 HKQcXhv/7pG4 =t5WM -----END PGP SIGNATURE----- Merge tag 'net-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Interestingly the recent kmemleak improvements allowed our CI to catch a couple of percpu leaks addressed here. We (mostly Jakub, to be accurate) are working to increase review coverage over the net code-base tweaking the MAINTAINER entries. Current release - regressions: - core: harmonize tstats and dstats - ipv6: fix dst refleaks in rpl, seg6 and ioam6 lwtunnels - eth: tun: revert fix group permission check - eth: stmmac: revert "specify hardware capability value when FIFO size isn't specified" Previous releases - regressions: - udp: gso: do not drop small packets when PMTU reduces - rxrpc: fix race in call state changing vs recvmsg() - eth: ice: fix Rx data path for heavy 9k MTU traffic - eth: vmxnet3: fix tx queue race condition with XDP Previous releases - always broken: - sched: pfifo_tail_enqueue: drop new packet when sch->limit == 0 - ethtool: ntuple: fix rss + ring_cookie check - rxrpc: fix the rxrpc_connection attend queue handling Misc: - recognize Kuniyuki Iwashima as a maintainer" * tag 'net-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits) Revert "net: stmmac: Specify hardware capability value when FIFO size isn't specified" MAINTAINERS: add a sample ethtool section entry MAINTAINERS: add entry for ethtool rxrpc: Fix race in call state changing vs recvmsg() rxrpc: Fix call state set to not include the SERVER_SECURING state net: sched: Fix truncation of offloaded action statistics tun: revert fix group permission check selftests/tc-testing: Add a test case for qdisc_tree_reduce_backlog() netem: Update sch->q.qlen before qdisc_tree_reduce_backlog() selftests/tc-testing: Add a test case for pfifo_head_drop qdisc when limit==0 pfifo_tail_enqueue: Drop new packet when sch->limit == 0 selftests: mptcp: connect: -f: no reconnect net: rose: lock the socket in rose_bind() net: atlantic: fix warning during hot unplug rxrpc: Fix the rxrpc_connection attend queue handling net: harmonize tstats and dstats selftests: drv-net: rss_ctx: don't fail reconfigure test if queue offset not supported selftests: drv-net: rss_ctx: add missing cleanup in queue reconfigure ethtool: ntuple: fix rss + ring_cookie check ethtool: rss: fix hiding unsupported fields in dumps ... |
||
![]() |
2d7b30aef3 |
rxrpc: Fix race in call state changing vs recvmsg()
There's a race in between the rxrpc I/O thread recording the end of the
receive phase of a call and recvmsg() examining the state of the call to
determine whether it has completed.
The problem is that call->_state records the I/O thread's view of the call,
not the application's view (which may lag), so that alone is not
sufficient. To this end, the application also checks whether there is
anything left in call->recvmsg_queue for it to pick up. The call must be
in state RXRPC_CALL_COMPLETE and the recvmsg_queue empty for the call to be
considered fully complete.
In rxrpc_input_queue_data(), the latest skbuff is added to the queue and
then, if it was marked as LAST_PACKET, the state is advanced... But this
is two separate operations with no locking around them.
As a consequence, the lack of locking means that sendmsg() can jump into
the gap on a service call and attempt to send the reply - but then get
rejected because the I/O thread hasn't advanced the state yet.
Simply flipping the order in which things are done isn't an option as that
impacts the client side, causing the checks in rxrpc_kernel_check_life() as
to whether the call is still alive to race instead.
Fix this by moving the update of call->_state inside the skb queue
spinlocked section where the packet is queued on the I/O thread side.
rxrpc's recvmsg() will then automatically sync against this because it has
to take the call->recvmsg_queue spinlock in order to dequeue the last
packet.
rxrpc's sendmsg() doesn't need amending as the app shouldn't be calling it
to send a reply until recvmsg() indicates it has returned all of the
request.
Fixes:
|
||
![]() |
41b996ce83 |
rxrpc: Fix call state set to not include the SERVER_SECURING state
The RXRPC_CALL_SERVER_SECURING state doesn't really belong with the other
states in the call's state set as the other states govern the call's Rx/Tx
phase transition and govern when packets can and can't be received or
transmitted. The "Securing" state doesn't actually govern the reception of
packets and would need to be split depending on whether or not we've
received the last packet yet (to mirror RECV_REQUEST/ACK_REQUEST).
The "Securing" state is more about whether or not we can start forwarding
packets to the application as recvmsg will need to decode them and the
decoding can't take place until the challenge/response exchange has
completed.
Fix this by removing the RXRPC_CALL_SERVER_SECURING state from the state
set and, instead, using a flag, RXRPC_CALL_CONN_CHALLENGING, to track
whether or not we can queue the call for reception by recvmsg() or notify
the kernel app that data is ready. In the event that we've already
received all the packets, the connection event handler will poke the app
layer in the appropriate manner.
Also there's a race whereby the app layer sees the last packet before rxrpc
has managed to end the rx phase and change the state to one amenable to
allowing a reply. Fix this by queuing the packet after calling
rxrpc_end_rx_phase().
Fixes:
|
||
![]() |
638ba50893 |
netem: Update sch->q.qlen before qdisc_tree_reduce_backlog()
qdisc_tree_reduce_backlog() notifies parent qdisc only if child
qdisc becomes empty, therefore we need to reduce the backlog of the
child qdisc before calling it. Otherwise it would miss the opportunity
to call cops->qlen_notify(), in the case of DRR, it resulted in UAF
since DRR uses ->qlen_notify() to maintain its active list.
Fixes:
|
||
![]() |
647cef20e6 |
pfifo_tail_enqueue: Drop new packet when sch->limit == 0
Expected behaviour:
In case we reach scheduler's limit, pfifo_tail_enqueue() will drop a
packet in scheduler's queue and decrease scheduler's qlen by one.
Then, pfifo_tail_enqueue() enqueue new packet and increase
scheduler's qlen by one. Finally, pfifo_tail_enqueue() return
`NET_XMIT_CN` status code.
Weird behaviour:
In case we set `sch->limit == 0` and trigger pfifo_tail_enqueue() on a
scheduler that has no packet, the 'drop a packet' step will do nothing.
This means the scheduler's qlen still has value equal 0.
Then, we continue to enqueue new packet and increase scheduler's qlen by
one. In summary, we can leverage pfifo_tail_enqueue() to increase qlen by
one and return `NET_XMIT_CN` status code.
The problem is:
Let's say we have two qdiscs: Qdisc_A and Qdisc_B.
- Qdisc_A's type must have '->graft()' function to create parent/child relationship.
Let's say Qdisc_A's type is `hfsc`. Enqueue packet to this qdisc will trigger `hfsc_enqueue`.
- Qdisc_B's type is pfifo_head_drop. Enqueue packet to this qdisc will trigger `pfifo_tail_enqueue`.
- Qdisc_B is configured to have `sch->limit == 0`.
- Qdisc_A is configured to route the enqueued's packet to Qdisc_B.
Enqueue packet through Qdisc_A will lead to:
- hfsc_enqueue(Qdisc_A) -> pfifo_tail_enqueue(Qdisc_B)
- Qdisc_B->q.qlen += 1
- pfifo_tail_enqueue() return `NET_XMIT_CN`
- hfsc_enqueue() check for `NET_XMIT_SUCCESS` and see `NET_XMIT_CN` => hfsc_enqueue() don't increase qlen of Qdisc_A.
The whole process lead to a situation where Qdisc_A->q.qlen == 0 and Qdisc_B->q.qlen == 1.
Replace 'hfsc' with other type (for example: 'drr') still lead to the same problem.
This violate the design where parent's qlen should equal to the sum of its childrens'qlen.
Bug impact: This issue can be used for user->kernel privilege escalation when it is reachable.
Fixes:
|
||
![]() |
a1300691ae |
net: rose: lock the socket in rose_bind()
syzbot reported a soft lockup in rose_loopback_timer(),
with a repro calling bind() from multiple threads.
rose_bind() must lock the socket to avoid this issue.
Fixes:
|
||
![]() |
4241a702e0 |
rxrpc: Fix the rxrpc_connection attend queue handling
The rxrpc_connection attend queue is never used because conn::attend_link
is never initialised and so is always NULL'd out and thus always appears to
be busy. This requires the following fix:
(1) Fix this the attend queue problem by initialising conn::attend_link.
And, consequently, two further fixes for things masked by the above bug:
(2) Fix rxrpc_input_conn_event() to handle being invoked with a NULL
sk_buff pointer - something that can now happen with the above change.
(3) Fix the RXRPC_SKB_MARK_SERVICE_CONN_SECURED message to carry a pointer
to the connection and a ref on it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Fixes:
|
||
![]() |
d3ed6dee73 |
net: harmonize tstats and dstats
After the blamed commits below, some UDP tunnel use dstats for accounting. On the xmit path, all the UDP-base tunnels ends up using iptunnel_xmit_stats() for stats accounting, and the latter assumes the relevant (tunnel) network device uses tstats. The end result is some 'funny' stat report for the mentioned UDP tunnel, e.g. when no packet is actually dropped and a bunch of packets are transmitted: gnv2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue \ state UNKNOWN mode DEFAULT group default qlen 1000 link/ether ee:7d:09:87:90:ea brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped missed mcast 14916 23 0 15 0 0 TX: bytes packets errors dropped carrier collsns 0 1566 0 0 0 0 Address the issue ensuring the same binary layout for the overlapping fields of dstats and tstats. While this solution is a bit hackish, is smaller and with no performance pitfall compared to other alternatives i.e. supporting both dstat and tstat in iptunnel_xmit_stats() or reverting the blamed commit. With time we should possibly move all the IP-based tunnel (and virtual devices) to dstats. Fixes: |
||
![]() |
2b91cc1214 |
ethtool: ntuple: fix rss + ring_cookie check
The info.flow_type is for RXFH commands, ntuple flow_type is inside
the flow spec. The check currently does nothing, as info.flow_type
is 0 (or even uninitialized by user space) for ETHTOOL_SRXCLSRLINS.
Fixes:
|
||
![]() |
244f8aa46f |
ethtool: rss: fix hiding unsupported fields in dumps
Commit |
||
![]() |
235174b2be |
udp: gso: do not drop small packets when PMTU reduces
Commit |
||
![]() |
a5a056c8d2 |
HID: intel-thc: fix CONFIG_HID dependency
In drivers/hid/, most drivers depend on CONFIG_HID, while a couple of the
drivers in subdirectories instead depend on CONFIG_HID_SUPPORT and use
'select HID'. With the newly added INTEL_THC_HID, this causes a build
warning for a circular dependency:
WARNING: unmet direct dependencies detected for HID
Depends on [m]: HID_SUPPORT [=y] && INPUT [=m]
Selected by [y]:
- INTEL_THC_HID [=y] && HID_SUPPORT [=y] && X86_64 [=y] && PCI [=y] && ACPI [=y]
WARNING: unmet direct dependencies detected for INPUT_FF_MEMLESS
Depends on [m]: INPUT [=m]
Selected by [y]:
- HID_MICROSOFT [=y] && HID_SUPPORT [=y] && HID [=y]
- GREENASIA_FF [=y] && HID_SUPPORT [=y] && HID [=y] && HID_GREENASIA [=y]
- HID_WIIMOTE [=y] && HID_SUPPORT [=y] && HID [=y] && LEDS_CLASS [=y]
- ZEROPLUS_FF [=y] && HID_SUPPORT [=y] && HID [=y] && HID_ZEROPLUS [=y]
Selected by [m]:
- HID_ACRUX_FF [=y] && HID_SUPPORT [=y] && HID [=y] && HID_ACRUX [=m]
- HID_EMS_FF [=m] && HID_SUPPORT [=y] && HID [=y]
- HID_GOOGLE_STADIA_FF [=m] && HID_SUPPORT [=y] && HID [=y]
- PANTHERLORD_FF [=y] && HID_SUPPORT [=y] && HID [=y] && HID_PANTHERLORD [=m]
It's better to be consistent and always use 'depends on HID' for HID
drivers. The notable exception here is USB_KBD/USB_MOUSE, which are
alternative implementations that do not depend on the HID subsystem.
Do this by extending the "if HID" section below, which means that a few
of the duplicate "depends on HID" and "depends on INPUT" statements
can be removed in the process.
Fixes:
|
||
![]() |
92191dd107 |
net: ipv6: fix dst ref loops in rpl, seg6 and ioam6 lwtunnels
Some lwtunnels have a dst cache for post-transformation dst. If the packet destination did not change we may end up recording a reference to the lwtunnel in its own cache, and the lwtunnel state will never be freed. Discovered by the ioam6.sh test, kmemleak was recently fixed to catch per-cpu memory leaks. I'm not sure if rpl and seg6 can actually hit this, but in principle I don't see why not. Fixes: |
||
![]() |
c71a192976 |
net: ipv6: fix dst refleaks in rpl, seg6 and ioam6 lwtunnels
dst_cache_get() gives us a reference, we need to release it. Discovered by the ioam6.sh test, kmemleak was recently fixed to catch per-cpu memory leaks. Fixes: |
||
![]() |
a86bf2283d |
assorted stuff for this merge window
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZ5yJdgAKCRBZ7Krx/gZQ 69W4AQDwgxceiQ6icx3rFhCWQigne4jdMO84kd8tNaa+xHGe1AD/WnkeChc5DqjQ wZWZxAAzml9SS01IcSiHWaF5fgrjlA0= =rXOq -----END PGP SIGNATURE----- Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs cleanups from Al Viro: "Two unrelated patches - one is a removal of long-obsolete include in overlayfs (it used to need fs/internal.h, but the extern it wanted has been moved back to include/linux/namei.h) and another introduces convenience helper constructing struct qstr by a NUL-terminated string" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add a string-to-qstr constructor fs/overlayfs/namei.c: get rid of include ../internal.h |
||
![]() |
c2933b2bef |
First batch of fixes for 6.14. Nothing really stands out,
but as usual there's a slight concentration of fixes for issues added in the last two weeks before the MW, and driver bugs from 6.13 which tend to get discovered upon wider distribution. Including fixes from IPSec, netfilter and Bluetooth. Current release - regressions: - net: revert RTNL changes in unregister_netdevice_many_notify() - Bluetooth: fix possible infinite recursion of btusb_reset - eth: adjust locking in some old drivers which protect their state with spinlocks to avoid sleeping in atomic; core protects netdev state with a mutex now Previous releases - regressions: - eth: mlx5e: make sure we pass node ID, not CPU ID to kvzalloc_node() - eth: bgmac: reduce max frame size to support just 1500 bytes; the jumbo frame support would previously cause OOB writes, but now fails outright - mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted, avoid false detection of MPTCP blackholing Previous releases - always broken: - mptcp: handle fastopen disconnect correctly - xfrm: make sure skb->sk is a full sock before accessing its fields - xfrm: fix taking a lock with preempt disabled for RT kernels - usb: ipheth: improve safety of packet metadata parsing; prevent potential OOB accesses - eth: renesas: fix missing rtnl lock in suspend/resume path Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmebzXsACgkQMUZtbf5S IrvGBQ//auOF2yY1sg40fBvc6Hr1jpZBcr+uqTL6Qka1uVOvTFY51hAN54lBt32+ ixmcHsD0xdcHrr7VrqSXqurQLiGsdwUpnxZFCj/FymQuMunVysEqudvPeKDVHpsw JW5c4nJOexEA2viByK9iB23Qq0P3uBoPEnKrbSTVSDvYaXUj6y8Cvt3/vXc+H/tc T7GaxHH55NNNPkRz34YU3OWcaZsgkQEcdVpZf4tODPmg7J5VQj8SQeMhk/HI0sdO WKjWB0woZkiQECtamqAOXnv47PXd6igv8NALRPlJcKjs0EszUvuYhD/9MEOeghjI sjcQn9JnPpG+ca/qFVCSpEEOo2zGVn5dkJT5x26udH+5XHf7Pq+zpJwB6LHo98yF bGMpIrF6gi2EnBtS/tRjMyBU9Ut9KiUtjXMvn9EsD1U1FGIbz6wyQLlT0pG0hwJb rEdKfegrcyhWKHOD4vH9ciEg/7lgGfsGyfJDktIMdyailZc6tBcxwbdlc+5jRDA9 0RqGASIXaiX7AC3WOSeQzMgbV+WXdhaX/yrMJL5KBfBTzxG2audnJt1tPN3mbh3z NM6M2cnMsoX4QLSiaukJaCL7LWHSTlVttZVg8FGXHj1PejMQQBVjGVvzk2UF55UR gV7X9/VkXhmIDAZgThWtOdPLz+ItksfSKiruhUsXust6JgqRuqc= =GjBE -----END PGP SIGNATURE----- Merge tag 'net-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from IPSec, netfilter and Bluetooth. Nothing really stands out, but as usual there's a slight concentration of fixes for issues added in the last two weeks before the merge window, and driver bugs from 6.13 which tend to get discovered upon wider distribution. Current release - regressions: - net: revert RTNL changes in unregister_netdevice_many_notify() - Bluetooth: fix possible infinite recursion of btusb_reset - eth: adjust locking in some old drivers which protect their state with spinlocks to avoid sleeping in atomic; core protects netdev state with a mutex now Previous releases - regressions: - eth: - mlx5e: make sure we pass node ID, not CPU ID to kvzalloc_node() - bgmac: reduce max frame size to support just 1500 bytes; the jumbo frame support would previously cause OOB writes, but now fails outright - mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted, avoid false detection of MPTCP blackholing Previous releases - always broken: - mptcp: handle fastopen disconnect correctly - xfrm: - make sure skb->sk is a full sock before accessing its fields - fix taking a lock with preempt disabled for RT kernels - usb: ipheth: improve safety of packet metadata parsing; prevent potential OOB accesses - eth: renesas: fix missing rtnl lock in suspend/resume path" * tag 'net-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) MAINTAINERS: add Neal to TCP maintainers net: revert RTNL changes in unregister_netdevice_many_notify() net: hsr: fix fill_frame_info() regression vs VLAN packets doc: mptcp: sysctl: blackhole_timeout is per-netns mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted netfilter: nf_tables: reject mismatching sum of field_len with set key length net: sh_eth: Fix missing rtnl lock in suspend/resume path net: ravb: Fix missing rtnl lock in suspend/resume path selftests/net: Add test for loading devbound XDP program in generic mode net: xdp: Disallow attaching device-bound programs in generic mode tcp: correct handling of extreme memory squeeze bgmac: reduce max frame size to support just MTU 1500 vsock/test: Add test for connect() retries vsock/test: Add test for UAF due to socket unbinding vsock/test: Introduce vsock_connect_fd() vsock/test: Introduce vsock_bind() vsock: Allow retrying on connect() failure vsock: Keep the binding until socket destruction Bluetooth: L2CAP: accept zero as a special value for MTU auto-selection Bluetooth: btnxpuart: Fix glitches seen in dual A2DP streaming ... |
||
![]() |
dfffaccffc |
netfilter pull request 25-01-30
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmebYkAACgkQ1w0aZmrP KyH3kBAAvj0QSXPd3iwuFA0E9gmlZKWMrI/TmmSjI9nthmPAOXdTKOPjG5TmvESn bWie6PHVNQjtQ+E3fUoGBFZz5P8F6X1hKx4VM0r75fDeHPxEkGuA0dPyTYG/CEjZ 4WwnfZ1YmSO7rXPRqRQa+X50XgJD0PiVYhb5QPAelNbb+pEyE5URnqNkKWBC7xQC YizcZE5Zvm0/yBf+HegQL9eenwLHJbnC1F+d6LrOnDKOd3qXY41jpG0CJDnkseKv qQRO1GRdMuivqrd9jicqUFbmhnCK8pVMRVm1NtENJ3I44KAAJlbihKztV7QL2BRt H4DySZpaAm6CFw5Gut2OOSKNCw8c3bejBEeTaw6JWwA+gG/xqgAh2npyrY2Qsf+s pLw+pNtfUxOmRnd3HDhjy++AUPfBBNfLREjaUXOUAM8I475ijccMuhdxnhLK/Wg1 F7AST3YRhaRByXTD13OG9S1GwK8ejQ1EKbTV7MnbGa6XzA8jMh+Uhr7gcI7ke20q nDuxp9/N6iWHF4ncu9YHGyFiVTt7mQ4TyUgsxqJkH1rp3VhgKl3rpemY7G8EoYaW QQRO4/l39XK32E3fBQ12Z0iunUTtdthkWO2CiCSY1GvXeek13wcu0tunkJ6TZMSx P0IWoVEWv4dssluExToMQli66M4lL1WfF5VILGncIUlWWkMVoYM= =81dC -----END PGP SIGNATURE----- Merge tag 'nf-25-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains one Netfilter fix: 1) Reject mismatching sum of field_len with set key length which allows to create a set without inconsistent pipapo rule width and set key length. * tag 'nf-25-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: reject mismatching sum of field_len with set key length ==================== Link: https://patch.msgid.link/20250130113307.2327470-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
e759e1e4a4 |
net: revert RTNL changes in unregister_netdevice_many_notify()
This patch reverts following changes: |
||
![]() |
0f5697f1a3 |
net: hsr: fix fill_frame_info() regression vs VLAN packets
Stephan Wurm reported that my recent patch broke VLAN support.
Apparently skb->mac_len is not correct for VLAN traffic as
shown by debug traces [1].
Use instead pskb_may_pull() to make sure the expected header
is present in skb->head.
Many thanks to Stephan for his help.
[1]
kernel: skb len=170 headroom=2 headlen=170 tailroom=20
mac=(2,14) mac_len=14 net=(16,-1) trans=-1
shinfo(txflags=0 nr_frags=0 gso(size=0 type=0 segs=0))
csum(0x0 start=0 offset=0 ip_summed=0 complete_sw=0 valid=0 level=0)
hash(0x0 sw=0 l4=0) proto=0x0000 pkttype=0 iif=0
priority=0x0 mark=0x0 alloc_cpu=0 vlan_all=0x0
encapsulation=0 inner(proto=0x0000, mac=0, net=0, trans=0)
kernel: dev name=prp0 feat=0x0000000000007000
kernel: sk family=17 type=3 proto=0
kernel: skb headroom: 00000000: 74 00
kernel: skb linear: 00000000: 01 0c cd 01 00 01 00 d0 93 53 9c cb 81 00 80 00
kernel: skb linear: 00000010: 88 b8 00 01 00 98 00 00 00 00 61 81 8d 80 16 52
kernel: skb linear: 00000020: 45 47 44 4e 43 54 52 4c 2f 4c 4c 4e 30 24 47 4f
kernel: skb linear: 00000030: 24 47 6f 43 62 81 01 14 82 16 52 45 47 44 4e 43
kernel: skb linear: 00000040: 54 52 4c 2f 4c 4c 4e 30 24 44 73 47 6f 6f 73 65
kernel: skb linear: 00000050: 83 07 47 6f 49 64 65 6e 74 84 08 67 8d f5 93 7e
kernel: skb linear: 00000060: 76 c8 00 85 01 01 86 01 00 87 01 00 88 01 01 89
kernel: skb linear: 00000070: 01 00 8a 01 02 ab 33 a2 15 83 01 00 84 03 03 00
kernel: skb linear: 00000080: 00 91 08 67 8d f5 92 77 4b c6 1f 83 01 00 a2 1a
kernel: skb linear: 00000090: a2 06 85 01 00 83 01 00 84 03 03 00 00 91 08 67
kernel: skb linear: 000000a0: 8d f5 92 77 4b c6 1f 83 01 00
kernel: skb tailroom: 00000000: 80 18 02 00 fe 4e 00 00 01 01 08 0a 4f fd 5e d1
kernel: skb tailroom: 00000010: 4f fd 5e cd
Fixes:
|
||
![]() |
e598d8981f |
mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted
The Fixes commit mentioned this:
> An MPTCP firewall blackhole can be detected if the following SYN
> retransmission after a fallback to "plain" TCP is accepted.
But in fact, this blackhole was detected if any following SYN
retransmissions after a fallback to TCP was accepted.
That's because 'mptcp_subflow_early_fallback()' will set 'request_mptcp'
to 0, and 'mpc_drop' will never be reset to 0 after.
This is an issue, because some not so unusual situations might cause the
kernel to detect a false-positive blackhole, e.g. a client trying to
connect to a server while the network is not ready yet, causing a few
SYN retransmissions, before reaching the end server.
Fixes:
|
||
![]() |
1b9335a800 |
netfilter: nf_tables: reject mismatching sum of field_len with set key length
The field length description provides the length of each separated key
field in the concatenation, each field gets rounded up to 32-bits to
calculate the pipapo rule width from pipapo_init(). The set key length
provides the total size of the key aligned to 32-bits.
Register-based arithmetics still allows for combining mismatching set
key length and field length description, eg. set key length 10 and field
description [ 5, 4 ] leading to pipapo width of 12.
Cc: stable@vger.kernel.org
Fixes:
|
||
![]() |
da5ca229b6 |
bluetooth pull request for net:
- btusb: mediatek: Add locks for usb_driver_claim_interface() - L2CAP: accept zero as a special value for MTU auto-selection - btusb: Fix possible infinite recursion of btusb_reset - Add ABI doc for sysfs reset - btnxpuart: Fix glitches seen in dual A2DP streaming -----BEGIN PGP SIGNATURE----- iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmealqcZHGx1aXoudm9u LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKUn6D/wMzyaazNsfa+/WwzqlyDPZ 17mQWJUrRY7AhH2/vFA7JrfWeVgmihg90qaSocb8LnGnlUvEiX53xhRNspBt//lt 26ZHpDDNq+pKftG3zbR5U76BhpjvjRgzMJO9U7YJAFlNvNiL5pL+UTufIhFU60mR x53e3ixNAl+62vwT2Re1S2NueMT8a2c8VPnbNI2W2xw59Am+WqQKWBJrthBkK9G6 nt5frOkFkva8J+zI4Rxu7nEzSCdiSXrO1OGae41v+orNzIl9ffyxW69XEuGsleMV NJ8/ZJZrhWh/lEWqXlZu20+c119pmwXQPM5ONZKv4rSrFtuc1bEPNp6foRoHwfsh qbtjPg323IJeR4TdIQUdo2VGM8w08GLEP5azYHjJ9MYvOdtjAUZ6YO5nhXmzE6Vt cnF/OYMnggx202TnUzCT2HUghA3F4SNDRR6NT82xf+vnzbnF5W8ZIBUc6vbtqIxu dGlkGtWF8waodNvh7F1+2X0El3joi6HOzw07+HLmnpa78Mn5Di3KUknotZ/gpYtn M/FPEqJvmbMom1XlMYg5g5y+YvSdygllPCI5ewNR/LD/1S3jt6ES7+/pt9/ZAWrl 1WnCN1gJ6aaS7UQNteQNjB90iXxsTkPXqJvRhcr+WZauxqX7msFKuxmjhK++fplj o/79A7oQ+XZTNw78zcrH4A== =q2T2 -----END PGP SIGNATURE----- Merge tag 'for-net-2025-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - btusb: mediatek: Add locks for usb_driver_claim_interface() - L2CAP: accept zero as a special value for MTU auto-selection - btusb: Fix possible infinite recursion of btusb_reset - Add ABI doc for sysfs reset - btnxpuart: Fix glitches seen in dual A2DP streaming * tag 'for-net-2025-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: L2CAP: accept zero as a special value for MTU auto-selection Bluetooth: btnxpuart: Fix glitches seen in dual A2DP streaming Bluetooth: Add ABI doc for sysfs reset Bluetooth: Fix possible infinite recursion of btusb_reset Bluetooth: btusb: mediatek: Add locks for usb_driver_claim_interface() ==================== Link: https://patch.msgid.link/20250129210057.1318963-1-luiz.dentz@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
![]() |
3595599fa8 |
net: xdp: Disallow attaching device-bound programs in generic mode
Device-bound programs are used to support RX metadata kfuncs. These
kfuncs are driver-specific and rely on the driver context to read the
metadata. This means they can't work in generic XDP mode. However, there
is no check to disallow such programs from being attached in generic
mode, in which case the metadata kfuncs will be called in an invalid
context, leading to crashes.
Fix this by adding a check to disallow attaching device-bound programs
in generic mode.
Fixes:
|
||
![]() |
8c670bdfa5 |
tcp: correct handling of extreme memory squeeze
Testing with iperf3 using the "pasta" protocol splicer has revealed
a problem in the way tcp handles window advertising in extreme memory
squeeze situations.
Under memory pressure, a socket endpoint may temporarily advertise
a zero-sized window, but this is not stored as part of the socket data.
The reasoning behind this is that it is considered a temporary setting
which shouldn't influence any further calculations.
However, if we happen to stall at an unfortunate value of the current
window size, the algorithm selecting a new value will consistently fail
to advertise a non-zero window once we have freed up enough memory.
This means that this side's notion of the current window size is
different from the one last advertised to the peer, causing the latter
to not send any data to resolve the sitution.
The problem occurs on the iperf3 server side, and the socket in question
is a completely regular socket with the default settings for the
fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
The following excerpt of a logging session, with own comments added,
shows more in detail what is happening:
// tcp_v4_rcv(->)
// tcp_rcv_established(->)
[5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
[5201<->39222]: tcp_data_queue(->)
[5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
[OFO queue: gap: 65480, len: 0]
[5201<->39222]: tcp_data_queue(<-)
[5201<->39222]: __tcp_transmit_skb(->)
[tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]: tcp_select_window(->)
[5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
[tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
returning 0
[5201<->39222]: tcp_select_window(<-)
[5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160
[5201<->39222]: [__tcp_transmit_skb(<-)
[5201<->39222]: tcp_rcv_established(<-)
[5201<->39222]: tcp_v4_rcv(<-)
// Receive queue is at 85 buffers and we are out of memory.
// We drop the incoming buffer, although it is in sequence, and decide
// to send an advertisement with a window of zero.
// We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
// we unconditionally shrink the window.
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]: __tcp_cleanup_rbuf(<-)
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
returning 6104 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)
// After each read, the algorithm for calculating the new receive
// window in __tcp_cleanup_rbuf() finds it is too small to advertise
// or to update tp->rcv_wnd.
// Meanwhile, the peer thinks the window is zero, and will not send
// any more data to trigger an update from the interrupt mode side.
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]: __tcp_cleanup_rbuf(<-)
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
returning 131072 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)
// The above pattern repeats again and again, since nothing changes
// between the reads.
[...]
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]: __tcp_cleanup_rbuf(<-)
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
returning 54672 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)
// The receive queue is empty, but no new advertisement has been sent.
// The peer still thinks the receive window is zero, and sends nothing.
// We have ended up in a deadlock situation.
Note that well behaved endpoints will send win0 probes, so the problem
will not occur.
Furthermore, we have observed that in these situations this side may
send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
as it should be. Backing ack_seq seems to be harmless, but is of
course still wrong from a protocol viewpoint.
We fix this by updating the socket state correctly when a packet has
been dropped because of memory exhaustion and we have to advertize
a zero window.
Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.
Fixes:
|
||
![]() |
aa388c7211 |
vsock: Allow retrying on connect() failure
sk_err is set when a (connectible) connect() fails. Effectively, this makes
an otherwise still healthy SS_UNCONNECTED socket impossible to use for any
subsequent connection attempts.
Clear sk_err upon trying to establish a connection.
Fixes:
|
||
![]() |
fcdd2242c0 |
vsock: Keep the binding until socket destruction
Preserve sockets bindings; this includes both resulting from an explicit
bind() and those implicitly bound through autobind during connect().
Prevents socket unbinding during a transport reassignment, which fixes a
use-after-free:
1. vsock_create() (refcnt=1) calls vsock_insert_unbound() (refcnt=2)
2. transport->release() calls vsock_remove_bound() without checking if
sk was bound and moved to bound list (refcnt=1)
3. vsock_bind() assumes sk is in unbound list and before
__vsock_insert_bound(vsock_bound_sockets()) calls
__vsock_remove_bound() which does:
list_del_init(&vsk->bound_table); // nop
sock_put(&vsk->sk); // refcnt=0
BUG: KASAN: slab-use-after-free in __vsock_bind+0x62e/0x730
Read of size 4 at addr ffff88816b46a74c by task a.out/2057
dump_stack_lvl+0x68/0x90
print_report+0x174/0x4f6
kasan_report+0xb9/0x190
__vsock_bind+0x62e/0x730
vsock_bind+0x97/0xe0
__sys_bind+0x154/0x1f0
__x64_sys_bind+0x6e/0xb0
do_syscall_64+0x93/0x1b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Allocated by task 2057:
kasan_save_stack+0x1e/0x40
kasan_save_track+0x10/0x30
__kasan_slab_alloc+0x85/0x90
kmem_cache_alloc_noprof+0x131/0x450
sk_prot_alloc+0x5b/0x220
sk_alloc+0x2c/0x870
__vsock_create.constprop.0+0x2e/0xb60
vsock_create+0xe4/0x420
__sock_create+0x241/0x650
__sys_socket+0xf2/0x1a0
__x64_sys_socket+0x6e/0xb0
do_syscall_64+0x93/0x1b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 2057:
kasan_save_stack+0x1e/0x40
kasan_save_track+0x10/0x30
kasan_save_free_info+0x37/0x60
__kasan_slab_free+0x4b/0x70
kmem_cache_free+0x1a1/0x590
__sk_destruct+0x388/0x5a0
__vsock_bind+0x5e1/0x730
vsock_bind+0x97/0xe0
__sys_bind+0x154/0x1f0
__x64_sys_bind+0x6e/0xb0
do_syscall_64+0x93/0x1b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 7 PID: 2057 at lib/refcount.c:25 refcount_warn_saturate+0xce/0x150
RIP: 0010:refcount_warn_saturate+0xce/0x150
__vsock_bind+0x66d/0x730
vsock_bind+0x97/0xe0
__sys_bind+0x154/0x1f0
__x64_sys_bind+0x6e/0xb0
do_syscall_64+0x93/0x1b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
refcount_t: underflow; use-after-free.
WARNING: CPU: 7 PID: 2057 at lib/refcount.c:28 refcount_warn_saturate+0xee/0x150
RIP: 0010:refcount_warn_saturate+0xee/0x150
vsock_remove_bound+0x187/0x1e0
__vsock_release+0x383/0x4a0
vsock_release+0x90/0x120
__sock_release+0xa3/0x250
sock_close+0x14/0x20
__fput+0x359/0xa80
task_work_run+0x107/0x1d0
do_exit+0x847/0x2560
do_group_exit+0xb8/0x250
__x64_sys_exit_group+0x3a/0x50
x64_sys_call+0xfec/0x14f0
do_syscall_64+0x93/0x1b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes:
|
||
![]() |
5459cce6bf |
bpf: Disable non stream socket for strparser
Currently, only TCP supports strparser, but sockmap doesn't intercept
non-TCP connections to attach strparser. For example, with UDP, although
the read/write handlers are replaced, strparser is not executed due to
the lack of a read_sock operation.
Furthermore, in udp_bpf_recvmsg(), it checks whether the psock has data,
and if not, it falls back to the native UDP read interface, making
UDP + strparser appear to read correctly. According to its commit history,
this behavior is unexpected.
Moreover, since UDP lacks the concept of streams, we intercept it directly.
Fixes:
|
||
![]() |
36b62df568 |
bpf: Fix wrong copied_seq calculation
'sk->copied_seq' was updated in the tcp_eat_skb() function when the action
of a BPF program was SK_REDIRECT. For other actions, like SK_PASS, the
update logic for 'sk->copied_seq' was moved to tcp_bpf_recvmsg_parser()
to ensure the accuracy of the 'fionread' feature.
It works for a single stream_verdict scenario, as it also modified
sk_data_ready->sk_psock_verdict_data_ready->tcp_read_skb
to remove updating 'sk->copied_seq'.
However, for programs where both stream_parser and stream_verdict are
active (strparser purpose), tcp_read_sock() was used instead of
tcp_read_skb() (sk_data_ready->strp_data_ready->tcp_read_sock).
tcp_read_sock() now still updates 'sk->copied_seq', leading to duplicate
updates.
In summary, for strparser + SK_PASS, copied_seq is redundantly calculated
in both tcp_read_sock() and tcp_bpf_recvmsg_parser().
The issue causes incorrect copied_seq calculations, which prevent
correct data reads from the recv() interface in user-land.
We do not want to add new proto_ops to implement a new version of
tcp_read_sock, as this would introduce code complexity [1].
We could have added noack and copied_seq to desc, and then called
ops->read_sock. However, unfortunately, other modules didn’t fully
initialize desc to zero. So, for now, we are directly calling
tcp_read_sock_noack() in tcp_bpf.c.
[1]: https://lore.kernel.org/bpf/20241218053408.437295-1-mrpre@163.com
Fixes:
|
||
![]() |
0532a79efd |
strparser: Add read_sock callback
Added a new read_sock handler, allowing users to customize read operations instead of relying on the native socket's read_sock. Signed-off-by: Jiayuan Chen <mrpre@163.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://patch.msgid.link/20250122100917.49845-2-mrpre@163.com |
||
![]() |
5c61419e02 |
Bluetooth: L2CAP: accept zero as a special value for MTU auto-selection
One of the possible ways to enable the input MTU auto-selection for L2CAP connections is supposed to be through passing a special "0" value for it as a socket option. Commit [1] added one of those into avdtp. However, it simply wouldn't work because the kernel still treats the specified value as invalid and denies the setting attempt. Recorded BlueZ logs include the following: bluetoothd[496]: profiles/audio/avdtp.c:l2cap_connect() setsockopt(L2CAP_OPTIONS): Invalid argument (22) [1]: |
||
![]() |
6b3d638ca8 |
bpf, test_run: Fix use-after-free issue in eth_skb_pkt_type()
KMSAN reported a use-after-free issue in eth_skb_pkt_type()[1]. The
cause of the issue was that eth_skb_pkt_type() accessed skb's data
that didn't contain an Ethernet header. This occurs when
bpf_prog_test_run_xdp() passes an invalid value as the user_data
argument to bpf_test_init().
Fix this by returning an error when user_data is less than ETH_HLEN in
bpf_test_init(). Additionally, remove the check for "if (user_size >
size)" as it is unnecessary.
[1]
BUG: KMSAN: use-after-free in eth_skb_pkt_type include/linux/etherdevice.h:627 [inline]
BUG: KMSAN: use-after-free in eth_type_trans+0x4ee/0x980 net/ethernet/eth.c:165
eth_skb_pkt_type include/linux/etherdevice.h:627 [inline]
eth_type_trans+0x4ee/0x980 net/ethernet/eth.c:165
__xdp_build_skb_from_frame+0x5a8/0xa50 net/core/xdp.c:635
xdp_recv_frames net/bpf/test_run.c:272 [inline]
xdp_test_run_batch net/bpf/test_run.c:361 [inline]
bpf_test_run_xdp_live+0x2954/0x3330 net/bpf/test_run.c:390
bpf_prog_test_run_xdp+0x148e/0x1b10 net/bpf/test_run.c:1318
bpf_prog_test_run+0x5b7/0xa30 kernel/bpf/syscall.c:4371
__sys_bpf+0x6a6/0xe20 kernel/bpf/syscall.c:5777
__do_sys_bpf kernel/bpf/syscall.c:5866 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5864 [inline]
__x64_sys_bpf+0xa4/0xf0 kernel/bpf/syscall.c:5864
x64_sys_call+0x2ea0/0x3d90 arch/x86/include/generated/asm/syscalls_64.h:322
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xd9/0x1d0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Uninit was created at:
free_pages_prepare mm/page_alloc.c:1056 [inline]
free_unref_page+0x156/0x1320 mm/page_alloc.c:2657
__free_pages+0xa3/0x1b0 mm/page_alloc.c:4838
bpf_ringbuf_free kernel/bpf/ringbuf.c:226 [inline]
ringbuf_map_free+0xff/0x1e0 kernel/bpf/ringbuf.c:235
bpf_map_free kernel/bpf/syscall.c:838 [inline]
bpf_map_free_deferred+0x17c/0x310 kernel/bpf/syscall.c:862
process_one_work kernel/workqueue.c:3229 [inline]
process_scheduled_works+0xa2b/0x1b60 kernel/workqueue.c:3310
worker_thread+0xedf/0x1550 kernel/workqueue.c:3391
kthread+0x535/0x6b0 kernel/kthread.c:389
ret_from_fork+0x6e/0x90 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
CPU: 1 UID: 0 PID: 17276 Comm: syz.1.16450 Not tainted 6.12.0-05490-g9bb88c659673 #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
Fixes:
|
||
![]() |
7332537962 |
bpf: Remove unnecessary BTF lookups in bpf_sk_storage_tracing_allowed
When loading BPF programs, bpf_sk_storage_tracing_allowed() does a
series of lookups to get a type name from the program's attach_btf_id,
making the assumption that the type is present in the vmlinux BTF along
the way. However, this results in btf_type_by_id() returning a null
pointer if a non-vmlinux kernel BTF is attached to. Proof-of-concept on
a kernel with CONFIG_IPV6=m:
$ cat bpfcrash.c
#include <unistd.h>
#include <linux/bpf.h>
#include <sys/syscall.h>
static int bpf(enum bpf_cmd cmd, union bpf_attr *attr)
{
return syscall(__NR_bpf, cmd, attr, sizeof(*attr));
}
int main(void)
{
const int btf_fd = bpf(BPF_BTF_GET_FD_BY_ID, &(union bpf_attr) {
.btf_id = BTF_ID,
});
if (btf_fd < 0)
return 1;
const int bpf_sk_storage_get = 107;
const struct bpf_insn insns[] = {
{ .code = BPF_JMP | BPF_CALL, .imm = bpf_sk_storage_get},
{ .code = BPF_JMP | BPF_EXIT },
};
return bpf(BPF_PROG_LOAD, &(union bpf_attr) {
.prog_type = BPF_PROG_TYPE_TRACING,
.expected_attach_type = BPF_TRACE_FENTRY,
.license = (unsigned long)"GPL",
.insns = (unsigned long)&insns,
.insn_cnt = sizeof(insns) / sizeof(insns[0]),
.attach_btf_obj_fd = btf_fd,
.attach_btf_id = TYPE_ID,
});
}
$ sudo bpftool btf list | grep ipv6
2: name [ipv6] size 928200B
$ sudo bpftool btf dump id 2 | awk '$3 ~ /inet6_sock_destruct/'
[130689] FUNC 'inet6_sock_destruct' type_id=130677 linkage=static
$ gcc -D_DEFAULT_SOURCE -DBTF_ID=2 -DTYPE_ID=130689 \
bpfcrash.c -o bpfcrash
$ sudo ./bpfcrash
This causes a null pointer dereference:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Call trace:
bpf_sk_storage_tracing_allowed+0x8c/0xb0 P
check_helper_call.isra.0+0xa8/0x1730
do_check+0xa18/0xb40
do_check_common+0x140/0x640
bpf_check+0xb74/0xcb8
bpf_prog_load+0x598/0x9a8
__sys_bpf+0x580/0x980
__arm64_sys_bpf+0x28/0x40
invoke_syscall.constprop.0+0x54/0xe8
do_el0_svc+0xb4/0xd0
el0_svc+0x44/0x1f8
el0t_64_sync_handler+0x13c/0x160
el0t_64_sync+0x184/0x188
Resolve this by using prog->aux->attach_func_name and removing the
lookups.
Fixes:
|
||
![]() |
b88fe2b5dd |
NFS Client Updates for Linux 6.14
New Features: * Enable using direct IO with localio * Added localio related tracepoints Bugfixes: * Sunrpc fixes for working with a very large cl_tasks list * Fix a possible buffer overflow in nfs_sysfs_link_rpc_client() * Fixes for handling reconnections with localio * Fix how the NFS_FSCACHE kconfig option interacts with NETFS_SUPPORT * Fix COPY_NOTIFY xdr_buf size calculations * pNFS/Flexfiles fix for retrying requesting a layout segment for reads * Sunrpc fix for retrying on EKEYEXPIRED error when the TGT is expired Cleanups: * Various other nfs & nfsd localio cleanups * Prepratory patches for async copy improvements that are under development * Make OFFLOAD_CANCEL, LAYOUTSTATS, and LAYOUTERR moveable to other xprts * Add netns inum and srcaddr to debugfs rpc_xprt info -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmeZUzUACgkQ18tUv7Cl QOvArw/9HltIlcJHbi7tApGJ4dFpuJCa/fHbA1n5bHvKrCR5aElmFZoiFDdsM1JX kFAlMED9n1dW9VmzLJcepxmrLo/t7KXueiZNharHynWTxcszSl6jS+tOFBW6OflG Rrrjq/SrsWI2Fu8X4e/7ZV7pqRLGGn5SSMwgbuMbcyzBvVgN8mZM/BneIp1J59AI 5NOsif5KWetVhQc43zlRlbVWR5cvNGcUK4i58LIaPFzPMt0xq/XJI+QWffj6kv4g cHabCNYTdQYMkhiPQC+LLYkw6sMbw2NatajTTYNMWfR/I+7wz9k5ej6CHKPIFCSr xjmscypySTLfMFQjrDFZkpX2CwSp/VIbV6go36DJwAlcCRzqz+I7cajlrRK4zvyr DyrcaZHvClEczP9QqdPj2wqRXbmIOsDMksOu4ACTUImd4o3f2v1K6DcwRj9oUIhV AGR31OEMt2A+RaVvVZYR4PpixJ01vH9LcmsaOu5KkHX8X4q2osQ7eMy+FV4kV09S pMnxDMAyszJU8IuzUG1/HfkonNlDMivIbqpgG4ZaVW08Nq4mCxJll1vTAa9FTLz2 z+9eocqKwf724q1RAgOB7vj4AwOwL4Ul6d18UBtyUitZz3ndLRZ8Yy6r/AhrpCsC 3co0Y3znZbKeRjmReNl0GLG4qiKE+E7Xh23Lf3IqXg8GE2Mu+Ls= =srvH -----END PGP SIGNATURE----- Merge tag 'nfs-for-6.14-1' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client updates from Anna Schumaker: "New Features: - Enable using direct IO with localio - Added localio related tracepoints Bugfixes: - Sunrpc fixes for working with a very large cl_tasks list - Fix a possible buffer overflow in nfs_sysfs_link_rpc_client() - Fixes for handling reconnections with localio - Fix how the NFS_FSCACHE kconfig option interacts with NETFS_SUPPORT - Fix COPY_NOTIFY xdr_buf size calculations - pNFS/Flexfiles fix for retrying requesting a layout segment for reads - Sunrpc fix for retrying on EKEYEXPIRED error when the TGT is expired Cleanups: - Various other nfs & nfsd localio cleanups - Prepratory patches for async copy improvements that are under development - Make OFFLOAD_CANCEL, LAYOUTSTATS, and LAYOUTERR moveable to other xprts - Add netns inum and srcaddr to debugfs rpc_xprt info" * tag 'nfs-for-6.14-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (28 commits) SUNRPC: do not retry on EKEYEXPIRED when user TGT ticket expired sunrpc: add netns inum and srcaddr to debugfs rpc_xprt info pnfs/flexfiles: retry getting layout segment for reads NFSv4.2: make LAYOUTSTATS and LAYOUTERROR MOVEABLE NFSv4.2: mark OFFLOAD_CANCEL MOVEABLE NFSv4.2: fix COPY_NOTIFY xdr buf size calculation NFS: Rename struct nfs4_offloadcancel_data NFS: Fix typo in OFFLOAD_CANCEL comment NFS: CB_OFFLOAD can return NFS4ERR_DELAY nfs: Make NFS_FSCACHE select NETFS_SUPPORT instead of depending on it nfs: fix incorrect error handling in LOCALIO nfs: probe for LOCALIO when v3 client reconnects to server nfs: probe for LOCALIO when v4 client reconnects to server nfs/localio: remove redundant code and simplify LOCALIO enablement nfs_common: add nfs_localio trace events nfs_common: track all open nfsd_files per LOCALIO nfs_client nfs_common: rename nfslocalio nfs_uuid_lock to nfs_uuids_lock nfsd: nfsd_file_acquire_local no longer returns GC'd nfsd_file nfsd: rename nfsd_serv_ prefixed methods and variables with nfsd_net_ nfsd: update percpu_ref to manage references on nfsd_net ... |
||
![]() |
f4c9c2cc82 |
batman-adv: Fix incorrect offset in batadv_tt_tvlv_ogm_handler_v1()
Since commit |
||
![]() |
2ab002c755 |
Driver core and debugfs updates
Here is the big set of driver core and debugfs updates for 6.14-rc1. It's coming late in the merge cycle as there are a number of merge conflicts with your tree now, and I wanted to make sure they were working properly. To resolve them, look in linux-next, and I will send the "fixup" patch as a response to the pull request. Included in here is a bunch of driver core, PCI, OF, and platform rust bindings (all acked by the different subsystem maintainers), hence the merge conflict with the rust tree, and some driver core api updates to mark things as const, which will also require some fixups due to new stuff coming in through other trees in this merge window. There are also a bunch of debugfs updates from Al, and there is at least one user that does have a regression with these, but Al is working on tracking down the fix for it. In my use (and everyone else's linux-next use), it does not seem like a big issue at the moment. Here's a short list of the things in here: - driver core bindings for PCI, platform, OF, and some i/o functions. We are almost at the "write a real driver in rust" stage now, depending on what you want to do. - misc device rust bindings and a sample driver to show how to use them - debugfs cleanups in the fs as well as the users of the fs api for places where drivers got it wrong or were unnecessarily doing things in complex ways. - driver core const work, making more of the api take const * for different parameters to make the rust bindings easier overall. - other small fixes and updates All of these have been in linux-next with all of the aforementioned merge conflicts, and the one debugfs issue, which looks to be resolved "soon". Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ5koPA8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ymFHACfT5acDKf2Bov2Lc/5u3vBW/R6ChsAnj+LmgVI hcDSPodj4szR40RRnzBd =u5Ey -----END PGP SIGNATURE----- Merge tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core and debugfs updates from Greg KH: "Here is the big set of driver core and debugfs updates for 6.14-rc1. Included in here is a bunch of driver core, PCI, OF, and platform rust bindings (all acked by the different subsystem maintainers), hence the merge conflict with the rust tree, and some driver core api updates to mark things as const, which will also require some fixups due to new stuff coming in through other trees in this merge window. There are also a bunch of debugfs updates from Al, and there is at least one user that does have a regression with these, but Al is working on tracking down the fix for it. In my use (and everyone else's linux-next use), it does not seem like a big issue at the moment. Here's a short list of the things in here: - driver core rust bindings for PCI, platform, OF, and some i/o functions. We are almost at the "write a real driver in rust" stage now, depending on what you want to do. - misc device rust bindings and a sample driver to show how to use them - debugfs cleanups in the fs as well as the users of the fs api for places where drivers got it wrong or were unnecessarily doing things in complex ways. - driver core const work, making more of the api take const * for different parameters to make the rust bindings easier overall. - other small fixes and updates All of these have been in linux-next with all of the aforementioned merge conflicts, and the one debugfs issue, which looks to be resolved "soon"" * tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits) rust: device: Use as_char_ptr() to avoid explicit cast rust: device: Replace CString with CStr in property_present() devcoredump: Constify 'struct bin_attribute' devcoredump: Define 'struct bin_attribute' through macro rust: device: Add property_present() saner replacement for debugfs_rename() orangefs-debugfs: don't mess with ->d_name octeontx2: don't mess with ->d_parent or ->d_parent->d_name arm_scmi: don't mess with ->d_parent->d_name slub: don't mess with ->d_name sof-client-ipc-flood-test: don't mess with ->d_name qat: don't mess with ->d_name xhci: don't mess with ->d_iname mtu3: don't mess wiht ->d_iname greybus/camera - stop messing with ->d_iname mediatek: stop messing with ->d_iname netdevsim: don't embed file_operations into your structs b43legacy: make use of debugfs_get_aux() b43: stop embedding struct file_operations into their objects carl9170: stop embedding file_operations into their objects ... |
||
![]() |
4f5a52adeb |
ethtool: Fix set RXNFC command with symmetric RSS hash
The sanity check that both source and destination are set when symmetric
RSS hash is requested is only relevant for ETHTOOL_SRXFH (rx-flow-hash),
it should not be performed on any other commands (e.g.
ETHTOOL_SRXCLSRLINS/ETHTOOL_SRXCLSRLDEL).
This resolves accessing uninitialized 'info.data' field, and fixes false
errors in rule insertion:
# ethtool --config-ntuple eth2 flow-type ip4 dst-ip 255.255.255.255 action -1 loc 0
rmgr: Cannot insert RX class rule: Invalid argument
Cannot insert classification rule
Fixes:
|
||
![]() |
f34b580514 |
NFSD 6.14 Release Notes
Jeff Layton contributed an implementation of NFSv4.2+ attribute delegation, as described here: https://www.ietf.org/archive/id/draft-ietf-nfsv4-delstid-08.html This interoperates with similar functionality introduced into the Linux NFS client in v6.11. An attribute delegation permits an NFS client to manage a file's mtime, rather than flushing dirty data to the NFS server so that the file's mtime reflects the last write, which is considerably slower. Neil Brown contributed dynamic NFSv4.1 session slot table resizing. This facility enables NFSD to increase or decrease the number of slots per NFS session depending on server memory availability. More session slots means greater parallelism. Chuck Lever fixed a long-standing latent bug where NFSv4 COMPOUND encoding screws up when crossing a page boundary in the encoding buffer. This is a zero-day bug, but hitting it is rare and depends on the NFS client implementation. The Linux NFS client does not happen to trigger this issue. A variety of bug fixes and other incremental improvements fill out the list of commits in this release. Great thanks to all contributors, reviewers, testers, and bug reporters who participated during this development cycle. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmeVPBUACgkQM2qzM29m f5dClxAAmW4O2bOJaR8neJ54fzeFYXtFYEXRF/XbIh2KdnCy2LoywT9ux8ndzE0k 1tsjtv0g4y84IcYfrxXPRTuhh2GO2pHw5L1kGVRezJg2ODSFbpdzGtcVK7SIrs2S eijTViqdi9xRgfj1jPqRrvxC99RL3/fmCztqorAPLDFsqYAd/6ZRxZ7+IcZ2h4+J cJ0Z6Wx6eh10roacZPXweH13XJ7xWO/ublYZvQFQpK2BAKyO98aXGgLDraNt8k60 X3DZLSKkGB/eBlNeAlTtcrXec/ot6XGJPKr3b/7zhwfMi8B13RGdSmCR8SxMdRQM vCQO4G2YadU4YFS6FFIw9Wc1XDYUuYh2YgcveafjzjXbgi7NnY7rYOxtnTgBi0Xv XGjtGqpvD676gPm+8b3DcwqmWI3c/WUdtQIZ1uYRCFZdFqkVP91bySPT2aHtx2GO 4j3uEyTlypC00kyvu1oL3+tVUG/EFlJCpYvIbOwqDG2m7KWPStzpfJTD5Q9cdlEl fdgs6l82EVqe1YyjLTqajDuOcRrLYK2hlR/5STc03hQV+GpKSo5UypRejzE9WtRV zT/tyelqhj4+0EZJz4ay/8q9s2Jp+5JGVxoVvjujSuH7+Ulb3T+IDkldtMamO8Fm www2y0/fLfU2xIapMJdCoJ+ZKgel2i8RZMPIc0cIfO5ITXm+dOs= =hwXx -----END PGP SIGNATURE----- Merge tag 'nfsd-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd updates from Chuck Lever: "Jeff Layton contributed an implementation of NFSv4.2+ attribute delegation, as described here: https://www.ietf.org/archive/id/draft-ietf-nfsv4-delstid-08.html This interoperates with similar functionality introduced into the Linux NFS client in v6.11. An attribute delegation permits an NFS client to manage a file's mtime, rather than flushing dirty data to the NFS server so that the file's mtime reflects the last write, which is considerably slower. Neil Brown contributed dynamic NFSv4.1 session slot table resizing. This facility enables NFSD to increase or decrease the number of slots per NFS session depending on server memory availability. More session slots means greater parallelism. Chuck Lever fixed a long-standing latent bug where NFSv4 COMPOUND encoding screws up when crossing a page boundary in the encoding buffer. This is a zero-day bug, but hitting it is rare and depends on the NFS client implementation. The Linux NFS client does not happen to trigger this issue. A variety of bug fixes and other incremental improvements fill out the list of commits in this release. Great thanks to all contributors, reviewers, testers, and bug reporters who participated during this development cycle" * tag 'nfsd-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (42 commits) sunrpc: Remove gss_{de,en}crypt_xdr_buf deadcode sunrpc: Remove gss_generic_token deadcode sunrpc: Remove unused xprt_iter_get_xprt Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" nfsd: implement OPEN_ARGS_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION nfsd: handle delegated timestamps in SETATTR nfsd: add support for delegated timestamps nfsd: rework NFS4_SHARE_WANT_* flag handling nfsd: add support for FATTR4_OPEN_ARGUMENTS nfsd: prepare delegation code for handing out *_ATTRS_DELEG delegations nfsd: rename NFS4_SHARE_WANT_* constants to OPEN4_SHARE_ACCESS_WANT_* nfsd: switch to autogenerated definitions for open_delegation_type4 nfs_common: make include/linux/nfs4.h include generated nfs4_1.h nfsd: fix handling of delegated change attr in CB_GETATTR SUNRPC: Document validity guarantees of the pointer returned by reserve_space NFSD: Insulate nfsd4_encode_fattr4() from page boundaries in the encode buffer NFSD: Insulate nfsd4_encode_secinfo() from page boundaries in the encode buffer NFSD: Refactor nfsd4_do_encode_secinfo() again NFSD: Insulate nfsd4_encode_readlink() from page boundaries in the encode buffer NFSD: Insulate nfsd4_encode_read_plus_data() from page boundaries in the encode buffer ... |
||
![]() |
c1feab95e0 |
add a string-to-qstr constructor
Quite a few places want to build a struct qstr by given string; it would be convenient to have a primitive doing that, rather than open-coding it via QSTR_INIT(). The closest approximation was in bcachefs, but that expands to initializer list - {.len = strlen(string), .name = string}. It would be more useful to have it as compound literal - (struct qstr){.len = strlen(string), .name = string}. Unlike initializer list it's a valid expression. What's more, it's a valid lvalue - it's an equivalent of anonymous local variable with such initializer, so the things like path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name)); are valid. It can also be used as initializer, with identical effect - struct qstr x = (struct qstr){.name = s, .len = strlen(s)}; is equivalent to struct qstr anon_variable = {.name = s, .len = strlen(s)}; struct qstr x = anon_variable; // anon_variable is never used after that point and any even remotely sane compiler will manage to collapse that into struct qstr x = {.name = s, .len = strlen(s)}; What compound literals can't be used for is initialization of global variables, but those are covered by QSTR_INIT(). This commit lifts definition(s) of QSTR() into linux/dcache.h, converts it to compound literal (all bcachefs users are fine with that) and converts assorted open-coded instances to using that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
![]() |
463ec95a16 |
ipsec-2025-01-27
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmeXIF4ACgkQrB3Eaf9P W7fRMA/+Js2x0HNA3+6SMb5nJzY6lywi1BIRAzstyfd6EsxbHlgfdWYCCpixboA0 /ZfDe7yPND/ewPIQLT9eO6hk9YzuAVhYUkdIcDC5jdFDNbh9dDqBdyu5P/5spsi9 9SdFEucoOsKBP4ejmSvtwGsVNIf/1vB8hFqYxB+vh8+d/g8PHrI3xxk+2b7KkIGS ms+IyDCoVdCGQUOp4BGtEQbzXtx67diH5dcfwg8/DJpSMbfqO3ZFRG7gPu8C5Igt cxVSCW67rv/zzPkGPv8B+nczAdVUZ3OFXgEWxdDCN/mUbFKwxUcIDxZVJMfBBAUP lcjsbzmNfj2PNMLZFe/5LuU6o+sFEZdxmTPmvbb+lSYrRHx2oz2/Jb871gEj8rTC vNZ+1Lu1k7QRjEPiO1fe85vWdmU4G81+WAzC88nD0KYLDUN4c+MmxUFQkKbAxf6p e6VCihcKqi5Sa6R73Ohm87iyiSuv8WvkyVSM0XgQrkXWDFy5Jp2Bo25pW0QgVxK+ l/aHhDA+YHFEOZTcjZsh/EdKlQRIxBNJ3ualITkjd2T+A1WyWm0A3S+kYZQCKqiM WGGWM3oVNXkUAaRxvURNvmXqO+hPeKfIElDeVrOUjG8zQ+EktKcg4KpDQb2BGJCj s9ksFj0pplR4GHxUrFmkEPxJWYKpFqUYCZMJDnBnHFm1ykC7QGM= =pg+h -----END PGP SIGNATURE----- Merge tag 'ipsec-2025-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2025-01-27 1) Fix incrementing the upper 32 bit sequence numbers for GSO skbs. From Jianbo Liu. 2) Fix an out-of-bounds read on xfrm state lookup. From Florian Westphal. 3) Fix secpath handling on packet offload mode. From Alexandre Cassen. 4) Fix the usage of skb->sk in the xfrm layer. 5) Don't disable preemption while looking up cache state to fix PREEMPT_RT. From Sebastian Sewior. * tag 'ipsec-2025-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Don't disable preemption while looking up cache state. xfrm: Fix the usage of skb->sk xfrm: delete intermediate secpath entry in packet offload mode xfrm: state: fix out-of-bounds read during lookup xfrm: replay: Fix the update of replay_esn->oseq_hi for GSO ==================== Link: https://patch.msgid.link/20250127060757.3946314-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
619af16b3b |
mptcp: handle fastopen disconnect correctly
Syzbot was able to trigger a data stream corruption:
WARNING: CPU: 0 PID: 9846 at net/mptcp/protocol.c:1024 __mptcp_clean_una+0xddb/0xff0 net/mptcp/protocol.c:1024
Modules linked in:
CPU: 0 UID: 0 PID: 9846 Comm: syz-executor351 Not tainted 6.13.0-rc2-syzkaller-00059-g00a5acdbf398 #0
Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 11/25/2024
RIP: 0010:__mptcp_clean_una+0xddb/0xff0 net/mptcp/protocol.c:1024
Code: fa ff ff 48 8b 4c 24 18 80 e1 07 fe c1 38 c1 0f 8c 8e fa ff ff 48 8b 7c 24 18 e8 e0 db 54 f6 e9 7f fa ff ff e8 e6 80 ee f5 90 <0f> 0b 90 4c 8b 6c 24 40 4d 89 f4 e9 04 f5 ff ff 44 89 f1 80 e1 07
RSP: 0018:ffffc9000c0cf400 EFLAGS: 00010293
RAX: ffffffff8bb0dd5a RBX: ffff888033f5d230 RCX: ffff888059ce8000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc9000c0cf518 R08: ffffffff8bb0d1dd R09: 1ffff110170c8928
R10: dffffc0000000000 R11: ffffed10170c8929 R12: 0000000000000000
R13: ffff888033f5d220 R14: dffffc0000000000 R15: ffff8880592b8000
FS: 00007f6e866496c0(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6e86f491a0 CR3: 00000000310e6000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__mptcp_clean_una_wakeup+0x7f/0x2d0 net/mptcp/protocol.c:1074
mptcp_release_cb+0x7cb/0xb30 net/mptcp/protocol.c:3493
release_sock+0x1aa/0x1f0 net/core/sock.c:3640
inet_wait_for_connect net/ipv4/af_inet.c:609 [inline]
__inet_stream_connect+0x8bd/0xf30 net/ipv4/af_inet.c:703
mptcp_sendmsg_fastopen+0x2a2/0x530 net/mptcp/protocol.c:1755
mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1830
sock_sendmsg_nosec net/socket.c:711 [inline]
__sock_sendmsg+0x1a6/0x270 net/socket.c:726
____sys_sendmsg+0x52a/0x7e0 net/socket.c:2583
___sys_sendmsg net/socket.c:2637 [inline]
__sys_sendmsg+0x269/0x350 net/socket.c:2669
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f6e86ebfe69
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 1f 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6e86649168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f6e86f491b8 RCX: 00007f6e86ebfe69
RDX: 0000000030004001 RSI: 0000000020000080 RDI: 0000000000000003
RBP: 00007f6e86f491b0 R08: 00007f6e866496c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6e86f491bc
R13: 000000000000006e R14: 00007ffe445d9420 R15: 00007ffe445d9508
</TASK>
The root cause is the bad handling of disconnect() generated internally
by the MPTCP protocol in case of connect FASTOPEN errors.
Address the issue increasing the socket disconnect counter even on such
a case, to allow other threads waiting on the same socket lock to
properly error out.
Fixes:
|
||
![]() |
1bb0d13485 |
mptcp: pm: only set fullmesh for subflow endp
With the in-kernel path-manager, it is possible to change the 'fullmesh'
flag. The code in mptcp_pm_nl_fullmesh() expects to change it only on
'subflow' endpoints, to recreate more or less subflows using the linked
address.
Unfortunately, the set_flags() hook was a bit more permissive, and
allowed 'implicit' endpoints to get the 'fullmesh' flag while it is not
allowed before.
That's what syzbot found, triggering the following warning:
WARNING: CPU: 0 PID: 6499 at net/mptcp/pm_netlink.c:1496 __mark_subflow_endp_available net/mptcp/pm_netlink.c:1496 [inline]
WARNING: CPU: 0 PID: 6499 at net/mptcp/pm_netlink.c:1496 mptcp_pm_nl_fullmesh net/mptcp/pm_netlink.c:1980 [inline]
WARNING: CPU: 0 PID: 6499 at net/mptcp/pm_netlink.c:1496 mptcp_nl_set_flags net/mptcp/pm_netlink.c:2003 [inline]
WARNING: CPU: 0 PID: 6499 at net/mptcp/pm_netlink.c:1496 mptcp_pm_nl_set_flags+0x974/0xdc0 net/mptcp/pm_netlink.c:2064
Modules linked in:
CPU: 0 UID: 0 PID: 6499 Comm: syz.1.413 Not tainted 6.13.0-rc5-syzkaller-00172-gd1bf27c4e176 #0
Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
RIP: 0010:__mark_subflow_endp_available net/mptcp/pm_netlink.c:1496 [inline]
RIP: 0010:mptcp_pm_nl_fullmesh net/mptcp/pm_netlink.c:1980 [inline]
RIP: 0010:mptcp_nl_set_flags net/mptcp/pm_netlink.c:2003 [inline]
RIP: 0010:mptcp_pm_nl_set_flags+0x974/0xdc0 net/mptcp/pm_netlink.c:2064
Code: 01 00 00 49 89 c5 e8 fb 45 e8 f5 e9 b8 fc ff ff e8 f1 45 e8 f5 4c 89 f7 be 03 00 00 00 e8 44 1d 0b f9 eb a0 e8 dd 45 e8 f5 90 <0f> 0b 90 e9 17 ff ff ff 89 d9 80 e1 07 38 c1 0f 8c c9 fc ff ff 48
RSP: 0018:ffffc9000d307240 EFLAGS: 00010293
RAX: ffffffff8bb72e03 RBX: 0000000000000000 RCX: ffff88807da88000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc9000d307430 R08: ffffffff8bb72cf0 R09: 1ffff1100b842a5e
R10: dffffc0000000000 R11: ffffed100b842a5f R12: ffff88801e2e5ac0
R13: ffff88805c214800 R14: ffff88805c2152e8 R15: 1ffff1100b842a5d
FS: 00005555619f6500(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020002840 CR3: 00000000247e6000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0xb14/0xec0 net/netlink/genetlink.c:1210
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2542
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1321 [inline]
netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1347
netlink_sendmsg+0x8e4/0xcb0 net/netlink/af_netlink.c:1891
sock_sendmsg_nosec net/socket.c:711 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:726
____sys_sendmsg+0x52a/0x7e0 net/socket.c:2583
___sys_sendmsg net/socket.c:2637 [inline]
__sys_sendmsg+0x269/0x350 net/socket.c:2669
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5fe8785d29
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff571f5558 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f5fe8975fa0 RCX: 00007f5fe8785d29
RDX: 0000000000000000 RSI: 0000000020000480 RDI: 0000000000000007
RBP: 00007f5fe8801b08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f5fe8975fa0 R14: 00007f5fe8975fa0 R15: 00000000000011f4
</TASK>
Here, syzbot managed to set the 'fullmesh' flag on an 'implicit' and
used -- according to 'id_avail_bitmap' -- endpoint, causing the PM to
try decrement the local_addr_used counter which is only incremented for
the 'subflow' endpoint.
Note that 'no type' endpoints -- not 'subflow', 'signal', 'implicit' --
are fine, because their ID will not be marked as used in the 'id_avail'
bitmap, and setting 'fullmesh' can help forcing the creation of subflow
when receiving an ADD_ADDR.
Fixes:
|
||
![]() |
c86b000782 |
mptcp: consolidate suboption status
MPTCP maintains the received sub-options status is the bitmask carrying
the received suboptions and in several bitfields carrying per suboption
additional info.
Zeroing the bitmask before parsing is not enough to ensure a consistent
status, and the MPTCP code has to additionally clear some bitfiled
depending on the actually parsed suboption.
The above schema is fragile, and syzbot managed to trigger a path where
a relevant bitfield is not cleared/initialized:
BUG: KMSAN: uninit-value in __mptcp_expand_seq net/mptcp/options.c:1030 [inline]
BUG: KMSAN: uninit-value in mptcp_expand_seq net/mptcp/protocol.h:864 [inline]
BUG: KMSAN: uninit-value in ack_update_msk net/mptcp/options.c:1060 [inline]
BUG: KMSAN: uninit-value in mptcp_incoming_options+0x2036/0x3d30 net/mptcp/options.c:1209
__mptcp_expand_seq net/mptcp/options.c:1030 [inline]
mptcp_expand_seq net/mptcp/protocol.h:864 [inline]
ack_update_msk net/mptcp/options.c:1060 [inline]
mptcp_incoming_options+0x2036/0x3d30 net/mptcp/options.c:1209
tcp_data_queue+0xb4/0x7be0 net/ipv4/tcp_input.c:5233
tcp_rcv_established+0x1061/0x2510 net/ipv4/tcp_input.c:6264
tcp_v4_do_rcv+0x7f3/0x11a0 net/ipv4/tcp_ipv4.c:1916
tcp_v4_rcv+0x51df/0x5750 net/ipv4/tcp_ipv4.c:2351
ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x336/0x500 net/ipv4/ip_input.c:233
NF_HOOK include/linux/netfilter.h:314 [inline]
ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
dst_input include/net/dst.h:460 [inline]
ip_rcv_finish+0x4a2/0x520 net/ipv4/ip_input.c:447
NF_HOOK include/linux/netfilter.h:314 [inline]
ip_rcv+0xcd/0x380 net/ipv4/ip_input.c:567
__netif_receive_skb_one_core net/core/dev.c:5704 [inline]
__netif_receive_skb+0x319/0xa00 net/core/dev.c:5817
process_backlog+0x4ad/0xa50 net/core/dev.c:6149
__napi_poll+0xe7/0x980 net/core/dev.c:6902
napi_poll net/core/dev.c:6971 [inline]
net_rx_action+0xa5a/0x19b0 net/core/dev.c:7093
handle_softirqs+0x1a0/0x7c0 kernel/softirq.c:561
__do_softirq+0x14/0x1a kernel/softirq.c:595
do_softirq+0x9a/0x100 kernel/softirq.c:462
__local_bh_enable_ip+0x9f/0xb0 kernel/softirq.c:389
local_bh_enable include/linux/bottom_half.h:33 [inline]
rcu_read_unlock_bh include/linux/rcupdate.h:919 [inline]
__dev_queue_xmit+0x2758/0x57d0 net/core/dev.c:4493
dev_queue_xmit include/linux/netdevice.h:3168 [inline]
neigh_hh_output include/net/neighbour.h:523 [inline]
neigh_output include/net/neighbour.h:537 [inline]
ip_finish_output2+0x187c/0x1b70 net/ipv4/ip_output.c:236
__ip_finish_output+0x287/0x810
ip_finish_output+0x4b/0x600 net/ipv4/ip_output.c:324
NF_HOOK_COND include/linux/netfilter.h:303 [inline]
ip_output+0x15f/0x3f0 net/ipv4/ip_output.c:434
dst_output include/net/dst.h:450 [inline]
ip_local_out net/ipv4/ip_output.c:130 [inline]
__ip_queue_xmit+0x1f2a/0x20d0 net/ipv4/ip_output.c:536
ip_queue_xmit+0x60/0x80 net/ipv4/ip_output.c:550
__tcp_transmit_skb+0x3cea/0x4900 net/ipv4/tcp_output.c:1468
tcp_transmit_skb net/ipv4/tcp_output.c:1486 [inline]
tcp_write_xmit+0x3b90/0x9070 net/ipv4/tcp_output.c:2829
__tcp_push_pending_frames+0xc4/0x380 net/ipv4/tcp_output.c:3012
tcp_send_fin+0x9f6/0xf50 net/ipv4/tcp_output.c:3618
__tcp_close+0x140c/0x1550 net/ipv4/tcp.c:3130
__mptcp_close_ssk+0x74e/0x16f0 net/mptcp/protocol.c:2496
mptcp_close_ssk+0x26b/0x2c0 net/mptcp/protocol.c:2550
mptcp_pm_nl_rm_addr_or_subflow+0x635/0xd10 net/mptcp/pm_netlink.c:889
mptcp_pm_nl_rm_subflow_received net/mptcp/pm_netlink.c:924 [inline]
mptcp_pm_flush_addrs_and_subflows net/mptcp/pm_netlink.c:1688 [inline]
mptcp_nl_flush_addrs_list net/mptcp/pm_netlink.c:1709 [inline]
mptcp_pm_nl_flush_addrs_doit+0xe10/0x1630 net/mptcp/pm_netlink.c:1750
genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0x1214/0x12c0 net/netlink/genetlink.c:1210
netlink_rcv_skb+0x375/0x650 net/netlink/af_netlink.c:2542
genl_rcv+0x40/0x60 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1321 [inline]
netlink_unicast+0xf52/0x1260 net/netlink/af_netlink.c:1347
netlink_sendmsg+0x10da/0x11e0 net/netlink/af_netlink.c:1891
sock_sendmsg_nosec net/socket.c:711 [inline]
__sock_sendmsg+0x30f/0x380 net/socket.c:726
____sys_sendmsg+0x877/0xb60 net/socket.c:2583
___sys_sendmsg+0x28d/0x3c0 net/socket.c:2637
__sys_sendmsg net/socket.c:2669 [inline]
__do_sys_sendmsg net/socket.c:2674 [inline]
__se_sys_sendmsg net/socket.c:2672 [inline]
__x64_sys_sendmsg+0x212/0x3c0 net/socket.c:2672
x64_sys_call+0x2ed6/0x3c30 arch/x86/include/generated/asm/syscalls_64.h:47
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Uninit was stored to memory at:
mptcp_get_options+0x2c0f/0x2f20 net/mptcp/options.c:397
mptcp_incoming_options+0x19a/0x3d30 net/mptcp/options.c:1150
tcp_data_queue+0xb4/0x7be0 net/ipv4/tcp_input.c:5233
tcp_rcv_established+0x1061/0x2510 net/ipv4/tcp_input.c:6264
tcp_v4_do_rcv+0x7f3/0x11a0 net/ipv4/tcp_ipv4.c:1916
tcp_v4_rcv+0x51df/0x5750 net/ipv4/tcp_ipv4.c:2351
ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x336/0x500 net/ipv4/ip_input.c:233
NF_HOOK include/linux/netfilter.h:314 [inline]
ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
dst_input include/net/dst.h:460 [inline]
ip_rcv_finish+0x4a2/0x520 net/ipv4/ip_input.c:447
NF_HOOK include/linux/netfilter.h:314 [inline]
ip_rcv+0xcd/0x380 net/ipv4/ip_input.c:567
__netif_receive_skb_one_core net/core/dev.c:5704 [inline]
__netif_receive_skb+0x319/0xa00 net/core/dev.c:5817
process_backlog+0x4ad/0xa50 net/core/dev.c:6149
__napi_poll+0xe7/0x980 net/core/dev.c:6902
napi_poll net/core/dev.c:6971 [inline]
net_rx_action+0xa5a/0x19b0 net/core/dev.c:7093
handle_softirqs+0x1a0/0x7c0 kernel/softirq.c:561
__do_softirq+0x14/0x1a kernel/softirq.c:595
Uninit was stored to memory at:
put_unaligned_be32 include/linux/unaligned.h:68 [inline]
mptcp_write_options+0x17f9/0x3100 net/mptcp/options.c:1417
mptcp_options_write net/ipv4/tcp_output.c:465 [inline]
tcp_options_write+0x6d9/0xe90 net/ipv4/tcp_output.c:759
__tcp_transmit_skb+0x294b/0x4900 net/ipv4/tcp_output.c:1414
tcp_transmit_skb net/ipv4/tcp_output.c:1486 [inline]
tcp_write_xmit+0x3b90/0x9070 net/ipv4/tcp_output.c:2829
__tcp_push_pending_frames+0xc4/0x380 net/ipv4/tcp_output.c:3012
tcp_send_fin+0x9f6/0xf50 net/ipv4/tcp_output.c:3618
__tcp_close+0x140c/0x1550 net/ipv4/tcp.c:3130
__mptcp_close_ssk+0x74e/0x16f0 net/mptcp/protocol.c:2496
mptcp_close_ssk+0x26b/0x2c0 net/mptcp/protocol.c:2550
mptcp_pm_nl_rm_addr_or_subflow+0x635/0xd10 net/mptcp/pm_netlink.c:889
mptcp_pm_nl_rm_subflow_received net/mptcp/pm_netlink.c:924 [inline]
mptcp_pm_flush_addrs_and_subflows net/mptcp/pm_netlink.c:1688 [inline]
mptcp_nl_flush_addrs_list net/mptcp/pm_netlink.c:1709 [inline]
mptcp_pm_nl_flush_addrs_doit+0xe10/0x1630 net/mptcp/pm_netlink.c:1750
genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0x1214/0x12c0 net/netlink/genetlink.c:1210
netlink_rcv_skb+0x375/0x650 net/netlink/af_netlink.c:2542
genl_rcv+0x40/0x60 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1321 [inline]
netlink_unicast+0xf52/0x1260 net/netlink/af_netlink.c:1347
netlink_sendmsg+0x10da/0x11e0 net/netlink/af_netlink.c:1891
sock_sendmsg_nosec net/socket.c:711 [inline]
__sock_sendmsg+0x30f/0x380 net/socket.c:726
____sys_sendmsg+0x877/0xb60 net/socket.c:2583
___sys_sendmsg+0x28d/0x3c0 net/socket.c:2637
__sys_sendmsg net/socket.c:2669 [inline]
__do_sys_sendmsg net/socket.c:2674 [inline]
__se_sys_sendmsg net/socket.c:2672 [inline]
__x64_sys_sendmsg+0x212/0x3c0 net/socket.c:2672
x64_sys_call+0x2ed6/0x3c30 arch/x86/include/generated/asm/syscalls_64.h:47
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Uninit was stored to memory at:
mptcp_pm_add_addr_signal+0x3d7/0x4c0
mptcp_established_options_add_addr net/mptcp/options.c:666 [inline]
mptcp_established_options+0x1b9b/0x3a00 net/mptcp/options.c:884
tcp_established_options+0x2c4/0x7d0 net/ipv4/tcp_output.c:1012
__tcp_transmit_skb+0x5b7/0x4900 net/ipv4/tcp_output.c:1333
tcp_transmit_skb net/ipv4/tcp_output.c:1486 [inline]
tcp_write_xmit+0x3b90/0x9070 net/ipv4/tcp_output.c:2829
__tcp_push_pending_frames+0xc4/0x380 net/ipv4/tcp_output.c:3012
tcp_send_fin+0x9f6/0xf50 net/ipv4/tcp_output.c:3618
__tcp_close+0x140c/0x1550 net/ipv4/tcp.c:3130
__mptcp_close_ssk+0x74e/0x16f0 net/mptcp/protocol.c:2496
mptcp_close_ssk+0x26b/0x2c0 net/mptcp/protocol.c:2550
mptcp_pm_nl_rm_addr_or_subflow+0x635/0xd10 net/mptcp/pm_netlink.c:889
mptcp_pm_nl_rm_subflow_received net/mptcp/pm_netlink.c:924 [inline]
mptcp_pm_flush_addrs_and_subflows net/mptcp/pm_netlink.c:1688 [inline]
mptcp_nl_flush_addrs_list net/mptcp/pm_netlink.c:1709 [inline]
mptcp_pm_nl_flush_addrs_doit+0xe10/0x1630 net/mptcp/pm_netlink.c:1750
genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0x1214/0x12c0 net/netlink/genetlink.c:1210
netlink_rcv_skb+0x375/0x650 net/netlink/af_netlink.c:2542
genl_rcv+0x40/0x60 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1321 [inline]
netlink_unicast+0xf52/0x1260 net/netlink/af_netlink.c:1347
netlink_sendmsg+0x10da/0x11e0 net/netlink/af_netlink.c:1891
sock_sendmsg_nosec net/socket.c:711 [inline]
__sock_sendmsg+0x30f/0x380 net/socket.c:726
____sys_sendmsg+0x877/0xb60 net/socket.c:2583
___sys_sendmsg+0x28d/0x3c0 net/socket.c:2637
__sys_sendmsg net/socket.c:2669 [inline]
__do_sys_sendmsg net/socket.c:2674 [inline]
__se_sys_sendmsg net/socket.c:2672 [inline]
__x64_sys_sendmsg+0x212/0x3c0 net/socket.c:2672
x64_sys_call+0x2ed6/0x3c30 arch/x86/include/generated/asm/syscalls_64.h:47
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Uninit was stored to memory at:
mptcp_pm_add_addr_received+0x95f/0xdd0 net/mptcp/pm.c:235
mptcp_incoming_options+0x2983/0x3d30 net/mptcp/options.c:1169
tcp_data_queue+0xb4/0x7be0 net/ipv4/tcp_input.c:5233
tcp_rcv_state_process+0x2a38/0x49d0 net/ipv4/tcp_input.c:6972
tcp_v4_do_rcv+0xbf9/0x11a0 net/ipv4/tcp_ipv4.c:1939
tcp_v4_rcv+0x51df/0x5750 net/ipv4/tcp_ipv4.c:2351
ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x336/0x500 net/ipv4/ip_input.c:233
NF_HOOK include/linux/netfilter.h:314 [inline]
ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
dst_input include/net/dst.h:460 [inline]
ip_rcv_finish+0x4a2/0x520 net/ipv4/ip_input.c:447
NF_HOOK include/linux/netfilter.h:314 [inline]
ip_rcv+0xcd/0x380 net/ipv4/ip_input.c:567
__netif_receive_skb_one_core net/core/dev.c:5704 [inline]
__netif_receive_skb+0x319/0xa00 net/core/dev.c:5817
process_backlog+0x4ad/0xa50 net/core/dev.c:6149
__napi_poll+0xe7/0x980 net/core/dev.c:6902
napi_poll net/core/dev.c:6971 [inline]
net_rx_action+0xa5a/0x19b0 net/core/dev.c:7093
handle_softirqs+0x1a0/0x7c0 kernel/softirq.c:561
__do_softirq+0x14/0x1a kernel/softirq.c:595
Local variable mp_opt created at:
mptcp_incoming_options+0x119/0x3d30 net/mptcp/options.c:1127
tcp_data_queue+0xb4/0x7be0 net/ipv4/tcp_input.c:5233
The current schema is too fragile; address the issue grouping all the
state-related data together and clearing the whole group instead of
just the bitmask. This also cleans-up the code a bit, as there is no
need to individually clear "random" bitfield in a couple of places
any more.
Fixes:
|
||
![]() |
79d458c130 |
rxrpc, afs: Fix peer hash locking vs RCU callback
In its address list, afs now retains pointers to and refs on one or more
rxrpc_peer objects. The address list is freed under RCU and at this time,
it puts the refs on those peers.
Now, when an rxrpc_peer object runs out of refs, it gets removed from the
peer hash table and, for that, rxrpc has to take a spinlock. However, it
is now being called from afs's RCU cleanup, which takes place in BH
context - but it is just taking an ordinary spinlock.
The put may also be called from non-BH context, and so there exists the
possibility of deadlock if the BH-based RCU cleanup happens whilst the hash
spinlock is held. This led to the attached lockdep complaint.
Fix this by changing spinlocks of rxnet->peer_hash_lock back to
BH-disabling locks.
================================
WARNING: inconsistent lock state
6.13.0-rc5-build2+ #1223 Tainted: G E
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
ffff88810babe228 (&rxnet->peer_hash_lock){+.?.}-{3:3}, at: rxrpc_put_peer+0xcb/0x180
{SOFTIRQ-ON-W} state was registered at:
mark_usage+0x164/0x180
__lock_acquire+0x544/0x990
lock_acquire.part.0+0x103/0x280
_raw_spin_lock+0x2f/0x40
rxrpc_peer_keepalive_worker+0x144/0x440
process_one_work+0x486/0x7c0
process_scheduled_works+0x73/0x90
worker_thread+0x1c8/0x2a0
kthread+0x19b/0x1b0
ret_from_fork+0x24/0x40
ret_from_fork_asm+0x1a/0x30
irq event stamp: 972402
hardirqs last enabled at (972402): [<ffffffff8244360e>] _raw_spin_unlock_irqrestore+0x2e/0x50
hardirqs last disabled at (972401): [<ffffffff82443328>] _raw_spin_lock_irqsave+0x18/0x60
softirqs last enabled at (972300): [<ffffffff810ffbbe>] handle_softirqs+0x3ee/0x430
softirqs last disabled at (972313): [<ffffffff810ffc54>] __irq_exit_rcu+0x44/0x110
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&rxnet->peer_hash_lock);
<Interrupt>
lock(&rxnet->peer_hash_lock);
*** DEADLOCK ***
1 lock held by swapper/1/0:
#0: ffffffff83576be0 (rcu_callback){....}-{0:0}, at: rcu_lock_acquire+0x7/0x30
stack backtrace:
CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G E 6.13.0-rc5-build2+ #1223
Tainted: [E]=UNSIGNED_MODULE
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
<IRQ>
dump_stack_lvl+0x57/0x80
print_usage_bug.part.0+0x227/0x240
valid_state+0x53/0x70
mark_lock_irq+0xa5/0x2f0
mark_lock+0xf7/0x170
mark_usage+0xe1/0x180
__lock_acquire+0x544/0x990
lock_acquire.part.0+0x103/0x280
_raw_spin_lock+0x2f/0x40
rxrpc_put_peer+0xcb/0x180
afs_free_addrlist+0x46/0x90 [kafs]
rcu_do_batch+0x2d2/0x640
rcu_core+0x2f7/0x350
handle_softirqs+0x1ee/0x430
__irq_exit_rcu+0x44/0x110
irq_exit_rcu+0xa/0x30
sysvec_apic_timer_interrupt+0x7f/0xa0
</IRQ>
Fixes:
|
||
![]() |
67e4bb2ced |
net: page_pool: don't try to stash the napi id
Page ppol tried to cache the NAPI ID in page pool info to avoid
having a dependency on the life cycle of the NAPI instance.
Since commit under Fixes the NAPI ID is not populated until
napi_enable() and there's a good chance that page pool is
created before NAPI gets enabled.
Protect the NAPI pointer with the existing page pool mutex,
the reading path already holds it. napi_id itself we need
to READ_ONCE(), it's protected by netdev_lock() which are
not holding in page pool.
Before this patch napi IDs were missing for mlx5:
# ./cli.py --spec netlink/specs/netdev.yaml --dump page-pool-get
[{'id': 144, 'ifindex': 2, 'inflight': 3072, 'inflight-mem': 12582912},
{'id': 143, 'ifindex': 2, 'inflight': 5568, 'inflight-mem': 22806528},
{'id': 142, 'ifindex': 2, 'inflight': 5120, 'inflight-mem': 20971520},
{'id': 141, 'ifindex': 2, 'inflight': 4992, 'inflight-mem': 20447232},
...
After:
[{'id': 144, 'ifindex': 2, 'inflight': 3072, 'inflight-mem': 12582912,
'napi-id': 565},
{'id': 143, 'ifindex': 2, 'inflight': 4224, 'inflight-mem': 17301504,
'napi-id': 525},
{'id': 142, 'ifindex': 2, 'inflight': 4288, 'inflight-mem': 17563648,
'napi-id': 524},
...
Fixes:
|
||
![]() |
5de7665e0a |
net: rose: fix timer races against user threads
Rose timers only acquire the socket spinlock, without
checking if the socket is owned by one user thread.
Add a check and rearm the timers if needed.
BUG: KASAN: slab-use-after-free in rose_timer_expiry+0x31d/0x360 net/rose/rose_timer.c:174
Read of size 2 at addr ffff88802f09b82a by task swapper/0/0
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.13.0-rc5-syzkaller-00172-gd1bf27c4e176 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0x169/0x550 mm/kasan/report.c:489
kasan_report+0x143/0x180 mm/kasan/report.c:602
rose_timer_expiry+0x31d/0x360 net/rose/rose_timer.c:174
call_timer_fn+0x187/0x650 kernel/time/timer.c:1793
expire_timers kernel/time/timer.c:1844 [inline]
__run_timers kernel/time/timer.c:2418 [inline]
__run_timer_base+0x66a/0x8e0 kernel/time/timer.c:2430
run_timer_base kernel/time/timer.c:2439 [inline]
run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2449
handle_softirqs+0x2d4/0x9b0 kernel/softirq.c:561
__do_softirq kernel/softirq.c:595 [inline]
invoke_softirq kernel/softirq.c:435 [inline]
__irq_exit_rcu+0xf7/0x220 kernel/softirq.c:662
irq_exit_rcu+0x9/0x30 kernel/softirq.c:678
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1049 [inline]
sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1049
</IRQ>
Fixes:
|
||
![]() |
05d91cdb1f |
net/ncsi: use dev_set_mac_address() for Get MC MAC Address handling
Copy of the rationale from |
||
![]() |
9c5968db9e |
The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec. - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones. - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest. - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code. - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups. - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code. - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c. - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator. - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading. - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/). Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED). - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled. - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL. - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests. - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size. - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic. - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated. - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated. - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed. - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic. - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy. - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic. - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions. - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed. - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting. - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface. - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior. - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi "introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors." - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram. - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal. - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance. - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation. - "Cleanup for memfd_create()" from Isaac Manjarres does that thing. - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration. - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices. - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE= =Ib2s -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ... |
||
![]() |
c159dfbdd4 |
Mainly individually changelogged singleton patches. The patch series in
this pull are: - "lib min_heap: Improve min_heap safety, testing, and documentation" from Kuan-Wei Chiu provides various tightenings to the min_heap library code. - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some cleanup and Rust preparation in the xarray library code. - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes pathnames in some code comments. - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the new secs_to_jiffies() in various places where that is appropriate. - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen switches two filesystems to the new mount API. - "Convert ocfs2 to use folios" from Matthew Wilcox does that. - "Remove get_task_comm() and print task comm directly" from Yafang Shao removes now-unneeded calls to get_task_comm() in various places. - "squashfs: reduce memory usage and update docs" from Phillip Lougher implements some memory savings in squashfs and performs some maintainability work. - "lib: clarify comparison function requirements" from Kuan-Wei Chiu tightens the sort code's behaviour and adds some maintenance work. - "nilfs2: protect busy buffer heads from being force-cleared" from Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a corrupted image. - "nilfs2: fix kernel-doc comments for function return values" from Ryusuke Konishi fixes some nilfs kerneldoc. - "nilfs2: fix issues with rename operations" from Ryusuke Konishi addresses some nilfs BUG_ONs which syzbot was able to trigger. - "minmax.h: Cleanups and minor optimisations" from David Laight does some maintenance work on the min/max library code. - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work on the xarray library code. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5SP5QAKCRDdBJ7gKXxA jqN7AQChvwXGG43n4d5SDiA/rH7ddvowQcDqhC9cAMJ1ReR7qwEA8/LIWDE4PdMX mJnaZ1/ibpEpearrChCViApQtcyEGQI= =ti4E -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Mainly individually changelogged singleton patches. The patch series in this pull are: - "lib min_heap: Improve min_heap safety, testing, and documentation" from Kuan-Wei Chiu provides various tightenings to the min_heap library code - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some cleanup and Rust preparation in the xarray library code - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes pathnames in some code comments - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the new secs_to_jiffies() in various places where that is appropriate - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen switches two filesystems to the new mount API - "Convert ocfs2 to use folios" from Matthew Wilcox does that - "Remove get_task_comm() and print task comm directly" from Yafang Shao removes now-unneeded calls to get_task_comm() in various places - "squashfs: reduce memory usage and update docs" from Phillip Lougher implements some memory savings in squashfs and performs some maintainability work - "lib: clarify comparison function requirements" from Kuan-Wei Chiu tightens the sort code's behaviour and adds some maintenance work - "nilfs2: protect busy buffer heads from being force-cleared" from Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a corrupted image - "nilfs2: fix kernel-doc comments for function return values" from Ryusuke Konishi fixes some nilfs kerneldoc - "nilfs2: fix issues with rename operations" from Ryusuke Konishi addresses some nilfs BUG_ONs which syzbot was able to trigger - "minmax.h: Cleanups and minor optimisations" from David Laight does some maintenance work on the min/max library code - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work on the xarray library code" * tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits) ocfs2: use str_yes_no() and str_no_yes() helper functions include/linux/lz4.h: add some missing macros Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent Xarray: remove repeat check in xas_squash_marks() Xarray: distinguish large entries correctly in xas_split_alloc() Xarray: move forward index correctly in xas_pause() Xarray: do not return sibling entries from xas_find_marked() ipc/util.c: complete the kernel-doc function descriptions gcov: clang: use correct function param names latencytop: use correct kernel-doc format for func params minmax.h: remove some #defines that are only expanded once minmax.h: simplify the variants of clamp() minmax.h: move all the clamp() definitions after the min/max() ones minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp() minmax.h: reduce the #define expansion of min(), max() and clamp() minmax.h: update some comments minmax.h: add whitespace around operators and after commas nilfs2: do not update mtime of renamed directory that is not moved nilfs2: handle errors that nilfs_prepare_chunk() may return CREDITS: fix spelling mistake ... |
||
![]() |
6bf9b5b40a |
mm: alloc_pages_bulk: rename API
The previous commit removed the page_list argument from alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function. Now that only the *_array() flavour of the API remains, we can do the following renaming (along with the _noprof() ones): alloc_pages_bulk_array -> alloc_pages_bulk alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy alloc_pages_bulk_array_node -> alloc_pages_bulk_node Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
382e391365 |
hyperv-next for v6.14
-----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmeTFQ4THHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXqMWB/4uHjnu50u+m00OwXAKQr6i92zh50BZ RQragd9s9C8tuUNwPDmS/ct2BNAhoy43KJ0ClegdZjKxT1Ys8cLv4Wr5CaGckqWq +WCHqTgt+cPe0vUofqahB5wiAZMsnBgzFkV/OfFwBx0wkub9y5T3qVq5KapYlaDI 7Gftb+wg1AAsrdZ/HuLRy5ZVvkM/73rU2uoi8WXjr/T14E1krCFR/qirLd1OXo6Q Jb97qhnCt/N9JPwIq5/VnYWde5Mpqz6UgtA2rFLDXgNGz+h9/ND6ecWFHjZWNVdc AKWZTO5t+fRVBOSyahoyRoYSntPw3wlxyL7A2/54h6j4Dex7wLt6NQBj =empO -----END PGP SIGNATURE----- Merge tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Introduce a new set of Hyper-V headers in include/hyperv and replace the old hyperv-tlfs.h with the new headers (Nuno Das Neves) - Fixes for the Hyper-V VTL mode (Roman Kisel) - Fixes for cpu mask usage in Hyper-V code (Michael Kelley) - Document the guest VM hibernation behaviour (Michael Kelley) - Miscellaneous fixes and cleanups (Jacob Pan, John Starks, Naman Jain) * tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Documentation: hyperv: Add overview of guest VM hibernation hyperv: Do not overlap the hvcall IO areas in hv_vtl_apicid_to_vp_id() hyperv: Do not overlap the hvcall IO areas in get_vtl() hyperv: Enable the hypercall output page for the VTL mode hv_balloon: Fallback to generic_online_page() for non-HV hot added mem Drivers: hv: vmbus: Log on missing offers if any Drivers: hv: vmbus: Wait for boot-time offers during boot and resume uio_hv_generic: Add a check for HV_NIC for send, receive buffers setup iommu/hyper-v: Don't assume cpu_possible_mask is dense Drivers: hv: Don't assume cpu_possible_mask is dense x86/hyperv: Don't assume cpu_possible_mask is dense hyperv: Remove the now unused hyperv-tlfs.h files hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h hyperv: Add new Hyper-V headers in include/hyperv hyperv: Clean up unnecessary #includes hyperv: Move hv_connection_id to hyperv-tlfs.h |
||
![]() |
6c9b7db96d |
xfrm: Don't disable preemption while looking up cache state.
For the state cache lookup xfrm_input_state_lookup() first disables
preemption, to remain on the CPU and then retrieves a per-CPU pointer.
Within the preempt-disable section it also acquires
netns_xfrm::xfrm_state_lock, a spinlock_t. This lock must not be
acquired with explicit disabled preemption (such as by get_cpu())
because this lock becomes a sleeping lock on PREEMPT_RT.
To remain on the same CPU is just an optimisation for the CPU local
lookup. The actual modification of the per-CPU variable happens with
netns_xfrm::xfrm_state_lock acquired.
Remove get_cpu() and use the state_cache_input on the current CPU.
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Closes: https://lore.kernel.org/all/CAADnVQKkCLaj=roayH=Mjiiqz_svdf1tsC3OE4EC0E=mAD+L1A@mail.gmail.com/
Fixes:
|
||
![]() |
d0d106a2bd |
bpf-next-6.14
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmeOu1YACgkQ6rmadz2v bTrrHxAAn6eqEsluWnDlzhI0OGsPjvgS00sf+MOeqiXYeS2eJ8yJuKifp38+nIQZ lIplsWU2ReUY20eizPqLPnQ7TXZGvLgp08E8yHUoZ0siWanqr9iDRfbZCCNrDMNm lMqeR1SLapMws2R/UX9JbvPn2ajIJ6Lb4wxenTfdlW6q+0hAGM6Dt0k/jBod+quq /oo+xwG3L0q4APBovJfiAFN2z6IYN03b+zLiOrpIJtMACGewEXnl3m4mkL8ZM/FV nZGPIxIUPXCpKTGEkNqxfkrnHN2wZQ4ZSKEJ6lhEEp4jrgCVITaGZ/E7jlx6fZoj bbd4YMonIPo9Nhim8p1dt8yYBhKKiE5IXIq0GqlMv5+MvAN8ylrlydpsouW1fu66 hZ1W1BxbxmrgyF0Bwo9JPOMhBHwMrmD6iH9LgiMpZf0ASeF+q9cJpoSOU5j5E9XB LpLIRf5jYTd4wZjhDmrQREReLo+Bng9DlCBu+jjh2+YTz6l6Qed+ETpENcd7lL5i IHZVbgD2RVPNJoUfdrd763HfYfDTk+50MF5FIMEyfKHz11if0E/LhBMzto22hm6b 2f8ruj/8yvg8s2dxEP3ySQgcnynlwEnGxLenUVv7uEOYKeWri1rq+fvTK5ne1OLK oHnTlkViwQb74c0r8cFW+nkyfUYTfhhBAql14rl/fMjGDO2KZ10= =f2CA -----END PGP SIGNATURE----- Merge tag 'bpf-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: "A smaller than usual release cycle. The main changes are: - Prepare selftest to run with GCC-BPF backend (Ihor Solodrai) In addition to LLVM-BPF runs the BPF CI now runs GCC-BPF in compile only mode. Half of the tests are failing, since support for btf_decl_tag is still WIP, but this is a great milestone. - Convert various samples/bpf to selftests/bpf/test_progs format (Alexis Lothoré and Bastien Curutchet) - Teach verifier to recognize that array lookup with constant in-range index will always succeed (Daniel Xu) - Cleanup migrate disable scope in BPF maps (Hou Tao) - Fix bpf_timer destroy path in PREEMPT_RT (Hou Tao) - Always use bpf_mem_alloc in bpf_local_storage in PREEMPT_RT (Martin KaFai Lau) - Refactor verifier lock support (Kumar Kartikeya Dwivedi) This is a prerequisite for upcoming resilient spin lock. - Remove excessive 'may_goto +0' instructions in the verifier that LLVM leaves when unrolls the loops (Yonghong Song) - Remove unhelpful bpf_probe_write_user() warning message (Marco Elver) - Add fd_array_cnt attribute for prog_load command (Anton Protopopov) This is a prerequisite for upcoming support for static_branch" * tag 'bpf-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (125 commits) selftests/bpf: Add some tests related to 'may_goto 0' insns bpf: Remove 'may_goto 0' instruction in opt_remove_nops() bpf: Allow 'may_goto 0' instruction in verifier selftests/bpf: Add test case for the freeing of bpf_timer bpf: Cancel the running bpf_timer through kworker for PREEMPT_RT bpf: Free element after unlock in __htab_map_lookup_and_delete_elem() bpf: Bail out early in __htab_map_lookup_and_delete_elem() bpf: Free special fields after unlock in htab_lru_map_delete_node() tools: Sync if_xdp.h uapi tooling header libbpf: Work around kernel inconsistently stripping '.llvm.' suffix bpf: selftests: verifier: Add nullness elision tests bpf: verifier: Support eliding map lookup nullness bpf: verifier: Refactor helper access type tracking bpf: tcp: Mark bpf_load_hdr_opt() arg2 as read-write bpf: verifier: Add missing newline on verbose() call selftests/bpf: Add distilled BTF test about marking BTF_IS_EMBEDDED libbpf: Fix incorrect traversal end type ID when marking BTF_IS_EMBEDDED libbpf: Fix return zero when elf_begin failed selftests/bpf: Fix btf leak on new btf alloc failure in btf_distill test veristat: Load struct_ops programs only once ... |
||
![]() |
15a901361e |
ipmr: do not call mr_mfc_uses_dev() for unres entries
syzbot found that calling mr_mfc_uses_dev() for unres entries
would crash [1], because c->mfc_un.res.minvif / c->mfc_un.res.maxvif
alias to "struct sk_buff_head unresolved", which contain two pointers.
This code never worked, lets remove it.
[1]
Unable to handle kernel paging request at virtual address ffff5fff2d536613
KASAN: maybe wild-memory-access in range [0xfffefff96a9b3098-0xfffefff96a9b309f]
Modules linked in:
CPU: 1 UID: 0 PID: 7321 Comm: syz.0.16 Not tainted 6.13.0-rc7-syzkaller-g1950a0af2d55 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : mr_mfc_uses_dev net/ipv4/ipmr_base.c:290 [inline]
pc : mr_table_dump+0x5a4/0x8b0 net/ipv4/ipmr_base.c:334
lr : mr_mfc_uses_dev net/ipv4/ipmr_base.c:289 [inline]
lr : mr_table_dump+0x694/0x8b0 net/ipv4/ipmr_base.c:334
Call trace:
mr_mfc_uses_dev net/ipv4/ipmr_base.c:290 [inline] (P)
mr_table_dump+0x5a4/0x8b0 net/ipv4/ipmr_base.c:334 (P)
mr_rtm_dumproute+0x254/0x454 net/ipv4/ipmr_base.c:382
ipmr_rtm_dumproute+0x248/0x4b4 net/ipv4/ipmr.c:2648
rtnl_dump_all+0x2e4/0x4e8 net/core/rtnetlink.c:4327
rtnl_dumpit+0x98/0x1d0 net/core/rtnetlink.c:6791
netlink_dump+0x4f0/0xbc0 net/netlink/af_netlink.c:2317
netlink_recvmsg+0x56c/0xe64 net/netlink/af_netlink.c:1973
sock_recvmsg_nosec net/socket.c:1033 [inline]
sock_recvmsg net/socket.c:1055 [inline]
sock_read_iter+0x2d8/0x40c net/socket.c:1125
new_sync_read fs/read_write.c:484 [inline]
vfs_read+0x740/0x970 fs/read_write.c:565
ksys_read+0x15c/0x26c fs/read_write.c:708
Fixes:
|
||
![]() |
6bb194d036 |
net/ncsi: wait for the last response to Deselect Package before configuring channel
The NCSI state machine as it's currently implemented assumes that
transition to the next logical state is performed either explicitly by
calling `schedule_work(&ndp->work)` to re-queue itself or implicitly
after processing the predefined (ndp->pending_req_num) number of
replies. Thus to avoid the configuration FSM from advancing prematurely
and getting out of sync with the process it's essential to not skip
waiting for a reply.
This patch makes the code wait for reception of the Deselect Package
response for the last package probed before proceeding to channel
configuration.
Thanks go to Potin Lai and Cosmo Chou for the initial investigation and
testing.
Fixes:
|
||
![]() |
110b43ef05 |
NFC: nci: Add bounds checking in nci_hci_create_pipe()
The "pipe" variable is a u8 which comes from the network. If it's more
than 127, then it results in memory corruption in the caller,
nci_hci_connect_gate().
Cc: stable@vger.kernel.org
Fixes:
|
||
![]() |
d62b04fca4 |
net: sched: fix ets qdisc OOB Indexing
Haowei Yan <g1042620637@gmail.com> found that ets_class_from_arg() can
index an Out-Of-Bound class in ets_class_from_arg() when passed clid of
0. The overflow may cause local privilege escalation.
[ 18.852298] ------------[ cut here ]------------
[ 18.853271] UBSAN: array-index-out-of-bounds in net/sched/sch_ets.c:93:20
[ 18.853743] index 18446744073709551615 is out of range for type 'ets_class [16]'
[ 18.854254] CPU: 0 UID: 0 PID: 1275 Comm: poc Not tainted 6.12.6-dirty #17
[ 18.854821] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 18.856532] Call Trace:
[ 18.857441] <TASK>
[ 18.858227] dump_stack_lvl+0xc2/0xf0
[ 18.859607] dump_stack+0x10/0x20
[ 18.860908] __ubsan_handle_out_of_bounds+0xa7/0xf0
[ 18.864022] ets_class_change+0x3d6/0x3f0
[ 18.864322] tc_ctl_tclass+0x251/0x910
[ 18.864587] ? lock_acquire+0x5e/0x140
[ 18.865113] ? __mutex_lock+0x9c/0xe70
[ 18.866009] ? __mutex_lock+0xa34/0xe70
[ 18.866401] rtnetlink_rcv_msg+0x170/0x6f0
[ 18.866806] ? __lock_acquire+0x578/0xc10
[ 18.867184] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 18.867503] netlink_rcv_skb+0x59/0x110
[ 18.867776] rtnetlink_rcv+0x15/0x30
[ 18.868159] netlink_unicast+0x1c3/0x2b0
[ 18.868440] netlink_sendmsg+0x239/0x4b0
[ 18.868721] ____sys_sendmsg+0x3e2/0x410
[ 18.869012] ___sys_sendmsg+0x88/0xe0
[ 18.869276] ? rseq_ip_fixup+0x198/0x260
[ 18.869563] ? rseq_update_cpu_node_id+0x10a/0x190
[ 18.869900] ? trace_hardirqs_off+0x5a/0xd0
[ 18.870196] ? syscall_exit_to_user_mode+0xcc/0x220
[ 18.870547] ? do_syscall_64+0x93/0x150
[ 18.870821] ? __memcg_slab_free_hook+0x69/0x290
[ 18.871157] __sys_sendmsg+0x69/0xd0
[ 18.871416] __x64_sys_sendmsg+0x1d/0x30
[ 18.871699] x64_sys_call+0x9e2/0x2670
[ 18.871979] do_syscall_64+0x87/0x150
[ 18.873280] ? do_syscall_64+0x93/0x150
[ 18.874742] ? lock_release+0x7b/0x160
[ 18.876157] ? do_user_addr_fault+0x5ce/0x8f0
[ 18.877833] ? irqentry_exit_to_user_mode+0xc2/0x210
[ 18.879608] ? irqentry_exit+0x77/0xb0
[ 18.879808] ? clear_bhb_loop+0x15/0x70
[ 18.880023] ? clear_bhb_loop+0x15/0x70
[ 18.880223] ? clear_bhb_loop+0x15/0x70
[ 18.880426] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 18.880683] RIP: 0033:0x44a957
[ 18.880851] Code: ff ff e8 fc 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 8974 24 10
[ 18.881766] RSP: 002b:00007ffcdd00fad8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 18.882149] RAX: ffffffffffffffda RBX: 00007ffcdd010db8 RCX: 000000000044a957
[ 18.882507] RDX: 0000000000000000 RSI: 00007ffcdd00fb70 RDI: 0000000000000003
[ 18.885037] RBP: 00007ffcdd010bc0 R08: 000000000703c770 R09: 000000000703c7c0
[ 18.887203] R10: 0000000000000080 R11: 0000000000000246 R12: 0000000000000001
[ 18.888026] R13: 00007ffcdd010da8 R14: 00000000004ca7d0 R15: 0000000000000001
[ 18.888395] </TASK>
[ 18.888610] ---[ end trace ]---
Fixes:
|
||
![]() |
6f56971841 |
SUNRPC: do not retry on EKEYEXPIRED when user TGT ticket expired
When a user TGT ticket expired, gssd returns EKEYEXPIRED to the RPC layer for the upcall to create the security context. The RPC layer then retries the upcall twice before returning the EKEYEXPIRED to the NFS layer. This results in three separate TCP connections to the NFS server being created by gssd for each RPC request. These connections are not used and left in TIME_WAIT state. Note that for RPC call that uses machine credential, gssd automatically renews the ticket. But for a regular user the ticket needs to be renewed by the user before access to the krb5 share is allowed. This patch removes the retries by RPC on EKEYEXPIRED so that these unused TCP connections are not created. Reproducer: $ kinit -l 1m $ sleep 65 $ cd /mnt/krb5share $ netstat -na |grep TIME_WAIT Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> |
||
![]() |
918b8e3b3f |
sunrpc: add netns inum and srcaddr to debugfs rpc_xprt info
The output format should provide a value that matches the one in the /proc/<pid>/ns/net symlink. This makes it simpler to match the rpc_xprt and rpc_clnt to a particular container. Also, when the xprt defines the get_srcaddr operation, use that to display the source address as well. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> |
||
![]() |
8c8ecc98f5 |
batman-adv: Drop unmanaged ELP metric worker
The ELP worker needs to calculate new metric values for all neighbors
"reachable" over an interface. Some of the used metric sources require
locks which might need to sleep. This sleep is incompatible with the RCU
list iterator used for the recorded neighbors. The initial approach to work
around of this problem was to queue another work item per neighbor and then
run this in a new context.
Even when this solved the RCU vs might_sleep() conflict, it has a major
problems: Nothing was stopping the work item in case it is not needed
anymore - for example because one of the related interfaces was removed or
the batman-adv module was unloaded - resulting in potential invalid memory
accesses.
Directly canceling the metric worker also has various problems:
* cancel_work_sync for a to-be-deactivated interface is called with
rtnl_lock held. But the code in the ELP metric worker also tries to use
rtnl_lock() - which will never return in this case. This also means that
cancel_work_sync would never return because it is waiting for the worker
to finish.
* iterating over the neighbor list for the to-be-deactivated interface is
currently done using the RCU specific methods. Which means that it is
possible to miss items when iterating over it without the associated
spinlock - a behaviour which is acceptable for a periodic metric check
but not for a cleanup routine (which must "stop" all still running
workers)
The better approch is to get rid of the per interface neighbor metric
worker and handle everything in the interface worker. The original problems
are solved by:
* creating a list of neighbors which require new metric information inside
the RCU protected context, gathering the metric according to the new list
outside the RCU protected context
* only use rcu_trylock inside metric gathering code to avoid a deadlock
when the cancel_delayed_work_sync is called in the interface removal code
(which is called with the rtnl_lock held)
Cc: stable@vger.kernel.org
Fixes:
|
||
![]() |
e7e34ffc97 |
batman-adv: Ignore neighbor throughput metrics in error case
If a temporary error happened in the evaluation of the neighbor throughput information, then the invalid throughput result should not be stored in the throughtput EWMA. Cc: stable@vger.kernel.org Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> |
||
![]() |
0ad9617c78 |
Networking changes for 6.14.
Core ---- - More core refactoring to reduce the RTNL lock contention, including preparatory work for the per-network namespace RTNL lock, replacing RTNL lock with a per device-one to protect NAPI-related net device data and moving synchronize_net() calls outside such lock. - Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge and more specific TCP coverage. - Reduce network namespace tear-down time by removing per-subsystems synchronize_net() in tipc and sched. - Add flow label selector support for fib rules, allowing traffic redirection based on such header field. Netfilter --------- - Do not remove netdev basechain when last device is gone, allowing netdev basechains without devices. - Revisit the flowtable teardown strategy, dealing better with fin, reset and re-open events. - Scale-up IP-vs connection dumping by avoiding linear search on each restart. Protocols --------- - A significant XDP socket refactor, consolidating and optimizing several helpers into the core - Better scaling of ICMP rate-limiting, by removing false-sharing in inet peers handling. - Introduces netlink notifications for multicast IPv4 and IPv6 address changes. - Add ipsec support for IP-TFS/AggFrag encapsulation, allowing aggregation and fragmentation of the inner IP. - Add sysctl to configure TIME-WAIT reuse delay for TCP sockets, to avoid local port exhaustion issues when the average connection lifetime is very short. - Support updating keys (re-keying) for connections using kernel TLS (for TLS 1.3 only). - Support ipv4-mapped ipv6 address clients in smc-r v2. - Add support for jumbo data packet transmission in RxRPC sockets, gluing multiple data packets in a single UDP packet. - Support RxRPC RACK-TLP to manage packet loss and retransmission in conjunction with the congestion control algorithm. Driver API ---------- - Introduce a unified and structured interface for reporting PHY statistics, exposing consistent data across different H/W via ethtool. - Make timestamping selectable, allow the user to select the desired hwtstamp provider (PHY or MAC) administratively. - Add support for configuring a header-data-split threshold (HDS) value via ethtool, to deal with partial or buggy H/W implementation. - Consolidate DSA drivers Energy Efficiency Ethernet support. - Add EEE management to phylink, making use of the phylib implementation. - Add phylib support for in-band capabilities negotiation. - Simplify how phylib-enabled mac drivers expose the supported interfaces. Tests and tooling ----------------- - Make the YNL tool package-friendly to make it easier to deploy it separately from the kernel. - Increase TCP selftest coverage importing several packetdrill test-cases. - Regenerate the ethtool uapi header from the YNL spec, to ease maintenance and future development. - Add YNL support for decoding the link types used in net self-tests, allowing a single build to run both net and drivers/net. Drivers ------- - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - add cross E-Switch QoS support - add SW Steering support for ConnectX-8 - implement support for HW-Managed Flow Steering, improving the rule deletion/insertion rate - support for multi-host LAG - Intel (ixgbe, ice, igb): - ice: add support for devlink health events - ixgbe: add initial support for E610 chipset variant - igb: add support for AF_XDP zero-copy - Meta: - add support for basic RSS config - allow changing the number of channels - add hardware monitoring support - Broadcom (bnxt): - implement TCP data split and HDS threshold ethtool support, enabling Device Memory TCP. - Marvell Octeon: - implement egress ipsec offload support for the cn10k family - Hisilicon (HIBMC): - implement unicast MAC filtering - Ethernet NICs embedded and virtual: - Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding contented atomic operations for drop counters - Freescale: - quicc: phylink conversion - enetc: support Tx and Rx checksum offload and improve TSO performances - MediaTek: - airoha: introduce support for ETS and HTB Qdisc offload - Microchip: - lan78XX USB: preparation work for phylink conversion - Synopsys (stmmac): - support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45 - refactor EEE support to leverage the new driver API - optimize DMA and cache access to increase raw RX performances by 40% - TI: - icssg-prueth: add multicast filtering support for VLAN interface - netkit: - add ability to configure head/tailroom - VXLAN: - accepts packets with user-defined reserved bit - Ethernet switches: - Microchip: - lan969x: add RGMII support - lan969x: improve TX and RX performance using the FDMA engine - nVidia/Mellanox: - move Tx header handling to PCI driver, to ease XDP support - Ethernet PHYs: - Texas Instruments DP83822: - add support for GPIO2 clock output - Realtek: - 8169: add support for RTL8125D rev.b - rtl822x: add hwmon support for the temperature sensor - Microchip: - add support for RDS PTP hardware - consolidate periodic output signal generation - CAN: - several DT-bindings to DT schema conversions - tcan4x5x: - add HW standby support - support nWKRQ voltage selection - kvaser: - allowing Bus Error Reporting runtime configuration - WiFi: - the on-going Multi-Link Operation (MLO) effort continues, affecting both the stack and in drivers - mac80211/cfg80211: - Emergency Preparedness Communication Services (EPCS) station mode support - support for adding and removing station links for MLO - add support for WiFi 7/EHT mesh over 320 MHz channels - report Tx power info for each link - RealTek (rtw88): - enable USB Rx aggregation and USB 3 to improve performance - LED support - RealTek (rtw89): - refactor power save to support Multi-Link Operations - add support for RTL8922AE-VS variant - MediaTek (mt76): - single wiphy multiband support (preparation for MLO) - p2p device support - add TP-Link TXE50UH USB adapter support - Qualcomm (ath10k): - support for the QCA6698AQ IP core - Qualcomm (ath12k): - enable MLO for QCN9274 - Bluetooth: - Allow sysfs to trigger hdev reset, to allow recovering devices not responsive from user-space - MediaTek: add support for MT7922, MT7925, MT7921e devices - Realtek: add support for RTL8851BE devices - Qualcomm: add support for WCN785x devices - ISO: allow BIG re-sync Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmePf5YSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkUcMQALblhkGTxurnfT+yK+Bsuhn2LoHl2RPN 4u2Kjkzm+2FYgcw6lS17cFXsnfAPlRIpmhnmKk1EBgsBdkuL29c+jtqnljA2bboD tIMhMgWiaLS3xgEMrLeKnseIo0G9mviQRphGeZPFTaLb4Ww/bd5LAp4ZGc5oij76 tURatC3b6MuO4Lt5U+jWKnRwviXku8udHkVHXlvPdirawHCVinmx3tvce/BI/MaD eUOp6ZeJCPCOLtk7b8WEyxxvdY0f6D9ed82qfPDHjb94SJv+Vxb38RZtNuApIjn9 S0KdlNih/4flDy17LDxGYSyFps78lUFRbpqmsUlnZkyLXpsph7/WTvAmMAFcrX0K UgQ/F/q5GAvcP5WZcCj5+tZaRmfKQraQirXMtYU/Uj50qCnSU7ssyACASt23GLZ8 OF8tCLlm9lLOU1B6Ofkul1Dbo5f0Xpaghga4dFb0kzSfbm78fTUnqBNsJ7jIkWfi fD6dO+fg+p2ZMD0CACGo3CNxQuJmaQWg6BIDeno6God8kZ6qBMxY/sFr4qozrvFH x/FgQq8dgc8WLmaPejKiNIPkdQepXrIiv3T9jgMVyEjJnWB/LBfyWKSQOdTfnLs+ rgr4YMV6XW4bx0fYqTI8B9jZ+FCWbG6sn4UtRTHITKcd3FSvd8Y+PHa5YyCUWvJM l8pePMGF0XVF =hrsp -----END PGP SIGNATURE----- Merge tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "This is slightly smaller than usual, with the most interesting work being still around RTNL scope reduction. Core: - More core refactoring to reduce the RTNL lock contention, including preparatory work for the per-network namespace RTNL lock, replacing RTNL lock with a per device-one to protect NAPI-related net device data and moving synchronize_net() calls outside such lock. - Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge and more specific TCP coverage. - Reduce network namespace tear-down time by removing per-subsystems synchronize_net() in tipc and sched. - Add flow label selector support for fib rules, allowing traffic redirection based on such header field. Netfilter: - Do not remove netdev basechain when last device is gone, allowing netdev basechains without devices. - Revisit the flowtable teardown strategy, dealing better with fin, reset and re-open events. - Scale-up IP-vs connection dumping by avoiding linear search on each restart. Protocols: - A significant XDP socket refactor, consolidating and optimizing several helpers into the core - Better scaling of ICMP rate-limiting, by removing false-sharing in inet peers handling. - Introduces netlink notifications for multicast IPv4 and IPv6 address changes. - Add ipsec support for IP-TFS/AggFrag encapsulation, allowing aggregation and fragmentation of the inner IP. - Add sysctl to configure TIME-WAIT reuse delay for TCP sockets, to avoid local port exhaustion issues when the average connection lifetime is very short. - Support updating keys (re-keying) for connections using kernel TLS (for TLS 1.3 only). - Support ipv4-mapped ipv6 address clients in smc-r v2. - Add support for jumbo data packet transmission in RxRPC sockets, gluing multiple data packets in a single UDP packet. - Support RxRPC RACK-TLP to manage packet loss and retransmission in conjunction with the congestion control algorithm. Driver API: - Introduce a unified and structured interface for reporting PHY statistics, exposing consistent data across different H/W via ethtool. - Make timestamping selectable, allow the user to select the desired hwtstamp provider (PHY or MAC) administratively. - Add support for configuring a header-data-split threshold (HDS) value via ethtool, to deal with partial or buggy H/W implementation. - Consolidate DSA drivers Energy Efficiency Ethernet support. - Add EEE management to phylink, making use of the phylib implementation. - Add phylib support for in-band capabilities negotiation. - Simplify how phylib-enabled mac drivers expose the supported interfaces. Tests and tooling: - Make the YNL tool package-friendly to make it easier to deploy it separately from the kernel. - Increase TCP selftest coverage importing several packetdrill test-cases. - Regenerate the ethtool uapi header from the YNL spec, to ease maintenance and future development. - Add YNL support for decoding the link types used in net self-tests, allowing a single build to run both net and drivers/net. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - add cross E-Switch QoS support - add SW Steering support for ConnectX-8 - implement support for HW-Managed Flow Steering, improving the rule deletion/insertion rate - support for multi-host LAG - Intel (ixgbe, ice, igb): - ice: add support for devlink health events - ixgbe: add initial support for E610 chipset variant - igb: add support for AF_XDP zero-copy - Meta: - add support for basic RSS config - allow changing the number of channels - add hardware monitoring support - Broadcom (bnxt): - implement TCP data split and HDS threshold ethtool support, enabling Device Memory TCP. - Marvell Octeon: - implement egress ipsec offload support for the cn10k family - Hisilicon (HIBMC): - implement unicast MAC filtering - Ethernet NICs embedded and virtual: - Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding contented atomic operations for drop counters - Freescale: - quicc: phylink conversion - enetc: support Tx and Rx checksum offload and improve TSO performances - MediaTek: - airoha: introduce support for ETS and HTB Qdisc offload - Microchip: - lan78XX USB: preparation work for phylink conversion - Synopsys (stmmac): - support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45 - refactor EEE support to leverage the new driver API - optimize DMA and cache access to increase raw RX performances by 40% - TI: - icssg-prueth: add multicast filtering support for VLAN interface - netkit: - add ability to configure head/tailroom - VXLAN: - accepts packets with user-defined reserved bit - Ethernet switches: - Microchip: - lan969x: add RGMII support - lan969x: improve TX and RX performance using the FDMA engine - nVidia/Mellanox: - move Tx header handling to PCI driver, to ease XDP support - Ethernet PHYs: - Texas Instruments DP83822: - add support for GPIO2 clock output - Realtek: - 8169: add support for RTL8125D rev.b - rtl822x: add hwmon support for the temperature sensor - Microchip: - add support for RDS PTP hardware - consolidate periodic output signal generation - CAN: - several DT-bindings to DT schema conversions - tcan4x5x: - add HW standby support - support nWKRQ voltage selection - kvaser: - allowing Bus Error Reporting runtime configuration - WiFi: - the on-going Multi-Link Operation (MLO) effort continues, affecting both the stack and in drivers - mac80211/cfg80211: - Emergency Preparedness Communication Services (EPCS) station mode support - support for adding and removing station links for MLO - add support for WiFi 7/EHT mesh over 320 MHz channels - report Tx power info for each link - RealTek (rtw88): - enable USB Rx aggregation and USB 3 to improve performance - LED support - RealTek (rtw89): - refactor power save to support Multi-Link Operations - add support for RTL8922AE-VS variant - MediaTek (mt76): - single wiphy multiband support (preparation for MLO) - p2p device support - add TP-Link TXE50UH USB adapter support - Qualcomm (ath10k): - support for the QCA6698AQ IP core - Qualcomm (ath12k): - enable MLO for QCN9274 - Bluetooth: - Allow sysfs to trigger hdev reset, to allow recovering devices not responsive from user-space - MediaTek: add support for MT7922, MT7925, MT7921e devices - Realtek: add support for RTL8851BE devices - Qualcomm: add support for WCN785x devices - ISO: allow BIG re-sync" * tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1386 commits) net/rose: prevent integer overflows in rose_setsockopt() net: phylink: fix regression when binding a PHY net: ethernet: ti: am65-cpsw: streamline TX queue creation and cleanup net: ethernet: ti: am65-cpsw: streamline RX queue creation and cleanup net: ethernet: ti: am65-cpsw: ensure proper channel cleanup in error path ipv6: Convert inet6_rtm_deladdr() to per-netns RTNL. ipv6: Convert inet6_rtm_newaddr() to per-netns RTNL. ipv6: Move lifetime validation to inet6_rtm_newaddr(). ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr(). ipv6: Pass dev to inet6_addr_add(). ipv6: Convert inet6_ioctl() to per-netns RTNL. ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup(). ipv6: Hold rtnl_net_lock() in addrconf_dad_work(). ipv6: Hold rtnl_net_lock() in addrconf_verify_work(). ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL. ipv6: Add __in6_dev_get_rtnl_net(). net: stmmac: Drop redundant skb_mark_for_recycle() for SKB frags net: mii: Fix the Speed display when the network cable is not connected sysctl net: Remove macro checks for CONFIG_SYSCTL eth: bnxt: update header sizing defaults ... |
||
![]() |
f96a974170 |
lsm/stable-6.14 PR 20250121
-----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmeQFBoUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPvcA//XCdwMz0bGtWKv58nuyP8vkQx08n6 //olz/O8te3uWK5O3kRiarzFLwH8qsHQ6A7GYalwwix34hatR4ndJE0Y/guVRWa1 +aBmJxJ7Jm/q3fvpAEfqiSgreuE6kBoztlDOWEq+hUQGu4qfnQGm2EnvbvfFrAmN VheOfIQSU2KCL/Scc3FGnF6uru4WrqN0JJ9RbvrEpfdQgmcyTGLnQsZLljutWSIq kDWkteIr7cj3O9J45zpxZsTftvYSgVn/y1iKeXbHI4DBA1eheK12vsHB9AADKI1J GwHxOrnLpZtv+ICUKqcfFTmWTl+NmfJJurAT5KXKdBjL3xM5MoJlBvK1A5qE9CMo LaHVG/TZR2MmBaoM3EN+gvWhDgWlvT02Q/0cYaafTlVLMez3HtfctxN6OnCvTXTB Y8dqYClhhlBm/mHQwYfMoeKw4MftUpzEqBd1Nj7Qe8dbP0f/62Ca3K2B3D6Rf8QV pj3ryMlSWYV9mdTerruLNQexTGoN7l66jPwzdWpTbFeL3WmNtfCako8OZGbXgPIu Iahm3P+jnSVx8ZQro2c9zwdKXI5xiI335pCBbDZ8aX+JAsfj0OofHsFx5Q5diber M7tAEhxDqRisbpz7Ei+/LOAEGg2Z619XKg8ks4z6Y4P5PF7zEgeWTkZJk2iLbxXe 6LLOjmF7LLw+G4M= =fgyr -----END PGP SIGNATURE----- Merge tag 'lsm-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Improved handling of LSM "secctx" strings through lsm_context struct The LSM secctx string interface is from an older time when only one LSM was supported, migrate over to the lsm_context struct to better support the different LSMs we now have and make it easier to support new LSMs in the future. These changes explain the Rust, VFS, and networking changes in the diffstat. - Only build lsm_audit.c if CONFIG_SECURITY and CONFIG_AUDIT are enabled Small tweak to be a bit smarter about when we build the LSM's common audit helpers. - Check for absurdly large policies from userspace in SafeSetID SafeSetID policies rules are fairly small, basically just "UID:UID", it easy to impose a limit of KMALLOC_MAX_SIZE on policy writes which helps quiet a number of syzbot related issues. While work is being done to address the syzbot issues through other mechanisms, this is a trivial and relatively safe fix that we can do now. - Various minor improvements and cleanups A collection of improvements to the kernel selftests, constification of some function parameters, removing redundant assignments, and local variable renames to improve readability. * tag 'lsm-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: lockdown: initialize local array before use to quiet static analysis safesetid: check size of policy writes net: corrections for security_secid_to_secctx returns lsm: rename variable to avoid shadowing lsm: constify function parameters security: remove redundant assignment to return variable lsm: Only build lsm_audit.c if CONFIG_SECURITY and CONFIG_AUDIT are set selftests: refactor the lsm `flags_overset_lsm_set_self_attr` test binder: initialize lsm_context structure rust: replace lsm context+len with lsm_context lsm: secctx provider check on release lsm: lsm_context in security_dentry_init_security lsm: use lsm_context in security_inode_getsecctx lsm: replace context+len with lsm_context lsm: ensure the correct LSM context releaser |
||
![]() |
1d6d399223 |
Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never execute relevant code on any other CPU. This is currently handled by smpboot code which takes care of CPU-hotplug operations. Affinity here is a correctness constraint. 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't run anywhere else. The affinity is set through kthread_bind_mask() and the subsystem takes care by itself to handle CPU-hotplug operations. Affinity here is assumed to be a correctness constraint. 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This is not a correctness constraint but merely a preference in terms of memory locality. kswapd and kcompactd both fall into this category. The affinity is set manually like for any other task and CPU-hotplug is supposed to be handled by the relevant subsystem so that the task is properly reaffined whenever a given CPU from the node comes up. Also care should be taken so that the node affinity doesn't cross isolated (nohz_full) cpumask boundaries. 4) Similar to the previous point except kthreads have a _preferred_ affinity different than a node. Both RCU boost kthreads and RCU exp kworkers fall into this category as they refer to "RCU nodes" from a distinctly distributed tree. Currently the preferred affinity patterns (3 and 4) have at least 4 identified users, with more or less success when it comes to handle CPU-hotplug operations and CPU isolation. Each of which do it in its own ad-hoc way. This is an infrastructure proposal to handle this with the following API changes: _ kthread_create_on_node() automatically affines the created kthread to its target node unless it has been set as per-cpu or bound with kthread_bind[_mask]() before the first wake-up. - kthread_affine_preferred() is a new function that can be called right after kthread_create_on_node() to specify a preferred affinity different than the specified node. When the preferred affinity can't be applied because the possible targets are offline or isolated (nohz_full), the kthread is affine to the housekeeping CPUs (which means to all online CPUs most of the time or only the non-nohz_full CPUs when nohz_full= is set). kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been converted, along with a few old drivers. Summary of the changes: * Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu() * Introduce task_cpu_fallback_mask() that defines the default last resort affinity of a task to become nohz_full aware * Add some correctness check to ensure kthread_bind() is always called before the first kthread wake up. * Default affine kthread to its preferred node. * Convert kswapd / kcompactd and remove their halfway working ad-hoc affinity implementation * Implement kthreads preferred affinity * Unify kthread worker and kthread API's style * Convert RCU kthreads to the new API and remove the ad-hoc affinity implementation. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmeNf8gACgkQhSRUR1CO jHedQQ/+IxTjjqQiItzrq41TES2S0desHDq8lNJFb7rsR/DtKFyLx3s67cOYV+cM Yx54QHg2m/Fz4nXMQ7Po5ygOtJGCKBc5C5QQy7y0lVKeTQK+daDfEtBSa3oG7j3C u+E3tTY6qxkbCzymUyaKkHN4/ay2vLvjFS50luV7KMyI3x47Aji+t7VdCX4LCPP2 eAwOALWD0+7qLJ/VF6gsmQLKA4Qx7PQAzBa3KSBmUN9UcN8Gk1bQHCTIQKDHP9LQ v8BXrNZtYX1o2+snNYpX2z6/ECjxkdwriOgqqZY5306hd9RAQ1u46Dx3byrIqjGn ULG/XQ2istPyhTqb/h+RbrobdOcwEUIeqk8hRRbBXE8bPpqUz9EMuaCMxWDbQjgH NTuKG4ifKJ/IqstkkuDkdOiByE/ysMmwqrTXgSnu2ITNL9yY3BEgFbvA95hgo42s f7QCxEfZb1MHcNEMENSMwM3xw5lLMGMpxVZcMQ3gLwyotMBRrhFZm1qZJG7TITYW IDIeCbH4JOMdQwLs3CcWTXio0N5/85NhRNFV+IDn96OrgxObgnMtV8QwNgjXBAJ5 wGeJWt8s34W1Zo3qS9gEuVzEhW4XaxISQQMkHe8faKkK6iHmIB/VjSQikDwwUNQ/ AspYj82RyWBCDZsqhiYh71kpxjvS6Xp0bj39Ce1sNsOnuksxKkQ= =g8In -----END PGP SIGNATURE----- Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "Kthreads affinity follow either of 4 existing different patterns: 1) Per-CPU kthreads must stay affine to a single CPU and never execute relevant code on any other CPU. This is currently handled by smpboot code which takes care of CPU-hotplug operations. Affinity here is a correctness constraint. 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't run anywhere else. The affinity is set through kthread_bind_mask() and the subsystem takes care by itself to handle CPU-hotplug operations. Affinity here is assumed to be a correctness constraint. 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This is not a correctness constraint but merely a preference in terms of memory locality. kswapd and kcompactd both fall into this category. The affinity is set manually like for any other task and CPU-hotplug is supposed to be handled by the relevant subsystem so that the task is properly reaffined whenever a given CPU from the node comes up. Also care should be taken so that the node affinity doesn't cross isolated (nohz_full) cpumask boundaries. 4) Similar to the previous point except kthreads have a _preferred_ affinity different than a node. Both RCU boost kthreads and RCU exp kworkers fall into this category as they refer to "RCU nodes" from a distinctly distributed tree. Currently the preferred affinity patterns (3 and 4) have at least 4 identified users, with more or less success when it comes to handle CPU-hotplug operations and CPU isolation. Each of which do it in its own ad-hoc way. This is an infrastructure proposal to handle this with the following API changes: - kthread_create_on_node() automatically affines the created kthread to its target node unless it has been set as per-cpu or bound with kthread_bind[_mask]() before the first wake-up. - kthread_affine_preferred() is a new function that can be called right after kthread_create_on_node() to specify a preferred affinity different than the specified node. When the preferred affinity can't be applied because the possible targets are offline or isolated (nohz_full), the kthread is affine to the housekeeping CPUs (which means to all online CPUs most of the time or only the non-nohz_full CPUs when nohz_full= is set). kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been converted, along with a few old drivers. Summary of the changes: - Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu() - Introduce task_cpu_fallback_mask() that defines the default last resort affinity of a task to become nohz_full aware - Add some correctness check to ensure kthread_bind() is always called before the first kthread wake up. - Default affine kthread to its preferred node. - Convert kswapd / kcompactd and remove their halfway working ad-hoc affinity implementation - Implement kthreads preferred affinity - Unify kthread worker and kthread API's style - Convert RCU kthreads to the new API and remove the ad-hoc affinity implementation" * tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: kthread: modify kernel-doc function name to match code rcu: Use kthread preferred affinity for RCU exp kworkers treewide: Introduce kthread_run_worker[_on_cpu]() kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format rcu: Use kthread preferred affinity for RCU boost kthread: Implement preferred affinity mm: Create/affine kswapd to its preferred node mm: Create/affine kcompactd to its preferred node kthread: Default affine kthread to its preferred NUMA node kthread: Make sure kthread hasn't started while binding it sched,arm64: Handle CPU isolation on last resort fallback rq selection arm64: Exclude nohz_full CPUs from 32bits el0 support lib: test_objpool: Use kthread_run_on_cpu() kallsyms: Use kthread_run_on_cpu() soc/qman: test: Use kthread_run_on_cpu() arm/bL_switcher: Use kthread_run_on_cpu() |
||
![]() |
c92066e786 |
sunrpc: Remove gss_{de,en}crypt_xdr_buf deadcode
Commit
|
||
![]() |
afc52b1eeb |
sunrpc: Remove gss_generic_token deadcode
Commit
|
||
![]() |
ee0d90d4b9 |
sunrpc: Remove unused xprt_iter_get_xprt
xprt_iter_get_xprt() was added by
commit
|
||
![]() |
966a675da8 |
Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages"
I noticed that a handful of NFSv3 fstests were taking an
unexpectedly long time to run. Troubleshooting showed that the
server's TCP window closed and never re-opened, which caused the
client to trigger an RPC retransmit timeout after 180 seconds.
The client's recovery action was to establish a fresh connection
and retransmit the timed-out requests. This worked, but it adds a
long delay.
I tracked the problem to the commit that attempted to reduce the
rate at which the network layer delivers TCP socket data_ready
callbacks. Under most circumstances this change worked as expected,
but for NFSv3, which has no session or other type of throttling, it
can overwhelm the receiver on occasion.
I'm sure I could tweak the lowat settings, but the small benefit
doesn't seem worth the bother. Just revert it.
Fixes:
|
||
![]() |
cf33d96f50 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts and no adjacent changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
![]() |
d640627663 |
net/rose: prevent integer overflows in rose_setsockopt()
In case of possible unpredictably large arguments passed to
rose_setsockopt() and multiplied by extra values on top of that,
integer overflows may occur.
Do the safest minimum and fix these issues by checking the
contents of 'opt' and returning -EINVAL if they are too large. Also,
switch to unsigned int and remove useless check for negative 'opt'
in ROSE_IDLE case.
Fixes:
|