mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-09-04 20:19:47 +08:00
3f84d577b7
524 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
3955802f16 |
af_unix: Define locking order for unix_table_double_lock().
When created, AF_UNIX socket is put into net->unx.table.buckets[], and the hash is stored in sk->sk_hash. * unbound socket : 0 <= sk_hash <= UNIX_HASH_MOD When bind() is called, the socket could be moved to another bucket. * pathname socket : 0 <= sk_hash <= UNIX_HASH_MOD * abstract socket : UNIX_HASH_MOD + 1 <= sk_hash <= UNIX_HASH_MOD * 2 + 1 Then, we call unix_table_double_lock() which locks a single bucket or two. Let's define the order as unix_table_lock_cmp_fn() instead of using spin_lock_nested(). The locking is always done in ascending order of sk->sk_hash, which is the index of buckets/locks array allocated by kvmalloc_array(). sk_hash_A < sk_hash_B <=> &locks[sk_hash_A].dep_map < &locks[sk_hash_B].dep_map So, the relation of two sk->sk_hash can be derived from the addresses of dep_map in the array of locks. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
![]() |
4c7d3d79c7 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts, no adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
a6736a0add |
af_unix: Read with MSG_PEEK loops if the first unread byte is OOB
Read with MSG_PEEK flag loops if the first byte to read is an OOB byte. commit |
||
![]() |
62b5bf58b9 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/pensando/ionic/ionic_txrx.c |
||
![]() |
83690b82d2 |
af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
If the socket type is SOCK_STREAM or SOCK_SEQPACKET, unix_release_sock()
checks the length of the peer socket's recvq under unix_state_lock().
However, unix_stream_read_generic() calls skb_unlink() after releasing
the lock. Also, for SOCK_SEQPACKET, __skb_try_recv_datagram() unlinks
skb without unix_state_lock().
Thues, unix_state_lock() does not protect qlen.
Let's use skb_queue_empty_lockless() in unix_release_sock().
Fixes:
|
||
![]() |
45d872f0e6 |
af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
Once sk->sk_state is changed to TCP_LISTEN, it never changes.
unix_accept() takes advantage of this characteristics; it does not
hold the listener's unix_state_lock() and only acquires recvq lock
to pop one skb.
It means unix_state_lock() does not prevent the queue length from
changing in unix_stream_connect().
Thus, we need to use unix_recvq_full_lockless() to avoid data-race.
Now we remove unix_recvq_full() as no one uses it.
Note that we can remove READ_ONCE() for sk->sk_max_ack_backlog in
unix_recvq_full_lockless() because of the following reasons:
(1) For SOCK_DGRAM, it is a written-once field in unix_create1()
(2) For SOCK_STREAM and SOCK_SEQPACKET, it is changed under the
listener's unix_state_lock() in unix_listen(), and we hold
the lock in unix_stream_connect()
Fixes:
|
||
![]() |
bd9f2d0573 |
af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen.
net->unx.sysctl_max_dgram_qlen is exposed as a sysctl knob and can be
changed concurrently.
Let's use READ_ONCE() in unix_create1().
Fixes:
|
||
![]() |
b0632e53e0 |
af_unix: Annotate data-races around sk->sk_sndbuf.
sk_setsockopt() changes sk->sk_sndbuf under lock_sock(), but it's
not used in af_unix.c.
Let's use READ_ONCE() to read sk->sk_sndbuf in unix_writable(),
unix_dgram_sendmsg(), and unix_stream_sendmsg().
Fixes:
|
||
![]() |
af4c733b6b |
af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb().
unix_stream_read_skb() is called from sk->sk_data_ready() context
where unix_state_lock() is not held.
Let's use READ_ONCE() there.
Fixes:
|
||
![]() |
8a34d4e8d9 |
af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg().
The following functions read sk->sk_state locklessly and proceed only if the state is TCP_ESTABLISHED. * unix_stream_sendmsg * unix_stream_read_generic * unix_seqpacket_sendmsg * unix_seqpacket_recvmsg Let's use READ_ONCE() there. Fixes: |
||
![]() |
1b536948e8 |
af_unix: Annotate data-race of sk->sk_state in unix_accept().
Once sk->sk_state is changed to TCP_LISTEN, it never changes.
unix_accept() takes the advantage and reads sk->sk_state without
holding unix_state_lock().
Let's use READ_ONCE() there.
Fixes:
|
||
![]() |
a9bf9c7dc6 |
af_unix: Annotate data-race of sk->sk_state in unix_stream_connect().
As small optimisation, unix_stream_connect() prefetches the client's
sk->sk_state without unix_state_lock() and checks if it's TCP_CLOSE.
Later, sk->sk_state is checked again under unix_state_lock().
Let's use READ_ONCE() for the first check and TCP_CLOSE directly for
the second check.
Fixes:
|
||
![]() |
eb0718fb3e |
af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
unix_poll() and unix_dgram_poll() read sk->sk_state locklessly and calls unix_writable() which also reads sk->sk_state without holding unix_state_lock(). Let's use READ_ONCE() in unix_poll() and unix_dgram_poll() and pass it to unix_writable(). While at it, we remove TCP_SYN_SENT check in unix_dgram_poll() as that state does not exist for AF_UNIX socket since the code was added. Fixes: |
||
![]() |
3a0f38eb28 |
af_unix: Annotate data-race of sk->sk_state in unix_inq_len().
ioctl(SIOCINQ) calls unix_inq_len() that checks sk->sk_state first
and returns -EINVAL if it's TCP_LISTEN.
Then, for SOCK_STREAM sockets, unix_inq_len() returns the number of
bytes in recvq.
However, unix_inq_len() does not hold unix_state_lock(), and the
concurrent listen() might change the state after checking sk->sk_state.
If the race occurs, 0 is returned for the listener, instead of -EINVAL,
because the length of skb with embryo is 0.
We could hold unix_state_lock() in unix_inq_len(), but it's overkill
given the result is true for pre-listen() TCP_CLOSE state.
So, let's use READ_ONCE() for sk->sk_state in unix_inq_len().
Fixes:
|
||
![]() |
942238f973 |
af_unix: Annodate data-races around sk->sk_state for writers.
sk->sk_state is changed under unix_state_lock(), but it's read locklessly in many places. This patch adds WRITE_ONCE() on the writer side. We will add READ_ONCE() to the lockless readers in the following patches. Fixes: |
||
![]() |
26bfb8b570 |
af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
to BPF SOCKMAP.
When the socket disconnects from the peer by connect(AF_UNSPEC), the state
is set back to TCP_CLOSE.
Then, the peer's state is also set to TCP_CLOSE, but the update is done
locklessly and unconditionally.
Let's say socket A connect()ed to B, B connect()ed to C, and A disconnects
from B.
After the first two connect()s, all three sockets' sk->sk_state are
TCP_ESTABLISHED:
$ ss -xa
Netid State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
u_dgr ESTAB 0 0 @A 641 * 642
u_dgr ESTAB 0 0 @B 642 * 643
u_dgr ESTAB 0 0 @C 643 * 0
And after the disconnect, B's state is TCP_CLOSE even though it's still
connected to C and C's state is TCP_ESTABLISHED.
$ ss -xa
Netid State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
u_dgr UNCONN 0 0 @A 641 * 0
u_dgr UNCONN 0 0 @B 642 * 643
u_dgr ESTAB 0 0 @C 643 * 0
In this case, we cannot register B to SOCKMAP.
So, when a socket disconnects from the peer, we should not set TCP_CLOSE to
the peer if the peer is connected to yet another socket, and this must be
done under unix_state_lock().
Note that we use WRITE_ONCE() for sk->sk_state as there are many lockless
readers. These data-races will be fixed in the following patches.
Fixes:
|
||
![]() |
b5c0898807 |
af_unix: Remove dead code in unix_stream_read_generic().
When splice() support was added in commit |
||
![]() |
51d1b25a72 |
af_unix: Read sk->sk_hash under bindlock during bind().
syzkaller reported data-race of sk->sk_hash in unix_autobind() [0],
and the same ones exist in unix_bind_bsd() and unix_bind_abstract().
The three bind() functions prefetch sk->sk_hash locklessly and
use it later after validating that unix_sk(sk)->addr is NULL under
unix_sk(sk)->bindlock.
The prefetched sk->sk_hash is the hash value of unbound socket set
in unix_create1() and does not change until bind() completes.
There could be a chance that sk->sk_hash changes after the lockless
read. However, in such a case, non-NULL unix_sk(sk)->addr is visible
under unix_sk(sk)->bindlock, and bind() returns -EINVAL without using
the prefetched value.
The KCSAN splat is false-positive, but let's silence it by reading
sk->sk_hash under unix_sk(sk)->bindlock.
[0]:
BUG: KCSAN: data-race in unix_autobind / unix_autobind
write to 0xffff888034a9fb88 of 4 bytes by task 4468 on cpu 0:
__unix_set_addr_hash net/unix/af_unix.c:331 [inline]
unix_autobind+0x47a/0x7d0 net/unix/af_unix.c:1185
unix_dgram_connect+0x7e3/0x890 net/unix/af_unix.c:1373
__sys_connect_file+0xd7/0xe0 net/socket.c:2048
__sys_connect+0x114/0x140 net/socket.c:2065
__do_sys_connect net/socket.c:2075 [inline]
__se_sys_connect net/socket.c:2072 [inline]
__x64_sys_connect+0x40/0x50 net/socket.c:2072
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x46/0x4e
read to 0xffff888034a9fb88 of 4 bytes by task 4465 on cpu 1:
unix_autobind+0x28/0x7d0 net/unix/af_unix.c:1134
unix_dgram_connect+0x7e3/0x890 net/unix/af_unix.c:1373
__sys_connect_file+0xd7/0xe0 net/socket.c:2048
__sys_connect+0x114/0x140 net/socket.c:2065
__do_sys_connect net/socket.c:2075 [inline]
__se_sys_connect net/socket.c:2072 [inline]
__x64_sys_connect+0x40/0x50 net/socket.c:2072
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x46/0x4e
value changed: 0x000000e4 -> 0x000001e3
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 4465 Comm: syz-executor.0 Not tainted 6.8.0-12822-gcd51db110a7e #12
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Fixes:
|
||
![]() |
97e1db06c7 |
af_unix: Annotate data-race around unix_sk(sk)->addr.
Once unix_sk(sk)->addr is assigned under net->unx.table.locks and unix_sk(sk)->bindlock, *(unix_sk(sk)->addr) and unix_sk(sk)->path are fully set up, and unix_sk(sk)->addr is never changed. unix_getname() and unix_copy_addr() access the two fields locklessly, and commit |
||
![]() |
66ad4829dd |
Quite smaller than usual. Notably it includes the fix for the unix
regression you have been notified of in the past weeks. The TCP window fix will require some follow-up, already queued. Current release - regressions: - af_unix: fix garbage collection of embryos Previous releases - regressions: - af_unix: fix race between GC and receive path - ipv6: sr: fix missing sk_buff release in seg6_input_core - tcp: remove 64 KByte limit for initial tp->rcv_wnd value - eth: r8169: fix rx hangup - eth: lan966x: remove ptp traps in case the ptp is not enabled. - eth: ixgbe: fix link breakage vs cisco switches. - eth: ice: prevent ethtool from corrupting the channels. Previous releases - always broken: - openvswitch: set the skbuff pkt_type for proper pmtud support. - tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). Misc: - a bunch of selftests stabilization patches. Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmZPXmUSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOk/o4QAJTA/LcQmHkObgQWyJ7vSykhRFmxSsfR Qc/DstWuNkM+xDbasdjlxaM+BPgf0RduyB/bsPOr8UvGw0S0NUwQBC9V9bgQ0p67 D9qrZH6gEDRbzG+mkbF49SXksJMSdNSygWc4YnYaCW+eufpCaZwN15q+4pAgAWfW UmSra9wCkgl9nRc7N4+UEJbhhi0Lso/yaRlHUUUooHOP0ENDe3JSKidUyS3UuhYc Ah75gKIMm9BygUhg/+mrsRyeb1kfXMfJ54ku/uEIimErG4rTntCJCAc+dBoRXtob pImg4xfgr1OBL1wQKTHM+nvhE+DThLAJOSguX2RYvTvklx/l00tL1PQkA/kn6XNM HdQGnDoN1JpUs3xw90hxWp0gzOwJ1XCjbXT/Dx2kp+ltFj0A1EZViTNNTgh6y2E0 B5oo8NFD0y02ilMdaGW/KOpceglO82p2P4DEc0kBAYvCICQ8MKMdtThuubQeB0FK EO7Xs7lKbDXLJUDtmN4EiE1sofvLVD+1htGt5FG2jtizyQ5Ho/b2aTk2uq0kRN3F mZgaXcNR3sOJGBdaTvzquALZ2Dt69w0D3EHGv/30tD5zwQO8j71W5OoWTnjknWUp Nh7ytL/YlqvwJI47UuuTeDBh95jb/KpTWFv8EYsQLI0JOTfa1VXsoDxidg6rnHuX mvLdIOtzTZqU =zd2T -----END PGP SIGNATURE----- Merge tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Quite smaller than usual. Notably it includes the fix for the unix regression from the past weeks. The TCP window fix will require some follow-up, already queued. Current release - regressions: - af_unix: fix garbage collection of embryos Previous releases - regressions: - af_unix: fix race between GC and receive path - ipv6: sr: fix missing sk_buff release in seg6_input_core - tcp: remove 64 KByte limit for initial tp->rcv_wnd value - eth: r8169: fix rx hangup - eth: lan966x: remove ptp traps in case the ptp is not enabled - eth: ixgbe: fix link breakage vs cisco switches - eth: ice: prevent ethtool from corrupting the channels Previous releases - always broken: - openvswitch: set the skbuff pkt_type for proper pmtud support - tcp: Fix shift-out-of-bounds in dctcp_update_alpha() Misc: - a bunch of selftests stabilization patches" * tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (25 commits) r8169: Fix possible ring buffer corruption on fragmented Tx packets. idpf: Interpret .set_channels() input differently ice: Interpret .set_channels() input differently nfc: nci: Fix handling of zero-length payload packets in nci_rx_work() net: relax socket state check at accept time. tcp: remove 64 KByte limit for initial tp->rcv_wnd value net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe() tls: fix missing memory barrier in tls_init net: fec: avoid lock evasion when reading pps_enable Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI" testing: net-drv: use stats64 for testing net: mana: Fix the extra HZ in mana_hwc_send_request net: lan966x: Remove ptp traps in case the ptp is not enabled. openvswitch: Set the skbuff pkt_type for proper pmtud support. selftest: af_unix: Make SCM_RIGHTS into OOB data. af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). selftests/net: use tc rule to filter the na packet ipv6: sr: fix memleak in seg6_hmac_init_algo af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. ... |
||
![]() |
9841991a44 |
af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
Billy Jheng Bing-Jhong reported a race between __unix_gc() and
queue_oob().
__unix_gc() tries to garbage-collect close()d inflight sockets,
and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC
will drop the reference and set NULL to it locklessly.
However, the peer socket still can send MSG_OOB message and
queue_oob() can update unix_sk(sk)->oob_skb concurrently, leading
NULL pointer dereference. [0]
To fix the issue, let's update unix_sk(sk)->oob_skb under the
sk_receive_queue's lock and take it everywhere we touch oob_skb.
Note that we defer kfree_skb() in manage_oob() to silence lockdep
false-positive (See [1]).
[0]:
BUG: kernel NULL pointer dereference, address: 0000000000000008
PF: supervisor write access in kernel mode
PF: error_code(0x0002) - not-present page
PGD 8000000009f5e067 P4D 8000000009f5e067 PUD 9f5d067 PMD 0
Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc5-00191-gd091e579b864 #110
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: events delayed_fput
RIP: 0010:skb_dequeue (./include/linux/skbuff.h:2386 ./include/linux/skbuff.h:2402 net/core/skbuff.c:3847)
Code: 39 e3 74 3e 8b 43 10 48 89 ef 83 e8 01 89 43 10 49 8b 44 24 08 49 c7 44 24 08 00 00 00 00 49 8b 14 24 49 c7 04 24 00 00 00 00 <48> 89 42 08 48 89 10 e8 e7 c5 42 00 4c 89 e0 5b 5d 41 5c c3 cc cc
RSP: 0018:ffffc900001bfd48 EFLAGS: 00000002
RAX: 0000000000000000 RBX: ffff8880088f5ae8 RCX: 00000000361289f9
RDX: 0000000000000000 RSI: 0000000000000206 RDI: ffff8880088f5b00
RBP: ffff8880088f5b00 R08: 0000000000080000 R09: 0000000000000001
R10: 0000000000000003 R11: 0000000000000001 R12: ffff8880056b6a00
R13: ffff8880088f5280 R14: 0000000000000001 R15: ffff8880088f5a80
FS: 0000000000000000(0000) GS:ffff88807dd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000006314000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
<TASK>
unix_release_sock (net/unix/af_unix.c:654)
unix_release (net/unix/af_unix.c:1050)
__sock_release (net/socket.c:660)
sock_close (net/socket.c:1423)
__fput (fs/file_table.c:423)
delayed_fput (fs/file_table.c:444 (discriminator 3))
process_one_work (kernel/workqueue.c:3259)
worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
kthread (kernel/kthread.c:388)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
</TASK>
Modules linked in:
CR2: 0000000000000008
Link: https://lore.kernel.org/netdev/a00d3993-c461-43f2-be6d-07259c98509a@rbox.co/ [1]
Fixes:
|
||
![]() |
89721e3038 |
net-accept-more-20240515
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmZFcFwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuP1EADKJOJRtcvO/av2cUR+HZFDC+s/jBwHIJK+ 4UY633zQlxjqc7dt4rX7zk/uk4mkhZnsGY+wS6xH08kB3VO9YksrwREVt6Ur9lP8 UXNVPpPcZ7fFcIp41rYkZX9pTDp2N8z2qsVg7V8wcXJ7EeTXd6L4ZLjhfCiHvs2s i6yIEwLrW+voYuqSFV7vWBIM3mSXSRTIiO2DqRAOtT2lsj374DOthvP2lOSSb5wq 6TF4s4z3HMGs+HF3rjP5kJ6ic6RdC6i31lzEivUMhwCiKN1AZXdp96KXaC+NVPRV t5//EdS+pSenQgkg6XH7d5kzFoCUFJfVZt05w0GCqMA081Q9ySjUymN1zedJbGd9 8CDlW01N8XLqG6+F9yakJLSFY+mUFGduPuueTNiUJWP8kTkQCtYIRzZDeyjxQrE5 c17NW5S1uWkf26Ucyi1r+gxw9N4kGkuB3+NitC6DOc7BW5CocEIoqLWi/UH7cEZe 0v6loTakqBAdgh03RCDMUj9Rt/37pQs2KFT9/CazVpbkvkKsue4xK4K2CUFsxqOj qcoc/LD62at4S3AUWwhUIs3YaQ7v/6AY5hIktqAwsFHmDffUbPdRrXWY1keKIprJ 4qS/sY0M+kvKGnp+80fPVHab9l6/fMLfabIyFuh0M3W/M4eHGt2YfKWreoGEy/1x xLq2iq+ehw== =S6Xt -----END PGP SIGNATURE----- Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linux Pull more io_uring updates from Jens Axboe: "This adds support for IORING_CQE_F_SOCK_NONEMPTY for io_uring accept requests. This is very similar to previous work that enabled the same hint for doing receives on sockets. By far the majority of the work here is refactoring to enable the networking side to pass back whether or not the socket had more pending requests after accepting the current one, the last patch just wires it up for io_uring. Not only does this enable applications to know whether there are more connections to accept right now, it also enables smarter logic for io_uring multishot accept on whether to retry immediately or wait for a poll trigger" * tag 'net-accept-more-20240515' of git://git.kernel.dk/linux: io_uring/net: wire up IORING_CQE_F_SOCK_NONEMPTY for accept net: pass back whether socket was empty post accept net: have do_accept() take a struct proto_accept_arg argument net: change proto and proto_ops accept type |
||
![]() |
654de42f3f |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes to prepare for the 6.10 net-next PR. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
92ef0fd55a |
net: change proto and proto_ops accept type
Rather than pass in flags, error pointer, and whether this is a kernel invocation or not, add a struct proto_accept_arg struct as the argument. This then holds all of these arguments, and prepares accept for being able to pass back more information. No functional changes in this patch. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
540bf24fba |
af_unix: Fix data races in unix_release_sock/unix_stream_sendmsg
A data-race condition has been identified in af_unix. In one data path, the write function unix_release_sock() atomically writes to sk->sk_shutdown using WRITE_ONCE. However, on the reader side, unix_stream_sendmsg() does not read it atomically. Consequently, this issue is causing the following KCSAN splat to occur: BUG: KCSAN: data-race in unix_release_sock / unix_stream_sendmsg write (marked) to 0xffff88867256ddbb of 1 bytes by task 7270 on cpu 28: unix_release_sock (net/unix/af_unix.c:640) unix_release (net/unix/af_unix.c:1050) sock_close (net/socket.c:659 net/socket.c:1421) __fput (fs/file_table.c:422) __fput_sync (fs/file_table.c:508) __se_sys_close (fs/open.c:1559 fs/open.c:1541) __x64_sys_close (fs/open.c:1541) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) read to 0xffff88867256ddbb of 1 bytes by task 989 on cpu 14: unix_stream_sendmsg (net/unix/af_unix.c:2273) __sock_sendmsg (net/socket.c:730 net/socket.c:745) ____sys_sendmsg (net/socket.c:2584) __sys_sendmmsg (net/socket.c:2638 net/socket.c:2724) __x64_sys_sendmmsg (net/socket.c:2753 net/socket.c:2750 net/socket.c:2750) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) value changed: 0x01 -> 0x03 The line numbers are related to commit |
||
![]() |
41e3ddb291 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: include/trace/events/rpcgss.h |
||
![]() |
fd86344823 |
af_unix: Try not to hold unix_gc_lock during accept().
Commit
|
||
![]() |
22dd70eb2c |
af_unix: Don't peek OOB data without MSG_OOB.
Currently, we can read OOB data without MSG_OOB by using MSG_PEEK
when OOB data is sitting on the front row, which is apparently
wrong.
>>> from socket import *
>>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
>>> c1.send(b'a', MSG_OOB)
1
>>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
b'a'
If manage_oob() is called when no data has been copied, we only
check if the socket enables SO_OOBINLINE or MSG_PEEK is not used.
Otherwise, the skb is returned as is.
However, here we should return NULL if MSG_PEEK is set and no data
has been copied.
Also, in such a case, we should not jump to the redo label because
we will be caught in the loop and hog the CPU until normal data
comes in.
Then, we need to handle skb == NULL case with the if-clause below
the manage_oob() block.
With this patch:
>>> from socket import *
>>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
>>> c1.send(b'a', MSG_OOB)
1
>>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
BlockingIOError: [Errno 11] Resource temporarily unavailable
Fixes:
|
||
![]() |
283454c8a1 |
af_unix: Call manage_oob() for every skb in unix_stream_read_generic().
When we call recv() for AF_UNIX socket, we first peek one skb and
calls manage_oob() to check if the skb is sent with MSG_OOB.
However, when we fetch the next (and the following) skb, manage_oob()
is not called now, leading a wrong behaviour.
Let's say a socket send()s "hello" with MSG_OOB and the peer tries
to recv() 5 bytes with MSG_PEEK. Here, we should get only "hell"
without 'o', but actually not:
>>> from socket import *
>>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
>>> c1.send(b'hello', MSG_OOB)
5
>>> c2.recv(5, MSG_PEEK)
b'hello'
The first skb fills 4 bytes, and the next skb is peeked but not
properly checked by manage_oob().
Let's move up the again label to call manage_oob() for evry skb.
With this patch:
>>> from socket import *
>>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
>>> c1.send(b'hello', MSG_OOB)
5
>>> c2.recv(5, MSG_PEEK)
b'hell'
Fixes:
|
||
![]() |
94426ed213 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: net/unix/garbage.c |
||
![]() |
b46f4eaa4f |
af_unix: Clear stale u->oob_skb.
syzkaller started to report deadlock of unix_gc_lock after commit |
||
![]() |
118f457da9 |
af_unix: Remove lock dance in unix_peek_fds().
In the previous GC implementation, the shape of the inflight socket graph was not expected to change while GC was in progress. MSG_PEEK was tricky because it could install inflight fd silently and transform the graph. Let's say we peeked a fd, which was a listening socket, and accept()ed some embryo sockets from it. The garbage collection algorithm would have been confused because the set of sockets visited in scan_inflight() would change within the same GC invocation. That's why we placed spin_lock(&unix_gc_lock) and spin_unlock() in unix_peek_fds() with a fat comment. In the new GC implementation, we no longer garbage-collect the socket if it exists in another queue, that is, if it has a bridge to another SCC. Also, accept() will require the lock if it has edges. Thus, we need not do the complicated lock dance. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240401173125.92184-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
7c349ed090 |
af_unix: Remove scm_fp_dup() in unix_attach_fds().
When we passed fds, we used to bump each file's refcount twice
in scm_fp_copy() and scm_fp_dup() before linking the socket to
gc_inflight_list.
This is because we incremented the inflight count of the socket
and linked it to the list in advance before passing skb to the
destination socket.
Otherwise, the inflight socket could have been garbage-collected
in a small race window between linking the socket to the list and
queuing skb:
CPU 1 : sendmsg(X) w/ A's fd CPU 2 : close(A)
----- -----
/* Here A's refcount is 1, and inflight count is 0 */
bump A's refcount to 2 in scm_fp_copy()
bump A's inflight count to 1
link A to gc_inflight_list
decrement A's refcount to 1
/* A's refcount == inflight count, thus A could be GC candidate */
start GC
mark A as candidate
purge A's receive queue
queue skb w/ A's fd to X
/* A is queued, but all data has been lost */
After commit
|
||
![]() |
1abe267f17 |
net: add sk_wake_async_rcu() helper
While looking at UDP receive performance, I saw sk_wake_async() was no longer inlined. This matters at least on AMD Zen1-4 platforms (see SRSO) This might be because rcu_read_lock() and rcu_read_unlock() are no longer nops in recent kernels ? Add sk_wake_async_rcu() variant, which must be called from contexts already holding rcu lock. As SOCK_FASYNC is deprecated in modern days, use unlikely() to give a hint to the compiler. sk_wake_async_rcu() is properly inlined from __udp_enqueue_schedule_skb() and sock_def_readable(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240328144032.1864988-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
4090fa373f |
af_unix: Replace garbage collection algorithm.
If we find a dead SCC during iteration, we call unix_collect_skb() to splice all skb in the SCC to the global sk_buff_head, hitlist. After iterating all SCC, we unlock unix_gc_lock and purge the queue. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-15-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
dcf70df204 |
af_unix: Fix up unix_edge.successor for embryo socket.
To garbage collect inflight AF_UNIX sockets, we must define the cyclic reference appropriately. This is a bit tricky if the loop consists of embryo sockets. Suppose that the fd of AF_UNIX socket A is passed to D and the fd B to C and that C and D are embryo sockets of A and B, respectively. It may appear that there are two separate graphs, A (-> D) and B (-> C), but this is not correct. A --. .-- B X C <-' `-> D Now, D holds A's refcount, and C has B's refcount, so unix_release() will never be called for A and B when we close() them. However, no one can call close() for D and C to free skbs holding refcounts of A and B because C/D is in A/B's receive queue, which should have been purged by unix_release() for A and B. So, here's another type of cyclic reference. When a fd of an AF_UNIX socket is passed to an embryo socket, the reference is indirectly held by its parent listening socket. .-> A .-> B | `- sk_receive_queue | `- sk_receive_queue | `- skb | `- skb | `- sk == C | `- sk == D | `- sk_receive_queue | `- sk_receive_queue | `- skb +---------' `- skb +-. | | `---------------------------------------------------------' Technically, the graph must be denoted as A <-> B instead of A (-> D) and B (-> C) to find such a cyclic reference without touching each socket's receive queue. .-> A --. .-- B <-. | X | == A <-> B `-- C <-' `-> D --' We apply this fixup during GC by fetching the real successor by unix_edge_successor(). When we call accept(), we clear unix_sock.listener under unix_gc_lock not to confuse GC. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
aed6ecef55 |
af_unix: Save listener for embryo socket.
This is a prep patch for the following change, where we need to fetch the listening socket from the successor embryo socket during GC. We add a new field to struct unix_sock to save a pointer to a listening socket. We set it when connect() creates a new socket, and clear it when accept() is called. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
42f298c06b |
af_unix: Link struct unix_edge when queuing skb.
Just before queuing skb with inflight fds, we call scm_stat_add(), which is a good place to set up the preallocated struct unix_vertex and struct unix_edge in UNIXCB(skb).fp. Then, we call unix_add_edges() and construct the directed graph as follows: 1. Set the inflight socket's unix_sock to unix_edge.predecessor. 2. Set the receiver's unix_sock to unix_edge.successor. 3. Set the preallocated vertex to inflight socket's unix_sock.vertex. 4. Link inflight socket's unix_vertex.entry to unix_unvisited_vertices. 5. Link unix_edge.vertex_entry to the inflight socket's unix_vertex.edges. Let's say we pass the fd of AF_UNIX socket A to B and the fd of B to C. The graph looks like this: +-------------------------+ | unix_unvisited_vertices | <-------------------------. +-------------------------+ | + | | +--------------+ +--------------+ | +--------------+ | | unix_sock A | <---. .---> | unix_sock B | <-|-. .---> | unix_sock C | | +--------------+ | | +--------------+ | | | +--------------+ | .-+ | vertex | | | .-+ | vertex | | | | | vertex | | | +--------------+ | | | +--------------+ | | | +--------------+ | | | | | | | | | | +--------------+ | | | +--------------+ | | | | '-> | unix_vertex | | | '-> | unix_vertex | | | | | +--------------+ | | +--------------+ | | | `---> | entry | +---------> | entry | +-' | | |--------------| | | |--------------| | | | edges | <-. | | | edges | <-. | | +--------------+ | | | +--------------+ | | | | | | | | | .----------------------' | | .----------------------' | | | | | | | | | +--------------+ | | | +--------------+ | | | | unix_edge | | | | | unix_edge | | | | +--------------+ | | | +--------------+ | | `-> | vertex_entry | | | `-> | vertex_entry | | | |--------------| | | |--------------| | | | predecessor | +---' | | predecessor | +---' | |--------------| | |--------------| | | successor | +-----' | successor | +-----' +--------------+ +--------------+ Henceforth, we denote such a graph as A -> B (-> C). Now, we can express all inflight fd graphs that do not contain embryo sockets. We will support the particular case later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
1fbfdfaa59 |
af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd.
We will replace the garbage collection algorithm for AF_UNIX, where we will consider each inflight AF_UNIX socket as a vertex and its file descriptor as an edge in a directed graph. This patch introduces a new struct unix_vertex representing a vertex in the graph and adds its pointer to struct unix_sock. When we send a fd using the SCM_RIGHTS message, we allocate struct scm_fp_list to struct scm_cookie in scm_fp_copy(). Then, we bump each refcount of the inflight fds' struct file and save them in scm_fp_list.fp. After that, unix_attach_fds() inexplicably clones scm_fp_list of scm_cookie and sets it to skb. (We will remove this part after replacing GC.) Here, we add a new function call in unix_attach_fds() to preallocate struct unix_vertex per inflight AF_UNIX fd and link each vertex to skb's scm_fp_list.vertices. When sendmsg() succeeds later, if the socket of the inflight fd is still not inflight yet, we will set the preallocated vertex to struct unix_sock.vertex and link it to a global list unix_unvisited_vertices under spin_lock(&unix_gc_lock). If the socket is already inflight, we free the preallocated vertex. This is to avoid taking the lock unnecessarily when sendmsg() could fail later. In the following patch, we will similarly allocate another struct per edge, which will finally be linked to the inflight socket's unix_vertex.edges. And then, we will count the number of edges as unix_vertex.out_degree. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
fecc51559a |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/udp.c |
||
![]() |
56667da739 |
net: implement lockless setsockopt(SO_PEEK_OFF)
syzbot reported a lockdep violation [1] involving af_unix
support of SO_PEEK_OFF.
Since SO_PEEK_OFF is inherently not thread safe (it uses a per-socket
sk_peek_off field), there is really no point to enforce a pointless
thread safety in the kernel.
After this patch :
- setsockopt(SO_PEEK_OFF) no longer acquires the socket lock.
- skb_consume_udp() no longer has to acquire the socket lock.
- af_unix no longer needs a special version of sk_set_peek_off(),
because it does not lock u->iolock anymore.
As a followup, we could replace prot->set_peek_off to be a boolean
and avoid an indirect call, since we always use sk_set_peek_off().
[1]
WARNING: possible circular locking dependency detected
6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Not tainted
syz-executor.2/30025 is trying to acquire lock:
ffff8880765e7d80 (&u->iolock){+.+.}-{3:3}, at: unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
but task is already holding lock:
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (sk_lock-AF_UNIX){+.+.}-{0:0}:
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
lock_sock_nested+0x48/0x100 net/core/sock.c:3524
lock_sock include/net/sock.h:1691 [inline]
__unix_dgram_recvmsg+0x1275/0x12c0 net/unix/af_unix.c:2415
sock_recvmsg_nosec+0x18e/0x1d0 net/socket.c:1046
____sys_recvmsg+0x3c0/0x470 net/socket.c:2801
___sys_recvmsg net/socket.c:2845 [inline]
do_recvmmsg+0x474/0xae0 net/socket.c:2939
__sys_recvmmsg net/socket.c:3018 [inline]
__do_sys_recvmmsg net/socket.c:3041 [inline]
__se_sys_recvmmsg net/socket.c:3034 [inline]
__x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
-> #0 (&u->iolock){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
__mutex_lock_common kernel/locking/mutex.c:608 [inline]
__mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
sk_setsockopt+0x207e/0x3360
do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
__sys_setsockopt+0x1ad/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_UNIX);
lock(&u->iolock);
lock(sk_lock-AF_UNIX);
lock(&u->iolock);
*** DEADLOCK ***
1 lock held by syz-executor.2/30025:
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193
stack backtrace:
CPU: 0 PID: 30025 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
__mutex_lock_common kernel/locking/mutex.c:608 [inline]
__mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
sk_setsockopt+0x207e/0x3360
do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
__sys_setsockopt+0x1ad/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7f78a1c7dda9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f78a0fde0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007f78a1dac050 RCX: 00007f78a1c7dda9
RDX: 000000000000002a RSI: 0000000000000001 RDI: 0000000000000006
RBP: 00007f78a1cca47a R08: 0000000000000004 R09: 0000000000000000
R10: 0000000020000180 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f78a1dac050 R15: 00007ffe5cd81ae8
Fixes:
|
||
![]() |
cf244463a2 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
4d322dce82 |
af_unix: fix lockdep positive in sk_diag_dump_icons()
syzbot reported a lockdep splat [1].
Blamed commit hinted about the possible lockdep
violation, and code used unix_state_lock_nested()
in an attempt to silence lockdep.
It is not sufficient, because unix_state_lock_nested()
is already used from unix_state_double_lock().
We need to use a separate subclass.
This patch adds a distinct enumeration to make things
more explicit.
Also use swap() in unix_state_double_lock() as a clean up.
v2: add a missing inline keyword to unix_state_lock_nested()
[1]
WARNING: possible circular locking dependency detected
6.8.0-rc1-syzkaller-00356-g8a696a29c690 #0 Not tainted
syz-executor.1/2542 is trying to acquire lock:
ffff88808b5df9e8 (rlock-AF_UNIX){+.+.}-{2:2}, at: skb_queue_tail+0x36/0x120 net/core/skbuff.c:3863
but task is already holding lock:
ffff88808b5dfe70 (&u->lock/1){+.+.}-{2:2}, at: unix_dgram_sendmsg+0xfc7/0x2200 net/unix/af_unix.c:2089
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&u->lock/1){+.+.}-{2:2}:
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
_raw_spin_lock_nested+0x31/0x40 kernel/locking/spinlock.c:378
sk_diag_dump_icons net/unix/diag.c:87 [inline]
sk_diag_fill+0x6ea/0xfe0 net/unix/diag.c:157
sk_diag_dump net/unix/diag.c:196 [inline]
unix_diag_dump+0x3e9/0x630 net/unix/diag.c:220
netlink_dump+0x5c1/0xcd0 net/netlink/af_netlink.c:2264
__netlink_dump_start+0x5d7/0x780 net/netlink/af_netlink.c:2370
netlink_dump_start include/linux/netlink.h:338 [inline]
unix_diag_handler_dump+0x1c3/0x8f0 net/unix/diag.c:319
sock_diag_rcv_msg+0xe3/0x400
netlink_rcv_skb+0x1df/0x430 net/netlink/af_netlink.c:2543
sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:280
netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline]
netlink_unicast+0x7e6/0x980 net/netlink/af_netlink.c:1367
netlink_sendmsg+0xa37/0xd70 net/netlink/af_netlink.c:1908
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
sock_write_iter+0x39a/0x520 net/socket.c:1160
call_write_iter include/linux/fs.h:2085 [inline]
new_sync_write fs/read_write.c:497 [inline]
vfs_write+0xa74/0xca0 fs/read_write.c:590
ksys_write+0x1a0/0x2c0 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b
-> #0 (rlock-AF_UNIX){+.+.}-{2:2}:
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x1909/0x5ab0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
skb_queue_tail+0x36/0x120 net/core/skbuff.c:3863
unix_dgram_sendmsg+0x15d9/0x2200 net/unix/af_unix.c:2112
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
____sys_sendmsg+0x592/0x890 net/socket.c:2584
___sys_sendmsg net/socket.c:2638 [inline]
__sys_sendmmsg+0x3b2/0x730 net/socket.c:2724
__do_sys_sendmmsg net/socket.c:2753 [inline]
__se_sys_sendmmsg net/socket.c:2750 [inline]
__x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2750
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&u->lock/1);
lock(rlock-AF_UNIX);
lock(&u->lock/1);
lock(rlock-AF_UNIX);
*** DEADLOCK ***
1 lock held by syz-executor.1/2542:
#0: ffff88808b5dfe70 (&u->lock/1){+.+.}-{2:2}, at: unix_dgram_sendmsg+0xfc7/0x2200 net/unix/af_unix.c:2089
stack backtrace:
CPU: 1 PID: 2542 Comm: syz-executor.1 Not tainted 6.8.0-rc1-syzkaller-00356-g8a696a29c690 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
check_noncircular+0x366/0x490 kernel/locking/lockdep.c:2187
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x1909/0x5ab0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
skb_queue_tail+0x36/0x120 net/core/skbuff.c:3863
unix_dgram_sendmsg+0x15d9/0x2200 net/unix/af_unix.c:2112
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
____sys_sendmsg+0x592/0x890 net/socket.c:2584
___sys_sendmsg net/socket.c:2638 [inline]
__sys_sendmmsg+0x3b2/0x730 net/socket.c:2724
__do_sys_sendmmsg net/socket.c:2753 [inline]
__se_sys_sendmmsg net/socket.c:2750 [inline]
__x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2750
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f26d887cda9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f26d95a60c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00007f26d89abf80 RCX: 00007f26d887cda9
RDX: 000000000000003e RSI: 00000000200bd000 RDI: 0000000000000004
RBP: 00007f26d88c947a R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000008c0 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007f26d89abf80 R15: 00007ffcfe081a68
Fixes:
|
||
![]() |
99a7a5b994 |
af_unix: Remove CONFIG_UNIX_SCM.
Originally, the code related to garbage collection was all in garbage.c. Commit |
||
![]() |
d9f21b3613 |
af_unix: Try to run GC async.
If more than 16000 inflight AF_UNIX sockets exist and the garbage collector is not running, unix_(dgram|stream)_sendmsg() call unix_gc(). Also, they wait for unix_gc() to complete. In unix_gc(), all inflight AF_UNIX sockets are traversed at least once, and more if they are the GC candidate. Thus, sendmsg() significantly slows down with too many inflight AF_UNIX sockets. However, if a process sends data with no AF_UNIX FD, the sendmsg() call does not need to wait for GC. After this change, only the process that meets the condition below will be blocked under such a situation. 1) cmsg contains AF_UNIX socket 2) more than 32 AF_UNIX sent by the same user are still inflight Note that even a sendmsg() call that does not meet the condition but has AF_UNIX FD will be blocked later in unix_scm_to_skb() by the spinlock, but we allow that as a bonus for sane users. The results below are the time spent in unix_dgram_sendmsg() sending 1 byte of data with no FD 4096 times on a host where 32K inflight AF_UNIX sockets exist. Without series: the sane sendmsg() needs to wait gc unreasonably. $ sudo /usr/share/bcc/tools/funclatency -p 11165 unix_dgram_sendmsg Tracing 1 functions for "unix_dgram_sendmsg"... Hit Ctrl-C to end. ^C nsecs : count distribution [...] 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 3881 |****************************************| 2097152 -> 4194303 : 214 |** | 4194304 -> 8388607 : 1 | | avg = 1825567 nsecs, total: 7477526027 nsecs, count: 4096 With series: the sane sendmsg() can finish much faster. $ sudo /usr/share/bcc/tools/funclatency -p 8702 unix_dgram_sendmsg Tracing 1 functions for "unix_dgram_sendmsg"... Hit Ctrl-C to end. ^C nsecs : count distribution [...] 128 -> 255 : 0 | | 256 -> 511 : 4092 |****************************************| 512 -> 1023 : 2 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 1 | | 8192 -> 16383 : 1 | | avg = 410 nsecs, total: 1680510 nsecs, count: 4096 Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240123170856.41348-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
97af84a6bb |
af_unix: Do not use atomic ops for unix_sk(sk)->inflight.
When touching unix_sk(sk)->inflight, we are always under spin_lock(&unix_gc_lock). Let's convert unix_sk(sk)->inflight to the normal unsigned long. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240123170856.41348-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
8866730aed |
bpf, sockmap: af_unix stream sockets need to hold ref for pair sock
AF_UNIX stream sockets are a paired socket. So sending on one of the pairs
will lookup the paired socket as part of the send operation. It is possible
however to put just one of the pairs in a BPF map. This currently increments
the refcnt on the sock in the sockmap to ensure it is not free'd by the
stack before sockmap cleans up its state and stops any skbs being sent/recv'd
to that socket.
But we missed a case. If the peer socket is closed it will be free'd by the
stack. However, the paired socket can still be referenced from BPF sockmap
side because we hold a reference there. Then if we are sending traffic through
BPF sockmap to that socket it will try to dereference the free'd pair in its
send logic creating a use after free. And following splat:
[59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
[59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
[...]
[59.905468] Call Trace:
[59.905787] <TASK>
[59.906066] dump_stack_lvl+0x130/0x1d0
[59.908877] print_report+0x16f/0x740
[59.910629] kasan_report+0x118/0x160
[59.912576] sk_wake_async+0x31/0x1b0
[59.913554] sock_def_readable+0x156/0x2a0
[59.914060] unix_stream_sendmsg+0x3f9/0x12a0
[59.916398] sock_sendmsg+0x20e/0x250
[59.916854] skb_send_sock+0x236/0xac0
[59.920527] sk_psock_backlog+0x287/0xaa0
To fix let BPF sockmap hold a refcnt on both the socket in the sockmap and its
paired socket. It wasn't obvious how to contain the fix to bpf_unix logic. The
primarily problem with keeping this logic in bpf_unix was: In the sock close()
we could handle the deref by having a close handler. But, when we are destroying
the psock through a map delete operation we wouldn't have gotten any signal
thorugh the proto struct other than it being replaced. If we do the deref from
the proto replace its too early because we need to deref the sk_pair after the
backlog worker has been stopped.
Given all this it seems best to just cache it at the end of the psock and eat 8B
for the af_unix and vsock users. Notice dgram sockets are OK because they handle
locking already.
Fixes:
|
||
![]() |
4b7b492615 |
af_unix: fix use-after-free in unix_stream_read_actor()
syzbot reported the following crash [1]
After releasing unix socket lock, u->oob_skb can be changed
by another thread. We must temporarily increase skb refcount
to make sure this other thread will not free the skb under us.
[1]
BUG: KASAN: slab-use-after-free in unix_stream_read_actor+0xa7/0xc0 net/unix/af_unix.c:2866
Read of size 4 at addr ffff88801f3b9cc4 by task syz-executor107/5297
CPU: 1 PID: 5297 Comm: syz-executor107 Not tainted 6.6.0-syzkaller-15910-gb8e3a87a627b #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:364 [inline]
print_report+0xc4/0x620 mm/kasan/report.c:475
kasan_report+0xda/0x110 mm/kasan/report.c:588
unix_stream_read_actor+0xa7/0xc0 net/unix/af_unix.c:2866
unix_stream_recv_urg net/unix/af_unix.c:2587 [inline]
unix_stream_read_generic+0x19a5/0x2480 net/unix/af_unix.c:2666
unix_stream_recvmsg+0x189/0x1b0 net/unix/af_unix.c:2903
sock_recvmsg_nosec net/socket.c:1044 [inline]
sock_recvmsg+0xe2/0x170 net/socket.c:1066
____sys_recvmsg+0x21f/0x5c0 net/socket.c:2803
___sys_recvmsg+0x115/0x1a0 net/socket.c:2845
__sys_recvmsg+0x114/0x1e0 net/socket.c:2875
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x3f/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7fc67492c559
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fc6748ab228 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
RAX: ffffffffffffffda RBX: 000000000000001c RCX: 00007fc67492c559
RDX: 0000000040010083 RSI: 0000000020000140 RDI: 0000000000000004
RBP: 00007fc6749b6348 R08: 00007fc6748ab6c0 R09: 00007fc6748ab6c0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc6749b6340
R13: 00007fc6749b634c R14: 00007ffe9fac52a0 R15: 00007ffe9fac5388
</TASK>
Allocated by task 5295:
kasan_save_stack+0x33/0x50 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
__kasan_slab_alloc+0x81/0x90 mm/kasan/common.c:328
kasan_slab_alloc include/linux/kasan.h:188 [inline]
slab_post_alloc_hook mm/slab.h:763 [inline]
slab_alloc_node mm/slub.c:3478 [inline]
kmem_cache_alloc_node+0x180/0x3c0 mm/slub.c:3523
__alloc_skb+0x287/0x330 net/core/skbuff.c:641
alloc_skb include/linux/skbuff.h:1286 [inline]
alloc_skb_with_frags+0xe4/0x710 net/core/skbuff.c:6331
sock_alloc_send_pskb+0x7e4/0x970 net/core/sock.c:2780
sock_alloc_send_skb include/net/sock.h:1884 [inline]
queue_oob net/unix/af_unix.c:2147 [inline]
unix_stream_sendmsg+0xb5f/0x10a0 net/unix/af_unix.c:2301
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0xd5/0x180 net/socket.c:745
____sys_sendmsg+0x6ac/0x940 net/socket.c:2584
___sys_sendmsg+0x135/0x1d0 net/socket.c:2638
__sys_sendmsg+0x117/0x1e0 net/socket.c:2667
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x3f/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Freed by task 5295:
kasan_save_stack+0x33/0x50 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
kasan_save_free_info+0x2b/0x40 mm/kasan/generic.c:522
____kasan_slab_free mm/kasan/common.c:236 [inline]
____kasan_slab_free+0x15b/0x1b0 mm/kasan/common.c:200
kasan_slab_free include/linux/kasan.h:164 [inline]
slab_free_hook mm/slub.c:1800 [inline]
slab_free_freelist_hook+0x114/0x1e0 mm/slub.c:1826
slab_free mm/slub.c:3809 [inline]
kmem_cache_free+0xf8/0x340 mm/slub.c:3831
kfree_skbmem+0xef/0x1b0 net/core/skbuff.c:1015
__kfree_skb net/core/skbuff.c:1073 [inline]
consume_skb net/core/skbuff.c:1288 [inline]
consume_skb+0xdf/0x170 net/core/skbuff.c:1282
queue_oob net/unix/af_unix.c:2178 [inline]
unix_stream_sendmsg+0xd49/0x10a0 net/unix/af_unix.c:2301
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0xd5/0x180 net/socket.c:745
____sys_sendmsg+0x6ac/0x940 net/socket.c:2584
___sys_sendmsg+0x135/0x1d0 net/socket.c:2638
__sys_sendmsg+0x117/0x1e0 net/socket.c:2667
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x3f/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
The buggy address belongs to the object at ffff88801f3b9c80
which belongs to the cache skbuff_head_cache of size 240
The buggy address is located 68 bytes inside of
freed 240-byte region [ffff88801f3b9c80, ffff88801f3b9d70)
The buggy address belongs to the physical page:
page:ffffea00007cee40 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1f3b9
flags: 0xfff00000000800(slab|node=0|zone=1|lastcpupid=0x7ff)
page_type: 0xffffffff()
raw: 00fff00000000800 ffff888142a60640 dead000000000122 0000000000000000
raw: 0000000000000000 00000000000c000c 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 5299, tgid 5283 (syz-executor107), ts 103803840339, free_ts 103600093431
set_page_owner include/linux/page_owner.h:31 [inline]
post_alloc_hook+0x2cf/0x340 mm/page_alloc.c:1537
prep_new_page mm/page_alloc.c:1544 [inline]
get_page_from_freelist+0xa25/0x36c0 mm/page_alloc.c:3312
__alloc_pages+0x1d0/0x4a0 mm/page_alloc.c:4568
alloc_pages_mpol+0x258/0x5f0 mm/mempolicy.c:2133
alloc_slab_page mm/slub.c:1870 [inline]
allocate_slab+0x251/0x380 mm/slub.c:2017
new_slab mm/slub.c:2070 [inline]
___slab_alloc+0x8c7/0x1580 mm/slub.c:3223
__slab_alloc.constprop.0+0x56/0xa0 mm/slub.c:3322
__slab_alloc_node mm/slub.c:3375 [inline]
slab_alloc_node mm/slub.c:3468 [inline]
kmem_cache_alloc_node+0x132/0x3c0 mm/slub.c:3523
__alloc_skb+0x287/0x330 net/core/skbuff.c:641
alloc_skb include/linux/skbuff.h:1286 [inline]
alloc_skb_with_frags+0xe4/0x710 net/core/skbuff.c:6331
sock_alloc_send_pskb+0x7e4/0x970 net/core/sock.c:2780
sock_alloc_send_skb include/net/sock.h:1884 [inline]
queue_oob net/unix/af_unix.c:2147 [inline]
unix_stream_sendmsg+0xb5f/0x10a0 net/unix/af_unix.c:2301
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0xd5/0x180 net/socket.c:745
____sys_sendmsg+0x6ac/0x940 net/socket.c:2584
___sys_sendmsg+0x135/0x1d0 net/socket.c:2638
__sys_sendmsg+0x117/0x1e0 net/socket.c:2667
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1137 [inline]
free_unref_page_prepare+0x4f8/0xa90 mm/page_alloc.c:2347
free_unref_page+0x33/0x3b0 mm/page_alloc.c:2487
__unfreeze_partials+0x21d/0x240 mm/slub.c:2655
qlink_free mm/kasan/quarantine.c:168 [inline]
qlist_free_all+0x6a/0x170 mm/kasan/quarantine.c:187
kasan_quarantine_reduce+0x18e/0x1d0 mm/kasan/quarantine.c:294
__kasan_slab_alloc+0x65/0x90 mm/kasan/common.c:305
kasan_slab_alloc include/linux/kasan.h:188 [inline]
slab_post_alloc_hook mm/slab.h:763 [inline]
slab_alloc_node mm/slub.c:3478 [inline]
slab_alloc mm/slub.c:3486 [inline]
__kmem_cache_alloc_lru mm/slub.c:3493 [inline]
kmem_cache_alloc+0x15d/0x380 mm/slub.c:3502
vm_area_dup+0x21/0x2f0 kernel/fork.c:500
__split_vma+0x17d/0x1070 mm/mmap.c:2365
split_vma mm/mmap.c:2437 [inline]
vma_modify+0x25d/0x450 mm/mmap.c:2472
vma_modify_flags include/linux/mm.h:3271 [inline]
mprotect_fixup+0x228/0xc80 mm/mprotect.c:635
do_mprotect_pkey+0x852/0xd60 mm/mprotect.c:809
__do_sys_mprotect mm/mprotect.c:830 [inline]
__se_sys_mprotect mm/mprotect.c:827 [inline]
__x64_sys_mprotect+0x78/0xb0 mm/mprotect.c:827
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x3f/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Memory state around the buggy address:
ffff88801f3b9b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88801f3b9c00: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff88801f3b9c80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88801f3b9d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
ffff88801f3b9d80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
Fixes:
|
||
![]() |
3a04927f8d |
af_unix: Remove module remnants.
Since commit
|
||
![]() |
859051dd16 |
bpf: Implement cgroup sockaddr hooks for unix sockets
These hooks allows intercepting connect(), getsockname(), getpeername(), sendmsg() and recvmsg() for unix sockets. The unix socket hooks get write access to the address length because the address length is not fixed when dealing with unix sockets and needs to be modified when a unix socket address is modified by the hook. Because abstract socket unix addresses start with a NUL byte, we cannot recalculate the socket address in kernelspace after running the hook by calculating the length of the unix socket path using strlen(). These hooks can be used when users want to multiplex syscall to a single unix socket to multiple different processes behind the scenes by redirecting the connect() and other syscalls to process specific sockets. We do not implement support for intercepting bind() because when using bind() with unix sockets with a pathname address, this creates an inode in the filesystem which must be cleaned up. If we rewrite the address, the user might try to clean up the wrong file, leaking the socket in the filesystem where it is never cleaned up. Until we figure out a solution for this (and a use case for intercepting bind()), we opt to not allow rewriting the sockaddr in bind() calls. We also implement recvmsg() support for connected streams so that after a connect() that is modified by a sockaddr hook, any corresponding recmvsg() on the connected socket can also be modified to make the connected program think it is connected to the "intended" remote. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com> Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> |
||
![]() |
ade32bd8a7 |
af_unix: Fix data-race around unix_tot_inflight.
unix_tot_inflight is changed under spin_lock(unix_gc_lock), but unix_release_sock() reads it locklessly. Let's use READ_ONCE() for unix_tot_inflight. Note that the writer side was marked by commit |
||
![]() |
11695c6e96 |
net: add missing data-race annotations around sk->sk_peek_off
sk_getsockopt() runs locklessly, thus we need to annotate the read
of sk->sk_peek_off.
While we are at it, add corresponding annotations to sk_set_peek_off()
and unix_set_peek_off().
Fixes:
|
||
![]() |
ecb4534b6a |
af_unix: Terminate sun_path when bind()ing pathname socket.
kernel test robot reported slab-out-of-bounds access in strlen(). [0] Commit |
||
![]() |
06d4c8a808 |
af_unix: Fix fortify_panic() in unix_bind_bsd().
syzkaller found a bug in unix_bind_bsd() [0]. We can reproduce it
by bind()ing a socket on a path with length 108.
108 is the size of sun_addr of struct sockaddr_un and is the maximum
valid length for the pathname socket. When calling bind(), we use
struct sockaddr_storage as the actual buffer size, so terminating
sun_addr[108] with null is legitimate as done in unix_mkname_bsd().
However, strlen(sunaddr) for such a case causes fortify_panic() if
CONFIG_FORTIFY_SOURCE=y. __fortify_strlen() has no idea about the
actual buffer size and see the string as unterminated.
Let's use strnlen() to allow sun_addr to be unterminated at 107.
[0]:
detected buffer overflow in __fortify_strlen
kernel BUG at lib/string_helpers.c:1031!
Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 255 Comm: syz-executor296 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #4
Hardware name: linux,dummy-virt (DT)
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : fortify_panic+0x1c/0x20 lib/string_helpers.c:1030
lr : fortify_panic+0x1c/0x20 lib/string_helpers.c:1030
sp : ffff800089817af0
x29: ffff800089817af0 x28: ffff800089817b40 x27: 1ffff00011302f68
x26: 000000000000006e x25: 0000000000000012 x24: ffff800087e60140
x23: dfff800000000000 x22: ffff800089817c20 x21: ffff800089817c8e
x20: 000000000000006c x19: ffff00000c323900 x18: ffff800086ab1630
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000001
x14: 1ffff00011302eb8 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : 64a26b65474d2a00
x8 : 64a26b65474d2a00 x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffff800089817438 x4 : ffff800086ac99e0 x3 : ffff800080f19e8c
x2 : 0000000000000001 x1 : 0000000100000000 x0 : 000000000000002c
Call trace:
fortify_panic+0x1c/0x20 lib/string_helpers.c:1030
_Z16__fortify_strlenPKcU25pass_dynamic_object_size1 include/linux/fortify-string.h:217 [inline]
unix_bind_bsd net/unix/af_unix.c:1212 [inline]
unix_bind+0xba8/0xc58 net/unix/af_unix.c:1326
__sys_bind+0x1ac/0x248 net/socket.c:1792
__do_sys_bind net/socket.c:1803 [inline]
__se_sys_bind net/socket.c:1801 [inline]
__arm64_sys_bind+0x7c/0x94 net/socket.c:1801
__invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
invoke_syscall+0x98/0x2c0 arch/arm64/kernel/syscall.c:52
el0_svc_common+0x134/0x240 arch/arm64/kernel/syscall.c:139
do_el0_svc+0x64/0x198 arch/arm64/kernel/syscall.c:188
el0_svc+0x2c/0x7c arch/arm64/kernel/entry-common.c:647
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:665
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591
Code: aa0003e1 d0000e80 91030000 97ffc91a (d4210000)
Fixes:
|
||
![]() |
a9c49cc2f5 |
net: scm: introduce and use scm_recv_unix helper
Recently, our friends from bluetooth subsystem reported [1] that after commit |
||
![]() |
9d797ee2dc |
Revert "af_unix: Call scm_recv() only after scm_set_cred()."
This reverts commit |
||
![]() |
dc97391e66 |
sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)
Remove ->sendpage() and ->sendpage_locked(). sendmsg() with MSG_SPLICE_PAGES should be used instead. This allows multiple pages and multipage folios to be passed through. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-afs@lists.infradead.org cc: mptcp@lists.linux.dev cc: rds-devel@oss.oracle.com cc: tipc-discussion@lists.sourceforge.net cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20230623225513.2732256-16-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
3f5f118bb6 |
af_unix: Call scm_recv() only after scm_set_cred().
syzkaller hit a WARN_ON_ONCE(!scm->pid) in scm_pidfd_recv(). In unix_stream_read_generic(), if there is no skb in the queue, we could bail out the do-while loop without calling scm_set_cred(): 1. No skb in the queue 2. sk is non-blocking or shutdown(sk, RCV_SHUTDOWN) is called concurrently or peer calls close() If the socket is configured with SO_PASSCRED or SO_PASSPIDFD, scm_recv() would populate cmsg with garbage. Let's not call scm_recv() unless there is skb to receive. WARNING: CPU: 1 PID: 3245 at include/net/scm.h:138 scm_pidfd_recv include/net/scm.h:138 [inline] WARNING: CPU: 1 PID: 3245 at include/net/scm.h:138 scm_recv.constprop.0+0x754/0x850 include/net/scm.h:177 Modules linked in: CPU: 1 PID: 3245 Comm: syz-executor.1 Not tainted 6.4.0-rc5-01219-gfa0e21fa4443 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:scm_pidfd_recv include/net/scm.h:138 [inline] RIP: 0010:scm_recv.constprop.0+0x754/0x850 include/net/scm.h:177 Code: 67 fd e9 55 fd ff ff e8 4a 70 67 fd e9 7f fd ff ff e8 40 70 67 fd e9 3e fb ff ff e8 36 70 67 fd e9 02 fd ff ff e8 8c 3a 20 fd <0f> 0b e9 fe fb ff ff e8 50 70 67 fd e9 2e f9 ff ff e8 46 70 67 fd RSP: 0018:ffffc90009af7660 EFLAGS: 00010216 RAX: 00000000000000a1 RBX: ffff888041e58a80 RCX: ffffc90003852000 RDX: 0000000000040000 RSI: ffffffff842675b4 RDI: 0000000000000007 RBP: ffffc90009af7810 R08: 0000000000000007 R09: 0000000000000013 R10: 00000000000000f8 R11: 0000000000000001 R12: ffffc90009af7db0 R13: 0000000000000000 R14: ffff888041e58a88 R15: 1ffff9200135eecc FS: 00007f6b7113f640(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6b7111de38 CR3: 0000000012a6e002 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> unix_stream_read_generic+0x5fe/0x1f50 net/unix/af_unix.c:2830 unix_stream_recvmsg+0x194/0x1c0 net/unix/af_unix.c:2880 sock_recvmsg_nosec net/socket.c:1019 [inline] sock_recvmsg+0x188/0x1d0 net/socket.c:1040 ____sys_recvmsg+0x210/0x610 net/socket.c:2712 ___sys_recvmsg+0xff/0x190 net/socket.c:2754 do_recvmmsg+0x25d/0x6c0 net/socket.c:2848 __sys_recvmmsg net/socket.c:2927 [inline] __do_sys_recvmmsg net/socket.c:2950 [inline] __se_sys_recvmmsg net/socket.c:2943 [inline] __x64_sys_recvmmsg+0x224/0x290 net/socket.c:2943 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3f/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f6b71da2e5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007f6b7113ecc8 EFLAGS: 00000246 ORIG_RAX: 000000000000012b RAX: ffffffffffffffda RBX: 00000000004bc050 RCX: 00007f6b71da2e5d RDX: 0000000000000007 RSI: 0000000020006600 RDI: 000000000000000b RBP: 00000000004bc050 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000120 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000006e R14: 00007f6b71e03530 R15: 0000000000000000 </TASK> Fixes: |
||
![]() |
7b26952a91 |
net: core: add getsockopt SO_PEERPIDFD
Add SO_PEERPIDFD which allows to get pidfd of peer socket holder pidfd. This thing is direct analog of SO_PEERCRED which allows to get plain PID. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: David Ahern <dsahern@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kees Cook <keescook@chromium.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Luca Boccassi <bluca@debian.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Stanislav Fomichev <sdf@google.com> Cc: bpf@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Cc: linux-arch@vger.kernel.org Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Tested-by: Luca Boccassi <bluca@debian.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
5e2ff6704a |
scm: add SO_PASSPIDFD and SCM_PIDFD
Implement SCM_PIDFD, a new type of CMSG type analogical to SCM_CREDENTIALS, but it contains pidfd instead of plain pid, which allows programmers not to care about PID reuse problem. We mask SO_PASSPIDFD feature if CONFIG_UNIX is not builtin because it depends on a pidfd_prepare() API which is not exported to the kernel modules. Idea comes from UAPI kernel group: https://uapi-group.org/kernel-features/ Big thanks to Christian Brauner and Lennart Poettering for productive discussions about this. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: David Ahern <dsahern@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kees Cook <keescook@chromium.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Luca Boccassi <bluca@debian.org> Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Cc: linux-arch@vger.kernel.org Tested-by: Luca Boccassi <bluca@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
d4031ec844 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/raw.c |
||
![]() |
57d44a354a |
unix: Convert unix_stream_sendpage() to use MSG_SPLICE_PAGES
Convert unix_stream_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather than directly splicing in the pages itself. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells <dhowells@redhat.com> cc: Kuniyuki Iwashima <kuniyu@amazon.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
a0dbf5f818 |
af_unix: Support MSG_SPLICE_PAGES
Make AF_UNIX sendmsg() support MSG_SPLICE_PAGES, splicing in pages from the source iterator if possible and copying the data in otherwise. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells <dhowells@redhat.com> cc: Kuniyuki Iwashima <kuniyu@amazon.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
96449f9024 |
net: Pass max frags into skb_append_pagefrags()
Pass the maximum number of fragments into skb_append_pagefrags() rather than using MAX_SKB_FRAGS so that it can be used from code that wants to specify sysctl_max_skb_frags. Signed-off-by: David Howells <dhowells@redhat.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
78fa0d61d9 |
bpf, sockmap: Pass skb ownership through read_skb
The read_skb hook calls consume_skb() now, but this means that if the
recv_actor program wants to use the skb it needs to inc the ref cnt
so that the consume_skb() doesn't kfree the sk_buff.
This is problematic because in some error cases under memory pressure
we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
Then we get this,
skb_linearize()
__pskb_pull_tail()
pskb_expand_head()
BUG_ON(skb_shared(skb))
Because we incremented users refcnt from sk_psock_verdict_recv() we
hit the bug on with refcnt > 1 and trip it.
To fix lets simply pass ownership of the sk_buff through the skb_read
call. Then we can drop the consume from read_skb handlers and assume
the verdict recv does any required kfree.
Bug found while testing in our CI which runs in VMs that hit memory
constraints rather regularly. William tested TCP read_skb handlers.
[ 106.536188] ------------[ cut here ]------------
[ 106.536197] kernel BUG at net/core/skbuff.c:1693!
[ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
[ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
[ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
[ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
[ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
[ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
[ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
[ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
[ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
[ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
[ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
[ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 106.542255] Call Trace:
[ 106.542383] <IRQ>
[ 106.542487] __pskb_pull_tail+0x4b/0x3e0
[ 106.542681] skb_ensure_writable+0x85/0xa0
[ 106.542882] sk_skb_pull_data+0x18/0x20
[ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
[ 106.543536] ? migrate_disable+0x66/0x80
[ 106.543871] sk_psock_verdict_recv+0xe2/0x310
[ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0
[ 106.544561] tcp_read_skb+0x7b/0x120
[ 106.544740] tcp_data_queue+0x904/0xee0
[ 106.544931] tcp_rcv_established+0x212/0x7c0
[ 106.545142] tcp_v4_do_rcv+0x174/0x2a0
[ 106.545326] tcp_v4_rcv+0xe70/0xf60
[ 106.545500] ip_protocol_deliver_rcu+0x48/0x290
[ 106.545744] ip_local_deliver_finish+0xa7/0x150
Fixes:
|
||
![]() |
e1d09c2c2f |
af_unix: Fix data races around sk->sk_shutdown.
KCSAN found a data race around sk->sk_shutdown where unix_release_sock() and unix_shutdown() update it under unix_state_lock(), OTOH unix_poll() and unix_dgram_poll() read it locklessly. We need to annotate the writes and reads with WRITE_ONCE() and READ_ONCE(). BUG: KCSAN: data-race in unix_poll / unix_release_sock write to 0xffff88800d0f8aec of 1 bytes by task 264 on cpu 0: unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631 unix_release+0x59/0x80 net/unix/af_unix.c:1042 __sock_release+0x7d/0x170 net/socket.c:653 sock_close+0x19/0x30 net/socket.c:1397 __fput+0x179/0x5e0 fs/file_table.c:321 ____fput+0x15/0x20 fs/file_table.c:349 task_work_run+0x116/0x1a0 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline] syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297 do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x72/0xdc read to 0xffff88800d0f8aec of 1 bytes by task 222 on cpu 1: unix_poll+0xa3/0x2a0 net/unix/af_unix.c:3170 sock_poll+0xcf/0x2b0 net/socket.c:1385 vfs_poll include/linux/poll.h:88 [inline] ep_item_poll.isra.0+0x78/0xc0 fs/eventpoll.c:855 ep_send_events fs/eventpoll.c:1694 [inline] ep_poll fs/eventpoll.c:1823 [inline] do_epoll_wait+0x6c4/0xea0 fs/eventpoll.c:2258 __do_sys_epoll_wait fs/eventpoll.c:2270 [inline] __se_sys_epoll_wait fs/eventpoll.c:2265 [inline] __x64_sys_epoll_wait+0xcc/0x190 fs/eventpoll.c:2265 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc value changed: 0x00 -> 0x03 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 222 Comm: dbus-broker Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Fixes: |
||
![]() |
679ed006d4 |
af_unix: Fix a data race of sk->sk_receive_queue->qlen.
KCSAN found a data race of sk->sk_receive_queue->qlen where recvmsg()
updates qlen under the queue lock and sendmsg() checks qlen under
unix_state_sock(), not the queue lock, so the reader side needs
READ_ONCE().
BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_wait_for_peer
write (marked) to 0xffff888019fe7c68 of 4 bytes by task 49792 on cpu 0:
__skb_unlink include/linux/skbuff.h:2347 [inline]
__skb_try_recv_from_queue+0x3de/0x470 net/core/datagram.c:197
__skb_try_recv_datagram+0xf7/0x390 net/core/datagram.c:263
__unix_dgram_recvmsg+0x109/0x8a0 net/unix/af_unix.c:2452
unix_dgram_recvmsg+0x94/0xa0 net/unix/af_unix.c:2549
sock_recvmsg_nosec net/socket.c:1019 [inline]
____sys_recvmsg+0x3a3/0x3b0 net/socket.c:2720
___sys_recvmsg+0xc8/0x150 net/socket.c:2764
do_recvmmsg+0x182/0x560 net/socket.c:2858
__sys_recvmmsg net/socket.c:2937 [inline]
__do_sys_recvmmsg net/socket.c:2960 [inline]
__se_sys_recvmmsg net/socket.c:2953 [inline]
__x64_sys_recvmmsg+0x153/0x170 net/socket.c:2953
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x72/0xdc
read to 0xffff888019fe7c68 of 4 bytes by task 49793 on cpu 1:
skb_queue_len include/linux/skbuff.h:2127 [inline]
unix_recvq_full net/unix/af_unix.c:229 [inline]
unix_wait_for_peer+0x154/0x1a0 net/unix/af_unix.c:1445
unix_dgram_sendmsg+0x13bc/0x14b0 net/unix/af_unix.c:2048
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg+0x148/0x160 net/socket.c:747
____sys_sendmsg+0x20e/0x620 net/socket.c:2503
___sys_sendmsg+0xc6/0x140 net/socket.c:2557
__sys_sendmmsg+0x11d/0x370 net/socket.c:2643
__do_sys_sendmmsg net/socket.c:2672 [inline]
__se_sys_sendmmsg net/socket.c:2669 [inline]
__x64_sys_sendmmsg+0x58/0x70 net/socket.c:2669
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x72/0xdc
value changed: 0x0000000b -> 0x00000001
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 49793 Comm: syz-executor.0 Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Fixes:
|
||
![]() |
cc04410af7 |
af_unix: annotate lockless accesses to sk->sk_err
unix_poll() and unix_dgram_poll() read sk->sk_err without any lock held. Add relevant READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
2aab4b9690 |
af_unix: fix struct pid leaks in OOB support
syzbot reported struct pid leak [1].
Issue is that queue_oob() calls maybe_add_creds() which potentially
holds a reference on a pid.
But skb->destructor is not set (either directly or by calling
unix_scm_to_skb())
This means that subsequent kfree_skb() or consume_skb() would leak
this reference.
In this fix, I chose to fully support scm even for the OOB message.
[1]
BUG: memory leak
unreferenced object 0xffff8881053e7f80 (size 128):
comm "syz-executor242", pid 5066, jiffies 4294946079 (age 13.220s)
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff812ae26a>] alloc_pid+0x6a/0x560 kernel/pid.c:180
[<ffffffff812718df>] copy_process+0x169f/0x26c0 kernel/fork.c:2285
[<ffffffff81272b37>] kernel_clone+0xf7/0x610 kernel/fork.c:2684
[<ffffffff812730cc>] __do_sys_clone+0x7c/0xb0 kernel/fork.c:2825
[<ffffffff849ad699>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<ffffffff849ad699>] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
[<ffffffff84a0008b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes:
|
||
![]() |
5b7c4cabbb |
Networking changes for 6.3.
Core ---- - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols --------- - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF --- - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter --------- - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt. races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API ---------- - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers ---------------------- - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers ------- - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - enetc: support XDP_REDIRECT for XDP non-linear buffers - enetc: improve reconfig, avoid link flap and waiting for idle - enetc: support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation. Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmP1VIYACgkQMUZtbf5S IrvsChAApz0rNL/sPKxXTEfxZ1tN7D3sYxYKQPomxvl5BV+MvicrLddJy3KmzEFK nnJNO3nuRNuH422JQ/ylZ4mGX1opa6+5QJb0UINImXUI7Fm8HHBIuPGkv7d5CheZ 7JexFqjPJXUy9nPyh1Rra+IA9AcRd2U7jeGEZR38wb99bHJQj5Bzdk20WArEB0el n44aqg49LXH71bSeXRz77x5SjkwVtYiccQxLcnmTbjLU2xVraLvI2J+wAhHnVXWW 9lrU1+V4Ex2Xcd1xR0L0cHeK+meP1TrPRAeF+JDpVI3a/zJiE7cZjfHdG/jH5xWl leZJqghVozrZQNtewWWO7XhUFhMDgFu3W/1vNLjSHPZEqaz1JpM67J1+ql6s63l4 LMWoXbcYZz+SL9ZRCoPkbGue/5fKSHv8/Jl9Sh58+eTS+c/zgN8uFGRNFXLX1+EP n8uvt985PxMd6x1+dHumhOUzxnY4Sfi1vjitSunTsNFQ3Cmp4SO0IfBVJWfLUCuC xz5hbJGJJbSpvUsO+HWyCg83E5OWghRE/Onpt2jsQSZCrO9HDg4FRTEf3WAMgaqc edb5KfbRZPTJQM08gWdluXzSk1nw3FNP2tXW4XlgUrEbjb+fOk0V9dQg2gyYTxQ1 Nhvn8ZQPi6/GMMELHAIPGmmW1allyOGiAzGlQsv8EmL+OFM6WDI= =xXhC -----END PGP SIGNATURE----- Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ... |
||
![]() |
509f15b9c5 |
net: add missing includes of linux/splice.h
Number of files depend on linux/splice.h getting included by linux/skbuff.h which soon will no longer be the case. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
abf08576af
|
fs: port vfs_*() helpers to struct mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
![]() |
b27401a30e |
unix: Improve locking scheme in unix_show_fdinfo()
After switching to TCP_ESTABLISHED or TCP_LISTEN sk_state, alive SOCK_STREAM
and SOCK_SEQPACKET sockets can't change it anymore (since commit
|
||
![]() |
3ff8bff704 |
unix: Fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()
There is a race resulting in alive SOCK_SEQPACKET socket
may change its state from TCP_ESTABLISHED to TCP_CLOSE:
unix_release_sock(peer) unix_dgram_sendmsg(sk)
sock_orphan(peer)
sock_set_flag(peer, SOCK_DEAD)
sock_alloc_send_pskb()
if !(sk->sk_shutdown & SEND_SHUTDOWN)
OK
if sock_flag(peer, SOCK_DEAD)
sk->sk_state = TCP_CLOSE
sk->sk_shutdown = SHUTDOWN_MASK
After that socket sk remains almost normal: it is able to connect, listen, accept
and recvmsg, while it can't sendmsg.
Since this is the only possibility for alive SOCK_SEQPACKET to change
the state in such way, we should better fix this strange and potentially
danger corner case.
Note, that we will return EPIPE here like this is normally done in sock_alloc_send_pskb().
Originally used ECONNREFUSED looks strange, since it's strange to return
a specific retval in dependence of race in kernel, when user can't affect on this.
Also, move TCP_CLOSE assignment for SOCK_DGRAM sockets under state lock
to fix race with unix_dgram_connect():
unix_dgram_connect(other) unix_dgram_sendmsg(sk)
unix_peer(sk) = NULL
unix_state_unlock(sk)
unix_state_double_lock(sk, other)
sk->sk_state = TCP_ESTABLISHED
unix_peer(sk) = other
unix_state_double_unlock(sk, other)
sk->sk_state = TCP_CLOSED
This patch fixes both of these races.
Fixes:
|
||
![]() |
73e341e028 |
af_unix: call proto_unregister() in the error path in af_unix_init()
If register unix_stream_proto returns error, unix_dgram_proto needs
be unregistered.
Fixes:
|
||
![]() |
a251c17aa5 |
treewide: use get_random_u32() when possible
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> |
||
![]() |
30c999937f |
Scheduler changes for v6.1:
- Debuggability: - Change most occurances of BUG_ON() to WARN_ON_ONCE() - Reorganize & fix TASK_ state comparisons, turn it into a bitmap - Update/fix misc scheduler debugging facilities - Load-balancing & regular scheduling: - Improve the behavior of the scheduler in presence of lot of SCHED_IDLE tasks - in particular they should not impact other scheduling classes. - Optimize task load tracking, cleanups & fixes - Clean up & simplify misc load-balancing code - Freezer: - Rewrite the core freezer to behave better wrt thawing and be simpler in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting all the fallout. - Deadline scheduler: - Fix the DL capacity-aware code - Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period() - Relax/optimize locking in task_non_contending() - Cleanups: - Factor out the update_current_exec_runtime() helper - Various cleanups, simplifications Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/01cRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1geZA/+PB4KC1T9aVxzaTHI36R03YgJYZmIdtxw wTf02MixePmz+gQCbepJbempGOh5ST28aOcI0xhdYOql5B63MaUBBMlB0HvGUyDG IU3zETqLMRtAbnSTdQFv8m++ECUtZYp8/x1FCel4WO7ya4ETkRu1NRfCoUepEhpZ aVAlae9LH3NBaF9t7s0PT2lTjf3pIzMFRkddJ0ywJhbFR3VnWat05fAK+J6fGY8+ LS54coefNlJD4oDh5TY8uniL1j5SmWmmwbk9Cdj7bLU5P3dFSS0/+5FJNHJPVGDE srGT7wstRUcDrN0CnZo48VIUBiApJCCDqTfJYi9wNYd0NAHvwY6MIJJgEIY8mKsI L/qH26H81Wt+ezSZ/5JIlGlZ/LIeNaa6OO/fbWEYABBQogvvx3nxsRNUYKSQzumH CnSBasBjLnjWyLlK4qARM9cI7NFSEK6NUigrEx/7h8JFu/8T4DlSy6LsF1HUyKgq 4+FJLAqG6cL0tcwB/fHYd0oRESN8dStnQhGxSojgufwLc7dlFULvCYF5JM/dX+/V IKwbOfIOeOn6ViMtSOXAEGdII+IQ2/ZFPwr+8Z5JC7NzvTVL6xlu/3JXkLZR3L7o yaXTSaz06h1vil7Z+GRf7RHc+wUeGkEpXh5vnarGZKXivhFdWsBdROIJANK+xR0i TeSLCxQxXlU= =KjMD -----END PGP SIGNATURE----- Merge tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Debuggability: - Change most occurances of BUG_ON() to WARN_ON_ONCE() - Reorganize & fix TASK_ state comparisons, turn it into a bitmap - Update/fix misc scheduler debugging facilities Load-balancing & regular scheduling: - Improve the behavior of the scheduler in presence of lot of SCHED_IDLE tasks - in particular they should not impact other scheduling classes. - Optimize task load tracking, cleanups & fixes - Clean up & simplify misc load-balancing code Freezer: - Rewrite the core freezer to behave better wrt thawing and be simpler in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting all the fallout. Deadline scheduler: - Fix the DL capacity-aware code - Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period() - Relax/optimize locking in task_non_contending() Cleanups: - Factor out the update_current_exec_runtime() helper - Various cleanups, simplifications" * tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits) sched: Fix more TASK_state comparisons sched: Fix TASK_state comparisons sched/fair: Move call to list_last_entry() in detach_tasks sched/fair: Cleanup loop_max and loop_break sched/fair: Make sure to try to detach at least one movable task sched: Show PF_flag holes freezer,sched: Rewrite core freezer logic sched: Widen TAKS_state literals sched/wait: Add wait_event_state() sched/completion: Add wait_for_completion_state() sched: Add TASK_ANY for wait_task_inactive() sched: Change wait_task_inactive()s match_state freezer,umh: Clean up freezer/initrd interaction freezer: Have {,un}lock_system_sleep() save/restore flags sched: Rename task_running() to task_on_cpu() sched/fair: Cleanup for SIS_PROP sched/fair: Default to false in test_idle_cores() sched/fair: Remove useless check in select_idle_core() sched/fair: Avoid double search on same cpu sched/fair: Remove redundant check in select_idle_smt() ... |
||
![]() |
e52f7c1ddf |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in the left-over fixes before the net-next pull-request. Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c |
||
![]() |
7a62ed6136 |
af_unix: Fix memory leaks of the whole sk due to OOB skb.
syzbot reported a sequence of memory leaks, and one of them indicated we
failed to free a whole sk:
unreferenced object 0xffff8880126e0000 (size 1088):
comm "syz-executor419", pid 326, jiffies 4294773607 (age 12.609s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 7d 00 00 00 00 00 00 00 ........}.......
01 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
backtrace:
[<000000006fefe750>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:1970
[<0000000074006db5>] sk_alloc+0x3b/0x800 net/core/sock.c:2029
[<00000000728cd434>] unix_create1+0xaf/0x920 net/unix/af_unix.c:928
[<00000000a279a139>] unix_create+0x113/0x1d0 net/unix/af_unix.c:997
[<0000000068259812>] __sock_create+0x2ab/0x550 net/socket.c:1516
[<00000000da1521e1>] sock_create net/socket.c:1566 [inline]
[<00000000da1521e1>] __sys_socketpair+0x1a8/0x550 net/socket.c:1698
[<000000007ab259e1>] __do_sys_socketpair net/socket.c:1751 [inline]
[<000000007ab259e1>] __se_sys_socketpair net/socket.c:1748 [inline]
[<000000007ab259e1>] __x64_sys_socketpair+0x97/0x100 net/socket.c:1748
[<000000007dedddc1>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<000000007dedddc1>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
[<000000009456679f>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
We can reproduce this issue by creating two AF_UNIX SOCK_STREAM sockets,
send()ing an OOB skb to each other, and close()ing them without consuming
the OOB skbs.
int skpair[2];
socketpair(AF_UNIX, SOCK_STREAM, 0, skpair);
send(skpair[0], "x", 1, MSG_OOB);
send(skpair[1], "x", 1, MSG_OOB);
close(skpair[0]);
close(skpair[1]);
Currently, we free an OOB skb in unix_sock_destructor() which is called via
__sk_free(), but it's too late because the receiver's unix_sk(sk)->oob_skb
is accounted against the sender's sk->sk_wmem_alloc and __sk_free() is
called only when sk->sk_wmem_alloc is 0.
In the repro sequences, we do not consume the OOB skb, so both two sk's
sock_put() never reach __sk_free() due to the positive sk->sk_wmem_alloc.
Then, no one can consume the OOB skb nor call __sk_free(), and we finally
leak the two whole sk.
Thus, we must free the unconsumed OOB skb earlier when close()ing the
socket.
Fixes:
|
||
![]() |
d6e3b27cbd |
af_unix: Refactor unix_read_skb()
Similar to udp_read_skb(), delete the unnecessary while loop in unix_read_skb() for readability. Since recv_actor() cannot return a value greater than skb->len (see sk_psock_verdict_recv()), remove the redundant check. Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Link: https://lore.kernel.org/r/7009141683ad6cd3785daced3e4a80ba0eb773b5.1663909008.git.peilin.ye@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
f5d39b0208 |
freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. As said; replace this with a special task state TASK_FROZEN and add the following state transitions: TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost). The special __TASK_{STOPPED,TRACED} states *can* be restored since their canonical state is in ->jobctl. With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org |
||
![]() |
de43708924 |
af_unix: Show number of inflight fds for sockets in TCP_LISTEN state too
TCP_LISTEN sockets is a special case. They preserve skb with a newly connected sock till accept() makes it fully functional socket. Receive queue of such socket may grow after connected peer send messages there. Since these messages may contain scm_fds, we should expose correct fdinfo::scm_fds for listening socket too. Signed-off-by: Kirill Tkhai <tkhai@ya.ru> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
0076cad301 |
Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-07-09 We've added 94 non-merge commits during the last 19 day(s) which contain a total of 125 files changed, 5141 insertions(+), 6701 deletions(-). The main changes are: 1) Add new way for performing BTF type queries to BPF, from Daniel Müller. 2) Add inlining of calls to bpf_loop() helper when its function callback is statically known, from Eduard Zingerman. 3) Implement BPF TCP CC framework usability improvements, from Jörn-Thorben Hinz. 4) Add LSM flavor for attaching per-cgroup BPF programs to existing LSM hooks, from Stanislav Fomichev. 5) Remove all deprecated libbpf APIs in prep for 1.0 release, from Andrii Nakryiko. 6) Add benchmarks around local_storage to BPF selftests, from Dave Marchevsky. 7) AF_XDP sample removal (given move to libxdp) and various improvements around AF_XDP selftests, from Magnus Karlsson & Maciej Fijalkowski. 8) Add bpftool improvements for memcg probing and bash completion, from Quentin Monnet. 9) Add arm64 JIT support for BPF-2-BPF coupled with tail calls, from Jakub Sitnicki. 10) Sockmap optimizations around throughput of UDP transmissions which have been improved by 61%, from Cong Wang. 11) Rework perf's BPF prologue code to remove deprecated functions, from Jiri Olsa. 12) Fix sockmap teardown path to avoid sleepable sk_psock_stop, from John Fastabend. 13) Fix libbpf's cleanup around legacy kprobe/uprobe on error case, from Chuang Wang. 14) Fix libbpf's bpf_helpers.h to work with gcc for the case of its sec/pragma macro, from James Hilliard. 15) Fix libbpf's pt_regs macros for riscv to use a0 for RC register, from Yixun Lan. 16) Fix bpftool to show the name of type BPF_OBJ_LINK, from Yafang Shao. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (94 commits) selftests/bpf: Fix xdp_synproxy build failure if CONFIG_NF_CONNTRACK=m/n bpf: Correctly propagate errors up from bpf_core_composites_match libbpf: Disable SEC pragma macro on GCC bpf: Check attach_func_proto more carefully in check_return_code selftests/bpf: Add test involving restrict type qualifier bpftool: Add support for KIND_RESTRICT to gen min_core_btf command MAINTAINERS: Add entry for AF_XDP selftests files selftests, xsk: Rename AF_XDP testing app bpf, docs: Remove deprecated xsk libbpf APIs description selftests/bpf: Add benchmark for local_storage RCU Tasks Trace usage libbpf, riscv: Use a0 for RC register libbpf: Remove unnecessary usdt_rel_ip assignments selftests/bpf: Fix few more compiler warnings selftests/bpf: Fix bogus uninitialized variable warning bpftool: Remove zlib feature test from Makefile libbpf: Cleanup the legacy uprobe_event on failed add/attach_event() libbpf: Fix wrong variable used in perf_event_uprobe_open_legacy() libbpf: Cleanup the legacy kprobe_event on failed add/attach_event() selftests/bpf: Add type match test against kernel's task_struct selftests/bpf: Add nested type to type based tests ... ==================== Link: https://lore.kernel.org/r/20220708233145.32365-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
cf21b355cc |
af_unix: Optimise hash table layout.
Commit |
||
![]() |
51bae889fe |
af_unix: Put pathname sockets in the global hash table.
Commit |
||
![]() |
2f7ca90a01 |
af_unix: Remove unix_table_locks.
unix_table_locks are to protect the global hash table, unix_socket_table. The previous commit removed it, so let's clean up the unnecessary locks. Here is a test result on EC2 c5.9xlarge where 10 processes run concurrently in different netns and bind 100,000 sockets for each. without this series : 1m 38s with this series : 11s It is ~10x faster because the global hash table is split into 10 netns in this case. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
cf2f225e26 |
af_unix: Put a socket into a per-netns hash table.
This commit replaces the global hash table with a per-netns one and removes the global one. We now link a socket in each netns's hash table so we can save some netns comparisons when iterating through a hash bucket. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
79b05beaa5 |
af_unix: Acquire/Release per-netns hash table's locks.
This commit adds extra spin_lock/spin_unlock() for a per-netns hash table inside the existing ones for unix_table_locks. As of this commit, sockets are still linked in the global hash table. After putting sockets in a per-netns hash table and removing the old one in the next patch, we remove the global locks in the last patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
b6e8113830 |
af_unix: Define a per-netns hash table.
This commit adds a per netns hash table for AF_UNIX, which size is fixed as UNIX_HASH_SIZE for now. The first implementation defines a per-netns hash table as a single array of lock and list: struct unix_hashbucket { spinlock_t lock; struct hlist_head head; }; struct netns_unix { struct unix_hashbucket *hash; ... }; But, Eric pointed out memory cost that the structure has holes because of sizeof(spinlock_t), which is 4 (or more if LOCKDEP is enabled). [0] It could be expensive on a host with thousands of netns and few AF_UNIX sockets. For this reason, a per-netns hash table uses two dense arrays. struct unix_table { spinlock_t *locks; struct hlist_head *buckets; }; struct netns_unix { struct unix_table table; ... }; Note the length of the list has a significant impact rather than lock contention, so having shared locks can be an option. But, per-netns locks and lists still perform better than the global locks and per-netns lists. [1] Also, this patch adds a change so that struct netns_unix disappears from struct net if CONFIG_UNIX is disabled. [0]: https://lore.kernel.org/netdev/CANn89iLVxO5aqx16azNU7p7Z-nz5NrnM5QTqOzueVxEnkVTxyg@mail.gmail.com/ [1]: https://lore.kernel.org/netdev/20220617175215.1769-1-kuniyu@amazon.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
f302d180c6 |
af_unix: Include the whole hash table size in UNIX_HASH_SIZE.
Currently, the size of AF_UNIX hash table is UNIX_HASH_SIZE * 2, the first half for bind()ed sockets and the second half for unbound ones. UNIX_HASH_SIZE * 2 is used to define the table and iterate over it. In some places, we use ARRAY_SIZE(unix_socket_table) instead of UNIX_HASH_SIZE * 2. However, we cannot use it anymore because we will allocate the hash table dynamically. Then, we would have to add UNIX_HASH_SIZE * 2 in many places, which would be troublesome. This patch adapts the UNIX_HASH_SIZE definition to include bound and unbound sockets and defines a new UNIX_HASH_MOD macro to ease calculations. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
340c3d3371 |
af_unix: Clean up some sock_net() uses.
Some functions define a net pointer only for one-shot use. Others call sock_net() redundantly even when a net pointer is available. Let's fix these and make the code simpler. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
965b57b469 |
net: Introduce a new proto_ops ->read_skb()
Currently both splice() and sockmap use ->read_sock() to read skb from receive queue, but for sockmap we only read one entire skb at a time, so ->read_sock() is too conservative to use. Introduce a new proto_ops ->read_skb() which supports this sematic, with this we can finally pass the ownership of skb to recv actors. For non-TCP protocols, all ->read_sock() can be simply converted to ->read_skb(). Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com |
||
![]() |
dd29c67dbb |
af_unix: use DEBUG_NET_WARN_ON_ONCE()
Replace four WARN_ON() that have not triggered recently with DEBUG_NET_WARN_ON_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
662a80946c |
af_unix: Fix a data-race in unix_dgram_peer_wake_me().
unix_dgram_poll() calls unix_dgram_peer_wake_me() without `other`'s
lock held and check if its receive queue is full. Here we need to
use unix_recvq_full_lockless() instead of unix_recvq_full(), otherwise
KCSAN will report a data-race.
Fixes:
|
||
![]() |
7e062cda7d |
Networking changes for 5.19.
Core ---- - Support TCPv6 segmentation offload with super-segments larger than 64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP). - Generalize skb freeing deferral to per-cpu lists, instead of per-socket lists. - Add a netdev statistic for packets dropped due to L2 address mismatch (rx_otherhost_dropped). - Continue work annotating skb drop reasons. - Accept alternative netdev names (ALT_IFNAME) in more netlink requests. - Add VLAN support for AF_PACKET SOCK_RAW GSO. - Allow receiving skb mark from the socket as a cmsg. - Enable memcg accounting for veth queues, sysctl tables and IPv6. BPF --- - Add libbpf support for User Statically-Defined Tracing (USDTs). - Speed up symbol resolution for kprobes multi-link attachments. - Support storing typed pointers to referenced and unreferenced objects in BPF maps. - Add support for BPF link iterator. - Introduce access to remote CPU map elements in BPF per-cpu map. - Allow middle-of-the-road settings for the kernel.unprivileged_bpf_disabled sysctl. - Implement basic types of dynamic pointers e.g. to allow for dynamically sized ringbuf reservations without extra memory copies. Protocols --------- - Retire port only listening_hash table, add a second bind table hashed by port and address. Avoid linear list walk when binding to very popular ports (e.g. 443). - Add bridge FDB bulk flush filtering support allowing user space to remove all FDB entries matching a condition. - Introduce accept_unsolicited_na sysctl for IPv6 to implement router-side changes for RFC9131. - Support for MPTCP path manager in user space. - Add MPTCP support for fallback to regular TCP for connections that have never connected additional subflows or transmitted out-of-sequence data (partial support for RFC8684 fallback). - Avoid races in MPTCP-level window tracking, stabilize and improve throughput. - Support lockless operation of GRE tunnels with seq numbers enabled. - WiFi support for host based BSS color collision detection. - Add support for SO_TXTIME/SCM_TXTIME on CAN sockets. - Support transmission w/o flow control in CAN ISOTP (ISO 15765-2). - Support zero-copy Tx with TLS 1.2 crypto offload (sendfile). - Allow matching on the number of VLAN tags via tc-flower. - Add tracepoint for tcp_set_ca_state(). Driver API ---------- - Improve error reporting from classifier and action offload. - Add support for listing line cards in switches (devlink). - Add helpers for reporting page pool statistics with ethtool -S. - Add support for reading clock cycles when using PTP virtual clocks, instead of having the driver convert to time before reporting. This makes it possible to report time from different vclocks. - Support configuring low-latency Tx descriptor push via ethtool. - Separate Clause 22 and Clause 45 MDIO accesses more explicitly. New hardware / drivers ---------------------- - Ethernet: - Marvell's Octeon NIC PCI Endpoint support (octeon_ep) - Sunplus SP7021 SoC (sp7021_emac) - Add support for Renesas RZ/V2M (in ravb) - Add support for MediaTek mt7986 switches (in mtk_eth_soc) - Ethernet PHYs: - ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting) - TI DP83TD510 PHY - Microchip LAN8742/LAN88xx PHYs - WiFi: - Driver for pureLiFi X, XL, XC devices (plfxlc) - Driver for Silicon Labs devices (wfx) - Support for WCN6750 (in ath11k) - Support Realtek 8852ce devices (in rtw89) - Mobile: - MediaTek T700 modems (Intel 5G 5000 M.2 cards) - CAN: - ctucanfd: add support for CTU CAN FD open-source IP core from Czech Technical University in Prague Drivers ------- - Delete a number of old drivers still using virt_to_bus(). - Ethernet NICs: - intel: support TSO on tunnels MPLS - broadcom: support multi-buffer XDP - nfp: support VF rate limiting - sfc: use hardware tx timestamps for more than PTP - mlx5: multi-port eswitch support - hyper-v: add support for XDP_REDIRECT - atlantic: XDP support (including multi-buffer) - macb: improve real-time perf by deferring Tx processing to NAPI - High-speed Ethernet switches: - mlxsw: implement basic line card information querying - prestera: add support for traffic policing on ingress and egress - Embedded Ethernet switches: - lan966x: add support for packet DMA (FDMA) - lan966x: add support for PTP programmable pins - ti: cpsw_new: enable bc/mc storm prevention - Qualcomm 802.11ax WiFi (ath11k): - Wake-on-WLAN support for QCA6390 and WCN6855 - device recovery (firmware restart) support - support setting Specific Absorption Rate (SAR) for WCN6855 - read country code from SMBIOS for WCN6855/QCA6390 - enable keep-alive during WoWLAN suspend - implement remain-on-channel support - MediaTek WiFi (mt76): - support Wireless Ethernet Dispatch offloading packet movement between the Ethernet switch and WiFi interfaces - non-standard VHT MCS10-11 support - mt7921 AP mode support - mt7921 IPv6 NS offload support - Ethernet PHYs: - micrel: ksz9031/ksz9131: cabletest support - lan87xx: SQI support for T1 PHYs - lan937x: add interrupt support for link detection Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmKNMPQACgkQMUZtbf5S IrsRARAAuDyYs6jFYB3p+xazZdOnbF4iAgVv71+DQGvmsCl6CB9OrsNZMlvE85OL Q3gjcRbgjrkN4lhgI8DmiGYbsUJnAvVjFdNjccz1Z/vTLYvuIM0ol54MUp5S+9WY StncOJkOGJxxR/Gi5gzVmejPDsysU3Jik+hm/fpIcz8pybXxAsFKU5waY5qfl+/T TZepfV0VCfqRDjqcF1qA5+jJZNU8pdodQlZ1+mh8bwu6Jk1ZkWkj6Ov8MWdwQldr LnPeK/9hIGzkdJYHZfajxA3t8D0K5CHzSuih2bJ9ry8ZXgVBkXEThew778/R5izW uB0YZs9COFlrIP7XHjtRTy/2xHOdYIPlj2nWhVdfuQDX8Crvt4VRN6EZ1rjko1ZJ WanfG6WHF8NH5pXBRQbh3kIMKBnYn6OIzuCfCQSqd+niHcxFIM4vRiggeXI5C5TW vJgEWfK6X+NfDiFVa3xyCrEmp5ieA/pNecpwd8rVkql+MtFAAw4vfsotLKOJEAru J/XL6UE+YuLqIJV9ACZ9x1AFXXAo661jOxBunOo4VXhXVzWS9lYYz5r5ryIkgT/8 /Fr0zjANJWgfIuNdIBtYfQ4qG+LozGq038VA06RhFUAZ5tF9DzhqJs2Q2AFuWWBC ewCePJVqo1j2Ceq2mGonXRt47OEnlePoOxTk9W+cKZb7ZWE+zEo= =Wjii -----END PGP SIGNATURE----- Merge tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core ---- - Support TCPv6 segmentation offload with super-segments larger than 64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP). - Generalize skb freeing deferral to per-cpu lists, instead of per-socket lists. - Add a netdev statistic for packets dropped due to L2 address mismatch (rx_otherhost_dropped). - Continue work annotating skb drop reasons. - Accept alternative netdev names (ALT_IFNAME) in more netlink requests. - Add VLAN support for AF_PACKET SOCK_RAW GSO. - Allow receiving skb mark from the socket as a cmsg. - Enable memcg accounting for veth queues, sysctl tables and IPv6. BPF --- - Add libbpf support for User Statically-Defined Tracing (USDTs). - Speed up symbol resolution for kprobes multi-link attachments. - Support storing typed pointers to referenced and unreferenced objects in BPF maps. - Add support for BPF link iterator. - Introduce access to remote CPU map elements in BPF per-cpu map. - Allow middle-of-the-road settings for the kernel.unprivileged_bpf_disabled sysctl. - Implement basic types of dynamic pointers e.g. to allow for dynamically sized ringbuf reservations without extra memory copies. Protocols --------- - Retire port only listening_hash table, add a second bind table hashed by port and address. Avoid linear list walk when binding to very popular ports (e.g. 443). - Add bridge FDB bulk flush filtering support allowing user space to remove all FDB entries matching a condition. - Introduce accept_unsolicited_na sysctl for IPv6 to implement router-side changes for RFC9131. - Support for MPTCP path manager in user space. - Add MPTCP support for fallback to regular TCP for connections that have never connected additional subflows or transmitted out-of-sequence data (partial support for RFC8684 fallback). - Avoid races in MPTCP-level window tracking, stabilize and improve throughput. - Support lockless operation of GRE tunnels with seq numbers enabled. - WiFi support for host based BSS color collision detection. - Add support for SO_TXTIME/SCM_TXTIME on CAN sockets. - Support transmission w/o flow control in CAN ISOTP (ISO 15765-2). - Support zero-copy Tx with TLS 1.2 crypto offload (sendfile). - Allow matching on the number of VLAN tags via tc-flower. - Add tracepoint for tcp_set_ca_state(). Driver API ---------- - Improve error reporting from classifier and action offload. - Add support for listing line cards in switches (devlink). - Add helpers for reporting page pool statistics with ethtool -S. - Add support for reading clock cycles when using PTP virtual clocks, instead of having the driver convert to time before reporting. This makes it possible to report time from different vclocks. - Support configuring low-latency Tx descriptor push via ethtool. - Separate Clause 22 and Clause 45 MDIO accesses more explicitly. New hardware / drivers ---------------------- - Ethernet: - Marvell's Octeon NIC PCI Endpoint support (octeon_ep) - Sunplus SP7021 SoC (sp7021_emac) - Add support for Renesas RZ/V2M (in ravb) - Add support for MediaTek mt7986 switches (in mtk_eth_soc) - Ethernet PHYs: - ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting) - TI DP83TD510 PHY - Microchip LAN8742/LAN88xx PHYs - WiFi: - Driver for pureLiFi X, XL, XC devices (plfxlc) - Driver for Silicon Labs devices (wfx) - Support for WCN6750 (in ath11k) - Support Realtek 8852ce devices (in rtw89) - Mobile: - MediaTek T700 modems (Intel 5G 5000 M.2 cards) - CAN: - ctucanfd: add support for CTU CAN FD open-source IP core from Czech Technical University in Prague Drivers ------- - Delete a number of old drivers still using virt_to_bus(). - Ethernet NICs: - intel: support TSO on tunnels MPLS - broadcom: support multi-buffer XDP - nfp: support VF rate limiting - sfc: use hardware tx timestamps for more than PTP - mlx5: multi-port eswitch support - hyper-v: add support for XDP_REDIRECT - atlantic: XDP support (including multi-buffer) - macb: improve real-time perf by deferring Tx processing to NAPI - High-speed Ethernet switches: - mlxsw: implement basic line card information querying - prestera: add support for traffic policing on ingress and egress - Embedded Ethernet switches: - lan966x: add support for packet DMA (FDMA) - lan966x: add support for PTP programmable pins - ti: cpsw_new: enable bc/mc storm prevention - Qualcomm 802.11ax WiFi (ath11k): - Wake-on-WLAN support for QCA6390 and WCN6855 - device recovery (firmware restart) support - support setting Specific Absorption Rate (SAR) for WCN6855 - read country code from SMBIOS for WCN6855/QCA6390 - enable keep-alive during WoWLAN suspend - implement remain-on-channel support - MediaTek WiFi (mt76): - support Wireless Ethernet Dispatch offloading packet movement between the Ethernet switch and WiFi interfaces - non-standard VHT MCS10-11 support - mt7921 AP mode support - mt7921 IPv6 NS offload support - Ethernet PHYs: - micrel: ksz9031/ksz9131: cabletest support - lan87xx: SQI support for T1 PHYs - lan937x: add interrupt support for link detection" * tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1809 commits) ptp: ocp: Add firmware header checks ptp: ocp: fix PPS source selector debugfs reporting ptp: ocp: add .init function for sma_op vector ptp: ocp: vectorize the sma accessor functions ptp: ocp: constify selectors ptp: ocp: parameterize input/output sma selectors ptp: ocp: revise firmware display ptp: ocp: add Celestica timecard PCI ids ptp: ocp: Remove #ifdefs around PCI IDs ptp: ocp: 32-bit fixups for pci start address Revert "net/smc: fix listen processing for SMC-Rv2" ath6kl: Use cc-disable-warning to disable -Wdangling-pointer selftests/bpf: Dynptr tests bpf: Add dynptr data slices bpf: Add bpf_dynptr_read and bpf_dynptr_write bpf: Dynptr support for ring buffers bpf: Add bpf_dynptr_from_mem for local dynptrs bpf: Add verifier support for dynptrs bpf: Suppress 'passing zero to PTR_ERR' warning bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack ... |
||
![]() |
b146cbf2e3 |
af_unix: Silence randstruct GCC plugin warning
While preparing for Clang randstruct support (which duplicated many of the warnings the randstruct GCC plugin warned about), one strange one remained only for the randstruct GCC plugin. Eliminating this rids the plugin of the last exception. It seems the plugin is happy to dereference individual members of a cross-struct cast, but it is upset about casting to a whole object pointer. This only manifests in one place in the kernel, so just replace the variable with individual member accesses. There is no change in executable instruction output. Drop the last exception from the randstruct GCC plugin. Cc: "David S. Miller" <davem@davemloft.net> Cc: Christoph Hellwig <hch@infradead.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Cong Wang <cong.wang@bytedance.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: netdev@vger.kernel.org Cc: linux-hardening@vger.kernel.org Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Link: https://lore.kernel.org/lkml/20220511022217.58586-1-kuniyu@amazon.co.jp Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/lkml/20220511151542.4cb3ff17@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> |
||
![]() |
ec095263a9 |
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
|
||
![]() |
f4b41f062c |
net: remove noblock parameter from skb_recv_datagram()
skb_recv_datagram() has two parameters 'flags' and 'noblock' that are merged inside skb_recv_datagram() by 'flags | (noblock ? MSG_DONTWAIT : 0)' As 'flags' may contain MSG_DONTWAIT as value most callers split the 'flags' into 'flags' and 'noblock' with finally obsolete bit operations like this: skb_recv_datagram(sk, flags & ~MSG_DONTWAIT, flags & MSG_DONTWAIT, &rc); And this is not even done consistently with the 'flags' parameter. This patch removes the obsolete and costly splitting into two parameters and only performs bit operations when really needed on the caller side. One missing conversion thankfully reported by kernel test robot. I missed to enable kunit tests to build the mctp code. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
![]() |
89695196f0 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in overtime fixes, no conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
4edf21aa94 |
af_unix: Remove unnecessary brackets around CONFIG_AF_UNIX_OOB.
Let's remove unnecessary brackets around CONFIG_AF_UNIX_OOB. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Link: https://lore.kernel.org/r/20220317032308.65372-1-kuniyu@amazon.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org> |