Merge tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Replace busylock at the Tx queuing layer with a lockless list.

     Resulting in a 300% (4x) improvement on heavy TX workloads, sending
     twice the number of packets per second, for half the cpu cycles.

   - Allow constantly busy flows to migrate to a more suitable CPU/NIC
     queue.

     Normally we perform queue re-selection when flow comes out of idle,
     but under extreme circumstances the flows may be constantly busy.

     Add sysctl to allow periodic rehashing even if it'd risk packet
     reordering.

   - Optimize the NAPI skb cache, make it larger, use it in more paths.

   - Attempt returning Tx skbs to the originating CPU (like we already
     did for Rx skbs).

   - Various data structure layout and prefetch optimizations from Eric.

   - Remove ktime_get() from the recvmsg() fast path, ktime_get() is
     sadly quite expensive on recent AMD machines.

   - Extend threaded NAPI polling to allow the kthread busy poll for
     packets.

   - Make MPTCP use Rx backlog processing. This lowers the lock
     pressure, improving the Rx performance.

   - Support memcg accounting of MPTCP socket memory.

   - Allow admin to opt sockets out of global protocol memory accounting
     (using a sysctl or BPF-based policy). The global limits are a poor
     fit for modern container workloads, where limits are imposed using
     cgroups.

   - Improve heuristics for when to kick off AF_UNIX garbage collection.

   - Allow users to control TCP SACK compression, and default to 33% of
     RTT.

   - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid
     unnecessarily aggressive rcvbuf growth and overshot when the
     connection RTT is low.

   - Preserve skb metadata space across skb_push / skb_pull operations.

   - Support for IPIP encapsulation in the nftables flowtable offload.

   - Support appending IP interface information to ICMP messages (RFC
     5837).

   - Support setting max record size in TLS (RFC 8449).

   - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.

   - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.

   - Let users configure the number of write buffers in SMC.

   - Add new struct sockaddr_unsized for sockaddr of unknown length,
     from Kees.

   - Some conversions away from the crypto_ahash API, from Eric Biggers.

   - Some preparations for slimming down struct page.

   - YAML Netlink protocol spec for WireGuard.

   - Add a tool on top of YAML Netlink specs/lib for reporting commonly
     computed derived statistics and summarized system state.

  Driver API:

   - Add CAN XL support to the CAN Netlink interface.

   - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as
     defined by the OPEN Alliance's "Advanced diagnostic features for
     100BASE-T1 automotive Ethernet PHYs" specification.

   - Add DPLL phase-adjust-gran pin attribute (and implement it in
     zl3073x).

   - Refactor xfrm_input lock to reduce contention when NIC offloads
     IPsec and performs RSS.

   - Add info to devlink params whether the current setting is the
     default or a user override. Allow resetting back to default.

   - Add standard device stats for PSP crypto offload.

   - Leverage DSA frame broadcast to implement simple HSR frame
     duplication for a lot of switches without dedicated HSR offload.

   - Add uAPI defines for 1.6Tbps link modes.

  Device drivers:

   - Add Motorcomm YT921x gigabit Ethernet switch support.

   - Add MUCSE driver for N500/N210 1GbE NIC series.

   - Convert drivers to support dedicated ops for timestamping control,
     and away from the direct IOCTL handling. While at it support GET
     operations for PHY timestamping.

   - Add (and convert most drivers to) a dedicated ethtool callback for
     reading the Rx ring count.

   - Significant refactoring efforts in the STMMAC driver, which
     supports Synopsys turn-key MAC IP integrated into a ton of SoCs.

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - support PPS in/out on all pins
      - Intel (100G, ice, idpf):
         - ice: implement standard ethtool and timestamping stats
         - i40e: support setting the max number of MAC addresses per VF
         - iavf: support RSS of GTP tunnels for 5G and LTE deployments
      - nVidia/Mellanox (mlx5):
         - reduce downtime on interface reconfiguration
         - disable being an XDP redirect target by default (same as
           other drivers) to avoid wasting resources if feature is
           unused
      - Meta (fbnic):
         - add support for Linux-managed PCS on 25G, 50G, and 100G links
      - Wangxun:
         - support Rx descriptor merge, and Tx head writeback
         - support Rx coalescing offload
         - support 25G SPF and 40G QSFP modules

   - Ethernet virtual:
      - Google (gve):
         - allow ethtool to configure rx_buf_len
         - implement XDP HW RX Timestamping support for DQ descriptor
           format
      - Microsoft vNIC (mana):
         - support HW link state events
         - handle hardware recovery events when probing the device

   - Ethernet NICs consumer, and embedded:
      - usbnet: add support for Byte Queue Limits (BQL)
      - AMD (amd-xgbe):
         - add device selftests
      - NXP (enetc):
         - add i.MX94 support
      - Broadcom integrated MACs (bcmgenet, bcmasp):
         - bcmasp: add support for PHY-based Wake-on-LAN
      - Broadcom switches (b53):
         - support port isolation
         - support BCM5389/97/98 and BCM63XX ARL formats
      - Lantiq/MaxLinear switches:
         - support bridge FDB entries on the CPU port
         - use regmap for register access
         - allow user to enable/disable learning
         - support Energy Efficient Ethernet
         - support configuring RMII clock delays
         - add tagging driver for MaxLinear GSW1xx switches
      - Synopsys (stmmac):
         - support using the HW clock in free running mode
         - add Eswin EIC7700 support
         - add Rockchip RK3506 support
         - add Altera Agilex5 support
      - Cadence (macb):
         - cleanup and consolidate descriptor and DMA address handling
         - add EyeQ5 support
      - TI:
         - icssg-prueth: support AF_XDP
      - Airoha access points:
         - add missing Ethernet stats and link state callback
         - add AN7583 support
         - support out-of-order Tx completion processing
      - Power over Ethernet:
         - pd692x0: preserve PSE configuration across reboots
         - add support for TPS23881B devices

   - Ethernet PHYs:
      - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
      - Support 50G SerDes and 100G interfaces in Linux-managed PHYs
      - micrel:
         - support for non PTP SKUs of lan8814
         - enable in-band auto-negotiation on lan8814
      - realtek:
         - cable testing support on RTL8224
         - interrupt support on RTL8221B
      - motorcomm: support for PHY LEDs on YT853
      - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
      - mscc: support for PHY LED control

   - CAN drivers:
      - m_can: add support for optional reset and system wake up
      - remove can_change_mtu() obsoleted by core handling
      - mcp251xfd: support GPIO controller functionality

   - Bluetooth:
      - add initial support for PASTa

   - WiFi:
      - split ieee80211.h file, it's way too big
      - improvements in VHT radiotap reporting, S1G, Channel Switch
        Announcement handling, rate tracking in mesh networks
      - improve multi-radio monitor mode support, and add a cfg80211
        debugfs interface for it
      - HT action frame handling on 6 GHz
      - initial chanctx work towards NAN
      - MU-MIMO sniffer improvements

   - WiFi drivers:
      - RealTek (rtw89):
         - support USB devices RTL8852AU and RTL8852CU
         - initial work for RTL8922DE
         - improved injection support
      - Intel:
         - iwlwifi: new sniffer API support
      - MediaTek (mt76):
         - WED support for >32-bit DMA
         - airoha NPU support
         - regdomain improvements
         - continued WiFi7/MLO work
      - Qualcomm/Atheros:
         - ath10k: factory test support
         - ath11k: TX power insertion support
         - ath12k: BSS color change support
         - ath12k: statistics improvements
      - brcmfmac: Acer A1 840 tablet quirk
      - rtl8xxxu: 40 MHz connection fixes/support"

* tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits)
  net: page_pool: sanitise allocation order
  net: page pool: xa init with destroy on pp init
  net/mlx5e: Support XDP target xmit with dummy program
  net/mlx5e: Update XDP features in switch channels
  selftests/tc-testing: Test CAKE scheduler when enqueue drops packets
  net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop
  wireguard: netlink: generate netlink code
  wireguard: uapi: generate header with ynl-gen
  wireguard: uapi: move flag enums
  wireguard: uapi: move enum wg_cmd
  wireguard: netlink: add YNL specification
  selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
  selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py
  selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py
  selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py
  selftests: drv-net: introduce Iperf3Runner for measurement use cases
  selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS
  net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive()
  Documentation: net: dsa: mention simple HSR offload helpers
  Documentation: net: dsa: mention availability of RedBox
  ...
This commit is contained in:
Linus Torvalds
2025-12-03 17:24:33 -08:00
1652 changed files with 58109 additions and 24036 deletions

View File

@@ -733,19 +733,7 @@ static void unix_release_sock(struct sock *sk, int embrion)
/* ---- Socket is dead now and most probably destroyed ---- */
/*
* Fixme: BSD difference: In BSD all sockets connected to us get
* ECONNRESET and we die on the spot. In Linux we behave
* like files and pipes do and wait for the last
* dereference.
*
* Can't we simply set sock->err?
*
* What the above comment does talk about? --ANK(980817)
*/
if (READ_ONCE(unix_tot_inflight))
unix_gc(); /* Garbage collect fds */
unix_schedule_gc(NULL);
}
struct unix_peercred {
@@ -854,8 +842,8 @@ out:
}
static int unix_release(struct socket *);
static int unix_bind(struct socket *, struct sockaddr *, int);
static int unix_stream_connect(struct socket *, struct sockaddr *,
static int unix_bind(struct socket *, struct sockaddr_unsized *, int);
static int unix_stream_connect(struct socket *, struct sockaddr_unsized *,
int addr_len, int flags);
static int unix_socketpair(struct socket *, struct socket *);
static int unix_accept(struct socket *, struct socket *, struct proto_accept_arg *arg);
@@ -877,7 +865,7 @@ static int unix_dgram_sendmsg(struct socket *, struct msghdr *, size_t);
static int unix_dgram_recvmsg(struct socket *, struct msghdr *, size_t, int);
static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor);
static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor);
static int unix_dgram_connect(struct socket *, struct sockaddr *,
static int unix_dgram_connect(struct socket *, struct sockaddr_unsized *,
int, int);
static int unix_seqpacket_sendmsg(struct socket *, struct msghdr *, size_t);
static int unix_seqpacket_recvmsg(struct socket *, struct msghdr *, size_t,
@@ -1468,7 +1456,7 @@ out:
return err;
}
static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
static int unix_bind(struct socket *sock, struct sockaddr_unsized *uaddr, int addr_len)
{
struct sockaddr_un *sunaddr = (struct sockaddr_un *)uaddr;
struct sock *sk = sock->sk;
@@ -1514,7 +1502,7 @@ static void unix_state_double_unlock(struct sock *sk1, struct sock *sk2)
unix_state_unlock(sk2);
}
static int unix_dgram_connect(struct socket *sock, struct sockaddr *addr,
static int unix_dgram_connect(struct socket *sock, struct sockaddr_unsized *addr,
int alen, int flags)
{
struct sockaddr_un *sunaddr = (struct sockaddr_un *)addr;
@@ -1633,7 +1621,7 @@ static long unix_wait_for_peer(struct sock *other, long timeo)
return timeo;
}
static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uaddr,
int addr_len, int flags)
{
struct sockaddr_un *sunaddr = (struct sockaddr_un *)uaddr;
@@ -2101,8 +2089,6 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
if (err < 0)
return err;
wait_for_unix_gc(scm.fp);
if (msg->msg_flags & MSG_OOB) {
err = -EOPNOTSUPP;
goto out;
@@ -2396,8 +2382,6 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
if (err < 0)
return err;
wait_for_unix_gc(scm.fp);
if (msg->msg_flags & MSG_OOB) {
err = -EOPNOTSUPP;
#if IS_ENABLED(CONFIG_AF_UNIX_OOB)

View File

@@ -24,14 +24,12 @@ struct unix_skb_parms {
#define UNIXCB(skb) (*(struct unix_skb_parms *)&((skb)->cb))
/* GC for SCM_RIGHTS */
extern unsigned int unix_tot_inflight;
void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver);
void unix_del_edges(struct scm_fp_list *fpl);
void unix_update_edges(struct unix_sock *receiver);
int unix_prepare_fpl(struct scm_fp_list *fpl);
void unix_destroy_fpl(struct scm_fp_list *fpl);
void unix_gc(void);
void wait_for_unix_gc(struct scm_fp_list *fpl);
void unix_schedule_gc(struct user_struct *user);
/* SOCK_DIAG */
long unix_inq_len(struct sock *sk);

View File

@@ -121,8 +121,13 @@ static struct unix_vertex *unix_edge_successor(struct unix_edge *edge)
return edge->successor->vertex;
}
static bool unix_graph_maybe_cyclic;
static bool unix_graph_grouped;
enum {
UNIX_GRAPH_NOT_CYCLIC,
UNIX_GRAPH_MAYBE_CYCLIC,
UNIX_GRAPH_CYCLIC,
};
static unsigned char unix_graph_state;
static void unix_update_graph(struct unix_vertex *vertex)
{
@@ -132,8 +137,7 @@ static void unix_update_graph(struct unix_vertex *vertex)
if (!vertex)
return;
unix_graph_maybe_cyclic = true;
unix_graph_grouped = false;
WRITE_ONCE(unix_graph_state, UNIX_GRAPH_MAYBE_CYCLIC);
}
static LIST_HEAD(unix_unvisited_vertices);
@@ -196,7 +200,6 @@ static void unix_free_vertices(struct scm_fp_list *fpl)
}
static DEFINE_SPINLOCK(unix_gc_lock);
unsigned int unix_tot_inflight;
void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
{
@@ -222,7 +225,6 @@ void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
} while (i < fpl->count_unix);
receiver->scm_stat.nr_unix_fds += fpl->count_unix;
WRITE_ONCE(unix_tot_inflight, unix_tot_inflight + fpl->count_unix);
out:
WRITE_ONCE(fpl->user->unix_inflight, fpl->user->unix_inflight + fpl->count);
@@ -253,7 +255,6 @@ void unix_del_edges(struct scm_fp_list *fpl)
receiver = fpl->edges[0].successor;
receiver->scm_stat.nr_unix_fds -= fpl->count_unix;
}
WRITE_ONCE(unix_tot_inflight, unix_tot_inflight - fpl->count_unix);
out:
WRITE_ONCE(fpl->user->unix_inflight, fpl->user->unix_inflight - fpl->count);
@@ -299,6 +300,8 @@ int unix_prepare_fpl(struct scm_fp_list *fpl)
if (!fpl->edges)
goto err;
unix_schedule_gc(fpl->user);
return 0;
err:
@@ -404,9 +407,11 @@ static bool unix_scc_cyclic(struct list_head *scc)
static LIST_HEAD(unix_visited_vertices);
static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
static void __unix_walk_scc(struct unix_vertex *vertex, unsigned long *last_index,
struct sk_buff_head *hitlist)
static unsigned long __unix_walk_scc(struct unix_vertex *vertex,
unsigned long *last_index,
struct sk_buff_head *hitlist)
{
unsigned long cyclic_sccs = 0;
LIST_HEAD(vertex_stack);
struct unix_edge *edge;
LIST_HEAD(edge_stack);
@@ -497,8 +502,8 @@ prev_vertex:
if (unix_vertex_max_scc_index < vertex->scc_index)
unix_vertex_max_scc_index = vertex->scc_index;
if (!unix_graph_maybe_cyclic)
unix_graph_maybe_cyclic = unix_scc_cyclic(&scc);
if (unix_scc_cyclic(&scc))
cyclic_sccs++;
}
list_del(&scc);
@@ -507,13 +512,17 @@ prev_vertex:
/* Need backtracking ? */
if (!list_empty(&edge_stack))
goto prev_vertex;
return cyclic_sccs;
}
static unsigned long unix_graph_cyclic_sccs;
static void unix_walk_scc(struct sk_buff_head *hitlist)
{
unsigned long last_index = UNIX_VERTEX_INDEX_START;
unsigned long cyclic_sccs = 0;
unix_graph_maybe_cyclic = false;
unix_vertex_max_scc_index = UNIX_VERTEX_INDEX_START;
/* Visit every vertex exactly once.
@@ -523,18 +532,20 @@ static void unix_walk_scc(struct sk_buff_head *hitlist)
struct unix_vertex *vertex;
vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
__unix_walk_scc(vertex, &last_index, hitlist);
cyclic_sccs += __unix_walk_scc(vertex, &last_index, hitlist);
}
list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
swap(unix_vertex_unvisited_index, unix_vertex_grouped_index);
unix_graph_grouped = true;
WRITE_ONCE(unix_graph_cyclic_sccs, cyclic_sccs);
WRITE_ONCE(unix_graph_state,
cyclic_sccs ? UNIX_GRAPH_CYCLIC : UNIX_GRAPH_NOT_CYCLIC);
}
static void unix_walk_scc_fast(struct sk_buff_head *hitlist)
{
unix_graph_maybe_cyclic = false;
unsigned long cyclic_sccs = unix_graph_cyclic_sccs;
while (!list_empty(&unix_unvisited_vertices)) {
struct unix_vertex *vertex;
@@ -551,34 +562,38 @@ static void unix_walk_scc_fast(struct sk_buff_head *hitlist)
scc_dead = unix_vertex_dead(vertex);
}
if (scc_dead)
if (scc_dead) {
cyclic_sccs--;
unix_collect_skb(&scc, hitlist);
else if (!unix_graph_maybe_cyclic)
unix_graph_maybe_cyclic = unix_scc_cyclic(&scc);
}
list_del(&scc);
}
list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices);
WRITE_ONCE(unix_graph_cyclic_sccs, cyclic_sccs);
WRITE_ONCE(unix_graph_state,
cyclic_sccs ? UNIX_GRAPH_CYCLIC : UNIX_GRAPH_NOT_CYCLIC);
}
static bool gc_in_progress;
static void __unix_gc(struct work_struct *work)
static void unix_gc(struct work_struct *work)
{
struct sk_buff_head hitlist;
struct sk_buff *skb;
spin_lock(&unix_gc_lock);
if (!unix_graph_maybe_cyclic) {
if (unix_graph_state == UNIX_GRAPH_NOT_CYCLIC) {
spin_unlock(&unix_gc_lock);
goto skip_gc;
}
__skb_queue_head_init(&hitlist);
if (unix_graph_grouped)
if (unix_graph_state == UNIX_GRAPH_CYCLIC)
unix_walk_scc_fast(&hitlist);
else
unix_walk_scc(&hitlist);
@@ -595,36 +610,27 @@ skip_gc:
WRITE_ONCE(gc_in_progress, false);
}
static DECLARE_WORK(unix_gc_work, __unix_gc);
static DECLARE_WORK(unix_gc_work, unix_gc);
void unix_gc(void)
#define UNIX_INFLIGHT_SANE_USER (SCM_MAX_FD * 8)
void unix_schedule_gc(struct user_struct *user)
{
WRITE_ONCE(gc_in_progress, true);
queue_work(system_dfl_wq, &unix_gc_work);
}
#define UNIX_INFLIGHT_TRIGGER_GC 16000
#define UNIX_INFLIGHT_SANE_USER (SCM_MAX_FD * 8)
void wait_for_unix_gc(struct scm_fp_list *fpl)
{
/* If number of inflight sockets is insane,
* force a garbage collect right now.
*
* Paired with the WRITE_ONCE() in unix_inflight(),
* unix_notinflight(), and __unix_gc().
*/
if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&
!READ_ONCE(gc_in_progress))
unix_gc();
if (READ_ONCE(unix_graph_state) == UNIX_GRAPH_NOT_CYCLIC)
return;
/* Penalise users who want to send AF_UNIX sockets
* but whose sockets have not been received yet.
*/
if (!fpl || !fpl->count_unix ||
READ_ONCE(fpl->user->unix_inflight) < UNIX_INFLIGHT_SANE_USER)
if (user &&
READ_ONCE(user->unix_inflight) < UNIX_INFLIGHT_SANE_USER)
return;
if (READ_ONCE(gc_in_progress))
if (!READ_ONCE(gc_in_progress)) {
WRITE_ONCE(gc_in_progress, true);
queue_work(system_dfl_wq, &unix_gc_work);
}
if (user && READ_ONCE(unix_graph_cyclic_sccs))
flush_work(&unix_gc_work);
}