2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

10127 Commits

Author SHA1 Message Date
Suchit Karunakaran
8b7ab8eb52 net: stream: add description for sk_stream_write_space()
Add a proper description for the sk_stream_write_space() function as
previously marked by a FIXME comment.
No functional changes.

Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com>
Link: https://patch.msgid.link/20250716153404.7385-1-suchitkarunakaran@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18 16:57:21 -07:00
Kuniyuki Iwashima
dc2a27e524 neighbour: Update pneigh_entry in pneigh_create().
neigh_add() updates pneigh_entry() found or created by pneigh_create().

This update is serialised by RTNL, but we will remove it.

Let's move the update part to pneigh_create() and make it return errno
instead of a pointer of pneigh_entry.

Now, the pneigh code is RTNL free.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-16-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:22 -07:00
Kuniyuki Iwashima
13a936bb99 neighbour: Protect tbl->phash_buckets[] with a dedicated mutex.
tbl->phash_buckets[] is only modified in the slow path by pneigh_create()
and pneigh_delete() under the table lock.

Both of them are called under RTNL, so no extra lock is needed, but we
will remove RTNL from the paths.

pneigh_create() looks up a pneigh_entry, and this part can be lockless,
but it would complicate the logic like

  1. lookup
  2. allocate pengih_entry for GFP_KERNEL
  3. lookup again but under lock
  4. if found, return it after freeing the allocated memory
  5. else, return the new one

Instead, let's add a per-table mutex and run lookup and allocation
under it.

Note that updating pneigh_entry part in neigh_add() is still protected
by RTNL and will be moved to pneigh_create() in the next patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-15-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
b8b7ed1ea8 neighbour: Drop read_lock_bh(&tbl->lock) in pneigh_lookup().
Now, all callers of pneigh_lookup() are under RCU, and the read
lock there is no longer needed.

Let's drop the lock, inline __pneigh_lookup_1() to pneigh_lookup(),
and call it from pneigh_create().

The next patch will remove tbl->lock from pneigh_create().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-14-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
dd103c9a53 neighbour: Remove __pneigh_lookup().
__pneigh_lookup() is the lockless version of pneigh_lookup(),
but its only caller pndisc_is_router() holds the table lock and
reads pneigh_netry.flags.

This is because accessing pneigh_entry after pneigh_lookup() was
illegal unless the caller holds RTNL or the table lock.

Now, pneigh_entry is guaranteed to be alive during the RCU critical
section.

Let's call pneigh_lookup() and use READ_ONCE() for n->flags in
pndisc_is_router() and remove __pneigh_lookup().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
b9c89fa128 neighbour: Use rcu_dereference() in pneigh_get_{first,next}().
Now pneigh_entry is guaranteed to be alive during the
RCU critical section even without holding tbl->lock.

Let's use rcu_dereference() in pneigh_get_{first,next}().

Note that neigh_seq_start() still holds tbl->lock for the
normal neighbour entry.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-12-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
32d5eaabf1 neighbour: Drop read_lock_bh(&tbl->lock) in pneigh_dump_table().
Now pneigh_entry is guaranteed to be alive during the
RCU critical section even without holding tbl->lock.

Let's drop read_lock_bh(&tbl->lock) and use rcu_dereference()
to iterate tbl->phash_buckets[] in pneigh_dump_table()

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
ed6e380d2d neighbour: Convert RTM_GETNEIGH to RCU.
Only __dev_get_by_index() is the RTNL dependant in neigh_get().

Let's replace it with dev_get_by_index_rcu() and convert RTM_GETNEIGH
to RCU.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
cc03492c7b neighbour: Annotate access to struct pneigh_entry.{flags,protocol}.
We will convert pneigh readers to RCU, and its flags and protocol
will be read locklessly.

Let's annotate the access to the two fields.

Note that all access to pn->permanent is under RTNL (neigh_add()
and pneigh_ifdown_and_unlock()), so WRITE_ONCE() and READ_ONCE()
are not needed.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
d539d8fbd8 neighbour: Free pneigh_entry after RCU grace period.
We will convert RTM_GETNEIGH to RCU.

neigh_get() looks up pneigh_entry by pneigh_lookup() and passes
it to pneigh_fill_info().

Then, we must ensure that the entry is alive till pneigh_fill_info()
completes, but read_lock_bh(&tbl->lock) in pneigh_lookup() does not
guarantee that.

Also, we will convert all readers of tbl->phash_buckets[] to RCU.

Let's use call_rcu() to free pneigh_entry and update phash_buckets[]
and ->next by rcu_assign_pointer().

pneigh_ifdown_and_unlock() uses list_head to avoid overwriting
->next and moving RCU iterators to another list.

pndisc_destructor() (only IPv6 ndisc uses this) uses a mutex, so it
is not delayed to call_rcu(), where we cannot sleep.  This is fine
because the mcast code works with RCU and ipv6_dev_mc_dec() frees
mcast objects after RCU grace period.

While at it, we change the return type of pneigh_ifdown_and_unlock()
to void.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
d63382aea7 neighbour: Annotate neigh_table.phash_buckets and pneigh_entry.next with __rcu.
The next patch will free pneigh_entry with call_rcu().

Then, we need to annotate neigh_table.phash_buckets[] and
pneigh_entry.next with __rcu.

To make the next patch cleaner, let's annotate the fields in advance.

Currently, all accesses to the fields are under the neigh table lock,
so rcu_dereference_protected() is used with 1 for now, but most of them
(except in pneigh_delete() and pneigh_ifdown_and_unlock()) will be
replaced with rcu_dereference() and rcu_dereference_check().

Note that pneigh_ifdown_and_unlock() changes pneigh_entry.next to a
local list, which is illegal because the RCU iterator could be moved
to another list.  This part will be fixed in the next patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
e804bd83c1 neighbour: Split pneigh_lookup().
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which
is confusing.

When called with the last argument, creat, 0, pneigh_lookup() literally
looks up a proxy neighbour entry.  This is the case of the reader path
as the fast path and RTM_GETNEIGH.

pneigh_lookup(), however, creates a pneigh_entry when called with creat 1
from RTM_NEWNEIGH and SIOCSARP, which require RTNL.

Let's split pneigh_lookup() into two functions.

We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock)
in the new pneigh_lookup() will be dropped.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
0e5ac19c78 neighbour: Move neigh_find_table() to neigh_get().
neigh_valid_get_req() calls neigh_find_table() to fetch neigh_tables[].

neigh_find_table() uses rcu_dereference_rtnl(), but RTNL actually does
not protect it at all; neigh_table_clear() can be called without RTNL
and only waits for RCU readers by synchronize_rcu().

Fortunately, there is no bug because IPv4 is built-in, IPv6 cannot be
unloaded, and DECNET was removed.

To fetch neigh_tables[] by rcu_dereference() later, let's move
neigh_find_table() from neigh_valid_get_req() to neigh_get().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
3dfe0b57dc neighbour: Allocate skb in neigh_get().
We will remove RTNL for neigh_get() and run it under RCU instead.

neigh_get_reply() and pneigh_get_reply() allocate skb with GFP_KERNEL.

Let's move the allocation before __dev_get_by_index() in neigh_get().

Now, neigh_get_reply() and pneigh_get_reply() are inlined and
rtnl_unicast() is factorised.

We will convert pneigh_lookup() to __pneigh_lookup() later.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
f5046fbc1b neighbour: Move two validations from neigh_get() to neigh_valid_get_req().
We will remove RTNL for neigh_get() and run it under RCU instead.

neigh_get() returns -EINVAL in the following cases:

  * NDA_DST is not specified
  * Both ndm->ndm_ifindex and NTF_PROXY are not specified

These validations do not require RCU.

Let's move them to neigh_valid_get_req().

While at it, the extack string for the first case is replaced with
NL_SET_ERR_ATTR_MISS().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
caf0a753a8 neighbour: Make neigh_valid_get_req() return ndmsg.
neigh_get() passes 4 local variable pointers to neigh_valid_get_req().

If it returns a pointer of struct ndmsg, we do not need to pass two
of them.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Samiullah Khawaja
2677010e77 Add support to set NAPI threaded for individual NAPI
A net device has a threaded sysctl that can be used to enable threaded
NAPI polling on all of the NAPI contexts under that device. Allow
enabling threaded NAPI polling at individual NAPI level using netlink.

Extend the netlink operation `napi-set` and allow setting the threaded
attribute of a NAPI. This will enable the threaded polling on a NAPI
context.

Add a test in `nl_netdev.py` that verifies various cases of threaded
NAPI being set at NAPI and at device level.

Tested
 ./tools/testing/selftests/net/nl_netdev.py
 TAP version 13
 1..7
 ok 1 nl_netdev.empty_check
 ok 2 nl_netdev.lo_check
 ok 3 nl_netdev.page_pool_check
 ok 4 nl_netdev.napi_list_check
 ok 5 nl_netdev.dev_set_threaded
 ok 6 nl_netdev.napi_set_threaded
 ok 7 nl_netdev.nsim_rxq_reset_down
 # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250710211203.3979655-1-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14 18:02:37 -07:00
Kuniyuki Iwashima
2a683d0052 dev: Pass netdevice_tracker to dev_get_by_flags_rcu().
This is a follow-up for commit eb1ac9ff6c ("ipv6: anycast: Don't
hold RTNL for IPV6_JOIN_ANYCAST.").

We should not add a new device lookup API without netdevice_tracker.

Let's pass netdevice_tracker to dev_get_by_flags_rcu() and rename it
with netdev_ prefix to match other newer APIs.

Note that we always use GFP_ATOMIC for netdev_hold() as it's expected
to be called under RCU.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/netdev/20250708184053.102109f6@kernel.org/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250711051120.2866855-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14 17:11:14 -07:00
Feng Yang
76d727ae02 skbuff: Add MSG_MORE flag to optimize tcp large packet transmission
When using sockmap for forwarding, the average latency for different packet sizes
after sending 10,000 packets is as follows:
size    old(us)         new(us)
512     56              55
1472    58              58
1600    106             81
3000    145             105
5000    182             125

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250708054053.39551-1-yangfeng59949@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09 19:25:57 -07:00
Kuniyuki Iwashima
eb1ac9ff6c ipv6: anycast: Don't hold RTNL for IPV6_JOIN_ANYCAST.
inet6_sk(sk)->ipv6_ac_list is protected by lock_sock().

In ipv6_sock_ac_join(), only __dev_get_by_index(), __dev_get_by_flags(),
and __in6_dev_get() require RTNL.

__dev_get_by_flags() is only used by ipv6_sock_ac_join() and can be
converted to RCU version.

Let's replace RCU version helper and drop RTNL from IPV6_JOIN_ANYCAST.

setsockopt_needs_rtnl() will be removed in the next patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-15-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08 18:32:39 -07:00
Jakub Kicinski
4e655028c2 net: ethtool: remove the compat code for _rxfh_context ops
All drivers are now converted to dedicated _rxfh_context ops.
Remove the use of >set_rxfh() to manage additional contexts.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250707184115.2285277-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08 11:56:40 -07:00
Fengyuan Gong
a41851bea7 net: account for encap headers in qdisc pkt len
Refine qdisc_pkt_len_init to include headers up through
the inner transport header when computing header size
for encapsulations. Also refine net/sched/sch_cake.c
borrowed from qdisc_pkt_len_init().

Signed-off-by: Fengyuan Gong <gfengyuan@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250702160741.1204919-1-gfengyuan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08 08:55:33 -07:00
Michal Luczaj
ab34e14258 net: skbuff: Drop unused @skb
Since its introduction in commit 6fa01ccd88 ("skbuff: Add pskb_extract()
helper function"), pskb_carve_frag_list() never used the argument @skb.
Drop it and adapt the only caller.

No functional change intended.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08 08:37:22 -07:00
Michal Luczaj
ad0ac6cd9c net: skbuff: Drop unused @skb
Since its introduction in commit ce098da149 ("skbuff: Introduce
slab_build_skb()"), __slab_build_skb() never used the @skb argument. Remove
it and adapt both callers.

No functional change intended.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08 08:37:22 -07:00
Michal Luczaj
25489a4f55 net: splice: Drop unused @gfp
Since its introduction in commit 2e910b9532 ("net: Add a function to
splice pages into an skbuff for MSG_SPLICE_PAGES"), skb_splice_from_iter()
never used the @gfp argument. Remove it and adapt callers.

No functional change intended.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-2-55f68b60d2b7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08 08:37:15 -07:00
Michal Luczaj
1024f12071 net: splice: Drop unused @pipe
Since commit 41c73a0d44 ("net: speedup skb_splice_bits()"),
__splice_segment() and spd_fill_page() do not use the @pipe argument. Drop
it.

While adapting the callers, move one line to enforce reverse xmas tree
order.

No functional change intended.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-1-55f68b60d2b7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08 08:37:14 -07:00
Breno Leitao
eb4e773f13 netpoll: move Ethernet setup to push_eth() helper
Refactor Ethernet header population into dedicated function, completing
the layered abstraction with:

- push_eth() for link layer
- push_udp() for transport
- push_ipv4()/push_ipv6() for network

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-6-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:52:56 -07:00
Breno Leitao
cacfb1f4e9 netpoll: factor out UDP header setup into push_udp() helper
Move UDP header construction from netpoll_send_udp() into a new
static helper function push_udp(). This completes the protocol
layer refactoring by:

1. Creating a dedicated helper for UDP header assembly
2. Removing UDP-specific logic from the main send function
3. Establishing a consistent pattern with existing IPv4/IPv6 helpers:
   - push_udp()
   - push_ipv4()
   - push_ipv6()

The change improves code organization and maintains the encapsulation
pattern established in previous refactorings.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-5-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:52:56 -07:00
Breno Leitao
8c27639dbe netpoll: factor out IPv4 header setup into push_ipv4() helper
Move IPv4 header construction from netpoll_send_udp() into a new
static helper function push_ipv4(). This completes the refactoring
started with IPv6 header handling, creating symmetric helper functions
for both IP versions.

Changes include:
1. Extracting IPv4 header setup logic into push_ipv4()
2. Replacing inline IPv4 code with helper call
3. Moving eth assignment after helper calls for consistency

The refactoring reduces code duplication and improves maintainability
by isolating IP version-specific logic.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-4-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:52:56 -07:00
Breno Leitao
839388f39a netpoll: factor out IPv6 header setup into push_ipv6() helper
Move IPv6 header construction from netpoll_send_udp() into a new
static helper function, push_ipv6(). This refactoring reduces code
duplication and improves readability in netpoll_send_udp().

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-3-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:52:55 -07:00
Breno Leitao
01dae7a61c netpoll: factor out UDP checksum calculation into helper
Extract UDP checksum calculation logic from netpoll_send_udp()
into a new static helper function netpoll_udp_checksum(). This
reduces code duplication and improves readability for both IPv4
and IPv6 cases.

No functional change intended.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-2-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:52:55 -07:00
Breno Leitao
4b52cdfcce netpoll: Improve code clarity with explicit struct size calculations
Replace pointer-dereference sizeof() operations with explicit struct names
for improved readability and maintainability. This change:

1. Replaces `sizeof(*udph)` with `sizeof(struct udphdr)`
2. Replaces `sizeof(*ip6h)` with `sizeof(struct ipv6hdr)`
3. Replaces `sizeof(*iph)` with `sizeof(struct iphdr)`

This will make it easy to move code in the upcoming patches.

No functional changes are introduced by this patch.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-1-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:52:55 -07:00
Eric Dumazet
6058099da5 net: remove RTNL use for /proc/sys/net/core/rps_default_mask
Use a dedicated mutex instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250702061558.1585870-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:42:12 -07:00
Byungchul Park
b56ce86846 page_pool: rename __page_pool_alloc_pages_slow() to __page_pool_alloc_netmems_slow()
Now that __page_pool_alloc_pages_slow() is for allocating netmem, not
struct page, rename it to __page_pool_alloc_netmems_slow() to reflect
what it does.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250702053256.4594-4-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:40:09 -07:00
Byungchul Park
4ad125ae38 page_pool: rename __page_pool_release_page_dma() to __page_pool_release_netmem_dma()
Now that __page_pool_release_page_dma() is for releasing netmem, not
struct page, rename it to __page_pool_release_netmem_dma() to reflect
what it does.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://patch.msgid.link/20250702053256.4594-3-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:40:09 -07:00
Byungchul Park
61a3324753 page_pool: rename page_pool_return_page() to page_pool_return_netmem()
Now that page_pool_return_page() is for returning netmem, not struct
page, rename it to page_pool_return_netmem() to reflect what it does.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://patch.msgid.link/20250702053256.4594-2-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:40:08 -07:00
Willem de Bruijn
d2527ad3a9 net: preserve MSG_ZEROCOPY with forwarding
MSG_ZEROCOPY data must be copied before data is queued to local
sockets, to avoid indefinite timeout until memory release.

This test is performed by skb_orphan_frags_rx, which is called when
looping an egress skb to packet sockets, error queue or ingress path.

To preserve zerocopy for skbs that are looped to ingress but are then
forwarded to an egress device rather than delivered locally, defer
this last check until an skb enters the local IP receive path.

This is analogous to existing behavior of skb_clear_delivery_time.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:07:16 -07:00
Eric Dumazet
88fe14253e net: dst: add four helpers to annotate data-races around dst->dev
dst->dev is read locklessly in many contexts,
and written in dst_dev_put().

Fixing all the races is going to need many changes.

We probably will have to add full RCU protection.

Add three helpers to ease this painful process.

static inline struct net_device *dst_dev(const struct dst_entry *dst)
{
       return READ_ONCE(dst->dev);
}

static inline struct net_device *skb_dst_dev(const struct sk_buff *skb)
{
       return dst_dev(skb_dst(skb));
}

static inline struct net *skb_dst_dev_net(const struct sk_buff *skb)
{
       return dev_net(skb_dst_dev(skb));
}

static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb)
{
       return dev_net_rcu(skb_dst_dev(skb));
}

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
2dce8c52a9 net: dst: annotate data-races around dst->output
dst_dev_put() can overwrite dst->output while other
cpus might read this field (for instance from dst_output())

Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.

We will likely need RCU protection in the future.

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
f1c5fd3489 net: dst: annotate data-races around dst->input
dst_dev_put() can overwrite dst->input while other
cpus might read this field (for instance from dst_input())

Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.

We will likely need full RCU protection later.

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:29 -07:00
Eric Dumazet
8f2b2282d0 net: dst: annotate data-races around dst->lastuse
(dst_entry)->lastuse is read and written locklessly,
add corresponding annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:29 -07:00
Eric Dumazet
8a402bbe54 net: dst: annotate data-races around dst->obsolete
(dst_entry)->obsolete is read locklessly, add corresponding
annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:29 -07:00
Eric Dumazet
8308133741 tcp: move tcp_memory_allocated into net_aligned_data
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.

Move tcp_memory_allocated into a dedicated cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:22:02 -07:00
Eric Dumazet
998642e999 net: move net_cookie into net_aligned_data
Using per-cpu data for net->net_cookie generation is overkill,
because even busy hosts do not create hundreds of netns per second.

Make sure to put net_cookie in a private cache line to avoid
potential false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:22:02 -07:00
Eric Dumazet
3715b5df09 net: add struct net_aligned_data
This structure will hold networking data that must
consume a full cache line to avoid accidental false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:22:02 -07:00
RubenKelevra
21deb2d966 net: ieee8021q: fix insufficient table-size assertion
_Static_assert(ARRAY_SIZE(map) != IEEE8021Q_TT_MAX - 1) rejects only a
length of 7 and allows any other mismatch. Replace it with a strict
equality test via a helper macro so that every mapping table must have
exactly IEEE8021Q_TT_MAX (8) entries.

Signed-off-by: RubenKelevra <rubenkelevra@gmail.com>
Link: https://patch.msgid.link/20250626205907.1566384-1-rubenkelevra@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-01 12:55:49 +02:00
Eric Dumazet
aed4969f2b net: net->nsid_lock does not need BH safety
At the time of commit bc51dddf98 ("netns: avoid disabling irq
for netns id") peernet2id() was not yet using RCU.

Commit 2dce224f46 ("netns: protect netns
ID lookups with RCU") changed peernet2id() to no longer
acquire net->nsid_lock (potentially from BH context).

We do not need to block soft interrupts when acquiring
net->nsid_lock anymore.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Guillaume Nault <gnault@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250627163242.230866-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-30 18:58:20 -07:00
Ido Schimmel
03dc03fa04 neighbor: Add NTF_EXT_VALIDATED flag for externally validated entries
tl;dr
=====

Add a new neighbor flag ("extern_valid") that can be used to indicate to
the kernel that a neighbor entry was learned and determined to be valid
externally. The kernel will not try to remove or invalidate such an
entry, leaving these decisions to the user space control plane. This is
needed for EVPN multi-homing where a neighbor entry for a multi-homed
host needs to be synced across all the VTEPs among which the host is
multi-homed.

Background
==========

In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches (VTEPs). VTEPs that are connected to the same ES are called ES
peers.

When a neighbor entry is learned on a VTEP, it is distributed to both ES
peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers
use the neighbor entry when routing traffic towards the multi-homed host
and remote VTEPs use it for ARP/NS suppression.

Motivation
==========

If the ES link between a host and the VTEP on which the neighbor entry
was locally learned goes down, the EVPN MAC/IP advertisement route will
be withdrawn and the neighbor entries will be removed from both ES peers
and remote VTEPs. Routing towards the multi-homed host and ARP/NS
suppression can fail until another ES peer locally learns the neighbor
entry and distributes it via an EVPN MAC/IP advertisement route.

"draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these
intermittent failures by having the ES peers install the neighbor
entries as before, but also injecting EVPN MAC/IP advertisement routes
with a proxy indication. When the previously mentioned ES link goes down
and the original EVPN MAC/IP advertisement route is withdrawn, the ES
peers will not withdraw their neighbor entries, but instead start aging
timers for the proxy indication.

If an ES peer locally learns the neighbor entry (i.e., it becomes
"reachable"), it will restart its aging timer for the entry and emit an
EVPN MAC/IP advertisement route without a proxy indication. An ES peer
will stop its aging timer for the proxy indication if it observes the
removal of the proxy indication from at least one of the ES peers
advertising the entry.

In the event that the aging timer for the proxy indication expired, an
ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer
expired on all ES peers and they all withdrew their proxy
advertisements, the neighbor entry will be completely removed from the
EVPN fabric.

Implementation
==============

In the above scheme, when the control plane (e.g., FRR) advertises a
neighbor entry with a proxy indication, it expects the corresponding
entry in the data plane (i.e., the kernel) to remain valid and not be
removed due to garbage collection or loss of carrier. The control plane
also expects the kernel to notify it if the entry was learned locally
(i.e., became "reachable") so that it will remove the proxy indication
from the EVPN MAC/IP advertisement route. That is why these entries
cannot be programmed with dummy states such as "permanent" or "noarp".

Instead, add a new neighbor flag ("extern_valid") which indicates that
the entry was learned and determined to be valid externally and should
not be removed or invalidated by the kernel. The kernel can probe the
entry and notify user space when it becomes "reachable" (it is initially
installed as "stale"). However, if the kernel does not receive a
confirmation, have it return the entry to the "stale" state instead of
the "failed" state.

In other words, an entry marked with the "extern_valid" flag behaves
like any other dynamically learned entry other than the fact that the
kernel cannot remove or invalidate it.

One can argue that the "extern_valid" flag should not prevent garbage
collection and that instead a neighbor entry should be programmed with
both the "extern_valid" and "extern_learn" flags. There are two reasons
for not doing that:

1. Unclear why a control plane would like to program an entry that the
   kernel cannot invalidate but can completely remove.

2. The "extern_learn" flag is used by FRR for neighbor entries learned
   on remote VTEPs (for ARP/NS suppression) whereas here we are
   concerned with local entries. This distinction is currently irrelevant
   for the kernel, but might be relevant in the future.

Given that the flag only makes sense when the neighbor has a valid
state, reject attempts to add a neighbor with an invalid state and with
this flag set. For example:

 # ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid
 Error: Cannot create externally validated neighbor with an invalid state.
 # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
 # ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid
 Error: Cannot mark neighbor as externally validated with an invalid state.

The above means that a neighbor cannot be created with the
"extern_valid" flag and flags such as "use" or "managed" as they result
in a neighbor being created with an invalid state ("none") and
immediately getting probed:

 # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
 Error: Cannot create externally validated neighbor with an invalid state.

However, these flags can be used together with "extern_valid" after the
neighbor was created with a valid state:

 # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use

One consequence of preventing the kernel from invalidating a neighbor
entry is that by default it will only try to determine reachability
using unicast probes. This can be changed using the "mcast_resolicit"
sysctl:

 # sysctl net.ipv4.neigh.br0/10.mcast_resolicit
 0
 # tcpdump -nn -e -i br0.10 -Q out arp &
 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 # sysctl -wq net.ipv4.neigh.br0/10.mcast_resolicit=3
 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28

iproute2 patches can be found here [2].

[1] https://datatracker.ietf.org/doc/html/draft-rbickhart-evpn-ip-mac-proxy-adv-03
[2] https://github.com/idosch/iproute2/tree/submit/extern_valid_v1

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20250626073111.244534-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-30 18:14:23 -07:00
Jakub Kicinski
32155c6fd9 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCaF3LFhUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISINtRgD+JagJmBokoPnsk7DfauJnVhaP95aV
 tsnna+fU1kGwS7MBAMINCoLyeISiD/XG0O+Om38czhhglWbl4+TgrthegPkE
 =opKf
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2025-06-27

We've added 6 non-merge commits during the last 8 day(s) which contain
a total of 6 files changed, 120 insertions(+), 20 deletions(-).

The main changes are:

1) Fix RCU usage in task_cls_state() for BPF programs using helpers like
   bpf_get_cgroup_classid_curr() outside of networking, from Charalampos
   Mitrodimas.

2) Fix a sockmap race between map_update and a pending workqueue from
   an earlier map_delete freeing the old psock where both pointed to the
   same psock->sk, from Jiayuan Chen.

3) Fix a data corruption issue when using bpf_msg_pop_data() in kTLS which
   failed to recalculate the ciphertext length, also from Jiayuan Chen.

4) Remove xdp_redirect_map{,_err} trace events since they are unused and
   also hide XDP trace events under CONFIG_BPF_SYSCALL, from Steven Rostedt.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  xdp: tracing: Hide some xdp events under CONFIG_BPF_SYSCALL
  xdp: Remove unused events xdp_redirect_map and xdp_redirect_map_err
  net, bpf: Fix RCU usage in task_cls_state() for BPF programs
  selftests/bpf: Add test to cover ktls with bpf_msg_pop_data
  bpf, ktls: Fix data corruption when using bpf_msg_pop_data() in ktls
  bpf, sockmap: Fix psock incorrectly pointing to sk
====================

Link: https://patch.msgid.link/20250626230111.24772-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-26 17:13:17 -07:00
Jakub Kicinski
28aa52b618 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.16-rc4).

Conflicts:

Documentation/netlink/specs/mptcp_pm.yaml
  9e6dd4c256 ("netlink: specs: mptcp: replace underscores with dashes in names")
  ec362192aa ("netlink: specs: fix up indentation errors")
https://lore.kernel.org/20250626122205.389c2cd4@canb.auug.org.au

Adjacent changes:

Documentation/netlink/specs/fou.yaml
  791a9ed0a4 ("netlink: specs: fou: replace underscores with dashes in names")
  880d43ca9a ("netlink: specs: clean up spaces in brackets")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-26 10:40:50 -07:00