The number of SYN + MPC retransmissions before falling back to TCP was
fixed to 2. This is certainly a good default value, but having a fixed
number can be a problem in some environments.
The current behaviour means that if all packets are dropped, there will
be:
- The initial SYN + MPC
- 2 retransmissions with MPC
- The next ones will be without MPTCP.
So typically ~3 seconds before falling back to TCP. In some networks
where some temporally blackholes are unfortunately frequent, or when a
client tries to initiate connections while the network is not ready yet,
this can cause new connections not to have MPTCP connections.
In such environments, it is now possible to increase the number of SYN
retransmissions with MPTCP options to make sure MPTCP is used.
Interesting values are:
- 0: the first retransmission will be done without MPTCP options: quite
aggressive, but also a higher risk of detecting false-positive
MPTCP blackholes.
- >= 128: all SYN retransmissions will keep the MPTCP options: back to
the < 6.12 behaviour.
The default behaviour is not changed here.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250117-net-next-mptcp-syn_retrans_before_tcp_fallback-v1-1-ab4b187099b0@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Integrate with the standard infrastructure for reporting hardware packet
timestamping statistics.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250116104628.123555-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For packets with two-step timestamp requests, the hardware timestamp
comes back to the driver through a confirmation mechanism of sorts,
which allows the driver to confidently bump the successful "pkts"
counter.
For one-step PTP, the NIC is supposed to autonomously insert its
hardware TX timestamp in the packet headers while simultaneously
transmitting it. There may be a confirmation that this was done
successfully, or there may not.
None of the current drivers which implement ethtool_ops :: get_ts_stats()
also support HWTSTAMP_TX_ONESTEP_SYNC or HWTSTAMP_TX_ONESTEP_SYNC, so it
is a bit unclear which model to follow. But there are NICs, such as DSA,
where there is no transmit confirmation at all. Here, it would be wrong /
misleading to increment the successful "pkts" counter, because one-step
PTP packets can be dropped on TX just like any other packets.
So introduce a special counter which signifies "yes, an attempt was made,
but we don't know whether it also exited the port or not". I expect that
for one-step PTP packets where a confirmation is available, the "pkts"
counter would be bumped.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250116104628.123555-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The existing error message ("Invalid qdisc name") is confusing
because it suggests that there is no qdisc with the given name. In
fact, the name does refer to a valid qdisc, but it doesn't match
the kind of an existing qdisc being modified or replaced. The
new error message provides more detail to eliminate confusion.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20250116195642.2794-1-ouster@cs.stanford.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use inet_sk_dscp() to get the socket DSCP value as dscp_t, instead of
ip_sock_rt_tos() which returns a __u8. This will ease the conversion
of fl4->flowi4_tos to dscp_t, which now just becomes a matter of
dropping the inet_dscp_to_dsfield() call.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/208dc5ca28bb5595d7a545de026bba18b1d63bda.1737032802.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The following problem was encountered during stability test:
(NULL net_device): NAPI poll function process_backlog+0x0/0x530 \
returned 1, exceeding its budget of 0.
------------[ cut here ]------------
list_add double add: new=ffff88905f746f48, prev=ffff88905f746f48, \
next=ffff88905f746e40.
WARNING: CPU: 18 PID: 5462 at lib/list_debug.c:35 \
__list_add_valid_or_report+0xf3/0x130
CPU: 18 UID: 0 PID: 5462 Comm: ping Kdump: loaded Not tainted 6.13.0-rc7+
RIP: 0010:__list_add_valid_or_report+0xf3/0x130
Call Trace:
? __warn+0xcd/0x250
? __list_add_valid_or_report+0xf3/0x130
enqueue_to_backlog+0x923/0x1070
netif_rx_internal+0x92/0x2b0
__netif_rx+0x15/0x170
loopback_xmit+0x2ef/0x450
dev_hard_start_xmit+0x103/0x490
__dev_queue_xmit+0xeac/0x1950
ip_finish_output2+0x6cc/0x1620
ip_output+0x161/0x270
ip_push_pending_frames+0x155/0x1a0
raw_sendmsg+0xe13/0x1550
__sys_sendto+0x3bf/0x4e0
__x64_sys_sendto+0xdc/0x1b0
do_syscall_64+0x5b/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The reproduction command is as follows:
sysctl -w net.core.dev_weight=0
ping 127.0.0.1
This is because when the napi's weight is set to 0, process_backlog() may
return 0 and clear the NAPI_STATE_SCHED bit of napi->state, causing this
napi to be re-polled in net_rx_action() until __do_softirq() times out.
Since the NAPI_STATE_SCHED bit has been cleared, napi_schedule_rps() can
be retriggered in enqueue_to_backlog(), causing this issue.
Making the napi's weight always non-zero solves this problem.
Triggering this issue requires system-wide admin (setting is
not namespaced).
Fixes: e387660545 ("[NET]: Fix sysctl net.core.dev_weight")
Fixes: 3d48b53fb2 ("net: dev_weight: TX/RX orthogonality")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Link: https://patch.msgid.link/20250116143053.4146855-1-liujian56@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reduce duplicate code by using netlink helpers which return the
soft/hard interface directly. Instead of returning an interface index
which we are typically not interested in.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
MEM_WRITE attribute is defined as: "Non-presence of MEM_WRITE means that
MEM is only being read". bpf_load_hdr_opt() both reads and writes from
its arg2 - void *search_res.
This matters a lot for the next commit where we more precisely track
stack accesses. Without this annotation, the verifier will make false
assumptions about the contents of memory written to by helpers and
possibly prune valid branches.
Fixes: 6fad274f06 ("bpf: Add MEM_WRITE attribute")
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/730e45f8c39be2a5f3d8c4406cceca9d574cbf14.1736886479.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Basically, dev_ifsioc() operates on the passed single netns (except
for netdev notifier chains with lower/upper devices for which we will
need more changes).
Let's hold rtnl_net_lock() for dev_ifsioc().
Now that NETDEV_CHANGENAME is always triggered under rtnl_net_lock()
of the device's netns. (do_setlink() and dev_ifsioc())
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
devnet_rename_sem is no longer used since commit
0840556e5a ("net: Protect dev->name by seqlock.").
Also, RTNL serialises dev_change_name().
Let's remove devnet_rename_sem.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The cited commit forgot to add netdev_rename_lock in one of the
error paths in dev_change_name().
Let's hold netdev_rename_lock before restoring the old dev->name.
Fixes: 0840556e5a ("net: Protect dev->name by seqlock.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The tool pp_alloc_fail.py tested error recovery by injecting errors
into the function page_pool_alloc_pages(). The page pool allocation
function page_pool_dev_alloc() does not end up calling
page_pool_alloc_pages(). page_pool_alloc_netmems() seems to be the
function that is called by all of the page pool alloc functions in
the API, so move error injection to that function instead.
Signed-off-by: John Daley <johndale@cisco.com>
Link: https://patch.msgid.link/20250115181312.3544-2-johndale@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Store rtm->rtm_tos in a dscp_t variable, which can then be used for
setting fl4.flowi4_tos and also be passed as parameter of
ip_route_input_rcu().
The .flowi4_tos field is going to be converted to dscp_t to ensure ECN
bits aren't erroneously taken into account during route lookups. Having
a dscp_t variable available will simplify that conversion, as we'll
just have to drop the inet_dscp_to_dsfield() call.
Note that we can't just convert rtm->rtm_tos to dscp_t because this
structure is exported to user space.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/7bc1c7dc47ad1393569095d334521fae59af5bc7.1736944951.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use ip4h_dscp() to get the tunnel DSCP option as dscp_t, instead of
manually masking the raw tos field with INET_DSCP_MASK. This will ease
the conversion of fl4->flowi4_tos to dscp_t, which just becomes a
matter of dropping the inet_dscp_to_dsfield() call.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/6c05a11afdc61530f1a4505147e0909ad51feb15.1736941806.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet says:
====================
net: reduce RTNL pressure in unregister_netdevice()
One major source of RTNL contention resides in unregister_netdevice()
Due to RCU protection of various network structures, and
unregister_netdevice() being a synchronous function,
it is calling potentially slow functions while holding RTNL.
I think we can release RTNL in two points, so that three
slow functions are called while RTNL can be used
by other threads.
v1: https://lore.kernel.org/netdev/20250107130906.098fc8d6@kernel.org/T/#m398c95f5778e1ff70938e079d3c4c43c050ad2a6
====================
Link: https://patch.msgid.link/20250114205531.967841-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
One synchronize_net() call is currently done while holding RTNL.
This is source of RTNL contention in workloads adding and deleting
many network namespaces per second, because synchronize_rcu()
and synchronize_rcu_expedited() can use 60+ ms in some cases.
For cleanup_net() use, temporarily release RTNL
while calling the last synchronize_net().
This should be safe, because devices are no longer visible
to other threads after unlist_netdevice() call
and setting dev->reg_state to NETREG_UNREGISTERING.
In any case, the new netdev_lock() / netdev_unlock()
infrastructure that we are adding should allow
to fix potential issues, with a combination
of a per-device mutex and dev->reg_state awareness.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Two synchronize_net() calls are currently done while holding RTNL.
This is source of RTNL contention in workloads adding and deleting
many network namespaces per second, because synchronize_rcu()
and synchronize_rcu_expedited() can use 60+ ms in some cases.
For cleanup_net() use, temporarily release RTNL
while calling the last synchronize_net().
This should be safe, because devices are no longer visible
to other threads at this point.
In any case, the new netdev_lock() / netdev_unlock()
infrastructure that we are adding should allow
to fix potential issues, with a combination
of a per-device mutex and dev->reg_state awareness.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
flush_all_backlogs() is called from unregister_netdevice_many_notify()
as part of netdevice dismantles.
This is currently called under RTNL, and can last up to 50 ms
on busy hosts.
There is no reason to hold RTNL at this stage, if our caller
is cleanup_net() : netns are no more visible, devices
are in NETREG_UNREGISTERING state and no other thread
could mess our state while RTNL is temporarily released.
In order to provide isolation, this patch provides a separate
'net_todo_list' for cleanup_net().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
flush_all_backlogs() uses per-cpu and static data to hold its
temporary data, on the assumption it is called under RTNL
protection.
Following patch in the series will break this assumption.
Use instead a dynamically allocated piece of memory.
In the unlikely case the allocation fails,
use a boot-time allocated memory.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cleanup_net() is the single thread responsible
for netns dismantles, and a serious bottleneck.
Before we can get per-netns RTNL, make sure
all synchronize_net() called from this thread
are using rcu_synchronize_expedited().
v3: deal with CONFIG_NET_NS=n
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
NAPI lifetime, visibility and config are all fully under
netdev_lock protection now.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Protect the following members of netdev and napi by netdev_lock:
- defer_hard_irqs,
- gro_flush_timeout,
- irq_suspend_timeout.
The first two are written via sysfs (which this patch switches
to new lock), and netdev genl which holds both netdev and rtnl locks.
irq_suspend_timeout is only written by netdev genl.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Take netdev_lock() in netif_napi_set_irq(). All NAPI "control fields"
are now protected by that lock (most of the other ones are set during
napi add/del). The napi_hash_node is fully protected by the hash
spin lock, but close enough for the kdoc...
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that NAPI instances can't come and go without holding
netdev->lock we can trivially switch from rtnl_lock() to
netdev_lock() for setting netdev->threaded via sysfs.
Note that since we do not lock netdev_lock around sysfs
calls in the core we don't have to "trylock" like we do
with rtnl_lock.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In prep for dropping rtnl_lock, start locking netdev->lock in netlink
genl ops. We need to be using netdev->up instead of flags & IFF_UP.
We can remove the RCU lock protection for the NAPI since NAPI list
is protected by netdev->lock already.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wrap napi_enable() / napi_disable() with netdev_lock().
Provide the "already locked" flavor of the API.
iavf needs the usual adjustment. A number of drivers call
napi_enable() under a spin lock, so they have to be modified
to take netdev_lock() first, then spin lock then call
napi_enable_locked().
Protecting napi_enable() implies that napi->napi_id is protected
by netdev_lock().
Acked-by: Francois Romieu <romieu@fr.zoreil.com> # via-velocity
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hold netdev->lock when NAPIs are getting added or removed.
This will allow safe access to NAPI instances of a net_device
without rtnl_lock.
Create a family of helpers which assume the lock is already taken.
Switch iavf to them, as it makes extensive use of netdev->lock,
already.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some uAPI (netdev netlink) hide net_device's sub-objects while
the interface is down to ensure uniform behavior across drivers.
To remove the rtnl_lock dependency from those uAPIs we need a way
to safely tell if the device is down or up.
Add an indication of whether device is open or closed, protected
by netdev->lock. The semantics are the same as IFF_UP, but taking
netdev_lock around every write to ->flags would be a lot of code
churn.
We don't want to blanket the entire open / close path by netdev_lock,
because it will prevent us from applying it to specific structures -
core helpers won't be able to take that lock from any function
called by the drivers on open/close paths.
So the state of the flag is "pessimistic", as in it may report false
negatives, but never false positives.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add helpers for accessing netdevs under netdev_lock().
There's some careful handling needed to find the device and lock it
safely, without it getting unregistered, and without taking rtnl_lock
(the latter being the whole point of the new locking, after all).
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Protect writes to netdev->reg_state with netdev_lock().
From now on holding netdev_lock() is sufficient to prevent
the net_device from getting unregistered, so code which
wants to hold just a single netdev around no longer needs
to hold rtnl_lock.
We do not protect the NETREG_UNREGISTERED -> NETREG_RELEASED
transition. We'd need to move mutex_destroy(netdev->lock)
to .release, but the real reason is that trying to stop
the unregistration process mid-way would be unsafe / crazy.
Taking references on such devices is not safe, either.
So the intended semantics are to lock REGISTERED devices.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add helpers for locking the netdev instance, use it in drivers
and the shaper code. This will make grepping for the lock usage
much easier, as we extend the lock to cover more fields.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/20250115035319.559603-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Following fields of 'struct mr_mfc' can be updated
concurrently (no lock protection) from ip_mr_forward()
and ip6_mr_forward()
- bytes
- pkt
- wrong_if
- lastuse
They also can be read from other functions.
Convert bytes, pkt and wrong_if to atomic_long_t,
and use READ_ONCE()/WRITE_ONCE() for lastuse.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250114221049.1190631-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a single buffer XDP is attached, NIC should guarantee only single
page packets will be received.
tcp-data-split feature splits packets into header and payload. single
buffer XDP can't handle it properly.
So attaching single buffer XDP should be disallowed when tcp-data-split
is enabled.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-6-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While the devmem is running, the tcp-data-split and
hds-thresh configuration should not be changed.
If user tries to change tcp-data-split and threshold value while the
devmem is running, it fails and shows extack message.
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-5-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If driver doesn't support ring parameter or tcp-data-split configuration
is not sufficient, the devmem should not be set up.
Before setup the devmem, tcp-data-split should be ON and hds-thresh
value should be 0.
Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-4-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The hds-thresh option configures the threshold value of
the header-data-split.
If a received packet size is larger than this threshold value, a packet
will be split into header and payload.
The header indicates TCP and UDP header, but it depends on driver spec.
The bnxt_en driver supports HDS(Header-Data-Split) configuration at
FW level, affecting TCP and UDP too.
So, If hds-thresh is set, it affects UDP and TCP packets.
Example:
# ethtool -G <interface name> hds-thresh <value>
# ethtool -G enp14s0f0np0 tcp-data-split on hds-thresh 256
# ethtool -g enp14s0f0np0
Ring parameters for enp14s0f0np0:
Pre-set maximums:
...
HDS thresh: 1023
Current hardware settings:
...
TCP data split: on
HDS thresh: 256
The default/min/max values are not defined in the ethtool so the drivers
should define themself.
The 0 value means that all TCP/UDP packets' header and payload
will be split.
Tested-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-3-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When tcp-data-split is UNKNOWN mode, drivers arbitrarily handle it.
For example, bnxt_en driver automatically enables if at least one of
LRO/GRO/JUMBO is enabled.
If tcp-data-split is UNKNOWN and LRO is enabled, a driver returns
ENABLES of tcp-data-split, not UNKNOWN.
So, `ethtool -g eth0` shows tcp-data-split is enabled.
The problem is in the setting situation.
In the ethnl_set_rings(), it first calls get_ringparam() to get the
current driver's config.
At that moment, if driver's tcp-data-split config is UNKNOWN, it returns
ENABLE if LRO/GRO/JUMBO is enabled.
Then, it sets values from the user and driver's current config to
kernel_ethtool_ringparam.
Last it calls .set_ringparam().
The driver, especially bnxt_en driver receives
ETHTOOL_TCP_DATA_SPLIT_ENABLED.
But it can't distinguish whether it is set by the user or just the
current config.
When user updates ring parameter, the new hds_config value is updated
and current hds_config value is stored to old_hdsconfig.
Driver's .set_ringparam() callback can distinguish a passed
tcp-data-split value is came from user explicitly.
If .set_ringparam() is failed, hds_config is rollbacked immediately.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-2-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 1c670b39ce ("mptcp: change local addr type of subflow_destroy")
introduced a bug in mptcp_pm_nl_subflow_destroy_doit().
ipv6_addr_set_v4mapped() should be called to set the remote ipv4 address
'addr_r.addr.s_addr' to the remote ipv6 address 'addr_r.addr6', not
'addr_l.addr.addr6', which is the local ipv6 address.
Fixes: 1c670b39ce ("mptcp: change local addr type of subflow_destroy")
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250114-net-next-mptcp-fix-remote-addr-v1-1-debcd84ea86f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Allow sysfs to trigger hdev reset. This is required to recover devices
that are not responsive from userspace.
Signed-off-by: Hsin-chen Chuang <chharry@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The hdev->reset is never used now and the hdev->cmd_timeout actually
does reset. This patch changes the call path from
hdev->cmd_timeout -> vendor_cmd_timeout -> btusb_reset -> hdev->reset
, to
hdev->reset -> vendor_reset -> btusb_reset
Which makes it clear when we export the hdev->reset to a wider usage
e.g. allowing reset from sysfs.
This patch doesn't introduce any behavior change.
Signed-off-by: Hsin-chen Chuang <chharry@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
A NULL sock pointer is passed into l2cap_sock_alloc() when it is called
from l2cap_sock_new_connection_cb() and the error handling paths should
also be aware of it.
Seemingly a more elegant solution would be to swap bt_sock_alloc() and
l2cap_chan_create() calls since they are not interdependent to that moment
but then l2cap_chan_create() adds the soon to be deallocated and still
dummy-initialized channel to the global list accessible by many L2CAP
paths. The channel would be removed from the list in short period of time
but be a bit more straight-forward here and just check for NULL instead of
changing the order of function calls.
Found by Linux Verification Center (linuxtesting.org) with SVACE static
analysis tool.
Fixes: 7c4f78cdb8 ("Bluetooth: L2CAP: do not leave dangling sk pointer on error in l2cap_sock_create()")
Cc: stable@vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_bdaddr_list_del_with_flags() was added in 2020's
commit 8baaa4038e ("Bluetooth: Add bdaddr_list_with_flags for classic
whitelist")
but has remained unused.
hci_remove_ext_adv_instance() was added in 2020's
commit eca0ae4aea ("Bluetooth: Add initial implementation of BIS
connections")
but has remained unused.
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This marks LL Privacy as stable by removing its experimental UUID and
move its functionality to Device Flag (HCI_CONN_FLAG_ADDRESS_RESOLUTION)
which can be set by MGMT Device Set Flags so userspace retain control of
the feature.
Link: https://github.com/bluez/bluez/issues/1028
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
In 'cfg80211_scan_6ghz()', an instances of 'struct cfg80211_colocated_ap'
are allocated as if they would have 'ssid' as trailing VLA member. Since
this is not so, extra IEEE80211_MAX_SSID_LEN bytes are not needed.
Briefly tested with KUnit.
Fixes: c8cb5b854b ("nl80211/cfg80211: support 6 GHz scanning")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://patch.msgid.link/20250113155417.552587-1-dmantipov@yandex.ru
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Existing primitive has several problems:
1) calling conventions are clumsy - it returns a dentry reference
that is either identical to its second argument or is an ERR_PTR(-E...);
in both cases no refcount changes happen. Inconvenient for users and
bug-prone; it would be better to have it return 0 on success and -E... on
failure.
2) it allows cross-directory moves; however, no such caller have
ever materialized and considering the way debugfs is used, it's unlikely
to happen in the future. What's more, any such caller would have fun
issues to deal with wrt interplay with recursive removal. It also makes
the calling conventions clumsier...
3) tautological rename fails; the callers have no race-free way
to deal with that.
4) new name must have been formed by the caller; quite a few
callers have it done by sprintf/kasprintf/etc., ending up with considerable
boilerplate.
Proposed replacement: int debugfs_change_name(dentry, fmt, ...). All callers
convert to that easily, and it's simpler internally.
IMO debugfs_rename() should go; if we ever get a real-world use case for
cross-directory moves in debugfs, we can always look into the right way
to handle that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250112080705.141166-21-viro@zeniv.linux.org.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When __netpoll_setup() is called directly, instead of through
netpoll_setup(), the np->skb_pool list head isn't initialized.
If skb_pool_flush() is later called, then we hit a NULL pointer
in skb_queue_purge_reason(). This can be seen with this repro,
when CONFIG_NETCONSOLE is enabled as a module:
ip tuntap add mode tap tap0
ip link add name br0 type bridge
ip link set dev tap0 master br0
modprobe netconsole netconsole=4444@10.0.0.1/br0,9353@10.0.0.2/
rmmod netconsole
The backtrace is:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
... ... ...
Call Trace:
<TASK>
__netpoll_free+0xa5/0xf0
br_netpoll_cleanup+0x43/0x50 [bridge]
do_netpoll_cleanup+0x43/0xc0
netconsole_netdev_event+0x1e3/0x300 [netconsole]
unregister_netdevice_notifier+0xd9/0x150
cleanup_module+0x45/0x920 [netconsole]
__se_sys_delete_module+0x205/0x290
do_syscall_64+0x70/0x150
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Move the skb_pool list setup and initial skb fill into __netpoll_setup().
Fixes: 221a9c1df7 ("net: netpoll: Individualize the skb pool")
Signed-off-by: John Sperbeck <jsperbeck@google.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250114011354.2096812-1-jsperbeck@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The last use of kernel_sendmsg_locked() was removed in 2023 by
commit dc97391e66 ("sock: Remove ->sendpage*() in favour of
sendmsg(MSG_SPLICE_PAGES)")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250112131318.63753-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The wake-up condition currently implemented by mptcp_epollin_ready()
is wrong, as it could mark the MPTCP socket as readable even when
no data are present and the system is under memory pressure.
Explicitly check for some data being available in the receive queue.
Fixes: 5684ab1a0e ("mptcp: give rcvlowat some love")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250113-net-mptcp-connect-st-flakes-v1-2-0d986ee7b1b6@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mptcp_cleanup_rbuf() is responsible to send acks when the user-space
reads enough data to update the receive windows significantly.
It tries hard to avoid acquiring the subflow sockets locks by checking
conditions similar to the ones implemented at the TCP level.
To avoid too much code duplication - the MPTCP protocol can't reuse the
TCP helpers as part of the relevant status is maintained into the msk
socket - and multiple costly window size computation, mptcp_cleanup_rbuf
uses a rough estimate for the most recently advertised window size:
the MPTCP receive free space, as recorded as at last-ack time.
Unfortunately the above does not allow mptcp_cleanup_rbuf() to detect
a zero to non-zero win change in some corner cases, skipping the
tcp_cleanup_rbuf call and leaving the peer stuck.
After commit ea66758c17 ("tcp: allow MPTCP to update the announced
window"), MPTCP has actually cheap access to the announced window value.
Use it in mptcp_cleanup_rbuf() for a more accurate ack generation.
Fixes: e3859603ba ("mptcp: better msk receive window updates")
Cc: stable@vger.kernel.org
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/20250107131845.5e5de3c5@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250113-net-mptcp-connect-st-flakes-v1-1-0d986ee7b1b6@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A Broadcast Sink might require BIG sync to be terminated and
re-established multiple times, while keeping the same PA sync
handle active. This can be possible if the configuration of the
listening (PA sync) socket is reset once all bound BISes are
established and accepted by the user space:
1. The DEFER setup flag needs to be reset on the parent socket,
to allow another BIG create sync procedure to be started on socket
read.
2. The BT_SK_BIG_SYNC flag needs to be cleared on the parent socket,
to allow another BIG create sync command to be sent.
3. The socket state needs to transition from BT_LISTEN to BT_CONNECTED,
to mark that the listening process has completed and another one can
be started if needed.
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Prior patch in the series added TCP_RFC7323_PAWS_ACK drop reason.
This patch adds the corresponding SNMP counter, for folks
using nstat instead of tracing for TCP diagnostics.
nstat -az | grep PAWSOldAck
Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250113135558.3180360-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
XPS can cause reorders because of the relaxed OOO
conditions for pure ACK packets.
For hosts not using RFS, what can happpen is that ACK
packets are sent on behalf of the cpu processing NIC
interrupts, selecting TX queue A for ACK packet P1.
Then a subsequent sendmsg() can run on another cpu.
TX queue selection uses the socket hash and can choose
another queue B for packets P2 (with payload).
If queue A is more congested than queue B,
the ACK packet P1 could be sent on the wire after
P2.
A linux receiver when processing P1 (after P2) currently increments
LINUX_MIB_PAWSESTABREJECTED (TcpExtPAWSEstab)
and use TCP_RFC7323_PAWS drop reason.
It might also send a DUPACK if not rate limited.
In order to better understand this pattern, this
patch adds a new drop_reason : TCP_RFC7323_PAWS_ACK.
For old ACKS like these, we no longer increment
LINUX_MIB_PAWSESTABREJECTED and no longer sends a DUPACK,
keeping credit for other more interesting DUPACK.
perf record -e skb:kfree_skb -a
perf script
...
swapper 0 [148] 27475.438637: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [208] 27475.438706: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [208] 27475.438908: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [148] 27475.439010: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [148] 27475.439214: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [208] 27475.439286: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250113135558.3180360-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Following patch is adding a new drop_reason to tcp_validate_incoming().
Change tcp_disordered_ack() to not return a boolean anymore,
but a drop reason.
Change its name to tcp_disordered_ack_check()
Refactor tcp_validate_incoming() to ease the code
review of the following patch, and reduce indentation
level.
This patch is a refactor, with no functional change.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250113135558.3180360-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ethtool_get_status callback currently handles all status and PSE
information within a single function. This approach has two key
drawbacks:
1. If the core requires some information for purposes other than
ethtool_get_status, redundant code will be needed to fetch the same
data from the driver (like is_enabled).
2. Drivers currently have access to all information passed to ethtool.
New variables will soon be added to ethtool status, such as PSE ID,
power domain IDs, and budget evaluation strategies, which are meant
to be managed solely by the core. Drivers should not have the ability
to modify these variables.
To resolve these issues, ethtool_get_status has been split into multiple
callbacks, with each handling a specific piece of information required
by ethtool or the core.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Recent reports have shown how we sometimes call vsock_*_has_data()
when a vsock socket has been de-assigned from a transport (see attached
links), but we shouldn't.
Previous commits should have solved the real problems, but we may have
more in the future, so to avoid null-ptr-deref, we can return 0
(no space, no data available) but with a warning.
This way the code should continue to run in a nearly consistent state
and have a warning that allows us to debug future problems.
Fixes: c0cfa2d8a7 ("vsock: add multi-transports support")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/netdev/Z2K%2FI4nlHdfMRTZC@v4bel-B760M-AORUS-ELITE-AX/
Link: https://lore.kernel.org/netdev/5ca20d4c-1017-49c2-9516-f6f75fd331e9@rbox.co/
Link: https://lore.kernel.org/netdev/677f84a8.050a0220.25a300.01b3.GAE@google.com/
Co-developed-by: Hyunwoo Kim <v4bel@theori.io>
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Co-developed-by: Wongi Lee <qwerty@theori.io>
Signed-off-by: Wongi Lee <qwerty@theori.io>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Hyunwoo Kim <v4bel@theori.io>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Transport's release() and destruct() are called when de-assigning the
vsock transport. These callbacks can touch some socket state like
sock flags, sk_state, and peer_shutdown.
Since we are reassigning the socket to a new transport during
vsock_connect(), let's reset these fields to have a clean state with
the new transport.
Fixes: c0cfa2d8a7 ("vsock: add multi-transports support")
Cc: stable@vger.kernel.org
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
During virtio_transport_release() we can schedule a delayed work to
perform the closing of the socket before destruction.
The destructor is called either when the socket is really destroyed
(reference counter to zero), or it can also be called when we are
de-assigning the transport.
In the former case, we are sure the delayed work has completed, because
it holds a reference until it completes, so the destructor will
definitely be called after the delayed work is finished.
But in the latter case, the destructor is called by AF_VSOCK core, just
after the release(), so there may still be delayed work scheduled.
Refactor the code, moving the code to delete the close work already in
the do_close() to a new function. Invoke it during destruction to make
sure we don't leave any pending work.
Fixes: c0cfa2d8a7 ("vsock: add multi-transports support")
Cc: stable@vger.kernel.org
Reported-by: Hyunwoo Kim <v4bel@theori.io>
Closes: https://lore.kernel.org/netdev/Z37Sh+utS+iV3+eb@v4bel-B760M-AORUS-ELITE-AX/
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Tested-by: Hyunwoo Kim <v4bel@theori.io>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Some of the core functions can only be called if the transport
has been assigned.
As Michal reported, a socket might have the transport at NULL,
for example after a failed connect(), causing the following trace:
BUG: kernel NULL pointer dereference, address: 00000000000000a0
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 12faf8067 P4D 12faf8067 PUD 113670067 PMD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 15 UID: 0 PID: 1198 Comm: a.out Not tainted 6.13.0-rc2+
RIP: 0010:vsock_connectible_has_data+0x1f/0x40
Call Trace:
vsock_bpf_recvmsg+0xca/0x5e0
sock_recvmsg+0xb9/0xc0
__sys_recvfrom+0xb3/0x130
__x64_sys_recvfrom+0x20/0x30
do_syscall_64+0x93/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
So we need to check the `vsk->transport` in vsock_bpf_recvmsg(),
especially for connected sockets (stream/seqpacket) as we already
do in __vsock_connectible_recvmsg().
Fixes: 634f1a7110 ("vsock: support sockmap")
Cc: stable@vger.kernel.org
Reported-by: Michal Luczaj <mhal@rbox.co>
Closes: https://lore.kernel.org/netdev/5ca20d4c-1017-49c2-9516-f6f75fd331e9@rbox.co/
Tested-by: Michal Luczaj <mhal@rbox.co>
Reported-by: syzbot+3affdbfc986ecd9200fd@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/677f84a8.050a0220.25a300.01b3.GAE@google.com/
Tested-by: syzbot+3affdbfc986ecd9200fd@syzkaller.appspotmail.com
Reviewed-by: Hyunwoo Kim <v4bel@theori.io>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If the socket has been de-assigned or assigned to another transport,
we must discard any packets received because they are not expected
and would cause issues when we access vsk->transport.
A possible scenario is described by Hyunwoo Kim in the attached link,
where after a first connect() interrupted by a signal, and a second
connect() failed, we can find `vsk->transport` at NULL, leading to a
NULL pointer dereference.
Fixes: c0cfa2d8a7 ("vsock: add multi-transports support")
Cc: stable@vger.kernel.org
Reported-by: Hyunwoo Kim <v4bel@theori.io>
Reported-by: Wongi Lee <qwerty@theori.io>
Closes: https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Hyunwoo Kim <v4bel@theori.io>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Create an API to get the net_device to the slave port of HSR device. The
API will take hsr net_device and enum hsr_port_type for which we want the
net_device as arguments.
This API can be used by client drivers who support HSR and want to get
the net_devcie of slave ports from the hsr device. Export this API for
the same.
This API needs the enum hsr_port_type to be accessible by the drivers using
hsr. Move the enum hsr_port_type from net/hsr/hsr_main.h to
include/linux/if_hsr.h for the same.
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add multicast filtering support for VLAN interfaces in dual EMAC mode
for ICSSG driver.
The driver uses vlan_for_each() API to get the list of available
vlans. The driver then sync mc addr of vlan interface with a locally
mainatined list emac->vlan_mcast_list[vid] using __hw_addr_sync_multiple()
API.
__hw_addr_sync_multiple() is used instead of __hw_addr_sync() to sync
vdev->mc with local list because the sync_cnt for addresses in vdev->mc
will already be set by the vlan_dev_set_rx_mode() [net/8021q/vlan_dev.c]
and __hw_addr_sync() only syncs when the sync_cnt == 0. Whereas
__hw_addr_sync_multiple() can sync addresses even if sync_cnt is not 0.
Export __hw_addr_sync_multiple() so that driver can use it.
Once the local list is synced, driver calls __hw_addr_sync_dev() with
the local list, vdev, sync and unsync callbacks.
__hw_addr_sync_dev() is used with the local maintained list as the list
to synchronize instead of using __dev_mc_sync() on vdev because
__dev_mc_sync() on vdev will call __hw_addr_sync_dev() on vdev->mc and
sync_cnt for addresses in vdev->mc will already be set by the
vlan_dev_set_rx_mode() [net/8021q/vlan_dev.c] and __hw_addr_sync_dev()
only syncs if the sync_cnt of addresses in the list (vdev->mc in this case)
is 0. Whereas __hw_addr_sync_dev() on local list will work fine as the
sync_cnt for addresses in the local list will still be 0.
Based on change in addresses in the local list, sync / unsync callbacks
are invoked. In the sync / unsync API in driver, based on whether the ndev
is vlan or not, driver passes appropriate vid to FDB helper functions.
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmeC97YACgkQ1w0aZmrP
KyEUfg//e+n/YKyHn3MRlHeKf9HnPdAdzrHqNrq8t8332x/4nFGBeMWPbEuM0H7B
kM4eZUp8XjS5JF4ze8mZV6HJ7c2a0JHMkrs1/I9uZzBkvZayFl2ueL0cCQoizK0G
opOLinsw/RUe6H/ulGEq2K7rtBjIAvWi5d2/i+oMERkr3ADOq0d4cWlGJra0a6Lb
RyCPDJoQ65kNBHCChxYVhhpC8LlMCSDTuZPwcl58qGRRqiTIyVme05K9yCQrcJAO
91trgnsTHghJ1xBcTAvewcSTDylkcL7qkiFuCYcvUmPmVBrFKn1g/va99rrsvVI6
Fz2pbIPMZa/6Gdpx/mBzDaPv0XAU+cqLuIJ/t5fdRiCviHzEYZaGXh3+ZWLyhR8d
FLXXC3V1IKnsBqqN3TC7Rx81h9c6jt3+KPKNMFOpoYtn3V9shHtjpAHCLW7UDSU6
zPp8FhzwDS7BV91Oyp2uKWtmUr0GQFRpr9F/iSGp8Ix0l8kL+AD5t3M3LA/QDjnq
GSpAsakrZMXgfFfAZxecvT8cVzWE0KydKrAawsFTn11s++rVCQlBLayZjHD6ZfuM
IlP69GZcPgEvxT5lZIcg74pzCQpbR8jV6KME3XV8Kvu8FLNqv2l+hy1bB+x3YAFQ
8BOalM7KPDNqNdK91DVkfEl0DxabmJcRj7R0C1L85R5EN4xYATA=
=6jpN
-----END PGP SIGNATURE-----
Merge tag 'nf-next-25-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains a small batch of Netfilter/IPVS updates
for net-next:
1) Remove unused genmask parameter in nf_tables_addchain()
2) Speed up reads from /proc/net/ip_vs_conn, from Florian Westphal.
3) Skip empty buckets in hashlimit to avoid atomic operations that results
in false positive reports by syzbot with lockdep enabled, patch from
Eric Dumazet.
4) Add conntrack event timestamps available via ctnetlink,
from Florian Westphal.
netfilter pull request 25-01-11
* tag 'nf-next-25-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: conntrack: add conntrack event timestamp
netfilter: xt_hashlimit: htable_selective_cleanup() optimization
ipvs: speed up reads from ip_vs_conn proc file
netfilter: nf_tables: remove the genmask parameter
====================
Link: https://patch.msgid.link/20250111230800.67349-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce a new way to report PHY statistics in a structured and
standardized format using the netlink API. This new method does not
replace the old driver-specific stats, which can still be accessed with
`ethtool -S <eth name>`. The structured stats are available with
`ethtool -S <eth name> --all-groups`.
This new method makes it easier to diagnose problems by organizing stats
in a consistent and documented way.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce support for standardized PHY statistics reporting in ethtool
by extending the PHYLIB framework. Add the functions
phy_ethtool_get_phy_stats() and phy_ethtool_get_link_ext_stats() to
provide a consistent interface for retrieving PHY-level and
link-specific statistics. These functions are used within the ethtool
implementation to avoid direct access to the phy_device structure
outside of the PHYLIB framework.
A new structure, ethtool_phy_stats, is introduced to standardize PHY
statistics such as packet counts, byte counts, and error counters.
Drivers are updated to include callbacks for retrieving PHY and
link-specific statistics, ensuring values are explicitly set only for
supported fields, initialized with ETHTOOL_STAT_NOT_SET to avoid
ambiguity.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Adapt linkstate_get_sqi() and linkstate_get_sqi_max() to take a
phy_device argument directly, enabling support for setups with
multiple PHYs. The previous assumption of a single PHY attached to
a net_device no longer holds.
Use ethnl_req_get_phydev() to identify the appropriate PHY device
for the operation. Update linkstate_prepare_data() and related
logic to accommodate this change, ensuring compatibility with
multi-PHY configurations.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As discussed in [0], rehash4 could be missed in udp_lib_rehash() when
udp hash4 changes while hash2 doesn't change. This patch fixes this by
moving rehash4 codes out of rehash2 checking, and then rehash2 and
rehash4 are done separately.
By doing this, we no longer need to call rehash4 explicitly in
udp_lib_hash4(), as the rehash callback in __ip4_datagram_connect takes
it. Thus, now udp_lib_hash4() returns directly if the sk is already
hashed.
Note that uhash4 may fail to work under consecutive connect(<dst
address>) calls because rehash() is not called with every connect(). To
overcome this, connect(<AF_UNSPEC>) needs to be called after the next
connect to a new destination.
[0]
https://lore.kernel.org/all/4761e466ab9f7542c68cdc95f248987d127044d2.1733499715.git.pabeni@redhat.com/
Fixes: 78c91ae2c6 ("ipv4/udp: Add 4-tuple hash for connected socket")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Link: https://patch.msgid.link/20250110010810.107145-1-lulie@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
dev_deactivate_many() role is to remove the qdiscs
of a network device.
When/if a qdisc is dismantled, an rcu grace period
is needed to make sure all outstanding qdisc enqueue
are done before we proceed with a qdisc reset.
Most virtual devices do not have a qdisc.
We can call the expensive synchronize_net() only
if needed.
Note that dev_deactivate_many() does not have to deal
with qdisc-less dev_queue_xmit, as an old comment
was claiming.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250109171850.2871194-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
init_dummy_netdev_core() used to cater to net_devices which
did not come from alloc_netdev_mqs(). Since that's no longer
supported remove the init logic which duplicates alloc_netdev_mqs().
While at it rename back to init_dummy_netdev().
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250113003456.3904110-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
init_dummy_netdev() can initialize statically declared or embedded
net_devices. Such netdevs did not come from alloc_netdev_mqs().
After recent work by Breno, there are the only two cases where
we have do that.
Switch those cases to alloc_netdev_mqs() and delete init_dummy_netdev().
Dealing with static netdevs is not worth the maintenance burden.
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250113003456.3904110-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When recvmsg with MSG_PEEK flag, the data will be copied to
user's buffer without advancing consume cursor and without
reducing the length of rx available data. Once the expected
peek length is larger than the value of bytes_to_rcv, in the
loop of do while in smc_rx_recvmsg, the first loop will copy
bytes_to_rcv bytes of data from the position local_tx_ctrl.cons,
the second loop will copy the min(bytes_to_rcv, read_remaining)
bytes from the position local_tx_ctrl.cons again because of the
lacking of process with advancing consume cursor and reducing
the length of available data. So do the subsequent loops. The
data copied in the second loop and the subsequent loops will
result in data error, as it should not be copied if no more data
arrives and it should be copied from the position advancing
bytes_to_rcv bytes from the local_tx_ctrl.cons if more data arrives.
This issue can be reproduce by the following python script:
server.py:
import socket
import time
server_ip = '0.0.0.0'
server_port = 12346
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_socket.bind((server_ip, server_port))
server_socket.listen(1)
print('Server is running and listening for connections...')
conn, addr = server_socket.accept()
print('Connected by', addr)
while True:
data = conn.recv(1024)
if not data:
break
print('Received request:', data.decode())
conn.sendall(b'Hello, client!\n')
time.sleep(5)
conn.sendall(b'Hello, again!\n')
conn.close()
client.py:
import socket
server_ip = '<server ip>'
server_port = 12346
resp=b'Hello, client!\nHello, again!\n'
client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
client_socket.connect((server_ip, server_port))
request = 'Hello, server!'
client_socket.sendall(request.encode())
peek_data = client_socket.recv(len(resp),
socket.MSG_PEEK | socket.MSG_WAITALL)
print('Peeked data:', peek_data.decode())
client_socket.close()
Fixes: 952310ccf2 ("smc: receive data from RMBE")
Reported-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Link: https://patch.msgid.link/20250104143201.35529-1-guangguan.wang@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Display the total number of RPC tasks, including tasks waiting
on workqueue and wait queues, for rpc_clnt.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Under heavy write load, we've seen the cl_tasks list grows to
millions of entries. Even though the list is extremely long,
the system still runs fine until the user wants to get the
information of all active RPC tasks by doing:
When this happens, tasks_start acquires the cl_lock to walk the
cl_tasks list, returning one entry at a time to the caller. The
cl_lock is held until all tasks on this list have been processed.
While the cl_lock is held, completed RPC tasks have to spin wait
in rpc_task_release_client for the cl_lock. If there are millions
of entries in the cl_tasks list it will take a long time before
tasks_stop is called and the cl_lock is released.
The spin wait tasks can use up all the available CPUs in the system,
preventing other jobs to run, this causes the system to temporarily
lock up.
This patch fixes this problem by delaying inserting the RPC
task on the cl_tasks list until the RPC call slot is reserved.
This limits the length of the cl_tasks to the number of call
slots available in the system.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
In case of authentication/association timeout (as detected in
ieee80211_iface_work->ieee80211_sta_work), ieee80211_destroy_auth_data
is called.
At the beginning of it, the pointer to ifmgd::auth_data memory is
copied to a local variable.
If iface_work is queued again during the execution of the current one,
and then the driver is flushing the wiphy_works (for its needs),
ieee80211_destroy_auth_data will run again and free auth_data.
Then when the execution of the original worker continues, the previously
copied pointer will be freed, causing a kernel bug:
kernel BUG at mm/slub.c:553! (double free)
Same for association timeout (just with ieee80211_destroy_assoc_data and
ifmgd::assoc_data)
Fix this by NULLifying auth/assoc data right after we copied
the pointer to it. That way, even in the scenario above, the code will
not handle the same timeout twice.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250102161730.0c3f7f781096.I2b458fb53291b06717077a815755288a81274756@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Mark that we left the IBSS before actually leaving (which
requires calling the driver). Otherwise, it's possible to
have the driver do some work flushing etc. while leaving,
and then get into the work trying to join again while all
data is being destroyed.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250102161730.81a6c12b304c.I8484f768371e128152a84aa164854cca9ec1066b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We disable the carrier, but that doesn't immediately take
effect, and as such a concurrent transmit can go into the
driver while drv_leave_ibss() is happening and confuse it.
Synchronize to avoid that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250102161730.83b58167e80e.I538751fbe1b4302d20f6ed80afb024bca6a2dbf5@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If STA state is pre-moved to AUTHORIZED (such as in IBSS
scenarios) and insertion fails, the station is freed.
In this case, the driver never knew about the station,
so trying to flush it is unexpected and may crash.
Check if the sta was uploaded to the driver before and
fix this.
Fixes: d00800a289 ("wifi: mac80211: add flush_sta method")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250102161730.e3d10970a7c7.I491bbcccc46f835ade07df0640a75f6ed92f20a3@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When in non-MLO mode, the key ID was set to -1 even for keys that are
not pairwise. Change the link ID to be the link ID of the deflink in
this case so that drivers do not need to special cases for this.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250102161730.0c066f084677.I4a5c288465e75119edb6a0df90dddf6f30d14a02@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The link ID passed to drv_mgd_complete_tx when handling the association
response was not set.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250102161730.2b06504ecaef.Ifb94e9375b910de6cdd2e5865d8cb3ab9790b314@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for parsing an ML element of type EPCS priority
access, which can optionally be included in EHT protected action
frames used to configure EPCS.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250102161730.5afdf65cff46.I0ffa30b40fbad47bc5b608b5fd46047a8c44e904@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for configuring Emergency Preparedness Communication
Services (EPCS) for station mode.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250102161730.ea54ac94445c.I11d750188bc0871e13e86146a3b5cc048d853e69@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for adding and removing station links:
- Adding links is done asynchronously, i.e., first
an ML reconfiguration action frame is sent to the AP
requesting to add links, and only when the AP replies,
links which were added successfully by the AP are added
locally.
- Removing links is done synchronously, i.e., the links
are removed before sending the ML reconfiguration
action frame to the AP (to avoid using this links after
the AP MLD removed them but before the station got the
ML reconfiguration response). In case the AP replies with a
status indicating that a link removal was not successful,
disconnect (as this should not happen an is an indication
that something might be wrong on the AP MLD).
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250102161730.ec0492a8dd21.I2869686642bbc0f86c40f284ebf7e6f644b551ab@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pull the calculation of the size needed for a link in an association
request frame to a function, so it could also be used during the
construction of other frames as well, e.g., ML link configuration
request frame.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250102161730.ac16adfa39d4.I9e28c2fcd5ca252341c817fc03ea8df7b807fcbf@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Instead of always using 'sdata->u.mgd.assoc_data' have
the association data be passed as an argument.
This will later allow to use the same functionality
for adding links to the current association.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250102161730.191f58f2bba7.I6baa6e2989a39937234ff91d7db5ff1359a6bb30@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
And move it to a separate function so it could later be reused for
dynamic addition of links.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250102161730.1e9c1873796a.I27a51c8c1d455f0a6d5b59f93f2c9ac49282febb@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since the wiphy_guard changes, rdev_start_radar_detection's return value
in nl80211_start_radar_detection is ignored and we always returned 0.
Fixes: f42d22d3f7 ("wifi: cfg80211: define and use wiphy guard")
Signed-off-by: Nicolas Escande <nico.escande@gmail.com>
Link: https://patch.msgid.link/20250109161040.325742-1-nico.escande@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The kernel performs several regulatory checks for AP mode in
nl80211/cfg80211. These checks include radar detection,
verification of whether the sub-channel is disabled, and
an examination to determine if the channel is a DFS channel
(both DFS usable and DFS available). These checks are
performed across a frequency range, examining each sub-channel.
However, these checks are also performed on subchannels that
have been punctured which should not be examined as they are
not in use.
This leads to the issue where the AP stops because one of
the 20 MHz sub-channels is disabled or radar detected on
the channel, even when the sub-channel is punctured.
To address this issue, add a condition check wherever
regulatory checks exist for AP mode in nl80211/cfg80211.
This check identifies punctured channels and, upon finding
them, skips the regulatory checks for those channels.
Co-developed-by: Manaswini Paluri <quic_mpaluri@quicinc.com>
Signed-off-by: Manaswini Paluri <quic_mpaluri@quicinc.com>
Signed-off-by: Kavita Kavita <quic_kkavita@quicinc.com>
Link: https://patch.msgid.link/20250109050409.25351-1-quic_kkavita@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
With change (wifi: mac80211: fix receiving A-MSDU
frames on mesh interfaces), a non-zero TID assignment
is lost during slow path mesh forwarding.
Prior to this change, ieee80211_rx_h_mesh_fwding()
left the TID intact in the header.
As a result of this header corruption, packets belonging
to non-zero TIDs will get treating as belonging
TID 0 by functions such as ieee80211_get_tid().
While this miscategorization by itself is an
issue, there are additional ramifications
due to the fact that skb->priority still reflects
the mesh forwarded packet's ingress (correct) TID.
The mt7915 driver inspects the TID recorded within
skb->priority and relays this to the
hardware/radio during TX. The radio firmware appears to
react to this by changing the sequence control
header, but it does not also ensure/correct the TID in
the QoS control header. As a result, the receiver
will see packets with sequence numbers corresponding
to the wrong TID. The receiver of the forwarded
packet will see TID 0 in QoS control but a sequence number
corresponding to the correct (different) TID in sequence
control. This causes data stalls for TID 0 until
the TID 0 sequence number advances past what the receiver
believes it should be due to this bug.
Mesh routing mpath changes cause a brief transition
from fast path forwarding to slow path forwarding.
Since this bug only affects the slow path forwarding,
mpath changes bring opportunity for the bug to be triggered.
In the author's case, he was experiencing TID 0 data stalls
after mpath changes on an intermediate mesh node.
These observed stalls may be specific
to mediatek radios. But the inconsistency between
the packet header and skb->priority may cause problems
for other drivers as well. Regardless if this causes
connectivity issues on other radios, this change is
necessary in order transmit (forward) the packet on the
correct TID and to have a consistent view a packet's TID
within mac80211.
Fixes: 986e43b19a ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces")
Signed-off-by: Andy Strohman <andrew@andrewstrohman.com>
Link: https://patch.msgid.link/20250107104431.446775-1-andrew@andrewstrohman.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some functions that should be tested may expect an sdata object that is
configured to a basic degree. Add setup code to create such an object
for use by tests.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.12eeefd3c98b.I6e8c2b8374d4305f16675524ca30621e089b6fb0@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Parse both the Supported Rates and BSS Membership Selectors as well as
the extended version of the tag when verifying whether we support all
features.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.f1840f19afa7.I12e3a0e634ce7014f5067256d9a6215fec6bf165@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We should not attempt a connection if the BSS we are connecting to
requires support for a basic rate or other feature using the BSS
membership selector. Add a check verifying this.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.e58a0f34c798.Ifeb3bfd7b157ffa2ccdb20ca1cba6cf068fd117d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently the SAE_H2E selector already exists, which needs to be
implemented by the SME. As new such selectors might be added in the
future, add a feature to permit userspace to report a selector as
supported.
If not given, the kernel should assume that userspace does support
SAE_H2E.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.fe67b871cc39.Ieb98390328927e998e612345a58b6dbc00b0e3a2@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Doing so enables further checking whether we are implementing the
requested features. Also allow passing in NULL for more parameters as
they may not be needed by the caller.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.76433fd3d69f.I94e8718de26ab32282b60ae257b8c6c334b7c528@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The GLK and EPD Selectors are also not rates, so add a new macro for the
minimum value of a selector and test against that instead of the entire
list. Also fix the typo in the EPD selector define.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.2c19a2dc53db.If187b7d93d8b43a6c70e422c837b7636538fb358@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ieee80211_determine_chan_mode is called for each link and if there is a
downgrade, then it is interesting to know on which link it happened.
Pass through the link_id where relevant and use the new link_id_info
macro instead of sdata_info so that the link ID is printed when
relevant.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.d400da710fc4.I64775ec914603d3c7b0c6ea14b507c0370c11622@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The basic_rates variable was passed to mesh_sta_info_init as an out
parameter even though the result is not used. Passing NULL instead is
safe here, so do that.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.23a86a9bad0c.If79bc2c1c98d01cfb4c7e93c18b198fe6c6ea44c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add the refcount. This can be useful when we want to understand why a
queue stays stopped after it is woken.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.bd320c6e6702.I6ae0f19d922aea1f28236d72bf260acac428fc02@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Check that additionally extended MLD capa/ops for the MLD is
consistent, i.e. the same value is reported by all affiliated
APs/links.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.e29f42c7ae21.Ib2cdce608321ad154e4b13103cc315c3e3cb6b2b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There really shouldn't be duplicate entries when we give
the list to the driver, and since we already have a list
it's easy to avoid.
While at it, remove the unnecessary allocation there.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Reviewed-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.b0012c70f503.Id6fcad979434c1437340aa283abae2906345cca1@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The ieee80211_config_bw() function is called in different
contexts: during association with the association response
and during beacon tracking with the beacon. This can be a
bit misleading, so disambiguate the messages for those.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.ee574cf7553b.Ie7c78877d20b5e9de4cce3cf8e4f1b9e0c7ee005@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The MLME code doesn't currently handle adding vendor elements
correctly with multi-link due to element inheritance. Simply
prevent that for now completely, if someone needs it we can
fix this later.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.bb82d3aaf6ef.Ib30573d0666430a3d7a905e513dfc661edf0bf65@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Coverity pointed out that __ieee80211_rx_h_amsdu() checks if rx->sta is
NULL before dereferencing it but not always.
Since rx->sta can't be NULL at this point, just remove the check to
avoid confusion
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.033096029d0a.I0923387246a6152f589d278f27f27bce52daee79@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In order to save power, it can be desirable to change the
RX operating mode using OMI to reduce the bandwidth. As the
handshake must be done in the HTC+ field, it cannot be done
by mac80211 directly, so expose functions to the driver to
request and finalize the necessary updates.
Note that RX OMI really only changes what the peer (AP) will
transmit to us, but in order to use it to actually save some
power (by reducing the listen bandwidth) we also update rate
scaling and then the channel context's mindef accordingly.
The updates are split into two in order to sequence them
correctly, when reducing bandwidth first reduce the rate
scaling and thus TX, then send OMI, then reduce the listen
bandwidth (chandef); when increasing bandwidth this is the
other way around. This also requires tracking in different
variables which part is applicable already.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250101070249.2c1a1934bd73.I4e90fd503504e37f9eac5bdae62e3f07e7071275@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The last use of ieee80211_smps_is_restrictive() was removed in 2020 by
commit 52b4810bed ("mac80211: Remove support for changing AP SMPS mode")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20241226170119.108947-1-linux@treblig.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
According to Draft P802.11be_D7.0 clause 35.3.4.2, if a multi-link
request requests an MLD with which an AP corresponding to the
nontransmitted BSSID, the corresponding multi-link probe response
shall carry a basic multi-mink element of that MLD in the frame body
of the multi-link probe response, whose location is outside of the
Multiple BSSID element carried in the frame.
Therefore additional handing is needed for parsing multi-link probe
response and generating the merged elements so that the MLD in the frame
body can be correctly copied to the generated elements. Otherwise, the
nontransmitted BSS looks like non-MLD.
Signed-off-by: Money Wang <money.wang@mediatek.com>
Signed-off-by: Michael-CY Lee <michael-cy.lee@mediatek.com>
Link: https://patch.msgid.link/20241225073725.847062-1-michael-cy.lee@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, the sequence goes like this (among others):
1. flush all stations (including the AP ones) -> this will tell the
drivers to remove the stations
2. notify the driver the vif is not associated.
Which means that in between 1 and 2, the state is that the vif is
associated, but there is no AP station, which makes no sense, and may be
problematic for some drivers (for example iwlwifi)
Change the sequence to:
1. flush the TDLS stations
2. move the AP station to IEEE80211_STA_NONE
3. notify the driver about the vif being unassociated
4. flush the AP station
In order to not break other drivers, add a vif flag to indicate whether
the driver wants to new sequence or not. If the flag is not set, then
things will be done in the old sequence.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20241224192322.996ad1be6cb3.I7815d33415aa1d65c0120b54be7a15a45388f807@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Sometimes we might want to flush only part of the stations of a vif,
for example only the TDLS ones.
To allow this, add a do_not_flush_sta argument to __sta_info_flush,
which in turn, will not flush this station.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20241224192322.535e1fcca192.Icecf7f443bf98c9535ce8ec03b24d0d17dfbc28e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The last use of ieee80211_debugfs_key_sta_del() was removed in 2007 by
commit 11a843b7e1 ("[MAC80211]: rework key handling")
The last use of ieee80211_debugfs_key_add_mgmt_default() was removed
in 2010 by
commit f7e0104c1a ("mac80211: support separate default keys")
The last use of ieee80211_debugfs_key_add_beacon_default() was
removed in 2020 by
commit e5473e80d4 ("mac80211: Support BIGTK configuration for Beacon
protection")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20241224013257.185742-2-linux@treblig.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Passing a sufficient amount of imix entries leads to invalid access to the
pkt_dev->imix_entries array because of the incorrect boundary check.
UBSAN: array-index-out-of-bounds in net/core/pktgen.c:874:24
index 20 is out of range for type 'imix_pkt [20]'
CPU: 2 PID: 1210 Comm: bash Not tainted 6.10.0-rc1 #121
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
Call Trace:
<TASK>
dump_stack_lvl lib/dump_stack.c:117
__ubsan_handle_out_of_bounds lib/ubsan.c:429
get_imix_entries net/core/pktgen.c:874
pktgen_if_write net/core/pktgen.c:1063
pde_write fs/proc/inode.c:334
proc_reg_write fs/proc/inode.c:346
vfs_write fs/read_write.c:593
ksys_write fs/read_write.c:644
do_syscall_64 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe arch/x86/entry/entry_64.S:130
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 52a62f8603 ("pktgen: Parse internet mix (imix) input")
Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru>
[ fp: allow to fill the array completely; minor changelog cleanup ]
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need the debugfs / driver-core fixes in here as well for testing and
to build on top of.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since task->comm is guaranteed to be NUL-terminated, we can print it
directly without the need to copy it into a separate buffer. This
simplifies the code and avoids unnecessary operations.
Link: https://lkml.kernel.org/r/20241219023452.69907-4-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "André Almeida" <andrealmeid@igalia.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Danilo Krummrich <dakr@redhat.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Vineet Gupta <vgupta@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.
This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:
@@ constant C; @@
- msecs_to_jiffies(C * 1000)
+ secs_to_jiffies(C)
@@ constant C; @@
- msecs_to_jiffies(C * MSEC_PER_SEC)
+ secs_to_jiffies(C)
Link: https://lkml.kernel.org/r/20241210-converge-secs-to-jiffies-v3-15-ddfefd7e9f2a@linux.microsoft.com
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Daniel Mack <daniel@zonque.org>
Cc: David Airlie <airlied@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jack Wang <jinpu.wang@cloud.ionos.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Jeff Johnson <jjohnson@kernel.org>
Cc: Jeff Johnson <quic_jjohnson@quicinc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jeroen de Borst <jeroendb@google.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: Louis Peens <louis.peens@corigine.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Ofir Bitton <obitton@habana.ai>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Praveen Kaligineedi <pkaligineedi@google.com>
Cc: Ray Jui <rjui@broadcom.com>
Cc: Robert Jarzmik <robert.jarzmik@free.fr>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Shailend Chand <shailend@google.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Simon Horman <horms@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Converge on using secs_to_jiffies()", v3.
This is a series that follows up on my previous series to introduce
secs_to_jiffies() and convert a few initial users.[1] In the review for
that series, Anna-Maria requested converting other users with Coccinelle.
[2] This is part 1 that converts users of msecs_to_jiffies() that use the
multiply pattern of either of:
- msecs_to_jiffies(N*1000), or
- msecs_to_jiffies(N*MSEC_PER_SEC)
where N is a constant, to avoid the multiplication.
The entire conversion is made with Coccinelle in the script added in
patch 2. Some changes suggested by Coccinelle have been deferred to
later parts that will address other possible variant patterns.
[1] https://lore.kernel.org/all/20241030-open-coded-timeouts-v3-0-9ba123facf88@linux.microsoft.com/
[2] https://lore.kernel.org/all/8734kngfni.fsf@somnus/
This patch (of 19):
None of the higher order definitions are used anymore, so remove
definitions for minutes, hours, and days timeouts. Convert the seconds
denominated timeouts to secs_to_jiffies()
Link: https://lkml.kernel.org/r/20241210-converge-secs-to-jiffies-v3-0-ddfefd7e9f2a@linux.microsoft.com
Link: https://lkml.kernel.org/r/20241210-converge-secs-to-jiffies-v3-1-ddfefd7e9f2a@linux.microsoft.com
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Daniel Mack <daniel@zonque.org>
Cc: David Airlie <airlied@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jack Wang <jinpu.wang@cloud.ionos.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Jeff Johnson <jjohnson@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jeroen de Borst <jeroendb@google.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Johan Hedberg <johan.hedberg@gmail.com>:
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: Louis Peens <louis.peens@corigine.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Marcel Holtmann <marcel@holtmann.org>:
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Ofir Bitton <obitton@habana.ai>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Praveen Kaligineedi <pkaligineedi@google.com>
Cc: Ray Jui <rjui@broadcom.com>
Cc: Robert Jarzmik <robert.jarzmik@free.fr>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Shailend Chand <shailend@google.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Simon Horman <horms@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Here "buf" is a void pointer so sizeof(*buf) is one. Doing a divide
by one makes the code less readable. Delete it.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/ee1a790b-f874-4512-b3ae-9c45f99dc640@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A subtlety of this API is that if the @nbytes region traverses a
page boundary, the next __xdr_commit_encode will shift the data item
in the XDR encode buffer. This makes the returned pointer point to
something else, leading to unexpected behavior.
There are a few cases where the caller saves the returned pointer
and then later uses it to insert a computed value into an earlier
part of the stream. This can be safe only if either:
- the data item is guaranteed to be in the XDR buffer's head, and
thus is not ever going to be near a page boundary, or
- the data item is no larger than 4 octets, since XDR alignment
rules require all data items to start on 4-octet boundaries
But that safety is only an artifact of the current implementation.
It would be less brittle if these "safe" uses were eventually
replaced.
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
There are no module callers of dev_get_by_napi_id(),
and commit d1cacd7477 ("netdev: prevent accessing NAPI instances
from another namespace") proves that getting NAPI by id
needs to be done with care. So hide dev_get_by_napi_id().
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250110004924.3212260-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dump continuation depends on the NAPI list being sorted.
Broken netlink dump continuation may be rare and hard to debug
so add a warning if we notice the potential problem while walking
the list.
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250110004505.3210140-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit in a fixes tag attempted to fix the issue in the following
sequence of calls:
do_output
-> ovs_vport_send
-> dev_queue_xmit
-> __dev_queue_xmit
-> netdev_core_pick_tx
-> skb_tx_hash
When device is unregistering, the 'dev->real_num_tx_queues' goes to
zero and the 'while (unlikely(hash >= qcount))' loop inside the
'skb_tx_hash' becomes infinite, locking up the core forever.
But unfortunately, checking just the carrier status is not enough to
fix the issue, because some devices may still be in unregistering
state while reporting carrier status OK.
One example of such device is a net/dummy. It sets carrier ON
on start, but it doesn't implement .ndo_stop to set the carrier off.
And it makes sense, because dummy doesn't really have a carrier.
Therefore, while this device is unregistering, it's still easy to hit
the infinite loop in the skb_tx_hash() from the OVS datapath. There
might be other drivers that do the same, but dummy by itself is
important for the OVS ecosystem, because it is frequently used as a
packet sink for tcpdump while debugging OVS deployments. And when the
issue is hit, the only way to recover is to reboot.
Fix that by also checking if the device is running. The running
state is handled by the net core during unregistering, so it covers
unregistering case better, and we don't really need to send packets
to devices that are not running anyway.
While only checking the running state might be enough, the carrier
check is preserved. The running and the carrier states seem disjoined
throughout the code and different drivers. And other core functions
like __dev_direct_xmit() check both before attempting to transmit
a packet. So, it seems safer to check both flags in OVS as well.
Fixes: 066b86787f ("net: openvswitch: fix race on port output")
Reported-by: Friedrich Weber <f.weber@proxmox.com>
Closes: https://mail.openvswitch.org/pipermail/ovs-discuss/2025-January/053423.html
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Tested-by: Friedrich Weber <f.weber@proxmox.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250109122225.4034688-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
hwprov should be protected by rcu_read_lock to prevent possible UAF
Fixes: 4c61d809cf ("net: ethtool: Fix suspicious rcu_dereference usage")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Kory Maincent <kory.maincent@bootlin.com>
diff with v1: move and use err varialbe, instead of define a new variable
net/ethtool/common.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
Link: https://patch.msgid.link/20250109111057.4746-1-lirongqing@baidu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 86e25f40aa ("net: napi: Add napi_config") moved napi->napi_id
assignment to a later point in time (napi_hash_add_with_id). This breaks
__xdp_rxq_info_reg which copies napi_id at an earlier time and now
stores 0 napi_id. It also makes sk_mark_napi_id_once_xdp and
__sk_mark_napi_id_once useless because they now work against 0 napi_id.
Since sk_busy_loop requires valid napi_id to busy-poll on, there is no way
to busy-poll AF_XDP sockets anymore.
Bring back the ability to busy-poll on XSK by resolving socket's napi_id
at bind time. This relies on relatively recent netif_queue_set_napi,
but (assume) at this point most popular drivers should have been converted.
This also removes per-tx/rx cycles which used to check and/or set
the napi_id value.
Confirmed by running a busy-polling AF_XDP socket
(github.com/fomichev/xskrtt) on mlx5 and looking at BusyPollRxPackets
from /proc/net/netstat.
Fixes: 86e25f40aa ("net: napi: Add napi_config")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250109003436.2829560-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As pointed out in the original comment, lookup in sockmap can return a TCP
ESTABLISHED socket. Such TCP socket may have had SO_ATTACH_REUSEPORT_EBPF
set before it was ESTABLISHED. In other words, a non-NULL sk_reuseport_cb
does not imply a non-refcounted socket.
Drop sk's reference in both error paths.
unreferenced object 0xffff888101911800 (size 2048):
comm "test_progs", pid 44109, jiffies 4297131437
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
80 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 9336483b):
__kmalloc_noprof+0x3bf/0x560
__reuseport_alloc+0x1d/0x40
reuseport_alloc+0xca/0x150
reuseport_attach_prog+0x87/0x140
sk_reuseport_attach_bpf+0xc8/0x100
sk_setsockopt+0x1181/0x1990
do_sock_setsockopt+0x12b/0x160
__sys_setsockopt+0x7b/0xc0
__x64_sys_setsockopt+0x1b/0x30
do_syscall_64+0x93/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: 64d85290d7 ("bpf: Allow bpf_map_lookup_elem for SOCKMAP and SOCKHASH")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250110-reuseport-memleak-v1-1-fa1ddab0adfe@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmd/mNkACgkQrB3Eaf9P
W7cQww/9Hnv7+wosuBxFW2o2ptgZThiwj/Ovz0oiWGWvctpN7VlpwmrDOsXQ+XMn
xF6JBFEjnJanYoDBb78D0dMffJcMcUKZopTUU/ZMitSNr8aIHYiuB4SWPG1tqxl4
Ete7Mr2m3tS96YePQNnAaRZzEuGsx3BQb28VLTWl9So81MByD2OK4fsAbYz22Gg8
7A6tDHn1mUd9b2VG+LeeBZaDDFG8C0O2x4E/8Z3DX3z1N8y3LABPwZ38jcgTviKO
1ZldGrJT+PownBydu23bWDassKE2TuVvGH9e/SOPeQj8DJ4Lmd0bafMZTY6xwfNT
RJCwhlzZUpYRXFzvcf3+U3egsqEWEemV7/LzAapdT0V9OqfLWMUh3b1jMA4KcblZ
qmlm/MhZyXutDoeuASwtM4jgM3wGwovOofrKKsb13hD9VLBs5jFZmFSw5TlbmwE3
sjv7V4pFwNyfJnwQtyMmfuuHiy8w+fzqAA2GCg8mF3OosHABH/FOvEBP8xg1Vqu1
iKlLkByfyaCFD+GxTPzqSDvSB8nDzeZBgBM/ILGuwH0OWr+0gxBzl1sxrhAkw9hC
Gf+4J3wg7EknTxfrJk4LyqfyS50GvUIzpLSkHYxDtQRd4zv75bxRzMvoZvuapTtN
GGpWYftKO8n2kgLORQ0dTtFee3c4w/No+KXWsFdtRVNpyk3N0Yw=
=U79q
-----END PGP SIGNATURE-----
Merge tag 'ipsec-next-2025-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
ipsec-next-2025-01-09
1) Implement the AGGFRAG protocol and basic IP-TFS (RFC9347) functionality.
From Christian Hopps.
2) Support ESN context update to hardware for TX.
From Jianbo Liu.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When jumping to 'martian_destination' a drop reason is always set but
that label falls-through the 'e_nobufs' one, overriding the value.
The behavior was introduced by the mentioned commit. The logic went
from,
goto martian_destination;
...
martian_destination:
...
e_inval:
err = -EINVAL;
goto out;
e_nobufs:
err = -ENOBUFS;
goto out;
to,
reason = ...;
goto martian_destination;
...
martian_destination:
...
e_nobufs:
reason = SKB_DROP_REASON_NOMEM;
goto out;
A 'goto out' is clearly missing now after 'martian_destination' to avoid
overriding the drop reason.
Fixes: 5b92112acd ("net: ip: make ip_route_input_slow() return drop reasons")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Cc: Menglong Dong <menglong8.dong@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250108165725.404564-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Switch to using hvhdk.h everywhere in the kernel. This header
includes all the new Hyper-V headers in include/hyperv, which form a
superset of the definitions found in hyperv-tlfs.h.
This makes it easier to add new Hyper-V interfaces without being
restricted to those in the TLFS doc (reflected in hyperv-tlfs.h).
To be more consistent with the original Hyper-V code, the names of
some definitions are changed slightly. Update those where needed.
Update comments in mshyperv.h files to point to include/hyperv for
adding new definitions.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://lore.kernel.org/r/1732577084-2122-5-git-send-email-nunodasneves@linux.microsoft.com
Link: https://lore.kernel.org/r/20250108222138.1623703-3-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmd/wXMACgkQ1w0aZmrP
KyHUPRAAolGTWs541hsCAqrtGN+hk4jzm9WKBAVr7A/52TVqxIT4pfzVoeFMkCyI
Q+1US3W/PuuZOUcdE1Nh6fGu+FycTVY2HO/hsWTzNt617JmtYYCANkNYnfKBcMrQ
o7yCwxCI8KNBdsG/yBByrnHiuxZC1gI0d2SzCLN/2phtW/8nMevHcMZfhUzl6D0s
eIbzgNQE4KB2+ntGGgCqfNf2hTJrhQDvIDUKDho5hC+FQOo91tNt1zED+u/N3S5Y
ak16JYGBju9JLuMssDqgYLuupKGOzxmUFQokqMEMCkydoPlX8R3fxRSVAXv7DVUq
Y+W6Lo+5zj8BduDsCTEIkk3piv7jMcZAc7E8BkXLcR/Wj/XUyP4pKHhX+pDGZjA5
rHsemh17Xozj431XLhxxDhJ8p7n4xx8d9auijoBoeWREpacLIQvBUW4L7HEwfq+p
0wg8a0F1izwJZWQS4+zQis8EcDo3da/0idFSMP0JFVigPVGefHnJkj9iO0l+VfoL
PlDY5Qqty+uQbykvV7YFVRt3iD4nttux2hP+r7k/KCcrD2BmJbkgpC9tTej9noiK
OjStQWbvyLKFE9bAxvxOYa7NaE3TdOkQFTgJYVny5r0oQU7aClv092WM3ReQSlvr
tHN3+ze2hiehKwYKmRvHUSbJPNrC6ISh4qpNBoZKJ2TgGkKsGys=
=PG5v
-----END PGP SIGNATURE-----
Merge tag 'nf-25-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix imbalance between flowtable BIND and UNBIND calls to configure
hardware offload, this fixes a possible kmemleak.
2) Clamp maximum conntrack hashtable size to INT_MAX to fix a possible
WARN_ON_ONCE splat coming from kvmalloc_array(), only possible from
init_netns.
* tag 'nf-25-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: conntrack: clamp maximum hashtable size to INT_MAX
netfilter: nf_tables: imbalance in flowtable binding
====================
Link: https://patch.msgid.link/20250109123532.41768-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As mentioned in a previous commit of this series, using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The per-netns structure can be obtained from the table->data using
container_of(), then the 'net' one can be retrieved from the listen
socket (if available).
Fixes: c6a58ffed5 ("RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-9-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As mentioned in a previous commit of this series, using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'net' structure can be obtained from the table->data using
container_of().
Note that table->data could also be used directly, as this is the only
member needed from the 'net' structure, but that would increase the size
of this fix, to use '*data' everywhere 'net->sctp.probe_interval' is
used.
Fixes: d1e462a7a5 ("sctp: add probe_interval in sysctl and sock/asoc/transport")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-8-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As mentioned in a previous commit of this series, using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'net' structure can be obtained from the table->data using
container_of().
Note that table->data could also be used directly, but that would
increase the size of this fix, while 'sctp.ctl_sock' still needs to be
retrieved from 'net' structure.
Fixes: 046c052b47 ("sctp: enable udp tunneling socks")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-7-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As mentioned in a previous commit of this series, using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'net' structure can be obtained from the table->data using
container_of().
Note that table->data could also be used directly, but that would
increase the size of this fix, while 'sctp.ctl_sock' still needs to be
retrieved from 'net' structure.
Fixes: b14878ccb7 ("net: sctp: cache auth_enable per endpoint")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-6-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As mentioned in a previous commit of this series, using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'net' structure can be obtained from the table->data using
container_of().
Note that table->data could also be used directly, as this is the only
member needed from the 'net' structure, but that would increase the size
of this fix, to use '*data' everywhere 'net->sctp.rto_min/max' is used.
Fixes: 4f3fdf3bc5 ("sctp: add check rto_min and rto_max in sysctl")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-5-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As mentioned in a previous commit of this series, using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'net' structure can be obtained from the table->data using
container_of().
Note that table->data could also be used directly, as this is the only
member needed from the 'net' structure, but that would increase the size
of this fix, to use '*data' everywhere 'net->sctp.sctp_hmac_alg' is
used.
Fixes: 3c68198e75 ("sctp: Make hmac algorithm selection for cookie generation dynamic")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-4-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As mentioned in the previous commit, using the 'net' structure via
'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'pernet' structure can be obtained from the table->data using
container_of().
Fixes: 27069e7cb3 ("mptcp: disable active MPTCP in case of blackhole")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-3-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
'net.mptcp.available_schedulers' sysctl knob is there to list available
schedulers, not to modify this list.
There are then no reasons to give write access to it.
Nothing would have been written anyway, but no errors would have been
returned, which is unexpected.
Fixes: 73c900aa36 ("mptcp: add net.mptcp.available_schedulers")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-1-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Even though we fixed a logic error in the commit cited below, syzbot
still managed to trigger an underflow of the per-host bulk flow
counters, leading to an out of bounds memory access.
To avoid any such logic errors causing out of bounds memory accesses,
this commit factors out all accesses to the per-host bulk flow counters
to a series of helpers that perform bounds-checking before any
increments and decrements. This also has the benefit of improving
readability by moving the conditional checks for the flow mode into
these helpers, instead of having them spread out throughout the
code (which was the cause of the original logic error).
As part of this change, the flow quantum calculation is consolidated
into a helper function, which means that the dithering applied to the
ost load scaling is now applied both in the DRR rotation and when a
sparse flow's quantum is first initiated. The only user-visible effect
of this is that the maximum packet size that can be sent while a flow
stays sparse will now vary with +/- one byte in some cases. This should
not make a noticeable difference in practice, and thus it's not worth
complicating the code to preserve the old behaviour.
Fixes: 546ea84d07 ("sched: sch_cake: fix bulk flow accounting logic for host fairness")
Reported-by: syzbot+f63600d288bfb7057424@syzkaller.appspotmail.com
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Dave Taht <dave.taht@gmail.com>
Link: https://patch.msgid.link/20250107120105.70685-1-toke@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus suggested during one of past maintainer summits (in context of
a DMA_BUF discussion) that symbol namespaces can be used to prevent
unwelcome but in-tree code from using all exported functions.
Create a namespace for netdev.
Export netdev_rx_queue_restart(), drivers may want to use it since
it gives them a simple and safe way to restart a queue to apply
config changes. But it's both too low level and too actively developed
to be used outside netdev.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Netlink code depends on NAPI instances being sorted by ID on
the netdev list for dump continuation. We need to be able to
find the position on the list where we left off if dump does
not fit in a single skb, and in the meantime NAPI instances
can come and go.
This was trivially true when we were assigning a new ID to every
new NAPI instance. Since we added the NAPI config API, we try
to retain the ID previously used for the same queue, but still
add the new NAPI instance at the start of the list.
This is fine if we reset the entire netdev and all NAPIs get
removed and added back. If driver replaces a NAPI instance
during an operation like DEVMEM queue reset, or recreates
a subset of NAPI instances in other ways we may end up with
broken ordering, and therefore Netlink dumps with either
missing or duplicated entries.
At this stage the problem is theoretical. Only two drivers
support queue API, bnxt and gve. gve recreates NAPIs during
queue reset, but it doesn't support NAPI config.
bnxt supports NAPI config but doesn't recreate instances
during reset.
We need to save the ID in the config as soon as it is assigned
because otherwise the new NAPI will not know what ID it will
get at enable time, at the time it is being added.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Nadia Pinaeva writes:
I am working on a tool that allows collecting network performance
metrics by using conntrack events.
Start time of a conntrack entry is used to evaluate seen_reply
latency, therefore the sooner it is timestamped, the better the
precision is.
In particular, when using this tool to compare the performance of the
same feature implemented using iptables/nftables/OVS it is crucial
to have the entry timestamped earlier to see any difference.
At this time, conntrack events can only get timestamped at recv time in
userspace, so there can be some delay between the event being generated
and the userspace process consuming the message.
There is sys/net/netfilter/nf_conntrack_timestamp, which adds a
64bit timestamp (ns resolution) that records start and stop times,
but its not suited for this either, start time is the 'hashtable insertion
time', not 'conntrack allocation time'.
There is concern that moving the start-time moment to conntrack
allocation will add overhead in case of flooding, where conntrack
entries are allocated and released right away without getting inserted
into the hashtable.
Also, even if this was changed it would not with events other than
new (start time) and destroy (stop time).
Pablo suggested to add new CTA_TIMESTAMP_EVENT, this adds this feature.
The timestamp is recorded in case both events are requested and the
sys/net/netfilter/nf_conntrack_timestamp toggle is enabled.
Reported-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use INT_MAX as maximum size for the conntrack hashtable. Otherwise, it
is possible to hit WARN_ON_ONCE in __kvmalloc_node_noprof() when
resizing hashtable because __GFP_NOWARN is unset. See:
0708a0afe2 ("mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls")
Note: hashtable resize is only possible from init_netns.
Fixes: 9cc1c73ad6 ("netfilter: conntrack: avoid integer overflow when resizing")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
All these cases cause imbalance between BIND and UNBIND calls:
- Delete an interface from a flowtable with multiple interfaces
- Add a (device to a) flowtable with --check flag
- Delete a netns containing a flowtable
- In an interactive nft session, create a table with owner flag and
flowtable inside, then quit.
Fix it by calling FLOW_BLOCK_UNBIND when unregistering hooks, then
remove late FLOW_BLOCK_UNBIND call when destroying flowtable.
Fixes: ff4bf2f42a ("netfilter: nf_tables: add nft_unregister_flowtable_hook()")
Reported-by: Phil Sutter <phil@nwl.cc>
Tested-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A synchronize_rcu() was added by mistake in commit
c5a7591172 ("net/hsr: Use list_head (and rcu) instead
of array for slave devices.")
RCU does not mandate to observe a grace period after
list_add_tail_rcu().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250107144701.503884-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In commit 66e4c8d950 ("net: warn if transport header was not set")
I added a debug check in skb_transport_header() to detect
if a caller expects the transport_header to be set to a meaningful
value by a prior code path.
Unfortunately, __netif_receive_skb_core() resets the transport header
to the same value than the network header, defeating this check
in receive paths.
Pretending the transport and network headers are the same
is usually wrong.
This patch removes this reset for CONFIG_DEBUG_NET=y builds
to let fuzzers and CI find bugs.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250107144342.499759-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This change introduces a mechanism for notifying userspace
applications about changes to IPv6 anycast addresses via netlink. It
includes:
* Addition and deletion of IPv6 anycast addresses are reported using
RTM_NEWANYCAST and RTM_DELANYCAST.
* A new netlink group (RTNLGRP_IPV6_ACADDR) for subscribing to these
notifications.
This enables user space applications(e.g. ip monitor) to efficiently
track anycast addresses through netlink messages, improving metrics
collection and system monitoring. It also unlocks the potential for
advanced anycast management in user space, such as hardware offload
control and fine grained network control.
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250107114355.1766086-1-yuyanghuang@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
- btmtk: Fix failed to send func ctrl for MediaTek devices.
- hci_sync: Fix not setting Random Address when required
- MGMT: Fix Add Device to responding before completing
- btnxpuart: Fix driver sending truncated data
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmd+pn4ZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKf0tD/wLkaVKDMSOL30kQLjgaHS/
hrInB7XHkT3JvWtxaOc3z/D7+cdtvZQ+j5h+Hzuh1SCTvGro0is67LqMb9BfBtNQ
DR6Fn1+H8Kr5iLA1wgxepuSiRD5xmakqceX3UfGMfyh5utgYr7y/6br0056a9lz/
b7dHbrXAaao1GrAW9RPcyl93WJswGxfEbfk6b5DQydZc+zRu5EXdBdrdQyArTOS4
YjyOvLPQB43E88fDYVW4JCjue6dcw4MQi93L1JV67ygG8EE7ONFxjWuEVZ9d01kq
f+jegcB9lFDkgZZ0Hp7SNiRBMdkUW4eI221h7F7/Jzh6R0izCBkSoTsHc9sELMiW
T1WcQ9VDP5tY107SlDq9eCwQ0rrsFRo7Aqiii40ZIt1hXzoN7ge6cVQ+0LeHmjNK
Afg9iJS427YazJfJcX6ouKXhZGJaUu7qhG6IaXlCT5s38wb2Bv2qDLWpL2rzsfRG
HUwn5VPpWALljqHQBlrIdTh0bD9MnzRxE/bZl8uf1CFnyNvWj1SMKVIckODTur2n
MRosDawcHt31Zb8oEH75m32Nwnpj0yzJh/d6M9F0RQMpIw/jj9ijYLrfyg9P8NMR
blVLOjztxliXYcBgslhGmigOXQ0esMj0BBMBlVMTrc24jWq359ds20Xoj/ooD7VM
BLt8HL0cdQlyQSpDw8rZgg==
=8VPB
-----END PGP SIGNATURE-----
Merge tag 'for-net-2025-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- btmtk: Fix failed to send func ctrl for MediaTek devices.
- hci_sync: Fix not setting Random Address when required
- MGMT: Fix Add Device to responding before completing
- btnxpuart: Fix driver sending truncated data
* tag 'for-net-2025-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: btmtk: Fix failed to send func ctrl for MediaTek devices.
Bluetooth: btnxpuart: Fix driver sending truncated data
Bluetooth: MGMT: Fix Add Device to responding before completing
Bluetooth: hci_sync: Fix not setting Random Address when required
====================
Link: https://patch.msgid.link/20250108162627.1623760-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bpf_sk_storage_clone() will call bpf_selem_free() to free the clone
element when the allocation of new sock storage fails. bpf_selem_free()
will call check_and_free_fields() to free the special fields in the
element. Since the allocated element is not visible to bpf syscall or
bpf program when bpf_local_storage_alloc() fails, these special fields
in the element must be all zero when invoking bpf_selem_free().
To be uniform with other callers of bpf_selem_free(), disabling
migration when cloning sock storage. Adding migrate_{disable|enable}
pair also benefits the potential switching from kzalloc to bpf memory
allocator for sock storage.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-9-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When destroying sock storage, it invokes bpf_local_storage_destroy() to
remove all storage elements saved in the sock storage. The destroy
procedure will call bpf_selem_free() to free the element, and
bpf_selem_free() calls bpf_obj_free_fields() to free the special fields
in map value (e.g., kptr). Since kptrs may be allocated from bpf memory
allocator, migrate_{disable|enable} pairs are necessary for the freeing
of these kptrs.
To simplify reasoning about when migrate_disable() is needed for the
freeing of these dynamically-allocated kptrs, let the caller to
guarantee migration is disabled before invoking bpf_obj_free_fields().
Therefore, the patch adds migrate_{disable|enable} pair in
bpf_sock_storage_free(). The migrate_{disable|enable} pairs in the
underlying implementation of bpf_obj_free_fields() will be removed by
The following patch.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-8-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This is a follow-up to 3c5b4d69c3 ("net: annotate data-races around
sk->sk_mark"). sk->sk_mark can be read and written without holding
the socket lock. IPv6 equivalent is already covered with READ_ONCE()
annotation in tcp_v6_send_response().
Fixes: 3c5b4d69c3 ("net: annotate data-races around sk->sk_mark")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/f459d1fc44f205e13f6d8bdca2c8bfb9902ffac9.1736244569.git.daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The NAPI IDs were not fully exposed to user space prior to the netlink
API, so they were never namespaced. The netlink API must ensure that
at the very least NAPI instance belongs to the same netns as the owner
of the genl sock.
napi_by_id() can become static now, but it needs to move because of
dev_get_by_napi_id().
Cc: stable@vger.kernel.org
Fixes: 1287c1ae0f ("netdev-genl: Support setting per-NAPI config values")
Fixes: 27f91aaf49 ("netdev-genl: Add netlink framework functions for napi")
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250106180137.1861472-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
kthread_create() creates a kthread without running it yet. kthread_run()
creates a kthread and runs it.
On the other hand, kthread_create_worker() creates a kthread worker and
runs it.
This difference in behaviours is confusing. Also there is no way to
create a kthread worker and affine it using kthread_bind_mask() or
kthread_affine_preferred() before starting it.
Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]()
that behaves just like kthread_run(). kthread_create_worker[_on_cpu]()
will now only create a kthread worker without starting it.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Use usb_autopm_get_interface() and usb_autopm_put_interface()
in btmtk_usb_shutdown(), it could send func ctrl after enabling
autosuspend.
Bluetooth: btmtk_usb_hci_wmt_sync() hci0: Execution of wmt command
timed out
Bluetooth: btmtk_usb_shutdown() hci0: Failed to send wmt func ctrl
(-110)
Fixes: 5c5e8c52e3 ("Bluetooth: btmtk: move btusb_mtk_[setup, shutdown] to btmtk.c")
Signed-off-by: Chris Lu <chris.lu@mediatek.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Add Device with LE type requires updating resolving/accept list which
requires quite a number of commands to complete and each of them may
fail, so instead of pretending it would always work this checks the
return of hci_update_passive_scan_sync which indicates if everything
worked as intended.
Fixes: e8907f7654 ("Bluetooth: hci_sync: Make use of hci_cmd_sync_queue set 3")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This fixes errors such as the following when Own address type is set to
Random Address but it has not been programmed yet due to either be
advertising or connecting:
< HCI Command: LE Set Exte.. (0x08|0x0041) plen 13
Own address type: Random (0x03)
Filter policy: Ignore not in accept list (0x01)
PHYs: 0x05
Entry 0: LE 1M
Type: Passive (0x00)
Interval: 60.000 msec (0x0060)
Window: 30.000 msec (0x0030)
Entry 1: LE Coded
Type: Passive (0x00)
Interval: 180.000 msec (0x0120)
Window: 90.000 msec (0x0090)
> HCI Event: Command Complete (0x0e) plen 4
LE Set Extended Scan Parameters (0x08|0x0041) ncmd 1
Status: Success (0x00)
< HCI Command: LE Set Exten.. (0x08|0x0042) plen 6
Extended scan: Enabled (0x01)
Filter duplicates: Enabled (0x01)
Duration: 0 msec (0x0000)
Period: 0.00 sec (0x0000)
> HCI Event: Command Complete (0x0e) plen 4
LE Set Extended Scan Enable (0x08|0x0042) ncmd 1
Status: Invalid HCI Command Parameters (0x12)
Fixes: c45074d68a ("Bluetooth: Fix not generating RPA when required")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
All implementations of get_mac_eee() now just return zero without doing
anything useful. Remove the call to this method in preparation to
removing the method from each DSA driver.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tUlkz-007Uyl-UA@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(un)?register_netdevice_notifier_dev_net() hold RTNL before triggering
the notifier for all netdev in the netns.
Let's convert the RTNL to rtnl_net_lock().
Note that move_netdevice_notifiers_dev_net() is assumed to be (but not
yet) protected by per-netns RTNL of both src and dst netns; we need to
convert wireless and hyperv drivers that call dev_change_net_namespace().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(un)?register_netdevice_notifier_net() hold RTNL before triggering the
notifier for all netdev in the netns.
Let's convert the RTNL to rtnl_net_lock().
Note that the per-netns netdev notifier is protected by per-netns RTNL.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In commit d7811e623d ("[NET]: Drop tx lock in dev_watchdog_up")
dev_watchdog_up() became a simple wrapper for __netdev_watchdog_up()
Herbert also said : "In 2.6.19 we can eliminate the unnecessary
__dev_watchdog_up and replace it with dev_watchdog_up."
This patch consolidates things to have only two functions, with
a common prefix.
- netdev_watchdog_up(), exported for the sake of one freescale driver.
This replaces __netdev_watchdog_up() and dev_watchdog_up().
- netdev_watchdog_down(), static to net/sched/sch_generic.c
This replaces dev_watchdog_down().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250105090924.1661822-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We've noticed that NFS can hang when using RPC over TLS on an unstable
connection, and investigation shows that the RPC layer is stuck in a tight
loop attempting to transmit, but forever getting -EBADMSG back from the
underlying network. The loop begins when tcp_sendmsg_locked() returns
-EPIPE to tls_tx_records(), but that error is converted to -EBADMSG when
calling the socket's error reporting handler.
Instead of converting errors from tcp_sendmsg_locked(), let's pass them
along in this path. The RPC layer handles -EPIPE by reconnecting the
transport, which prevents the endless attempts to transmit on a broken
connection.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Link: https://patch.msgid.link/9594185559881679d81f071b181a10eb07cd079f.1736004079.git.bcodding@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The skb_buff struct in br_is_nd_neigh_msg() is never modified. Mark it as
const.
Signed-off-by: Ted Chen <znscnchen@gmail.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250104083846.71612-1-znscnchen@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Let's hold per-netns RTNL of dev_net(dev) in register_netdev()
and unregister_netdev().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
rtnl_lock_killable() is used only in register_netdev()
and will be converted to per-netns RTNL.
Let's unexport it and add the corresponding helper.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Previously xfrm_dev_state_advance_esn() was added for RX only. But
it's possible that ESN context also need to be synced to hardware for
TX, so call it for outbound in this patch.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We use NAPI ID as the key for continuing dumps. We also depend
on the NAPIs being sorted by ID within the driver list. Tx NAPIs
(which don't have an ID assigned) break this expectation, it's
not currently possible to dump them reliably. Since Tx NAPIs
are relatively rare, and can't be used in doit (GET or SET)
hide them from the dump API as well.
Fixes: 27f91aaf49 ("netdev-genl: Add netlink framework functions for napi")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250103183207.1216004-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Use kfree_rcu() instead of synchronize_rcu()+kfree().
This might allow syzbot to fuzz HSR a bit faster...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250103101148.3594545-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Define inet_sk_dscp() to get a dscp_t value from struct inet_sock, so
that sctp_v4_get_dst() can easily set ->flowi4_tos from a dscp_t
variable. For the SCTP_DSCP_SET_MASK case, we can just use
inet_dsfield_to_dscp() to get a dscp_t value.
Then, when converting ->flowi4_tos from __u8 to dscp_t, we'll just have
to drop the inet_dscp_to_dsfield() conversion function.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/1a645f4a0bc60ad18e7c0916642883ce8a43c013.1735835456.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rcu_read_lock/rcu_read_unlock has already provide protection for the
pointer we will reference when we call c_show. Therefore, there is no
need to obtain a cache reference to help protect cache_head.
Additionally, the .put such as expkey_put/svc_export_put will invoke
dput, which can sleep and break rcu. Stop get cache reference to fix
them all.
Fixes: ae74136b4b ("SUNRPC: Allow cache lookups to use RCU protection rather than the r/w spinlock")
Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This is a prepare patch to add cache_check_rcu, will use it with follow
patch.
Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that the connection limit only apply to unconfirmed connections,
there is no need to configure it. So remove all the configuration and
fix the number of unconfirmed connections as always 64 - which is
now given a name: XPT_MAX_TMP_CONN
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The heuristic for limiting the number of incoming connections to nfsd
currently uses sv_nrthreads - allowing more connections if more threads
were configured.
A future patch will allow number of threads to grow dynamically so that
there will be no need to configure sv_nrthreads. So we need a different
solution for limiting connections.
It isn't clear what problem is solved by limiting connections (as
mentioned in a code comment) but the most likely problem is a connection
storm - many connections that are not doing productive work. These will
be closed after about 6 minutes already but it might help to slow down a
storm.
This patch adds a per-connection flag XPT_PEER_VALID which indicates
that the peer has presented a filehandle for which it has some sort of
access. i.e the peer is known to be trusted in some way. We now only
count connections which have NOT been determined to be valid. There
should be relative few of these at any given time.
If the number of non-validated peer exceed a limit - currently 64 - we
close the oldest non-validated peer to avoid having too many of these
useless connections.
Note that this patch significantly changes the meaning of the various
configuration parameters for "max connections". The next patch will
remove all of these.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reading is very slow because ->start() performs a linear re-scan of the
entire hash table until it finds the successor to the last dumped
element. The current implementation uses 'pos' as the 'number of
elements to skip, then does linear iteration until it has skipped
'pos' entries.
Store the last bucket and the number of elements to skip in that
bucket instead, so we can resume from bucket b directly.
before this patch, its possible to read ~35k entries in one second, but
each read() gets slower as the number of entries to skip grows:
time timeout 60 cat /proc/net/ip_vs_conn > /tmp/all; wc -l /tmp/all
real 1m0.007s
user 0m0.003s
sys 0m59.956s
140386 /tmp/all
Only ~100k more got read in remaining the remaining 59s, and did not get
nowhere near the 1m entries that are stored at the time.
after this patch, dump completes very quickly:
time cat /proc/net/ip_vs_conn > /tmp/all; wc -l /tmp/all
real 0m2.286s
user 0m0.004s
sys 0m2.281s
1000001 /tmp/all
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The genmask parameter is not used within the nf_tables_addchain function
body. It should be removed to simplify the function parameter list.
Signed-off-by: tuqiang <tu.qiang35@zte.com.cn>
Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
security_secid_to_secctx() returns the size of the new context,
whereas previous versions provided that via a pointer parameter.
Correct the type of the value returned in nfqnl_get_sk_secctx()
and the check for error in netlbl_unlhsh_add(). Add an error
check.
Fixes: 2d470c7781 ("lsm: replace context+len with lsm_context")
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Stefan Schmidt says:
====================
pull-request: ieee802154-next 2025-01-03
Leo Stone provided a documatation fix to improve the grammar.
David Gilbert spotted a non-used fucntion we can safely remove.
* tag 'ieee802154-for-net-next-2025-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan-next:
net: mac802154: Remove unused ieee802154_mlme_tx_one
Documentation: ieee802154: fix grammar
====================
Link: https://patch.msgid.link/20250103154605.440478-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stefan Schmidt says:
====================
pull-request: ieee802154 for net 2025-01-03
Keisuke Nishimura provided a fix to check for kfifo_alloc() in the ca8210
driver.
Lizhi Xu provided a fix a corrupted list, found by syzkaller, by checking local
interfaces first.
* tag 'ieee802154-for-net-2025-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan:
mac802154: check local interfaces before deleting sdata list
ieee802154: ca8210: Add missing check for kfifo_alloc() in ca8210_probe()
====================
Link: https://patch.msgid.link/20250103160046.469363-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
802.2+LLC+SNAP frames received by napi_complete_done() with GRO and DSA
have skb->transport_header set two bytes short, or pointing 2 bytes
before network_header & skb->data. This was an issue as snap_rcv()
expected offset to point to SNAP header (OID:PID), causing packet to
be dropped.
A fix at llc_fixup_skb() (a024e377ef) resets transport_header for any
LLC consumers that may care about it, and stops SNAP packets from being
dropped, but doesn't fix the problem which is that LLC and SNAP should
not use transport_header offset.
Ths patch eliminates the use of transport_header offset for SNAP lookup
of OID:PID so that SNAP does not rely on the offset at all.
The offset is reset after pull for any SNAP packet consumers that may
(but shouldn't) use it.
Fixes: fda55eca5a ("net: introduce skb_transport_header_was_set()")
Signed-off-by: Antonio Pastor <antonio.pastor@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250103012303.746521-1-antonio.pastor@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
two-thirds of our normal weekly fix count, but delaying sending these
until -rc7 seemed like a really bad idea.
AFAIK we have no bugs under investigation. One or two reverts for
stuff for which we haven't gotten a proper fix will likely come in
the next PR.
Including fixes from wireles and netfilter.
Current release - fix to a fix:
- netfilter: nft_set_hash: unaligned atomic read on struct nft_set_ext
- eth: gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup
Previous releases - regressions:
- net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets
- mptcp:
- fix sleeping rcvmsg sleeping forever after bad recvbuffer adjust
- fix TCP options overflow
- prevent excessive coalescing on receive, fix throughput
- net: fix memory leak in tcp_conn_request() if map insertion fails
- wifi: cw1200: fix potential NULL dereference after conversion
to GPIO descriptors
- phy: micrel: dynamically control external clock of KSZ PHY,
fix suspend behavior
Previous releases - always broken:
- af_packet: fix VLAN handling with MSG_PEEK
- net: restrict SO_REUSEPORT to inet sockets
- netdev-genl: avoid empty messages in NAPI get
- dsa: microchip: fix set_ageing_time function on KSZ9477 and LAN937X
- eth: gve: XDP fixes around transmit, queue wakeup etc.
- eth: ti: icssg-prueth: fix firmware load sequence to prevent time
jump which breaks timesync related operations
Misc:
- netlink: specs: mptcp: add missing attr and improve documentation
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmd4FCgACgkQMUZtbf5S
IruB7hAAkppr4+4kywDq/PIqgkisjYzpNIQIrphebUeIHIcAizWbJEL9CiHXnCQD
w95+nVhzDmsWXdzE2AvwM/2YAByLP6CpgKEa+7rI3K5BQZ86lbzm+ftOOFYiF3eV
2v5yLfjgMVbzwqzuP3ivkNrtG95IGQFPobLl7tI9Wpgm3yD0+H8BCqtZNAwA4N6g
mEdZD1jMOzkdINoJBg35O70B2GDq/WSS1N8dgxtj1F7EPDuQdO1kijJmqFjYT3fw
30/NdjlfGtbFro6zp+8ZY6P9RG3CLNhYXu+nUqdOHJOJw65aXUxOPjWCRp3o6m12
HIIp22imDTvNTSCfn0wlbGJfC+v9HdIgB9gqApAaxNOS02bQgEy5tuxxXNTo2B1I
bqH7GJ+6a7axsQ6BnpcOcDhu2fnbtguj1W/zC/MlGQmZZj2g+UNCf4QqPir/XkcG
r43/iy2/L8cgZWYXzH1skfUMgbymI5uLytO+n+CLsb6P/N3WLa7wxHNV5Lni+Xng
qDyd0TwUccCr2KBfZqjgFeGbvQ7qub+eeQCYx9kUAaFhKUsQB/eZ5i9C7SsfutMa
xzwITqCkcvmdyU3gG3v/oABtQkxdTcRqWCfYDGcXm0zb1jzMf5ZGN4E/gABYM7p9
EewL1kCAt4ULmBc8Hqrj+2Erlzh+xsodvvfpgpJandMVXaJW44k=
=YKS4
-----END PGP SIGNATURE-----
Merge tag 'net-6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from wireles and netfilter.
Nothing major here. Over the last two weeks we gathered only around
two-thirds of our normal weekly fix count, but delaying sending these
until -rc7 seemed like a really bad idea.
AFAIK we have no bugs under investigation. One or two reverts for
stuff for which we haven't gotten a proper fix will likely come in the
next PR.
Current release - fix to a fix:
- netfilter: nft_set_hash: unaligned atomic read on struct
nft_set_ext
- eth: gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup
Previous releases - regressions:
- net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets
- mptcp:
- fix sleeping rcvmsg sleeping forever after bad recvbuffer adjust
- fix TCP options overflow
- prevent excessive coalescing on receive, fix throughput
- net: fix memory leak in tcp_conn_request() if map insertion fails
- wifi: cw1200: fix potential NULL dereference after conversion to
GPIO descriptors
- phy: micrel: dynamically control external clock of KSZ PHY, fix
suspend behavior
Previous releases - always broken:
- af_packet: fix VLAN handling with MSG_PEEK
- net: restrict SO_REUSEPORT to inet sockets
- netdev-genl: avoid empty messages in NAPI get
- dsa: microchip: fix set_ageing_time function on KSZ9477 and LAN937X
- eth:
- gve: XDP fixes around transmit, queue wakeup etc.
- ti: icssg-prueth: fix firmware load sequence to prevent time
jump which breaks timesync related operations
Misc:
- netlink: specs: mptcp: add missing attr and improve documentation"
* tag 'net-6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
net: ti: icssg-prueth: Fix clearing of IEP_CMP_CFG registers during iep_init
net: ti: icssg-prueth: Fix firmware load sequence.
mptcp: prevent excessive coalescing on receive
mptcp: don't always assume copied data in mptcp_cleanup_rbuf()
mptcp: fix recvbuffer adjust on sleeping rcvmsg
ila: serialize calls to nf_register_net_hooks()
af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK
af_packet: fix vlan_get_tci() vs MSG_PEEK
net: wwan: iosm: Properly check for valid exec stage in ipc_mmio_init()
net: restrict SO_REUSEPORT to inet sockets
net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets
net: sfc: Correct key_len for efx_tc_ct_zone_ht_params
net: wwan: t7xx: Fix FSM command timeout issue
sky2: Add device ID 11ab:4373 for Marvell 88E8075
mptcp: fix TCP options overflow.
net: mv643xx_eth: fix an OF node reference leak
gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup
eth: bcmsysport: fix call balance of priv->clk handling routines
net: llc: reset skb->transport_header
netlink: specs: mptcp: fix missing doc
...
Constify the following API:
struct device *device_find_child(struct device *dev, void *data,
int (*match)(struct device *dev, void *data));
To :
struct device *device_find_child(struct device *dev, const void *data,
device_match_t match);
typedef int (*device_match_t)(struct device *dev, const void *data);
with the following reasons:
- Protect caller's match data @*data which is for comparison and lookup
and the API does not actually need to modify @*data.
- Make the API's parameters (@match)() and @data have the same type as
all of other device finding APIs (bus|class|driver)_find_device().
- All kinds of existing device match functions can be directly taken
as the API's argument, they were exported by driver core.
Constify the API and adapt for various existing usages.
BTW, various subsystem changes are squashed into this commit to meet
'git bisect' requirement, and this commit has the minimal and simplest
changes to complement squashing shortcoming, and that may bring extra
code improvement.
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Acked-by: Uwe Kleine-König <ukleinek@kernel.org> # for drivers/pwm
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Link: https://lore.kernel.org/r/20241224-const_dfc_done-v5-4-6623037414d4@quicinc.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently the skb size after coalescing is only limited by the skb
layout (the skb must not carry frag_list). A single coalesced skb
covering several MSS can potentially fill completely the receive
buffer. In such a case, the snd win will zero until the receive buffer
will be empty again, affecting tput badly.
Fixes: 8268ed4c9d ("mptcp: introduce and use mptcp_try_coalesce()")
Cc: stable@vger.kernel.org # please delay 2 weeks after 6.13-final release
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241230-net-mptcp-rbuf-fixes-v1-3-8608af434ceb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Under some corner cases the MPTCP protocol can end-up invoking
mptcp_cleanup_rbuf() when no data has been copied, but such helper
assumes the opposite condition.
Explicitly drop such assumption and performs the costly call only
when strictly needed - before releasing the msk socket lock.
Fixes: fd8976790a ("mptcp: be careful on MPTCP-level ack.")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241230-net-mptcp-rbuf-fixes-v1-2-8608af434ceb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the recvmsg() blocks after receiving some data - i.e. due to
SO_RCVLOWAT - the MPTCP code will attempt multiple times to
adjust the receive buffer size, wrongly accounting every time the
cumulative of received data - instead of accounting only for the
delta.
Address the issue moving mptcp_rcv_space_adjust just after the
data reception and passing it only the just received bytes.
This also removes an unneeded difference between the TCP and MPTCP
RX code path implementation.
Fixes: 5813022985 ("mptcp: error out earlier on disconnect")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241230-net-mptcp-rbuf-fixes-v1-1-8608af434ceb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The "struct sock *sk" parameter in ip_rcv_finish_core is unused, which
leads the compiler to optimize it out. As a result, the
"struct sk_buff *skb" parameter is passed using x1. And this make kprobe
hard to use.
Signed-off-by: Yu Tian <tianyu2@kernelsoft.com>
Link: https://patch.msgid.link/20241231023610.1657926-1-tianyu2@kernelsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The blamed commit disabled hardware offoad of IPv6 packets with
extension headers on devices that advertise NETIF_F_IPV6_CSUM,
based on the definition of that feature in skbuff.h:
* * - %NETIF_F_IPV6_CSUM
* - Driver (device) is only able to checksum plain
* TCP or UDP packets over IPv6. These are specifically
* unencapsulated packets of the form IPv6|TCP or
* IPv6|UDP where the Next Header field in the IPv6
* header is either TCP or UDP. IPv6 extension headers
* are not supported with this feature. This feature
* cannot be set in features for a device with
* NETIF_F_HW_CSUM also set. This feature is being
* DEPRECATED (see below).
The change causes skb_warn_bad_offload to fire for BIG TCP
packets.
[ 496.310233] WARNING: CPU: 13 PID: 23472 at net/core/dev.c:3129 skb_warn_bad_offload+0xc4/0xe0
[ 496.310297] ? skb_warn_bad_offload+0xc4/0xe0
[ 496.310300] skb_checksum_help+0x129/0x1f0
[ 496.310303] skb_csum_hwoffload_help+0x150/0x1b0
[ 496.310306] validate_xmit_skb+0x159/0x270
[ 496.310309] validate_xmit_skb_list+0x41/0x70
[ 496.310312] sch_direct_xmit+0x5c/0x250
[ 496.310317] __qdisc_run+0x388/0x620
BIG TCP introduced an IPV6_TLV_JUMBO IPv6 extension header to
communicate packet length, as this is an IPv6 jumbogram. But, the
feature is only enabled on devices that support BIG TCP TSO. The
header is only present for PF_PACKET taps like tcpdump, and not
transmitted by physical devices.
For this specific case of extension headers that are not
transmitted, return to the situation before the blamed commit
and support hardware offload.
ipv6_has_hopopt_jumbo() tests not only whether this header is present,
but also that it is the only extension header before a terminal (L4)
header.
Fixes: 04c20a9356 ("net: skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89iK1hdC3Nt8KPhOtTF8vCPc1AHDCtse_BTNki1pWxAByTQ@mail.gmail.com/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250101164909.1331680-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current PF number description is vague, sometimes interpreted as
some PF index. VF number in the PCI specification starts at 1; however
in kernel, it starts at 0 for representor model.
Improve the description of devlink port attributes PF, VF and SF
numbers with these details.
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/20241224183706.26571-1-parav@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Marek Lindner <marek.lindner@mailbox.org>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
ieee802154_mlme_tx_one() was added in 2022 by
commit ddd9ee7cda ("net: mac802154: Introduce a synchronous API for MLME
commands") but has remained unused.
Remove it.
Note, there's still a ieee802154_mlme_tx_one_locked()
variant that is used.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/20241225012423.439229-1-linux@treblig.org
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
802.2+LLC+SNAP frames received by napi_complete_done with GRO and DSA
have skb->transport_header set two bytes short, or pointing 2 bytes
before network_header & skb->data. As snap_rcv expects transport_header
to point to SNAP header (OID:PID) after LLC processing advances offset
over LLC header (llc_rcv & llc_fixup_skb), code doesn't find a match
and packet is dropped.
Between napi_complete_done and snap_rcv, transport_header is not used
until __netif_receive_skb_core, where originally it was being reset.
Commit fda55eca5a ("net: introduce skb_transport_header_was_set()")
only does so if not set, on the assumption the value was set correctly
by GRO (and also on assumption that "network stacks usually reset the
transport header anyway"). Afterwards it is moved forward by
llc_fixup_skb.
Locally generated traffic shows up at __netif_receive_skb_core with no
transport_header set and is processed without issue. On a setup with
GRO but no DSA, transport_header and network_header are both set to
point to skb->data which is also correct.
As issue is LLC specific, to avoid impacting non-LLC traffic, and to
follow up on original assumption made on previous code change,
llc_fixup_skb to reset the offset after skb pull. llc_fixup_skb
assumes the LLC header is at skb->data, and by definition SNAP header
immediately follows.
Fixes: fda55eca5a ("net: introduce skb_transport_header_was_set()")
Signed-off-by: Antonio Pastor <antonio.pastor@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241225010723.2830290-1-antonio.pastor@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Corrected the netlink message size calculation for multicast group
join/leave notifications. The previous calculation did not account for
the inclusion of both IPv4/IPv6 addresses and ifa_cacheinfo in the
payload. This fix ensures that the allocated message size is
sufficient to hold all necessary information.
This patch also includes the following improvements:
* Uses GFP_KERNEL instead of GFP_ATOMIC when holding the RTNL mutex.
* Uses nla_total_size(sizeof(struct in6_addr)) instead of
nla_total_size(16).
* Removes unnecessary EXPORT_SYMBOL().
Fixes: 2c2b61d213 ("netlink: add IGMP/MLD join/leave notifications")
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Link: https://patch.msgid.link/20241221100007.1910089-1-yuyanghuang@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The bridge input code may drop frames for various reasons and at various
points in the ingress handling logic. Currently kfree_skb() is used
everywhere, and therefore no drop reason is specified. Add drop reasons
to the most common drop points.
Drop reasons are not added exhaustively to the entire bridge code. The
intention is to incrementally add drop reasons to the rest of the bridge
code in follow up patches.
Signed-off-by: Radu Rendec <rrendec@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241219163606.717758-3-rrendec@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While by default max_autoclose equals to INT_MAX / HZ, one may set
net.sctp.max_autoclose to UINT_MAX. There is code in
sctp_association_init() that can consequently trigger overflow.
Cc: stable@vger.kernel.org
Fixes: 9f70f46bd4 ("sctp: properly latch and use autoclose value from sock to association")
Signed-off-by: Nikolay Kuratov <kniv@yandex-team.ru>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20241219162114.2863827-1-kniv@yandex-team.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The device denoted by tunnel->parms.link resides in the underlay net
namespace. Therefore pass tunnel->net to ip_tunnel_init_flow().
Fixes: db53cd3d88 ("net: Handle l3mdev in ip_tunnel_init_flow")
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20241219130336.103839-1-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If a UDP socket changes its local address while it's receiving
datagrams, as a result of connect(), there is a period during which
a lookup operation might fail to find it, after the address is changed
but before the secondary hash (port and address) and the four-tuple
hash (local and remote ports and addresses) are updated.
Secondary hash chains were introduced by commit 30fff9231f ("udp:
bind() optimisation") and, as a result, a rehash operation became
needed to make a bound socket reachable again after a connect().
This operation was introduced by commit 719f835853 ("udp: add
rehash on connect()") which isn't however a complete fix: the
socket will be found once the rehashing completes, but not while
it's pending.
This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
client sending datagrams to it. After the server receives the first
datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
the address of the sender, in order to set up a directed flow.
Now, if the client, running on a different CPU thread, happens to
send a (subsequent) datagram while the server's socket changes its
address, but is not rehashed yet, this will result in a failed
lookup and a port unreachable error delivered to the client, as
apparent from the following reproducer:
LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in
while :; do
taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
sleep 0.1 || sleep 1
taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
wait
done
where the client will eventually get ECONNREFUSED on a write()
(typically the second or third one of a given iteration):
2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused
This issue was first observed as a seldom failure in Podman's tests
checking UDP functionality while using pasta(1) to connect the
container's network namespace, which leads us to a reproducer with
the lookup error resulting in an ICMP packet on a tap device:
LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"
while :; do
./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
sleep 0.2 || sleep 1
socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
wait
cmp tmp.in tmp.out
done
Once this fails:
tmp.in tmp.out differ: char 8193, line 29
we can finally have a look at what's going on:
$ tshark -r pasta.pcap
1 0.000000 :: ? ff02::16 ICMPv6 110 Multicast Listener Report Message v2
2 0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
3 0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
4 0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
5 0.168827 c6:47:05:8d:dc:04 ? Broadcast ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
6 0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
7 0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
8 0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
9 0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
10 0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
11 0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
12 0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0
On the third datagram received, the network namespace of the container
initiates an ARP lookup to deliver the ICMP message.
In another variant of this reproducer, starting the client with:
strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &
and connecting to the socat server using a loopback address:
socat OPEN:tmp.in UDP4:localhost:1337,shut-null
we can more clearly observe a sendmmsg() call failing after the
first datagram is delivered:
[pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
[...]
[pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
[...]
[pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)
and, somewhat confusingly, after a connect() on the same socket
succeeded.
Until commit 4cdeeee925 ("net: udp: prefer listeners bound to an
address"), the race between receive address change and lookup didn't
actually cause visible issues, because, once the lookup based on the
secondary hash chain failed, we would still attempt a lookup based on
the primary hash (destination port only), and find the socket with the
outdated secondary hash.
That change, however, dropped port-only lookups altogether, as side
effect, making the race visible.
To fix this, while avoiding the need to make address changes and
rehash atomic against lookups, reintroduce primary hash lookups as
fallback, if lookups based on four-tuple and secondary hashes fail.
To this end, introduce a simplified lookup implementation, which
doesn't take care of SO_REUSEPORT groups: if we have one, there are
multiple sockets that would match the four-tuple or secondary hash,
meaning that we can't run into this race at all.
v2:
- instead of synchronising lookup operations against address change
plus rehash, reintroduce a simplified version of the original
primary hash lookup as fallback
v1:
- fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr
usage (Kuniyuki Iwashima)
- directly use sk_rcv_saddr for IPv4 receive addresses instead of
fetching inet_rcv_saddr (Kuniyuki Iwashima)
- move inet_update_saddr() to inet_hashtables.h and use that
to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima)
- rebase onto net-next, update commit message accordingly
Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/24147
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 30fff9231f ("udp: bind() optimisation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Fix inlining of bpf_get_smp_processor_id helper for !CONFIG_SMP
systems (Andrea Righi)
- Fix BPF USDT selftests helper code to use asm constraint "m"
for LoongArch (Tiezhu Yang)
- Fix BPF selftest compilation error in get_uprobe_offset when
PROCMAP_QUERY is not defined (Jerome Marchand)
- Fix BPF bpf_skb_change_tail helper when used in context of
BPF sockmap to handle negative skb header offsets (Cong Wang)
- Several fixes to BPF sockmap code, among others, in the area
of socket buffer accounting (Levi Zim, Zijian Zhang, Cong Wang)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-----BEGIN PGP SIGNATURE-----
iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ2YJABUcZGFuaWVsQGlv
Z2VhcmJveC5uZXQACgkQ2yufC7HISINDEgD+N4uVg+rp8Z8pg9jcai4WUERmRG20
NcQTfBXczLHkwIcBALvn7NVvbTAINJzBTnukbjX3XbWFz2cJ/xHxDYXycP4I
=SwXG
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull BPF fixes from Daniel Borkmann:
- Fix inlining of bpf_get_smp_processor_id helper for !CONFIG_SMP
systems (Andrea Righi)
- Fix BPF USDT selftests helper code to use asm constraint "m" for
LoongArch (Tiezhu Yang)
- Fix BPF selftest compilation error in get_uprobe_offset when
PROCMAP_QUERY is not defined (Jerome Marchand)
- Fix BPF bpf_skb_change_tail helper when used in context of BPF
sockmap to handle negative skb header offsets (Cong Wang)
- Several fixes to BPF sockmap code, among others, in the area of
socket buffer accounting (Levi Zim, Zijian Zhang, Cong Wang)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Test bpf_skb_change_tail() in TC ingress
selftests/bpf: Introduce socket_helpers.h for TC tests
selftests/bpf: Add a BPF selftest for bpf_skb_change_tail()
bpf: Check negative offsets in __bpf_skb_min_len()
tcp_bpf: Fix copied value in tcp_bpf_sendmsg
skmsg: Return copied bytes in sk_msg_memcopy_from_iter
tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirection
tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress()
selftests/bpf: Fix compilation error in get_uprobe_offset()
selftests/bpf: Use asm constraint "m" for LoongArch
bpf: Fix bpf_get_smp_processor_id() on !CONFIG_SMP
corruption due to a buffer overrun, potential infinite loop and several
memory leaks on the error paths. All but one marked for stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmdlxvsTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi4vXB/9Y91WEVWK/ZX525z3sXbh0UK6yubPQ
LRlHZ+01TE1uLTpU7wrv3rIApqqHDkL65au8Qc4/Cdxp3uJxZWFCs+0HuWwSeQ68
TmSFicQAPxii/IqqB9oLWtXvLNQkJ+kAlnpHwgka7eGZco+biOVaB+OcSViTaGXd
Sczfc42zxdSR342Zz9GnLzBDygtu983c3frcFMYG2qqhd1ZyvVittASUR2dP8krl
qZUoKaQjBC7z4C5X2iwJ+89m/0UjE0sxQr1tY7qhX0yetbElD5Ddke+lE2yjzVsF
tD4IasZaTWvN5Ywh6H3uQzfySWUVrdwzErJBw7g+h7IA1Ok6J9zqB1r+
=vleo
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.13-rc4' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A handful of important CephFS fixes from Max, Alex and myself: memory
corruption due to a buffer overrun, potential infinite loop and
several memory leaks on the error paths. All but one marked for
stable"
* tag 'ceph-for-6.13-rc4' of https://github.com/ceph/ceph-client:
ceph: allocate sparse_ext map only for sparse reads
ceph: fix memory leak in ceph_direct_read_write()
ceph: improve error handling and short/overflow-read logic in __ceph_sync_read()
ceph: validate snapdirname option length when mounting
ceph: give up on paths longer than PATH_MAX
ceph: fix memory leaks in __ceph_sync_read()
skb_network_offset() and skb_transport_offset() can be negative when
they are called after we pull the transport header, for example, when
we use eBPF sockmap at the point of ->sk_data_ready().
__bpf_skb_min_len() uses an unsigned int to get these offsets, this
leads to a very large number which then causes bpf_skb_change_tail()
failed unexpectedly.
Fix this by using a signed int to get these offsets and ensure the
minimum is at least zero.
Fixes: 5293efe62d ("bpf: add bpf_skb_change_tail helper")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241213034057.246437-2-xiyou.wangcong@gmail.com
bpf kselftest sockhash::test_txmsg_cork_hangs in test_sockmap.c triggers a
kernel NULL pointer dereference:
BUG: kernel NULL pointer dereference, address: 0000000000000008
? __die_body+0x6e/0xb0
? __die+0x8b/0xa0
? page_fault_oops+0x358/0x3c0
? local_clock+0x19/0x30
? lock_release+0x11b/0x440
? kernelmode_fixup_or_oops+0x54/0x60
? __bad_area_nosemaphore+0x4f/0x210
? mmap_read_unlock+0x13/0x30
? bad_area_nosemaphore+0x16/0x20
? do_user_addr_fault+0x6fd/0x740
? prb_read_valid+0x1d/0x30
? exc_page_fault+0x55/0xd0
? asm_exc_page_fault+0x2b/0x30
? splice_to_socket+0x52e/0x630
? shmem_file_splice_read+0x2b1/0x310
direct_splice_actor+0x47/0x70
splice_direct_to_actor+0x133/0x300
? do_splice_direct+0x90/0x90
do_splice_direct+0x64/0x90
? __ia32_sys_tee+0x30/0x30
do_sendfile+0x214/0x300
__se_sys_sendfile64+0x8e/0xb0
__x64_sys_sendfile64+0x25/0x30
x64_sys_call+0xb82/0x2840
do_syscall_64+0x75/0x110
entry_SYSCALL_64_after_hwframe+0x4b/0x53
This is caused by tcp_bpf_sendmsg() returning a larger value(12289) than
size (8192), which causes the while loop in splice_to_socket() to release
an uninitialized pipe buf.
The underlying cause is that this code assumes sk_msg_memcopy_from_iter()
will copy all bytes upon success but it actually might only copy part of
it.
This commit changes it to use the real copied bytes.
Signed-off-by: Levi Zim <rsworktech@outlook.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@kernel.org>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241130-tcp-bpf-sendmsg-v1-2-bae583d014f3@outlook.com
Previously sk_msg_memcopy_from_iter returns the copied bytes from the
last copy_from_iter{,_nocache} call upon success.
This commit changes it to return the total number of copied bytes on
success.
Signed-off-by: Levi Zim <rsworktech@outlook.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@kernel.org>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241130-tcp-bpf-sendmsg-v1-1-bae583d014f3@outlook.com
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in l2tp_ip_sendmsg() instead of passing parameters manually
to ip_route_output_ports().
Override ->daddr with the value passed in the msghdr structure if
provided.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: James Chapman <jchapman@katalix.com>
Link: https://patch.msgid.link/2ff22a3560c5050228928456662b80b9c84a8fe4.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in __ip_queue_xmit() instead of passing parameters manually
to ip_route_output_ports().
Override ->flowi4_tos with the value passed as parameter since that's
required by SCTP.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/37e64ffbd9adac187b14aa9097b095f5c86e85be.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPv4 code commonly has to initialise a flowi4 structure from an IPv4
socket. This requires looking at potential IPv4 options to set the
proper destination address, call flowi4_init_output() with the correct
set of parameters and run the sk_classify_flow security hook.
Instead of reimplementing these operations in different parts of the
stack, let's define inet_sk_init_flowi4() which does all these
operations.
The first user is inet_sk_rebuild_header(), where inet_sk_init_flowi4()
replaces ip_route_output_ports(). Unlike ip_route_output_ports(), which
sets the flowi4 structure and performs the route lookup in one go,
inet_sk_init_flowi4() only initialises the flow. The route lookup is
then done by ip_route_output_flow(). Decoupling flow initialisation
from route lookup makes this new interface applicable more broadly as
it will allow some users to overwrite specific struct flowi4 members
before the route lookup.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/fd416275262b1f518d5abfcef740ce4f4a1a6522.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When bridge binding is enabled on a VLAN netdevice, its link state should
track bridge ports that are members of the corresponding VLAN. This works
for newly-added netdevices. However toggling the option does not have the
effect of enabling or disabling the behavior as appropriate.
In this patch, react to bridge_binding toggles on VLAN uppers.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/90a8ca8aea4d81378b29d75d9e562433e0d5c7ff.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the BROPT_VLAN_BRIDGE_BINDING bridge option is only toggled when
VLAN devices are added on top of a bridge or removed from it. Extract the
toggling of the option to a function so that it could be invoked by a
subsequent patch when the state of an upper VLAN device changes.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/a7455f6fe1dfa7b13126ed8a7fb33d3b611eecb8.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Under DOS, inet_peer_xrlim_allow() might be called millions
of times per second from different cpus.
Make sure to write over peer->rate_tokens and peer->rate_last
only when really needed.
Note the inherent races of this function are still there,
we do not care of precise ICMP rate limiting.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241219150330.3159027-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Empty netlink responses from do() are not correct (as opposed to
dump() where not dumping anything is perfectly fine).
We should return an error if the target object does not exist,
in this case if the netdev is down we "hide" the NAPI instances.
Fixes: 27f91aaf49 ("netdev-genl: Add netlink framework functions for napi")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241219032833.1165433-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When we do sk_psock_verdict_apply->sk_psock_skb_ingress, an sk_msg will
be created out of the skb, and the rmem accounting of the sk_msg will be
handled by the skb.
For skmsgs in __SK_REDIRECT case of tcp_bpf_send_verdict, when redirecting
to the ingress of a socket, although we sk_rmem_schedule and add sk_msg to
the ingress_msg of sk_redir, we do not update sk_rmem_alloc. As a result,
except for the global memory limit, the rmem of sk_redir is nearly
unlimited. Thus, add sk_rmem_alloc related logic to limit the recv buffer.
Since the function sk_msg_recvmsg and __sk_psock_purge_ingress_msg are
used in these two paths. We use "msg->skb" to test whether the sk_msg is
skb backed up. If it's not, we shall do the memory accounting explicitly.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241210012039.1669389-3-zijianzhang@bytedance.com
When bpf_tcp_ingress() is called, the skmsg is being redirected to the
ingress of the destination socket. Therefore, we should charge its
receive socket buffer, instead of sending socket buffer.
Because sk_rmem_schedule() tests pfmemalloc of skb, we need to
introduce a wrapper and call it for skmsg.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241210012039.1669389-2-zijianzhang@bytedance.com
Same as with converting &xdp_buff to skb on Rx, the code which allocates
a new skb and copies the XSk frame there is identical across the
drivers, so make it generic. This includes copying all the frags if they
are present in the original buff.
System percpu page_pools greatly improve XDP_PASS performance on XSk:
instead of page_alloc() + page_free(), the net core recycles the same
pages, so the only overhead left is memcpy()s. When the Page Pool is
not compiled in, the whole function is a return-NULL (but it always
gets selected when eBPF is enabled).
Note that the passed buff gets freed if the conversion is done w/o any
error, assuming you don't need this buffer after you convert it to an
skb.
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-6-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The code which builds an skb from an &xdp_buff keeps multiplying itself
around the drivers with almost no changes. Let's try to stop that by
adding a generic function.
Unlike __xdp_build_skb_from_frame(), always allocate an skbuff head
using napi_build_skb() and make use of the available xdp_rxq pointer to
assign the Rx queue index. In case of PP-backed buffer, mark the skb to
be recycled, as every PP user's been switched to recycle skbs.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The code piece which would attach a frag to &xdp_buff is almost
identical across the drivers supporting XDP multi-buffer on Rx.
Make it a generic elegant "oneliner".
Also, I see lots of drivers calculating frags_truesize as
`xdp->frame_sz * nr_frags`. I can't say this is fully correct, since
frags might be backed by chunks of different sizes, especially with
stuff like the header split. Even page_pool_alloc() can give you two
different truesizes on two subsequent requests to allocate the same
buffer size. Add a field to &skb_shared_info (unionized as there's no
free slot currently on x86_64) to track the "true" truesize. It can
be used later when updating the skb.
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We already have enough variants of ip_route_output*() functions. We
don't need a GRE specific one in the generic route.h header file.
Furthermore, ip_route_output_gre() is only used once, in ipgre_open(),
where it can be easily replaced by a simple call to
ip_route_output_key().
While there, and for clarity, explicitly set .flowi4_scope to
RT_SCOPE_UNIVERSE instead of relying on the implicit zero
initialisation.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/ab7cba47b8558cd4bfe2dc843c38b622a95ee48e.1734527729.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This makes it possible to disable the MSG_OOB support in .config.
Signed-off-by: Florent Revest <revest@chromium.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241218143334.1507465-1-revest@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Multi-Link Operation implementation continues, both in stack and in
drivers. Otherwise it has been relatively quiet.
Major changes:
cfg80211/mac80211
* define wiphy guard
* get TX power per link
* EHT 320 MHz channel support for mesh
ath11k
* QCA6698AQ support
ath9k
* RX inactivity detection
rtl8xxxu
* add more USB device IDs
rtw88
* add more USB device IDs
* enable USB RX aggregation and USB 3 to improve performance
rtw89
* PowerSave flow for Multi-Link Operation
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmdkbAoRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZuRIgf/dcjYr+eg3I7iU9qGxvEwHlDAC5CaMRwe
8+/SO6gy49xf6igleNQ2jBn/qAsJTiro6IzJwb1D6i16ax4TRUTEkTZSiYCzntKI
9Nkq59qhsRI4Vxrhp6NibUtVnjuRdSruVM5uLCccUCJ8tfq13WGhecR2pmV0TDO3
bRSza6L64XIuSmqHkuWS3Hz1YQvIvIZMeeiWoC35mtXg6ORRXpYloLtCzFn1zxoP
YPoeSfoAqIlaVwdB1DoaakU6is8oGZ0oI6zw/qaN8P7pYfqO62ATf6ZzAdwHE1dU
fow9nvwzln+BqgpdIK81QFR+XC+7LorCGSaQlYu6C0nxjSzycSrgOw==
=WIP7
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2024-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.14
Multi-Link Operation implementation continues, both in stack and in
drivers. Otherwise it has been relatively quiet.
Major changes:
cfg80211/mac80211
- define wiphy guard
- get TX power per link
- EHT 320 MHz channel support for mesh
ath11k
- QCA6698AQ support
ath9k
- RX inactivity detection
rtl8xxxu
- add more USB device IDs
rtw88
- add more USB device IDs
- enable USB RX aggregation and USB 3 to improve performance
rtw89
- PowerSave flow for Multi-Link Operation
* tag 'wireless-next-2024-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (121 commits)
wifi: wlcore: sysfs: constify 'struct bin_attribute'
wifi: brcmfmac: clarify unmodifiable headroom log message
wifi: brcmfmac: add missing header include for brcmf_dbg
wifi: brcmsmac: add gain range check to wlc_phy_iqcal_gainparams_nphy()
wifi: qtnfmac: fix spelling error in core.h
wifi: rtw89: phy: add dummy C2H event handler for report of TAS power
wifi: rtw89: 8851b: rfk: remove unnecessary assignment of return value of _dpk_dgain_read()
wifi: rtw89: 8852c: rfk: refine target channel calculation in _rx_dck_channel_calc()
wifi: rtlwifi: pci: wait for firmware loading before releasing memory
wifi: rtlwifi: fix memory leaks and invalid access at probe error path
wifi: rtlwifi: destroy workqueue at rtl_deinit_core
wifi: rtlwifi: remove unused check_buddy_priv
wifi: rtw89: 8922a: update format of RFK pre-notify H2C command v2
wifi: rtw89: regd: update regulatory map to R68-R51
wifi: rtw89: 8852c: disable ER SU when 4x HE-LTF and 0.8 GI capability differ
wifi: rtw89: disable firmware training HE GI and LTF
wifi: rtw89: ps: update data for firmware and settings for hardware before/after PS
wifi: rtw89: ps: refactor channel info to firmware before entering PS
wifi: rtw89: ps: refactor PS flow to support MLO
wifi: mwifiex: decrease timeout waiting for host sleep from 10s to 5s
...
====================
Link: https://patch.msgid.link/20241219185709.774EDC4CECE@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot can figure out a way to redirect a netlink message to a tap.
Sending empty skbs to devices is not valid and we end up hitting
a skb_assert_len() in __dev_queue_xmit().
Make catching these mistakes easier, assert the skb size directly
in netlink core.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241218024400.824355-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The default IPv6 multipath hash policy takes the flow label into account
when calculating a multipath hash and previous patches added a flow
label selector to IPv6 FIB rules.
Allow user space to specify a flow label in route get requests by adding
a new netlink attribute and using its value to populate the "flowlabel"
field in the IPv6 flow info structure prior to a route lookup.
Deny the attribute in RTM_{NEW,DEL}ROUTE requests by checking for it in
rtm_to_fib6_config() and returning an error if present.
A subsequent patch will use this capability to test the new flow label
selector in IPv6 FIB rules.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Now that both IPv4 and IPv6 correctly handle the new flow label
attributes, enable user space to configure FIB rules that make use of
the flow label by changing the policy to stop rejecting them and
accepting 32 bit values in big-endian byte order.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Implement support for the new flow label selector which allows IPv6 FIB
rules to match on the flow label with a mask. Ensure that both flow
label attributes are specified (or none) and that the mask is valid.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
IPv4 FIB rules cannot match on flow label so reject requests that try to
add such rules. Do that in the IPv4 configure callback as the netlink
policy resides in the core and used by both IPv4 and IPv6.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add new FIB rule attributes which will allow user space to match on the
IPv6 flow label with a mask. Temporarily set the type of the attributes
to 'NLA_REJECT' while support is being added in the IPv6 code.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently, we don't use the return value from sock_queue_rcv_skb, which
means we may leak skbs if a message is not successfully queued to a
socket.
Instead, ensure that we're freeing the skb where the sock hasn't
otherwise taken ownership of the skb by adding checks on the
sock_queue_rcv_skb() to invoke a kfree on failure.
In doing so, rather than using the 'rc' value to trigger the
kfree_skb(), use the skb pointer itself, which is more explicit.
Also, add a kunit test for the sock delivery failure cases.
Fixes: 4a992bbd36 ("mctp: Implement message fragmentation & reassembly")
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20241218-mctp-next-v2-1-1c1729645eaa@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmdjW1EACgkQ1w0aZmrP
KyEslhAAtoUA09hpV/MvApX42612MouOGaEeDw4e3PQrGpgarCP6I/ZAquZHUano
+BrleIEV6fUanMbH94rsLHUUtytZKbPlFR3qEKhLZAqm5HnCO5yZLylUFGfWqKFn
kYGRxdvqj502kUgl6crvYqLeBu+fHV9MvbAChgwVH4xfjCPjWKTAIpL1Ot8HOXqQ
G5crPBGKHZk09GWkgfc29k9BKg9fFmcSWtWcuepX555RNoKd2+VEHx9U7Jtnql3m
WZCGX9pVzO1T9H8xvtc2XOCYg4asOmTyNyONrDcH9Nt+j/JHfSNeWaQk8LjjChyT
2+H0DylJHdzF4QopPHLGuwPRzbPs6FM/nSKzj08nAjZ++JF8MPrx55X5xqxb+HX7
V4W1LLZlrSOs4lo5MA241anK+sOp1bo5dHc2np2dHu4hHgXQ2FBcjwLIjkTkJ4t7
tkjDCG4cE+sjzdI3k6hvb8RAS9TmjToMSMKoWIj8LM2rlG/+URbWYklI4UvwuwzQ
VTU7nA82LHHyEYu8TQqp+8QBuONBejfl/UTujqqreL1CaHDI/hfWiLa4ON/kY/kt
hUtfNhws0hOf9K4JV68BMMp2HXHEH4WQkWv2qH5vlsTuE85PIb7I976GQeoZKQsB
q7/WVus1kJPvFwMtHsVesZW6xnoKljHGXbeC7UJ+bQTc8r/vEkE=
=Vzt5
-----END PGP SIGNATURE-----
Merge tag 'nf-24-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following series contains two fixes for Netfilter/IPVS:
1) Possible build failure in IPVS on systems with less than 512MB
memory due to incorrect use of clamp(), from David Laight.
2) Fix bogus lockdep nesting splat with ipset list:set type,
from Phil Sutter.
netfilter pull request 24-12-19
* tag 'nf-24-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: ipset: Fix for recursive locking warning
ipvs: Fix clamp() of ip_vs_conn_tab on small memory systems
====================
Link: https://patch.msgid.link/20241218234137.1687288-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If PSAMPLE_ATTR_SAMPLE_PROBABILITY flag is to be sent, the available
size for the packet data has to be adjusted accordingly.
Also, check the error code returned by nla_put_flag.
Fixes: 7b1b2b60c6 ("net: psample: allow using rate as probability")
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20241217113739.3929300-1-amorenoz@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Empty netlink responses from do() are not correct (as opposed to
dump() where not dumping anything is perfectly fine).
We should return an error if the target object does not exist,
in this case if the netdev is down it has no queues.
Fixes: 6b6171db7f ("netdev-genl: Add netlink framework functions for queue")
Reported-by: syzbot+0a884bc2d304ce4af70f@syzkaller.appspotmail.com
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20241218022508.815344-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>