AF_UNIX stream sockets are a paired socket. So sending on one of the pairs
will lookup the paired socket as part of the send operation. It is possible
however to put just one of the pairs in a BPF map. This currently increments
the refcnt on the sock in the sockmap to ensure it is not free'd by the
stack before sockmap cleans up its state and stops any skbs being sent/recv'd
to that socket.
But we missed a case. If the peer socket is closed it will be free'd by the
stack. However, the paired socket can still be referenced from BPF sockmap
side because we hold a reference there. Then if we are sending traffic through
BPF sockmap to that socket it will try to dereference the free'd pair in its
send logic creating a use after free. And following splat:
[59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
[59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
[...]
[59.905468] Call Trace:
[59.905787] <TASK>
[59.906066] dump_stack_lvl+0x130/0x1d0
[59.908877] print_report+0x16f/0x740
[59.910629] kasan_report+0x118/0x160
[59.912576] sk_wake_async+0x31/0x1b0
[59.913554] sock_def_readable+0x156/0x2a0
[59.914060] unix_stream_sendmsg+0x3f9/0x12a0
[59.916398] sock_sendmsg+0x20e/0x250
[59.916854] skb_send_sock+0x236/0xac0
[59.920527] sk_psock_backlog+0x287/0xaa0
To fix let BPF sockmap hold a refcnt on both the socket in the sockmap and its
paired socket. It wasn't obvious how to contain the fix to bpf_unix logic. The
primarily problem with keeping this logic in bpf_unix was: In the sock close()
we could handle the deref by having a close handler. But, when we are destroying
the psock through a map delete operation we wouldn't have gotten any signal
thorugh the proto struct other than it being replaced. If we do the deref from
the proto replace its too early because we need to deref the sk_pair after the
backlog worker has been stopped.
Given all this it seems best to just cache it at the end of the psock and eat 8B
for the af_unix and vsock users. Notice dgram sockets are OK because they handle
locking already.
Fixes: 94531cfcbe ("af_unix: Add unix_stream_proto for sockmap")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20231129012557.95371-2-john.fastabend@gmail.com
This change actually defines the (initial) metadata layout
that should be used by AF_XDP userspace (xsk_tx_metadata).
The first field is flags which requests appropriate offloads,
followed by the offload-specific fields. The supported per-device
offloads are exported via netlink (new xsk-flags).
The offloads themselves are still implemented in a bit of a
framework-y fashion that's left from my initial kfunc attempt.
I'm introducing new xsk_tx_metadata_ops which drivers are
supposed to implement. The drivers are also supposed
to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
the right places. Since xsk_tx_metadata_{request,_complete}
are static inline, we don't incur any extra overhead doing
indirect calls.
The benefit of this scheme is as follows:
- keeps all metadata layout parsing away from driver code
- makes it easy to grep and see which drivers implement what
- don't need any extra flags to maintain to keep track of what
offloads are implemented; if the callback is implemented - the offload
is supported (used by netlink reporting code)
Two offloads are defined right now:
1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset
2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata
area upon completion (tx_timestamp field)
XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes
SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass
metadata pointer).
The struct is forward-compatible and can be extended in the future
by appending more fields.
Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231127190319.1190813-3-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Mute the periodic "stalled pool shutdown" warning if the page pool
is visible to user space. Rolling out a driver using page pools
to just a few hundred hosts at Meta surfaces applications which
fail to reap their broken sockets. Obviously it's best if the
applications are fixed, but we don't generally print warnings
for application resource leaks. Admins can now depend on the
netlink interface for getting page pool info to detect buggy
apps.
While at it throw in the ID of the pool into the message,
in rare cases (pools from destroyed netns) this will make
finding the pool with a debugger easier.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Report when page pool was destroyed. Together with the inflight
/ memory use reporting this can serve as a replacement for the
warning about leaked page pools we currently print to dmesg.
Example output for a fake leaked page pool using some hacks
in netdevsim (one "live" pool, and one "leaked" on the same dev):
$ ./cli.py --no-schema --spec netlink/specs/netdev.yaml \
--dump page-pool-get
[{'id': 2, 'ifindex': 3},
{'id': 1, 'ifindex': 3, 'destroyed': 133, 'inflight': 1}]
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Advanced deployments need the ability to check memory use
of various system components. It makes it possible to make informed
decisions about memory allocation and to find regressions and leaks.
Report memory use of page pools. Report both number of references
and bytes held.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Generate netlink notifications about page pool state changes.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
To avoid any issues with race conditions on accessing napi
and having to think about the lifetime of NAPI objects
in netlink GET - stash the napi_id to which page pool
was linked at creation time.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link the page pools with netdevs. This needs to be netns compatible
so we have two options. Either we record the pools per netns and
have to worry about moving them as the netdev gets moved.
Or we record them directly on the netdev so they move with the netdev
without any extra work.
Implement the latter option. Since pools may outlast netdev we need
a place to store orphans. In time honored tradition use loopback
for this purpose.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
To give ourselves the flexibility of creating netlink commands
and ability to refer to page pool instances in uAPIs create
IDs for page pools.
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We'll soon (next change in the series) need a fuller unwind path
in page_pool_create() so create the inverse of page_pool_init().
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
To fully benefit from previous commit add one byte of state
in the first cache line recording if we need to look at
the slow part.
The packing isn't all that impressive right now, we create
a 7B hole. I'm expecting Olek's rework will reshuffle this,
anyway.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20231121000048.789613-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
struct page_pool is rather performance critical and we use
16B of the first cache line to store 2 pointers used only
by test code. Future patches will add more informational
(non-fast path) attributes.
It's convenient for the user of the API to not have to worry
which fields are fast and which are slow path. Use struct
groups to split the params into the two categories internally.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20231121000048.789613-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Networking supports changing netdevice's netns and name
at the same time. This allows avoiding name conflicts
and having to rename the interface in multiple steps.
E.g. netns1={eth0, eth1}, netns2={eth1} - we want
to move netns1:eth1 to netns2 and call it eth0 there.
If we can't rename "in flight" we'd need to (1) rename
eth1 -> $tmp, (2) change netns, (3) rename $tmp -> eth0.
To rename the underlying struct device we have to call
device_rename(). The rename()'s MOVE event, however, doesn't
"belong" to either the old or the new namespace.
If there are conflicts on both sides it's actually impossible
to issue a real MOVE (old name -> new name) without confusing
user space. And Daniel reports that such confusions do in fact
happen for systemd, in real life.
Since we already issue explicit REMOVE and ADD events
manually - suppress the MOVE event completely. Move
the ADD after the rename, so that the REMOVE uses
the old name, and the ADD the new one.
If there is no rename this changes the picture as follows:
Before:
old ns | KERNEL[213.399289] remove /devices/virtual/net/eth0 (net)
new ns | KERNEL[213.401302] add /devices/virtual/net/eth0 (net)
new ns | KERNEL[213.401397] move /devices/virtual/net/eth0 (net)
After:
old ns | KERNEL[266.774257] remove /devices/virtual/net/eth0 (net)
new ns | KERNEL[266.774509] add /devices/virtual/net/eth0 (net)
If there is a rename and a conflict (using the exact eth0/eth1
example explained above) we get this:
Before:
old ns | KERNEL[224.316833] remove /devices/virtual/net/eth1 (net)
new ns | KERNEL[224.318551] add /devices/virtual/net/eth1 (net)
new ns | KERNEL[224.319662] move /devices/virtual/net/eth0 (net)
After:
old ns | KERNEL[333.033166] remove /devices/virtual/net/eth1 (net)
new ns | KERNEL[333.035098] add /devices/virtual/net/eth0 (net)
Note that "in flight" rename is only performed when needed.
If there is no conflict for old name in the target netns -
the rename will be performed separately by dev_change_name(),
as if the rename was a different command, and there will still
be a MOVE event for the rename:
Before:
old ns | KERNEL[194.416429] remove /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.418809] add /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.418869] move /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.420866] move /devices/virtual/net/eth1 (net)
After:
old ns | KERNEL[71.917520] remove /devices/virtual/net/eth0 (net)
new ns | KERNEL[71.919155] add /devices/virtual/net/eth0 (net)
new ns | KERNEL[71.920729] move /devices/virtual/net/eth1 (net)
If deleting the MOVE event breaks some user space we should insert
an explicit kobject_uevent(MOVE) after the ADD, like this:
@@ -11192,6 +11192,12 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net,
kobject_uevent(&dev->dev.kobj, KOBJ_ADD);
netdev_adjacent_add_links(dev);
+ /* User space wants an explicit MOVE event, issue one unless
+ * dev_change_name() will get called later and issue one.
+ */
+ if (!pat || new_name[0])
+ kobject_uevent(&dev->dev.kobj, KOBJ_MOVE);
+
/* Adapt owner in case owning user namespace of target network
* namespace is different from the original one.
*/
Reported-by: Daniel Gröber <dxld@darkboxed.org>
Link: https://lore.kernel.org/all/20231010121003.x3yi6fihecewjy4e@House.clients.dxld.at/
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/all/20231120184140.578375-1-kuba@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of
indirect call wrapper and therefore optimize the bpf_redirect_peer()
internal handling a bit. Add a small skb_get_peer_dev() wrapper which
utilizes the INDIRECT_CALL_1() macro instead of open coding.
Future work could potentially add a peer pointer directly into struct
net_device in future and convert veth and netkit over to use it so
that eventually ndo_get_peer_dev can be removed.
Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20231114004220.6495-7-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Traffic redirected by bpf_redirect_peer() (used by recent CNIs like Cilium)
is not accounted for in the RX stats of supported devices (that is, veth
and netkit), confusing user space metrics collectors such as cAdvisor [0],
as reported by Youlun.
Fix it by calling dev_sw_netstats_rx_add() in skb_do_redirect(), to update
RX traffic counters. Devices that support ndo_get_peer_dev _must_ use the
@tstats per-CPU counters (instead of @lstats, or @dstats).
To make this more fool-proof, error out when ndo_get_peer_dev is set but
@tstats are not selected.
[0] Specifically, the "container_network_receive_{byte,packet}s_total"
counters are affected.
Fixes: 9aa1206e8f ("bpf: Add redirect_peer helper")
Reported-by: Youlun Zhang <zhangyoulun@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20231114004220.6495-6-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Move {l,t,d}stats allocation to the core and let netdevs pick the stats
type they need. That way the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc) - all happening in the core.
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
if a PF has 256 or more VFs, ip link command will allocate an order 3
memory or more, and maybe trigger OOM due to memory fragment,
the VFs needed memory size is computed in rtnl_vfinfo_size.
so introduce nlmsg_new_large which calls netlink_alloc_large_skb in
which vmalloc is used for large memory, to avoid the failure of
allocating memory
ip invoked oom-killer: gfp_mask=0xc2cc0(GFP_KERNEL|__GFP_NOWARN|\
__GFP_COMP|__GFP_NOMEMALLOC), order=3, oom_score_adj=0
CPU: 74 PID: 204414 Comm: ip Kdump: loaded Tainted: P OE
Call Trace:
dump_stack+0x57/0x6a
dump_header+0x4a/0x210
oom_kill_process+0xe4/0x140
out_of_memory+0x3e8/0x790
__alloc_pages_slowpath.constprop.116+0x953/0xc50
__alloc_pages_nodemask+0x2af/0x310
kmalloc_large_node+0x38/0xf0
__kmalloc_node_track_caller+0x417/0x4d0
__kmalloc_reserve.isra.61+0x2e/0x80
__alloc_skb+0x82/0x1c0
rtnl_getlink+0x24f/0x370
rtnetlink_rcv_msg+0x12c/0x350
netlink_rcv_skb+0x50/0x100
netlink_unicast+0x1b2/0x280
netlink_sendmsg+0x355/0x4a0
sock_sendmsg+0x5b/0x60
____sys_sendmsg+0x1ea/0x250
___sys_sendmsg+0x88/0xd0
__sys_sendmsg+0x5e/0xa0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f95a65a5b70
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/r/20231115120108.3711-1-lirongqing@baidu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Revert following commits:
commit acec05fb78 ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask")
commit 11d55be06d ("net: ethtool: Add a command to expose current time stamping layer")
commit bb8645b00c ("netlink: specs: Introduce new netlink command to get current timestamp")
commit d905f9c753 ("net: ethtool: Add a command to list available time stamping layers")
commit aed5004ee7 ("netlink: specs: Introduce new netlink command to list available time stamping layers")
commit 51bdf3165f ("net: Replace hwtstamp_source by timestamping layer")
commit 0f7f463d48 ("net: Change the API of PHY default timestamp to MAC")
commit 091fab1228 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp")
commit 152c75e1d0 ("net: ethtool: ts: Let the active time stamping layer be selectable")
commit ee60ea6be0 ("netlink: specs: Introduce time stamping set command")
They need more time for reviews.
Link: https://lore.kernel.org/all/20231118183529.6e67100c@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Change the API to select MAC default time stamping instead of the PHY.
Indeed the PHY is closer to the wire therefore theoretically it has less
delay than the MAC timestamping but the reality is different. Due to lower
time stamping clock frequency, latency in the MDIO bus and no PHC hardware
synchronization between different PHY, the PHY PTP is often less precise
than the MAC. The exception is for PHY designed specially for PTP case but
these devices are not very widespread. For not breaking the compatibility I
introduce a default_timestamp flag in phy_device that is set by the phy
driver to know we are using the old API behavior.
The phy_set_timestamp function is called at each call of phy_attach_direct.
In case of MAC driver using phylink this function is called when the
interface is turned up. Then if the interface goes down and up again the
last choice of timestamp will be overwritten by the default choice.
A solution could be to cache the timestamp status but it can bring other
issues. In case of SFP, if we change the module, it doesn't make sense to
blindly re-set the timestamp back to PHY, if the new module has a PHY with
mediocre timestamping capabilities.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace hwtstamp_source which is only used by the kernel_hwtstamp_config
structure by the more widely use timestamp_layer structure. This is done
to prepare the support of selectable timestamping source.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the dev_set_hwtstamp_phylib function accessible in prevision to use
it from ethtool to reset the tstamp current configuration.
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cited commit removed the strscpy() call and kept the snprintf() only.
It is common to use 'dev->name' as the format string before a netdev is
registered, this results in 'res' and 'name' pointers being equal.
According to POSIX, if copying takes place between objects that overlap
as a result of a call to sprintf() or snprintf(), the results are
undefined.
Add back the strscpy() and use 'buf' as an intermediate buffer.
Fixes: 7ad17b04dc ("net: trust the bitmap in __dev_alloc_name()")
Cc: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The test allocs a single page to hold all the frag_list skbs. This
is insufficient on kernels with CONFIG_MAX_SKB_FRAGS=45, due to the
increased skb_shared_info frags[] array length.
gso_test_func: ASSERTION FAILED at net/core/gso_test.c:210
Expected alloc_size <= ((1UL) << 12), but
alloc_size == 5075 (0x13d3)
((1UL) << 12) == 4096 (0x1000)
Simplify the logic. Just allocate a page for each frag_list skb.
Fixes: 4688ecb138 ("net: expand skb_segment unit test with frag_list coverage")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current release - regressions:
- sched: fix SKB_NOT_DROPPED_YET splat under debug config
Current release - new code bugs:
- tcp: fix usec timestamps with TCP fastopen
- tcp_sigpool: fix some off by one bugs
- tcp: fix possible out-of-bounds reads in tcp_hash_fail()
- tcp: fix SYN option room calculation for TCP-AO
- bpf: fix compilation error without CGROUPS
- ptp:
- ptp_read() should not release queue
- fix tsevqs corruption
Previous releases - regressions:
- llc: verify mac len before reading mac header
Previous releases - always broken:
- bpf:
- fix check_stack_write_fixed_off() to correctly spill imm
- fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END
- check map->usercnt after timer->timer is assigned
- dsa: lan9303: consequently nested-lock physical MDIO
- dccp/tcp: call security_inet_conn_request() after setting IP addr
- tg3: fix the TX ring stall due to incorrect full ring handling
- phylink: initialize carrier state at creation
- ice: fix direction of VF rules in switchdev mode
Misc:
- fill in a bunch of missing MODULE_DESCRIPTION()s, more to come
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmVNRnAACgkQMUZtbf5S
IrsYaA/+IUoYi96/oLtvvrET6HIbXeMaLKef0UlytEicQKy8h5EWlhcTZPhQEY0g
dtaKOemQsO0dQTma4eQBiBDHeCeSkitgD9p7fh0i+//QFYWSFqHrBiF2mlToc/ZQ
T1p4BlVL7D2Xsr1Lki93zk+EhFGEy2KroYgrWbZc9TWE5ap9PtSVF9eqeHAVCmZ7
ocre/eo4pqUM9rAHIAyhoL+0xtVQ59dBevbJC0qYcmflhafr82Gtdveo6pBBKuYm
GhwbRrAXER3Neav9c6NHqat4zsMwGpC27SiN9dYWm6dlkeS9U9t2PUu71OkJGVfw
VaSE+utkC/WmzGbuiUIjqQLBrRe372ItHCr78BfSRMshS+RBTHtoK7njeH8Iv67E
RsMeCyVNj9dtGlOQG5JAv8IoCQ1WbMw9B36Yzw3ip/MmDX/ntXz7Dcr4ZMZ6VURS
CHhHFZPnmMykMXkT6SIlxeAg2r8ELtESzkvLimdTVFPAlk3cPkibKJbh3F/tEqXS
PDb3y0uoEgRQBAsWXXx9FQEvv9rTL6YrzbMhmJBIIEoNxppQYQ7FZBJX9utAVp5B
1GdyqhR6IRTaKb9cMRj/K1xPwm2KgCw9xj9pjKdAA7QUMslXbFp8blv1rIkFGshg
hiNXmPcI8wo0j+0lZYktEcIERL5y6c8BgK2NnPU6RULua96tuQ4=
=k6Wk
-----END PGP SIGNATURE-----
Merge tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter and bpf.
Current release - regressions:
- sched: fix SKB_NOT_DROPPED_YET splat under debug config
Current release - new code bugs:
- tcp:
- fix usec timestamps with TCP fastopen
- fix possible out-of-bounds reads in tcp_hash_fail()
- fix SYN option room calculation for TCP-AO
- tcp_sigpool: fix some off by one bugs
- bpf: fix compilation error without CGROUPS
- ptp:
- ptp_read() should not release queue
- fix tsevqs corruption
Previous releases - regressions:
- llc: verify mac len before reading mac header
Previous releases - always broken:
- bpf:
- fix check_stack_write_fixed_off() to correctly spill imm
- fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END
- check map->usercnt after timer->timer is assigned
- dsa: lan9303: consequently nested-lock physical MDIO
- dccp/tcp: call security_inet_conn_request() after setting IP addr
- tg3: fix the TX ring stall due to incorrect full ring handling
- phylink: initialize carrier state at creation
- ice: fix direction of VF rules in switchdev mode
Misc:
- fill in a bunch of missing MODULE_DESCRIPTION()s, more to come"
* tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
net: ti: icss-iep: fix setting counter value
ptp: fix corrupted list in ptp_open
ptp: ptp_read should not release queue
net_sched: sch_fq: better validate TCA_FQ_WEIGHTS and TCA_FQ_PRIOMAP
net: kcm: fill in MODULE_DESCRIPTION()
net/sched: act_ct: Always fill offloading tuple iifidx
netfilter: nat: fix ipv6 nat redirect with mapped and scoped addresses
netfilter: xt_recent: fix (increase) ipv6 literal buffer length
ipvs: add missing module descriptions
netfilter: nf_tables: remove catchall element in GC sync path
netfilter: add missing module descriptions
drivers/net/ppp: use standard array-copy-function
net: enetc: shorten enetc_setup_xdp_prog() error message to fit NETLINK_MAX_FMTMSG_LEN
virtio/vsock: Fix uninit-value in virtio_transport_recv_pkt()
r8169: respect userspace disabling IFF_MULTICAST
selftests/bpf: get trusted cgrp from bpf_iter__cgroup directly
bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg
net: phylink: initialize carrier state at creation
test/vsock: add dobule bind connect test
test/vsock: refactor vsock_accept
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZUsiDAAKCRDbK58LschI
g9xXAQCaFjj55sXDpr1qKG2D3PMSDURx7SzmpzIay/A/dqVDPgEAlgU6XsMW6w6S
poMN8KniDLtBgj6nIKfJEAgIXeIYTAs=
=qXjW
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-11-08
We've added 16 non-merge commits during the last 6 day(s) which contain
a total of 30 files changed, 341 insertions(+), 130 deletions(-).
The main changes are:
1) Fix a BPF verifier issue in precision tracking for BPF_ALU | BPF_TO_BE |
BPF_END where the source register was incorrectly marked as precise,
from Shung-Hsi Yu.
2) Fix a concurrency issue in bpf_timer where the former could still have
been alive after an application releases or unpins the map, from Hou Tao.
3) Fix a BPF verifier issue where immediates are incorrectly cast to u32
before being spilled and therefore losing sign information, from Hao Sun.
4) Fix a misplaced BPF_TRACE_ITER in check_css_task_iter_allowlist which
incorrectly compared bpf_prog_type with bpf_attach_type, from Chuyi Zhou.
5) Add __bpf_hook_{start,end} as well as __bpf_kfunc_{start,end}_defs macros,
migrate all BPF-related __diag callsites over to it, and add a new
__diag_ignore_all for -Wmissing-declarations to the macros to address
recent build warnings, from Dave Marchevsky.
6) Fix broken BPF selftest build of xdp_hw_metadata test on architectures
where char is not signed, from Björn Töpel.
7) Fix test_maps selftest to properly use LIBBPF_OPTS() macro to initialize
the bpf_map_create_opts, from Andrii Nakryiko.
8) Fix bpffs selftest to avoid unmounting /sys/kernel/debug as it may have
been mounted and used by other applications already, from Manu Bretelle.
9) Fix a build issue without CONFIG_CGROUPS wrt css_task open-coded
iterators, from Matthieu Baerts.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: get trusted cgrp from bpf_iter__cgroup directly
bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg
selftests/bpf: Fix broken build where char is unsigned
selftests/bpf: precision tracking test for BPF_NEG and BPF_END
bpf: Fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END
selftests/bpf: Add test for using css_task iter in sleepable progs
selftests/bpf: Add tests for css_task iter combining with cgroup iter
bpf: Relax allowlist for css_task iter
selftests/bpf: fix test_maps' use of bpf_map_create_opts
bpf: Check map->usercnt after timer->timer is assigned
bpf: Add __bpf_hook_{start,end} macros
bpf: Add __bpf_kfunc_{start,end}_defs macros
selftests/bpf: fix test_bpffs
selftests/bpf: Add test for immediate spilled to stack
bpf: Fix check_stack_write_fixed_off() to correctly spill imm
bpf: fix compilation error without CGROUPS
====================
Link: https://lore.kernel.org/r/20231108132448.1970-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
there's little I can say which isn't in the individual changelogs.
The lengthier patch series are
- "kdump: use generic functions to simplify crashkernel reservation in
arch", from Baoquan He. This is mainly cleanups and consolidation of
the "crashkernel=" kernel parameter handling.
- After much discussion, David Laight's "minmax: Relax type checks in
min() and max()" is here. Hopefully reduces some typecasting and the
use of min_t() and max_t().
- A group of patches from Oleg Nesterov which clean up and slightly fix
our handling of reads from /proc/PID/task/... and which remove
task_struct.therad_group.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZUQP9wAKCRDdBJ7gKXxA
jmOAAQDh8sxagQYocoVsSm28ICqXFeaY9Co1jzBIDdNesAvYVwD/c2DHRqJHEiS4
63BNcG3+hM9nwGJHb5lyh5m79nBMRg0=
=On4u
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"As usual, lots of singleton and doubleton patches all over the tree
and there's little I can say which isn't in the individual changelogs.
The lengthier patch series are
- 'kdump: use generic functions to simplify crashkernel reservation
in arch', from Baoquan He. This is mainly cleanups and
consolidation of the 'crashkernel=' kernel parameter handling
- After much discussion, David Laight's 'minmax: Relax type checks in
min() and max()' is here. Hopefully reduces some typecasting and
the use of min_t() and max_t()
- A group of patches from Oleg Nesterov which clean up and slightly
fix our handling of reads from /proc/PID/task/... and which remove
task_struct.thread_group"
* tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits)
scripts/gdb/vmalloc: disable on no-MMU
scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n
.mailmap: add address mapping for Tomeu Vizoso
mailmap: update email address for Claudiu Beznea
tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions
.mailmap: map Benjamin Poirier's address
scripts/gdb: add lx_current support for riscv
ocfs2: fix a spelling typo in comment
proc: test ProtectionKey in proc-empty-vm test
proc: fix proc-empty-vm test with vsyscall
fs/proc/base.c: remove unneeded semicolon
do_io_accounting: use sig->stats_lock
do_io_accounting: use __for_each_thread()
ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error()
ocfs2: fix a typo in a comment
scripts/show_delta: add __main__ judgement before main code
treewide: mark stuff as __ro_after_init
fs: ocfs2: check status values
proc: test /proc/${pid}/statm
compiler.h: move __is_constexpr() to compiler.h
...
When ptr_ring_init() returns failure in page_pool_init(), free_percpu()
is not called to free pool->recycle_stats, which may cause memory
leak.
Fixes: ad6fa1e1ab ("page_pool: Add recycle stats")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20231030091256.2915394-1-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
BPF kfuncs are meant to be called from BPF programs. Accordingly, most
kfuncs are not called from anywhere in the kernel, which the
-Wmissing-prototypes warning is unhappy about. We've peppered
__diag_ignore_all("-Wmissing-prototypes", ... everywhere kfuncs are
defined in the codebase to suppress this warning.
This patch adds two macros meant to bound one or many kfunc definitions.
All existing kfunc definitions which use these __diag calls to suppress
-Wmissing-prototypes are migrated to use the newly-introduced macros.
A new __diag_ignore_all - for "-Wmissing-declarations" - is added to the
__bpf_kfunc_start_defs macro based on feedback from Andrii on an earlier
version of this patch [0] and another recent mailing list thread [1].
In the future we might need to ignore different warnings or do other
kfunc-specific things. This change will make it easier to make such
modifications for all kfunc defs.
[0]: https://lore.kernel.org/bpf/CAEf4BzaE5dRWtK6RPLnjTW-MW9sx9K3Fn6uwqCTChK2Dcb1Xig@mail.gmail.com/
[1]: https://lore.kernel.org/bpf/ZT+2qCc%2FaXep0%2FLf@krava/
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20231031215625.2343848-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Support GRO decapsulation for IPsec ESP in UDP.
- Add a handful of MODULE_DESCRIPTION()s.
- Drop questionable alignment check in TCP AO to avoid
build issue after changes in the crypto tree.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmVBpugACgkQMUZtbf5S
Iruf+xAAs49JUSYPxrEKy1CjTIsNoazm2qjqrUTLg04taFWq18GFZsfqakC+2Kjv
nTGCbMM+p7B5L7Sb+4Uc7U1vDZDGRpGRVQmO47IqrtyePE6NdinodQb13VK/53+x
B2e0TVa2uWKwWcVXwJGWYL4F29rlwe1kCn0anx24+Es94C/O2ScMtq34m8/dgDz+
6Y4h2Y+WNq6c+Zc07FaZIttiynvAAG3WpqxrAlRX5oBWpoSApqq3axgPtH75qSWT
mJD3AMD/7tn0j4D4UC3wwOtmnyu5AlAJZUTyi7ItiJ1wfMGPD/mS4H7lLM3q8FFP
wNGRLTc4TTmAMBrOYJIoU84Sv/uPtvXb/lfglBDaY+/DiDQl8vvwe036PF7LTtG9
VWWNkRt3FCab6yK+d0USxBsva/1lhVBGnRvAPn/AYLDl47Kr4uGnVDG31nEbvWKv
YomFQcBl6asYxssJ7dsKnkD3EFMneXCFS1UTd7HiNRXIxW3uSBSi1kMJmqLlcXl0
H9fwNgaHY6uQzxfhx8Jl7px/Ahn3qqw6OWPbMUcuSsUd7HGquqE/Z7Fg6iDUC8+a
3Rdn+C/hAVe8MNbUdC2y9iNmA8+5PKITb2BqcgkZoD40BzZaqZFMVhE7m6NXpvig
h4ewCp4n3vyEgfhaP2J7TWP6wWwbrnkHug1zTVy97z6dhCr+tAw=
=fqHj
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.7-followup' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull more networking updates from Jakub Kicinski:
- Support GRO decapsulation for IPsec ESP in UDP
- Add a handful of MODULE_DESCRIPTION()s
- Drop questionable alignment check in TCP AO to avoid
build issue after changes in the crypto tree
* tag 'net-next-6.7-followup' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next:
net: tcp: remove call to obsolete crypto_ahash_alignmask()
net: fill in MODULE_DESCRIPTION()s under drivers/net/
net: fill in MODULE_DESCRIPTION()s under net/802*
net: fill in MODULE_DESCRIPTION()s under net/core
net: fill in MODULE_DESCRIPTION()s in kuba@'s modules
xfrm: policy: fix layer 4 flowi decoding
xfrm Fix use after free in __xfrm6_udp_encap_rcv.
xfrm: policy: replace session decode with flow dissector
xfrm: move mark and oif flowi decode into common code
xfrm: pass struct net to xfrm_decode_session wrappers
xfrm: Support GRO for IPv6 ESP in UDP encapsulation
xfrm: Support GRO for IPv4 ESP in UDP encapsulation
xfrm: Use the XFRM_GRO to indicate a GRO call on input
xfrm: Annotate struct xfrm_sec_ctx with __counted_by
xfrm: Remove unused function declarations
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vdwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpr2rD/0astIsj/AACVSPzHARg9lnhkIvUeweMSSl
CjifLTzK3a9E3R2IrC4sflObUKIEL3fste0Lva141eNULZvBJ6cQJDvY7Bp72Bkc
CTPEwEQiwDJKLhTzQh3gY0H0+nFMWwEm1uc4dyeNAft/R9bPP/qOq62ttCoCp9+S
1UoFmTlJE3bhejyS7fytoGZvKqhkpdR7rtbR4ya7CXWPoAG+v9amo8fputbxm0dj
WECpKdd65JHWwYV4rbPA69T7jZ9V0oUsLen9RJ9BmjMLOFggHYqQdvEwG0Htirhw
t5uaXqSvc8pXsJhKXMS3tXCrLNtBha5nlWHBpSE+6ovcmKiRzFjUaRXkRbcIrOAx
ljIm0HHto1+xv0pDrNl3/lIjv5dpNOEauqqgMeYytQJIHa0JpSWbYzvjwQ8EZXQv
WWDiRfH5Z0/3BsFdOCVqd8mTt4Pbksp2VFcxGkojRtSqSr4CML3mPZSmqGcs3nE6
Fc16XXw7oLEWoF1tQYMP6KG0cVLem4on28c8CcVMJ/pRvcun3jBCif2gmMHJkWyA
a6Uq116amqQ61f1p+EQ3ChqyTA5uALrXPmovu6Ne3Y/btW5yG4+Vu7AsPLjPHdFN
oGHjOPV77XQzEqzUWRXmXPecZ+QifkcCV/8kbqtEHQqk5n+HUKQZmpC8+014ms3V
Af6LYI/vYg==
=sk8+
-----END PGP SIGNATURE-----
Merge tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux
Pull io_uring {get,set}sockopt support from Jens Axboe:
"This adds support for using getsockopt and setsockopt via io_uring.
The main use cases for this is to enable use of direct descriptors,
rather than first instantiating a normal file descriptor, doing the
option tweaking needed, then turning it into a direct descriptor. With
this support, we can avoid needing a regular file descriptor
completely.
The net and bpf bits have been signed off on their side"
* tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux:
selftests/bpf/sockopt: Add io_uring support
io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT
io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT
io_uring/cmd: return -EOPNOTSUPP if net is disabled
selftests/net: Extract uring helpers to be reusable
tools headers: Grab copy of io_uring.h
io_uring/cmd: Pass compat mode in issue_flags
net/socket: Break down __sys_getsockopt
net/socket: Break down __sys_setsockopt
bpf: Add sockptr support for setsockopt
bpf: Add sockptr support for getsockopt
Core & protocols
----------------
- Support usec resolution of TCP timestamps, enabled selectively by
a route attribute.
- Defer regular TCP ACK while processing socket backlog, try to send
a cumulative ACK at the end. Increase single TCP flow performance
on a 200Gbit NIC by 20% (100Gbit -> 120Gbit).
- The Fair Queuing (FQ) packet scheduler:
- add built-in 3 band prio / WRR scheduling
- support bypass if the qdisc is mostly idle (5% speed up for TCP RR)
- improve inactive flow reporting
- optimize the layout of structures for better cache locality
- Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern
replacement for the old MD5 option.
- Add more retransmission timeout (RTO) related statistics to TCP_INFO.
- Support sending fragmented skbs over vsock sockets.
- Make sure we send SIGPIPE for vsock sockets if socket was shutdown().
- Add sysctl for ignoring lower limit on lifetime in Router
Advertisement PIO, based on an in-progress IETF draft.
- Add sysctl to control activation of TCP ping-pong mode.
- Add sysctl to make connection timeout in MPTCP configurable.
- Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps
limit the number of wakeups.
- Support netlink GET for MDB (multicast forwarding), allowing user
space to request a single MDB entry instead of dumping the entire
table.
- Support selective FDB flushing in the VXLAN tunnel driver.
- Allow limiting learned FDB entries in bridges, prevent OOM attacks.
- Allow controlling via configfs netconsole targets which were created
via the kernel cmdline at boot, rather than via configfs at runtime.
- Support multiple PTP timestamp event queue readers with different
filters.
- MCTP over I3C.
BPF
---
- Add new veth-like netdevice where BPF program defines the logic
of the xmit routine. It can operate in L3 and L2 mode.
- Support exceptions - allow asserting conditions which should
never be true but are hard for the verifier to infer.
With some extra flexibility around handling of the exit / failure.
https://lwn.net/Articles/938435/
- Add support for local per-cpu kptr, allow allocating and storing
per-cpu objects in maps. Access to those objects operates on
the value for the current CPU. This allows to deprecate local
one-off implementations of per-CPU storage like
BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps.
- Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is
for systemd to re-implement the LogNamespace feature which allows
running multiple instances of systemd-journald to process the logs
of different services.
- Enable open-coded task_vma iteration, after maple tree conversion
made it hard to directly walk VMAs in tracing programs.
- Add open-coded task, css_task and css iterator support.
One of the use cases is customizable OOM victim selection via BPF.
- Allow source address selection with bpf_*_fib_lookup().
- Add ability to pin BPF timer to the current CPU.
- Prevent creation of infinite loops by combining tail calls and
fentry/fexit programs.
- Add missed stats for kprobes to retrieve the number of missed kprobe
executions and subsequent executions of BPF programs.
- Inherit system settings for CPU security mitigations.
- Add BPF v4 CPU instruction support for arm32 and s390x.
Changes to common code
----------------------
- overflow: add DEFINE_FLEX() for on-stack definition of structs
with flexible array members.
- Process doc update with more guidance for reviewers.
Driver API
----------
- Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy
mutex in most places and remove a lot of smaller locks.
- Create a common DPLL configuration API. Allow configuring
and querying state of PLL circuits used for clock syntonization,
in network time distribution.
- Unify fragmented and full page allocation APIs in page pool code.
Let drivers be ignorant of PAGE_SIZE.
- Rework PHY state machine to avoid races with calls to phy_stop().
- Notify DSA drivers of MAC address changes on user ports, improve
correctness of offloads which depend on matching port MAC addresses.
- Allow antenna control on injected WiFi frames.
- Reduce the number of variants of napi_schedule().
- Simplify error handling when composing devlink health messages.
Misc
----
- A lot of KCSAN data race "fixes", from Eric.
- A lot of __counted_by() annotations, from Kees.
- A lot of strncpy -> strscpy and printf format fixes.
- Replace master/slave terminology with conduit/user in DSA drivers.
- Handful of KUnit tests for netdev and WiFi core.
Removed
-------
- AppleTalk COPS.
- AppleTalk ipddp.
- TI AR7 CPMAC Ethernet driver.
Drivers
-------
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- add a driver for the Intel E2000 IPUs
- make CRC/FCS stripping configurable
- cross-timestamping for E823 devices
- basic support for E830 devices
- use aux-bus for managing client drivers
- i40e: report firmware versions via devlink
- nVidia/Mellanox:
- support 4-port NICs
- increase max number of channels to 256
- optimize / parallelize SF creation flow
- Broadcom (bnxt):
- enhance NIC temperature reporting
- support PAM4 speeds and lane configuration
- Marvell OcteonTX2:
- PTP pulse-per-second output support
- enable hardware timestamping for VFs
- Solarflare/AMD:
- conntrack NAT offload and offload for tunnels
- Wangxun (ngbe/txgbe):
- expose HW statistics
- Pensando/AMD:
- support PCI level reset
- narrow down the condition under which skbs are linearized
- Netronome/Corigine (nfp):
- support CHACHA20-POLY1305 crypto in IPsec offload
- Ethernet NICs embedded, slower, virtual:
- Synopsys (stmmac):
- add Loongson-1 SoC support
- enable use of HW queues with no offload capabilities
- enable PPS input support on all 5 channels
- increase TX coalesce timer to 5ms
- RealTek USB (r8152): improve efficiency of Rx by using GRO frags
- xen: support SW packet timestamping
- add drivers for implementations based on TI's PRUSS (AM64x EVM)
- nVidia/Mellanox Ethernet datacenter switches:
- avoid poor HW resource use on Spectrum-4 by better block selection
for IPv6 multicast forwarding and ordering of blocks in ACL region
- Ethernet embedded switches:
- Microchip:
- support configuring the drive strength for EMI compliance
- ksz9477: partial ACL support
- ksz9477: HSR offload
- ksz9477: Wake on LAN
- Realtek:
- rtl8366rb: respect device tree config of the CPU port
- Ethernet PHYs:
- support Broadcom BCM5221 PHYs
- TI dp83867: support hardware LED blinking
- CAN:
- add support for Linux-PHY based CAN transceivers
- at91_can: clean up and use rx-offload helpers
- WiFi:
- MediaTek (mt76):
- new sub-driver for mt7925 USB/PCIe devices
- HW wireless <> Ethernet bridging in MT7988 chips
- mt7603/mt7628 stability improvements
- Qualcomm (ath12k):
- WCN7850:
- enable 320 MHz channels in 6 GHz band
- hardware rfkill support
- enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS
to make scan faster
- read board data variant name from SMBIOS
- QCN9274: mesh support
- RealTek (rtw89):
- TDMA-based multi-channel concurrency (MCC)
- Silicon Labs (wfx):
- Remain-On-Channel (ROC) support
- Bluetooth:
- ISO: many improvements for broadcast support
- mark BCM4378/BCM4387 as BROKEN_LE_CODED
- add support for QCA2066
- btmtksdio: enable Bluetooth wakeup from suspend
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmU8XsYACgkQMUZtbf5S
Irv19RAAnud/24OOF5XMEJkIcYlnfqximh4XO6PujRSYkSkOUJdZTF6iJPgf3pSP
YpwoHYbYKHYfeOf8+3bTNESiQNSnoVmvmvwiS6/7lZ3behHUrGLQzW9Htc3EZyWH
2h6QkDZ5OOjfg0bwYSfp3vXkmMH2k8WE9Y0NvCkhcohqZi13Rmp14RnyPmNb2d1V
yZRYDMSM133KqE6gnBr1Ct65IEvnKeGlCUN2mTGqOJgdn6DZMsyxvtt0y4rmN7Ab
41+CgPU5SfxfbYpW+Dl2HJpgfte3WrC57KC6AM0PAPJzPmQWgeB/m9mjz/apj6Bg
bhsEIo7FdvbCnQm3yWPhK2OgCAcSwLr8jfGMU+Q+W4VnL5SRRR3Rm0zjsze+kHNP
OfqJgxzl3DpvoJqVBy1h5FGcZt0XHwhksm4cTxWqIahsF+veY0ECBXbuBBQx9XTF
Y7INfI8ulg7wISJs+CJfIClYkgOibTw2u8taBS5ikbtgxNqp5D4QqODn7UefQap1
PR/IDYODF+zRgmMJLeBqSa6fij6BkfOEDiOWak5kggBoZdtbtmeKI6tzze06CNdW
lWv1WEhRufxnwK+IuWsEkjhiMbs2WGLvkJ5JbgQV9BfqHfIfiqBCrcWtT/WbQnGt
lmU46CXh1t/FZEqbmK9h+8vsIIfrcDl6jb5npEiKPRG00vDKRTM=
=46nS
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Support usec resolution of TCP timestamps, enabled selectively by a
route attribute.
- Defer regular TCP ACK while processing socket backlog, try to send
a cumulative ACK at the end. Increase single TCP flow performance
on a 200Gbit NIC by 20% (100Gbit -> 120Gbit).
- The Fair Queuing (FQ) packet scheduler:
- add built-in 3 band prio / WRR scheduling
- support bypass if the qdisc is mostly idle (5% speed up for TCP RR)
- improve inactive flow reporting
- optimize the layout of structures for better cache locality
- Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern
replacement for the old MD5 option.
- Add more retransmission timeout (RTO) related statistics to
TCP_INFO.
- Support sending fragmented skbs over vsock sockets.
- Make sure we send SIGPIPE for vsock sockets if socket was
shutdown().
- Add sysctl for ignoring lower limit on lifetime in Router
Advertisement PIO, based on an in-progress IETF draft.
- Add sysctl to control activation of TCP ping-pong mode.
- Add sysctl to make connection timeout in MPTCP configurable.
- Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps
limit the number of wakeups.
- Support netlink GET for MDB (multicast forwarding), allowing user
space to request a single MDB entry instead of dumping the entire
table.
- Support selective FDB flushing in the VXLAN tunnel driver.
- Allow limiting learned FDB entries in bridges, prevent OOM attacks.
- Allow controlling via configfs netconsole targets which were
created via the kernel cmdline at boot, rather than via configfs at
runtime.
- Support multiple PTP timestamp event queue readers with different
filters.
- MCTP over I3C.
BPF:
- Add new veth-like netdevice where BPF program defines the logic of
the xmit routine. It can operate in L3 and L2 mode.
- Support exceptions - allow asserting conditions which should never
be true but are hard for the verifier to infer. With some extra
flexibility around handling of the exit / failure:
https://lwn.net/Articles/938435/
- Add support for local per-cpu kptr, allow allocating and storing
per-cpu objects in maps. Access to those objects operates on the
value for the current CPU.
This allows to deprecate local one-off implementations of per-CPU
storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps.
- Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is
for systemd to re-implement the LogNamespace feature which allows
running multiple instances of systemd-journald to process the logs
of different services.
- Enable open-coded task_vma iteration, after maple tree conversion
made it hard to directly walk VMAs in tracing programs.
- Add open-coded task, css_task and css iterator support. One of the
use cases is customizable OOM victim selection via BPF.
- Allow source address selection with bpf_*_fib_lookup().
- Add ability to pin BPF timer to the current CPU.
- Prevent creation of infinite loops by combining tail calls and
fentry/fexit programs.
- Add missed stats for kprobes to retrieve the number of missed
kprobe executions and subsequent executions of BPF programs.
- Inherit system settings for CPU security mitigations.
- Add BPF v4 CPU instruction support for arm32 and s390x.
Changes to common code:
- overflow: add DEFINE_FLEX() for on-stack definition of structs with
flexible array members.
- Process doc update with more guidance for reviewers.
Driver API:
- Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy
mutex in most places and remove a lot of smaller locks.
- Create a common DPLL configuration API. Allow configuring and
querying state of PLL circuits used for clock syntonization, in
network time distribution.
- Unify fragmented and full page allocation APIs in page pool code.
Let drivers be ignorant of PAGE_SIZE.
- Rework PHY state machine to avoid races with calls to phy_stop().
- Notify DSA drivers of MAC address changes on user ports, improve
correctness of offloads which depend on matching port MAC
addresses.
- Allow antenna control on injected WiFi frames.
- Reduce the number of variants of napi_schedule().
- Simplify error handling when composing devlink health messages.
Misc:
- A lot of KCSAN data race "fixes", from Eric.
- A lot of __counted_by() annotations, from Kees.
- A lot of strncpy -> strscpy and printf format fixes.
- Replace master/slave terminology with conduit/user in DSA drivers.
- Handful of KUnit tests for netdev and WiFi core.
Removed:
- AppleTalk COPS.
- AppleTalk ipddp.
- TI AR7 CPMAC Ethernet driver.
Drivers:
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- add a driver for the Intel E2000 IPUs
- make CRC/FCS stripping configurable
- cross-timestamping for E823 devices
- basic support for E830 devices
- use aux-bus for managing client drivers
- i40e: report firmware versions via devlink
- nVidia/Mellanox:
- support 4-port NICs
- increase max number of channels to 256
- optimize / parallelize SF creation flow
- Broadcom (bnxt):
- enhance NIC temperature reporting
- support PAM4 speeds and lane configuration
- Marvell OcteonTX2:
- PTP pulse-per-second output support
- enable hardware timestamping for VFs
- Solarflare/AMD:
- conntrack NAT offload and offload for tunnels
- Wangxun (ngbe/txgbe):
- expose HW statistics
- Pensando/AMD:
- support PCI level reset
- narrow down the condition under which skbs are linearized
- Netronome/Corigine (nfp):
- support CHACHA20-POLY1305 crypto in IPsec offload
- Ethernet NICs embedded, slower, virtual:
- Synopsys (stmmac):
- add Loongson-1 SoC support
- enable use of HW queues with no offload capabilities
- enable PPS input support on all 5 channels
- increase TX coalesce timer to 5ms
- RealTek USB (r8152): improve efficiency of Rx by using GRO frags
- xen: support SW packet timestamping
- add drivers for implementations based on TI's PRUSS (AM64x EVM)
- nVidia/Mellanox Ethernet datacenter switches:
- avoid poor HW resource use on Spectrum-4 by better block
selection for IPv6 multicast forwarding and ordering of blocks
in ACL region
- Ethernet embedded switches:
- Microchip:
- support configuring the drive strength for EMI compliance
- ksz9477: partial ACL support
- ksz9477: HSR offload
- ksz9477: Wake on LAN
- Realtek:
- rtl8366rb: respect device tree config of the CPU port
- Ethernet PHYs:
- support Broadcom BCM5221 PHYs
- TI dp83867: support hardware LED blinking
- CAN:
- add support for Linux-PHY based CAN transceivers
- at91_can: clean up and use rx-offload helpers
- WiFi:
- MediaTek (mt76):
- new sub-driver for mt7925 USB/PCIe devices
- HW wireless <> Ethernet bridging in MT7988 chips
- mt7603/mt7628 stability improvements
- Qualcomm (ath12k):
- WCN7850:
- enable 320 MHz channels in 6 GHz band
- hardware rfkill support
- enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to
make scan faster
- read board data variant name from SMBIOS
- QCN9274: mesh support
- RealTek (rtw89):
- TDMA-based multi-channel concurrency (MCC)
- Silicon Labs (wfx):
- Remain-On-Channel (ROC) support
- Bluetooth:
- ISO: many improvements for broadcast support
- mark BCM4378/BCM4387 as BROKEN_LE_CODED
- add support for QCA2066
- btmtksdio: enable Bluetooth wakeup from suspend"
* tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1816 commits)
net: pcs: xpcs: Add 2500BASE-X case in get state for XPCS drivers
net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos()
net: mana: Use xdp_set_features_flag instead of direct assignment
vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size()
iavf: delete the iavf client interface
iavf: add a common function for undoing the interrupt scheme
iavf: use unregister_netdev
iavf: rely on netdev's own registered state
iavf: fix the waiting time for initial reset
iavf: in iavf_down, don't queue watchdog_task if comms failed
iavf: simplify mutex_trylock+sleep loops
iavf: fix comments about old bit locks
doc/netlink: Update schema to support cmd-cnt-name and cmd-max-name
tools: ynl: introduce option to process unknown attributes or types
ipvlan: properly track tx_errors
netdevsim: Block until all devices are released
nfp: using napi_build_skb() to replace build_skb()
net: dsa: microchip: ksz9477: Fix spelling mistake "Enery" -> "Energy"
net: dsa: microchip: Ensure Stable PME Pin State for Wake-on-LAN
net: dsa: microchip: Refactor switch shutdown routine for WoL preparation
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppQwAKCRCRxhvAZXjc
om2kAP4u+eLsrhJHfUPUttGUEkSkZE+5/s/f1A/1GcV51usLSgEAu8urxAnP49GW
INaDABXaFfKx8/KI/H2YFZPKGwlNEwY=
=PDDE
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull iov_iter updates from Christian Brauner:
"This contain's David's iov_iter cleanup work to convert the iov_iter
iteration macros to inline functions:
- Remove last_offset from iov_iter as it was only used by ITER_PIPE
- Add a __user tag on copy_mc_to_user()'s dst argument on x86 to
match that on powerpc and get rid of a sparse warning
- Convert iter->user_backed to user_backed_iter() in the sound PCM
driver
- Convert iter->user_backed to user_backed_iter() in a couple of
infiniband drivers
- Renumber the type enum so that the ITER_* constants match the order
in iterate_and_advance*()
- Since the preceding patch puts UBUF and IOVEC at 0 and 1, change
user_backed_iter() to just use the type value and get rid of the
extra flag
- Convert the iov_iter iteration macros to always-inline functions to
make the code easier to follow. It uses function pointers, but they
get optimised away
- Move the check for ->copy_mc to _copy_from_iter() and
copy_page_from_iter_atomic() rather than in memcpy_from_iter_mc()
where it gets repeated for every segment. Instead, we check once
and invoke a side function that can use iterate_bvec() rather than
iterate_and_advance() and supply a different step function
- Move the copy-and-csum code to net/ where it can be in proximity
with the code that uses it
- Fold memcpy_and_csum() in to its two users
- Move csum_and_copy_from_iter_full() out of line and merge in
csum_and_copy_from_iter() since the former is the only caller of
the latter
- Move hash_and_copy_to_iter() to net/ where it can be with its only
caller"
* tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
iov_iter, net: Move hash_and_copy_to_iter() to net/
iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together
iov_iter, net: Fold in csum_and_memcpy()
iov_iter, net: Move csum_and_copy_to/from_iter() to net/
iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()
iov_iter: Convert iterate*() to inline funcs
iov_iter: Derive user-backedness from the iterator type
iov_iter: Renumber ITER_* constants
infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC
sound: Fix snd_pcm_readv()/writev() to use iov access functions
iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user()
iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
During a W=1 build GCC 13.2 says:
net/core/selftests.c: In function ‘net_selftest_get_strings’:
net/core/selftests.c:404:52: error: ‘%s’ directive output may be truncated writing up to 279 bytes into a region of size 28 [-Werror=format-truncation=]
404 | snprintf(p, ETH_GSTRING_LEN, "%2d. %s", i + 1,
| ^~
net/core/selftests.c:404:17: note: ‘snprintf’ output between 5 and 284 bytes into a destination of size 32
404 | snprintf(p, ETH_GSTRING_LEN, "%2d. %s", i + 1,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
405 | net_selftests[i].name);
| ~~~~~~~~~~~~~~~~~~~~~~
avoid it by using ethtool_sprintf().
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20231026022916.566661-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that both the bridge and VXLAN drivers implement the MDB get net
device operation, expose the functionality to user space by registering
a handler for RTM_GETMDB messages. Derive the net device from the
ifindex specified in the ancillary header and invoke its MDB get NDO.
Note that unlike other get handlers, the allocation of the skb
containing the response is not performed in the common rtnetlink code as
the size is variable and needs to be determined by the respective
driver.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZTp12QAKCRDbK58LschI
g8BrAQDifqp5liEEdXV8jdReBwJtqInjrL5tzy5LcyHUMQbTaAEA6Ph3Ct3B+3oA
mFnIW/y6UJiJrby0Xz4+vV5BXI/5WQg=
=pLCV
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-10-26
We've added 51 non-merge commits during the last 10 day(s) which contain
a total of 75 files changed, 5037 insertions(+), 200 deletions(-).
The main changes are:
1) Add open-coded task, css_task and css iterator support.
One of the use cases is customizable OOM victim selection via BPF,
from Chuyi Zhou.
2) Fix BPF verifier's iterator convergence logic to use exact states
comparison for convergence checks, from Eduard Zingerman,
Andrii Nakryiko and Alexei Starovoitov.
3) Add BPF programmable net device where bpf_mprog defines the logic
of its xmit routine. It can operate in L3 and L2 mode,
from Daniel Borkmann and Nikolay Aleksandrov.
4) Batch of fixes for BPF per-CPU kptr and re-enable unit_size checking
for global per-CPU allocator, from Hou Tao.
5) Fix libbpf which eagerly assumed that SHT_GNU_verdef ELF section
was going to be present whenever a binary has SHT_GNU_versym section,
from Andrii Nakryiko.
6) Fix BPF ringbuf correctness to fold smp_mb__before_atomic() into
atomic_set_release(), from Paul E. McKenney.
7) Add a warning if NAPI callback missed xdp_do_flush() under
CONFIG_DEBUG_NET which helps checking if drivers were missing
the former, from Sebastian Andrzej Siewior.
8) Fix missed RCU read-lock in bpf_task_under_cgroup() which was throwing
a warning under sleepable programs, from Yafang Shao.
9) Avoid unnecessary -EBUSY from htab_lock_bucket by disabling IRQ before
checking map_locked, from Song Liu.
10) Make BPF CI linked_list failure test more robust,
from Kumar Kartikeya Dwivedi.
11) Enable samples/bpf to be built as PIE in Fedora, from Viktor Malik.
12) Fix xsk starving when multiple xsk sockets were associated with
a single xsk_buff_pool, from Albert Huang.
13) Clarify the signed modulo implementation for the BPF ISA standardization
document that it uses truncated division, from Dave Thaler.
14) Improve BPF verifier's JEQ/JNE branch taken logic to also consider
signed bounds knowledge, from Andrii Nakryiko.
15) Add an option to XDP selftests to use multi-buffer AF_XDP
xdp_hw_metadata and mark used XDP programs as capable to use frags,
from Larysa Zaremba.
16) Fix bpftool's BTF dumper wrt printing a pointer value and another
one to fix struct_ops dump in an array, from Manu Bretelle.
* tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (51 commits)
netkit: Remove explicit active/peer ptr initialization
selftests/bpf: Fix selftests broken by mitigations=off
samples/bpf: Allow building with custom bpftool
samples/bpf: Fix passing LDFLAGS to libbpf
samples/bpf: Allow building with custom CFLAGS/LDFLAGS
bpf: Add more WARN_ON_ONCE checks for mismatched alloc and free
selftests/bpf: Add selftests for netkit
selftests/bpf: Add netlink helper library
bpftool: Extend net dump with netkit progs
bpftool: Implement link show support for netkit
libbpf: Add link-based API for netkit
tools: Sync if_link uapi header
netkit, bpf: Add bpf programmable net device
bpf: Improve JEQ/JNE branch taken logic
bpf: Fold smp_mb__before_atomic() into atomic_set_release()
bpf: Fix unnecessary -EBUSY from htab_lock_bucket
xsk: Avoid starving the xsk further down the list
bpf: print full verifier states on infinite loop detection
selftests/bpf: test if state loops are detected in a tricky case
bpf: correct loop detection for iterators convergence
...
====================
Link: https://lore.kernel.org/r/20231026150509.2824-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use more inclusive terms throughout the DSA subsystem by moving away
from "master" which is replaced by "conduit" and "slave" which is
replaced by "user". No functional changes.
Acked-by: Rob Herring <robh@kernel.org>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/20231023181729.1191071-2-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove unnecessary else clauses after return.
I copied this if / else construct from somewhere,
it makes the code harder to read.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231023152346.3639749-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Prior to restructuring __dev_alloc_name() handled both printf
and non-printf names. In a clever attempt at code reuse it
always prints the name into a buffer and checks if it's
a duplicate.
Trust the bitmap, and return an error if its full.
This shrinks the possible ID space by one from 32K to 32K - 1,
as previously the max value would have been tried as a valid ID.
It seems very unlikely that anyone would care as we heard
no requests to increase the max beyond 32k.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231023152346.3639749-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All callers of __dev_valid_name() go thru dev_prep_valid_name()
which handles the non-printf case. Focus __dev_alloc_name() on
the sprintf case, remove the indentation level.
Minor functional change of returning -EINVAL if % is not found,
which should now never happen.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231023152346.3639749-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__dev_alloc_name() handles both the sprintf and non-sprintf
target names. This complicates the code.
dev_prep_valid_name() already handles the non-sprintf case,
before calling __dev_alloc_name(), make the only other caller
also go thru dev_prep_valid_name(). This way we can drop
the non-sprintf handling in __dev_alloc_name() in one of
the next changes.
commit 55a5ec9b77 ("Revert "net: core: dev_get_valid_name is now the same as dev_alloc_name_ns"") and
commit 029b6d1405 ("Revert "net: core: maybe return -EEXIST in __dev_alloc_name"")
tell us that we can't start returning -EEXIST from dev_alloc_name()
on name duplicates. Bite the bullet and pass the expected errno to
dev_prep_valid_name().
dev_prep_valid_name() must now propagate out the allocated id
for printf names.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231023152346.3639749-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Callers of __dev_alloc_name() want to pass dev->name as
the output buffer. Make __dev_alloc_name() not clobber
that buffer on failure, and remove the workarounds
in callers.
dev_alloc_name_ns() is now completely unnecessary.
The extra strscpy() added here will be gone by the end
of the patch series.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231023152346.3639749-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before sockets became aware of net-memcg's memory pressure since
commit e1aab161e0 ("socket: initial cgroup code."), the memory
usage would be granted to raise if below average even when under
protocol's pressure. This provides fairness among the sockets of
same protocol.
That commit changes this because the heuristic will also be
effective when only memcg is under pressure which makes no sense.
So revert that behavior.
After reverting, __sk_mem_raise_allocated() no longer considers
memcg's pressure. As memcgs are isolated from each other w.r.t.
memory accounting, consuming one's budget won't affect others.
So except the places where buffer sizes are needed to be tuned,
allow workloads to use the memory they are provisioned.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231019120026.42215-3-wuyun.abel@bytedance.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There are now two accounting infrastructures for skmem, while the
heuristics in __sk_mem_raise_allocated() were actually introduced
before memcg was born.
Add some comments to clarify whether they can be applied to both
infrastructures or not.
Suggested-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231019120026.42215-2-wuyun.abel@bytedance.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Code cleanup for both better simplicity and readability.
No functional change intended.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231019120026.42215-1-wuyun.abel@bytedance.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently page pool supports the below use cases:
use case 1: allocate page without page splitting using
page_pool_alloc_pages() API if the driver knows
that the memory it need is always bigger than
half of the page allocated from page pool.
use case 2: allocate page frag with page splitting using
page_pool_alloc_frag() API if the driver knows
that the memory it need is always smaller than
or equal to the half of the page allocated from
page pool.
There is emerging use case [1] & [2] that is a mix of the
above two case: the driver doesn't know the size of memory it
need beforehand, so the driver may use something like below to
allocate memory with least memory utilization and performance
penalty:
if (size << 1 > max_size)
page = page_pool_alloc_pages();
else
page = page_pool_alloc_frag();
To avoid the driver doing something like above, add the
page_pool_alloc() API to support the above use case, and update
the true size of memory that is acctually allocated by updating
'*size' back to the driver in order to avoid exacerbating
truesize underestimate problem.
Rename page_pool_free() which is used in the destroy process to
__page_pool_destroy() to avoid confusion with the newly added
API.
1. https://lore.kernel.org/all/d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626.git.lorenzo@kernel.org/
2. https://lore.kernel.org/all/20230526054621.18371-3-liangchen.linux@gmail.com/
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Liang Chen <liangchen.linux@gmail.com>
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20231020095952.11055-4-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
PP_FLAG_PAGE_FRAG is not really needed after pp_frag_count
handling is unified and page_pool_alloc_frag() is supported
in 32-bit arch with 64-bit DMA, so remove it.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Liang Chen <liangchen.linux@gmail.com>
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20231020095952.11055-3-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently when page_pool_create() is called with
PP_FLAG_PAGE_FRAG flag, page_pool_alloc_pages() is only
allowed to be called under the below constraints:
1. page_pool_fragment_page() need to be called to setup
page->pp_frag_count immediately.
2. page_pool_defrag_page() often need to be called to drain
the page->pp_frag_count when there is no more user will
be holding on to that page.
Those constraints exist in order to support a page to be
split into multi fragments.
And those constraints have some overhead because of the
cache line dirtying/bouncing and atomic update.
Those constraints are unavoidable for case when we need a
page to be split into more than one fragment, but there is
also case that we want to avoid the above constraints and
their overhead when a page can't be split as it can only
hold a fragment as requested by user, depending on different
use cases:
use case 1: allocate page without page splitting.
use case 2: allocate page with page splitting.
use case 3: allocate page with or without page splitting
depending on the fragment size.
Currently page pool only provide page_pool_alloc_pages() and
page_pool_alloc_frag() API to enable the 1 & 2 separately,
so we can not use a combination of 1 & 2 to enable 3, it is
not possible yet because of the per page_pool flag
PP_FLAG_PAGE_FRAG.
So in order to allow allocating unsplit page without the
overhead of split page while still allow allocating split
page we need to remove the per page_pool flag in
page_pool_is_last_frag(), as best as I can think of, it seems
there are two methods as below:
1. Add per page flag/bit to indicate a page is split or
not, which means we might need to update that flag/bit
everytime the page is recycled, dirtying the cache line
of 'struct page' for use case 1.
2. Unify the page->pp_frag_count handling for both split and
unsplit page by assuming all pages in the page pool is split
into a big fragment initially.
As page pool already supports use case 1 without dirtying the
cache line of 'struct page' whenever a page is recyclable, we
need to support the above use case 3 with minimal overhead,
especially not adding any noticeable overhead for use case 1,
and we are already doing an optimization by not updating
pp_frag_count in page_pool_defrag_page() for the last fragment
user, this patch chooses to unify the pp_frag_count handling
to support the above use case 3.
There is no noticeable performance degradation and some
justification for unifying the frag_count handling with this
patch applied using a micro-benchmark testing in [1].
1. https://lore.kernel.org/all/bf2591f8-7b3c-4480-bb2c-31dc9da1d6ac@huawei.com/
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Liang Chen <liangchen.linux@gmail.com>
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20231020095952.11055-2-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1) tbl->gc_thresh1, tbl->gc_thresh2, tbl->gc_thresh3 and tbl->gc_interval
can be written from sysfs.
2) tbl->last_flush is read locklessly from neigh_alloc()
3) tbl->proxy_queue.qlen is read locklessly from neightbl_fill_info()
4) neightbl_fill_info() reads cpu stats that can be changed concurrently.
Fixes: c7fb64db00 ("[NETLINK]: Neighbour table configuration and statistics via rtnetlink")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231019122104.1448310-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Split __sys_getsockopt() into two functions by removing the core
logic into a sub-function (do_sock_getsockopt()). This will avoid
code duplication when doing the same operation in other callers, for
instance.
do_sock_getsockopt() will be called by io_uring getsockopt() command
operation in the following patch.
The same was done for the setsockopt pair.
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231016134750.1381153-5-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The altname nodes are currently not moved to the new netns
when netdevice itself moves:
[ ~]# ip netns add test
[ ~]# ip -netns test link add name eth0 type dummy
[ ~]# ip -netns test link property add dev eth0 altname some-name
[ ~]# ip -netns test link show dev some-name
2: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 1e:67:ed:19:3d:24 brd ff:ff:ff:ff:ff:ff
altname some-name
[ ~]# ip -netns test link set dev eth0 netns 1
[ ~]# ip link
...
3: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 02:40:88:62:ec:b8 brd ff:ff:ff:ff:ff:ff
altname some-name
[ ~]# ip li show dev some-name
Device "some-name" does not exist.
Remove them from the hash table when device is unlisted
and add back when listed again.
Fixes: 36fbf1e52b ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Altnames are accessed under RCU (dev_get_by_name_rcu())
but freed by kfree() with no synchronization point.
Each node has one or two allocations (node and a variable-size
name, sometimes the name is netdev->name). Adding rcu_heads
here is a bit tedious. Besides most code which unlists the names
already has rcu barriers - so take the simpler approach of adding
synchronize_rcu(). Note that the one on the unregistration path
(which matters more) is removed by the next fix.
Fixes: ff92741270 ("net: introduce name_node struct to be used in hashlist")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
It's currently possible to create an altname conflicting
with an altname or real name of another device by creating
it in another netns and moving it over:
[ ~]$ ip link add dev eth0 type dummy
[ ~]$ ip netns add test
[ ~]$ ip -netns test link add dev ethX netns test type dummy
[ ~]$ ip -netns test link property add dev ethX altname eth0
[ ~]$ ip -netns test link set dev ethX netns 1
[ ~]$ ip link
...
3: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 02:40:88:62:ec:b8 brd ff:ff:ff:ff:ff:ff
...
5: ethX: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 26:b7:28:78:38:0f brd ff:ff:ff:ff:ff:ff
altname eth0
Create a macro for walking the altnames, this hopefully makes
it clearer that the list we walk contains only altnames.
Which is otherwise not entirely intuitive.
Fixes: 36fbf1e52b ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
dev_get_valid_name() overwrites the netdev's name on success.
This makes it hard to use in prepare-commit-like fashion,
where we do validation first, and "commit" to the change
later.
Factor out a helper which lets us save the new name to a buffer.
Use it to fix the problem of notification on netns move having
incorrect name:
5: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether be:4d:58:f9:d5:40 brd ff:ff:ff:ff:ff:ff
6: eth1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether 1e:4a:34:36:e3:cd brd ff:ff:ff:ff:ff:ff
[ ~]# ip link set dev eth0 netns 1 name eth1
ip monitor inside netns:
Deleted inet eth0
Deleted inet6 eth0
Deleted 5: eth1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether be:4d:58:f9:d5:40 brd ff:ff:ff:ff:ff:ff new-netnsid 0 new-ifindex 7
Name is reported as eth1 in old netns for ifindex 5, already renamed.
Fixes: d90310243f ("net: device name allocation cleanups")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We currently have napi_if_scheduled_mark_missed that can be used to
check if napi is scheduled but that does more thing than simply checking
it and return a bool. Some driver already implement custom function to
check if napi is scheduled.
Drop these custom function and introduce napi_is_scheduled that simply
check if napi is scheduled atomically.
Update any driver and code that implement a similar check and instead
use this new helper.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Assume that caller's 'to' offset really represents an upper boundary for
the pattern search, so patterns extending past this offset are to be
rejected.
The old behaviour also was kind of inconsistent when it comes to
fragmentation (or otherwise non-linear skbs): If the pattern started in
between 'to' and 'from' offsets but extended to the next fragment, it
was not found if 'to' offset was still within the current fragment.
Test the new behaviour in a kselftest using iptables' string match.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: f72b948dcb ("[NET]: skb_find_text ignores to argument")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Device flags are displayed incorrectly:
1) The comparison (i == F_FLOW_SEQ) is always false, because F_FLOW_SEQ
is equal to (1 << FLOW_SEQ_SHIFT) == 2048, and the maximum value
of the 'i' variable is (NR_PKT_FLAG - 1) == 17. It should be compared
with FLOW_SEQ_SHIFT.
2) Similarly to the F_IPSEC flag.
3) Also add spaces to the print end of the string literal "spi:%u"
to prevent the output from merging with the flag that follows.
Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.
Fixes: 99c6d3d20d ("pktgen: Remove brute-force printing of flags")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few drivers were missing a xdp_do_flush() invocation after
XDP_REDIRECT.
Add three helper functions each for one of the per-CPU lists. Return
true if the per-CPU list is non-empty and flush the list.
Add xdp_do_check_flushed() which invokes each helper functions and
creates a warning if one of the functions had a non-empty list.
Hide everything behind CONFIG_DEBUG_NET.
Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20231016125738.Yt79p1uF@linutronix.de
consume_skb() doesn't walk the segment list, so segments other than
the first are leaked.
Move this skb_consume call into the loop.
Cc: Willem de Bruijn <willemb@google.com>
Fixes: b3098d32ed ("net: add skb_segment kunit test")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZS1d4wAKCRDbK58LschI
g4DSAP441CdKh8fd+wNKUSKHFbpCQ6EvocR6Nf+Sj2DFUx/w/QEA7mfju7Abqjc3
xwDEx0BuhrjMrjV5MmEpxc7lYl9XcQU=
=vuWk
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-10-16
We've added 90 non-merge commits during the last 25 day(s) which contain
a total of 120 files changed, 3519 insertions(+), 895 deletions(-).
The main changes are:
1) Add missed stats for kprobes to retrieve the number of missed kprobe
executions and subsequent executions of BPF programs, from Jiri Olsa.
2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is
for systemd to reimplement the LogNamespace feature which allows
running multiple instances of systemd-journald to process the logs
of different services, from Daan De Meyer.
3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich.
4) Improve BPF verifier log output for scalar registers to better
disambiguate their internal state wrt defaults vs min/max values
matching, from Andrii Nakryiko.
5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving
the source IP address with a new BPF_FIB_LOOKUP_SRC flag,
from Martynas Pumputis.
6) Add support for open-coded task_vma iterator to help with symbolization
for BPF-collected user stacks, from Dave Marchevsky.
7) Add libbpf getters for accessing individual BPF ring buffers which
is useful for polling them individually, for example, from Martin Kelly.
8) Extend AF_XDP selftests to validate the SHARED_UMEM feature,
from Tushar Vyavahare.
9) Improve BPF selftests cross-building support for riscv arch,
from Björn Töpel.
10) Add the ability to pin a BPF timer to the same calling CPU,
from David Vernet.
11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic
implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments,
from Alexandre Ghiti.
12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen.
13) Fix bpftool's skeleton code generation to guarantee that ELF data
is 8 byte aligned, from Ian Rogers.
14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4
security mitigations in BPF verifier, from Yafang Shao.
15) Annotate struct bpf_stack_map with __counted_by attribute to prepare
BPF side for upcoming __counted_by compiler support, from Kees Cook.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits)
bpf: Ensure proper register state printing for cond jumps
bpf: Disambiguate SCALAR register state output in verifier logs
selftests/bpf: Make align selftests more robust
selftests/bpf: Improve missed_kprobe_recursion test robustness
selftests/bpf: Improve percpu_alloc test robustness
selftests/bpf: Add tests for open-coded task_vma iter
bpf: Introduce task_vma open-coded iterator kfuncs
selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c
bpf: Don't explicitly emit BTF for struct btf_iter_num
bpf: Change syscall_nr type to int in struct syscall_tp_t
net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set
bpf: Avoid unnecessary audit log for CPU security mitigations
selftests/bpf: Add tests for cgroup unix socket address hooks
selftests/bpf: Make sure mount directory exists
documentation/bpf: Document cgroup unix socket address hooks
bpftool: Add support for cgroup unix socket address hooks
libbpf: Add support for cgroup unix socket address hooks
bpf: Implement cgroup sockaddr hooks for unix sockets
bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf
bpf: Propagate modified uaddrlen from cgroup sockaddr programs
...
====================
Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently page_pool_alloc_frag() is not supported in 32-bit
arch with 64-bit DMA because of the overlap issue between
pp_frag_count and dma_addr_upper in 'struct page' for those
arches, which seems to be quite common, see [1], which means
driver may need to handle it when using fragment API.
It is assumed that the combination of the above arch with an
address space >16TB does not exist, as all those arches have
64b equivalent, it seems logical to use the 64b version for a
system with a large address space. It is also assumed that dma
address is page aligned when we are dma mapping a page aligned
buffer, see [2].
That means we're storing 12 bits of 0 at the lower end for a
dma address, we can reuse those bits for the above arches to
support 32b+12b, which is 16TB of memory.
If we make a wrong assumption, a warning is emitted so that
user can report to us.
1. https://lore.kernel.org/all/20211117075652.58299-1-linyunsheng@huawei.com/
2. https://lore.kernel.org/all/20230818145145.4b357c89@kernel.org/
Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Liang Chen <liangchen.linux@gmail.com>
CC: Guillaume Tucker <guillaume.tucker@collabora.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: Linux-MM <linux-mm@kvack.org>
Link: https://lore.kernel.org/r/20231013064827.61135-2-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When modifying netclassid, the command("echo 0x100001 > net_cls.classid")
will take more time on many threads of one process, because the process
create many fds.
for example, one process exists 28000 fds and 60000 threads, echo command
will task 45 seconds.
Now, we only consider the main process when exec "iterate_fd", and the
time is about 52 milliseconds.
Signed-off-by: Liansen Zhai <zhailiansen@kuaishou.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231012090330.29636-1-zhailiansen@kuaishou.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the kfree_skb_reason() in sch_handle_{ingress,egress}() can only
express a basic SKB_DROP_REASON_TC_INGRESS or SKB_DROP_REASON_TC_EGRESS reason.
Victor kicked-off an initial proposal to make this more flexible by disambiguating
verdict from return code by moving the verdict into struct tcf_result and
letting tcf_classify() return a negative error. If hit, then two new drop
reasons were added in the proposal, that is SKB_DROP_REASON_TC_INGRESS_ERROR
as well as SKB_DROP_REASON_TC_EGRESS_ERROR. Further analysis of the actual
error codes would have required to attach to tcf_classify via kprobe/kretprobe
to more deeply debug skb and the returned error.
In order to make the kfree_skb_reason() in sch_handle_{ingress,egress}() more
extensible, it can be addressed in a more straight forward way, that is: Instead
of placing the verdict into struct tcf_result, we can just put the drop reason
in there, which does not require changes throughout various classful schedulers
given the existing verdict logic can stay as is.
Then, SKB_DROP_REASON_TC_ERROR{,_*} can be added to the enum skb_drop_reason
to disambiguate between an error or an intentional drop. New drop reason error
codes can be added successively to the tc code base.
For internal error locations which have not yet been annotated with a
SKB_DROP_REASON_TC_ERROR{,_*}, the fallback is SKB_DROP_REASON_TC_INGRESS and
SKB_DROP_REASON_TC_EGRESS, respectively. Generic errors could be marked with a
SKB_DROP_REASON_TC_ERROR code until they are converted to more specific ones
if it is found that they would be useful for troubleshooting.
While drop reasons have infrastructure for subsystem specific error codes which
are currently used by mac80211 and ovs, Jakub mentioned that it is preferred
for tc to use the enum skb_drop_reason core codes given it is a better fit and
currently the tooling support is better, too.
With regards to the latter:
[...] I think Alastair (bpftrace) is working on auto-prettifying enums when
bpftrace outputs maps. So we can do something like:
$ bpftrace -e 'tracepoint:skb:kfree_skb { @[args->reason] = count(); }'
Attaching 1 probe...
^C
@[SKB_DROP_REASON_TC_INGRESS]: 2
@[SKB_CONSUMED]: 34
^^^^^^^^^^^^ names!!
Auto-magically. [...]
Add a small helper tcf_set_drop_reason() which can be used to set the drop reason
into the tcf_result.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Victor Nogueira <victor@mojatatu.com>
Link: https://lore.kernel.org/netdev/20231006063233.74345d36@kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231009092655.22025-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As reported by Tom, .NET and applications build on top of it rely
on connect(AF_UNSPEC) to async cancel pending I/O operations on TCP
socket.
The blamed commit below caused a regression, as such cancellation
can now fail.
As suggested by Eric, this change addresses the problem explicitly
causing blocking I/O operation to terminate immediately (with an error)
when a concurrent disconnect() is executed.
Instead of tracking the number of threads blocked on a given socket,
track the number of disconnect() issued on such socket. If such counter
changes after a blocking operation releasing and re-acquiring the socket
lock, error out the current operation.
Fixes: 4faeee0cf8 ("tcp: deny tcp_disconnect() when threads are waiting")
Reported-by: Tom Deseyn <tdeseyn@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1886305
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/f3b95e47e3dbed840960548aebaa8d954372db41.1697008693.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The merge commit 9271686937 ("Merge branch 'br-flush-filtering'")
added support for FDB flushing in bridge driver. The following patches
will extend VXLAN driver to support FDB flushing as well. The netlink
message for bulk delete is shared between the drivers. With the existing
implementation, there is no way to prevent user from flushing with
attributes that are not supported per driver. For example, when VNI will
be added, user will not get an error for flush FDB entries in bridge
with VNI, although this attribute is not relevant for bridge.
As preparation for support of FDB flush in VXLAN driver, move the policy
to be handled in bridge driver, later a new policy for VXLAN will be
added in VXLAN driver. Do not pass 'vid' as part of ndo_fdb_del_bulk(),
as this field is relevant only for bridge.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
kernel/bpf/verifier.c
829955981c ("bpf: Fix verifier log for async callback return values")
a923819fb2 ("bpf: Treat first argument as return value for bpf_throw")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
gcc 12 errors out with:
net/core/gso_test.c:58:48: error: initializer element is not constant
58 | .segs = (const unsigned int[]) { gso_size },
This version isn't old (2022), so switch to preprocessor-bsaed constant
instead of 'static const int'.
Cc: Willem de Bruijn <willemb@google.com>
Reported-by: Tasmiya Nalatwad <tasmiya@linux.vnet.ibm.com>
Closes: https://lore.kernel.org/netdev/79fbe35c-4dd1-4f27-acb2-7a60794bc348@linux.vnet.ibm.com/
Fixes: 1b4fa28a8b ("net: parametrize skb_segment unit test to expand coverage")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20231012120901.10765-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
These hooks allows intercepting connect(), getsockname(),
getpeername(), sendmsg() and recvmsg() for unix sockets. The unix
socket hooks get write access to the address length because the
address length is not fixed when dealing with unix sockets and
needs to be modified when a unix socket address is modified by
the hook. Because abstract socket unix addresses start with a
NUL byte, we cannot recalculate the socket address in kernelspace
after running the hook by calculating the length of the unix socket
path using strlen().
These hooks can be used when users want to multiplex syscall to a
single unix socket to multiple different processes behind the scenes
by redirecting the connect() and other syscalls to process specific
sockets.
We do not implement support for intercepting bind() because when
using bind() with unix sockets with a pathname address, this creates
an inode in the filesystem which must be cleaned up. If we rewrite
the address, the user might try to clean up the wrong file, leaking
the socket in the filesystem where it is never cleaned up. Until we
figure out a solution for this (and a use case for intercepting bind()),
we opt to not allow rewriting the sockaddr in bind() calls.
We also implement recvmsg() support for connected streams so that
after a connect() that is modified by a sockaddr hook, any corresponding
recmvsg() on the connected socket can also be modified to make the
connected program think it is connected to the "intended" remote.
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
As prep for adding unix socket support to the cgroup sockaddr hooks,
let's add a kfunc bpf_sock_addr_set_sun_path() that allows modifying a unix
sockaddr from bpf. While this is already possible for AF_INET and AF_INET6,
we'll need this kfunc when we add unix socket support since modifying the
address for those requires modifying both the address and the sockaddr
length.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-4-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Expand the test with these variants that use skb frag_list:
- GSO_TEST_FRAG_LIST: frag_skb length is gso_size
- GSO_TEST_FRAG_LIST_PURE: same, data exclusively in frag skbs
- GSO_TEST_FRAG_LIST_NON_UNIFORM: frag_skb length may vary
- GSO_TEST_GSO_BY_FRAGS: frag_skb length defines gso_size,
i.e., segs may have varying sizes.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expand the test with variants
- GSO_TEST_NO_GSO: payload size less than or equal to gso_size
- GSO_TEST_FRAGS: payload in both linear and page frags
- GSO_TEST_FRAGS_PURE: payload exclusively in page frags
- GSO_TEST_GSO_PARTIAL: produce one gso segment of multiple of gso_size,
plus optionally one non-gso trailer segment
Define a test struct that encodes the input gso skb and output segs.
Input in terms of linear and fragment lengths. Output as length of
each segment.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add unit testing for skb segment. This function is exercised by many
different code paths, such as GSO_PARTIAL or GSO_BY_FRAGS, linear
(with or without head_frag), frags or frag_list skbs, etc.
It is infeasible to manually run tests that cover all code paths when
making changes. The long and complex function also makes it hard to
establish through analysis alone that a patch has no unintended
side-effects.
Add code coverage through kunit regression testing. Introduce kunit
infrastructure for tests under net/core, and add this first test.
This first skb_segment test exercises a simple case: a linear skb.
Follow-on patches will parametrize the test and add more variants.
Tested: Built and ran the test with
make ARCH=um mrproper
./tools/testing/kunit/kunit.py run \
--kconfig_add CONFIG_NET=y \
--kconfig_add CONFIG_DEBUG_KERNEL=y \
--kconfig_add CONFIG_DEBUG_INFO=y \
--kconfig_add=CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y \
net_core_gso
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although there is a kfree_skb_reason() helper function that can be used to
find the reason why this skb is dropped, but most callers didn't increase
one of rx_dropped, tx_dropped, rx_nohandler and rx_otherhost_dropped.
For the users, people are more concerned about why the dropped in ip
is increasing.
Introduce netdev_core_stats_inc() for trace the caller of
dev_core_stats_*_inc().
Also, add __code to netdev_core_stats_alloc(), as it's called with small
probability. And add noinline make sure netdev_core_stats_inc was never
inlined.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot uses panic_on_warn.
This means that the skb_dump() I added in the blamed commit are
not even called.
Rewrite this so that we get the needed skb dump before syzbot crashes.
Fixes: eeee4b77dc ("net: add more debug info in skb_checksum_help()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20231006173355.2254983-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend the bpf_fib_lookup() helper by making it to return the source
IPv4/IPv6 address if the BPF_FIB_LOOKUP_SRC flag is set.
For example, the following snippet can be used to derive the desired
source IP address:
struct bpf_fib_lookup p = { .ipv4_dst = ip4->daddr };
ret = bpf_skb_fib_lookup(skb, p, sizeof(p),
BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_SKIP_NEIGH);
if (ret != BPF_FIB_LKUP_RET_SUCCESS)
return TC_ACT_SHOT;
/* the p.ipv4_src now contains the source address */
The inability to derive the proper source address may cause malfunctions
in BPF-based dataplanes for hosts containing netdevs with more than one
routable IP address or for multi-homed hosts.
For example, Cilium implements packet masquerading in BPF. If an
egressing netdev to which the Cilium's BPF prog is attached has
multiple IP addresses, then only one [hardcoded] IP address can be used for
masquerading. This breaks connectivity if any other IP address should have
been selected instead, for example, when a public and private addresses
are attached to the same egress interface.
The change was tested with Cilium [1].
Nikolay Aleksandrov helped to figure out the IPv6 addr selection.
[1]: https://github.com/cilium/cilium/pull/28283
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Link: https://lore.kernel.org/r/20231007081415.33502-2-m@lambda.lt
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Move hash_and_copy_to_iter() to be with its only caller in networking code.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230925120309.1731676-13-dhowells@redhat.com
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Move csum_and_copy_from_iter_full() out of line and then merge
csum_and_copy_from_iter() into its only caller.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230925120309.1731676-12-dhowells@redhat.com
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Fold csum_and_memcpy() in to its callers.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230925120309.1731676-11-dhowells@redhat.com
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Move csum_and_copy_to/from_iter() to net code now that the iteration
framework can be #included.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230925120309.1731676-10-dhowells@redhat.com
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Exit early if the list is empty.
Some applications using TCP zerocopy are calling
recvmsg( ... MSG_ERRQUEUE) and hit this case quite often,
probably because busy polling only deals with sk_receive_queue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231005114504.642589-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1) Exit early if the list is empty.
2) splice the list into a local list,
so that we block hard irqs only once.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231003181920.3280453-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a kthread_stop_put() helper that stops a thread and puts its task
struct. Use it to replace the various instances of kthread_stop()
followed by put_task_struct().
Remove the kthread_stop_put() macro in usbip that is similar but doesn't
return the result of kthread_stop().
[agruenba@redhat.com: fix kerneldoc comment]
Link: https://lkml.kernel.org/r/20230911111730.2565537-1-agruenba@redhat.com
[akpm@linux-foundation.org: document kthread_stop_put()'s argument]
Link: https://lkml.kernel.org/r/20230907234048.2499820-1-agruenba@redhat.com
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZRqk1wAKCRDbK58LschI
g8GRAQC4E0bw6BTFRl0b3MxvpZES6lU0BUtX2gKVK4tLZdXw/wEAmTlBXQqNzF3b
BkCQknVbFTSw/8l8pzUW123Fb46wUAQ=
=E3hd
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-10-02
We've added 11 non-merge commits during the last 12 day(s) which contain
a total of 12 files changed, 176 insertions(+), 41 deletions(-).
The main changes are:
1) Fix BPF verifier to reset backtrack_state masks on global function
exit as otherwise subsequent precision tracking would reuse them,
from Andrii Nakryiko.
2) Several sockmap fixes for available bytes accounting,
from John Fastabend.
3) Reject sk_msg egress redirects to non-TCP sockets given this
is only supported for TCP sockets today, from Jakub Sitnicki.
4) Fix a syzkaller splat in bpf_mprog when hitting maximum program
limits with BPF_F_BEFORE directive, from Daniel Borkmann
and Nikolay Aleksandrov.
5) Fix BPF memory allocator to use kmalloc_size_roundup() to adjust
size_index for selecting a bpf_mem_cache, from Hou Tao.
6) Fix arch_prepare_bpf_trampoline return code for s390 JIT,
from Song Liu.
7) Fix bpf_trampoline_get when CONFIG_BPF_JIT is turned off,
from Leon Hwang.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Use kmalloc_size_roundup() to adjust size_index
selftest/bpf: Add various selftests for program limits
bpf, mprog: Fix maximum program check on mprog attachment
bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets
bpf, sockmap: Add tests for MSG_F_PEEK
bpf, sockmap: Do not inc copied_seq when PEEK flag set
bpf: tcp_read_skb needs to pop skb regardless of seq
bpf: unconditionally reset backtrack_state masks on global func exit
bpf: Fix tr dereferencing
selftests/bpf: Check bpf_cubic_acked() is called via struct_ops
s390/bpf: Let arch_prepare_bpf_trampoline return program size
====================
Link: https://lore.kernel.org/r/20231002113417.2309-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This field can be read or written without socket lock being held.
Add annotations to avoid load-store tearing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_txrehash readers are already safe against
concurrent change of this field.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SO_MAX_PACING_RATE setsockopt() does not need to hold
the socket lock, because sk->sk_pacing_rate readers
can run fine if the value is changed by other threads,
after adding READ_ONCE() accessors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting sk->sk_ll_usec, sk_prefer_busy_poll and sk_busy_poll_budget
do not require the socket lock, readers are lockless anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This options can not be set and return -ENOPROTOOPT,
no need to acqure socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock->flags are atomic, no need to hold the socket lock
in sk_setsockopt() for SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a followup of 8bf43be799 ("net: annotate data-races
around sk->sk_priority").
sk->sk_priority can be read and written without holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
n->output field can be read locklessly, while a writer
might change the pointer concurrently.
Add missing annotations to prevent load-store tearing.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking at a related syzbot report involving neigh_periodic_work(),
I found that I forgot to add an annotation when deleting an
RCU protected item from a list.
Readers use rcu_deference(*np), we need to use either
rcu_assign_pointer() or WRITE_ONCE() on writer side
to prevent store tearing.
I use rcu_assign_pointer() to have lockdep support,
this was the choice made in neigh_flush_dev().
Fixes: 767e97e1e0 ("neigh: RCU conversion of struct neighbour")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With a SOCKMAP/SOCKHASH map and an sk_msg program user can steer messages
sent from one TCP socket (s1) to actually egress from another TCP
socket (s2):
tcp_bpf_sendmsg(s1) // = sk_prot->sendmsg
tcp_bpf_send_verdict(s1) // __SK_REDIRECT case
tcp_bpf_sendmsg_redir(s2)
tcp_bpf_push_locked(s2)
tcp_bpf_push(s2)
tcp_rate_check_app_limited(s2) // expects tcp_sock
tcp_sendmsg_locked(s2) // ditto
There is a hard-coded assumption in the call-chain, that the egress
socket (s2) is a TCP socket.
However in commit 122e6c79ef ("sock_map: Update sock type checks for
UDP") we have enabled redirects to non-TCP sockets. This was done for the
sake of BPF sk_skb programs. There was no indention to support sk_msg
send-to-egress use case.
As a result, attempts to send-to-egress through a non-TCP socket lead to a
crash due to invalid downcast from sock to tcp_sock:
BUG: kernel NULL pointer dereference, address: 000000000000002f
...
Call Trace:
<TASK>
? show_regs+0x60/0x70
? __die+0x1f/0x70
? page_fault_oops+0x80/0x160
? do_user_addr_fault+0x2d7/0x800
? rcu_is_watching+0x11/0x50
? exc_page_fault+0x70/0x1c0
? asm_exc_page_fault+0x27/0x30
? tcp_tso_segs+0x14/0xa0
tcp_write_xmit+0x67/0xce0
__tcp_push_pending_frames+0x32/0xf0
tcp_push+0x107/0x140
tcp_sendmsg_locked+0x99f/0xbb0
tcp_bpf_push+0x19d/0x3a0
tcp_bpf_sendmsg_redir+0x55/0xd0
tcp_bpf_send_verdict+0x407/0x550
tcp_bpf_sendmsg+0x1a1/0x390
inet_sendmsg+0x6a/0x70
sock_sendmsg+0x9d/0xc0
? sockfd_lookup_light+0x12/0x80
__sys_sendto+0x10e/0x160
? syscall_enter_from_user_mode+0x20/0x60
? __this_cpu_preempt_check+0x13/0x20
? lockdep_hardirqs_on+0x82/0x110
__x64_sys_sendto+0x1f/0x30
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Reject selecting a non-TCP sockets as redirect target from a BPF sk_msg
program to prevent the crash. When attempted, user will receive an EACCES
error from send/sendto/sendmsg() syscall.
Fixes: 122e6c79ef ("sock_map: Update sock type checks for UDP")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20230920102055.42662-1-jakub@cloudflare.com
Currently, skbs generated by pktgen always have their reference count
incremented before transmission, causing their reference count to be
always greater than 1, leading to two issues:
1. Only the code paths for shared skbs can be tested.
2. In certain situations, skbs can only be released by pktgen.
To enhance testing comprehensiveness, we are introducing the "SHARED"
flag to indicate whether an SKB is shared. This flag is enabled by
default, aligning with the current behavior. However, disabling this
flag allows skbs with a reference count of 1 to be transmitted.
So we can test non-shared skbs and code paths where skbs are released
within the stack.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Link: https://lore.kernel.org/r/20230920125658.46978-2-liangchen.linux@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When specifying an unknown flag, it will print all available flags.
Currently, these flags are provided as fixed strings, which requires
manual updates when flags change. Replacing it with automated flag
enumeration.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Link: https://lore.kernel.org/r/20230920125658.46978-1-liangchen.linux@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alexei Starovoitov says:
====================
The following pull-request contains BPF updates for your *net-next* tree.
We've added 73 non-merge commits during the last 9 day(s) which contain
a total of 79 files changed, 5275 insertions(+), 600 deletions(-).
The main changes are:
1) Basic BTF validation in libbpf, from Andrii Nakryiko.
2) bpf_assert(), bpf_throw(), exceptions in bpf progs, from Kumar Kartikeya Dwivedi.
3) next_thread cleanups, from Oleg Nesterov.
4) Add mcpu=v4 support to arm32, from Puranjay Mohan.
5) Add support for __percpu pointers in bpf progs, from Yonghong Song.
6) Fix bpf tailcall interaction with bpf trampoline, from Leon Hwang.
7) Raise irq_work in bpf_mem_alloc while irqs are disabled to improve refill probabablity, from Hou Tao.
Please consider pulling these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
Thanks a lot!
Also thanks to reporters, reviewers and testers of commits in this pull-request:
Alan Maguire, Andrey Konovalov, Dave Marchevsky, "Eric W. Biederman",
Jiri Olsa, Maciej Fijalkowski, Quentin Monnet, Russell King (Oracle),
Song Liu, Stanislav Fomichev, Yonghong Song
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In case netdevice represents a SyncE port, the user needs to understand
the connection between netdevice and associated DPLL pin. There might me
multiple netdevices pointing to the same pin, in case of VF/SF
implementation.
Add a IFLA Netlink attribute to nest the DPLL pin handle, similar to
how it is implemented for devlink port. Add a struct dpll_pin pointer
to netdev and protect access to it by RTNL. Expose netdev_dpll_pin_set()
and netdev_dpll_pin_clear() helpers to the drivers so they can set/clear
the DPLL pin relationship to netdev.
Note that during the lifetime of struct dpll_pin the pin handle does not
change. Therefore it is save to access it lockless. It is drivers
responsibility to call netdev_dpll_pin_clear() before dpll_pin_put().
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use bitmap_zalloc() and bitmap_free() instead of hand-writing them.
It is less verbose and it improves the type checking and semantic.
While at it, add missing header inclusion (should be bitops.h,
but with the above change it becomes bitmap.h).
Suggested-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230911154534.4174265-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It can be time consuming to track driver bugs, that might be detected
too late from this confusing warning in skb_try_coalesce()
WARN_ON_ONCE(delta < len);
Add sanity check in skb_add_rx_frag() and skb_coalesce_rx_frag()
to better track bug origin for CONFIG_DEBUG_NET=y builds.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding sk->sk_prot->release_cb() call from __sk_flush_backlog()
Paolo suggested using indirect call helpers to take care of
CONFIG_RETPOLINE=y case.
It turns out Google had such mitigation for years in release_sock(),
it is time to make this public :)
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new xdp-rx-metadata-features member to netdev netlink
which exports a bitmask of supported kfuncs. Most of the patch
is autogenerated (headers), the only relevant part is netdev.yaml
and the changes in netdev-genl.c to marshal into netlink.
Example output on veth:
$ ip link add veth0 type veth peer name veth1 # ifndex == 12
$ ./tools/net/ynl/samples/netdev 12
Select ifc ($ifindex; or 0 = dump; or -2 ntf check): 12
veth1[12] xdp-features (23): basic redirect rx-sg xdp-rx-metadata-features (3): timestamp hash xdp-zc-max-segs=0
Cc: netdev@vger.kernel.org
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230913171350.369987-3-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
No functional changes.
Instead of having hand-crafted code in bpf_dev_bound_resolve_kfunc,
move kfunc <> xmo handler relationship into XDP_METADATA_KFUNC_xxx.
This way, any time new kfunc is added, we don't have to touch
bpf_dev_bound_resolve_kfunc.
Also document XDP_METADATA_KFUNC_xxx arguments since we now have
more than two and it might be confusing what is what.
Cc: netdev@vger.kernel.org
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230913171350.369987-2-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Add inet6_{test|set|clear|assign}_bit() helpers.
Note that I am using bits from inet->inet_flags,
this might change in the future if we need more flags.
While solving data-races accessing np->mc_loop,
this patch also allows to implement lockless accesses
to np->mcast_hops in the following patch.
Also constify sk_mc_loop() argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
__sk_flush_backlog() / sk_flush_backlog() are used
when TCP recvmsg()/sendmsg() process large chunks,
to not let packets in the backlog too long.
It makes sense to call tcp_release_cb() to also
process actions held in sk->sk_tsq_flags for smoother
scheduling.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This partially reverts c3f9b01849 ("tcp: tcp_release_cb()
should release socket ownership").
prequeue has been removed by Florian in commit e7942d0633
("tcp: remove prequeue support")
__tcp_checksum_complete_user() being gone, we no longer
have to release socket ownership in tcp_release_cb().
This is a prereq for third patch in the series
("net: call prot->release_cb() when processing backlog").
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Since commit 1202cdd66531("Remove DECnet support from kernel") has been
merged, all callers pass in the initial_ref value of 1 when they call
dst_alloc(). Therefore, remove initial_ref when the dst_alloc() is
declared and replace initial_ref with 1 in dst_alloc().
Also when all callers call dst_init(), the value of initial_ref is 1.
Therefore, remove the input parameter initial_ref of the dst_init() and
replace initial_ref with the value 1 in dst_init.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20230911125045.346390-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Current release - regressions:
- eth: stmmac: fix failure to probe without MAC interface specified
Current release - new code bugs:
- docs: netlink: fix missing classic_netlink doc reference
Previous releases - regressions:
- deal with integer overflows in kmalloc_reserve()
- use sk_forward_alloc_get() in sk_get_meminfo()
- bpf_sk_storage: fix the missing uncharge in sk_omem_alloc
- fib: avoid warn splat in flow dissector after packet mangling
- skb_segment: call zero copy functions before using skbuff frags
- eth: sfc: check for zero length in EF10 RX prefix
Previous releases - always broken:
- af_unix: fix msg_controllen test in scm_pidfd_recv() for
MSG_CMSG_COMPAT
- xsk: fix xsk_build_skb() dereferencing possible ERR_PTR()
- netfilter:
- nft_exthdr: fix non-linear header modification
- xt_u32, xt_sctp: validate user space input
- nftables: exthdr: fix 4-byte stack OOB write
- nfnetlink_osf: avoid OOB read
- one more fix for the garbage collection work from last release
- igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU
- bpf, sockmap: fix preempt_rt splat when using raw_spin_lock_t
- handshake: fix null-deref in handshake_nl_done_doit()
- ip: ignore dst hint for multipath routes to ensure packets
are hashed across the nexthops
- phy: micrel:
- correct bit assignments for cable test errata
- disable EEE according to the KSZ9477 errata
Misc:
- docs/bpf: document compile-once-run-everywhere (CO-RE) relocations
- Revert "net: macsec: preserve ingress frame ordering", it appears
to have been developed against an older kernel, problem doesn't
exist upstream
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmT6R6wACgkQMUZtbf5S
IrsmTg//TgmRjxSZ0lrPQtJwZR/eN3ZR2oQG3rwnssCx+YgHEGGxQsfT4KHEMacR
ZgGDZVTpthUJkkACBPi8ZMoy++RdjEmlCcanfeDkGHoYGtiX1lhkofhLMn1KUHbI
rIbP9EdNKxQT0SsBlw/U28pD5jKyqOgL23QobEwmcjLTdMpamb+qIsD6/xNv9tEj
Tu4BdCIkhjxnBD622hsE3pFTG7oSn2WM6rf5NT1E43mJ3W8RrMcydSB27J7Oryo9
l3nYMAhz0vQINS2WQ9eCT1/7GI6gg1nDtxFtrnV7ASvxayRBPIUr4kg1vT+Tixsz
CZMnwVamEBIYl9agmj7vSji7d5nOUgXPhtWhwWUM2tRoGdeGw3vSi1pgDvRiUCHE
PJ4UHv7goa2AgnOlOQCFtRybAu+9nmSGm7V+GkeGLnH7xbFsEa5smQ/+FSPJs8Dn
Yf4q5QAhdN8tdnofRlrN/nCssoDF3cfmBsTJ7wo5h71gW+BWhsP58eDCJlXd/r8k
+Qnvoe2kw27ktFR1tjsUDZ0AcSmeVARNwmXCOBYZsG4tEek8pLyj008mDvJvdfyn
PGPn7Eo5DyaERlHVmPuebHXSyniDEPe2GLTmlHcGiRpGspoUHbB+HRiDAuRLMB9g
pkL8RHpNfppnuUXeUoNy3rgEkYwlpTjZX0QHC6N8NQ76ccB6CNM=
=YpmE
-----END PGP SIGNATURE-----
Merge tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking updates from Jakub Kicinski:
"Including fixes from netfilter and bpf.
Current release - regressions:
- eth: stmmac: fix failure to probe without MAC interface specified
Current release - new code bugs:
- docs: netlink: fix missing classic_netlink doc reference
Previous releases - regressions:
- deal with integer overflows in kmalloc_reserve()
- use sk_forward_alloc_get() in sk_get_meminfo()
- bpf_sk_storage: fix the missing uncharge in sk_omem_alloc
- fib: avoid warn splat in flow dissector after packet mangling
- skb_segment: call zero copy functions before using skbuff frags
- eth: sfc: check for zero length in EF10 RX prefix
Previous releases - always broken:
- af_unix: fix msg_controllen test in scm_pidfd_recv() for
MSG_CMSG_COMPAT
- xsk: fix xsk_build_skb() dereferencing possible ERR_PTR()
- netfilter:
- nft_exthdr: fix non-linear header modification
- xt_u32, xt_sctp: validate user space input
- nftables: exthdr: fix 4-byte stack OOB write
- nfnetlink_osf: avoid OOB read
- one more fix for the garbage collection work from last release
- igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU
- bpf, sockmap: fix preempt_rt splat when using raw_spin_lock_t
- handshake: fix null-deref in handshake_nl_done_doit()
- ip: ignore dst hint for multipath routes to ensure packets are
hashed across the nexthops
- phy: micrel:
- correct bit assignments for cable test errata
- disable EEE according to the KSZ9477 errata
Misc:
- docs/bpf: document compile-once-run-everywhere (CO-RE) relocations
- Revert "net: macsec: preserve ingress frame ordering", it appears
to have been developed against an older kernel, problem doesn't
exist upstream"
* tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
net: enetc: distinguish error from valid pointers in enetc_fixup_clear_rss_rfs()
Revert "net: team: do not use dynamic lockdep key"
net: hns3: remove GSO partial feature bit
net: hns3: fix the port information display when sfp is absent
net: hns3: fix invalid mutex between tc qdisc and dcb ets command issue
net: hns3: fix debugfs concurrency issue between kfree buffer and read
net: hns3: fix byte order conversion issue in hclge_dbg_fd_tcam_read()
net: hns3: Support query tx timeout threshold by debugfs
net: hns3: fix tx timeout issue
net: phy: Provide Module 4 KSZ9477 errata (DS80000754C)
netfilter: nf_tables: Unbreak audit log reset
netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c
netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction
netfilter: nf_tables: uapi: Describe NFTA_RULE_CHAIN_ID
netfilter: nfnetlink_osf: avoid OOB read
netfilter: nftables: exthdr: fix 4-byte stack OOB write
selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc
bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc
bpf: bpf_sk_storage: Fix invalid wait context lockdep report
s390/bpf: Pass through tail call counter in trampolines
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZPhEYwAKCRDbK58LschI
g6puAQCGLE9S2d5xw68Fq2KLrTjTXC5trB9N25NWvZWqgXHAeQEA3wXxTC8eF0dG
J24RwZC2vOC4hF3aNGeI0fv/6j7SKA0=
=0IU4
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-09-06
We've added 9 non-merge commits during the last 6 day(s) which contain
a total of 12 files changed, 189 insertions(+), 44 deletions(-).
The main changes are:
1) Fix bpf_sk_storage to address an invalid wait context lockdep
report and another one to address missing omem uncharge,
from Martin KaFai Lau.
2) Two BPF recursion detection related fixes,
from Sebastian Andrzej Siewior.
3) Fix tailcall limit enforcement in trampolines for s390 JIT,
from Ilya Leoshkevich.
4) Fix a sockmap refcount race where skbs in sk_psock_backlog can
be referenced after user space side has already skb_consumed them,
from John Fastabend.
5) Fix BPF CI flake/race wrt sockmap vsock write test where
the transport endpoint is not connected, from Xu Kuohai.
6) Follow-up doc fix to address a cross-link warning,
from Eduard Zingerman.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc
bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc
bpf: bpf_sk_storage: Fix invalid wait context lockdep report
s390/bpf: Pass through tail call counter in trampolines
bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check.
bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf().
bpf, sockmap: Fix skb refcnt race after locking changes
docs/bpf: Fix "file doesn't exist" warnings in {llvm_reloc,btf}.rst
selftests/bpf: Fix a CI failure caused by vsock write
====================
Link: https://lore.kernel.org/r/20230906095117.16941-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__skb_get_hash_symmetric() was added to compute a symmetric hash over
the protocol, addresses and transport ports, by commit eb70db8756
("packet: Use symmetric hash for PACKET_FANOUT_HASH."). It uses
flow_keys_dissector_symmetric_keys as the flow_dissector to incorporate
IPv4 addresses, IPv6 addresses and ports. However, it should not specify
the flag as FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL, which stops further
dissection when an IPv6 flow label is encountered, making transport
ports not being incorporated in such case.
As a consequence, the symmetric hash is based on 5-tuple for IPv4 but
3-tuple for IPv6 when flow label is present. It caused a few problems,
e.g. when nft symhash and openvswitch l4_sym rely on the symmetric hash
to perform load balancing as different L4 flows between two given IPv6
addresses would always get the same symmetric hash, leading to uneven
traffic distribution.
Removing the use of FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL makes sure the
symmetric hash is based on 5-tuple for both IPv4 and IPv6 consistently.
Fixes: eb70db8756 ("packet: Use symmetric hash for PACKET_FANOUT_HASH.")
Reported-by: Lars Ekman <uablrek@gmail.com>
Closes: https://github.com/antrea-io/antrea/issues/5457
Signed-off-by: Quan Tian <qtian@vmware.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As with sk->sk_shutdown shown in the previous patch, sk->sk_err can be
read locklessly by unix_dgram_sendmsg().
Let's use READ_ONCE() for sk_err as well.
Note that the writer side is marked by commit cc04410af7 ("af_unix:
annotate lockless accesses to sk->sk_err").
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race where skb's from the sk_psock_backlog can be referenced
after userspace side has already skb_consumed() the sk_buff and its refcnt
dropped to zer0 causing use after free.
The flow is the following:
while ((skb = skb_peek(&psock->ingress_skb))
sk_psock_handle_Skb(psock, skb, ..., ingress)
if (!ingress) ...
sk_psock_skb_ingress
sk_psock_skb_ingress_enqueue(skb)
msg->skb = skb
sk_psock_queue_msg(psock, msg)
skb_dequeue(&psock->ingress_skb)
The sk_psock_queue_msg() puts the msg on the ingress_msg queue. This is
what the application reads when recvmsg() is called. An application can
read this anytime after the msg is placed on the queue. The recvmsg hook
will also read msg->skb and then after user space reads the msg will call
consume_skb(skb) on it effectively free'ing it.
But, the race is in above where backlog queue still has a reference to
the skb and calls skb_dequeue(). If the skb_dequeue happens after the
user reads and free's the skb we have a use after free.
The !ingress case does not suffer from this problem because it uses
sendmsg_*(sk, msg) which does not pass the sk_buff further down the
stack.
The following splat was observed with 'test_progs -t sockmap_listen':
[ 1022.710250][ T2556] general protection fault, ...
[...]
[ 1022.712830][ T2556] Workqueue: events sk_psock_backlog
[ 1022.713262][ T2556] RIP: 0010:skb_dequeue+0x4c/0x80
[ 1022.713653][ T2556] Code: ...
[...]
[ 1022.720699][ T2556] Call Trace:
[ 1022.720984][ T2556] <TASK>
[ 1022.721254][ T2556] ? die_addr+0x32/0x80^M
[ 1022.721589][ T2556] ? exc_general_protection+0x25a/0x4b0
[ 1022.722026][ T2556] ? asm_exc_general_protection+0x22/0x30
[ 1022.722489][ T2556] ? skb_dequeue+0x4c/0x80
[ 1022.722854][ T2556] sk_psock_backlog+0x27a/0x300
[ 1022.723243][ T2556] process_one_work+0x2a7/0x5b0
[ 1022.723633][ T2556] worker_thread+0x4f/0x3a0
[ 1022.723998][ T2556] ? __pfx_worker_thread+0x10/0x10
[ 1022.724386][ T2556] kthread+0xfd/0x130
[ 1022.724709][ T2556] ? __pfx_kthread+0x10/0x10
[ 1022.725066][ T2556] ret_from_fork+0x2d/0x50
[ 1022.725409][ T2556] ? __pfx_kthread+0x10/0x10
[ 1022.725799][ T2556] ret_from_fork_asm+0x1b/0x30
[ 1022.726201][ T2556] </TASK>
To fix we add an skb_get() before passing the skb to be enqueued in the
engress queue. This bumps the skb->users refcnt so that consume_skb()
and kfree_skb will not immediately free the sk_buff. With this we can
be sure the skb is still around when we do the dequeue. Then we just
need to decrement the refcnt or free the skb in the backlog case which
we do by calling kfree_skb() on the ingress case as well as the sendmsg
case.
Before locking change from fixes tag we had the sock locked so we
couldn't race with user and there was no issue here.
Fixes: 799aa7f98d ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Reported-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Xu Kuohai <xukuohai@huawei.com>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230901202137.214666-1-john.fastabend@gmail.com
Blamed commit changed:
ptr = kmalloc(size);
if (ptr)
size = ksize(ptr);
to:
size = kmalloc_size_roundup(size);
ptr = kmalloc(size);
This allowed various crash as reported by syzbot [1]
and Kyle Zeng.
Problem is that if @size is bigger than 0x80000001,
kmalloc_size_roundup(size) returns 2^32.
kmalloc_reserve() uses a 32bit variable (obj_size),
so 2^32 is truncated to 0.
kmalloc(0) returns ZERO_SIZE_PTR which is not handled by
skb allocations.
Following trace can be triggered if a netdev->mtu is set
close to 0x7fffffff
We might in the future limit netdev->mtu to more sensible
limit (like KMALLOC_MAX_SIZE).
This patch is based on a syzbot report, and also a report
and tentative fix from Kyle Zeng.
[1]
BUG: KASAN: user-memory-access in __build_skb_around net/core/skbuff.c:294 [inline]
BUG: KASAN: user-memory-access in __alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527
Write of size 32 at addr 00000000fffffd10 by task syz-executor.4/22554
CPU: 1 PID: 22554 Comm: syz-executor.4 Not tainted 6.1.39-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023
Call trace:
dump_backtrace+0x1c8/0x1f4 arch/arm64/kernel/stacktrace.c:279
show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:286
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x120/0x1a0 lib/dump_stack.c:106
print_report+0xe4/0x4b4 mm/kasan/report.c:398
kasan_report+0x150/0x1ac mm/kasan/report.c:495
kasan_check_range+0x264/0x2a4 mm/kasan/generic.c:189
memset+0x40/0x70 mm/kasan/shadow.c:44
__build_skb_around net/core/skbuff.c:294 [inline]
__alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527
alloc_skb include/linux/skbuff.h:1316 [inline]
igmpv3_newpack+0x104/0x1088 net/ipv4/igmp.c:359
add_grec+0x81c/0x1124 net/ipv4/igmp.c:534
igmpv3_send_cr net/ipv4/igmp.c:667 [inline]
igmp_ifc_timer_expire+0x1b0/0x1008 net/ipv4/igmp.c:810
call_timer_fn+0x1c0/0x9f0 kernel/time/timer.c:1474
expire_timers kernel/time/timer.c:1519 [inline]
__run_timers+0x54c/0x710 kernel/time/timer.c:1790
run_timer_softirq+0x28/0x4c kernel/time/timer.c:1803
_stext+0x380/0xfbc
____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:79
call_on_irq_stack+0x24/0x4c arch/arm64/kernel/entry.S:891
do_softirq_own_stack+0x20/0x2c arch/arm64/kernel/irq.c:84
invoke_softirq kernel/softirq.c:437 [inline]
__irq_exit_rcu+0x1c0/0x4cc kernel/softirq.c:683
irq_exit_rcu+0x14/0x78 kernel/softirq.c:695
el0_interrupt+0x7c/0x2e0 arch/arm64/kernel/entry-common.c:717
__el0_irq_handler_common+0x18/0x24 arch/arm64/kernel/entry-common.c:724
el0t_64_irq_handler+0x10/0x1c arch/arm64/kernel/entry-common.c:729
el0t_64_irq+0x1a0/0x1a4 arch/arm64/kernel/entry.S:584
Fixes: 12d6c1d3a2 ("skbuff: Proactively round up to kmalloc bucket size")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit bf5c25d608 ("skbuff: in skb_segment, call zerocopy functions
once per nskb") added the call to zero copy functions in skb_segment().
The change introduced a bug in skb_segment() because skb_orphan_frags()
may possibly change the number of fragments or allocate new fragments
altogether leaving nrfrags and frag to point to the old values. This can
cause a panic with stacktrace like the one below.
[ 193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc
[ 193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G O 5.15.123+ #26
[ 193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0
[ 194.021892] Call Trace:
[ 194.027422] <TASK>
[ 194.072861] tcp_gso_segment+0x107/0x540
[ 194.082031] inet_gso_segment+0x15c/0x3d0
[ 194.090783] skb_mac_gso_segment+0x9f/0x110
[ 194.095016] __skb_gso_segment+0xc1/0x190
[ 194.103131] netem_enqueue+0x290/0xb10 [sch_netem]
[ 194.107071] dev_qdisc_enqueue+0x16/0x70
[ 194.110884] __dev_queue_xmit+0x63b/0xb30
[ 194.121670] bond_start_xmit+0x159/0x380 [bonding]
[ 194.128506] dev_hard_start_xmit+0xc3/0x1e0
[ 194.131787] __dev_queue_xmit+0x8a0/0xb30
[ 194.138225] macvlan_start_xmit+0x4f/0x100 [macvlan]
[ 194.141477] dev_hard_start_xmit+0xc3/0x1e0
[ 194.144622] sch_direct_xmit+0xe3/0x280
[ 194.147748] __dev_queue_xmit+0x54a/0xb30
[ 194.154131] tap_get_user+0x2a8/0x9c0 [tap]
[ 194.157358] tap_sendmsg+0x52/0x8e0 [tap]
[ 194.167049] handle_tx_zerocopy+0x14e/0x4c0 [vhost_net]
[ 194.173631] handle_tx+0xcd/0xe0 [vhost_net]
[ 194.176959] vhost_worker+0x76/0xb0 [vhost]
[ 194.183667] kthread+0x118/0x140
[ 194.190358] ret_from_fork+0x1f/0x30
[ 194.193670] </TASK>
In this case calling skb_orphan_frags() updated nr_frags leaving nrfrags
local variable in skb_segment() stale. This resulted in the code hitting
i >= nrfrags prematurely and trying to move to next frag_skb using
list_skb pointer, which was NULL, and caused kernel panic. Move the call
to zero copy functions before using frags and nr_frags.
Fixes: bf5c25d608 ("skbuff: in skb_segment, call zerocopy functions once per nskb")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reported-by: Amit Goyal <agoyal@purestorage.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_bind_phc is read locklessly. Add corresponding annotations.
Fixes: d463126e23 ("net: sock: extend SO_TIMESTAMPING for PHC binding")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yangbo Lu <yangbo.lu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_tsflags can be read locklessly, add corresponding annotations.
Fixes: b9f40e21ef ("net-timestamp: move timestamp flags out of sk_flags")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every time sk->sk_forward_alloc is read locklessly,
add a READ_ONCE().
Add sk_forward_alloc_add() helper to centralize updates,
to reduce number of WRITE_ONCE().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_sk_diag_fill() has been changed to use sk_forward_alloc_get(),
but sk_get_meminfo() was forgotten.
Fixes: 292e6077b0 ("net: introduce sk_forward_alloc_get()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZPD2qwAKCRDbK58LschI
gzy9APoCsV3B0rJCX2PnxoKmx7ZwAbEhWRHN3iDAGgEOwuAdLQEAi1Mafivr/4Rr
WLi6AQOy+Erv7dAQRq2KbR2yE8rkEgg=
=BJ9X
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-08-31
We've added 15 non-merge commits during the last 3 day(s) which contain
a total of 17 files changed, 468 insertions(+), 97 deletions(-).
The main changes are:
1) BPF selftest fixes: one flake and one related to clang18 testing,
from Yonghong Song.
2) Fix a d_path BPF selftest failure after fast-forward from Linus'
tree, from Jiri Olsa.
3) Fix a preempt_rt splat in sockmap when using raw_spin_lock_t,
from John Fastabend.
4) Fix a xsk_diag_fill use-after-free race during socket cleanup,
from Magnus Karlsson.
5) Fix xsk_build_skb to address a buggy dereference of an ERR_PTR(),
from Tirthendu Sarkar.
6) Fix a bpftool build warning when compiled with -Wtype-limits,
from Yafang Shao.
7) Several misc fixes and cleanups in standardization docs,
from David Vernet.
8) Fix BPF selftest install to consider no_alu32/cpuv4/bpf-gcc flavors,
from Björn Töpel.
9) Annotate a data race in bpf_long_memcpy for KCSAN, from Daniel Borkmann.
10) Extend documentation with a description for CO-RE relocations,
from Eduard Zingerman.
11) Fix several invalid escape sequence warnings in bpf_doc.py script,
from Vishal Chourasia.
12) Fix the instruction set doc wrt offset of BPF-to-BPF call,
from Will Hawkins.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Include build flavors for install target
bpf: Annotate bpf_long_memcpy with data_race
selftests/bpf: Fix d_path test
bpf, docs: Fix invalid escape sequence warnings in bpf_doc.py
xsk: Fix xsk_diag use-after-free error during socket cleanup
bpf, docs: s/eBPF/BPF in standards documents
bpf, docs: Add abi.rst document to standardization subdirectory
bpf, docs: Move linux-notes.rst to root bpf docs tree
bpf, sockmap: Fix preempt_rt splat when using raw_spin_lock_t
docs/bpf: Add description for CO-RE relocations
bpf, docs: Correct source of offset for program-local call
selftests/bpf: Fix flaky cgroup_iter_sleepable subtest
xsk: Fix xsk_build_skb() error: 'skb' dereferencing possible ERR_PTR()
bpftool: Fix build warnings with -Wtype-limits
bpf: Prevent inlining of bpf_fentry_test7()
====================
Link: https://lore.kernel.org/r/20230831210019.14417-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sockmap and sockhash maps are a collection of psocks that are
objects representing a socket plus a set of metadata needed
to manage the BPF programs associated with the socket. These
maps use the stab->lock to protect from concurrent operations
on the maps, e.g. trying to insert to objects into the array
at the same time in the same slot. Additionally, a sockhash map
has a bucket lock to protect iteration and insert/delete into
the hash entry.
Each psock has a psock->link which is a linked list of all the
maps that a psock is attached to. This allows a psock (socket)
to be included in multiple sockmap and sockhash maps. This
linked list is protected the psock->link_lock.
They _must_ be nested correctly to avoid deadlock:
lock(stab->lock)
: do BPF map operations and psock insert/delete
lock(psock->link_lock)
: add map to psock linked list of maps
unlock(psock->link_lock)
unlock(stab->lock)
For non PREEMPT_RT kernels both raw_spin_lock_t and spin_lock_t
are guaranteed to not sleep. But, with PREEMPT_RT kernels the
spin_lock_t variants may sleep. In the current code we have
many patterns like this:
rcu_critical_section:
raw_spin_lock(stab->lock)
spin_lock(psock->link_lock) <- may sleep ouch
spin_unlock(psock->link_lock)
raw_spin_unlock(stab->lock)
rcu_critical_section
Nesting spin_lock() inside a raw_spin_lock() violates locking
rules for PREEMPT_RT kernels. And additionally we do alloc(GFP_ATOMICS)
inside the stab->lock, but those might sleep on PREEMPT_RT kernels.
The result is splats like this:
./test_progs -t sockmap_basic
[ 33.344330] bpf_testmod: loading out-of-tree module taints kernel.
[ 33.441933]
[ 33.442089] =============================
[ 33.442421] [ BUG: Invalid wait context ]
[ 33.442763] 6.5.0-rc5-01731-gec0ded2e0282 #4958 Tainted: G O
[ 33.443320] -----------------------------
[ 33.443624] test_progs/2073 is trying to lock:
[ 33.443960] ffff888102a1c290 (&psock->link_lock){....}-{3:3}, at: sock_map_update_common+0x2c2/0x3d0
[ 33.444636] other info that might help us debug this:
[ 33.444991] context-{5:5}
[ 33.445183] 3 locks held by test_progs/2073:
[ 33.445498] #0: ffff88811a208d30 (sk_lock-AF_INET){+.+.}-{0:0}, at: sock_map_update_elem_sys+0xff/0x330
[ 33.446159] #1: ffffffff842539e0 (rcu_read_lock){....}-{1:3}, at: sock_map_update_elem_sys+0xf5/0x330
[ 33.446809] #2: ffff88810d687240 (&stab->lock){+...}-{2:2}, at: sock_map_update_common+0x177/0x3d0
[ 33.447445] stack backtrace:
[ 33.447655] CPU: 10 PID
To fix observe we can't readily remove the allocations (for that
we would need to use/create something similar to bpf_map_alloc). So
convert raw_spin_lock_t to spin_lock_t. We note that sock_map_update
that would trigger the allocate and potential sleep is only allowed
through sys_bpf ops and via sock_ops which precludes hw interrupts
and low level atomic sections in RT preempt kernel. On non RT
preempt kernel there are no changes here and spin locks sections
and alloc(GFP_ATOMIC) are still not sleepable.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230830053517.166611-1-john.fastabend@gmail.com
Long ago we set out to remove the kitchen sink on kernel/sysctl.c arrays and
placings sysctls to their own sybsystem or file to help avoid merge conflicts.
Matthew Wilcox pointed out though that if we're going to do that we might as
well also *save* space while at it and try to remove the extra last sysctl
entry added at the end of each array, a sentintel, instead of bloating the
kernel by adding a new sentinel with each array moved.
Doing that was not so trivial, and has required slowing down the moves of
kernel/sysctl.c arrays and measuring the impact on size by each new move.
The complex part of the effort to help reduce the size of each sysctl is being
done by the patient work of el señor Don Joel Granados. A lot of this is truly
painful code refactoring and testing and then trying to measure the savings of
each move and removing the sentinels. Although Joel already has code which does
most of this work, experience with sysctl moves in the past shows is we need to
be careful due to the slew of odd build failures that are possible due to the
amount of random Kconfig options sysctls use.
To that end Joel's work is split by first addressing the major housekeeping
needed to remove the sentinels, which is part of this merge request. The rest
of the work to actually remove the sentinels will be done later in future
kernel releases.
At first I was only going to send his first 7 patches of his patch series,
posted 1 month ago, but in retrospect due to the testing the changes have
received in linux-next and the minor changes they make this goes with the
entire set of patches Joel had planned: just sysctl house keeping. There are
networking changes but these are part of the house keeping too.
The preliminary math is showing this will all help reduce the overall build
time size of the kernel and run time memory consumed by the kernel by about
~64 bytes per array where we are able to remove each sentinel in the future.
That also means there is no more bloating the kernel with the extra ~64 bytes
per array moved as no new sentinels are created.
Most of this has been in linux-next for about a month, the last 7 patches took
a minor refresh 2 week ago based on feedback.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmTuVnMSHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinIckP/imvRlfkO6L0IP7MmJBRPtwY01rsTAKO
Q14dZ//bG4DVQeGl1FdzrF6hhuLgekU0qW1YDFIWiCXO7CbaxaNBPSUkeW6ReVoC
R/VHNUPxSR1PWQy1OTJV2t4XKri2sB7ijmUsfsATtISwhei9bggTHEysShtP4tv+
U87DzhoqMnbYIsfMo49KCqOa1Qm7TmjC1a7WAp6Fph3GJuXAzZR5pXpsd0NtOZ9x
Ud5RT22icnQpMl7K+yPsqY6XcS5JkgBe/WbSzMAUkYZvBZFBq9t2D+OW5h9TZMhw
piJWQ9X0Rm7qI2D15mJfXwaOhhyDhWci391hzdJmS6DI0prf6Ma2NFdAWOt/zomI
uiRujS4bGeBUaK5F4TX2WQ1+jdMtAZ+0FncFnzt4U8q7dzUc91uVCm6iHW3gcfAb
N7OEg2ZL0gkkgCZHqKxN8wpNQiC2KwnNk+HLAbnL2a/oJYfBtdopQmlxWfrN2hpF
xxROiENqk483BRdMXDq6DR/gyDZmZWCobXIglSzlqCOjCOcLbDziIJ7pJk83ok09
h/QnXTYHf9protBq9OIQesgh2pwNzBBLifK84KZLKcb7IbdIKjpQrW5STp04oNGf
wcGJzEz8tXUe0UKyMM47AcHQGzIy6cdXNLjyF8a+m7rnZzr1ndnMqZyRStZzuQin
AUg2VWHKPmW9
=sq2p
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"Long ago we set out to remove the kitchen sink on kernel/sysctl.c
arrays and placings sysctls to their own sybsystem or file to help
avoid merge conflicts. Matthew Wilcox pointed out though that if we're
going to do that we might as well also *save* space while at it and
try to remove the extra last sysctl entry added at the end of each
array, a sentintel, instead of bloating the kernel by adding a new
sentinel with each array moved.
Doing that was not so trivial, and has required slowing down the moves
of kernel/sysctl.c arrays and measuring the impact on size by each new
move.
The complex part of the effort to help reduce the size of each sysctl
is being done by the patient work of el señor Don Joel Granados. A lot
of this is truly painful code refactoring and testing and then trying
to measure the savings of each move and removing the sentinels.
Although Joel already has code which does most of this work,
experience with sysctl moves in the past shows is we need to be
careful due to the slew of odd build failures that are possible due to
the amount of random Kconfig options sysctls use.
To that end Joel's work is split by first addressing the major
housekeeping needed to remove the sentinels, which is part of this
merge request. The rest of the work to actually remove the sentinels
will be done later in future kernel releases.
The preliminary math is showing this will all help reduce the overall
build time size of the kernel and run time memory consumed by the
kernel by about ~64 bytes per array where we are able to remove each
sentinel in the future. That also means there is no more bloating the
kernel with the extra ~64 bytes per array moved as no new sentinels
are created"
* tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
sysctl: Use ctl_table_size as stopping criteria for list macro
sysctl: SIZE_MAX->ARRAY_SIZE in register_net_sysctl
vrf: Update to register_net_sysctl_sz
networking: Update to register_net_sysctl_sz
netfilter: Update to register_net_sysctl_sz
ax.25: Update to register_net_sysctl_sz
sysctl: Add size to register_net_sysctl function
sysctl: Add size arg to __register_sysctl_init
sysctl: Add size to register_sysctl
sysctl: Add a size arg to __register_sysctl_table
sysctl: Add size argument to init_header
sysctl: Add ctl_table_size to ctl_table_header
sysctl: Use ctl_table_header in list_for_each_table_entry
sysctl: Prefer ctl_table_header in proc_sysctl
While looking at TC_ACT_* handling, the TC_ACT_CONSUMED is only handled in
sch_handle_ingress but not sch_handle_egress. This was added via cd11b16407
("net/tc: introduce TC_ACT_REINSERT.") and e5cf1baf92 ("act_mirred: use
TC_ACT_REINSERT when possible") and later got renamed into TC_ACT_CONSUMED
via 720f22fed8 ("net: sched: refactor reinsert action").
The initial work was targeted for ovs back then and only needed on ingress,
and the mirred action module also restricts it to only that. However, given
it's an API contract it would still make sense to make this consistent to
sch_handle_ingress and handle it on egress side in the same way, that is,
setting return code to "success" and returning NULL back to the caller as
otherwise an action module sitting on egress returning TC_ACT_CONSUMED could
lead to an UAF when untreated.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a memory leak for the tc egress path with TC_ACT_{STOLEN,QUEUED,TRAP}:
[...]
unreferenced object 0xffff88818bcb4f00 (size 232):
comm "softirq", pid 0, jiffies 4299085078 (age 134.028s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 80 70 61 81 88 ff ff 00 41 31 14 81 88 ff ff ..pa.....A1.....
backtrace:
[<ffffffff9991b938>] kmem_cache_alloc_node+0x268/0x400
[<ffffffff9b3d9231>] __alloc_skb+0x211/0x2c0
[<ffffffff9b3f0c7e>] alloc_skb_with_frags+0xbe/0x6b0
[<ffffffff9b3bf9a9>] sock_alloc_send_pskb+0x6a9/0x870
[<ffffffff9b6b3f00>] __ip_append_data+0x14d0/0x3bf0
[<ffffffff9b6ba24e>] ip_append_data+0xee/0x190
[<ffffffff9b7e1496>] icmp_push_reply+0xa6/0x470
[<ffffffff9b7e4030>] icmp_reply+0x900/0xa00
[<ffffffff9b7e42e3>] icmp_echo.part.0+0x1a3/0x230
[<ffffffff9b7e444d>] icmp_echo+0xcd/0x190
[<ffffffff9b7e9566>] icmp_rcv+0x806/0xe10
[<ffffffff9b699bd1>] ip_protocol_deliver_rcu+0x351/0x3d0
[<ffffffff9b699f14>] ip_local_deliver_finish+0x2b4/0x450
[<ffffffff9b69a234>] ip_local_deliver+0x174/0x1f0
[<ffffffff9b69a4b2>] ip_sublist_rcv_finish+0x1f2/0x420
[<ffffffff9b69ab56>] ip_sublist_rcv+0x466/0x920
[...]
I was able to reproduce this via:
ip link add dev dummy0 type dummy
ip link set dev dummy0 up
tc qdisc add dev eth0 clsact
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip protocol 1 0xff action mirred egress redirect dev dummy0
ping 1.1.1.1
<stolen>
After the fix, there are no kmemleak reports with the reproducer. This is
in line with what is also done on the ingress side, and from debugging the
skb_unref(skb) on dummy xmit and sch_handle_egress() side, it is visible
that these are two different skbs with both skb_unref(skb) as true. The two
seen skbs are due to mirred doing a skb_clone() internally as use_reinsert
is false in tcf_mirred_act() for egress. This was initially reported by Gal.
Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/bdfc2640-8f65-5b56-4472-db8e2b161aab@nvidia.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZOjkTAAKCRDbK58LschI
gx32AP9gaaHFBtOYBfoenKTJfMgv1WhtQHIBas+WN9ItmBx9MAEA4gm/VyQ6oD7O
EBjJKJQ2CZ/QKw7cNacXw+l5jF7/+Q0=
=8P7g
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-08-25
We've added 87 non-merge commits during the last 8 day(s) which contain
a total of 104 files changed, 3719 insertions(+), 4212 deletions(-).
The main changes are:
1) Add multi uprobe BPF links for attaching multiple uprobes
and usdt probes, which is significantly faster and saves extra fds,
from Jiri Olsa.
2) Add support BPF cpu v4 instructions for arm64 JIT compiler,
from Xu Kuohai.
3) Add support BPF cpu v4 instructions for riscv64 JIT compiler,
from Pu Lehui.
4) Fix LWT BPF xmit hooks wrt their return values where propagating
the result from skb_do_redirect() would trigger a use-after-free,
from Yan Zhai.
5) Fix a BPF verifier issue related to bpf_kptr_xchg() with local kptr
where the map's value kptr type and locally allocated obj type
mismatch, from Yonghong Song.
6) Fix BPF verifier's check_func_arg_reg_off() function wrt graph
root/node which bypassed reg->off == 0 enforcement,
from Kumar Kartikeya Dwivedi.
7) Lift BPF verifier restriction in networking BPF programs to treat
comparison of packet pointers not as a pointer leak,
from Yafang Shao.
8) Remove unmaintained XDP BPF samples as they are maintained
in xdp-tools repository out of tree, from Toke Høiland-Jørgensen.
9) Batch of fixes for the tracing programs from BPF samples in order
to make them more libbpf-aware, from Daniel T. Lee.
10) Fix a libbpf signedness determination bug in the CO-RE relocation
handling logic, from Andrii Nakryiko.
11) Extend libbpf to support CO-RE kfunc relocations. Also follow-up
fixes for bpf_refcount shared ownership implementation,
both from Dave Marchevsky.
12) Add a new bpf_object__unpin() API function to libbpf,
from Daniel Xu.
13) Fix a memory leak in libbpf to also free btf_vmlinux
when the bpf_object gets closed, from Hao Luo.
14) Small error output improvements to test_bpf module, from Helge Deller.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (87 commits)
selftests/bpf: Add tests for rbtree API interaction in sleepable progs
bpf: Allow bpf_spin_{lock,unlock} in sleepable progs
bpf: Consider non-owning refs to refcounted nodes RCU protected
bpf: Reenable bpf_refcount_acquire
bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes
bpf: Consider non-owning refs trusted
bpf: Ensure kptr_struct_meta is non-NULL for collection insert and refcount_acquire
selftests/bpf: Enable cpu v4 tests for RV64
riscv, bpf: Support unconditional bswap insn
riscv, bpf: Support signed div/mod insns
riscv, bpf: Support 32-bit offset jmp insn
riscv, bpf: Support sign-extension mov insns
riscv, bpf: Support sign-extension load insns
riscv, bpf: Fix missing exception handling and redundant zext for LDX_B/H/W
samples/bpf: Add note to README about the XDP utilities moved to xdp-tools
samples/bpf: Cleanup .gitignore
samples/bpf: Remove the xdp_sample_pkts utility
samples/bpf: Remove the xdp1 and xdp2 utilities
samples/bpf: Remove the xdp_rxq_info utility
samples/bpf: Remove the xdp_redirect* utilities
...
====================
Link: https://lore.kernel.org/r/20230825194319.12727-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove the necessity to modify skb_ext_total_length() when new extension
types are added.
Also reduces the line count a bit.
With optimizations enabled the function is folded down to the same
constant value as before during compilation.
This has been validated on x86 with GCC 6.5.0 and 13.2.1.
Also a similar construct has been validated on godbolt.org with GCC 5.1.
In any case the compiler has to be able to evaluate the construct at
compile-time for the BUILD_BUG_ON() in skb_extensions_init().
Even if not evaluated at compile-time this function would only ever
be executed once at run-time, so the overhead would be very minuscule.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230823-skb_ext-simplify-v2-1-66e26cd66860@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the ifdown function in the dst_ops structure is referenced, the input
parameter 'how' is always true. In the current implementation of the
ifdown interface, ip6_dst_ifdown does not use the input parameter 'how',
xfrm6_dst_ifdown and xfrm4_dst_ifdown functions use the input parameter
'unregister'. But false judgment on 'unregister' in xfrm6_dst_ifdown and
xfrm4_dst_ifdown is false, so remove the input parameter 'how' in ifdown
function.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230821084104.3812233-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
sk_getsockopt() runs locklessly. This means sk->sk_lingertime
can be read while other threads are changing its value.
Other reads also happen without socket lock being held,
and must be annotated.
Remove preprocessor logic using BITS_PER_LONG, compilers
are smart enough to figure this by themselves.
v2: fixed a clang W=1 (-Wtautological-constant-out-of-range-compare) warning
(Jakub)
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting IP_RECVERR and IPV6_RECVERR options to zero currently
purges the socket error queue, which was probably not expected
for zerocopy and tx_timestamp users.
I discovered this issue while preparing commit 6b5f43ea08
("inet: move inet->recverr to inet->inet_flags"), I presume this
change does not need to be backported to stable kernels.
Add skb_errqueue_purge() helper to purge error messages only.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
veth and vxcan need to make sure the ifindexes of the peer
are not negative, core does not validate this.
Using iproute2 with user-space-level checking removed:
Before:
# ./ip link add index 10 type veth peer index -1
# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:74:b2:03 brd ff:ff:ff:ff:ff:ff
10: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 8a:90:ff:57:6d:5d brd ff:ff:ff:ff:ff:ff
-1: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ae:ed:18:e6:fa:7f brd ff:ff:ff:ff:ff:ff
Now:
$ ./ip link add index 10 type veth peer index -1
Error: ifindex can't be negative.
This problem surfaced in net-next because an explicit WARN()
was added, the root cause is older.
Fixes: e6f8f1a739 ("veth: Allow to create peer link with given ifindex")
Fixes: a8f820a380 ("can: add Virtual CAN Tunnel driver (vxcan)")
Reported-by: syzbot+5ba06978f34abb058571@syzkaller.appspotmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_queue_purge() and __skb_queue_purge() become wrappers
around the new generic functions.
New SKB_DROP_REASON_QUEUE_PURGE drop reason is added,
but users can start adding more specific reasons.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since v6.5-rc1 MM-tree is merged and contains a new flag SLAB_NO_MERGE
in commit d0bf7d5759 ("mm/slab: introduce kmem_cache flag SLAB_NO_MERGE")
now is the time to use this flag for networking as proposed
earlier see link.
The SKB (sk_buff) kmem_cache slab is critical for network performance.
Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
performance by amortising the alloc/free cost.
For the bulk API to perform efficiently the slub fragmentation need to
be low. Especially for the SLUB allocator, the efficiency of bulk free
API depend on objects belonging to the same slab (page).
When running different network performance microbenchmarks, I started
to notice that performance was reduced (slightly) when machines had
longer uptimes. I believe the cause was 'skbuff_head_cache' got
aliased/merged into the general slub for 256 bytes sized objects (with
my kernel config, without CONFIG_HARDENED_USERCOPY).
For SKB kmem_cache network stack have other various reasons for
not merging, but it varies depending on kernel config (e.g.
CONFIG_HARDENED_USERCOPY). We want to explicitly set SLAB_NO_MERGE
for this kmem_cache to get most out of kmem_cache_{alloc,free}_bulk APIs.
When CONFIG_SLUB_TINY is configured the bulk APIs are essentially
disabled. Thus, for this case drop the SLAB_NO_MERGE flag.
Link: https://lore.kernel.org/all/167396280045.539803.7540459812377220500.stgit@firesoul/
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/169211265663.1491038.8580163757548985946.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/sfc/tc.c
fa165e1949 ("sfc: don't unregister flow_indr if it was never registered")
3bf969e88a ("sfc: add MAE table machinery for conntrack table")
https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
BPF encap ops can return different types of positive values, such like
NET_RX_DROP, NET_XMIT_CN, NETDEV_TX_BUSY, and so on, from function
skb_do_redirect and bpf_lwt_xmit_reroute. At the xmit hook, such return
values would be treated implicitly as LWTUNNEL_XMIT_CONTINUE in
ip(6)_finish_output2. When this happens, skbs that have been freed would
continue to the neighbor subsystem, causing use-after-free bug and
kernel crashes.
To fix the incorrect behavior, skb_do_redirect return values can be
simply discarded, the same as tc-egress behavior. On the other hand,
bpf_lwt_xmit_reroute returns useful errors to local senders, e.g. PMTU
information. Thus convert its return values to avoid the conflict with
LWTUNNEL_XMIT_CONTINUE.
Fixes: 3a0af8fd61 ("bpf: BPF for lightweight tunnel infrastructure")
Reported-by: Jordan Griege <jgriege@cloudflare.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/0d2b878186cfe215fec6b45769c1cd0591d3628d.1692326837.git.yan@cloudflare.com
The status of global socket memory pressure is updated when:
a) __sk_mem_raise_allocated():
enter: sk_memory_allocated(sk) > sysctl_mem[1]
leave: sk_memory_allocated(sk) <= sysctl_mem[0]
b) __sk_mem_reduce_allocated():
leave: sk_under_memory_pressure(sk) &&
sk_memory_allocated(sk) < sysctl_mem[0]
So the conditions of leaving global pressure are inconstant, which
may lead to the situation that one pressured net-memcg prevents the
global pressure from being cleared when there is indeed no global
pressure, thus the global constrains are still in effect unexpectedly
on the other sockets.
This patch fixes this by ignoring the net-memcg's pressure when
deciding whether should leave global memory pressure.
Fixes: e1aab161e0 ("socket: initial cgroup code.")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Link: https://lore.kernel.org/r/20230816091226.1542-1-wuyun.abel@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IP_MULTICAST_LOOP socket option can now be set/read
without locking the socket.
v3: fix build bot error reported in ipvs set_mcast_loop()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the xarray changes we mix returning valid ifindex and negative
errno in a single int returned from dev_index_reserve(). This depends
on the fact that ifindexes can't be negative. Otherwise we may insert
into the xarray and return a very large negative value. This in turn
may break ERR_PTR().
OvS is susceptible to this problem and lacking validation (fix posted
separately for net).
Reject negative ifindex explicitly. Add a warning because the input
validation is better handled by the caller.
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230814205627.2914583-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move from register_net_sysctl to register_net_sysctl_sz for all the
networking related files. Do this while making sure to mirror the NULL
assignments with a table_size of zero for the unprivileged users.
We need to move to the new function in preparation for when we change
SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do
so would erroneously allow ARRAY_SIZE() to be called on a pointer. We
hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all
the relevant net sysctl registering functions to register_net_sysctl_sz
in subsequent commits.
An additional size function was added to the following files in order to
calculate the size of an array that is defined in another file:
include/net/ipv6.h
net/ipv6/icmp.c
net/ipv6/route.c
net/ipv6/sysctl_net_ipv6.c
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZNRuIQAKCRBar9k/UBDW
4++9AP9ymOcPOKTKdQwZ6cnq3vkmvN37H6teufTyM8vsCha9NAD+OQE+vg1304RM
aETtG6d5Nb+byIHZGJrdUyT7g9jRzgw=
=qr/C
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Martin KaFai Lau says:
====================
pull-request: bpf 2023-08-09
We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 6 files changed, 102 insertions(+), 8 deletions(-).
The main changes are:
1) A bpf sockmap memleak fix and a fix in accessing the programs of
a sockmap under the incorrect map type from Xu Kuohai.
2) A refcount underflow fix in xsk from Magnus Karlsson.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add sockmap test for redirecting partial skb data
selftests/bpf: fix a CI failure caused by vsock sockmap test
bpf, sockmap: Fix bug that strp_done cannot be called
bpf, sockmap: Fix map type error in sock_map_del_link
xsk: fix refcount underflow in error path
====================
Link: https://lore.kernel.org/r/20230810055303.120917-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
strp_done is only called when psock->progs.stream_parser is not NULL,
but stream_parser was set to NULL by sk_psock_stop_strp(), called
by sk_psock_drop() earlier. So, strp_done can never be called.
Introduce SK_PSOCK_RX_ENABLED to mark whether there is strp on psock.
Change the condition for calling strp_done from judging whether
stream_parser is set to judging whether this flag is set. This flag is
only set once when strp_init() succeeds, and will never be cleared later.
Fixes: c0d95d3380 ("bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230804073740.194770-3-xukuohai@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
sock_map_del_link() operates on both SOCKMAP and SOCKHASH, although
both types have member named "progs", the offset of "progs" member in
these two types is different, so "progs" should be accessed with the
real map type.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230804073740.194770-2-xukuohai@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
IPV6_ADDRFORM socket option is evil, because it can change sock->ops
while other threads might read it. Same issue for sk->sk_family
being set to AF_INET.
Adding READ_ONCE() over sock->ops reads is needed for sockets
that might be impacted by IPV6_ADDRFORM.
Note that mptcp_is_tcpsk() can also overwrite sock->ops.
Adding annotations for all sk->sk_family reads will require
more patches :/
BUG: KCSAN: data-race in ____sys_sendmsg / do_ipv6_setsockopt
write to 0xffff888109f24ca0 of 8 bytes by task 4470 on cpu 0:
do_ipv6_setsockopt+0x2c5e/0x2ce0 net/ipv6/ipv6_sockglue.c:491
ipv6_setsockopt+0x57/0x130 net/ipv6/ipv6_sockglue.c:1012
udpv6_setsockopt+0x95/0xa0 net/ipv6/udp.c:1690
sock_common_setsockopt+0x61/0x70 net/core/sock.c:3663
__sys_setsockopt+0x1c3/0x230 net/socket.c:2273
__do_sys_setsockopt net/socket.c:2284 [inline]
__se_sys_setsockopt net/socket.c:2281 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2281
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff888109f24ca0 of 8 bytes by task 4469 on cpu 1:
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0x349/0x4c0 net/socket.c:2503
___sys_sendmsg net/socket.c:2557 [inline]
__sys_sendmmsg+0x263/0x500 net/socket.c:2643
__do_sys_sendmmsg net/socket.c:2672 [inline]
__se_sys_sendmmsg net/socket.c:2669 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2669
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0xffffffff850e32b8 -> 0xffffffff850da890
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 4469 Comm: syz-executor.1 Not tainted 6.4.0-rc5-syzkaller-00313-g4c605260bc60 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230808135809.2300241-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Change the new (unreleased) SO_PEERPIDFD sockopt to return ENODATA
rather than ESRCH if a socket type does not support remote peer-PID
queries.
Currently, SO_PEERPIDFD returns ESRCH when the socket in question is
not an AF_UNIX socket. This is quite unexpected, given that one would
assume ESRCH means the peer process already exited and thus cannot be
found. However, in that case the sockopt actually returns EINVAL (via
pidfd_prepare()). This is rather inconsistent with other syscalls, which
usually return ESRCH if a given PID refers to a non-existant process.
This changes SO_PEERPIDFD to return ENODATA instead. This is also what
SO_PEERGROUPS returns, and thus keeps a consistent behavior across
sockopts.
Note that this code is returned in 2 cases: First, if the socket type is
not AF_UNIX, and secondly if the socket was not yet connected. In both
cases ENODATA seems suitable.
Signed-off-by: David Rheinsberg <david@readahead.eu>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Luca Boccassi <bluca@debian.org>
Fixes: 7b26952a91 ("net: core: add getsockopt SO_PEERPIDFD")
Link: https://lore.kernel.org/r/20230807081225.816199-1-david@readahead.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller reported the following issue:
=======================================
Too BIG xdp->frame_sz = 131072
WARNING: CPU: 0 PID: 5020 at net/core/filter.c:4121
____bpf_xdp_adjust_tail net/core/filter.c:4121 [inline]
WARNING: CPU: 0 PID: 5020 at net/core/filter.c:4121
bpf_xdp_adjust_tail+0x466/0xa10 net/core/filter.c:4103
...
Call Trace:
<TASK>
bpf_prog_4add87e5301a4105+0x1a/0x1c
__bpf_prog_run include/linux/filter.h:600 [inline]
bpf_prog_run_xdp include/linux/filter.h:775 [inline]
bpf_prog_run_generic_xdp+0x57e/0x11e0 net/core/dev.c:4721
netif_receive_generic_xdp net/core/dev.c:4807 [inline]
do_xdp_generic+0x35c/0x770 net/core/dev.c:4866
tun_get_user+0x2340/0x3ca0 drivers/net/tun.c:1919
tun_chr_write_iter+0xe8/0x210 drivers/net/tun.c:2043
call_write_iter include/linux/fs.h:1871 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x650/0xe40 fs/read_write.c:584
ksys_write+0x12f/0x250 fs/read_write.c:637
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
xdp->frame_sz > PAGE_SIZE check was introduced in commit c8741e2bfe
("xdp: Allow bpf_xdp_adjust_tail() to grow packet size"). But Jesper
Dangaard Brouer <jbrouer@redhat.com> noted that after introducing the
xdp_init_buff() which all XDP driver use - it's safe to remove this
check. The original intend was to catch cases where XDP drivers have
not been updated to use xdp.frame_sz, but that is not longer a concern
(since xdp_init_buff).
Running the initial syzkaller repro it was discovered that the
contiguous physical memory allocation is used for both xdp paths in
tun_get_user(), e.g. tun_build_skb() and tun_alloc_skb(). It was also
stated by Jesper Dangaard Brouer <jbrouer@redhat.com> that XDP can
work on higher order pages, as long as this is contiguous physical
memory (e.g. a page).
Reported-and-tested-by: syzbot+f817490f5bd20541b90a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000774b9205f1d8a80d@google.com/T/
Link: https://syzkaller.appspot.com/bug?extid=f817490f5bd20541b90a
Link: https://lore.kernel.org/all/20230725155403.796-1-andrew.kanner@gmail.com/T/
Fixes: 43b5169d83 ("net, xdp: Introduce xdp_init_buff utility routine")
Signed-off-by: Andrew Kanner <andrew.kanner@gmail.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20230803190316.2380231-1-andrew.kanner@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 8c48eea3ad ("page_pool: allow caching from safely localized
NAPI") allowed direct recycling of skb pages to their PP for some cases,
but unfortunately missed a couple of other majors.
For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(),
which unconditionally passes `false` as @napi_safe. Thus, all pages go
through ptr_ring and locks, although most of time we're actually inside
the NAPI polling this PP is linked with, so that it would be perfectly
safe to recycle pages directly.
Let's address such. If @napi_safe is true, we're fine, don't change
anything for this path. But if it's false, check whether we are in the
softirq context. It will most likely be so and then if ->list_owner
is our current CPU, we're good to use direct recycling, even though
@napi_safe is false -- concurrent access is excluded. in_softirq()
protection is needed mostly due to we can hit this place in the
process context (not the hardirq though).
For the mentioned xdp-drop-skb-mode case, the improvement I got is
3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and
alloc_slow counter doesn't change most of time, which means the
MM layer is not even called to allocate any new pages.
Suggested-by: Jakub Kicinski <kuba@kernel.org> # in_softirq()
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-7-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Page pool use in hardirq is prohibited, add debug checks
to catch misuses. IIRC we previously discussed using
DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns
that people will have DEBUG_NET enabled in perf testing.
I don't think anyone enables lockdep in perf testing,
so use lockdep to avoid pushback and arguing :)
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-6-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, pp->p.napi is always read, but the actual variable it gets
assigned to is read-only when @napi_safe is true. For the !napi_safe
cases, which yet is still a pack, it's an unneeded operation.
Moreover, it can lead to premature or even redundant page_pool
cacheline access. For example, when page_pool_is_last_frag() returns
false (with the recent frag improvements).
Thus, read it only when @napi_safe is true. This also allows moving
@napi inside the condition block itself. Constify it while we are
here, because why not.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-5-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, touching <net/page_pool/types.h> triggers a rebuild of more
than half of the kernel. That's because it's included in
<linux/skbuff.h>. And each new include to page_pool/types.h adds more
[useless] data for the toolchain to process per each source file from
that pile.
In commit 6a5bcd84e8 ("page_pool: Allow drivers to hint on SKB
recycling"), Matteo included it to be able to call a couple of functions
defined there. Then, in commit 57f05bc2ab ("page_pool: keep pp info as
long as page pool owns the page") one of the calls was removed, so only
one was left. It's the call to page_pool_return_skb_page() in
napi_frag_unref(). The function is external and doesn't have any
dependencies. Having very niche page_pool_types.h included only for that
looks like an overkill.
As %PP_SIGNATURE is not local to page_pool.c (was only in the
early submissions), nothing holds this function there. Teleport
page_pool_return_skb_page() to skbuff.c, just next to the main consumer,
skb_pp_recycle(), and rename it to napi_pp_put_page(), as it doesn't
work with skbs at all and the former name tells nothing. The #if guards
here are only to not compile and have it in the vmlinux when not needed
-- both call sites are already guarded.
Now, touching page_pool_types.h only triggers rebuilding of the drivers
using it and a couple of core networking files.
Suggested-by: Jakub Kicinski <kuba@kernel.org> # make skbuff.h less heavy
Suggested-by: Alexander Duyck <alexanderduyck@fb.com> # move to skbuff.c
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Split types and pure function declarations from page_pool.h
and add them in page_page/types.h, so that C sources can
include page_pool.h and headers should generally only include
page_pool/types.h as suggested by jakub.
Rename page_pool.h to page_pool/helpers.h to have both in
one place.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-2-aleksander.lobakin@intel.com
[Jakub: change microsoft/mana, fix kdoc paths in Documentation]
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Setting dev->priv_flags & IFF_SEE_ALL_HWTSTAMP_REQUESTS is only legal
for drivers which were converted to ndo_hwtstamp_get() and
ndo_hwtstamp_set(), and it is only there that we call ndo_hwtstamp_set()
for a request that otherwise goes to phylib (for stuff like packet traps,
which need to be undone if phylib failed, hence the old_cfg logic).
The problem is that we end up calling ndo_hwtstamp_get() when we don't
need to (even if the SIOCSHWTSTAMP wasn't intended for phylib, or if it
was, but the driver didn't set IFF_SEE_ALL_HWTSTAMP_REQUESTS). For those
unnecessary conditions, we share a code path with virtual drivers (vlan,
macvlan, bonding) where ndo_hwtstamp_get() is implemented as
generic_hwtstamp_get_lower(), and may be resolved through
generic_hwtstamp_ioctl_lower() if the lower device is unconverted.
I.e. this situation:
$ ip link add link eno0 name eno0.100 type vlan id 100
$ hwstamp_ctl -i eno0.100 -t 1
We are unprepared to deal with this, because if ndo_hwtstamp_get() is
resolved through a legacy ndo_eth_ioctl(SIOCGHWTSTAMP) lower_dev
implementation, that needs a non-NULL old_cfg.ifr pointer, and we don't
have it.
But we don't even need to deal with it either. In the general case,
drivers may not even implement SIOCGHWTSTAMP handling, only SIOCSHWTSTAMP,
so it makes sense to completely avoid a SIOCGHWTSTAMP call if we can.
The solution is to split the single "if" condition into 3 smaller ones,
thus separating the decision to call ndo_hwtstamp_get() from the
decision to call ndo_hwtstamp_set(). The third "if" condition is
identical to the first one, and both are subsets of the second one.
Thus, the "cfg" argument of kernel_hwtstamp_config_changed() is always
valid.
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89iLOspJsvjPj+y8jikg7erXDomWe8sqHMdfL_2LQSFrPAg@mail.gmail.com/
Fixes: fd770e856e ("net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted drivers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW
42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP
Tngj6Yid60o39/IZXXblhV37HfSiyQ8=
=/kVE
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2023-08-03
We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 84 files changed, 4026 insertions(+), 562 deletions(-).
The main changes are:
1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer,
Daniel Borkmann
2) Support new insns from cpu v4 from Yonghong Song
3) Non-atomically allocate freelist during prefill from YiFei Zhu
4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu
5) Add tracepoint to xdp attaching failure from Leon Hwang
6) struct netdev_rx_queue and xdp.h reshuffling to reduce
rebuild time from Jakub Kicinski
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
net: invert the netdevice.h vs xdp.h dependency
net: move struct netdev_rx_queue out of netdevice.h
eth: add missing xdp.h includes in drivers
selftests/bpf: Add testcase for xdp attaching failure tracepoint
bpf, xdp: Add tracepoint to xdp attaching failure
selftests/bpf: fix static assert compilation issue for test_cls_*.c
bpf: fix bpf_probe_read_kernel prototype mismatch
riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework
libbpf: fix typos in Makefile
tracing: bpf: use struct trace_entry in struct syscall_tp_t
bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
netfilter: bpf: Only define get_proto_defrag_hook() if necessary
bpf: Fix an array-index-out-of-bounds issue in disasm.c
net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn
docs/bpf: Fix malformed documentation
bpf: selftests: Add defrag selftests
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Support not connecting client socket
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
...
====================
Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvqewAKCRBar9k/UBDW
48yeAQCnPnwzcvy+JDrdosuJEErhMv0pH3ECixNpPBpns95kzAEA9QhSYwjAhlFf
61d6hoiXj/sIibgMQT/ihODgeJ4wfQE=
=u7qn
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Martin KaFai Lau says:
====================
pull-request: bpf 2023-08-03
We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 3 files changed, 37 insertions(+), 20 deletions(-).
The main changes are:
1) Disable preemption in perf_event_output helpers code,
from Jiri Olsa
2) Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing,
from Lin Ma
3) Multiple warning splat fixes in cpumap from Hou Tao
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, cpumap: Handle skb as well when clean up ptr_ring
bpf, cpumap: Make sure kthread is running before map update returns
bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
bpf: Disable preemption in bpf_event_output
bpf: Disable preemption in bpf_perf_event_output
====================
Link: https://lore.kernel.org/r/20230803181429.994607-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All struct members of the driver-facing APIs are documented twice,
in the code and under Documentation. This is a bit tedious.
I also get the feeling that a lot of developers will read the header
when coding, rather than the doc. Bring the two a little closer
together by using kdoc for structs and functions.
Using kdoc also gives us links (mentioning a function or struct
in the text gets replaced by a link to its doc).
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230802161821.3621985-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
struct netdev_rx_queue is touched in only a few places
and having it defined in netdevice.h brings in the dependency
on xdp.h, because struct xdp_rxq_info gets embedded in
struct netdev_rx_queue.
In prep for removal of xdp.h from netdevice.h move all
the netdev_rx_queue stuff to a new header.
We could technically break the new header up to avoid
the sysfs.h include but it's so rarely included it
doesn't seem to be worth it at this point.
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
It is desirable that the new .ndo_hwtstamp_set() API gives more
uniformity, less overhead and future flexibility w.r.t. the PHY
timestamping behavior.
Currently there are some drivers which allow PHY timestamping through
the procedure mentioned in Documentation/networking/timestamping.rst.
They don't do anything locally if phy_has_hwtstamp() is set, except for
lan966x which installs PTP packet traps.
Centralize that behavior in a new dev_set_hwtstamp_phylib() code
function, which calls either phy_mii_ioctl() for the phylib PHY,
or .ndo_hwtstamp_set() of the netdev, based on a single policy
(currently simplistic: phy_has_hwtstamp()).
Any driver converted to .ndo_hwtstamp_set() will automatically opt into
the centralized phylib timestamping policy. Unconverted drivers still
get to choose whether they let the PHY handle timestamping or not.
Netdev drivers with integrated PHY drivers that don't use phylib
presumably don't set dev->phydev, and those will always see
HWTSTAMP_SOURCE_NETDEV requests even when converted. The timestamping
policy will remain 100% up to them.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-13-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The stackable net devices with hwtstamping support (vlan, macvlan,
bonding) only pass the hwtstamping ops to the lower (real) device.
These drivers are the first that need to be converted to the new
timestamping API, because if they aren't prepared to handle that,
then no real device driver cannot be converted to the new API either.
After studying what vlan_dev_ioctl(), macvlan_eth_ioctl() and
bond_eth_ioctl() have in common, here we propose two generic
implementations of ndo_hwtstamp_get() and ndo_hwtstamp_set() which
can be called by those 3 drivers, with "dev" being their lower device.
These helpers cover both cases, when the lower driver is converted to
the new API or unconverted.
We need some hacks in case of an unconverted driver, namely to stuff
some pointers in struct kernel_hwtstamp_config which shouldn't have
been there (since the new API isn't supposed to need it). These will
be removed when all drivers will have been converted to the new API.
Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current hardware timestamping API for NICs requires implementing
.ndo_eth_ioctl() for SIOCGHWTSTAMP and SIOCSHWTSTAMP.
That API has some boilerplate such as request parameter translation
between user and kernel address spaces, handling possible translation
failures correctly, etc. Since it is the same all across the board, it
would be desirable to handle it through generic code.
Here we introduce .ndo_hwtstamp_get() and .ndo_hwtstamp_set(), which
implement that boilerplate and allow drivers to just act upon requests.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When error happens in dev_xdp_attach(), it should have a way to tell
users the error message like the netlink approach.
To avoid breaking uapi, adding a tracepoint in bpf_xdp_link_attach() is
an appropriate way to notify users the error message.
Hence, bpf libraries are able to retrieve the error message by this
tracepoint, and then report the error message to users.
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20230801142621.7925-2-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch enables offload for TC classifier
flower rules which matches against SPI field.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support for dissecting IPSEC field SPI (which is
32bits in size) for ESP and AH packets.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disabling preemption in sock_map_sk_acquire conflicts with GFP_ATOMIC
allocation later in sk_psock_init_link on PREEMPT_RT kernels, since
GFP_ATOMIC might sleep on RT (see bpf: Make BPF and PREEMPT_RT co-exist
patchset notes for details).
This causes calling bpf_map_update_elem on BPF_MAP_TYPE_SOCKMAP maps to
BUG (sleeping function called from invalid context) on RT kernels.
preempt_disable was introduced together with lock_sk and rcu_read_lock
in commit 99ba2b5aba ("bpf: sockhash, disallow bpf_tcp_close and update
in parallel"), probably to match disabled migration of BPF programs, and
is no longer necessary.
Remove preempt_disable to fix BUG in sock_map_update_common on RT.
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/all/20200224140131.461979697@linutronix.de/
Fixes: 99ba2b5aba ("bpf: sockhash, disallow bpf_tcp_close and update in parallel")
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230728064411.305576-1-tglozar@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit df8fc4e934 ("kbuild: Enable -fstrict-flex-arrays=3") started
applying strict rules to standard string functions.
It does not work well with conventional socket code around each protocol-
specific sockaddr_XXX struct, which is cast from sockaddr_storage and has
a bigger size than fortified functions expect. See these commits:
commit 06d4c8a808 ("af_unix: Fix fortify_panic() in unix_bind_bsd().")
commit ecb4534b6a ("af_unix: Terminate sun_path when bind()ing pathname socket.")
commit a0ade8404c ("af_packet: Fix warning of fortified memcpy() in packet_getname().")
We must cast the protocol-specific address back to sockaddr_storage
to call such functions.
However, in the case of getsockaddr(SO_PEERNAME), the rationale is a bit
unclear as the buffer is defined by char[128] which is the same size as
sockaddr_storage.
Let's use sockaddr_storage explicitly.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As 32bits of dissector->used_keys are exhausted,
increase the size to 64bits.
This is base change for ESP/AH flow dissector patch.
Please find patch and discussions at
https://lore.kernel.org/netdev/ZMDNjD46BvZ5zp5I@corigine.com/T/#t
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Reviewed-by: Petr Machata <petrm@nvidia.com> # for mlxsw
Tested-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_getsockopt() runs locklessly. This means sk->sk_priority
can be read while other threads are changing its value.
Other reads also happen without socket lock being held.
Add missing annotations where needed.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a prior commit I forgot that sk_getsockopt() reads
sk->sk_ll_usec without holding a lock.
Fixes: 0dbffbb533 ("net: annotate data race around sk_ll_usec")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_getsockopt() runs locklessly, thus we need to annotate the read
of sk->sk_peek_off.
While we are at it, add corresponding annotations to sk_set_peek_off()
and unix_set_peek_off().
Fixes: b9bb53f383 ("sock: convert sk_peek_offset functions to WRITE_ONCE")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_mark is often read while another thread could change the value.
Fixes: 4a19ec5800 ("[NET]: Introducing socket mark socket option.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a prior commit, I forgot to change sk_getsockopt()
when reading sk->sk_rcvbuf locklessly.
Fixes: ebb3b78db7 ("tcp: annotate sk->sk_rcvbuf lockless reads")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a prior commit, I forgot to change sk_getsockopt()
when reading sk->sk_sndbuf locklessly.
Fixes: e292f05e0d ("tcp: annotate sk->sk_sndbuf lockless reads")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_getsockopt() runs without locks, we must add annotations
to sk->sk_rcvtimeo and sk->sk_sndtimeo.
In the future we might allow fetching these fields before
we lock the socket in TCP fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a prior commit, I forgot to change sk_getsockopt()
when reading sk->sk_rcvlowat locklessly.
Fixes: eac66402d1 ("net: annotate sk->sk_rcvlowat lockless reads")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_getsockopt() runs locklessly. This means sk->sk_max_pacing_rate
can be read while other threads are changing its value.
Fixes: 62748f32d5 ("net: introduce SO_MAX_PACING_RATE")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_getsockopt() runs locklessly. This means sk->sk_txrehash
can be read while other threads are changing its value.
Other locations were handled in commit cb6cd2cec7
("tcp: Change SYN ACK retransmit behaviour to account for rehash")
Fixes: 26859240e4 ("txhash: Add socket option to control TX hash rethink behavior")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Akhmat Karakotov <hmukos@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_getsockopt() runs locklessly. This means sk->sk_reserved_mem
can be read while other threads are changing its value.
Add missing annotations where they are needed.
Fixes: 2bb2f5fb21 ("net: add new socket option SO_RESERVE_MEM")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reap the benefits of easier iteration thanks to the xarray.
Convert just the genetlink ones, those are easier to test.
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230726185530.2247698-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Iterating over the netdev hash table for netlink dumps is hard.
Dumps are done in "chunks" so we need to save the position
after each chunk, so we know where to restart from. Because
netdevs are stored in a hash table we remember which bucket
we were in and how many devices we dumped.
Since we don't hold any locks across the "chunks" - devices may
come and go while we're dumping. If that happens we may miss
a device (if device is deleted from the bucket we were in).
We indicate to user space that this may have happened by setting
NLM_F_DUMP_INTR. User space is supposed to dump again (I think)
if it sees that. Somehow I doubt most user space gets this right..
To illustrate let's look at an example:
System state:
start: # [A, B, C]
del: B # [A, C]
with the hash table we may dump [A, B], missing C completely even
tho it existed both before and after the "del B".
Add an xarray and use it to allocate ifindexes. This way we
can iterate ifindexes in order, without the worry that we'll
skip one. We may still generate a dump of a state which "never
existed", for example for a set of values and sequence of ops:
System state:
start: # [A, B]
add: C # [A, C, B]
del: B # [A, C]
we may generate a dump of [A], if C got an index between A and B.
System has never been in such state. But I'm 90% sure that's perfectly
fine, important part is that we can't _miss_ devices which exist before
and after. User space which wants to mirror kernel's state subscribes
to notifications and does periodic dumps so it will know that C exists
from the notification about its creation or from the next dump
(next dump is _guaranteed_ to include C, if it doesn't get removed).
To avoid any perf regressions keep the hash table for now. Most
net namespaces have very few devices and microbenchmarking 1M lookups
on Skylake I get the following results (not counting loopback
to number of devs):
#devs | hash | xa | delta
2 | 18.3 | 20.1 | + 9.8%
16 | 18.3 | 20.1 | + 9.5%
64 | 18.3 | 26.3 | +43.8%
128 | 20.4 | 26.3 | +28.6%
256 | 20.0 | 26.4 | +32.1%
1024 | 26.6 | 26.7 | + 0.2%
8192 |541.3 | 33.5 | -93.8%
No surprises since the hash table has 256 entries.
The microbenchmark scans indexes in order, if the pattern is more
random xa starts to win at 512 devices already. But that's a lot
of devices, in practice.
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230726185530.2247698-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The DT of_device.h and of_platform.h date back to the separate
of_platform_bus_type before it as merged into the regular platform bus.
As part of that merge prepping Arm DT support 13 years ago, they
"temporarily" include each other. They also include platform_device.h
and of.h. As a result, there's a pretty much random mix of those include
files used throughout the tree. In order to detangle these headers and
replace the implicit includes with struct declarations, users need to
explicitly include the correct includes.
Acked-by: Alex Elder <elder@linaro.org>
Reviewed-by: Bhupesh Sharma <bhupesh.sharma@linaro.org>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230727014944.3972546-1-robh@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are totally 9 ndo_bridge_setlink handlers in the current kernel,
which are 1) bnxt_bridge_setlink, 2) be_ndo_bridge_setlink 3)
i40e_ndo_bridge_setlink 4) ice_bridge_setlink 5)
ixgbe_ndo_bridge_setlink 6) mlx5e_bridge_setlink 7)
nfp_net_bridge_setlink 8) qeth_l2_bridge_setlink 9) br_setlink.
By investigating the code, we find that 1-7 parse and use nlattr
IFLA_BRIDGE_MODE but 3 and 4 forget to do the nla_len check. This can
lead to an out-of-attribute read and allow a malformed nlattr (e.g.,
length 0) to be viewed as a 2 byte integer.
To avoid such issues, also for other ndo_bridge_setlink handlers in the
future. This patch adds the nla_len check in rtnl_bridge_setlink and
does an early error return if length mismatches. To make it works, the
break is removed from the parsing for IFLA_BRIDGE_FLAGS to make sure
this nla_for_each_nested iterates every attribute.
Fixes: b1edc14a3f ("ice: Implement ice_bridge_getlink and ice_bridge_setlink")
Fixes: 51616018dd ("i40e: Add support for getlink, setlink ndo ops")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20230726075314.1059224-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The nla_for_each_nested parsing in function bpf_sk_storage_diag_alloc
does not check the length of the nested attribute. This can lead to an
out-of-attribute read and allow a malformed nlattr (e.g., length 0) to
be viewed as a 4 byte integer.
This patch adds an additional check when the nlattr is getting counted.
This makes sure the latter nla_get_u32 can access the attributes with
the correct length.
Fixes: 1ed4d92458 ("bpf: INET_DIAG support in bpf_sk_storage")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230725023330.422856-1-linma@zju.edu.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
There are currently two paths that call remove_xps_queue():
1. __netif_set_xps_queue -> remove_xps_queue
2. clean_xps_maps -> remove_xps_queue_cpu -> remove_xps_queue
There is no need to check dev_maps in remove_xps_queue() because
dev_maps has been checked on these two paths.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20230724023735.2751602-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
sockets. This means we can't use the helper to steer traffic to Envoy,
which configures SO_REUSEPORT on its sockets. In turn, we're blocked
from removing TPROXY from our setup.
The reason that bpf_sk_assign refuses such sockets is that the
bpf_sk_lookup helpers don't execute SK_REUSEPORT programs. Instead,
one of the reuseport sockets is selected by hash. This could cause
dispatch to the "wrong" socket:
sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
helpers unfortunately. In the tc context, L2 headers are at the start
of the skb, while SK_REUSEPORT expects L3 headers instead.
Instead, we execute the SK_REUSEPORT program when the assigned socket
is pulled out of the skb, further up the stack. This creates some
trickiness with regards to refcounting as bpf_sk_assign will put both
refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
freed. We can infer that the sk_assigned socket is RCU freed if the
reuseport lookup succeeds, but convincing yourself of this fact isn't
straight forward. Therefore we defensively check refcounting on the
sk_assign sock even though it's probably not required in practice.
Fixes: 8e368dc72e ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Fixes: cf7fbe660f ("bpf: Add socket assign support")
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Joe Stringer <joe@cilium.io>
Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-7-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The semantics for bpf_sk_assign are as follows:
sk = some_lookup_func()
bpf_sk_assign(skb, sk)
bpf_sk_release(sk)
That is, the sk is not consumed by bpf_sk_assign. The function
therefore needs to make sure that sk lives long enough to be
consumed from __inet_lookup_skb. The path through the stack for a
TCPv4 packet is roughly:
netif_receive_skb_core: takes RCU read lock
__netif_receive_skb_core:
sch_handle_ingress:
tcf_classify:
bpf_sk_assign()
deliver_ptype_list_skb:
deliver_skb:
ip_packet_type->func == ip_rcv:
ip_rcv_core:
ip_rcv_finish_core:
dst_input:
ip_local_deliver:
ip_local_deliver_finish:
ip_protocol_deliver_rcu:
tcp_v4_rcv:
__inet_lookup_skb:
skb_steal_sock
The existing helper takes advantage of the fact that everything
happens in the same RCU critical section: for sockets with
SOCK_RCU_FREE set bpf_sk_assign never takes a reference.
skb_steal_sock then checks SOCK_RCU_FREE again and does sock_put
if necessary.
This approach assumes that SOCK_RCU_FREE is never set on a sk
between bpf_sk_assign and skb_steal_sock, but this invariant is
violated by unhashed UDP sockets. A new UDP socket is created
in TCP_CLOSE state but without SOCK_RCU_FREE set. That flag is only
added in udp_lib_get_port() which happens when a socket is bound.
When bpf_sk_assign was added it wasn't possible to access unhashed
UDP sockets from BPF, so this wasn't a problem. This changed
in commit 0c48eefae7 ("sock_map: Lift socket state restriction
for datagram sockets"), but the helper wasn't adjusted accordingly.
The following sequence of events will therefore lead to a refcount
leak:
1. Add socket(AF_INET, SOCK_DGRAM) to a sockmap.
2. Pull socket out of sockmap and bpf_sk_assign it. Since
SOCK_RCU_FREE is not set we increment the refcount.
3. bind() or connect() the socket, setting SOCK_RCU_FREE.
4. skb_steal_sock will now set refcounted = false due to
SOCK_RCU_FREE.
5. tcp_v4_rcv() skips sock_put().
Fix the problem by rejecting unhashed sockets in bpf_sk_assign().
This matches the behaviour of __inet_lookup_skb which is ultimately
the goal of bpf_sk_assign().
Fixes: cf7fbe660f ("bpf: Add socket assign support")
Cc: Joe Stringer <joe@cilium.io>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-2-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Now that page_pool_release_page() is not exported we can
merge it with page_pool_return_page(). I believe that
the "Do not replace this with page_pool_return_page()"
comment was there in case page_pool_return_page() was
not inlined, to avoid two function calls.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Link: https://lore.kernel.org/r/20230720010409.1967072-5-kuba@kernel.org
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There seems to be no user calling page_pool_release_page()
for legit reasons, all the users simply haven't been converted
to skb-based recycling, yet. Previous changes converted them.
Update the docs, and unexport the function.
Link: https://lore.kernel.org/r/20230720010409.1967072-4-kuba@kernel.org
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmS4IUIACgkQ6rmadz2v
bTrVCw/9GG5A5ebqwoh/DrsFXEzpKDmZFIAWd5wB+Fx2i8y+6Jl/Fw6SjkkAtUnc
215T3YX2u3Xg1WFC5zxY9lYm2OeMq2lPHVwjlqgt/pHE8D6b8cZ44eyN+f0ZSiLy
wyx0wHLd3oP4KvMyiqm7/ZmhDjAtBpuqMjY5FNsbUxrIGUUI2ZLC4VFVWhnWmzRA
eEOQuUge4e1YD62kfkWlT/GEv710ysqFZD2zs4yhevDfmr/6DAIaA7dhfKMYsM/S
hCPoCuuXWVoHiqksm0U1BwpEiAQrqR91Sx8RCAakw5Pyp5hkj9dJc9sLwkgMH/k7
2352IIPXddH8cGKQM+hIBrc/io+6MxMbVk7Pe+1OUIBrvP//zQrHWk0zbssF3D8C
z6TbxBLdSzbDELPph3gZu5bNaLSkpuODhNjLcIVGSOeSJ5nsgATCQtXFAAPV0E/Q
v2O7Te5aTjTOpFMcIrIK1eWXUS56yRA+YwDa1VuWXAiLrr+Rq0tm4tBqxhof3KlH
bfCoqFNa12MfpCJURHICcV7DJo53rWbCtDSJPaYwZXb/jJPd3gPb8EVixoLN2A1M
dV/ou9rKEEkJXxsZ4Bctuh7t5YwpqxTq74YSdvnkOJ8P1lBDYST2SfHgQVOayQPv
XH9MlMO3Qtb9Sl0ZiI7gHbpK7h6v9RvRuHJcnN2e3wwMEx256xE=
=VRCb
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2023-07-19
We've added 45 non-merge commits during the last 3 day(s) which contain
a total of 71 files changed, 7808 insertions(+), 592 deletions(-).
The main changes are:
1) multi-buffer support in AF_XDP, from Maciej Fijalkowski,
Magnus Karlsson, Tirthendu Sarkar.
2) BPF link support for tc BPF programs, from Daniel Borkmann.
3) Enable bpf_map_sum_elem_count kfunc for all program types,
from Anton Protopopov.
4) Add 'owner' field to bpf_rb_node to fix races in shared ownership,
Dave Marchevsky.
5) Prevent potential skb_header_pointer() misuse, from Alexei Starovoitov.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
bpf, net: Introduce skb_pointer_if_linear().
bpf: sync tools/ uapi header with
selftests/bpf: Add mprog API tests for BPF tcx links
selftests/bpf: Add mprog API tests for BPF tcx opts
bpftool: Extend net dump with tcx progs
libbpf: Add helper macro to clear opts structs
libbpf: Add link-based API for tcx
libbpf: Add opts-based attach/detach/query API for tcx
bpf: Add fd-based tcx multi-prog infra with link support
bpf: Add generic attach/detach/query API for multi-progs
selftests/xsk: reset NIC settings to default after running test suite
selftests/xsk: add test for too many frags
selftests/xsk: add metadata copy test for multi-buff
selftests/xsk: add invalid descriptor test for multi-buffer
selftests/xsk: add unaligned mode test for multi-buffer
selftests/xsk: add basic multi-buffer test
selftests/xsk: transmit and receive multi-buffer packets
xsk: add multi-buffer documentation
i40e: xsk: add TX multi-buffer support
ice: xsk: Tx multi-buffer support
...
====================
Link: https://lore.kernel.org/r/20230719175424.75717-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.
Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:
- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]
- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]
BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.
Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.
We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.
For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.
For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.
The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.
The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.
[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will
carry maximum fragments that underlying ZC driver is able to handle on
TX side. It is going to be included in netlink response only when driver
supports ZC. Any value higher than 1 implies multi-buffer ZC support on
underlying device.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add multi-buffer support for AF_XDP by extending the XDP multi-buffer
support to be reflected in user-space when a packet is redirected to
an AF_XDP socket.
In the XDP implementation, the NIC driver builds the xdp_buff from the
first frag of the packet and adds any subsequent frags in the skb_shinfo
area of the xdp_buff. In AF_XDP core, XDP buffers are allocated from
xdp_sock's pool and data is copied from the driver's xdp_buff and frags.
Once an allocated XDP buffer is full and there is still data to be
copied, the 'XDP_PKT_CONTD' flag in'options' field of the corresponding
xdp ring descriptor is set and passed to the application. When application
sees the aforementioned flag set it knows there is pending data for this
packet that will be carried in the following descriptors. If there is no
more data to be copied, the flag in 'options' field is cleared for that
descriptor signalling EOP to the application.
If application reads a batch of descriptors using for example the libxdp
interfaces, it is not guaranteed that the batch will end with a full
packet. It might end in the middle of a packet and the rest of the frames
of that packet will arrive at the beginning of the next batch.
AF_XDP ensures that only a complete packet (along with all its frags) is
sent to application.
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-6-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a new bridge port attribute that allows attaching a nexthop object
ID to an skb that is redirected to a backup bridge port with VLAN
tunneling enabled.
Specifically, when redirecting a known unicast packet, read the backup
nexthop ID from the bridge port that lost its carrier and set it in the
bridge control block of the skb before forwarding it via the backup
port. Note that reading the ID from the bridge port should not result in
a cache miss as the ID is added next to the 'backup_port' field that was
already accessed. After this change, the 'state' field still stays on
the first cache line, together with other data path related fields such
as 'flags and 'vlgrp':
struct net_bridge_port {
struct net_bridge * br; /* 0 8 */
struct net_device * dev; /* 8 8 */
netdevice_tracker dev_tracker; /* 16 0 */
struct list_head list; /* 16 16 */
long unsigned int flags; /* 32 8 */
struct net_bridge_vlan_group * vlgrp; /* 40 8 */
struct net_bridge_port * backup_port; /* 48 8 */
u32 backup_nhid; /* 56 4 */
u8 priority; /* 60 1 */
u8 state; /* 61 1 */
u16 port_no; /* 62 2 */
/* --- cacheline 1 boundary (64 bytes) --- */
[...]
} __attribute__((__aligned__(8)));
When forwarding an skb via a bridge port that has VLAN tunneling
enabled, check if the backup nexthop ID stored in the bridge control
block is valid (i.e., not zero). If so, instead of attaching the
pre-allocated metadata (that only has the tunnel key set), allocate a
new metadata, set both the tunnel key and the nexthop object ID and
attach it to the skb.
By default, do not dump the new attribute to user space as a value of
zero is an invalid nexthop object ID.
The above is useful for EVPN multihoming. When one of the links
composing an Ethernet Segment (ES) fails, traffic needs to be redirected
towards the host via one of the other ES peers. For example, if a host
is multihomed to three different VTEPs, the backup port of each ES link
needs to be set to the VXLAN device and the backup nexthop ID needs to
point to an FDB nexthop group that includes the IP addresses of the
other two VTEPs. The VXLAN driver will extract the ID from the metadata
of the redirected skb, calculate its flow hash and forward it towards
one of the other VTEPs. If the ID does not exist, or represents an
invalid nexthop object, the VXLAN driver will drop the skb. This
relieves the bridge driver from the need to validate the ID.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make rtnl_fill_vf() cancel the vfinfo attribute on error instead of the
inner rtnl_fill_vfinfo(), as it is the function that starts it.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20230716072440.2372567-1-gal@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmSvKGIACgkQ6rmadz2v
bTorMBAAl25eNvQbLlNGRK4o8BE3ykND/tQT285rrgpvzBd6+7okGeEGKIfj5OCv
1QxNwN6bfn6dvwKCloxJKt5Us43WyxgbuwCXXC34trdpUb2AsLcBDVImUIafWdIE
ZAks8ty84war9fhRGmpB79ZmFFscQ51L+jzdAwS0pcSQXImWuEgFsHR2iwGV3wtW
s3zdvxmkEE4HFKeKkjFKACkQz45BCMjW+L1Lu6bEyV/8CKRRx0F4RnUEoeUzEN5m
dB9GncDW12FRsAC0U8vkmutoHeAUCKfCKW/mvw27xadzZ3gSfdg5HXsQ6oo+oGgZ
+nOPbUzqP5NH/D8X7ZZAN5lQmx55edxPbFcm9x9dFbOWB/qlef/mt2yWUPmnpt9c
8TOlf0Wt9SxeUABORoXdXvXBoNvje7jBfjKgIj/ewaKqjeOlM9JwxW/+m4CHbpa1
5wfu4Gm0zsIRymalKbUzOlCeQPZYB4Zp6MlhddkzfxNh449/mXQWKkKa7sFhCg3X
MQ38CSXtkWZ48PUcWuRPOQu1fyJixYqkt821xbfbDRf7GfLfWFXdVv0xrPIbfLti
jT2GEhA8MTU/PvN2uQACWDNnUPbkC+ijVuRu3OPwrLx7F8lkf+6uMIRke62ToxjW
uL9xWc7Y4jYdq+wzRl5/5OyYIY15ZHLMB2ujCprl7BPwOo8wrBw=
=Mt88
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:
====================
pull-request: bpf 2023-07-12
We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 7 files changed, 93 insertions(+), 28 deletions(-).
The main changes are:
1) Fix max stack depth check for async callbacks, from Kumar.
2) Fix inconsistent JIT image generation, from Björn.
3) Use trusted arguments in XDP hints kfuncs, from Larysa.
4) Fix memory leak in cpu_map_update_elem, from Pu.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
xdp: use trusted arguments in XDP hints kfuncs
bpf: cpumap: Fix memory leak in cpu_map_update_elem
riscv, bpf: Fix inconsistent JIT image generation
selftests/bpf: Add selftest for check_stack_max_depth bug
bpf: Fix max stack depth check for async callbacks
====================
Link: https://lore.kernel.org/r/20230712223045.40182-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, verifier does not reject XDP programs that pass NULL pointer to
hints functions. At the same time, this case is not handled in any driver
implementation (including veth). For example, changing
bpf_xdp_metadata_rx_timestamp(ctx, ×tamp);
to
bpf_xdp_metadata_rx_timestamp(ctx, NULL);
in xdp_metadata test successfully crashes the system.
Add KF_TRUSTED_ARGS flag to hints kfunc definitions, so driver code
does not have to worry about getting invalid pointers.
Fixes: 3d76a4d3d4 ("bpf: XDP metadata RX kfuncs")
Reported-by: Stanislav Fomichev <sdf@google.com>
Closes: https://lore.kernel.org/bpf/ZKWo0BbpLfkZHbyE@google.com/
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230711105930.29170-1-larysa.zaremba@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We have for some time the assign_bit() API to replace open coded
if (foo)
set_bit(n, bar);
else
clear_bit(n, bar);
Use this API in the code. No functional change intended.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Message-ID: <20230710100830.89936-1-andriy.shevchenko@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The tracepoint has existed for 12 years, but it only covered udp
over the legacy IPv4 protocol. Having it enabled for udp6 removes
the unnecessary difference in error visibility.
Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
Fixes: 296f7ea75b ("udp: add tracepoints for queueing skb to rcvbuf")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use sendmsg() with MSG_SPLICE_PAGES rather than sendpage in
skb_send_sock(). This causes pages to be spliced from the source iterator
if possible.
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
Note that this could perhaps be improved to fill out a bvec array with all
the frags and then make a single sendmsg call, possibly sticking the header
on the front also.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/r/20230623225513.2732256-3-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZJX+ygAKCRDbK58LschI
g0/2AQDHg12smf9mPfK9wOFDNRIIX8r2iufB8LUFQMzCwltN6gEAkAdkAyfbof7P
TMaNUiHABijAFtChxoSI35j3OOSRrwE=
=GJgN
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-06-23
We've added 49 non-merge commits during the last 24 day(s) which contain
a total of 70 files changed, 1935 insertions(+), 442 deletions(-).
The main changes are:
1) Extend bpf_fib_lookup helper to allow passing the route table ID,
from Louis DeLosSantos.
2) Fix regsafe() in verifier to call check_ids() for scalar registers,
from Eduard Zingerman.
3) Extend the set of cpumask kfuncs with bpf_cpumask_first_and()
and a rework of bpf_cpumask_any*() kfuncs. Additionally,
add selftests, from David Vernet.
4) Fix socket lookup BPF helpers for tc/XDP to respect VRF bindings,
from Gilad Sever.
5) Change bpf_link_put() to use workqueue unconditionally to fix it
under PREEMPT_RT, from Sebastian Andrzej Siewior.
6) Follow-ups to address issues in the bpf_refcount shared ownership
implementation, from Dave Marchevsky.
7) A few general refactorings to BPF map and program creation permissions
checks which were part of the BPF token series, from Andrii Nakryiko.
8) Various fixes for benchmark framework and add a new benchmark
for BPF memory allocator to BPF selftests, from Hou Tao.
9) Documentation improvements around iterators and trusted pointers,
from Anton Protopopov.
10) Small cleanup in verifier to improve allocated object check,
from Daniel T. Lee.
11) Improve performance of bpf_xdp_pointer() by avoiding access
to shared_info when XDP packet does not have frags,
from Jesper Dangaard Brouer.
12) Silence a harmless syzbot-reported warning in btf_type_id_size(),
from Yonghong Song.
13) Remove duplicate bpfilter_umh_cleanup in favor of umd_cleanup_helper,
from Jarkko Sakkinen.
14) Fix BPF selftests build for resolve_btfids under custom HOSTCFLAGS,
from Viktor Malik.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits)
bpf, docs: Document existing macros instead of deprecated
bpf, docs: BPF Iterator Document
selftests/bpf: Fix compilation failure for prog vrf_socket_lookup
selftests/bpf: Add vrf_socket_lookup tests
bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings
bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC hookpoint
bpf: Factor out socket lookup functions for the TC hookpoint.
selftests/bpf: Set the default value of consumer_cnt as 0
selftests/bpf: Ensure that next_cpu() returns a valid CPU number
selftests/bpf: Output the correct error code for pthread APIs
selftests/bpf: Use producer_cnt to allocate local counter array
xsk: Remove unused inline function xsk_buff_discard()
bpf: Keep BPF_PROG_LOAD permission checks clear of validations
bpf: Centralize permissions checks for all BPF map types
bpf: Inline map creation logic in map_create() function
bpf: Move unprivileged checks into map_create() and bpf_prog_load()
bpf: Remove in_atomic() from bpf_link_put().
selftests/bpf: Verify that check_ids() is used for scalars in regsafe()
bpf: Verify scalar ids mapping in regsafe() using check_ids()
selftests/bpf: Check if mark_chain_precision() follows scalar ids
...
====================
Link: https://lore.kernel.org/r/20230623211256.8409-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit ad72c4a06a introduced optimization to return from function
quickly if the MAC address is not changing at all. It was reported
that such change causes dev->addr_assign_type to not change
to NET_ADDR_SET from _PERM or _RANDOM.
Restore the old behavior and skip only call to ndo_set_mac_address.
Fixes: ad72c4a06a ("net: add check for current MAC address in dev_set_mac_address")
Reported-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230621132106.991342-1-piotrx.gardocki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 1f86123b97 ("net: align SO_RCVMARK required
privileges with SO_MARK") because the reasoning in the commit message
is not really correct:
SO_RCVMARK is used for 'reading' incoming skb mark (via cmsg), as such
it is more equivalent to 'getsockopt(SO_MARK)' which has no priv check
and retrieves the socket mark, rather than 'setsockopt(SO_MARK) which
sets the socket mark and does require privs.
Additionally incoming skb->mark may already be visible if
sysctl_fwmark_reflect and/or sysctl_tcp_fwmark_accept are enabled.
Furthermore, it is easier to block the getsockopt via bpf
(either cgroup setsockopt hook, or via syscall filters)
then to unblock it if it requires CAP_NET_RAW/ADMIN.
On Android the socket mark is (among other things) used to store
the network identifier a socket is bound to. Setting it is privileged,
but retrieving it is not. We'd like unprivileged userspace to be able
to read the network id of incoming packets (where mark is set via
iptables [to be moved to bpf])...
An alternative would be to add another sysctl to control whether
setting SO_RCVMARK is privilged or not.
(or even a MASK of which bits in the mark can be exposed)
But this seems like over-engineering...
Note: This is a non-trivial revert, due to later merged commit e42c7beee7
("bpf: net: Consider has_current_bpf_ctx() when testing capable() in sk_setsockopt()")
which changed both 'ns_capable' into 'sockopt_ns_capable' calls.
Fixes: 1f86123b97 ("net: align SO_RCVMARK required privileges with SO_MARK")
Cc: Larysa Zaremba <larysa.zaremba@intel.com>
Cc: Simon Horman <simon.horman@corigine.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eyal Birger <eyal.birger@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Patrick Rohr <prohr@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230618103130.51628-1-maze@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When calling bpf_sk_lookup_tcp(), bpf_sk_lookup_udp() or
bpf_skc_lookup_tcp() from tc/xdp ingress, VRF socket bindings aren't
respoected, i.e. unbound sockets are returned, and bound sockets aren't
found.
VRF binding is determined by the sdif argument to sk_lookup(), however
when called from tc the IP SKB control block isn't initialized and thus
inet{,6}_sdif() always returns 0.
Fix by calculating sdif for the tc/xdp flows by observing the device's
l3 enslaved state.
The cg/sk_skb hooking points which are expected to support
inet{,6}_sdif() pass sdif=-1 which makes __bpf_skc_lookup() use the
existing logic.
Fixes: 6acc9b432e ("bpf: Add helper to retrieve socket in BPF")
Signed-off-by: Gilad Sever <gilad9366@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/bpf/20230621104211.301902-4-gilad9366@gmail.com
skb->dev always exists in the tc flow. There is no need to use
bpf_skc_lookup(), bpf_sk_lookup() from this code path.
This change facilitates fixing the tc flow to be VRF aware.
Signed-off-by: Gilad Sever <gilad9366@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230621104211.301902-3-gilad9366@gmail.com
Change BPF helper socket lookup functions to use TC specific variants:
bpf_tc_sk_lookup_tcp() / bpf_tc_sk_lookup_udp() / bpf_tc_skc_lookup_tcp()
instead of sharing implementation with the cg / sk_skb hooking points.
This allows introducing a separate logic for the TC flow.
The tc functions are identical to the original code.
Signed-off-by: Gilad Sever <gilad9366@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230621104211.301902-2-gilad9366@gmail.com
Blamed commit added these helpers for sake of detecting RAW
sockets specific ioctl.
syzbot complained about it [1].
Issue here is that RAW sockets could pretend there was no need
to call ipmr_sk_ioctl()
Regardless of inet_sk(sk)->inet_num, we must be prepared
for ipmr_ioctl() being called later. This must happen
from ipmr_sk_ioctl() context only.
We could add a safety check in ipmr_ioctl() at the risk of breaking
applications.
Instead, remove sk_is_ipmr() and sk_is_icmpv6() because their
name would be misleading, once we change their implementation.
[1]
BUG: KASAN: stack-out-of-bounds in ipmr_ioctl+0xb12/0xbd0 net/ipv4/ipmr.c:1654
Read of size 4 at addr ffffc90003aefae4 by task syz-executor105/5004
CPU: 0 PID: 5004 Comm: syz-executor105 Not tainted 6.4.0-rc6-syzkaller-01304-gc08afcdcf952 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:351
print_report mm/kasan/report.c:462 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:572
ipmr_ioctl+0xb12/0xbd0 net/ipv4/ipmr.c:1654
raw_ioctl+0x4e/0x1e0 net/ipv4/raw.c:881
sock_ioctl_out net/core/sock.c:4186 [inline]
sk_ioctl+0x151/0x440 net/core/sock.c:4214
inet_ioctl+0x18c/0x380 net/ipv4/af_inet.c:1001
sock_do_ioctl+0xcc/0x230 net/socket.c:1189
sock_ioctl+0x1f8/0x680 net/socket.c:1306
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f2944bf6ad9
Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd8897a028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2944bf6ad9
RDX: 0000000000000000 RSI: 00000000000089e1 RDI: 0000000000000003
RBP: 00007f2944bbac80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2944bbad10
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
The buggy address belongs to stack of task syz-executor105/5004
and is located at offset 36 in frame:
sk_ioctl+0x0/0x440 net/core/sock.c:4172
This frame has 2 objects:
[32, 36) 'karg'
[48, 88) 'buffer'
Fixes: e1d001fa5b ("net: ioctl: Use kernel memory on protocol ioctl callbacks")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Breno Leitao <leitao@debian.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230619124336.651528-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This allows to do more centralized decisions later on, and generally
makes it very explicit which maps are privileged and which are not
(e.g., LRU_HASH and LRU_PERCPU_HASH, which are privileged HASH variants,
as opposed to unprivileged HASH and HASH_PERCPU; now this is explicit
and easy to verify).
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230613223533.3689589-4-andrii@kernel.org
The double ifdefs (one for the variable declaration and
one around the code) are quite aesthetically displeasing.
Factor this code out into a helper for easier wrapping.
This will become even more ugly when another skb ext
comparison is added in the future.
The resulting machine code looks the same, the compiler
seems to try to use %rax more and some blocks more around
but I haven't spotted minor differences.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some cases it is possible for kernel to come with request
to change primary MAC address to the address that is already
set on the given interface.
Add proper check to return fast from the function in these cases.
An example of such case is adding an interface to bonding
channel in balance-alb mode:
modprobe bonding mode=balance-alb miimon=100 max_bonds=1
ip link set bond0 up
ifenslave bond0 <eth>
Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Most of the ioctls to net protocols operates directly on userspace
argument (arg). Usually doing get_user()/put_user() directly in the
ioctl callback. This is not flexible, because it is hard to reuse these
functions without passing userspace buffers.
Change the "struct proto" ioctls to avoid touching userspace memory and
operate on kernel buffers, i.e., all protocol's ioctl callbacks is
adapted to operate on a kernel memory other than on userspace (so, no
more {put,get}_user() and friends being called in the ioctl callback).
This changes the "struct proto" ioctl format in the following way:
int (*ioctl)(struct sock *sk, int cmd,
- unsigned long arg);
+ int *karg);
(Important to say that this patch does not touch the "struct proto_ops"
protocols)
So, the "karg" argument, which is passed to the ioctl callback, is a
pointer allocated to kernel space memory (inside a function wrapper).
This buffer (karg) may contain input argument (copied from userspace in
a prep function) and it might return a value/buffer, which is copied
back to userspace if necessary. There is not one-size-fits-all format
(that is I am using 'may' above), but basically, there are three type of
ioctls:
1) Do not read from userspace, returns a result to userspace
2) Read an input parameter from userspace, and does not return anything
to userspace
3) Read an input from userspace, and return a buffer to userspace.
The default case (1) (where no input parameter is given, and an "int" is
returned to userspace) encompasses more than 90% of the cases, but there
are two other exceptions. Here is a list of exceptions:
* Protocol RAW:
* cmd = SIOCGETVIFCNT:
* input and output = struct sioc_vif_req
* cmd = SIOCGETSGCNT
* input and output = struct sioc_sg_req
* Explanation: for the SIOCGETVIFCNT case, userspace passes the input
argument, which is struct sioc_vif_req. Then the callback populates
the struct, which is copied back to userspace.
* Protocol RAW6:
* cmd = SIOCGETMIFCNT_IN6
* input and output = struct sioc_mif_req6
* cmd = SIOCGETSGCNT_IN6
* input and output = struct sioc_sg_req6
* Protocol PHONET:
* cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
* input int (4 bytes)
* Nothing is copied back to userspace.
For the exception cases, functions sock_sk_ioctl_inout() will
copy the userspace input, and copy it back to kernel space.
The wrapper that prepare the buffer and put the buffer back to user is
sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
calls sk_ioctl(), which will handle all cases.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 5fa5ae6058 ("netpoll: add net device refcount tracker to struct netpoll")
was part of one of the initial netdev tracker introduction patches.
It added an explicit netdev_tracker_alloc() for netpoll, presumably
because the flow of the function is somewhat odd.
After most of the core networking stack was converted to use
the tracking hold() variants, netpoll's call to old dev_hold()
stands out a bit.
np is allocated by the caller and ready to use, we can use
netdev_hold() here, even tho np->ndev will only be set to
ndev inside __netpoll_setup().
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
New users of dev_get_by_index() and dev_get_by_name() keep
getting added and it would be nice to steer them towards
the APIs with reference tracking.
Add variants of those calls which allocate the reference
tracker and use them in a couple of places.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves validate_linkmsg() out of do_setlink() to its callers
and deletes the early validate_linkmsg() call in __rtnl_newlink(), so
that it will not call validate_linkmsg() twice in either of the paths:
- __rtnl_newlink() -> do_setlink()
- __rtnl_newlink() -> rtnl_newlink_create() -> rtnl_create_link()
Additionally, as validate_linkmsg() is now only called with a real
dev, we can remove the NULL check for dev in validate_linkmsg().
Note that we moved validate_linkmsg() check to the places where it has
not done any changes to the dev, as Jakub suggested.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/cf2ef061e08251faf9e8be25ff0d61150c030475.1686585334.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This filter already exists for excluding IPv6 SNMP stats. Extend its
definition to also exclude IFLA_VF_INFO stats in RTM_GETLINK.
This patch constitutes a partial fix for a netlink attribute nesting
overflow bug in IFLA_VFINFO_LIST. By excluding the stats when the
requester doesn't need them, the truncation of the VF list is avoided.
While it was technically only the stats added in commit c5a9f6f0ab
("net/core: Add drop counters to VF statistics") breaking the camel's
back, the appreciable size of the stats data should never have been
included without due consideration for the maximum number of VFs
supported by PCI.
Fixes: 3b766cd832 ("net/core: Add reading VF statistics through the PF netdevice")
Fixes: c5a9f6f0ab ("net/core: Add drop counters to VF statistics")
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Cc: Edwin Peer <espeer@gmail.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20230611105108.122586-1-gal@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add support for dissecting cfm packets. The cfm packet header
fields maintenance domain level and opcode can be dissected.
Signed-off-by: Zahari Doychev <zdoychev@maxlinear.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add SO_PEERPIDFD which allows to get pidfd of peer socket holder pidfd.
This thing is direct analog of SO_PEERCRED which allows to get plain PID.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: bpf@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement SCM_PIDFD, a new type of CMSG type analogical to SCM_CREDENTIALS,
but it contains pidfd instead of plain pid, which allows programmers not
to care about PID reuse problem.
We mask SO_PASSPIDFD feature if CONFIG_UNIX is not builtin because
it depends on a pidfd_prepare() API which is not exported to the kernel
modules.
Idea comes from UAPI kernel group:
https://uapi-group.org/kernel-features/
Big thanks to Christian Brauner and Lennart Poettering for productive
discussions about this.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Tested-by: Luca Boccassi <bluca@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move declarations into include/net/gso.h and code into net/core/gso.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
Conflicts:
net/sched/sch_taprio.c
d636fc5dd6 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping")
dced11ef84 ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()")
net/ipv4/sysctl_net_ipv4.c
e209fee411 ("net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294")
ccce324dab ("tcp: make the first N SYN RTO backoffs linear")
https://lore.kernel.org/all/20230605100816.08d41a7b@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZIDxUwAKCRDbK58LschI
g5hDAQD7ukrniCvMRNIm2yUZIGSxE4RvGiXptO4a0NfLck5R/wEAsfN2KUsPcPhW
HS37lVfx7VVXfj42+REf7lWLu4TXpwk=
=6mS/
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-06-07
We've added 7 non-merge commits during the last 7 day(s) which contain
a total of 12 files changed, 112 insertions(+), 7 deletions(-).
The main changes are:
1) Fix a use-after-free in BPF's task local storage, from KP Singh.
2) Make struct path handling more robust in bpf_d_path, from Jiri Olsa.
3) Fix a syzbot NULL-pointer dereference in sockmap, from Eric Dumazet.
4) UAPI fix for BPF_NETFILTER before final kernel ships,
from Florian Westphal.
5) Fix map-in-map array_map_gen_lookup code generation where elem_size was
not being set for inner maps, from Rhys Rustad-Elliott.
6) Fix sockopt_sk selftest's NETLINK_LIST_MEMBERSHIPS assertion,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Add extra path pointer check to d_path helper
selftests/bpf: Fix sockopt_sk selftest
bpf: netfilter: Add BPF_NETFILTER bpf_attach_type
selftests/bpf: Add access_inner_map selftest
bpf: Fix elem_size not being set for inner maps
bpf: Fix UAF in task local storage
bpf, sockmap: Avoid potential NULL dereference in sk_psock_verdict_data_ready()
====================
Link: https://lore.kernel.org/r/20230607220514.29698-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add READ_ONCE()/WRITE_ONCE() on accesses to the sock flow table.
This also prevents a (smart ?) compiler to remove the condition in:
if (table->ents[index] != newval)
table->ents[index] = newval;
We need the condition to avoid dirtying a shared cache line.
Fixes: fec5e652e5 ("rfs: Receive Flow Steering")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The GRO control block (NAPI_GRO_CB) is currently at its maximum size.
This commit reduces its size by putting two groups of fields that are
used only at different times into a union.
Specifically, the fields frag0 and frag0_len are the fields that make up
the frag0 optimisation mechanism, which is used during the initial
parsing of the SKB.
The fields last and age are used after the initial parsing, while the
SKB is stored in the GRO list, waiting for other packets to arrive.
There was one location in dev_gro_receive that modified the frag0 fields
after setting last and age. I changed this accordingly without altering
the code behaviour.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230601161407.GA9253@debian
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Andrzej Hajda says:
====================
drm/i915: use ref_tracker library for tracking wakerefs
This is reviewed series of ref_tracker patches, ready to merge
via network tree, rebased on net-next/main.
i915 patches will be merged later via intel-gfx tree.
====================
Merge on top of an -rc tag in case it's needed in another tree.
Link: https://lore.kernel.org/r/20230224-track_gt-v9-0-5b47a33f55d1@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In case the library is tracking busy subsystem, simply
printing stack for every active reference will spam log
with long, hard to read, redundant stack traces. To improve
readabilty following changes have been made:
- reports are printed per stack_handle - log is more compact,
- added display name for ref_tracker_dir - it will differentiate
multiple subsystems,
- stack trace is printed indented, in the same printk call,
- info about dropped references is printed as well.
Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently we observed a significant performance degradation in
samples/bpf xdp1 and xdp2, due XDP multibuffer "xdp.frags" handling,
added in commit 7722517422 ("samples/bpf: fixup some tools to be able
to support xdp multibuffer").
This patch reduce the overhead by avoiding to read/load shared_info
(sinfo) memory area, when XDP packet don't have any frags. This improves
performance because sinfo is located in another cacheline.
Function bpf_xdp_pointer() is used by BPF helpers bpf_xdp_load_bytes()
and bpf_xdp_store_bytes(). As a help to reviewers, xdp_get_buff_len() can
potentially access sinfo, but it uses xdp_buff_has_frags() flags bit check
to avoid accessing sinfo in no-frags case.
The likely/unlikely instrumentation lays out asm code such that sinfo
access isn't interleaved with no-frags case (checked on GCC 12.2.1-4).
The generated asm code is more compact towards the no-frags case.
The BPF kfunc bpf_dynptr_slice() also use bpf_xdp_pointer(). Thus, it
should also take effect for that.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/168563651438.3436004.17735707525651776648.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add ability to specify routing table ID to the `bpf_fib_lookup` BPF
helper.
A new field `tbid` is added to `struct bpf_fib_lookup` used as
parameters to the `bpf_fib_lookup` BPF helper.
When the helper is called with the `BPF_FIB_LOOKUP_DIRECT` and
`BPF_FIB_LOOKUP_TBID` flags the `tbid` field in `struct bpf_fib_lookup`
will be used as the table ID for the fib lookup.
If the `tbid` does not exist the fib lookup will fail with
`BPF_FIB_LKUP_RET_NOT_FWDED`.
The `tbid` field becomes a union over the vlan related output fields
in `struct bpf_fib_lookup` and will be zeroed immediately after usage.
This functionality is useful in containerized environments.
For instance, if a CNI wants to dictate the next-hop for traffic leaving
a container it can create a container-specific routing table and perform
a fib lookup against this table in a "host-net-namespace-side" TC program.
This functionality also allows `ip rule` like functionality at the TC
layer, allowing an eBPF program to pick a routing table based on some
aspect of the sk_buff.
As a concrete use case, this feature will be used in Cilium's SRv6 L3VPN
datapath.
When egress traffic leaves a Pod an eBPF program attached by Cilium will
determine which VRF the egress traffic should target, and then perform a
FIB lookup in a specific table representing this VRF's FIB.
Signed-off-by: Louis DeLosSantos <louis.delos.devel@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230505-bpf-add-tbid-fib-lookup-v2-1-0a31c22c748c@gmail.com
This fixes the issue that dev gro_max_size and gso_ipv4_max_size
can be set to a huge value:
# ip link add dummy1 type dummy
# ip link set dummy1 gro_max_size 4294967295
# ip -d link show dummy1
dummy addrgenmode eui64 ... gro_max_size 4294967295
Fixes: 0fe79f28bf ("net: allow gro_max_size to exceed 65536")
Fixes: 9eefedd58a ("net: add gso_ipv4_max_size and gro_ipv4_max_size per device")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These IFLA_GSO_* tb check should also be done for the new created link,
otherwise, they can be set to a huge value when creating links:
# ip link add dummy1 gso_max_size 4294967295 type dummy
# ip -d link show dummy1
dummy addrgenmode eui64 ... gso_max_size 4294967295
Fixes: 46e6b992c2 ("rtnetlink: allow GSO maximums to be set on device creation")
Fixes: 9eefedd58a ("net: add gso_ipv4_max_size and gro_ipv4_max_size per device")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
validate_linkmsg() was introduced by commit 1840bb13c2 ("[RTNL]:
Validate hardware and broadcast address attribute for RTM_NEWLINK")
to validate tb[IFLA_ADDRESS/BROADCAST] for existing links. The same
check should also be done for newly created links.
This patch adds validate_linkmsg() call in rtnl_create_link(), to
avoid the invalid address set when creating some devices like:
# ip link add dummy0 type dummy
# ip link add link dummy0 name mac0 address 01:02 type macsec
Fixes: 0e06877c6f ("[RTNETLINK]: rtnl_link: allow specifying initial device address")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If PREEMPT_RT is set, then assume that the user focuses on minimum
latency. Therefore don't set sw irq coalescing defaults.
This affects the defaults only, users can override these settings
via sysfs.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/f9439c7f-c92c-4c2c-703e-110f96d841b7@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller got the following report:
BUG: KASAN: use-after-free in sk_setup_caps+0x621/0x690 net/core/sock.c:2018
Read of size 8 at addr ffff888027f82780 by task syz-executor276/3255
The function sk_setup_caps (called by ip6_sk_dst_store_flow->
ip6_dst_store) referenced already freed memory as this memory was
freed by parallel task in udpv6_sendmsg->ip6_sk_dst_lookup_flow->
sk_dst_check.
task1 (connect) task2 (udp6_sendmsg)
sk_setup_caps->sk_dst_set |
| sk_dst_check->
| sk_dst_set
| dst_release
sk_setup_caps references |
to already freed dst_entry|
The reason for this race condition is: sk_setup_caps() keeps using
the dst after transferring the ownership to the dst cache.
Found by Linux Verification Center (linuxtesting.org) with syzkaller.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Vladislav Efanov <VEfanov@ispras.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend the 'FLOW_DISSECTOR_KEY_META' key with a new 'l2_miss' field and
populate it from a field with the same name in the tc skb extension.
This field is set by the bridge driver for packets that incur an FDB or
MDB miss.
The next patch will extend the flower classifier to be able to match on
layer 2 misses.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The "len" variable needs to be signed for the error handling to work
correctly.
Fixes: 2e910b9532 ("net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGES")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/366861a7-87c8-4bbf-9101-69dd41021d07@kili.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZHEm+wAKCRDbK58LschI
gyIKAQCqO7B4sIu8hYVxBTwfHV2tIuXSMSCV4P9e78NUOPcO2QEAvLP/WVSjB0Bm
vpyTKKM22SpZvPe/jSp52j6t20N+qAc=
=HFxD
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-05-26
We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 76 files changed, 2729 insertions(+), 1003 deletions(-).
The main changes are:
1) Add the capability to destroy sockets in BPF through a new kfunc,
from Aditi Ghag.
2) Support O_PATH fds in BPF_OBJ_PIN and BPF_OBJ_GET commands,
from Andrii Nakryiko.
3) Add capability for libbpf to resize datasec maps when backed via mmap,
from JP Kobryn.
4) Move all the test kfuncs for CI out of the kernel and into bpf_testmod,
from Jiri Olsa.
5) Big batch of xsk selftest improvements to prep for multi-buffer testing,
from Magnus Karlsson.
6) Show the target_{obj,btf}_id in tracing link's fdinfo and dump it
via bpftool, from Yafang Shao.
7) Various misc BPF selftest improvements to work with upcoming LLVM 17,
from Yonghong Song.
8) Extend bpftool to specify netdevice for resolving XDP hints,
from Larysa Zaremba.
9) Document masking in shift operations for the insn set document,
from Dave Thaler.
10) Extend BPF selftests to check xdp_feature support for bond driver,
from Lorenzo Bianconi.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
bpf: Fix bad unlock balance on freeze_mutex
libbpf: Ensure FD >= 3 during bpf_map__reuse_fd()
libbpf: Ensure libbpf always opens files with O_CLOEXEC
selftests/bpf: Check whether to run selftest
libbpf: Change var type in datasec resize func
bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command
libbpf: Selftests for resizing datasec maps
libbpf: Add capability for resizing datasec maps
selftests/bpf: Add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests
libbpf: Add opts-based bpf_obj_pin() API and add support for path_fd
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
libbpf: Start v1.3 development cycle
bpf: Validate BPF object in BPF_OBJ_PIN before calling LSM
bpftool: Specify XDP Hints ifname when loading program
selftests/bpf: Add xdp_feature selftest for bond device
selftests/bpf: Test bpf_sock_destroy
selftests/bpf: Add helper to get port using getsockname
bpf: Add bpf_sock_destroy kfunc
bpf: Add kfunc filter function to 'struct btf_kfunc_id_set'
bpf: udp: Implement batching for sockets iterator
...
====================
Link: https://lore.kernel.org/r/20230526222747.17775-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To keep things simple we used to include the uAPI header
in the kernel in the #include <linux/$family.h> format.
This works well enough, most of the genl families should
have headers in include/net/ so linux/$family.h ends up
referring to the uAPI header, anyway. And if it doesn't
no big deal, we'll just include more info than we need.
Unless that is there is a naming conflict. Someone recently
created include/linux/psp.h which will be a problem when
supporting the PSP protocol. (I'm talking about
work-in-progress patches, but it's just a proof that assuming
lack of name conflicts was overly optimistic.)
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Busy polling is currently not allowed on PREEMPT_RT, because it disables
preemption while invoking the NAPI callback. It is not possible to acquire
sleeping locks with disabled preemption. For details see commit
20ab39d13e ("net/core: disable NET_RX_BUSY_POLL on PREEMPT_RT").
However, strict cyclic and/or low latency network applications may prefer busy
polling e.g., using AF_XDP instead of interrupt driven communication.
The preempt_disable() is used in order to prevent the poll_owner and NAPI owner
to be preempted while owning the resource to ensure progress. Netpoll performs
busy polling in order to acquire the lock. NAPI is locked by setting the
NAPIF_STATE_SCHED flag. There is no busy polling if the flag is set and the
"owner" is preempted. Worst case is that the task owning NAPI gets preempted and
NAPI processing stalls. This is can be prevented by properly prioritising the
tasks within the system.
Allow RX_BUSY_POLL on PREEMPT_RT if NETPOLL is disabled. Don't disable
preemption on PREEMPT_RT within the busy poll loop.
Tested on x86 hardware with v6.1-RT and v6.3-RT on Intel i225 (igc) with
AF_XDP/ZC sockets configured to run in busy polling mode.
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZG4AiAAKCRDbK58LschI
g+xlAQCmefGbDuwPckZLnomvt6gl4bkIjs7kc1ySbG9QBnaInwD/WyrJaQIPijuD
qziHPAyx+MEgPseFU1b7Le35SZ66IwM=
=s4R1
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-05-24
We've added 19 non-merge commits during the last 10 day(s) which contain
a total of 20 files changed, 738 insertions(+), 448 deletions(-).
The main changes are:
1) Batch of BPF sockmap fixes found when running against NGINX TCP tests,
from John Fastabend.
2) Fix a memleak in the LRU{,_PERCPU} hash map when bucket locking fails,
from Anton Protopopov.
3) Init the BPF offload table earlier than just late_initcall,
from Jakub Kicinski.
4) Fix ctx access mask generation for 32-bit narrow loads of 64-bit fields,
from Will Deacon.
5) Remove a now unsupported __fallthrough in BPF samples,
from Andrii Nakryiko.
6) Fix a typo in pkg-config call for building sign-file,
from Jeremy Sowden.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, sockmap: Test progs verifier error with latest clang
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer
bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0
bpf, sockmap: Build helper to create connected socket pair
bpf, sockmap: Pull socket helpers out of listen test for general use
bpf, sockmap: Incorrectly handling copied_seq
bpf, sockmap: Wake up polling after data copy
bpf, sockmap: TCP data stall on recv before accept
bpf, sockmap: Handle fin correctly
bpf, sockmap: Improved check for empty queue
bpf, sockmap: Reschedule is now done through backlog
bpf, sockmap: Convert schedule_work into delayed_work
bpf, sockmap: Pass skb ownership through read_skb
bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps
bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields
samples/bpf: Drop unnecessary fallthrough
bpf: netdev: init the offload table earlier
selftests/bpf: Fix pkg-config call building sign-file
====================
Link: https://lore.kernel.org/r/20230524170839.13905-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 50749f2dd6 ("tcp/udp: Fix memleaks of sk and zerocopy skbs with
TX timestamp.") added a call to skb_orphan_frags_rx() to fix leaks with
zerocopy skbs. But it ended up adding a leak of its own. When
skb_orphan_frags_rx() fails, the function just returns, leaking the skb
it just cloned. Free it before returning.
This bug was discovered and resolved using Coverity Static Analysis
Security Testing (SAST) by Synopsys, Inc.
Fixes: 50749f2dd6 ("tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.")
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230522153020.32422-1-ptyadav@amazon.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a function to handle MSG_SPLICE_PAGES being passed internally to
sendmsg(). Pages are spliced into the given socket buffer if possible and
copied in if not (e.g. they're slab pages or have a zero refcount).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass the maximum number of fragments into skb_append_pagefrags() rather
than using MAX_SKB_FRAGS so that it can be used from code that wants to
specify sysctl_max_skb_frags.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
page_pool_ring_[un]lock() use in_softirq() to decide which
spin lock variant to use, and when they are called in the
context with in_softirq() being false, spin_lock_bh() is
called in page_pool_ring_lock() while spin_unlock() is
called in page_pool_ring_unlock(), because spin_lock_bh()
has disabled the softirq in page_pool_ring_lock(), which
causes inconsistency for spin lock pair calling.
This patch fixes it by returning in_softirq state from
page_pool_producer_lock(), and use it to decide which
spin lock variant to use in page_pool_producer_unlock().
As pool->ring has both producer and consumer lock, so
rename it to page_pool_producer_[un]lock() to reflect
the actual usage. Also move them to page_pool.c as they
are only used there, and remove the 'inline' as the
compiler may have better idea to do inlining or not.
Fixes: 7886244736 ("net: page_pool: Add bulk support for ptr_ring")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20230522031714.5089-1-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The read_skb() logic is incrementing the tcp->copied_seq which is used for
among other things calculating how many outstanding bytes can be read by
the application. This results in application errors, if the application
does an ioctl(FIONREAD) we return zero because this is calculated from
the copied_seq value.
To fix this we move tcp->copied_seq accounting into the recv handler so
that we update these when the recvmsg() hook is called and data is in
fact copied into user buffers. This gives an accurate FIONREAD value
as expected and improves ACK handling. Before we were calling the
tcp_rcv_space_adjust() which would update 'number of bytes copied to
user in last RTT' which is wrong for programs returning SK_PASS. The
bytes are only copied to the user when recvmsg is handled.
Doing the fix for recvmsg is straightforward, but fixing redirect and
SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then
call this from skmsg handlers. This fixes another issue where a broken
socket with a BPF program doing a resubmit could hang the receiver. This
happened because although read_skb() consumed the skb through sock_drop()
it did not update the copied_seq. Now if a single reccv socket is
redirecting to many sockets (for example for lb) the receiver sk will be
hung even though we might expect it to continue. The hang comes from
not updating the copied_seq numbers and memory pressure resulting from
that.
We have a slight layer problem of calling tcp_eat_skb even if its not
a TCP socket. To fix we could refactor and create per type receiver
handlers. I decided this is more work than we want in the fix and we
already have some small tweaks depending on caller that use the
helper skb_bpf_strparser(). So we extend that a bit and always set
the strparser bit when it is in use and then we can gate the
seq_copied updates on this.
Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com
When TCP stack has data ready to read sk_data_ready() is called. Sockmap
overwrites this with its own handler to call into BPF verdict program.
But, the original TCP socket had sock_def_readable that would additionally
wake up any user space waiters with sk_wake_async().
Sockmap saved the callback when the socket was created so call the saved
data ready callback and then we can wake up any epoll() logic waiting
on the read.
Note we call on 'copied >= 0' to account for returning 0 when a FIN is
received because we need to wake up user for this as well so they
can do the recvmsg() -> 0 and detect the shutdown.
Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-8-john.fastabend@gmail.com
We noticed some rare sk_buffs were stepping past the queue when system was
under memory pressure. The general theory is to skip enqueueing
sk_buffs when its not necessary which is the normal case with a system
that is properly provisioned for the task, no memory pressure and enough
cpu assigned.
But, if we can't allocate memory due to an ENOMEM error when enqueueing
the sk_buff into the sockmap receive queue we push it onto a delayed
workqueue to retry later. When a new sk_buff is received we then check
if that queue is empty. However, there is a problem with simply checking
the queue length. When a sk_buff is being processed from the ingress queue
but not yet on the sockmap msg receive queue its possible to also recv
a sk_buff through normal path. It will check the ingress queue which is
zero and then skip ahead of the pkt being processed.
Previously we used sock lock from both contexts which made the problem
harder to hit, but not impossible.
To fix instead of popping the skb from the queue entirely we peek the
skb from the queue and do the copy there. This ensures checks to the
queue length are non-zero while skb is being processed. Then finally
when the entire skb has been copied to user space queue or another
socket we pop it off the queue. This way the queue length check allows
bypassing the queue only after the list has been completely processed.
To reproduce issue we run NGINX compliance test with sockmap running and
observe some flakes in our testing that we attributed to this issue.
Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com
Now that the backlog manages the reschedule() logic correctly we can drop
the partial fix to reschedule from recvmsg hook.
Rescheduling on recvmsg hook was added to address a corner case where we
still had data in the backlog state but had nothing to kick it and
reschedule the backlog worker to run and finish copying data out of the
state. This had a couple limitations, first it required user space to
kick it introducing an unnecessary EBUSY and retry. Second it only
handled the ingress case and egress redirects would still be hung.
With the correct fix, pushing the reschedule logic down to where the
enomem error occurs we can drop this fix.
Fixes: bec217197b ("skmsg: Schedule psock work if the cached skb exists on the psock")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-4-john.fastabend@gmail.com
Sk_buffs are fed into sockmap verdict programs either from a strparser
(when the user might want to decide how framing of skb is done by attaching
another parser program) or directly through tcp_read_sock. The
tcp_read_sock is the preferred method for performance when the BPF logic is
a stream parser.
The flow for Cilium's common use case with a stream parser is,
tcp_read_sock()
sk_psock_verdict_recv
ret = bpf_prog_run_pin_on_cpu()
sk_psock_verdict_apply(sock, skb, ret)
// if system is under memory pressure or app is slow we may
// need to queue skb. Do this queuing through ingress_skb and
// then kick timer to wake up handler
skb_queue_tail(ingress_skb, skb)
schedule_work(work);
The work queue is wired up to sk_psock_backlog(). This will then walk the
ingress_skb skb list that holds our sk_buffs that could not be handled,
but should be OK to run at some later point. However, its possible that
the workqueue doing this work still hits an error when sending the skb.
When this happens the skbuff is requeued on a temporary 'state' struct
kept with the workqueue. This is necessary because its possible to
partially send an skbuff before hitting an error and we need to know how
and where to restart when the workqueue runs next.
Now for the trouble, we don't rekick the workqueue. This can cause a
stall where the skbuff we just cached on the state variable might never
be sent. This happens when its the last packet in a flow and no further
packets come along that would cause the system to kick the workqueue from
that side.
To fix we could do simple schedule_work(), but while under memory pressure
it makes sense to back off some instead of continue to retry repeatedly. So
instead to fix convert schedule_work to schedule_delayed_work and add
backoff logic to reschedule from backlog queue on errors. Its not obvious
though what a good backoff is so use '1'.
To test we observed some flakes whil running NGINX compliance test with
sockmap we attributed these failed test to this bug and subsequent issue.
>From on list discussion. This commit
bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock")
was intended to address similar race, but had a couple cases it missed.
Most obvious it only accounted for receiving traffic on the local socket
so if redirecting into another socket we could still get an sk_buff stuck
here. Next it missed the case where copied=0 in the recv() handler and
then we wouldn't kick the scheduler. Also its sub-optimal to require
userspace to kick the internal mechanisms of sockmap to wake it up and
copy data to user. It results in an extra syscall and requires the app
to actual handle the EAGAIN correctly.
Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
The read_skb hook calls consume_skb() now, but this means that if the
recv_actor program wants to use the skb it needs to inc the ref cnt
so that the consume_skb() doesn't kfree the sk_buff.
This is problematic because in some error cases under memory pressure
we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
Then we get this,
skb_linearize()
__pskb_pull_tail()
pskb_expand_head()
BUG_ON(skb_shared(skb))
Because we incremented users refcnt from sk_psock_verdict_recv() we
hit the bug on with refcnt > 1 and trip it.
To fix lets simply pass ownership of the sk_buff through the skb_read
call. Then we can drop the consume from read_skb handlers and assume
the verdict recv does any required kfree.
Bug found while testing in our CI which runs in VMs that hit memory
constraints rather regularly. William tested TCP read_skb handlers.
[ 106.536188] ------------[ cut here ]------------
[ 106.536197] kernel BUG at net/core/skbuff.c:1693!
[ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
[ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
[ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
[ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
[ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
[ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
[ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
[ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
[ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
[ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
[ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
[ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 106.542255] Call Trace:
[ 106.542383] <IRQ>
[ 106.542487] __pskb_pull_tail+0x4b/0x3e0
[ 106.542681] skb_ensure_writable+0x85/0xa0
[ 106.542882] sk_skb_pull_data+0x18/0x20
[ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
[ 106.543536] ? migrate_disable+0x66/0x80
[ 106.543871] sk_psock_verdict_recv+0xe2/0x310
[ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0
[ 106.544561] tcp_read_skb+0x7b/0x120
[ 106.544740] tcp_data_queue+0x904/0xee0
[ 106.544931] tcp_rcv_established+0x212/0x7c0
[ 106.545142] tcp_v4_do_rcv+0x174/0x2a0
[ 106.545326] tcp_v4_rcv+0xe70/0xf60
[ 106.545500] ip_protocol_deliver_rcu+0x48/0x290
[ 106.545744] ip_local_deliver_finish+0xa7/0x150
Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Reported-by: William Findlay <will@isovalent.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium
load-balancing to terminate client sockets that continue to connect to
deleted backends. The other use case is on-the-fly policy enforcement
where existing socket connections prevented by policies need to be
forcefully terminated. The kfunc also allows terminating sockets that may
or may not be actively sending traffic.
The kfunc can currently be called only from BPF TCP and UDP iterators
where users can filter, and terminate selected sockets. More
specifically, it can only be called from BPF contexts that ensure
socket locking in order to allow synchronous execution of protocol
specific `diag_destroy` handlers. The previous commit that batches UDP
sockets during iteration facilitated a synchronous invocation of the UDP
destroy callback from BPF context by skipping socket locks in
`udp_abort`. TCP iterator already supported batching of sockets being
iterated. To that end, `tracing_iter_filter` callback filter is added so
that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
attach type, and reject other programs.
The kfunc takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe for certain sockets like request sockets, but these have a
special handling in the diag_destroy handlers.
Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
eg. getting a sk pointer (may be even NULL) by following another sk
pointer. The pointer socket argument passed in TCP and UDP iterators is
tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info. The TRUSTED arg changes
are contributed by Martin KaFai Lau <martin.lau@kernel.org>.
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZGKqEAAKCRDbK58LschI
g6LYAQDp1jAszCOkmJ8VUA0ZyC5NAFDv+7y9Nd1toYWYX1btzAEAkf8+5qBJ1qmI
P5M0hjMTbH4MID9Aql10ZbMHheyOBAo=
=NUQM
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-05-16
We've added 57 non-merge commits during the last 19 day(s) which contain
a total of 63 files changed, 3293 insertions(+), 690 deletions(-).
The main changes are:
1) Add precision propagation to verifier for subprogs and callbacks,
from Andrii Nakryiko.
2) Improve BPF's {g,s}setsockopt() handling with wrong option lengths,
from Stanislav Fomichev.
3) Utilize pahole v1.25 for the kernel's BTF generation to filter out
inconsistent function prototypes, from Alan Maguire.
4) Various dyn-pointer verifier improvements to relax restrictions,
from Daniel Rosenberg.
5) Add a new bpf_task_under_cgroup() kfunc for designated task,
from Feng Zhou.
6) Unblock tests for arm64 BPF CI after ftrace supporting direct call,
from Florent Revest.
7) Add XDP hint kfunc metadata for RX hash/timestamp for igc,
from Jesper Dangaard Brouer.
8) Add several new dyn-pointer kfuncs to ease their usability,
from Joanne Koong.
9) Add in-depth LRU internals description and dot function graph,
from Joe Stringer.
10) Fix KCSAN report on bpf_lru_list when accessing node->ref,
from Martin KaFai Lau.
11) Only dump unprivileged_bpf_disabled log warning upon write,
from Kui-Feng Lee.
12) Extend test_progs to directly passing allow/denylist file,
from Stephen Veiss.
13) Fix BPF trampoline memleak upon failure attaching to fentry,
from Yafang Shao.
14) Fix emitting struct bpf_tcp_sock type in vmlinux BTF,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (57 commits)
bpf: Fix memleak due to fentry attach failure
bpf: Remove bpf trampoline selector
bpf, arm64: Support struct arguments in the BPF trampoline
bpftool: JIT limited misreported as negative value on aarch64
bpf: fix calculation of subseq_idx during precision backtracking
bpf: Remove anonymous union in bpf_kfunc_call_arg_meta
bpf: Document EFAULT changes for sockopt
selftests/bpf: Correctly handle optlen > 4096
selftests/bpf: Update EFAULT {g,s}etsockopt selftests
bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
libbpf: fix offsetof() and container_of() to work with CO-RE
bpf: Address KCSAN report on bpf_lru_list
bpf: Add --skip_encoding_btf_inconsistent_proto, --btf_gen_optimized to pahole flags for v1.25
selftests/bpf: Accept mem from dynptr in helper funcs
bpf: verifier: Accept dynptr mem as mem in helpers
selftests/bpf: Check overflow in optional buffer
selftests/bpf: Test allowing NULL buffer in dynptr slice
bpf: Allow NULL buffers in bpf_dynptr_slice(_rw)
selftests/bpf: Add testcase for bpf_task_under_cgroup
bpf: Add bpf_task_under_cgroup() kfunc
...
====================
Link: https://lore.kernel.org/r/20230515225603.27027-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Most users use __skb_frag_set_page()/skb_frag_off_set()/
skb_frag_size_set() to fill the page desc for a skb frag.
Introduce skb_frag_fill_page_desc() to do that.
net/bpf/test_run.c does not call skb_frag_off_set() to
set the offset, "copy_from_user(page_address(page), ...)"
and 'shinfo' being part of the 'data' kzalloced in
bpf_test_init() suggest that it is assuming offset to be
initialized as zero, so call skb_frag_fill_page_desc()
with offset being zero for this case.
Also, skb_frag_set_page() is not used anymore, so remove
it.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__condition is evaluated twice in sk_wait_event() macro.
First invocation is lockless, and reads can race with writes,
as spotted by syzbot.
BUG: KCSAN: data-race in sk_stream_wait_connect / tcp_disconnect
write to 0xffff88812d83d6a0 of 4 bytes by task 9065 on cpu 1:
tcp_disconnect+0x2cd/0xdb0
inet_shutdown+0x19e/0x1f0 net/ipv4/af_inet.c:911
__sys_shutdown_sock net/socket.c:2343 [inline]
__sys_shutdown net/socket.c:2355 [inline]
__do_sys_shutdown net/socket.c:2363 [inline]
__se_sys_shutdown+0xf8/0x140 net/socket.c:2361
__x64_sys_shutdown+0x31/0x40 net/socket.c:2361
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff88812d83d6a0 of 4 bytes by task 9040 on cpu 0:
sk_stream_wait_connect+0x1de/0x3a0 net/core/stream.c:75
tcp_sendmsg_locked+0x2e4/0x2120 net/ipv4/tcp.c:1266
tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1484
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:651
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
__sys_sendto+0x246/0x300 net/socket.c:2142
__do_sys_sendto net/socket.c:2154 [inline]
__se_sys_sendto net/socket.c:2150 [inline]
__x64_sys_sendto+0x78/0x90 net/socket.c:2150
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x00000000 -> 0x00000068
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c9929f0e34 ("mm/slob: remove CONFIG_SLOB") removes CONFIG_SLOB.
Now, we can also remove special handling for socket buffers with the SLOB
allocator. The code with HAVE_SKB_SMALL_HEAD_CACHE=1 is now the default
behavior for all allocators.
Remove an unnecessary distinction between SLOB and SLAB/SLUB allocator
after the SLOB allocator is gone.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230509071207.28942-1-lukas.bulwahn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In one of our internal testing, we found a case where
- uapi struct bpf_tcp_sock is in vmlinux.h where vmlinux.h is not
generated from the testing kernel
- struct bpf_tcp_sock is not in vmlinux BTF
The above combination caused bpf load failure as the following
memory access
struct bpf_tcp_sock *tcp_sock = ...;
... tcp_sock->snd_cwnd ...
needs CORE relocation but the relocation cannot be resolved since
the kernel BTF does not have corresponding type.
Similar to other previous cases (nf_conn___init, tcp6_sock, mctcp_sock, etc.),
add the type to vmlinux BTF with BTF_EMIT_TYPE macro.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230502180543.1832140-1-yhs@fb.com
David Ahern reported crashes in skb_copy_ubufs() caused by TCP tx zerocopy
using hugepages, and skb length bigger than ~68 KB.
skb_copy_ubufs() assumed it could copy all payload using up to
MAX_SKB_FRAGS order-0 pages.
This assumption broke when BIG TCP was able to put up to 512 KB per skb.
We did not hit this bug at Google because we use CONFIG_MAX_SKB_FRAGS=45
and limit gso_max_size to 180000.
A solution is to use higher order pages if needed.
v2: add missing __GFP_COMP, or we leak memory.
Fixes: 7c4e983c4f ("net: allow gso_max_size to exceed 65536")
Reported-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/netdev/c70000f6-baa4-4a05-46d0-4b3e0dc1ccc8@gmail.com/T/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Core
----
- Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
default value allows for better BIG TCP performances.
- Reduce compound page head access for zero-copy data transfers.
- RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible.
- Threaded NAPI improvements, adding defer skb free support and unneeded
softirq avoidance.
- Address dst_entry reference count scalability issues, via false
sharing avoidance and optimize refcount tracking.
- Add lockless accesses annotation to sk_err[_soft].
- Optimize again the skb struct layout.
- Extends the skb drop reasons to make it usable by multiple
subsystems.
- Better const qualifier awareness for socket casts.
BPF
---
- Add skb and XDP typed dynptrs which allow BPF programs for more
ergonomic and less brittle iteration through data and variable-sized
accesses.
- Add a new BPF netfilter program type and minimal support to hook
BPF programs to netfilter hooks such as prerouting or forward.
- Add more precise memory usage reporting for all BPF map types.
- Adds support for using {FOU,GUE} encap with an ipip device operating
in collect_md mode and add a set of BPF kfuncs for controlling encap
params.
- Allow BPF programs to detect at load time whether a particular kfunc
exists or not, and also add support for this in light skeleton.
- Bigger batch of BPF verifier improvements to prepare for upcoming BPF
open-coded iterators allowing for less restrictive looping capabilities.
- Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF
programs to NULL-check before passing such pointers into kfunc.
- Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in
local storage maps.
- Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps.
- Add support for refcounted local kptrs to the verifier for allowing
shared ownership, useful for adding a node to both the BPF list and
rbtree.
- Add BPF verifier support for ST instructions in convert_ctx_access()
which will help new -mcpu=v4 clang flag to start emitting them.
- Add ARM32 USDT support to libbpf.
- Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations.
Protocols
---------
- IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
indicates the provenance of the IP address.
- IPv6: optimize route lookup, dropping unneeded R/W lock acquisition.
- Add the handshake upcall mechanism, allowing the user-space
to implement generic TLS handshake on kernel's behalf.
- Bridge: support per-{Port, VLAN} neighbor suppression, increasing
resilience to nodes failures.
- SCTP: add support for Fair Capacity and Weighted Fair Queueing
schedulers.
- MPTCP: delay first subflow allocation up to its first usage. This
will allow for later better LSM interaction.
- xfrm: Remove inner/outer modes from input/output path. These are
not needed anymore.
- WiFi:
- reduced neighbor report (RNR) handling for AP mode
- HW timestamping support
- support for randomized auth/deauth TA for PASN privacy
- per-link debugfs for multi-link
- TC offload support for mac80211 drivers
- mac80211 mesh fast-xmit and fast-rx support
- enable Wi-Fi 7 (EHT) mesh support
Netfilter
---------
- Add nf_tables 'brouting' support, to force a packet to be routed
instead of being bridged.
- Update bridge netfilter and ovs conntrack helpers to handle
IPv6 Jumbo packets properly, i.e. fetch the packet length
from hop-by-hop extension header. This is needed for BIT TCP
support.
- The iptables 32bit compat interface isn't compiled in by default
anymore.
- Move ip(6)tables builtin icmp matches to the udptcp one.
This has the advantage that icmp/icmpv6 match doesn't load the
iptables/ip6tables modules anymore when iptables-nft is used.
- Extended netlink error report for netdevice in flowtables and
netdev/chains. Allow for incrementally add/delete devices to netdev
basechain. Allow to create netdev chain without device.
Driver API
----------
- Remove redundant Device Control Error Reporting Enable, as PCI core
has already error reporting enabled at enumeration time.
- Move Multicast DB netlink handlers to core, allowing devices other
then bridge to use them.
- Allow the page_pool to directly recycle the pages from safely
localized NAPI.
- Implement lockless TX queue stop/wake combo macros, allowing for
further code de-duplication and sanitization.
- Add YNL support for user headers and struct attrs.
- Add partial YNL specification for devlink.
- Add partial YNL specification for ethtool.
- Add tc-mqprio and tc-taprio support for preemptible traffic classes.
- Add tx push buf len param to ethtool, specifies the maximum number
of bytes of a transmitted packet a driver can push directly to the
underlying device.
- Add basic LED support for switch/phy.
- Add NAPI documentation, stop relaying on external links.
- Convert dsa_master_ioctl() to netdev notifier. This is a preparatory
work to make the hardware timestamping layer selectable by user
space.
- Add transceiver support and improve the error messages for CAN-FD
controllers.
New hardware / drivers
----------------------
- Ethernet:
- AMD/Pensando core device support
- MediaTek MT7981 SoC
- MediaTek MT7988 SoC
- Broadcom BCM53134 embedded switch
- Texas Instruments CPSW9G ethernet switch
- Qualcomm EMAC3 DWMAC ethernet
- StarFive JH7110 SoC
- NXP CBTX ethernet PHY
- WiFi:
- Apple M1 Pro/Max devices
- RealTek rtl8710bu/rtl8188gu
- RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
- Bluetooth:
- Realtek RTL8821CS, RTL8851B, RTL8852BS
- Mediatek MT7663, MT7922
- NXP w8997
- Actions Semi ATS2851
- QTI WCN6855
- Marvell 88W8997
- Can:
- STMicroelectronics bxcan stm32f429
Drivers
-------
- Ethernet NICs:
- Intel (1G, icg):
- add tracking and reporting of QBV config errors.
- add support for configuring max SDU for each Tx queue.
- Intel (100G, ice):
- refactor mailbox overflow detection to support Scalable IOV
- GNSS interface optimization
- Intel (i40e):
- support XDP multi-buffer
- nVidia/Mellanox:
- add the support for linux bridge multicast offload
- enable TC offload for egress and engress MACVLAN over bond
- add support for VxLAN GBP encap/decap flows offload
- extend packet offload to fully support libreswan
- support tunnel mode in mlx5 IPsec packet offload
- extend XDP multi-buffer support
- support MACsec VLAN offload
- add support for dynamic msix vectors allocation
- drop RX page_cache and fully use page_pool
- implement thermal zone to report NIC temperature
- Netronome/Corigine:
- add support for multi-zone conntrack offload
- Solarflare/Xilinx:
- support offloading TC VLAN push/pop actions to the MAE
- support TC decap rules
- support unicast PTP
- Other NICs:
- Broadcom (bnxt): enforce software based freq adjustments only
on shared PHC NIC
- RealTek (r8169): refactor to addess ASPM issues during NAPI poll.
- Micrel (lan8841): add support for PTP_PF_PEROUT
- Cadence (macb): enable PTP unicast
- Engleder (tsnep): add XDP socket zero-copy support
- virtio-net: implement exact header length guest feature
- veth: add page_pool support for page recycling
- vxlan: add MDB data path support
- gve: add XDP support for GQI-QPL format
- geneve: accept every ethertype
- macvlan: allow some packets to bypass broadcast queue
- mana: add support for jumbo frame
- Ethernet high-speed switches:
- Microchip (sparx5): Add support for TC flower templates.
- Ethernet embedded switches:
- Broadcom (b54):
- configure 6318 and 63268 RGMII ports
- Marvell (mv88e6xxx):
- faster C45 bus scan
- Microchip:
- lan966x:
- add support for IS1 VCAP
- better TX/RX from/to CPU performances
- ksz9477: add ETS Qdisc support
- ksz8: enhance static MAC table operations and error handling
- sama7g5: add PTP capability
- NXP (ocelot):
- add support for external ports
- add support for preemptible traffic classes
- Texas Instruments:
- add CPSWxG SGMII support for J7200 and J721E
- Intel WiFi (iwlwifi):
- preparation for Wi-Fi 7 EHT and multi-link support
- EHT (Wi-Fi 7) sniffer support
- hardware timestamping support for some devices/firwmares
- TX beacon protection on newer hardware
- Qualcomm 802.11ax WiFi (ath11k):
- MU-MIMO parameters support
- ack signal support for management packets
- RealTek WiFi (rtw88):
- SDIO bus support
- better support for some SDIO devices
(e.g. MAC address from efuse)
- RealTek WiFi (rtw89):
- HW scan support for 8852b
- better support for 6 GHz scanning
- support for various newer firmware APIs
- framework firmware backwards compatibility
- MediaTek WiFi (mt76):
- P2P support
- mesh A-MSDU support
- EHT (Wi-Fi 7) support
- coredump support
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRI/mUSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkgO0QAJGxpuN67YgYV0BIM+/atWKEEexJYG7B
9MMpU4jMO3EW/pUS5t7VRsBLUybLYVPmqCZoHodObDfnu59jiPOegb6SikJv/ZwJ
Zw62PVk5MvDnQjlu4e6kDcGwkplteN08TlgI+a49BUTedpdFitrxHAYGW8f2fRO6
cK2XSld+ZucMoym5vRwf8yWS1BwdxnslPMxDJ+/8ZbWBZv44qAnG2vMB/kIx7ObC
Vel/4m6MzTwVsLYBsRvcwMVbNNlZ9GuhztlTzEbfGA4ZhTadIAMgb5VTWXB84Ws7
Aic5wTdli+q+x6/2cxhbyeoVuB9HHObYmLBAciGg4GNljP5rnQBY3X3+KVZ/x9TI
HQB7CmhxmAZVrO9pLARFV+ECrMTH2/dy3NyrZ7uYQ3WPOXJi8hJZjOTO/eeEGL7C
eTjdz0dZBWIBK2gON/6s4nExXVQUTEF2ZsPi52jTTClKjfe5pz/ddeFQIWaY1DTm
pInEiWPAvd28JyiFmhFNHsuIBCjX/Zqe2JuMfMBeBibDAC09o/OGdKJYUI15AiRf
F46Pdb7use/puqfrYW44kSAfaPYoBiE+hj1RdeQfen35xD9HVE4vdnLNeuhRlFF9
aQfyIRHYQofkumRDr5f8JEY66cl9NiKQ4IVW1xxQfYDNdC6wQqREPG1md7rJVMrJ
vP7ugFnttneg
=ITVa
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
default value allows for better BIG TCP performances
- Reduce compound page head access for zero-copy data transfers
- RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
possible
- Threaded NAPI improvements, adding defer skb free support and
unneeded softirq avoidance
- Address dst_entry reference count scalability issues, via false
sharing avoidance and optimize refcount tracking
- Add lockless accesses annotation to sk_err[_soft]
- Optimize again the skb struct layout
- Extends the skb drop reasons to make it usable by multiple
subsystems
- Better const qualifier awareness for socket casts
BPF:
- Add skb and XDP typed dynptrs which allow BPF programs for more
ergonomic and less brittle iteration through data and
variable-sized accesses
- Add a new BPF netfilter program type and minimal support to hook
BPF programs to netfilter hooks such as prerouting or forward
- Add more precise memory usage reporting for all BPF map types
- Adds support for using {FOU,GUE} encap with an ipip device
operating in collect_md mode and add a set of BPF kfuncs for
controlling encap params
- Allow BPF programs to detect at load time whether a particular
kfunc exists or not, and also add support for this in light
skeleton
- Bigger batch of BPF verifier improvements to prepare for upcoming
BPF open-coded iterators allowing for less restrictive looping
capabilities
- Rework RCU enforcement in the verifier, add kptr_rcu and enforce
BPF programs to NULL-check before passing such pointers into kfunc
- Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
in local storage maps
- Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps
- Add support for refcounted local kptrs to the verifier for allowing
shared ownership, useful for adding a node to both the BPF list and
rbtree
- Add BPF verifier support for ST instructions in
convert_ctx_access() which will help new -mcpu=v4 clang flag to
start emitting them
- Add ARM32 USDT support to libbpf
- Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations
Protocols:
- IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
indicates the provenance of the IP address
- IPv6: optimize route lookup, dropping unneeded R/W lock acquisition
- Add the handshake upcall mechanism, allowing the user-space to
implement generic TLS handshake on kernel's behalf
- Bridge: support per-{Port, VLAN} neighbor suppression, increasing
resilience to nodes failures
- SCTP: add support for Fair Capacity and Weighted Fair Queueing
schedulers
- MPTCP: delay first subflow allocation up to its first usage. This
will allow for later better LSM interaction
- xfrm: Remove inner/outer modes from input/output path. These are
not needed anymore
- WiFi:
- reduced neighbor report (RNR) handling for AP mode
- HW timestamping support
- support for randomized auth/deauth TA for PASN privacy
- per-link debugfs for multi-link
- TC offload support for mac80211 drivers
- mac80211 mesh fast-xmit and fast-rx support
- enable Wi-Fi 7 (EHT) mesh support
Netfilter:
- Add nf_tables 'brouting' support, to force a packet to be routed
instead of being bridged
- Update bridge netfilter and ovs conntrack helpers to handle IPv6
Jumbo packets properly, i.e. fetch the packet length from
hop-by-hop extension header. This is needed for BIT TCP support
- The iptables 32bit compat interface isn't compiled in by default
anymore
- Move ip(6)tables builtin icmp matches to the udptcp one. This has
the advantage that icmp/icmpv6 match doesn't load the
iptables/ip6tables modules anymore when iptables-nft is used
- Extended netlink error report for netdevice in flowtables and
netdev/chains. Allow for incrementally add/delete devices to netdev
basechain. Allow to create netdev chain without device
Driver API:
- Remove redundant Device Control Error Reporting Enable, as PCI core
has already error reporting enabled at enumeration time
- Move Multicast DB netlink handlers to core, allowing devices other
then bridge to use them
- Allow the page_pool to directly recycle the pages from safely
localized NAPI
- Implement lockless TX queue stop/wake combo macros, allowing for
further code de-duplication and sanitization
- Add YNL support for user headers and struct attrs
- Add partial YNL specification for devlink
- Add partial YNL specification for ethtool
- Add tc-mqprio and tc-taprio support for preemptible traffic classes
- Add tx push buf len param to ethtool, specifies the maximum number
of bytes of a transmitted packet a driver can push directly to the
underlying device
- Add basic LED support for switch/phy
- Add NAPI documentation, stop relaying on external links
- Convert dsa_master_ioctl() to netdev notifier. This is a
preparatory work to make the hardware timestamping layer selectable
by user space
- Add transceiver support and improve the error messages for CAN-FD
controllers
New hardware / drivers:
- Ethernet:
- AMD/Pensando core device support
- MediaTek MT7981 SoC
- MediaTek MT7988 SoC
- Broadcom BCM53134 embedded switch
- Texas Instruments CPSW9G ethernet switch
- Qualcomm EMAC3 DWMAC ethernet
- StarFive JH7110 SoC
- NXP CBTX ethernet PHY
- WiFi:
- Apple M1 Pro/Max devices
- RealTek rtl8710bu/rtl8188gu
- RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
- Bluetooth:
- Realtek RTL8821CS, RTL8851B, RTL8852BS
- Mediatek MT7663, MT7922
- NXP w8997
- Actions Semi ATS2851
- QTI WCN6855
- Marvell 88W8997
- Can:
- STMicroelectronics bxcan stm32f429
Drivers:
- Ethernet NICs:
- Intel (1G, icg):
- add tracking and reporting of QBV config errors
- add support for configuring max SDU for each Tx queue
- Intel (100G, ice):
- refactor mailbox overflow detection to support Scalable IOV
- GNSS interface optimization
- Intel (i40e):
- support XDP multi-buffer
- nVidia/Mellanox:
- add the support for linux bridge multicast offload
- enable TC offload for egress and engress MACVLAN over bond
- add support for VxLAN GBP encap/decap flows offload
- extend packet offload to fully support libreswan
- support tunnel mode in mlx5 IPsec packet offload
- extend XDP multi-buffer support
- support MACsec VLAN offload
- add support for dynamic msix vectors allocation
- drop RX page_cache and fully use page_pool
- implement thermal zone to report NIC temperature
- Netronome/Corigine:
- add support for multi-zone conntrack offload
- Solarflare/Xilinx:
- support offloading TC VLAN push/pop actions to the MAE
- support TC decap rules
- support unicast PTP
- Other NICs:
- Broadcom (bnxt): enforce software based freq adjustments only on
shared PHC NIC
- RealTek (r8169): refactor to addess ASPM issues during NAPI poll
- Micrel (lan8841): add support for PTP_PF_PEROUT
- Cadence (macb): enable PTP unicast
- Engleder (tsnep): add XDP socket zero-copy support
- virtio-net: implement exact header length guest feature
- veth: add page_pool support for page recycling
- vxlan: add MDB data path support
- gve: add XDP support for GQI-QPL format
- geneve: accept every ethertype
- macvlan: allow some packets to bypass broadcast queue
- mana: add support for jumbo frame
- Ethernet high-speed switches:
- Microchip (sparx5): Add support for TC flower templates
- Ethernet embedded switches:
- Broadcom (b54):
- configure 6318 and 63268 RGMII ports
- Marvell (mv88e6xxx):
- faster C45 bus scan
- Microchip:
- lan966x:
- add support for IS1 VCAP
- better TX/RX from/to CPU performances
- ksz9477: add ETS Qdisc support
- ksz8: enhance static MAC table operations and error handling
- sama7g5: add PTP capability
- NXP (ocelot):
- add support for external ports
- add support for preemptible traffic classes
- Texas Instruments:
- add CPSWxG SGMII support for J7200 and J721E
- Intel WiFi (iwlwifi):
- preparation for Wi-Fi 7 EHT and multi-link support
- EHT (Wi-Fi 7) sniffer support
- hardware timestamping support for some devices/firwmares
- TX beacon protection on newer hardware
- Qualcomm 802.11ax WiFi (ath11k):
- MU-MIMO parameters support
- ack signal support for management packets
- RealTek WiFi (rtw88):
- SDIO bus support
- better support for some SDIO devices (e.g. MAC address from
efuse)
- RealTek WiFi (rtw89):
- HW scan support for 8852b
- better support for 6 GHz scanning
- support for various newer firmware APIs
- framework firmware backwards compatibility
- MediaTek WiFi (mt76):
- P2P support
- mesh A-MSDU support
- EHT (Wi-Fi 7) support
- coredump support"
* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
net: phy: hide the PHYLIB_LEDS knob
net: phy: marvell-88x2222: remove unnecessary (void*) conversions
tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
net: amd: Fix link leak when verifying config failed
net: phy: marvell: Fix inconsistent indenting in led_blink_set
lan966x: Don't use xdp_frame when action is XDP_TX
tsnep: Add XDP socket zero-copy TX support
tsnep: Add XDP socket zero-copy RX support
tsnep: Move skb receive action to separate function
tsnep: Add functions for queue enable/disable
tsnep: Rework TX/RX queue initialization
tsnep: Replace modulo operation with mask
net: phy: dp83867: Add led_brightness_set support
net: phy: Fix reading LED reg property
drivers: nfc: nfcsim: remove return value check of `dev_dir`
net: phy: dp83867: Remove unnecessary (void*) conversions
net: ethtool: coalesce: try to make user settings stick twice
net: mana: Check if netdev/napi_alloc_frag returns single page
net: mana: Rename mana_refill_rxoob and remove some empty lines
net: veth: add page_pool stats
...
syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
skbs. We can reproduce the problem with these sequences:
sk = socket(AF_INET, SOCK_DGRAM, 0)
sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
sk.close()
sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
skb->cb->ubuf.refcnt to 1, and calls sock_hold(). Here, struct
ubuf_info_msgzc indirectly holds a refcnt of the socket. When the
skb is sent, __skb_tstamp_tx() clones it and puts the clone into
the socket's error queue with the TX timestamp.
When the original skb is received locally, skb_copy_ubufs() calls
skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
This additional count is decremented while freeing the skb, but struct
ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
not called.
The last refcnt is not released unless we retrieve the TX timestamped
skb by recvmsg(). Since we clear the error queue in inet_sock_destruct()
after the socket's refcnt reaches 0, there is a circular dependency.
If we close() the socket holding such skbs, we never call sock_put()
and leak the count, sk, and skb.
TCP has the same problem, and commit e0c8bccd40 ("net: stream:
purge sk_error_queue in sk_stream_kill_queues()") tried to fix it
by calling skb_queue_purge() during close(). However, there is a
small chance that skb queued in a qdisc or device could be put
into the error queue after the skb_queue_purge() call.
In __skb_tstamp_tx(), the cloned skb should not have a reference
to the ubuf to remove the circular dependency, but skb_clone() does
not call skb_copy_ubufs() for zerocopy skb. So, we need to call
skb_orphan_frags_rx() for the cloned skb to call skb_copy_ubufs().
[0]:
BUG: memory leak
unreferenced object 0xffff88800c6d2d00 (size 1152):
comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00 ................
02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
backtrace:
[<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024
[<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083
[<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline]
[<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245
[<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515
[<00000000b9b11231>] sock_create net/socket.c:1566 [inline]
[<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline]
[<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline]
[<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636
[<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline]
[<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline]
[<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647
[<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
[<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
BUG: memory leak
unreferenced object 0xffff888017633a00 (size 240):
comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff .........-m.....
backtrace:
[<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497
[<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline]
[<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596
[<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline]
[<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370
[<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037
[<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652
[<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253
[<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819
[<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline]
[<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734
[<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117
[<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline]
[<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline]
[<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125
[<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
[<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: f214f915e7 ("tcp: enable MSG_ZEROCOPY")
Fixes: b5947e5d1e ("udp: msg_zerocopy")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYCQAAKCRBZ7Krx/gZQ
64FdAQDZ2hTDyZEWPt486dWYPYpiKyaGFXSXDGo7wgP0fiwxXQEA/mROKb6JqYw6
27mZ9A7qluT8r3AfTTQ0D+Yse/dr4AM=
=GA9W
-----END PGP SIGNATURE-----
Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fget updates from Al Viro:
"fget() to fdget() conversions"
* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fuse_dev_ioctl(): switch to fdget()
cgroup_get_from_fd(): switch to fdget_raw()
bpf: switch to fdget_raw()
build_mount_idmapped(): switch to fdget()
kill the last remaining user of proc_ns_fget()
SVM-SEV: convert the rest of fget() uses to fdget() in there
convert sgx_set_attribute() to fdget()/fdput()
convert setns(2) to fdget()/fdput()
o MAINTAINERS files additions and changes.
o Fix hotplug warning in nohz code.
o Tick dependency changes by Zqiang.
o Lazy-RCU shrinker fixes by Zqiang.
o rcu-tasks stall reporting improvements by Neeraj.
o Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep()
name for robustness.
o Documentation Updates:
o Significant changes to srcu_struct size.
o Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun.
o rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree.
o Other misc changes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4
5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2
96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS
eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx
yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+
XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/
+Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi
cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT
CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe
93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq
4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV
ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A=
=Rgbj
-----END PGP SIGNATURE-----
Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux
Pull RCU updates from Joel Fernandes:
- Updates and additions to MAINTAINERS files, with Boqun being added to
the RCU entry and Zqiang being added as an RCU reviewer.
I have also transitioned from reviewer to maintainer; however, Paul
will be taking over sending RCU pull-requests for the next merge
window.
- Resolution of hotplug warning in nohz code, achieved by fixing
cpu_is_hotpluggable() through interaction with the nohz subsystem.
Tick dependency modifications by Zqiang, focusing on fixing usage of
the TICK_DEP_BIT_RCU_EXP bitmask.
- Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
kernels, fixed by Zqiang.
- Improvements to rcu-tasks stall reporting by Neeraj.
- Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
increased robustness, affecting several components like mac802154,
drbd, vmw_vmci, tracing, and more.
A report by Eric Dumazet showed that the API could be unknowingly
used in an atomic context, so we'd rather make sure they know what
they're asking for by being explicit:
https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/
- Documentation updates, including corrections to spelling,
clarifications in comments, and improvements to the srcu_size_state
comments.
- Better srcu_struct cache locality for readers, by adjusting the size
of srcu_struct in support of SRCU usage by Christoph Hellwig.
- Teach lockdep to detect deadlocks between srcu_read_lock() vs
synchronize_srcu() contributed by Boqun.
Previously lockdep could not detect such deadlocks, now it can.
- Integration of rcutorture and rcu-related tools, targeted for v6.4
from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
module parameter, and more
- Miscellaneous changes, various code cleanups and comment improvements
* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
checkpatch: Error out if deprecated RCU API used
mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
rcu/trace: use strscpy() to instead of strncpy()
tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
...
ocelot_xmit_get_vlan_info() calls __skb_vlan_pop() as the most
appropriate helper I could find which strips away a VLAN header.
That's all I need it to do, but __skb_vlan_pop() has more logic, which
will become incompatible with the future revert of commit 6d1ccff627
("net: reset mac header in dev_start_xmit()").
Namely, it performs a sanity check on skb_mac_header(), which will stop
being set after the above revert, so it will return an error instead of
removing the VLAN tag.
ocelot_xmit_get_vlan_info() gets called in 2 circumstances:
(1) the port is under a VLAN-aware bridge and the bridge sends
VLAN-tagged packets
(2) the port is under a VLAN-aware bridge and somebody else (an 8021q
upper) sends VLAN-tagged packets (using a VID that isn't in the
bridge vlan tables)
In case (1), there is actually no bug to defend against, because
br_dev_xmit() calls skb_reset_mac_header() and things continue to work.
However, in case (2), illustrated using the commands below, it can be
seen that our intervention is needed, since __skb_vlan_pop() complains:
$ ip link add br0 type bridge vlan_filtering 1 && ip link set br0 up
$ ip link set $eth master br0 && ip link set $eth up
$ ip link add link $eth name $eth.100 type vlan id 100 && ip link set $eth.100 up
$ ip addr add 192.168.100.1/24 dev $eth.100
I could fend off the checks in __skb_vlan_pop() with some
skb_mac_header_was_set() calls, but seeing how few callers of
__skb_vlan_pop() there are from TX paths, that seems rather
unproductive.
As an alternative solution, extract the bare minimum logic to strip a
VLAN header, and move it to a new helper named vlan_remove_tag(), close
to the definition of vlan_insert_tag(). Document it appropriately and
make ocelot_xmit_get_vlan_info() call this smaller helper instead.
Seeing that it doesn't appear illegal to test skb->protocol in the TX
path, I guess it would be a good for vlan_remove_tag() to also absorb
the vlan_set_encap_proto() function call.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We use napi_threaded_poll() in order to reduce our softirq dependency.
We can add a followup of 821eba962d ("net: optimize napi_schedule_rps()")
to further remove the need of firing NET_RX_SOFTIRQ whenever
RPS/RFS are used.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we call skb_defer_free_flush() from napi_threaded_poll(),
we can avoid to raise IPI from skb_attempt_defer_free()
when the list becomes too big.
This allows napi_threaded_poll() to rely less on softirqs,
and lowers latency caused by a too big list.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We plan using skb_defer_free_flush() from napi_threaded_poll()
in the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
kfree_skb() can be called from hard irq handlers,
but skb_attempt_defer_free() is meant to be used
from process or BH contexts, and skb_defer_free_flush()
is meant to be called from BH contexts.
Not having to mask hard irq can save some cycles.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZELn8wAKCRDbK58LschI
g1khAQC1nmXPuKjM4EAfFK8Ysb3KoF8ADmpE97n+/HEDydCagwD/bX0+NABR75Nh
ueGcoU1TcfcbshDzrH0s+C95owZDZw4=
=BeZM
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-04-21
We've added 71 non-merge commits during the last 8 day(s) which contain
a total of 116 files changed, 13397 insertions(+), 8896 deletions(-).
The main changes are:
1) Add a new BPF netfilter program type and minimal support to hook
BPF programs to netfilter hooks such as prerouting or forward,
from Florian Westphal.
2) Fix race between btf_put and btf_idr walk which caused a deadlock,
from Alexei Starovoitov.
3) Second big batch to migrate test_verifier unit tests into test_progs
for ease of readability and debugging, from Eduard Zingerman.
4) Add support for refcounted local kptrs to the verifier for allowing
shared ownership, useful for adding a node to both the BPF list and
rbtree, from Dave Marchevsky.
5) Migrate bpf_for(), bpf_for_each() and bpf_repeat() macros from BPF
selftests into libbpf-provided bpf_helpers.h header and improve
kfunc handling, from Andrii Nakryiko.
6) Support 64-bit pointers to kfuncs needed for archs like s390x,
from Ilya Leoshkevich.
7) Support BPF progs under getsockopt with a NULL optval,
from Stanislav Fomichev.
8) Improve verifier u32 scalar equality checking in order to enable
LLVM transformations which earlier had to be disabled specifically
for BPF backend, from Yonghong Song.
9) Extend bpftool's struct_ops object loading to support links,
from Kui-Feng Lee.
10) Add xsk selftest follow-up fixes for hugepage allocated umem,
from Magnus Karlsson.
11) Support BPF redirects from tc BPF to ifb devices,
from Daniel Borkmann.
12) Add BPF support for integer type when accessing variable length
arrays, from Feng Zhou.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (71 commits)
selftests/bpf: verifier/value_ptr_arith converted to inline assembly
selftests/bpf: verifier/value_illegal_alu converted to inline assembly
selftests/bpf: verifier/unpriv converted to inline assembly
selftests/bpf: verifier/subreg converted to inline assembly
selftests/bpf: verifier/spin_lock converted to inline assembly
selftests/bpf: verifier/sock converted to inline assembly
selftests/bpf: verifier/search_pruning converted to inline assembly
selftests/bpf: verifier/runtime_jit converted to inline assembly
selftests/bpf: verifier/regalloc converted to inline assembly
selftests/bpf: verifier/ref_tracking converted to inline assembly
selftests/bpf: verifier/map_ptr_mixing converted to inline assembly
selftests/bpf: verifier/map_in_map converted to inline assembly
selftests/bpf: verifier/lwt converted to inline assembly
selftests/bpf: verifier/loops1 converted to inline assembly
selftests/bpf: verifier/jeq_infer_not_null converted to inline assembly
selftests/bpf: verifier/direct_packet_access converted to inline assembly
selftests/bpf: verifier/d_path converted to inline assembly
selftests/bpf: verifier/ctx converted to inline assembly
selftests/bpf: verifier/btf_ctx_access converted to inline assembly
selftests/bpf: verifier/bpf_get_stack converted to inline assembly
...
====================
Link: https://lore.kernel.org/r/20230421211035.9111-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xfrm_alloc_dst() followed by xfrm4_dst_destroy(), without a
xfrm4_fill_dst() call in between, causes the following BUG:
BUG: spinlock bad magic on CPU#0, fbxhostapd/732
lock: 0x890b7668, .magic: 890b7668, .owner: <none>/-1, .owner_cpu: 0
CPU: 0 PID: 732 Comm: fbxhostapd Not tainted 6.3.0-rc6-next-20230414-00613-ge8de66369925-dirty #9
Hardware name: Marvell Kirkwood (Flattened Device Tree)
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x28/0x30
dump_stack_lvl from do_raw_spin_lock+0x20/0x80
do_raw_spin_lock from rt_del_uncached_list+0x30/0x64
rt_del_uncached_list from xfrm4_dst_destroy+0x3c/0xbc
xfrm4_dst_destroy from dst_destroy+0x5c/0xb0
dst_destroy from rcu_process_callbacks+0xc4/0xec
rcu_process_callbacks from __do_softirq+0xb4/0x22c
__do_softirq from call_with_stack+0x1c/0x24
call_with_stack from do_softirq+0x60/0x6c
do_softirq from __local_bh_enable_ip+0xa0/0xcc
Patch "net: dst: Prevent false sharing vs. dst_entry:: __refcnt" moved
rt_uncached and rt_uncached_list fields from rtable struct to dst
struct, so they are more zeroed by memset_after(xdst, 0, u.dst) in
xfrm_alloc_dst().
Note that rt_uncached (list_head) was never properly initialized at
alloc time, but xfrm[46]_dst_destroy() is written in such a way that
it was not an issue thanks to the memset:
if (xdst->u.rt.dst.rt_uncached_list)
rt_del_uncached_list(&xdst->u.rt);
The route code does it the other way around: rt_uncached_list is
assumed to be valid IIF rt_uncached list_head is not empty:
void rt_del_uncached_list(struct rtable *rt)
{
if (!list_empty(&rt->dst.rt_uncached)) {
struct uncached_list *ul = rt->dst.rt_uncached_list;
spin_lock_bh(&ul->lock);
list_del_init(&rt->dst.rt_uncached);
spin_unlock_bh(&ul->lock);
}
}
This patch adds mandatory rt_uncached list_head initialization in
generic dst_init(), and adapt xfrm[46]_dst_destroy logic to match the
rest of the code.
Fixes: d288a162dd ("net: dst: Prevent false sharing vs. dst_entry:: __refcnt")
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202304162125.18b7bcdd-oliver.sang@intel.com
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
CC: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
Link: https://lore.kernel.org/r/20230420182508.2417582-1-mbizon@freebox.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This adds minimal support for BPF_PROG_TYPE_NETFILTER bpf programs
that will be invoked via the NF_HOOK() points in the ip stack.
Invocation incurs an indirect call. This is not a necessity: Its
possible to add 'DEFINE_BPF_DISPATCHER(nf_progs)' and handle the
program invocation with the same method already done for xdp progs.
This isn't done here to keep the size of this chunk down.
Verifier restricts verdicts to either DROP or ACCEPT.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-3-fw@strlen.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a new bridge port attribute that allows user space to enable
per-{Port, VLAN} neighbor suppression. Example:
# bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]'
false
# bridge link set dev swp1 neigh_vlan_suppress on
# bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]'
true
# bridge link set dev swp1 neigh_vlan_suppress off
# bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]'
false
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend drop reasons to make them usable by subsystems
other than core by reserving the high 16 bits for a
new subsystem ID, of which 0 of course is used for the
existing reasons immediately.
To still be able to have string reasons, restructure
that code a bit to make the loopup under RCU, the only
user of this (right now) is drop_monitor.
Link: https://lore.kernel.org/netdev/00659771ed54353f92027702c5bbb84702da62ce.camel@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
lookups by descriptor are better off closer to syscall surface...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__kfree_skb_defer() uses the old naming where "defer" meant
slab bulk free/alloc APIs. In the meantime we also made
__kfree_skb_defer() feed the per-NAPI skb cache, which
implies bulk APIs. So take away the 'defer' and add 'napi'.
While at it add a drop reason. This only matters on the
tx_action path, if the skb has a frag_list. But getting
rid of a SKB_DROP_REASON_NOT_SPECIFIED seems like a net
benefit so why not.
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230420020005.815854-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jesper points out that we must prevent recycling into cache
after page_pool_destroy() is called, because page_pool_destroy()
is not synchronized with recycling (some pages may still be
outstanding when destroy() gets called).
I assumed this will not happen because NAPI can't be scheduled
if its page pool is being destroyed. But I missed the fact that
NAPI may get reused. For instance when user changes ring configuration
driver may allocate a new page pool, stop NAPI, swap, start NAPI,
and then destroy the old pool. The NAPI is running so old page
pool will think it can recycle to the cache, but the consumer
at that point is the destroy() path, not NAPI.
To avoid extra synchronization let the drivers do "unlinking"
during the "swap" stage while NAPI is indeed disabled.
Fixes: 8c48eea3ad ("page_pool: allow caching from safely localized NAPI")
Reported-by: Jesper Dangaard Brouer <jbrouer@redhat.com>
Link: https://lore.kernel.org/all/e8df2654-6a5b-3c92-489d-2fe5e444135f@redhat.com/
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/20230419182006.719923-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit c519fe9a4f ("bnxt: add dma mapping attributes") added
DMA_ATTR_WEAK_ORDERING to DMA attrs on bnxt. It has since spread
to a few more drivers (possibly as a copy'n'paste).
DMA_ATTR_WEAK_ORDERING only seems to matter on Sparc and PowerPC/cell,
the rarity of these platforms is likely why we never bothered adding
the attribute in the page pool, even though it should be safe to add.
To make the page pool migration in drivers which set this flag less
of a risk (of regressing the precious sparc database workloads or
whatever needed this) let's add DMA_ATTR_WEAK_ORDERING on all
page pool DMA mappings.
We could make this a driver opt-in but frankly I don't think it's
worth complicating the API. I can't think of a reason why device
accesses to packet memory would have to be ordered.
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/20230417152805.331865-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
SCTP is not universally deployed, allow hiding its bit
from the skb.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Datacenter kernel builds will very likely not include WIRELESS,
so let them shave 2 bits off the skb by hiding the wifi fields.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some use-cases where it is desirable to use bpf_redirect()
in combination with ifb device, which currently is not supported, for
example, around filtering inbound traffic with BPF to then push it to
ifb which holds the qdisc for shaping in contrast to doing that on the
egress device.
Toke mentions the following case related to OpenWrt:
Because there's not always a single egress on the other side. These are
mainly home routers, which tend to have one or more WiFi devices bridged
to one or more ethernet ports on the LAN side, and a single upstream WAN
port. And the objective is to control the total amount of traffic going
over the WAN link (in both directions), to deal with bufferbloat in the
ISP network (which is sadly still all too prevalent).
In this setup, the traffic can be split arbitrarily between the links
on the LAN side, and the only "single bottleneck" is the WAN link. So we
install both egress and ingress shapers on this, configured to something
like 95-98% of the true link bandwidth, thus moving the queues into the
qdisc layer in the router. It's usually necessary to set the ingress
bandwidth shaper a bit lower than the egress due to being "downstream"
of the bottleneck link, but it does work surprisingly well.
We usually use something like a matchall filter to put all ingress
traffic on the ifb, so doing the redirect from BPF has not been an
immediate requirement thus far. However, it does seem a bit odd that
this is not possible, and we do have a BPF-based filter that layers on
top of this kind of setup, which currently uses u32 as the ingress
filter and so it could presumably be improved to use BPF instead if
that was available.
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reported-by: Yafang Shao <laoar.shao@gmail.com>
Reported-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://git.openwrt.org/?p=project/qosify.git;a=blob;f=README
Link: https://lore.kernel.org/bpf/875y9yzbuy.fsf@toke.dk
Link: https://lore.kernel.org/r/8cebc8b2b6e967e10cbafe2ffd6795050e74accd.1681739137.git.daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.
Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).
If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.
Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.
With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).
The CPU use of relevant functions decreases as expected:
page_pool_refill_alloc_cache 1.17% -> 0%
_raw_spin_lock 2.41% -> 0.98%
Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.
The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We maintain a NAPI-local cache of skbs which is fed by napi_consume_skb().
Going forward we will also try to cache head and data pages.
Plumb the "are we in a normal NAPI context" information thru
deeper into the freeing path, up to skb_release_data() and
skb_free_head()/skb_pp_recycle(). The "not normal NAPI context"
comes from netpoll which passes budget of 0 to try to reap
the Tx completions but not perform any Rx.
Use "bool napi_safe" rather than bare "int budget",
the further we get from NAPI the more confusing the budget
argument may seem (particularly whether 0 or MAX is the
correct value to pass in when not in NAPI).
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 1f466e1f15 ("net: cleanly handle kernel vs user
buffers for ->msg_control"), pointers to user buffers should be
stored in struct msghdr::msg_control_user, instead of the
msg_control field. Most users of msg_control have already been
converted (where user buffers are involved), but not all of them.
This patch attempts to address the remaining cases. An exception is
made for null checks, as it should be safe to use msg_control
unconditionally for that purpose.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZDhSiwAKCRDbK58LschI
g8cbAQCH4xrquOeDmYyGXFQGchHZAIj++tKg8ABU4+hYeJtrlwEA6D4W6wjoSZRk
mLSptZ9qro8yZA86BvyPvlBT1h9ELQA=
=StAc
-----END PGP SIGNATURE-----
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-04-13
We've added 260 non-merge commits during the last 36 day(s) which contain
a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).
The main changes are:
1) Rework BPF verifier log behavior and implement it as a rotating log
by default with the option to retain old-style fixed log behavior,
from Andrii Nakryiko.
2) Adds support for using {FOU,GUE} encap with an ipip device operating
in collect_md mode and add a set of BPF kfuncs for controlling encap
params, from Christian Ehrig.
3) Allow BPF programs to detect at load time whether a particular kfunc
exists or not, and also add support for this in light skeleton,
from Alexei Starovoitov.
4) Optimize hashmap lookups when key size is multiple of 4,
from Anton Protopopov.
5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps, from David Vernet.
6) Add support for stashing local BPF kptr into a map value via
bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
for new cgroups, from Dave Marchevsky.
7) Fix BTF handling of is_int_ptr to skip modifiers to work around
tracing issues where a program cannot be attached, from Feng Zhou.
8) Migrate a big portion of test_verifier unit tests over to
test_progs -a verifier_* via inline asm to ease {read,debug}ability,
from Eduard Zingerman.
9) Several updates to the instruction-set.rst documentation
which is subject to future IETF standardization
(https://lwn.net/Articles/926882/), from Dave Thaler.
10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
known bits information propagation, from Daniel Borkmann.
11) Add skb bitfield compaction work related to BPF with the overall goal
to make more of the sk_buff bits optional, from Jakub Kicinski.
12) BPF selftest cleanups for build id extraction which stand on its own
from the upcoming integration work of build id into struct file object,
from Jiri Olsa.
13) Add fixes and optimizations for xsk descriptor validation and several
selftest improvements for xsk sockets, from Kal Conley.
14) Add BPF links for struct_ops and enable switching implementations
of BPF TCP cong-ctls under a given name by replacing backing
struct_ops map, from Kui-Feng Lee.
15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
offset stack read as earlier Spectre checks cover this,
from Luis Gerhorst.
16) Fix issues in copy_from_user_nofault() for BPF and other tracers
to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
and Alexei Starovoitov.
17) Add --json-summary option to test_progs in order for CI tooling to
ease parsing of test results, from Manu Bretelle.
18) Batch of improvements and refactoring to prep for upcoming
bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
from Martin KaFai Lau.
19) Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations,
from Quentin Monnet.
20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
the module name from BTF of the target and searching kallsyms of
the correct module, from Viktor Malik.
21) Improve BPF verifier handling of '<const> <cond> <non_const>'
to better detect whether in particular jmp32 branches are taken,
from Yonghong Song.
22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
A built-in cc or one from a kernel module is already able to write
to app_limited, from Yixin Shen.
Conflicts:
Documentation/bpf/bpf_devel_QA.rst
b7abcd9c65 ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
0f10f647f4 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/
include/net/ip_tunnels.h
bc9d003dc4 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts")
ac931d4cde ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/
net/bpf/test_run.c
e5995bc7e2 ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption")
294635a816 ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
====================
Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The RSS hash type specifies what portion of packet data NIC hardware used
when calculating RSS hash value. The RSS types are focused on Internet
traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash
value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4
primarily TCP vs UDP, but some hardware supports SCTP.
Hardware RSS types are differently encoded for each hardware NIC. Most
hardware represent RSS hash type as a number. Determining L3 vs L4 often
requires a mapping table as there often isn't a pattern or sorting
according to ISO layer.
The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that
contains both BITs for the L3/L4 types, and combinations to be used by
drivers for their mapping tables. The enum xdp_rss_type_bits get exposed
to BPF via BTF, and it is up to the BPF-programmer to match using these
defines.
This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding
a pointer value argument for provide the RSS hash type.
Change signature for all xmo_rx_hash calls in drivers to make it compile.
The RSS type implementations for each driver comes as separate patches.
Fixes: 3d76a4d3d4 ("bpf: XDP metadata RX kfuncs")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132892042.340624.582563003880565460.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 1effe8ca4e ("skbuff: fix coalescing for page_pool fragment
recycling") allowed coalescing to proceed with non page pool page and page
pool page when @from is cloned, i.e.
to->pp_recycle --> false
from->pp_recycle --> true
skb_cloned(from) --> true
However, it actually requires skb_cloned(@from) to hold true until
coalescing finishes in this situation. If the other cloned SKB is
released while the merging is in process, from_shinfo->nr_frags will be
set to 0 toward the end of the function, causing the increment of frag
page _refcount to be unexpectedly skipped resulting in inconsistent
reference counts. Later when SKB(@to) is released, it frees the page
directly even though the page pool page is still in use, leading to
use-after-free or double-free errors. So it should be prohibited.
The double-free error message below prompted us to investigate:
BUG: Bad page state in process swapper/1 pfn:0e0d1
page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000
index:0x2 pfn:0xe0d1
flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000
raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000
page dumped because: nonzero _refcount
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G E 6.2.0+
Call Trace:
<IRQ>
dump_stack_lvl+0x32/0x50
bad_page+0x69/0xf0
free_pcp_prepare+0x260/0x2f0
free_unref_page+0x20/0x1c0
skb_release_data+0x10b/0x1a0
napi_consume_skb+0x56/0x150
net_rx_action+0xf0/0x350
? __napi_schedule+0x79/0x90
__do_softirq+0xc8/0x2b1
__irq_exit_rcu+0xb9/0xf0
common_interrupt+0x82/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x22/0x40
RIP: 0010:default_idle+0xb/0x20
Fixes: 53e0961da1 ("page_pool: add frag page recycling support in page pool")
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230413090353.14448-1-liangchen.linux@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The commits referenced below allows userspace to use the NLM_F_ECHO flag
for RTM_NEW/DELLINK operations to receive unicast notifications for the
affected link. Prior to these changes, applications may have relied on
multicast notifications to learn the same information without specifying
the NLM_F_ECHO flag.
For such applications, the mentioned commits changed the behavior for
requests not using NLM_F_ECHO. Multicast notifications are still received,
but now use the portid of the requester and the sequence number of the
request instead of zero values used previously. For the application, this
message may be unexpected and likely handled as a response to the
NLM_F_ACKed request, especially if it uses the same socket to handle
requests and notifications.
To fix existing applications relying on the old notification behavior,
set the portid and sequence number in the notification only if the
request included the NLM_F_ECHO flag. This restores the old behavior
for applications not using it, but allows unicasted notifications for
others.
Fixes: f3a63cce1b ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link")
Fixes: d88e136cab ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create")
Signed-off-by: Martin Willi <martin@strongswan.org>
Acked-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20230411074319.24133-1-martin@strongswan.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When huang uses sched_switch tracepoint, the tracepoint
does only one thing in the mounted ebpf program, which
deletes the fixed elements in sockhash ([0])
It seems that elements in sockhash are rarely actively
deleted by users or ebpf program. Therefore, we do not
pay much attention to their deletion. Compared with hash
maps, sockhash only provides spin_lock_bh protection.
This causes it to appear to have self-locking behavior
in the interrupt context.
[0]:https://lore.kernel.org/all/CABcoxUayum5oOqFMMqAeWuS8+EzojquSOSyDA3J_2omY=2EeAg@mail.gmail.com/
Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Xin Liu <liuxin350@huawei.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230406122622.109978-1-liuxin350@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There was a sort of rush surrounding commit 88c0a6b503 ("net: create a
netdev notifier for DSA to reject PTP on DSA master"), due to a desire
to convert DSA's attempt to deny TX timestamping on a DSA master to
something that doesn't block the kernel-wide API conversion from
ndo_eth_ioctl() to ndo_hwtstamp_set().
What was required was a mechanism that did not depend on ndo_eth_ioctl(),
and what was provided was a mechanism that did not depend on
ndo_eth_ioctl(), while at the same time introducing something that
wasn't absolutely necessary - a new netdev notifier.
There have been objections from Jakub Kicinski that using notifiers in
general when they are not absolutely necessary creates complications to
the control flow and difficulties to maintainers who look at the code.
So there is a desire to not use notifiers.
In addition to that, the notifier chain gets called even if there is no
DSA in the system and no one is interested in applying any restriction.
Take the model of udp_tunnel_nic_ops and introduce a stub mechanism,
through which net/core/dev_ioctl.c can call into DSA even when
CONFIG_NET_DSA=m.
Compared to the code that existed prior to the notifier conversion, aka
what was added in commits:
- 4cfab35667 ("net: dsa: Add wrappers for overloaded ndo_ops")
- 3369afba1e ("net: Call into DSA netdevice_ops wrappers")
this is different because we are not overloading any struct
net_device_ops of the DSA master anymore, but rather, we are exposing a
rather specific functionality which is orthogonal to which API is used
to enable it - ndo_eth_ioctl() or ndo_hwtstamp_set().
Also, what is similar is that both approaches use function pointers to
get from built-in code to DSA.
There is no point in replicating the function pointers towards
__dsa_master_hwtstamp_validate() once for every CPU port (dev->dsa_ptr).
Instead, it is sufficient to introduce a singleton struct dsa_stubs,
built into the kernel, which contains a single function pointer to
__dsa_master_hwtstamp_validate().
I find this approach preferable to what we had originally, because
dev->dsa_ptr->netdev_ops->ndo_do_ioctl() used to require going through
struct dsa_port (dev->dsa_ptr), and so, this was incompatible with any
attempts to add any data encapsulation and hide DSA data structures from
the outside world.
Link: https://lore.kernel.org/netdev/20230403083019.120b72fd@kernel.org/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 217f697436 ("net: busy-poll: allow preemption
in sk_busy_loop()"), a thread willing to use busy polling
is not hurting other threads anymore in a non preempt kernel.
I think it is safe to remove CAP_NET_ADMIN check.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230406194634.1804691-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
assume the following setup on a single machine:
1. An openvswitch instance with one bridge and default flows
2. two network namespaces "server" and "client"
3. two ovs interfaces "server" and "client" on the bridge
4. for each ovs interface a veth pair with a matching name and 32 rx and
tx queues
5. move the ends of the veth pairs to the respective network namespaces
6. assign ip addresses to each of the veth ends in the namespaces (needs
to be the same subnet)
7. start some http server on the server network namespace
8. test if a client in the client namespace can reach the http server
when following the actions below the host has a chance of getting a cpu
stuck in a infinite loop:
1. send a large amount of parallel requests to the http server (around
3000 curls should work)
2. in parallel delete the network namespace (do not delete interfaces or
stop the server, just kill the namespace)
there is a low chance that this will cause the below kernel cpu stuck
message. If this does not happen just retry.
Below there is also the output of bpftrace for the functions mentioned
in the output.
The series of events happening here is:
1. the network namespace is deleted calling
`unregister_netdevice_many_notify` somewhere in the process
2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and
then runs `synchronize_net`
3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER`
4. this is then handled by `dp_device_event` which calls
`ovs_netdev_detach_dev` (if a vport is found, which is the case for
the veth interface attached to ovs)
5. this removes the rx_handlers of the device but does not prevent
packages to be sent to the device
6. `dp_device_event` then queues the vport deletion to work in
background as a ovs_lock is needed that we do not hold in the
unregistration path
7. `unregister_netdevice_many_notify` continues to call
`netdev_unregister_kobject` which sets `real_num_tx_queues` to 0
8. port deletion continues (but details are not relevant for this issue)
9. at some future point the background task deletes the vport
If after 7. but before 9. a packet is send to the ovs vport (which is
not deleted at this point in time) which forwards it to the
`dev_queue_xmit` flow even though the device is unregistering.
In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is
a while loop (if the packet has a rx_queue recorded) that is infinite if
`dev->real_num_tx_queues` is zero.
To prevent this from happening we update `do_output` to handle devices
without carrier the same as if the device is not found (which would
be the code path after 9. is done).
Additionally we now produce a warning in `skb_tx_hash` if we will hit
the infinite loop.
bpftrace (first word is function name):
__dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1
netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 2, reg_state: 1
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 6, reg_state: 2
ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, reg_state: 2
netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 27, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 22, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 18, reg_state: 2
netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
__dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024
ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, reg_state: 2
synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604
stuck message:
watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [curl:1929279]
Modules linked in: veth pktgen bridge stp llc ip_set_hash_net nft_counter xt_set nft_compat nf_tables ip_set_hash_ip ip_set nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tls binfmt_misc nls_iso8859_1 input_leds joydev serio_raw dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net ahci net_failover crypto_simd cryptd psmouse libahci virtio_blk failover
CPU: 5 PID: 1929279 Comm: curl Not tainted 5.15.0-67-generic #74-Ubuntu
Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:netdev_pick_tx+0xf1/0x320
Code: 00 00 8d 48 ff 0f b7 c1 66 39 ca 0f 86 e9 01 00 00 45 0f b7 ff 41 39 c7 0f 87 5b 01 00 00 44 29 f8 41 39 c7 0f 87 4f 01 00 00 <eb> f2 0f 1f 44 00 00 49 8b 94 24 28 04 00 00 48 85 d2 0f 84 53 01
RSP: 0018:ffffb78b40298820 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff9c8773adc2e0 RCX: 000000000000083f
RDX: 0000000000000000 RSI: ffff9c8773adc2e0 RDI: ffff9c870a25e000
RBP: ffffb78b40298858 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c870a25e000
R13: ffff9c870a25e000 R14: ffff9c87fe043480 R15: 0000000000000000
FS: 00007f7b80008f00(0000) GS:ffff9c8e5f740000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7b80f6a0b0 CR3: 0000000329d66000 CR4: 0000000000350ee0
Call Trace:
<IRQ>
netdev_core_pick_tx+0xa4/0xb0
__dev_queue_xmit+0xf8/0x510
? __bpf_prog_exit+0x1e/0x30
dev_queue_xmit+0x10/0x20
ovs_vport_send+0xad/0x170 [openvswitch]
do_output+0x59/0x180 [openvswitch]
do_execute_actions+0xa80/0xaa0 [openvswitch]
? kfree+0x1/0x250
? kfree+0x1/0x250
? kprobe_perf_func+0x4f/0x2b0
? flow_lookup.constprop.0+0x5c/0x110 [openvswitch]
ovs_execute_actions+0x4c/0x120 [openvswitch]
ovs_dp_process_packet+0xa1/0x200 [openvswitch]
? ovs_ct_update_key.isra.0+0xa8/0x120 [openvswitch]
? ovs_ct_fill_key+0x1d/0x30 [openvswitch]
? ovs_flow_key_extract+0x2db/0x350 [openvswitch]
ovs_vport_receive+0x77/0xd0 [openvswitch]
? __htab_map_lookup_elem+0x4e/0x60
? bpf_prog_680e8aff8547aec1_kfree+0x3b/0x714
? trace_call_bpf+0xc8/0x150
? kfree+0x1/0x250
? kfree+0x1/0x250
? kprobe_perf_func+0x4f/0x2b0
? kprobe_perf_func+0x4f/0x2b0
? __mod_memcg_lruvec_state+0x63/0xe0
netdev_port_receive+0xc4/0x180 [openvswitch]
? netdev_port_receive+0x180/0x180 [openvswitch]
netdev_frame_hook+0x1f/0x40 [openvswitch]
__netif_receive_skb_core.constprop.0+0x23d/0xf00
__netif_receive_skb_one_core+0x3f/0xa0
__netif_receive_skb+0x15/0x60
process_backlog+0x9e/0x170
__napi_poll+0x33/0x180
net_rx_action+0x126/0x280
? ttwu_do_activate+0x72/0xf0
__do_softirq+0xd9/0x2e7
? rcu_report_exp_cpu_mult+0x1b0/0x1b0
do_softirq+0x7d/0xb0
</IRQ>
<TASK>
__local_bh_enable_ip+0x54/0x60
ip_finish_output2+0x191/0x460
__ip_finish_output+0xb7/0x180
ip_finish_output+0x2e/0xc0
ip_output+0x78/0x100
? __ip_finish_output+0x180/0x180
ip_local_out+0x5e/0x70
__ip_queue_xmit+0x184/0x440
? tcp_syn_options+0x1f9/0x300
ip_queue_xmit+0x15/0x20
__tcp_transmit_skb+0x910/0x9c0
? __mod_memcg_state+0x44/0xa0
tcp_connect+0x437/0x4e0
? ktime_get_with_offset+0x60/0xf0
tcp_v4_connect+0x436/0x530
__inet_stream_connect+0xd4/0x3a0
? kprobe_perf_func+0x4f/0x2b0
? aa_sk_perm+0x43/0x1c0
inet_stream_connect+0x3b/0x60
__sys_connect_file+0x63/0x70
__sys_connect+0xa6/0xd0
? setfl+0x108/0x170
? do_fcntl+0xe8/0x5a0
__x64_sys_connect+0x18/0x20
do_syscall_64+0x5c/0xc0
? __x64_sys_fcntl+0xa9/0xd0
? exit_to_user_mode_prepare+0x37/0xb0
? syscall_exit_to_user_mode+0x27/0x50
? do_syscall_64+0x69/0xc0
? __sys_setsockopt+0xea/0x1e0
? exit_to_user_mode_prepare+0x37/0xb0
? syscall_exit_to_user_mode+0x27/0x50
? __x64_sys_setsockopt+0x1f/0x30
? do_syscall_64+0x69/0xc0
? irqentry_exit+0x1d/0x30
? exc_page_fault+0x89/0x170
entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f7b8101c6a7
Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89
RSP: 002b:00007ffffd6b2198 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b8101c6a7
RDX: 0000000000000010 RSI: 00007ffffd6b2360 RDI: 0000000000000005
RBP: 0000561f1370d560 R08: 00002795ad21d1ac R09: 0030312e302e302e
R10: 00007ffffd73f080 R11: 0000000000000246 R12: 0000561f1370c410
R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000
</TASK>
Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
Co-developed-by: Luca Czesla <luca.czesla@mail.schwarz>
Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/ZC0pBXBAgh7c76CA@kernel-bug-kernel-bug
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The kfree_rcu() and kvfree_rcu() macros' single-argument forms are
deprecated. Therefore switch to the new kfree_rcu_mightsleep() and
kvfree_rcu_mightsleep() variants. The goal is to avoid accidental use
of the single-argument forms, which can introduce functionality bugs in
atomic contexts and latency bugs in non-atomic contexts.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
bpf_[sk|inode|task|cgrp]_storage_[get|delete]() and bpf_get_socket_cookie() helpers
perform run-time check that sk|inode|task|cgrp pointer != NULL.
Teach verifier about this fact and allow bpf programs to pass
PTR_TO_BTF_ID | PTR_MAYBE_NULL into such helpers.
It will be used in the subsequent patch that will do
bpf_sk_storage_get(.., skb->sk, ...);
Even when 'skb' pointer is trusted the 'sk' pointer may be NULL.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-5-alexei.starovoitov@gmail.com
Remove duplicated if (atype == BPF_READ) btf_struct_access() from
btf_struct_access() callback and invoke it only for writes. This is
possible to do because currently btf_struct_access() custom callback
always delegates to generic btf_struct_access() helper for BPF_READ
accesses.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-2-alexei.starovoitov@gmail.com
The fact that PTP 2-step TX timestamping is broken on DSA switches if
the master also timestamps the same packets is documented by commit
f685e609a3 ("net: dsa: Deny PTP on master if switch supports it").
We attempt to help the users avoid shooting themselves in the foot by
making DSA reject the timestamping ioctls on an interface that is a DSA
master, and the switch tree beneath it contains switches which are aware
of PTP.
The only problem is that there isn't an established way of intercepting
ndo_eth_ioctl calls, so DSA creates avoidable burden upon the network
stack by creating a struct dsa_netdevice_ops with overlaid function
pointers that are manually checked from the relevant call sites. There
used to be 2 such dsa_netdevice_ops, but now, ndo_eth_ioctl is the only
one left.
There is an ongoing effort to migrate driver-visible hardware timestamping
control from the ndo_eth_ioctl() based API to a new ndo_hwtstamp_set()
model, but DSA actively prevents that migration, since dsa_master_ioctl()
is currently coded to manually call the master's legacy ndo_eth_ioctl(),
and so, whenever a network device driver would be converted to the new
API, DSA's restrictions would be circumvented, because any device could
be used as a DSA master.
The established way for unrelated modules to react on a net device event
is via netdevice notifiers. So we create a new notifier which gets
called whenever there is an attempt to change hardware timestamping
settings on a device.
Finally, there is another reason why a netdev notifier will be a good
idea, besides strictly DSA, and this has to do with PHY timestamping.
With ndo_eth_ioctl(), all MAC drivers must manually call
phy_has_hwtstamp() before deciding whether to act upon SIOCSHWTSTAMP,
otherwise they must pass this ioctl to the PHY driver via
phy_mii_ioctl().
With the new ndo_hwtstamp_set() API, it will be desirable to simply not
make any calls into the MAC device driver when timestamping should be
performed at the PHY level.
But there exist drivers, such as the lan966x switch, which need to
install packet traps for PTP regardless of whether they are the layer
that provides the hardware timestamps, or the PHY is. That would be
impossible to support with the new API.
The proposal there, too, is to introduce a netdev notifier which acts as
a better cue for switching drivers to add or remove PTP packet traps,
than ndo_hwtstamp_set(). The one introduced here "almost" works there as
well, except for the fact that packet traps should only be installed if
the PHY driver succeeded to enable hardware timestamping, whereas here,
we need to deny hardware timestamping on the DSA master before it
actually gets enabled. This is why this notifier is called "PRE_", and
the notifier that would get used for PHY timestamping and packet traps
would be called NETDEV_CHANGE_HWTSTAMP. This isn't a new concept, for
example NETDEV_CHANGEUPPER and NETDEV_PRECHANGEUPPER do the same thing.
In expectation of future netlink UAPI, we also pass a non-NULL extack
pointer to the netdev notifier, and we make DSA populate it with an
informative reason for the rejection. To avoid making it go to waste, we
make the ioctl-based dev_set_hwtstamp() create a fake extack and print
the message to the kernel log.
Link: https://lore.kernel.org/netdev/20230401191215.tvveoi3lkawgg6g4@skbuf/
Link: https://lore.kernel.org/netdev/20230310164451.ls7bbs6pdzs4m6pw@skbuf/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski suggested that we may want to add new UAPI for
controlling hardware timestamping through netlink in the future, and in
that case, we will be limited to the struct hwtstamp_config that is
currently passed in fixed binary format through the SIOCGHWTSTAMP and
SIOCSHWTSTAMP ioctls. It would be good if new kernel code already
started operating on an extensible kernel variant of that structure,
similar in concept to struct kernel_ethtool_coalesce vs struct
ethtool_coalesce.
Since struct hwtstamp_config is in include/uapi/linux/net_tstamp.h, here
we introduce include/linux/net_tstamp.h which shadows that other header,
but also includes it, so that existing includers of this header work as
before. In addition to that, we add the definition for the kernel-only
structure, and a helper which translates all fields by manual copying.
I am doing a manual copy in order to not force the alignment (or type)
of the fields of struct kernel_hwtstamp_config to be the same as of
struct hwtstamp_config, even though now, they are the same.
Link: https://lore.kernel.org/netdev/20230330223519.36ce7d23@kernel.org/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel will want to start using the more meaningful struct
hwtstamp_config pointer in more places, so move the copy_from_user() at
the beginning of dev_set_hwtstamp() in order to get to that, and pass
this argument to net_hwtstamp_validate().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DSA does not want to intercept all ioctls handled by dev_eth_ioctl(),
only SIOCSHWTSTAMP. This can be seen from commit f685e609a3 ("net:
dsa: Deny PTP on master if switch supports it"). However, the way in
which the dsa_ndo_eth_ioctl() is called would suggest otherwise.
Split the handling of SIOCSHWTSTAMP and SIOCGHWTSTAMP ioctls into
separate case statements of dev_ifsioc(), and make each one call its own
sub-function. This also removes the dsa_ndo_eth_ioctl() call from
dev_eth_ioctl(), which from now on exclusively handles PHY ioctls.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the expression "x == 0 || x != -95", the term "x == 0" does not
change the expression's logical value, because 0 != -95, and so,
if x is 0, the expression would still be true by virtue of the second
term. If x is non-zero, the expression depends on the truth value of
the second term anyway. As such, the first term is redundant and can
be deleted.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The "switch (cmd)" block from dev_ifsioc() gained a bit too much
unnecessary manual handling of "cmd" in the "default" case, starting
with the private ioctls.
Clean that up by using the "ellipsis" gcc extension, adding separate
cases for the rest of the ioctls, and letting the default case only
return -EINVAL.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 0db3dc73f7 ("[NETPOLL]: tx lock deadlock fix") narrowed
down the region under netif_tx_trylock() inside netpoll_send_skb().
(At that point in time netif_tx_trylock() would lock all queues of
the device.) Taking the tx lock was problematic because driver's
cleanup method may take the same lock. So the change made us hold
the xmit lock only around xmit, and expected the driver to take
care of locking within ->ndo_poll_controller().
Unfortunately this only works if netpoll isn't itself called with
the xmit lock already held. Netpoll code is careful and uses
trylock(). The drivers, however, may be using plain lock().
Printing while holding the xmit lock is going to result in rare
deadlocks.
Luckily we record the xmit lock owners, so we can scan all the queues,
the same way we scan NAPI owners. If any of the xmit locks is held
by the local CPU we better not attempt any polling.
It would be nice if we could narrow down the check to only the NAPIs
and the queue we're trying to use. I don't see a way to do that now.
Reported-by: Roman Gushchin <roman.gushchin@linux.dev>
Fixes: 0db3dc73f7 ("[NETPOLL]: tx lock deadlock fix")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
____napi_schedule() adds a napi into current cpu softnet_data poll_list,
then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
from net_rx_action(), because we can process poll_list from this point,
without going to full softirq loop.
This needs a change in net_rx_action() to make sure we restart
its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
being raised.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Based on initial patch from Jason Xing.
Idea is to not raise NET_RX_SOFTIRQ from napi_schedule_rps()
when we queued a packet into another cpu backlog.
We can do this only in the context of us being called indirectly
from net_rx_action(), to have the guarantee our rps_ipi_list
will be processed before we exit from net_rx_action().
Link: https://lore.kernel.org/lkml/20230325152417.5403-1-kerneljasonxing@gmail.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We want to make two optimizations in napi_schedule_rps() and
____napi_schedule() which require to know if these helpers are
called from net_rx_action(), instead of being called from
other contexts.
sd.in_net_rx_action is only read/written by the owning cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
napi_schedule_rps() return value is ignored, remove it.
Change the comment to clarify the intent.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Under high contention dst_entry::__refcnt becomes a significant bottleneck.
atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into
high retry rates on contention.
Switch the reference count to rcuref_t which results in a significant
performance gain. Rename the reference count member to __rcuref to reflect
the change.
The gain depends on the micro-architecture and the number of concurrent
operations and has been measured in the range of +25% to +130% with a
localhost memtier/memcached benchmark which amplifies the problem
massively.
Running the memtier/memcached benchmark over a real (1Gb) network
connection the conversion on top of the false sharing fix for struct
dst_entry::__refcnt results in a total gain in the 2%-5% range over the
upstream baseline.
Reported-by: Wangyang Guo <wangyang.guo@intel.com>
Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230307125538.989175656@linutronix.de
Link: https://lore.kernel.org/r/20230323102800.215027837@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This fixes the following warning when compiled with GCC 12.2.0 and W=1.
net/core/dev_ioctl.c:475: warning: Function parameter or member 'data'
not described in 'dev_ioctl'
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch uses bpf_mem_alloc for the task and cgroup local storage that
the bpf prog can easily get a hold of the storage owner's PTR_TO_BTF_ID.
eg. bpf_get_current_task_btf() can be used in some of the kmalloc code
path which will cause deadlock/recursion. bpf_mem_cache_alloc is
deadlock free and will solve a legit use case in [1].
For sk storage, its batch creation benchmark shows a few percent
regression when the sk create/destroy batch size is larger than 32.
The sk creation/destruction happens much more often and
depends on external traffic. Considering it is hypothetical
to be able to cause deadlock with sk storage, it can cross
the bridge to use bpf_mem_alloc till a legit (ie. useful)
use case comes up.
For inode storage, bpf_local_storage_destroy() is called before
waiting for a rcu gp and its memory cannot be reused immediately.
inode stays with kmalloc/kfree after the rcu [or tasks_trace] gp.
A 'bool bpf_ma' argument is added to bpf_local_storage_map_alloc().
Only task and cgroup storage have 'bpf_ma == true' which
means to use bpf_mem_cache_alloc/free(). This patch only changes
selem to use bpf_mem_alloc for task and cgroup. The next patch
will change the local_storage to use bpf_mem_alloc also for
task and cgroup.
Here is some more details on the changes:
* memory allocation:
After bpf_mem_cache_alloc(), the SDATA(selem)->data is zero-ed because
bpf_mem_cache_alloc() could return a reused selem. It is to keep
the existing bpf_map_kzalloc() behavior. Only SDATA(selem)->data
is zero-ed. SDATA(selem)->data is the visible part to the bpf prog.
No need to use zero_map_value() to do the zeroing because
bpf_selem_free(..., reuse_now = true) ensures no bpf prog is using
the selem before returning the selem through bpf_mem_cache_free().
For the internal fields of selem, they will be initialized when
linking to the new smap and the new local_storage.
When 'bpf_ma == false', nothing changes in this patch. It will
stay with the bpf_map_kzalloc().
* memory free:
The bpf_selem_free() and bpf_selem_free_rcu() are modified to handle
the bpf_ma == true case.
For the common selem free path where its owner is also being destroyed,
the mem is freed in bpf_local_storage_destroy(), the owner (task
and cgroup) has gone through a rcu gp. The memory can be reused
immediately, so bpf_local_storage_destroy() will call
bpf_selem_free(..., reuse_now = true) which will do
bpf_mem_cache_free() for immediate reuse consideration.
An exception is the delete elem code path. The delete elem code path
is called from the helper bpf_*_storage_delete() and the syscall
bpf_map_delete_elem(). This path is an unusual case for local
storage because the common use case is to have the local storage
staying with its owner life time so that the bpf prog and the user
space does not have to monitor the owner's destruction. For the delete
elem path, the selem cannot be reused immediately because there could
be bpf prog using it. It will call bpf_selem_free(..., reuse_now = false)
and it will wait for a rcu tasks trace gp before freeing the elem. The
rcu callback is changed to do bpf_mem_cache_raw_free() instead of kfree().
When 'bpf_ma == false', it should be the same as before.
__bpf_selem_free() is added to do the kfree_rcu and call_tasks_trace_rcu().
A few words on the 'reuse_now == true'. When 'reuse_now == true',
it is still racing with bpf_local_storage_map_free which is under rcu
protection, so it still needs to wait for a rcu gp instead of kfree().
Otherwise, the selem may be reused by slab for a totally different struct
while the bpf_local_storage_map_free() is still using it (as a
rcu reader). For the inode case, there may be other rcu readers also.
In short, when bpf_ma == false and reuse_now == true => vanilla rcu.
[1]: https://lore.kernel.org/bpf/20221118190109.1512674-1-namhyung@kernel.org/
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230322215246.1675516-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZBzSGQAKCRDbK58LschI
g+dhAP95enbrlwaQ+9aoqrU+GqCq+uo4SkaqnUtq6GSvRNiVBQD8C6iZxrAjyXnm
1wRr3JN/HszPBzgjl3HvDc9y69I/PAI=
=8JwR
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-03-23
We've added 8 non-merge commits during the last 13 day(s) which contain
a total of 21 files changed, 238 insertions(+), 161 deletions(-).
The main changes are:
1) Fix verification issues in some BPF programs due to their stack usage
patterns, from Eduard Zingerman.
2) Fix to add missing overflow checks in xdp_umem_reg and return an error
in such case, from Kal Conley.
3) Fix and undo poisoning of strlcpy in libbpf given it broke builds for
libcs which provided the former like uClibc-ng, from Jesus Sanchez-Palencia.
4) Fix insufficient bpf_jit_limit default to avoid users running into hard
to debug seccomp BPF errors, from Daniel Borkmann.
5) Fix driver return code when they don't support a bpf_xdp_metadata kfunc
to make it unambiguous from other errors, from Jesper Dangaard Brouer.
6) Two BPF selftest fixes to address compilation errors from recent changes
in kernel structures, from Alexei Starovoitov.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver support
bpf: Adjust insufficient default bpf_jit_limit
xsk: Add missing overflow check in xdp_umem_reg
selftests/bpf: Fix progs/test_deny_namespace.c issues.
selftests/bpf: Fix progs/find_vma_fail1.c build error.
libbpf: Revert poisoning of strlcpy
selftests/bpf: Tests for uninitialized stack reads
bpf: Allow reads from uninit stack
====================
Link: https://lore.kernel.org/r/20230323225221.6082-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We want to remove our use of skb_mac_header() in tx paths,
eg remove skb_reset_mac_header() from __dev_queue_xmit().
Idea is that ndo_start_xmit() can get the mac header
simply looking at skb->data.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When setting the XPS value of a TX queue, warn the user once if the
index of the queue is greater than the number of allocated TX queues.
Previously, this scenario went uncaught. In the best case, it resulted
in unnecessary allocations. In the worst case, it resulted in
out-of-bounds memory references through calls to `netdev_get_tx_queue(
dev, index)`. Therefore, it is important to inform the user but not
worth returning an error and risk downing the netdevice.
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Reviewed-by: Piotr Raczynski <piotr.raczynski@intel.com>
Link: https://lore.kernel.org/r/20230321150725.127229-1-nnac123@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch changes the return types of bpf_map_ops functions to long, where
previously int was returned. Using long allows for bpf programs to maintain
the sign bit in the absence of sign extension during situations where
inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
error is returned.
The definitions of the helper funcs are generated from comments in the bpf
uapi header at `include/uapi/linux/bpf.h`. The return type of these
helpers was previously changed from int to long in commit bdb7b79b4c. For
any case where one of the map helpers call the bpf_map_ops funcs that are
still returning 32-bit int, a compiler might not include sign extension
instructions to properly convert the 32-bit negative value a 64-bit
negative value.
For example:
bpf assembly excerpt of an inlined helper calling a kernel function and
checking for a specific error:
; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
...
46: call 0xffffffffe103291c ; htab_map_update_elem
; if (err && err != -EEXIST) {
4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
kernel function assembly excerpt of return value from
`htab_map_update_elem` returning 32-bit int:
movl $0xffffffef, %r9d
...
movl %r9d, %eax
...results in the comparison:
cmp $0xffffffffffffffef, $0x00000000ffffffef
Fixes: bdb7b79b4c ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When driver doesn't implement a bpf_xdp_metadata kfunc the fallback
implementation returns EOPNOTSUPP, which indicate device driver doesn't
implement this kfunc.
Currently many drivers also return EOPNOTSUPP when the hint isn't
available, which is ambiguous from an API point of view. Instead
change drivers to return ENODATA in these cases.
There can be natural cases why a driver doesn't provide any hardware
info for a specific hint, even on a frame to frame basis (e.g. PTP).
Lets keep these cases as separate return codes.
When describing the return values, adjust the function kernel-doc layout
to get proper rendering for the return values.
Fixes: ab46182d0d ("net/mlx4_en: Support RX XDP metadata")
Fixes: bc8d405b1b ("net/mlx5e: Support RX XDP metadata")
Fixes: 306531f024 ("veth: Support RX XDP metadata")
Fixes: 3d76a4d3d4 ("bpf: XDP metadata RX kfuncs")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/167940675120.2718408.8176058626864184420.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When compound pages are enabled, although the mm layer still
returns an array of page pointers, a subset (or all) of them
may have the same page head since a max 180kb skb can span 2
hugepages if it is on the boundary, be a mix of pages and 1 hugepage,
or fit completely in a hugepage. Instead of referencing page head
on all page pointers, use page length arithmetic to only call page
head when referencing a known different page head to avoid touching
a cold cacheline.
Tested:
See next patch with changes to tcp_mmap
Correntess:
On a pair of separate hosts as send with MSG_ZEROCOPY will
force a copy on tx if using loopback alone, check that the SHA
on the message sent is equivalent to checksum on the message received,
since the current program already checks for the length.
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
./tcp_mmap -s -z
./tcp_mmap -H $DADDR -z
SHA256 is correct
received 2 MB (100 % mmap'ed) in 0.005914 s, 2.83686 Gbit
cpu usage user:0.001984 sys:0.000963, 1473.5 usec per MB, 10 c-switches
Performance:
Run neper between adjacent hosts with the same config
tcp_stream -Z --skip-rx-copy -6 -T 20 -F 1000 --stime-use-proc --test-length=30
Before patch: stime_end=37.670000
After patch: stime_end=30.310000
Signed-off-by: Coco Li <lixiaoyan@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230321081202.2370275-1-lixiaoyan@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Sometimes we need to know which one of backlog queue can be exactly
long enough to cause some latency when debugging this part is needed.
Thus, we can then separate the display of both.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230321015746.96994-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jiri suggests it reads more naturally to skip the explicit
array size when possible. When we export the symbol we want
to make sure that the size is right but for statics its
not needed.
Link: https://lore.kernel.org/r/20230321044159.1031040-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rcu_bh is no longer a win, especially for objects freed
with standard call_rcu().
Switch neighbour code to no longer disable BH when not necessary.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
vlan_present is gone since
commit 354259fa73 ("net: remove skb->vlan_present")
rename the offset field to what BPF is currently looking
for in this byte - mono_delivery_time and tc_at_ingress.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230321014115.997841-2-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
In the upcoming VXLAN MDB implementation, the 0.0.0.0 and :: MDB entries
will act as catchall entries for unregistered IP multicast traffic in a
similar fashion to the 00:00:00:00:00:00 VXLAN FDB entry that is used to
transmit BUM traffic.
In deployments where inter-subnet multicast forwarding is used, not all
the VTEPs in a tenant domain are members in all the broadcast domains.
It is therefore advantageous to transmit BULL (broadcast, unknown
unicast and link-local multicast) and unregistered IP multicast traffic
on different tunnels. If the same tunnel was used, a VTEP only
interested in IP multicast traffic would also pull all the BULL traffic
and drop it as it is not a member in the originating broadcast domain
[1].
Prepare for this change by allowing the 0.0.0.0 group address in the
common rtnetlink MDB code and forbid it in the bridge driver. A similar
change is not needed for IPv6 because the common code only validates
that the group address is not the all-nodes address.
[1] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast#section-2.6
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>