linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00

Author	SHA1	Message	Date
Jonathan Lemon	36177832f4	skbuff: Add skb parameter to the ubuf zerocopy callback Add an optional skb parameter to the zerocopy callback parameter, which is passed down from skb_zcopy_clear(). This gives access to the original skb, which is needed for upcoming RX zero-copy error handling. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-07 16:06:37 -08:00
Yunjian Wang	950271d7cc	tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS Currently the tun_napi_alloc_frags() function returns -ENOMEM when the number of iovs exceeds MAX_SKB_FRAGS + 1. However this is inappropriate, we should use -EMSGSIZE instead of -ENOMEM. The following distinctions are matters: 1. the caller need to drop the bad packet when -EMSGSIZE is returned, which means meeting a persistent failure. 2. the caller can try again when -ENOMEM is returned, which means meeting a transient failure. Fixes: `90e33d4594` ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/1608864736-24332-1-git-send-email-wangyunjian@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-28 13:34:36 -08:00
Jakub Kicinski	a1dd1d8697	Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-12-03 The main changes are: 1) Support BTF in kernel modules, from Andrii. 2) Introduce preferred busy-polling, from Björn. 3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh. 4) Memcg-based memory accounting for bpf objects, from Roman. 5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits) selftests/bpf: Fix invalid use of strncat in test_sockmap libbpf: Use memcpy instead of strncpy to please GCC selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module selftests/bpf: Add tp_btf CO-RE reloc test for modules libbpf: Support attachment of BPF tracing programs to kernel modules libbpf: Factor out low-level BPF program loading helper bpf: Allow to specify kernel module BTFs when attaching BPF programs bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF selftests/bpf: Add support for marking sub-tests as skipped selftests/bpf: Add bpf_testmod kernel module for testing libbpf: Add kernel module BTF support for CO-RE relocations libbpf: Refactor CO-RE relocs to not assume a single BTF object libbpf: Add internal helper to load BTF data by FD bpf: Keep module's btf_data_size intact after load bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address() selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP bpf: Adds support for setting window clamp samples/bpf: Fix spelling mistake "recieving" -> "receiving" bpf: Fix cold build of test_progs-no_alu32 ... ==================== Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 07:48:12 -08:00
Björn Töpel	b02e5a0ebb	xsk: Propagate napi_id to XDP socket Rx path Add napi_id to the xdp_rxq_info structure, and make sure the XDP socket pick up the napi_id in the Rx path. The napi_id is used to find the corresponding NAPI structure for socket busy polling. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com	2020-12-01 00:09:25 +01:00
Jakub Kicinski	5c39f26e67	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Trivial conflict in CAN, keep the net-next + the byteswap wrapper. Conflicts: drivers/net/can/usb/gs_usb.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-27 18:25:27 -08:00
Martin Schiller	8e1e33ffa6	net/tun: Call type change netdev notifiers Call netdev notifiers before and after changing the device type. Signed-off-by: Martin Schiller <ms@dev.tdt.de> Link: https://lore.kernel.org/r/20201118063919.29485-1-ms@dev.tdt.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-23 10:32:39 -08:00
Jens Axboe	5aac0390a6	tun: honor IOCB_NOWAIT flag tun only checks the file O_NONBLOCK flag, but it should also be checking the iocb IOCB_NOWAIT flag. Any fops using ->read/write_iter() should check both, otherwise it breaks users that correctly expect O_NONBLOCK semantics if IOCB_NOWAIT is set. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/e9451860-96cc-c7c7-47b8-fe42cadd5f4c@kernel.dk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-21 15:19:08 -08:00
Heiner Kallweit	497a5757ce	tun: switch to net core provided statistics counters Switch tun to the standard statistics pattern: - use netdev->stats for the less frequently accessed counters - use netdev->tstats for the frequently accessed per-cpu counters v3: - add atomic_long_t member rx_frame_errors for making counter updates atomic Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-09 17:50:28 -08:00
Jakub Kicinski	44a8c4f33c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net We got slightly different patches removing a double word in a comment in net/ipv4/raw.c - picked the version from net. Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached values instead of VNIC login response buffer (following what commit `507ebe6444` ("ibmvnic: Fix use-after-free of VNIC login response buffer") did). Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-09-04 21:28:59 -07:00
Gustavo A. R. Silva	df561f6688	treewide: Use fallthrough pseudo-keyword Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2020-08-23 17:36:59 -05:00
Maciej Żenczykowski	596b5ef458	net-tun: Eliminate two tun/xdp related function calls from vhost-net This provides a minor performance boost by virtue of inlining instead of cross module function calls. Test: builds Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200819010710.3959310-2-zenczykowski@gmail.com	2020-08-19 14:02:49 -07:00
Maciej Żenczykowski	b558b6c240	net-tun: Add type safety to tun_xdp_to_ptr() and tun_ptr_to_xdp() This reduces likelihood of incorrect use. Test: builds Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200819010710.3959310-1-zenczykowski@gmail.com	2020-08-19 14:02:49 -07:00
David S. Miller	2e7199bd77	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2020-08-04 The following pull-request contains BPF updates for your net-next tree. We've added 73 non-merge commits during the last 9 day(s) which contain a total of 135 files changed, 4603 insertions(+), 1013 deletions(-). The main changes are: 1) Implement bpf_link support for XDP. Also add LINK_DETACH operation for the BPF syscall allowing processes with BPF link FD to force-detach, from Andrii Nakryiko. 2) Add BPF iterator for map elements and to iterate all BPF programs for efficient in-kernel inspection, from Yonghong Song and Alexei Starovoitov. 3) Separate bpf_get_{stack,stackid}() helpers for perf events in BPF to avoid unwinder errors, from Song Liu. 4) Allow cgroup local storage map to be shared between programs on the same cgroup. Also extend BPF selftests with coverage, from YiFei Zhu. 5) Add BPF exception tables to ARM64 JIT in order to be able to JIT BPF_PROBE_MEM load instructions, from Jean-Philippe Brucker. 6) Follow-up fixes on BPF socket lookup in combination with reuseport group handling. Also add related BPF selftests, from Jakub Sitnicki. 7) Allow to use socket storage in BPF_PROG_TYPE_CGROUP_SOCK-typed programs for socket create/release as well as bind functions, from Stanislav Fomichev. 8) Fix an info leak in xsk_getsockopt() when retrieving XDP stats via old struct xdp_statistics, from Peilin Ye. 9) Fix PT_REGS_RC{,_CORE}() macros in libbpf for MIPS arch, from Jerry Crunchtime. 10) Extend BPF kernel test infra with skb->family and skb->{local,remote}_ip{4,6} fields and allow user space to specify skb->dev via ifindex, from Dmitry Yakunin. 11) Fix a bpftool segfault due to missing program type name and make it more robust to prevent them in future gaps, from Quentin Monnet. 12) Consolidate cgroup helper functions across selftests and fix a v6 localhost resolver issue, from John Fastabend. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-08-03 18:27:40 -07:00
Jason Wang	8f3f330da2	tun: add missing rcu annotation in tun_set_ebpf() We expecte prog_p to be protected by rcu, so adding the rcu annotation to fix the following sparse warning: drivers/net/tun.c:3003:36: warning: incorrect type in argument 2 (different address spaces) drivers/net/tun.c:3003:36: expected struct tun_prog [noderef] __rcu prog_p drivers/net/tun.c:3003:36: got struct tun_prog prog_p drivers/net/tun.c:3292:42: warning: incorrect type in argument 2 (different address spaces) drivers/net/tun.c:3292:42: expected struct tun_prog prog_p drivers/net/tun.c:3292:42: got struct tun_prog [noderef] __rcu drivers/net/tun.c:3296:42: warning: incorrect type in argument 2 (different address spaces) drivers/net/tun.c:3296:42: expected struct tun_prog prog_p drivers/net/tun.c:3296:42: got struct tun_prog [noderef] __rcu Reported-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-31 17:17:53 -07:00
Andrii Nakryiko	e8407fdeb9	bpf, xdp: Remove XDP_QUERY_PROG and XDP_QUERY_PROG_HW XDP commands Now that BPF program/link management is centralized in generic net_device code, kernel code never queries program id from drivers, so XDP_QUERY_PROG/XDP_QUERY_PROG_HW commands are unnecessary. This patch removes all the implementations of those commands in kernel, along the xdp_attachment_query(). This patch was compile-tested on allyesconfig. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200722064603.3350758-10-andriin@fb.com	2020-07-25 20:37:02 -07:00
Jason A. Donenfeld	b9815eb1d1	tun: implement header_ops->parse_protocol for AF_PACKET The tun driver passes up skb->protocol to userspace in the form of PI headers. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and the tun driver then gives userspace bogus values that it can't deal with. Note that this isn't the case with tap, because tap already benefits from the shared infrastructure for ethernet headers. But with tun, there's nothing. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-06-30 12:29:39 -07:00
Lorenzo Bianconi	1b698fa5d8	xdp: Rename convert_to_xdp_frame in xdp_convert_buff_to_frame In order to use standard 'xdp' prefix, rename convert_to_xdp_frame utility routine in xdp_convert_buff_to_frame and replace all the occurrences Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/6344f739be0d1a08ab2b9607584c4d5478c8c083.1590698295.git.lorenzo@kernel.org	2020-06-01 15:02:53 -07:00
Willem de Bruijn	96aa1b22bd	tun: correct header offsets in napi frags mode Tun in IFF_NAPI_FRAGS mode calls napi_gro_frags. Unlike netif_rx and netif_gro_receive, this expects skb->data to point to the mac layer. But skb_probe_transport_header, __skb_get_hash_symmetric, and xdp_do_generic in tun_get_user need skb->data to point to the network header. Flow dissection also needs skb->protocol set, so eth_type_trans has to be called. Ensure the link layer header lies in linear as eth_type_trans pulls ETH_HLEN. Then take the same code paths for frags as for not frags. Push the link layer header back just before calling napi_gro_frags. By pulling up to ETH_HLEN from frag0 into linear, this disables the frag0 optimization in the special case when IFF_NAPI_FRAGS is used with zero length iov[0] (and thus empty skb->linear). Fixes: `90e33d4594` ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Petar Penkov <ppenkov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-06-01 12:01:46 -07:00
Jesper Dangaard Brouer	fb3e6e9307	tun: Add XDP frame size The tun driver have two code paths for running XDP (bpf_prog_run_xdp). In both cases 'buflen' contains enough tailroom for skb_shared_info. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/bpf/158945343419.97035.9594485183958037621.stgit@firesoul	2020-05-14 21:21:55 -07:00
Gilberto Bertin	3fe260e00c	net: tun: record RX queue in skb before do_xdp_generic() This allows netif_receive_generic_xdp() to correctly determine the RX queue from which the skb is coming, so that the context passed to the XDP program will contain the correct RX queue index. Signed-off-by: Gilberto Bertin <me@jibi.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-04-12 20:58:24 -07:00
Will Deacon	bee348907d	tun: Don't put_page() for all negative return values from XDP program When an XDP program is installed, tun_build_skb() grabs a reference to the current page fragment page if the program returns XDP_REDIRECT or XDP_TX. However, since tun_xdp_act() passes through negative return values from the XDP program, it is possible to trigger the error path by mistake and accidentally drop a reference to the fragments page without taking one, leading to a spurious free. This is believed to be the cause of some KASAN use-after-free reports from syzbot [1], although without a reproducer it is not possible to confirm whether this patch fixes the problem. Ensure that we only drop a reference to the fragments page if the XDP transmit or redirect operations actually fail. [1] https://syzkaller.appspot.com/bug?id=e76a6af1be4acd727ff6bbca669833f98cbf5d95 Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> CC: Eric Dumazet <edumazet@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Fixes: `8ae1aff0b3` ("tuntap: split out XDP logic") Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-04-06 10:00:43 -07:00
Jakub Kicinski	e5ad00b34d	tun: reject unsupported coalescing params Set ethtool_ops->supported_coalesce_params to let the core reject unsupported coalescing parameters. This driver did not previously reject unsupported parameters. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-06 22:45:55 -08:00
Michal Kubecek	5af0907134	tun: drop TUN_DEBUG and tun_debug() TUN_DEBUG and tun_debug() are no longer used anywhere, drop them. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-05 21:38:03 -08:00
Michal Kubecek	3424170f37	tun: replace tun_debug() by netif_info() The tun driver uses custom macro tun_debug() which is only available if TUN_DEBUG is set. Replace it by standard netif_ifinfo(). For that purpose, rename tun_struct::debug to msg_enable and make it u32 and always present. Finally, make tun_get_msglevel(), tun_set_msglevel() and TUNSETDEBUG ioctl independent of TUN_DEBUG. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-05 21:38:02 -08:00
Michal Kubecek	182094348a	tun: drop useless debugging statements Some of the tun_debug() statements only inform us about entering a function which can be easily achieved with ftrace or kprobe. As tun_debug() is no-op unless TUN_DEBUG is set which requires editing the source and recompiling, setting up ftrace or kprobe is easier. Drop these debug statements. Also drop the tun_debug() statement informing about SIOCSIFHWADDR ioctl. We can monitor these through rtnetlink and it makes little sense to log address changes through ioctl but not changes through rtnetlink. Moreover, this tun_debug() is called even if the actual address change fails which makes it even less useful. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-05 21:38:02 -08:00
Michal Kubecek	7522416d25	tun: get rid of DBG1() macro This macro is no-op unless TUN_DEBUG is defined (which requires editing and recompiling the source) and only does something if variable debug is 2 but that variable is zero initialized and never set to anything else. Moreover, the only use of the macro informs about entering function tun_chr_open() which can be easily achieved using ftrace or kprobe. Drop DBG1() macro, its only use and global variable debug. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-05 21:38:02 -08:00
Michal Kubecek	516c512bde	tun: fix misleading comment format The comment above tun_flow_save_rps_rxhash() starts with "/**" which makes it look like kerneldoc comment and results in warnings when building with W=1. Fix the format to make it look like a normal comment. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-05 21:38:02 -08:00
David Ahern	fb0b1c6042	tun: Remove unnecessary BUG_ON check in tun_net_xmit The BUG_ON for NULL tfile is now redundant due to a recently added null check after the rcu_dereference. Remove the BUG_ON. Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-23 16:39:21 -08:00
David S. Miller	4d8773b68e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Minor conflict in mlx5 because changes happened to code that has moved meanwhile. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-01-26 10:40:21 +01:00
Eric Dumazet	1efba987c4	tun: add mutex_unlock() call and napi.skb clearing in tun_get_user() If both IFF_NAPI_FRAGS mode and XDP are enabled, and the XDP program consumes the skb, we need to clear the napi.skb (or risk a use-after-free) and release the mutex (or risk a deadlock) WARNING: lock held when returning to user space! 5.5.0-rc6-syzkaller #0 Not tainted ------------------------------------------------ syz-executor.0/455 is leaving the kernel with locks still held! 1 lock held by syz-executor.0/455: #0: ffff888098f6e748 (&tfile->napi_mutex){+.+.}, at: tun_get_user+0x1604/0x3fc0 drivers/net/tun.c:1835 Fixes: `90e33d4594` ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Petar Penkov <ppenkov@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-01-23 11:42:44 +01:00
Toke Høiland-Jørgensen	1d233886dd	xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths Since the bulk queue used by XDP_REDIRECT now lives in struct net_device, we can re-use the bulking for the non-map version of the bpf_redirect() helper. This is a simple matter of having xdp_do_redirect_slow() queue the frame on the bulk queue instead of sending it out with __bpf_tx_xdp(). Unfortunately we can't make the bpf_redirect() helper return an error if the ifindex doesn't exit (as bpf_redirect_map() does), because we don't have a reference to the network namespace of the ingress device at the time the helper is called. So we have to leave it as-is and keep the device lookup in xdp_do_redirect_slow(). Since this leaves less reason to have the non-map redirect code in a separate function, so we get rid of the xdp_do_redirect_slow() function entirely. This does lose us the tracepoint disambiguation, but fortunately the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint entry structures. This means both can contain a map index, so we can just amend the tracepoint definitions so we always emit the xdp_redirect(_err) tracepoints, but with the map ID only populated if a map is present. This means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep the definitions around in case someone is still listening for them. With this change, the performance of the xdp_redirect sample program goes from 5Mpps to 8.4Mpps (a 68% increase). Since the flush functions are no longer map-specific, rename the flush() functions to drop _map from their names. One of the renamed functions is the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To keep from having to update all drivers, use a #define to keep the old name working, and only update the virtual drivers in this patch. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk	2020-01-16 20:03:34 -08:00
Petar Penkov	c39e342a05	tun: fix data-race in gro_normal_list() There is a race in the TUN driver between napi_busy_loop and napi_gro_frags. This commit resolves the race by adding the NAPI struct via netif_tx_napi_add, instead of netif_napi_add, which disables polling for the NAPI struct. KCSAN reported: BUG: KCSAN: data-race in gro_normal_list.part.0 / napi_busy_loop write to 0xffff8880b5d474b0 of 4 bytes by task 11205 on cpu 0: gro_normal_list.part.0+0x77/0xb0 net/core/dev.c:5682 gro_normal_list net/core/dev.c:5678 [inline] gro_normal_one net/core/dev.c:5692 [inline] napi_frags_finish net/core/dev.c:5705 [inline] napi_gro_frags+0x625/0x770 net/core/dev.c:5778 tun_get_user+0x2150/0x26a0 drivers/net/tun.c:1976 tun_chr_write_iter+0x79/0xd0 drivers/net/tun.c:2022 call_write_iter include/linux/fs.h:1895 [inline] do_iter_readv_writev+0x487/0x5b0 fs/read_write.c:693 do_iter_write fs/read_write.c:970 [inline] do_iter_write+0x13b/0x3c0 fs/read_write.c:951 vfs_writev+0x118/0x1c0 fs/read_write.c:1015 do_writev+0xe3/0x250 fs/read_write.c:1058 __do_sys_writev fs/read_write.c:1131 [inline] __se_sys_writev fs/read_write.c:1128 [inline] __x64_sys_writev+0x4e/0x60 fs/read_write.c:1128 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 read to 0xffff8880b5d474b0 of 4 bytes by task 11168 on cpu 1: gro_normal_list net/core/dev.c:5678 [inline] napi_busy_loop+0xda/0x4f0 net/core/dev.c:6126 sk_busy_loop include/net/busy_poll.h:108 [inline] __skb_recv_udp+0x4ad/0x560 net/ipv4/udp.c:1689 udpv6_recvmsg+0x29e/0xe90 net/ipv6/udp.c:288 inet6_recvmsg+0xbb/0x240 net/ipv6/af_inet6.c:592 sock_recvmsg_nosec net/socket.c:871 [inline] sock_recvmsg net/socket.c:889 [inline] sock_recvmsg+0x92/0xb0 net/socket.c:885 sock_read_iter+0x15f/0x1e0 net/socket.c:967 call_read_iter include/linux/fs.h:1889 [inline] new_sync_read+0x389/0x4f0 fs/read_write.c:414 __vfs_read+0xb1/0xc0 fs/read_write.c:427 vfs_read fs/read_write.c:461 [inline] vfs_read+0x143/0x2c0 fs/read_write.c:446 ksys_read+0xd5/0x1b0 fs/read_write.c:587 __do_sys_read fs/read_write.c:597 [inline] __se_sys_read fs/read_write.c:595 [inline] __x64_sys_read+0x4c/0x60 fs/read_write.c:595 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 11168 Comm: syz-executor.0 Not tainted 5.4.0-rc6+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: `943170998b` ("tun: enable NAPI for TUN/TAP driver") Signed-off-by: Petar Penkov <ppenkov@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-15 12:46:49 -08:00
Eric Dumazet	5260dd3ed1	tun: switch to u64_stats_t In order to fix this data-race found by KCSAN [1], switch to u64_stats_t helpers. They provide all the needed annotations, without adding extra cost. [1] BUG: KCSAN: data-race in tun_get_user / tun_net_get_stats64 read to 0xffffe8ffffd8aca8 of 8 bytes by task 4882 on cpu 0: tun_net_get_stats64+0x9b/0x230 drivers/net/tun.c:1171 dev_get_stats+0x89/0x1e0 net/core/dev.c:9103 rtnl_fill_stats+0x56/0x370 net/core/rtnetlink.c:1177 rtnl_fill_ifinfo+0xd3b/0x2100 net/core/rtnetlink.c:1667 rtmsg_ifinfo_build_skb+0xb0/0x150 net/core/rtnetlink.c:3472 rtmsg_ifinfo_event.part.0+0x4e/0xb0 net/core/rtnetlink.c:3504 rtmsg_ifinfo_event net/core/rtnetlink.c:3515 [inline] rtmsg_ifinfo+0x85/0x90 net/core/rtnetlink.c:3513 __dev_notify_flags+0x18b/0x200 net/core/dev.c:7649 dev_change_flags+0xb8/0xe0 net/core/dev.c:7691 dev_ifsioc+0x201/0x6a0 net/core/dev_ioctl.c:237 dev_ioctl+0x149/0x660 net/core/dev_ioctl.c:489 sock_do_ioctl+0xdb/0x230 net/socket.c:1061 sock_ioctl+0x3a3/0x5e0 net/socket.c:1189 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0x991/0xc60 fs/ioctl.c:696 write to 0xffffe8ffffd8aca8 of 8 bytes by task 4883 on cpu 1: tun_get_user+0x1d94/0x2ba0 drivers/net/tun.c:2002 tun_chr_write_iter+0x79/0xd0 drivers/net/tun.c:2022 call_write_iter include/linux/fs.h:1895 [inline] new_sync_write+0x388/0x4a0 fs/read_write.c:483 __vfs_write+0xb1/0xc0 fs/read_write.c:496 __kernel_write+0xb8/0x240 fs/read_write.c:515 write_pipe_buf+0xb6/0xf0 fs/splice.c:794 splice_from_pipe_feed fs/splice.c:500 [inline] __splice_from_pipe+0x248/0x480 fs/splice.c:624 splice_from_pipe+0xbb/0x100 fs/splice.c:659 default_file_splice_write+0x45/0x90 fs/splice.c:806 do_splice_from fs/splice.c:848 [inline] direct_splice_actor+0xa0/0xc0 fs/splice.c:1020 splice_direct_to_actor+0x215/0x510 fs/splice.c:975 do_splice_direct+0x161/0x1e0 fs/splice.c:1063 do_sendfile+0x384/0x7f0 fs/read_write.c:1464 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 4883 Comm: syz-executor.1 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-07 20:03:08 -08:00
David S. Miller	2f184393e0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Several cases of overlapping changes which were for the most part trivially resolvable. Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-20 10:43:00 -07:00
Eric Dumazet	4ffdd22e49	tun: remove possible false sharing in tun_flow_update() As mentioned in https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE#it-may-improve-performance a C compiler can legally transform if (e->queue_index != queue_index) e->queue_index = queue_index; to : e->queue_index = queue_index; Note that the code using jiffies has no issue, since jiffies has volatile attribute. if (e->updated != jiffies) e->updated = jiffies; Fixes: `83b1bc122c` ("tun: align write-heavy flow entry members to a cache line") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Zhang Yu <zhangyu31@baidu.com> Cc: Wang Li <wangli39@baidu.com> Cc: Li RongQing <lirongqing@baidu.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-09 21:29:33 -07:00
Eric Dumazet	bacb7e1855	Revert "tun: call dev_get_valid_name() before register_netdevice()" This reverts commit `0ad646c81b`. As noticed by Jakub, this is no longer needed after commit `11fc7d5a0a` ("tun: fix memory leak in error path") This no longer exports dev_get_valid_name() for the exclusive use of tun driver. Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-08 20:49:15 -07:00
Eric Dumazet	11fc7d5a0a	tun: fix memory leak in error path syzbot reported a warning [1] that triggered after recent Jiri patch. This exposes a bug that we hit already in the past (see commit `ff244c6b29` ("tun: handle register_netdevice() failures properly") for details) tun uses priv->destructor without an ndo_init() method. register_netdevice() can return an error, but will not call priv->destructor() in some cases. Jiri recent patch added one more. A long term fix would be to transfer the initialization of what we destroy in ->destructor() in the ndo_init() This looks a bit risky given the complexity of tun driver. A simpler fix is to detect after the failed register_netdevice() if the tun_free_netdev() function was called already. [1] ODEBUG: free active (active state 0) object type: timer_list hint: tun_flow_cleanup+0x0/0x280 drivers/net/tun.c:457 WARNING: CPU: 0 PID: 8653 at lib/debugobjects.c:481 debug_print_object+0x168/0x250 lib/debugobjects.c:481 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 8653 Comm: syz-executor976 Not tainted 5.4.0-rc1-next-20191004 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 panic+0x2dc/0x755 kernel/panic.c:220 __warn.cold+0x2f/0x3c kernel/panic.c:581 report_bug+0x289/0x300 lib/bug.c:195 fixup_bug arch/x86/kernel/traps.c:174 [inline] fixup_bug arch/x86/kernel/traps.c:169 [inline] do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286 invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028 RIP: 0010:debug_print_object+0x168/0x250 lib/debugobjects.c:481 Code: dd 80 b9 e6 87 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 b5 00 00 00 48 8b 14 dd 80 b9 e6 87 48 c7 c7 e0 ae e6 87 e8 80 84 ff fd <0f> 0b 83 05 e3 ee 80 06 01 48 83 c4 20 5b 41 5c 41 5d 41 5e 5d c3 RSP: 0018:ffff888095997a28 EFLAGS: 00010082 RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff815cb526 RDI: ffffed1012b32f37 RBP: ffff888095997a68 R08: ffff8880a92ac580 R09: ffffed1015d04101 R10: ffffed1015d04100 R11: ffff8880ae820807 R12: 0000000000000001 R13: ffffffff88fb5340 R14: ffffffff81627110 R15: ffff8880aa41eab8 __debug_check_no_obj_freed lib/debugobjects.c:963 [inline] debug_check_no_obj_freed+0x2d4/0x43f lib/debugobjects.c:994 kfree+0xf8/0x2c0 mm/slab.c:3755 kvfree+0x61/0x70 mm/util.c:593 netdev_freemem net/core/dev.c:9384 [inline] free_netdev+0x39d/0x450 net/core/dev.c:9533 tun_set_iff drivers/net/tun.c:2871 [inline] __tun_chr_ioctl+0x317b/0x3f30 drivers/net/tun.c:3075 tun_chr_ioctl+0x2b/0x40 drivers/net/tun.c:3355 vfs_ioctl fs/ioctl.c:47 [inline] file_ioctl fs/ioctl.c:539 [inline] do_vfs_ioctl+0xdb6/0x13e0 fs/ioctl.c:726 ksys_ioctl+0xab/0xd0 fs/ioctl.c:743 __do_sys_ioctl fs/ioctl.c:750 [inline] __se_sys_ioctl fs/ioctl.c:748 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:748 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x441439 Code: e8 9c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fff61c37438 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000441439 RDX: 0000000020000400 RSI: 00000000400454ca RDI: 0000000000000004 RBP: 00007fff61c37470 R08: 0000000000000001 R09: 0000000100000000 R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000000 Kernel Offset: disabled Rebooting in 86400 seconds.. Fixes: `ff92741270` ("net: introduce name_node struct to be used in hashlist") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jiri Pirko <jiri@mellanox.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-08 13:06:11 -07:00
Florian Westphal	895b5c9f20	netfilter: drop bridge nf reset from nf_reset commit `174e23810c` ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi recycle always drop skb extensions. The additional skb_ext_del() that is performed via nf_reset on napi skb recycle is not needed anymore. Most nf_reset() calls in the stack are there so queued skb won't block 'rmmod nf_conntrack' indefinitely. This removes the skb_ext_del from nf_reset, and renames it to a more fitting nf_reset_ct(). In a few selected places, add a call to skb_ext_reset to make sure that no active extensions remain. I am submitting this for "net", because we're still early in the release cycle. The patch applies to net-next too, but I think the rename causes needless divergence between those trees. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2019-10-01 18:42:15 +02:00
Yang Yingliang	77f22f92df	tun: fix use-after-free when register netdev failed I got a UAF repport in tun driver when doing fuzzy test: [ 466.269490] ================================================================== [ 466.271792] BUG: KASAN: use-after-free in tun_chr_read_iter+0x2ca/0x2d0 [ 466.271806] Read of size 8 at addr ffff888372139250 by task tun-test/2699 [ 466.271810] [ 466.271824] CPU: 1 PID: 2699 Comm: tun-test Not tainted 5.3.0-rc1-00001-g5a9433db2614-dirty #427 [ 466.271833] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 466.271838] Call Trace: [ 466.271858] dump_stack+0xca/0x13e [ 466.271871] ? tun_chr_read_iter+0x2ca/0x2d0 [ 466.271890] print_address_description+0x79/0x440 [ 466.271906] ? vprintk_func+0x5e/0xf0 [ 466.271920] ? tun_chr_read_iter+0x2ca/0x2d0 [ 466.271935] __kasan_report+0x15c/0x1df [ 466.271958] ? tun_chr_read_iter+0x2ca/0x2d0 [ 466.271976] kasan_report+0xe/0x20 [ 466.271987] tun_chr_read_iter+0x2ca/0x2d0 [ 466.272013] do_iter_readv_writev+0x4b7/0x740 [ 466.272032] ? default_llseek+0x2d0/0x2d0 [ 466.272072] do_iter_read+0x1c5/0x5e0 [ 466.272110] vfs_readv+0x108/0x180 [ 466.299007] ? compat_rw_copy_check_uvector+0x440/0x440 [ 466.299020] ? fsnotify+0x888/0xd50 [ 466.299040] ? __fsnotify_parent+0xd0/0x350 [ 466.299064] ? fsnotify_first_mark+0x1e0/0x1e0 [ 466.304548] ? vfs_write+0x264/0x510 [ 466.304569] ? ksys_write+0x101/0x210 [ 466.304591] ? do_preadv+0x116/0x1a0 [ 466.304609] do_preadv+0x116/0x1a0 [ 466.309829] do_syscall_64+0xc8/0x600 [ 466.309849] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 466.309861] RIP: 0033:0x4560f9 [ 466.309875] Code: 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 466.309889] RSP: 002b:00007ffffa5166e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000127 [ 466.322992] RAX: ffffffffffffffda RBX: 0000000000400460 RCX: 00000000004560f9 [ 466.322999] RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003 [ 466.323007] RBP: 00007ffffa516700 R08: 0000000000000004 R09: 0000000000000000 [ 466.323014] R10: 0000000000000000 R11: 0000000000000206 R12: 000000000040cb10 [ 466.323021] R13: 0000000000000000 R14: 00000000006d7018 R15: 0000000000000000 [ 466.323057] [ 466.323064] Allocated by task 2605: [ 466.335165] save_stack+0x19/0x80 [ 466.336240] __kasan_kmalloc.constprop.8+0xa0/0xd0 [ 466.337755] kmem_cache_alloc+0xe8/0x320 [ 466.339050] getname_flags+0xca/0x560 [ 466.340229] user_path_at_empty+0x2c/0x50 [ 466.341508] vfs_statx+0xe6/0x190 [ 466.342619] __do_sys_newstat+0x81/0x100 [ 466.343908] do_syscall_64+0xc8/0x600 [ 466.345303] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 466.347034] [ 466.347517] Freed by task 2605: [ 466.348471] save_stack+0x19/0x80 [ 466.349476] __kasan_slab_free+0x12e/0x180 [ 466.350726] kmem_cache_free+0xc8/0x430 [ 466.351874] putname+0xe2/0x120 [ 466.352921] filename_lookup+0x257/0x3e0 [ 466.354319] vfs_statx+0xe6/0x190 [ 466.355498] __do_sys_newstat+0x81/0x100 [ 466.356889] do_syscall_64+0xc8/0x600 [ 466.358037] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 466.359567] [ 466.360050] The buggy address belongs to the object at ffff888372139100 [ 466.360050] which belongs to the cache names_cache of size 4096 [ 466.363735] The buggy address is located 336 bytes inside of [ 466.363735] 4096-byte region [ffff888372139100, ffff88837213a100) [ 466.367179] The buggy address belongs to the page: [ 466.368604] page:ffffea000dc84e00 refcount:1 mapcount:0 mapping:ffff8883df1b4f00 index:0x0 compound_mapcount: 0 [ 466.371582] flags: 0x2fffff80010200(slab\|head) [ 466.372910] raw: 002fffff80010200 dead000000000100 dead000000000122 ffff8883df1b4f00 [ 466.375209] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000 [ 466.377778] page dumped because: kasan: bad access detected [ 466.379730] [ 466.380288] Memory state around the buggy address: [ 466.381844] ffff888372139100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.384009] ffff888372139180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.386131] >ffff888372139200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.388257] ^ [ 466.390234] ffff888372139280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.392512] ffff888372139300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.394667] ================================================================== tun_chr_read_iter() accessed the memory which freed by free_netdev() called by tun_set_iff(): CPUA CPUB tun_set_iff() alloc_netdev_mqs() tun_attach() tun_chr_read_iter() tun_get() tun_do_read() tun_ring_recv() register_netdevice() <-- inject error goto err_detach tun_detach_all() <-- set RCV_SHUTDOWN free_netdev() <-- called from err_free_dev path netdev_freemem() <-- free the memory without check refcount (In this path, the refcount cannot prevent freeing the memory of dev, and the memory will be used by dev_put() called by tun_chr_read_iter() on CPUB.) (Break from tun_ring_recv(), because RCV_SHUTDOWN is set) tun_put() dev_put() <-- use the memory freed by netdev_freemem() Put the publishing of tfile->tun after register_netdevice(), so tun_get() won't get the tun pointer that freed by err_detach path if register_netdevice() failed. Fixes: `eb0fb363f9` ("tuntap: attach queue 0 before registering netdevice") Reported-by: Hulk Robot <hulkci@huawei.com> Suggested-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-09-12 11:17:26 +01:00
Alexis Bauvin	4b66336624	tun: mark small packets as owned by the tap sock - v1 -> v2: Move skb_set_owner_w to __tun_build_skb to reduce patch size Small packets going out of a tap device go through an optimized code path that uses build_skb() rather than sock_alloc_send_pskb(). The latter calls skb_set_owner_w(), but the small packet code path does not. The net effect is that small packets are not owned by the userland application's socket (e.g. QEMU), while large packets are. This can be seen with a TCP session, where packets are not owned when the window size is small enough (around PAGE_SIZE), while they are once the window grows (note that this requires the host to support virtio tso for the guest to offload segmentation). All this leads to inconsistent behaviour in the kernel, especially on netfilter modules that uses sk->socket (e.g. xt_owner). Fixes: `66ccbc9c87` ("tap: use build_skb() for small packet") Signed-off-by: Alexis Bauvin <abauvin@scaleway.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-07-25 11:38:32 -07:00
Al Viro	333f7909a8	coallocate socket_wq with socket itself socket->wq is assign-once, set when we are initializing both struct socket it's in and struct socket_wq it points to. As the matter of fact, the only reason for separate allocation was the ability to RCU-delay freeing of socket_wq. RCU-delaying the freeing of socket itself gets rid of that need, so we can just fold struct socket_wq into the end of struct socket and simplify the life both for sock_alloc_inode() (one allocation instead of two) and for tun/tap oddballs, where we used to embed struct socket and struct socket_wq into the same structure (now - embedding just the struct socket). Note that reference to struct socket_wq in struct sock does remain a reference - that's unchanged. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-07-08 19:25:19 -07:00
Fei Li	72b319dc08	tun: wake up waitqueues after IFF_UP is set Currently after setting tap0 link up, the tun code wakes tx/rx waited queues up in tun_net_open() when .ndo_open() is called, however the IFF_UP flag has not been set yet. If there's already a wait queue, it would fail to transmit when checking the IFF_UP flag in tun_sendmsg(). Then the saving vhost_poll_start() will add the wq into wqh until it is waken up again. Although this works when IFF_UP flag has been set when tun_chr_poll detects; this is not true if IFF_UP flag has not been set at that time. Sadly the latter case is a fatal error, as the wq will never be waken up in future unless later manually setting link up on purpose. Fix this by moving the wakeup process into the NETDEV_UP event notifying process, this makes sure IFF_UP has been set before all waited queues been waken up. Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 10:46:52 -07:00
Thomas Gleixner	c942fddf87	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 157 Based on 3 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version [author] [kishon] [vijay] [abraham] [i] [kishon]@[ti] [com] this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version [author] [graeme] [gregory] [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i] [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema] [hk] [hemahk]@[ti] [com] this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 1105 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-05-30 11:26:37 -07:00
Jason Wang	9871a9e47a	tuntap: synchronize through tfiles array instead of tun->numqueues When a queue(tfile) is detached through __tun_detach(), we move the last enabled tfile to the position where detached one sit but don't NULL out last position. We expect to synchronize the datapath through tun->numqueues. Unfortunately, this won't work since we're lacking sufficient mechanism to order or synchronize the access to tun->numqueues. To fix this, NULL out the last position during detaching and check RCU protected tfile against NULL instead of checking tun->numqueues in datapath. Cc: YueHaibing <yuehaibing@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: weiyongjun (A) <weiyongjun1@huawei.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Fixes: `c8d68e6be1` ("tuntap: multiqueue support") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-05-09 09:21:42 -07:00
Jason Wang	a35d310f03	tuntap: fix dividing by zero in ebpf queue selection We need check if tun->numqueues is zero (e.g for the persist device) before trying to use it for modular arithmetic. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Fixes: 96f84061620c6("tun: add eBPF based queue selection method") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-05-09 09:21:42 -07:00
Stanislav Fomichev	c43f1255b8	net: pass net_device argument to the eth_get_headlen Update all users of eth_get_headlen to pass network device, fetch network namespace from it and pass it down to the flow dissector. This commit is a noop until administrator inserts BPF flow dissector program. Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: intel-wired-lan@lists.osuosl.org Cc: Yisen Zhuang <yisen.zhuang@huawei.com> Cc: Salil Mehta <salil.mehta@huawei.com> Cc: Michael Chan <michael.chan@broadcom.com> Cc: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-04-23 18:36:34 +02:00
Eric Dumazet	dc05360fee	net: convert rps_needed and rfs_needed to new static branch api We prefer static_branch_unlikely() over static_key_false() these days. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 21:57:38 -04:00
Kirill Tkhai	12132768dc	tun: Remove unused first parameter of tun_get_iff() Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-21 13:19:15 -07:00
Kirill Tkhai	0c3e0e3bb6	tun: Add ioctl() TUNGETDEVNETNS cmd to allow obtaining real net ns of tun device In commit `f2780d6d74` "tun: Add ioctl() SIOCGSKNS cmd to allow obtaining net ns of tun device" it was missed that tun may change its net ns, while net ns of socket remains the same as it was created initially. SIOCGSKNS returns net ns of socket, so it is not suitable for obtaining net ns of device. We may have two tun devices with the same names in two net ns, and in this case it's not possible to determ, which of them fd refers to (TUNGETIFF will return the same name). This patch adds new ioctl() cmd for obtaining net ns of a device. Reported-by: Harald Albrecht <harald.albrecht@gmx.net> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-21 13:19:15 -07:00
Paolo Abeni	a350eccee5	net: remove 'fallback' argument from dev->ndo_select_queue() After the previous patch, all the callers of ndo_select_queue() provide as a 'fallback' argument netdev_pick_tx. The only exceptions are nested calls to ndo_select_queue(), which pass down the 'fallback' available in the current scope - still netdev_pick_tx. We can drop such argument and replace fallback() invocation with netdev_pick_tx(). This avoids an indirect call per xmit packet in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen) with device drivers implementing such ndo. It also clean the code a bit. Tested with ixgbe and CONFIG_FCOE=m With pktgen using queue xmit: threads vanilla patched (kpps) (kpps) 1 2334 2428 2 4166 4278 4 7895 8100 v1 -> v2: - rebased after helper's name change Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-20 11:18:55 -07:00
Eric Dumazet	9180bb4f04	tun: add a missing rcu_read_unlock() in error path In my latest patch I missed one rcu_read_unlock(), in case device is down. Fixes: `4477138fa0` ("tun: properly test for IFF_UP") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-16 13:16:37 -07:00
Eric Dumazet	4477138fa0	tun: properly test for IFF_UP Same reasons than the ones explained in commit `4179cb5a4c` ("vxlan: test dev->flags & IFF_UP before calling netif_rx()") netif_rx_ni() or napi_gro_frags() must be called under a strict contract. At device dismantle phase, core networking clears IFF_UP and flush_all_backlogs() is called after rcu grace period to make sure no incoming packet might be in a cpu backlog and still referencing the device. A similar protocol is used for gro layer. Most drivers call netif_rx() from their interrupt handler, and since the interrupts are disabled at device dismantle, netif_rx() does not have to check dev->flags & IFF_UP Virtual drivers do not have this guarantee, and must therefore make the check themselves. Fixes: `1bd4978a88` ("tun: honor IFF_UP in tun_get_user()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-15 15:42:11 -07:00
David S. Miller	9eb359140c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2019-03-02 12:54:35 -08:00
Timur Celik	ecef67cb10	tun: remove unnecessary memory barrier Replace set_current_state with __set_current_state since no memory barrier is needed at this point. Signed-off-by: Timur Celik <mail@timurcelik.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-02-25 14:27:21 -08:00
Timur Celik	71828b2240	tun: fix blocking read This patch moves setting of the current state into the loop. Otherwise the task may end up in a busy wait loop if none of the break conditions are met. Signed-off-by: Timur Celik <mail@timurcelik.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-02-24 22:11:53 -08:00
Maxim Mikityanskiy	d2aa125d62	net: Don't set transport offset to invalid value If the socket was created with socket(AF_PACKET, SOCK_RAW, 0), skb->protocol will be unset, __skb_flow_dissect() will fail, and skb_probe_transport_header() will fall back to the offset_hint, making the resulting skb_transport_offset incorrect. If, however, there is no transport header in the packet, transport_header shouldn't be set to an arbitrary value. Fix it by leaving the transport offset unset if it couldn't be found, to be explicit rather than to fill it with some wrong value. It changes the behavior, but if some code relied on the old behavior, it would be broken anyway, as the old one is incorrect. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-02-22 12:55:31 -08:00
George Amanakis	3a03cb8456	tun: move the call to tun_set_real_num_queues Call tun_set_real_num_queues() after the increment of tun->numqueues since the former depends on it. Otherwise, the number of queues is not correctly accounted for, which results to warnings similar to: "vnet0 selects TX queue 11, but real number of TX queues is 11". Fixes: `0b7959b625` ("tun: publish tfile after it's fully initialized") Reported-and-tested-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-01-30 21:40:25 -08:00
Stanislav Fomichev	0b7959b625	tun: publish tfile after it's fully initialized BUG: unable to handle kernel NULL pointer dereference at 00000000000000d1 Call Trace: ? napi_gro_frags+0xa7/0x2c0 tun_get_user+0xb50/0xf20 tun_chr_write_iter+0x53/0x70 new_sync_write+0xff/0x160 vfs_write+0x191/0x1e0 __x64_sys_write+0x5e/0xd0 do_syscall_64+0x47/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 I think there is a subtle race between sending a packet via tap and attaching it: CPU0: CPU1: tun_chr_ioctl(TUNSETIFF) tun_set_iff tun_attach rcu_assign_pointer(tfile->tun, tun); tun_fops->write_iter() tun_chr_write_iter tun_napi_alloc_frags napi_get_frags napi->skb = napi_alloc_skb tun_napi_init netif_napi_add napi->skb = NULL napi->skb is NULL here napi_gro_frags napi_frags_skb skb = napi->skb skb_reset_mac_header(skb) panic() Move rcu_assign_pointer(tfile->tun) and rcu_assign_pointer(tun->tfiles) to be the last thing we do in tun_attach(); this should guarantee that when we call tun_get() we always get an initialized object. v2 changes: * remove extra napi_mutex locks/unlocks for napi operations Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: `90e33d4594` ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-01-10 09:24:38 -05:00
Prashant Bhole	6342ca6447	tun: replace get_cpu_ptr with this_cpu_ptr when bh disabled tun_xdp_one() runs with local bh disabled. So there is no need to disable preemption by calling get_cpu_ptr while updating stats. This patch replaces the use of get_cpu_ptr() with this_cpu_ptr() as a micro-optimization. Also removes related put_cpu_ptr call. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-12-14 13:36:26 -08:00
Petr Machata	3a37a9636c	net: dev: Add extack argument to dev_set_mac_address() A follow-up patch will add a notifier type NETDEV_PRE_CHANGEADDR, which allows vetoing of MAC address changes. One prominent path to that notification is through dev_set_mac_address(). Therefore give this function an extack argument, so that it can be packed together with the notification. Thus a textual reason for rejection (or a warning) can be communicated back to the user. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-12-13 18:41:38 -08:00
David S. Miller	4cc1feeb6f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Several conflicts, seemingly all over the place. I used Stephen Rothwell's sample resolutions for many of these, if not just to double check my own work, so definitely the credit largely goes to him. The NFP conflict consisted of a bug fix (moving operations past the rhashtable operation) while chaning the initial argument in the function call in the moved code. The net/dsa/master.c conflict had to do with a bug fix intermixing of making dsa_master_set_mtu() static with the fixing of the tagging attribute location. cls_flower had a conflict because the dup reject fix from Or overlapped with the addition of port range classifiction. __set_phy_supported()'s conflict was relatively easy to resolve because Andrew fixed it in both trees, so it was just a matter of taking the net-next copy. Or at least I think it was :-) Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup() intermixed with changes on how the sdif and caller_net are calculated in these code paths in net-next. The remaining BPF conflicts were largely about the addition of the __bpf_md_ptr stuff in 'net' overlapping with adjustments and additions to the relevant data structure where the MD pointer macros are used. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-12-09 21:43:31 -08:00
Li RongQing	5c327f673d	tun: remove unnecessary check in tun_flow_update caller has guaranted that rxhash is not zero Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-12-06 12:15:53 -08:00
Li RongQing	83b1bc122c	tun: align write-heavy flow entry members to a cache line tun flow entry 'updated' fields are written when receive every packet. Thus if a flow is receiving packets from a particular flow entry, it'll cause false-sharing with all the other who has looked it up, so move it in its own cache line and update 'queue_index' and 'update' field only when they are changed to reduce the cache false-sharing. Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Wang Li <wangli39@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-12-06 12:15:27 -08:00
Prashant Bhole	4e4b08e558	tun: remove skb access after netif_receive_skb In tun.c skb->len was accessed while doing stats accounting after a call to netif_receive_skb. We can not access skb after this call because buffers may be dropped. The fix for this bug would be to store skb->len in local variable and then use it after netif_receive_skb(). IMO using xdp data size for accounting bytes will be better because input for tun_xdp_one() is xdp_buff. Hence this patch: - fixes a bug by removing skb access after netif_receive_skb() - uses xdp data size for accounting bytes [613.019057] BUG: KASAN: use-after-free in tun_sendmsg+0x77c/0xc50 [tun] [613.021062] Read of size 4 at addr ffff8881da9ab7c0 by task vhost-1115/1155 [613.023073] [613.024003] CPU: 0 PID: 1155 Comm: vhost-1115 Not tainted 4.20.0-rc3-vm+ #232 [613.026029] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [613.029116] Call Trace: [613.031145] dump_stack+0x5b/0x90 [613.032219] print_address_description+0x6c/0x23c [613.034156] ? tun_sendmsg+0x77c/0xc50 [tun] [613.036141] kasan_report.cold.5+0x241/0x308 [613.038125] tun_sendmsg+0x77c/0xc50 [tun] [613.040109] ? tun_get_user+0x1960/0x1960 [tun] [613.042094] ? __isolate_free_page+0x270/0x270 [613.045173] vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net] [613.047127] ? peek_head_len.part.13+0x90/0x90 [vhost_net] [613.049096] ? get_tx_bufs+0x5a/0x2c0 [vhost_net] [613.051106] ? vhost_enable_notify+0x2d8/0x420 [vhost] [613.053139] handle_tx_copy+0x2d0/0x8f0 [vhost_net] [613.053139] ? vhost_net_buf_peek+0x340/0x340 [vhost_net] [613.053139] ? __mutex_lock+0x8d9/0xb30 [613.053139] ? finish_task_switch+0x8f/0x3f0 [613.053139] ? handle_tx+0x32/0x120 [vhost_net] [613.053139] ? mutex_trylock+0x110/0x110 [613.053139] ? finish_task_switch+0xcf/0x3f0 [613.053139] ? finish_task_switch+0x240/0x3f0 [613.053139] ? __switch_to_asm+0x34/0x70 [613.053139] ? __switch_to_asm+0x40/0x70 [613.053139] ? __schedule+0x506/0xf10 [613.053139] handle_tx+0xc7/0x120 [vhost_net] [613.053139] vhost_worker+0x166/0x200 [vhost] [613.053139] ? vhost_dev_init+0x580/0x580 [vhost] [613.053139] ? __kthread_parkme+0x77/0x90 [613.053139] ? vhost_dev_init+0x580/0x580 [vhost] [613.053139] kthread+0x1b1/0x1d0 [613.053139] ? kthread_park+0xb0/0xb0 [613.053139] ret_from_fork+0x35/0x40 [613.088705] [613.088705] Allocated by task 1155: [613.088705] kasan_kmalloc+0xbf/0xe0 [613.088705] kmem_cache_alloc+0xdc/0x220 [613.088705] __build_skb+0x2a/0x160 [613.088705] build_skb+0x14/0xc0 [613.088705] tun_sendmsg+0x4f0/0xc50 [tun] [613.088705] vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net] [613.088705] handle_tx_copy+0x2d0/0x8f0 [vhost_net] [613.088705] handle_tx+0xc7/0x120 [vhost_net] [613.088705] vhost_worker+0x166/0x200 [vhost] [613.088705] kthread+0x1b1/0x1d0 [613.088705] ret_from_fork+0x35/0x40 [613.088705] [613.088705] Freed by task 1155: [613.088705] __kasan_slab_free+0x12e/0x180 [613.088705] kmem_cache_free+0xa0/0x230 [613.088705] ip6_mc_input+0x40f/0x5a0 [613.088705] ipv6_rcv+0xc9/0x1e0 [613.088705] __netif_receive_skb_one_core+0xc1/0x100 [613.088705] netif_receive_skb_internal+0xc4/0x270 [613.088705] br_pass_frame_up+0x2b9/0x2e0 [613.088705] br_handle_frame_finish+0x2fb/0x7a0 [613.088705] br_handle_frame+0x30f/0x6c0 [613.088705] __netif_receive_skb_core+0x61a/0x15b0 [613.088705] __netif_receive_skb_one_core+0x8e/0x100 [613.088705] netif_receive_skb_internal+0xc4/0x270 [613.088705] tun_sendmsg+0x738/0xc50 [tun] [613.088705] vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net] [613.088705] handle_tx_copy+0x2d0/0x8f0 [vhost_net] [613.088705] handle_tx+0xc7/0x120 [vhost_net] [613.088705] vhost_worker+0x166/0x200 [vhost] [613.088705] kthread+0x1b1/0x1d0 [613.088705] ret_from_fork+0x35/0x40 [613.088705] [613.088705] The buggy address belongs to the object at ffff8881da9ab740 [613.088705] which belongs to the cache skbuff_head_cache of size 232 Fixes: `043d222f93` ("tuntap: accept an array of XDP buffs through sendmsg()") Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-12-03 14:10:27 -08:00
Nicolas Dichtel	35b827b6d0	tun: forbid iface creation with rtnl ops It's not supported right now (the goal of the initial patch was to support 'ip link del' only). Before the patch: $ ip link add foo type tun [ 239.632660] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [snip] [ 239.636410] RIP: 0010:register_netdevice+0x8e/0x3a0 This panic occurs because dev->netdev_ops is not set by tun_setup(). But to have something usable, it will require more than just setting netdev_ops. Fixes: `f019a7a594` ("tun: Implement ip link del tunXXX") CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-11-30 17:31:03 -08:00
Nicolas Dichtel	26d31925cd	tun: implement carrier change The userspace may need to control the carrier state. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Didier Pallard <didier.pallard@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-11-30 17:16:38 -08:00
David S. Miller	f2be6d710d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2018-11-19 10:55:00 -08:00
Matthew Cover	8ebebcba55	tuntap: fix multiqueue rx When writing packets to a descriptor associated with a combined queue, the packets should end up on that queue. Before this change all packets written to any descriptor associated with a tap interface end up on rx-0, even when the descriptor is associated with a different queue. The rx traffic can be generated by either of the following. 1. a simple tap program which spins up multiple queues and writes packets to each of the file descriptors 2. tx from a qemu vm with a tap multiqueue netdev The queue for rx traffic can be observed by either of the following (done on the hypervisor in the qemu case). 1. a simple netmap program which opens and reads from per-queue descriptors 2. configuring RPS and doing per-cpu captures with rxtxcpu Alternatively, if you printk() the return value of skb_get_rx_queue() just before each instance of netif_receive_skb() in tun.c, you will get 65535 for every skb. Calling skb_record_rx_queue() to set the rx queue to the queue_index fixes the association between descriptor and rx queue. Signed-off-by: Matthew Cover <matthew.cover@stackpath.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-11-18 19:05:43 -08:00
Eric Dumazet	aa6daacaa1	tun: use netdev_alloc_frag() in tun_napi_alloc_frags() In order to cook skbs in the same way than Ethernet drivers, it is probably better to not use GFP_KERNEL, but rather use the GFP_ATOMIC and PFMEMALLOC mechanisms provided by netdev_alloc_frag(). This would allow to use tun driver even in memory stress situations, especially if swap is used over this tun channel. Fixes: `90e33d4594` ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Petar Penkov <peterpenkov96@gmail.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-11-18 19:01:11 -08:00
David S. Miller	6f0271d929	tun: Adjust on-stack tun_page initialization. Instead of constantly playing with the struct initializer syntax trying to make gcc and CLang both happy, just clear it out using memset(). >> drivers/net/tun.c:2503:42: warning: Using plain integer as NULL pointer Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-11-17 16:53:46 -08:00
Jason Wang	f9e06c45cb	tuntap: free XDP dropped packets in a batch Thanks to the batched XDP buffs through msg_control. Instead of calling put_page() for each page which involves a atomic operation, let's batch them by record the last page that needs to be freed and its refcnt count and free them in a batch. Testpmd(virtio-user + vhost_net) + XDP_DROP shows 3.8% improvement. Before: 4.71Mpps After : 4.89Mpps Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-11-17 12:00:42 -08:00
Paolo Abeni	f29eb2a96c	tun: compute the RFS hash only if needed. The tun XDP sendmsg code path, unconditionally computes the symmetric hash of each packet for RFS's sake, even when we could skip it. e.g. when the device has a single queue. This change adds the check already in-place for the skb sendmsg path to avoid unneeded hashing. The above gives small, but measurable, performance gain for VM xmit path when zerocopy is not enabled. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-11-07 22:22:16 -08:00
Serhey Popovych	df52eab23d	tun: Consistently configure generic netdev params via rtnetlink Configuring generic network device parameters on tun will fail in presence of IFLA_INFO_KIND attribute in IFLA_LINKINFO nested attribute since tun_validate() always return failure. This can be visualized with following ip-link(8) command sequences: # ip link set dev tun0 group 100 # ip link set dev tun0 group 100 type tun RTNETLINK answers: Invalid argument with contrast to dummy and veth drivers: # ip link set dev dummy0 group 100 # ip link set dev dummy0 type dummy # ip link set dev veth0 group 100 # ip link set dev veth0 group 100 type veth Fix by returning zero in tun_validate() when @data is NULL that is always in case since rtnl_link_ops->maxtype is zero in tun driver. Fixes: `f019a7a594` ("tun: Implement ip link del tunXXX") Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-15 21:40:31 -07:00
Wang Li	4b035271fe	net: tun: remove useless codes of tun_automq_select_queue Because the function __skb_get_hash_symmetric always returns non-zero. Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Wang Li <wangli39@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:35:22 -07:00
David S. Miller	6f41617bf2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net' overlapped the renaming of a netlink attribute in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-03 21:00:17 -07:00
Eric Dumazet	af3fb24eec	tun: napi flags belong to tfile Since tun->flags might be shared by multiple tfile structures, it is better to make sure tun_get_user() is using the flags for the current tfile. Presence of the READ_ONCE() in tun_napi_frags_enabled() gave a hint of what could happen, but we need something stronger to please syzbot. kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 13647 Comm: syz-executor5 Not tainted 4.19.0-rc5+ #59 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:dev_gro_receive+0x132/0x2720 net/core/dev.c:5427 Code: 48 c1 ea 03 80 3c 02 00 0f 85 6e 20 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 6e 10 49 8d bd d0 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 59 20 00 00 4d 8b a5 d0 00 00 00 31 ff 41 81 e4 RSP: 0018:ffff8801c400f410 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8618d325 RDX: 000000000000001a RSI: ffffffff86189f97 RDI: 00000000000000d0 RBP: ffff8801c400f608 R08: ffff8801c8fb4300 R09: 0000000000000000 R10: ffffed0038801ed7 R11: 0000000000000003 R12: ffff8801d327d358 R13: 0000000000000000 R14: ffff8801c16dd8c0 R15: 0000000000000004 FS: 00007fe003615700(0000) GS:ffff8801dac00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe1f3c43db8 CR3: 00000001bebb2000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: napi_gro_frags+0x3f4/0xc90 net/core/dev.c:5715 tun_get_user+0x31d5/0x42a0 drivers/net/tun.c:1922 tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:1967 call_write_iter include/linux/fs.h:1808 [inline] new_sync_write fs/read_write.c:474 [inline] __vfs_write+0x6b8/0x9f0 fs/read_write.c:487 vfs_write+0x1fc/0x560 fs/read_write.c:549 ksys_write+0x101/0x260 fs/read_write.c:598 __do_sys_write fs/read_write.c:610 [inline] __se_sys_write fs/read_write.c:607 [inline] __x64_sys_write+0x73/0xb0 fs/read_write.c:607 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457579 Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fe003614c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457579 RDX: 0000000000000012 RSI: 0000000020000000 RDI: 000000000000000a RBP: 000000000072c040 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe0036156d4 R13: 00000000004c5574 R14: 00000000004d8e98 R15: 00000000ffffffff Modules linked in: RIP: 0010:dev_gro_receive+0x132/0x2720 net/core/dev.c:5427 Code: 48 c1 ea 03 80 3c 02 00 0f 85 6e 20 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 6e 10 49 8d bd d0 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 59 20 00 00 4d 8b a5 d0 00 00 00 31 ff 41 81 e4 RSP: 0018:ffff8801c400f410 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8618d325 RDX: 000000000000001a RSI: ffffffff86189f97 RDI: 00000000000000d0 RBP: ffff8801c400f608 R08: ffff8801c8fb4300 R09: 0000000000000000 R10: ffffed0038801ed7 R11: 0000000000000003 R12: ffff8801d327d358 R13: 0000000000000000 R14: ffff8801c16dd8c0 R15: 0000000000000004 FS: 00007fe003615700(0000) GS:ffff8801dac00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe1f3c43db8 CR3: 00000001bebb2000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: `90e33d4594` ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 23:27:28 -07:00
Eric Dumazet	c7256f579f	tun: initialize napi_mutex unconditionally This is the first part to fix following syzbot report : console output: https://syzkaller.appspot.com/x/log.txt?x=145378e6400000 kernel config: https://syzkaller.appspot.com/x/.config?x=443816db871edd66 dashboard link: https://syzkaller.appspot.com/bug?extid=e662df0ac1d753b57e80 Following patch is fixing the race condition, but it seems safer to initialize this mutex at tfile creation anyway. Fixes: `90e33d4594` ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+e662df0ac1d753b57e80@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 23:27:28 -07:00
Eric Dumazet	06e55addd3	tun: remove unused parameters tun_napi_disable() and tun_napi_del() do not need a pointer to the tun_struct Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 23:27:28 -07:00
David S. Miller	a06ee256e5	Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net Version bump conflict in batman-adv, take what's in net-next. iavf conflict, adjustment of netdev_ops in net-next conflicting with poll controller method removal in net. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-25 10:35:29 -07:00
Eric Dumazet	765cdc209c	tun: remove ndo_poll_controller As diagnosed by Song Liu, ndo_poll_controller() can be very dangerous on loaded hosts, since the cpu calling ndo_poll_controller() might steal all NAPI contexts (for all RX/TX queues of the NIC). This capture can last for unlimited amount of time, since one cpu is generally not able to drain all the queues under load. tun uses NAPI for TX completions, so we better let core networking stack call the napi->poll() to avoid the capture. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-23 21:55:25 -07:00
Jason Wang	043d222f93	tuntap: accept an array of XDP buffs through sendmsg() This patch implement TUN_MSG_PTR msg_control type. This type allows the caller to pass an array of XDP buffs to tuntap through ptr field of the tun_msg_control. If an XDP program is attached, tuntap can run XDP program directly. If not, tuntap will build skb and do a fast receiving since part of the work has been done by vhost_net. This will avoid lots of indirect calls thus improves the icache utilization and allows to do XDP batched flushing when doing XDP redirection. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-13 09:25:40 -07:00
Jason Wang	fe8dd45bb7	tun: switch to new type of msg_control This patch introduces to a new tun/tap specific msg_control: #define TUN_MSG_UBUF 1 #define TUN_MSG_PTR 2 struct tun_msg_ctl { int type; void *ptr; }; This allows us to pass different kinds of msg_control through sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will be used by the existed vhost_net zerocopy code. The second is XDP buff, which allows vhost_net to pass XDP buff to TUN. This could be used to implement accepting an array of XDP buffs from vhost_net in the following patches. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-13 09:25:40 -07:00
Jason Wang	1a097910ad	tuntap: move XDP flushing out of tun_do_xdp() This will allow adding batch flushing on top. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-13 09:25:40 -07:00
Jason Wang	8ae1aff0b3	tuntap: split out XDP logic This patch split out XDP logic into a single function. This make it to be reused by XDP batching path in the following patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-13 09:25:40 -07:00
Jason Wang	ac1f1f6c5a	tuntap: tweak on the path of skb XDP case in tun_build_skb() If we're sure not to go native XDP, there's no need for several things like bh and rcu stuffs. So this patch introduces a helper to build skb and hold page refcnt. When we found we will go through skb path, build skb directly. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-13 09:25:40 -07:00
Jason Wang	f7053b6ccb	tuntap: simplify error handling in tun_build_skb() There's no need to duplicate page get logic in each action. So this patch tries to get page and calculate the offset before processing XDP actions (except for XDP_DROP), and undo them when meet errors (we don't care the performance on errors). This will be used for factoring out XDP logic. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-13 09:25:40 -07:00
Jason Wang	291aeb2b1d	tuntap: enable bh early during processing XDP This patch move the bh enabling a little bit earlier, this will be used for factoring out the core XDP logic of tuntap. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-13 09:25:40 -07:00
Jason Wang	4f23aff871	tuntap: switch to use XDP_PACKET_HEADROOM Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-13 09:25:40 -07:00
Jason Wang	e4a2a3048e	net: sock: introduce SOCK_XDP This patch introduces a new sock flag - SOCK_XDP. This will be used for notifying the upper layer that XDP program is attached on the lower socket, and requires for extra headroom. TUN will be the first user. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-13 09:25:40 -07:00
Linus Torvalds	0214f46b3a	Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull core signal handling updates from Eric Biederman: "It was observed that a periodic timer in combination with a sufficiently expensive fork could prevent fork from every completing. This contains the changes to remove the need for that restart. This set of changes is split into several parts: - The first part makes PIDTYPE_TGID a proper pid type instead something only for very special cases. The part starts using PIDTYPE_TGID enough so that in __send_signal where signals are actually delivered we know if the signal is being sent to a a group of processes or just a single process. - With that prep work out of the way the logic in fork is modified so that fork logically makes signals received while it is running appear to be received after the fork completes" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits) signal: Don't send signals to tasks that don't exist signal: Don't restart fork when signals come in. fork: Have new threads join on-going signal group stops fork: Skip setting TIF_SIGPENDING in ptrace_init_task signal: Add calculate_sigpending() fork: Unconditionally exit if a fatal signal is pending fork: Move and describe why the code examines PIDNS_ADDING signal: Push pid type down into complete_signal. signal: Push pid type down into __send_signal signal: Push pid type down into send_signal signal: Pass pid type into do_send_sig_info signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task signal: Pass pid type into group_send_sig_info signal: Pass pid and pid type into send_sigqueue posix-timers: Noralize good_sigevent signal: Use PIDTYPE_TGID to clearly store where file signals will be sent pid: Implement PIDTYPE_TGID pids: Move the pgrp and session pid pointers from task_struct to signal_struct kvm: Don't open code task_pid in kvm_vcpu_ioctl pids: Compute task_tgid using signal->leader_pid ...	2018-08-21 13:47:29 -07:00
Li RongQing	f13b546847	tun: not use hardcoded mask value 0x3ff in tun_hashfn is mask of TUN_NUM_FLOW_ENTRIES, instead of hardcode, define a macro to setup the relationship with TUN_NUM_FLOW_ENTRIES Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-04 13:15:11 -07:00
Eric W. Biederman	019191342f	signal: Use PIDTYPE_TGID to clearly store where file signals will be sent When f_setown is called a pid and a pid type are stored. Replace the use of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread group. Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now is only for a thread. Update the users of __f_setown to use PIDTYPE_TGID instead of PIDTYPE_PID. For now the code continues to capture task_pid (when task_tgid would really be appropriate), and iterate on PIDTYPE_PID (even when type == PIDTYPE_TGID) out of an abundance of caution to preserve existing behavior. Oleg Nesterov suggested using the test to ensure we use PIDTYPE_PID for tgid lookup also be used to avoid taking the tasklist lock. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2018-07-21 10:43:12 -05:00
David S. Miller	c4c5551df1	Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux All conflicts were trivial overlapping changes, so reasonably easy to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-20 21:17:12 -07:00
Toshiaki Makita	6e8cfd6d9d	tun: Fix use-after-free on XDP_TX On XDP_TX we need to free up the frame only when tun_xdp_tx() returns a negative value. A positive value indicates that the packet is successfully enqueued to the ptr_ring, so freeing the page causes use-after-free. Fixes: `735fc4054b` ("xdp: change ndo_xdp_xmit API to support bulking") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 13:38:29 -07:00
David S. Miller	2aa4a3378a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2018-07-15 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Various different arm32 JIT improvements in order to optimize code emission and make the JIT code itself more robust, from Russell. 2) Support simultaneous driver and offloaded XDP in order to allow for advanced use-cases where some work is offloaded to the NIC and some to the host. Also add ability for bpftool to load programs and maps beyond just the cgroup case, from Jakub. 3) Add BPF JIT support in nfp for multiplication as well as division. For the latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong. 4) Add BTF pretty print functionality to bpftool in plain and JSON output format, from Okash. 5) Add build and installation to the BPF helper man page into bpftool, from Quentin. 6) Add a TCP BPF callback for listening sockets which is triggered right after the socket transitions to TCP_LISTEN state, from Andrey. 7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup tree and prints all attached programs, from Roman. 8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged packets, from Jesper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-14 18:47:44 -07:00
Jakub Kicinski	6b86758973	xdp: don't make drivers report attachment mode prog_attached of struct netdev_bpf should have been superseded by simply setting prog_id long time ago, but we kept it around to allow offloading drivers to communicate attachment mode (drv vs hw). Subsequently drivers were also allowed to report back attachment flags (prog_flags), and since nowadays only programs attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell the attachment mode from the flags driver reports. Remove prog_attached member. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-07-13 20:26:35 +02:00
Alexander Duyck	4f49dec907	net: allow ndo_select_queue to pass netdev This patch makes it so that instead of passing a void pointer as the accel_priv we instead pass a net_device pointer as sb_dev. Making this change allows us to pass the subordinate device through to the fallback function eventually so that we can keep the actual code in the ndo_select_queue call as focused on possible on the exception cases. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2018-07-09 13:41:34 -07:00
Willem de Bruijn	fd3a886258	net: in virtio_net_hdr only add VLAN_HLEN to csum_start if payload holds vlan Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr to communicate packet metadata to userspace. For skbuffs with vlan, the first two return the packet as it may have existed on the wire, inserting the VLAN tag in the user buffer. Then virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes. Commit `f09e2249c4` ("macvtap: restore vlan header on user read") added this feature to macvtap. Commit `3ce9b20f19` ("macvtap: Fix csum_start when VLAN tags are present") then fixed up csum_start. Virtio, packet and uml do not insert the vlan header in the user buffer. When introducing virtio_net_hdr_from_skb to deduplicate filling in the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was applied uniformly, breaking csum offset for packets with vlan on virtio and packet. Make insertion of VLAN_HLEN optional. Convert the callers to pass it when needed. Fixes: `e858fae2b0` ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion") Fixes: `1276f24eee` ("packet: use common code for virtio_net_hdr and skb GSO conversion") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-06-07 16:15:38 -04:00
David S. Miller	fd129f8941	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2018-06-05 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Add a new BPF hook for sendmsg similar to existing hooks for bind and connect: "This allows to override source IP (including the case when it's set via cmsg(3)) and destination IP:port for unconnected UDP (slow path). TCP and connected UDP (fast path) are not affected. This makes UDP support complete, that is, connected UDP is handled by connect hooks, unconnected by sendmsg ones.", from Andrey. 2) Rework of the AF_XDP API to allow extending it in future for type writer model if necessary. In this mode a memory window is passed to hardware and multiple frames might be filled into that window instead of just one that is the case in the current fixed frame-size model. With the new changes made this can be supported without having to add a new descriptor format. Also, core bits for the zero-copy support for AF_XDP have been merged as agreed upon, where i40e bits will be routed via Jeff later on. Various improvements to documentation and sample programs included as well, all from Björn and Magnus. 3) Given BPF's flexibility, a new program type has been added to implement infrared decoders. Quote: "The kernel IR decoders support the most widely used IR protocols, but there are many protocols which are not supported. [...] There is a 'long tail' of unsupported IR protocols, for which lircd is need to decode the IR. IR encoding is done in such a way that some simple circuit can decode it; therefore, BPF is ideal. [...] user-space can define a decoder in BPF, attach it to the rc device through the lirc chardev.", from Sean. 4) Several improvements and fixes to BPF core, among others, dumping map and prog IDs into fdinfo which is a straight forward way to correlate BPF objects used by applications, removing an indirect call and therefore retpoline in all map lookup/update/delete calls by invoking the callback directly for 64 bit archs, adding a new bpf_skb_cgroup_id() BPF helper for tc BPF programs to have an efficient way of looking up cgroup v2 id for policy or other use cases. Fixes to make sure we zero tunnel/xfrm state that hasn't been filled, to allow context access wrt pt_regs in 32 bit archs for tracing, and last but not least various test cases for fixes that landed in bpf earlier, from Daniel. 5) Get rid of the ndo_xdp_flush API and extend the ndo_xdp_xmit with a XDP_XMIT_FLUSH flag instead which allows to avoid one indirect call as flushing is now merged directly into ndo_xdp_xmit(), from Jesper. 6) Add a new bpf_get_current_cgroup_id() helper that can be used in tracing to retrieve the cgroup id from the current process in order to allow for e.g. aggregation of container-level events, from Yonghong. 7) Two follow-up fixes for BTF to reject invalid input values and related to that also two test cases for BPF kselftests, from Martin. 8) Various API improvements to the bpf_fib_lookup() helper, that is, dropping MPLS bits which are not fully hashed out yet, rejecting invalid helper flags, returning error for unsupported address families as well as renaming flowlabel to flowinfo, from David. 9) Various fixes and improvements to sockmap BPF kselftests in particular in proper error detection and data verification, from Prashant. 10) Two arm32 BPF JIT improvements. One is to fix imm range check with regards to whether immediate fits into 24 bits, and a naming cleanup to get functions related to rsh handling consistent to those handling lsh, from Wang. 11) Two compile warning fixes in BPF, one for BTF and a false positive to silent gcc in stack_map_get_build_id_offset(), from Arnd. 12) Add missing seg6.h header into tools include infrastructure in order to fix compilation of BPF kselftests, from Mathieu. 13) Several formatting cleanups in the BPF UAPI helper description that also fix an error during rst2man compilation, from Quentin. 14) Hide an unused variable in sk_msg_convert_ctx_access() when IPv6 is not built into the kernel, from Yue. 15) Remove a useless double assignment in dev_map_enqueue(), from Colin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-06-05 12:42:19 -04:00
Jesper Dangaard Brouer	42421a5654	tun: remove ndo_xdp_flush call tun_xdp_flush Remove the ndo_xdp_flush call implementation tun_xdp_flush as no callers of ndo_xdp_flush are left. The tun drivers XDP_TX implementation also used tun_xdp_flush (and tun_xdp_xmit). This is easily solved by passing the XDP_XMIT_FLUSH flag to tun_xdp_xmit in tun_xdp_tx. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-06-05 14:03:16 +02:00

1 2 3 4 5 ...

587 Commits