2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

554 Commits

Author SHA1 Message Date
Matthias Schiffer
019b13ae85 vxlan: fix incorrect nlattr access in MTU check
The access to the wrong variable could lead to a NULL dereference and
possibly other invalid memory reads in vxlan newlink/changelink requests
with a IFLA_MTU attribute.

Fixes: a985343ba9 "vxlan: refactor verification and application of configuration"
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-27 14:40:35 -04:00
Matthias Schiffer
a8b8a889e3 net: add netlink_ext_ack argument to rtnl_link_ops.validate
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:22 -04:00
Matthias Schiffer
ad744b223c net: add netlink_ext_ack argument to rtnl_link_ops.changelink
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:22 -04:00
Matthias Schiffer
7a3f4a1851 net: add netlink_ext_ack argument to rtnl_link_ops.newlink
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:21 -04:00
Matthias Schiffer
49f810f00f vxlan: allow multiple VXLANs with same VNI for IPv6 link-local addresses
As link-local addresses are only valid for a single interface, we can allow
to use the same VNI for multiple independent VXLANs, as long as the used
interfaces are distinct. This way, VXLANs can always be used as a drop-in
replacement for VLANs with greater ID space.

This also extends VNI lookup to respect the ifindex when link-local IPv6
addresses are used, so using the same VNI on multiple interfaces can
actually work.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 13:37:03 -04:00
Matthias Schiffer
87613de950 vxlan: fix snooping for link-local IPv6 addresses
If VXLAN is run over link-local IPv6 addresses, it is necessary to store
the ifindex in the FDB entries. Otherwise, the used interface is undefined
and unicast communication will most likely fail.

Support for link-local IPv4 addresses should be possible as well, but as
the semantics aren't as well defined as for IPv6, and there doesn't seem to
be much interest in having the support, it's not implemented for now.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 13:37:03 -04:00
Matthias Schiffer
0f22a3c68d vxlan: check valid combinations of address scopes
* Multicast addresses are never valid as local address
* Link-local IPv6 unicast addresses may only be used as remote when the
  local address is link-local as well
* Don't allow link-local IPv6 local/remote addresses without interface

We also store in the flags field if link-local addresses are used for the
follow-up patches that actually make VXLAN over link-local IPv6 work.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 13:37:02 -04:00
Matthias Schiffer
ce44a4aea5 vxlan: improve validation of address family configuration
Address families of source and destination addresses must match, and
changelink operations can't change the address family.

In addition, always use the VXLAN_F_IPV6 to check if a VXLAN device uses
IPv4 or IPv6.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 13:37:02 -04:00
Matthias Schiffer
dc5321d796 vxlan: get rid of redundant vxlan_dev.flags
There is no good reason to keep the flags twice in vxlan_dev and
vxlan_config.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 13:37:02 -04:00
Matthias Schiffer
a985343ba9 vxlan: refactor verification and application of configuration
The vxlan_dev_configure function was mixing validation and application of
the vxlan configuration; this could easily lead to bugs with the changelink
operation, as it was hard to see if the function wcould return an error
after parts of the configuration had already been applied.

This commit splits validation and application out of vxlan_dev_configure as
separate functions to make it clearer where error returns are allowed and
where the vxlan_dev or net_device may be configured. Log messages in these
functions are removed, as it is generally unexpected to find error output
for netlink requests in the kernel log. Userspace should be able to handle
errors based on the error codes returned via netlink just fine.

In addition, some validation and initialization is moved to vxlan_validate
and vxlan_setup respectively to improve grouping of similar settings.

Finally, this also fixes two actual bugs:

* if set, conf->mtu would overwrite dev->mtu in each changelink operation,
  reverting other changes of dev->mtu
* the "if (!conf->dst_port)" branch would never be run, as conf->dst_port
  was set in vxlan_setup before. This caused VXLAN-GPE to use the same
  default port as other VXLAN sockets instead of the intended IANA-assigned
  4790.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 13:37:01 -04:00
Johannes Berg
d58ff35122 networking: make skb_push & __skb_push return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

    @@
    expression SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - fn(SKB, LEN)[0]
    + *(u8 *)fn(SKB, LEN)

Note that the last part there converts from push(...)[0] to the
more idiomatic *(u8 *)push(...).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:40 -04:00
Johannes Berg
b080db5853 networking: convert many more places to skb_put_zero()
There were many places that my previous spatch didn't find,
as pointed out by yuan linyu in various patches.

The following spatch found many more and also removes the
now unnecessary casts:

    @@
    identifier p, p2;
    expression len;
    expression skb;
    type t, t2;
    @@
    (
    -p = skb_put(skb, len);
    +p = skb_put_zero(skb, len);
    |
    -p = (t)skb_put(skb, len);
    +p = skb_put_zero(skb, len);
    )
    ... when != p
    (
    p2 = (t2)p;
    -memset(p2, 0, len);
    |
    -memset(p, 0, len);
    )

    @@
    type t, t2;
    identifier p, p2;
    expression skb;
    @@
    t *p;
    ...
    (
    -p = skb_put(skb, sizeof(t));
    +p = skb_put_zero(skb, sizeof(t));
    |
    -p = (t *)skb_put(skb, sizeof(t));
    +p = skb_put_zero(skb, sizeof(t));
    )
    ... when != p
    (
    p2 = (t2)p;
    -memset(p2, 0, sizeof(*p));
    |
    -memset(p, 0, sizeof(*p));
    )

    @@
    expression skb, len;
    @@
    -memset(skb_put(skb, len), 0, len);
    +skb_put_zero(skb, len);

Apply it to the tree (with one manual fixup to keep the
comment in vxlan.c, which spatch removed.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:35 -04:00
David S. Miller
0ddead90b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflicts were two cases of overlapping changes in
batman-adv and the qed driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 11:59:32 -04:00
Roopa Prabhu
e0090a9e97 vxlan: dont migrate permanent fdb entries during learn
This patch fixes vxlan_snoop to not move permanent fdb entries
on learn events. This is consistent with the bridge fdb
handling of permanent entries.

Fixes: 26a41ae604 ("vxlan: only migrate dynamic FDB entries")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-12 11:01:00 -04:00
David S. Miller
cf124db566 net: Fix inconsistent teardown and release of private netdev state.
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init().  However, the release of these resources
can occur in one of two different places.

Either netdev_ops->ndo_uninit() or netdev->destructor().

The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.

netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.

netdev->destructor(), on the other hand, does not run until the
netdev references all go away.

Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().

This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.

If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
it is not able to invoke netdev->destructor().

This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.

However, this means that the resources that would normally be released
by netdev->destructor() will not be.

Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.

Many drivers do not try to deal with this, and instead we have leaks.

Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().

netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().

netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().

Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().

And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07 15:53:24 -04:00
Mark Bloch
57d88182ea vxlan: use a more suitable function when assigning NULL
When stopping the vxlan interface we detach it from the socket.
Use RCU_INIT_POINTER() and not rcu_assign_pointer() to do so.

Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07 15:26:03 -04:00
Mark Bloch
a53cb29b0a vxlan: fix use-after-free on deletion
Adding a vxlan interface to a socket isn't symmetrical, while adding
is done in vxlan_open() the deletion is done in vxlan_dellink().
This can cause a use-after-free error when we close the vxlan
interface before deleting it.

We add vxlan_vs_del_dev() to match vxlan_vs_add_dev() and call
it from vxlan_stop() to match the call from vxlan_open().

Fixes: 56ef9c909b ("vxlan: Move socket initialization to within rtnl scope")
Acked-by: Jiri Benc <jbenc@redhat.com>
Tested-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-02 14:29:16 -04:00
Lance Richardson
35cf284556 vxlan: eliminate cached dst leak
After commit 0c1d70af92 ("net: use dst_cache for vxlan device"),
cached dst entries could be leaked when more than one remote was
present for a given vxlan_fdb entry, causing subsequent netns
operations to block indefinitely and "unregister_netdevice: waiting
for lo to become free." messages to appear in the kernel log.

Fix by properly releasing cached dst and freeing resources in this
case.

Fixes: 0c1d70af92 ("net: use dst_cache for vxlan device")
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-01 14:45:21 -04:00
Jiri Benc
baf4d78607 vxlan: do not output confusing error message
The message "Cannot bind port X, err=Y" creates only confusion. In metadata
based mode, failure of IPv6 socket creation is okay if IPv6 is disabled and
no error message should be printed. But when IPv6 tunnel was requested, such
failure is fatal. The vxlan_socket_create does not know when the error is
harmless and when it's not.

Instead of passing such information down to vxlan_socket_create, remove the
message completely. It's not useful. We propagate the error code up to the
user space and the port number comes from the user space. There's nothing in
the message that the process creating vxlan interface does not know.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:30:13 -04:00
Jiri Benc
d074bf9600 vxlan: correctly handle ipv6.disable module parameter
When IPv6 is compiled but disabled at runtime, __vxlan_sock_add returns
-EAFNOSUPPORT. For metadata based tunnels, this causes failure of the whole
operation of bringing up the tunnel.

Ignore failure of IPv6 socket creation for metadata based tunnels caused by
IPv6 not being available.

Fixes: b1be00a6c3 ("vxlan: support both IPv4 and IPv6 sockets in a single vxlan device")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:30:13 -04:00
Vincent Bernat
f1fb08f633 vxlan: fix ND proxy when skb doesn't have transport header offset
When an incoming frame is tagged or when GRO is disabled, the skb
handled to vxlan_xmit() doesn't contain a valid transport header
offset. This makes ND proxying fail.

We combine two changes: replace use of skb_transport_offset() and ensure
the necessary amount of skb is linear just before using it:

 - In vxlan_xmit(), when determining if we have an ICMPv6 neighbor
   discovery packet, just check if it is an ICMPv6 packet and rely on
   neigh_reduce() to do more checks if this is the case. The use of
   pskb_may_pull() is replaced by skb_header_pointer() for just the IPv6
   header.

 - In neigh_reduce(), add pskb_may_pull() for IPv6 header and neighbor
   discovery message since this was removed from vxlan_xmit(). Replace
   skb_transport_header() with ipv6_hdr() + 1.

 - In vxlan_na_create(), replace first skb_transport_offset() with
   ipv6_hdr() + 1 and second with skb_network_offset() + sizeof(struct
   ipv6hdr). Additionally, ensure we pskb_may_pull() the whole skb as we
   need it to iterate over the options.

Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03 18:50:42 -07:00
Felix Manlunas
d6acfeb17d vxlan: vxlan dev should inherit lowerdev's gso_max_size
vxlan dev currently ignores lowerdev's gso_max_size, which adversely
affects TSO performance of liquidio if it's the lowerdev.  Egress TCP
packets' skb->len often exceed liquidio's advertised gso_max_size.  This
may happen on other NIC drivers.

Fix it by assigning lowerdev's gso_max_size to that of vxlan dev.  Might as
well do likewise for gso_max_segs.

Single flow TSO throughput of liquidio as lowerdev (using iperf3):

    Before the patch:    139 Mbps
    After the patch :   8.68 Gbps
    Percent increase:  6,144 %

Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 11:43:38 -07:00
Roopa Prabhu
def499c929 vxlan: don't age NTF_EXT_LEARNED fdb entries
vxlan driver already implicitly supports installing
of external fdb entries with NTF_EXT_LEARNED. This
patch just makes sure these entries are not aged
by the vxlan driver. An external entity managing these
entries will age them out. This is consistent with
the use of NTF_EXT_LEARNED in the bridge driver.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 17:57:16 -07:00
David S. Miller
101c431492 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/genet/bcmgenet.c
	net/core/sock.c

Conflicts were overlapping changes in bcmgenet and the
lockdep handling of sockets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 11:59:10 -07:00
Nicolas Dichtel
c80498e36d vxlan: fix ovs support
The required changes in the function vxlan_dev_create() were missing
in commit 8bcdc4f3a2.
The vxlan device is not registered anymore after this patch and the error
path causes an stack dump:
 WARNING: CPU: 3 PID: 1498 at net/core/dev.c:6713 rollback_registered_many+0x9d/0x3f0

Fixes: 8bcdc4f3a2 ("vxlan: add changelink support")
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 16:03:42 -07:00
Vincent Bernat
8f48ba71ed vxlan: use appropriate family on L3 miss
When sending a L3 miss, the family is set to AF_INET even for IPv6. This
causes userland (eg "ip monitor") to be confused. Ensure we send the
appropriate family in this case. For L2 miss, keep using AF_INET.

Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:31:05 -07:00
Jakub Kicinski
56de859e99 vxlan: lock RCU on TX path
There is no guarantees that callers of the TX path will hold
the RCU lock.  Grab it explicitly.

Fixes: c6fcc4fc5f ("vxlan: avoid using stale vxlan socket.")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 09:58:31 -08:00
Brian Russell
1158632b5a vxlan: don't allow overwrite of config src addr
When using IPv6 transport and a default dst, a pointer to the configured
source address is passed into the route lookup. If no source address is
configured, then the value is overwritten.

IPv6 route lookup ignores egress ifindex match if the source address is set,
so if egress ifindex match is desired, the source address must be passed
as any. The overwrite breaks this for subsequent lookups.

Avoid this by copying the configured address to an existing stack variable
and pass a pointer to that instead.

Fixes: 272d96a5ab ("net: vxlan: lwt: Use source ip address during route lookup.")

Signed-off-by: Brian Russell <brussell@brocade.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24 13:36:24 -05:00
Matthias Schiffer
4e37d6911f vxlan: correctly validate VXLAN ID against VXLAN_N_VID
The incorrect check caused an off-by-one error: the maximum VID 0xffffff
was unusable.

Fixes: d342894c5d ("vxlan: virtual extensible lan")
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24 11:42:55 -05:00
Roopa Prabhu
8dcd81a993 vxlan: remove unused variable saddr in neigh_reduce
silences the below warning:
    drivers/net/vxlan.c: In function ‘neigh_reduce’:
    drivers/net/vxlan.c:1599:25: warning: variable ‘saddr’ set but not used
    [-Wunused-but-set-variable]

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21 12:19:21 -05:00
Roopa Prabhu
8bcdc4f3a2 vxlan: add changelink support
This patch adds changelink rtnl op support for vxlan netdevs.
code changes involve:
    - refactor vxlan_newlink into vxlan_nl2conf to be
    used by vxlan_newlink and vxlan_changelink
    - vxlan_nl2conf and vxlan_dev_configure take a
    changelink argument to isolate changelink checks
    and updates.
    - Allow changing only a few attributes:
        - return -EOPNOTSUPP for attributes that cannot
        be changed for now. Incremental patches can
        make the non-supported one available in the future
        if needed.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21 12:19:21 -05:00
David S. Miller
f787d1debf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-19 11:18:46 -05:00
Paolo Abeni
22f0708a71 vxlan: fix oops in dev_fill_metadata_dst
Since the commit 0c1d70af92 ("net: use dst_cache for vxlan device")
vxlan_fill_metadata_dst() calls vxlan_get_route() passing a NULL
dst_cache pointer, so the latter should explicitly check for
valid dst_cache ptr. Unfortunately the commit d71785ffc7 ("net: add
dst_cache to ovs vxlan lwtunnel") removed said check.

As a result is possible to trigger a null pointer access calling
vxlan_fill_metadata_dst(), e.g. with:

ovs-vsctl add-br ovs-br0
ovs-vsctl add-port ovs-br0 vxlan0 -- set interface vxlan0 \
	type=vxlan options:remote_ip=192.168.1.1 \
	options:key=1234 options:dst_port=4789 ofport_request=10
ip address add dev ovs-br0 172.16.1.2/24
ovs-vsctl set Bridge ovs-br0 ipfix=@i -- --id=@i create IPFIX \
	targets=\"172.16.1.1:1234\" sampling=1
iperf -c 172.16.1.1 -u -l 1000 -b 10M -t 1 -p 1234

This commit addresses the issue passing to vxlan_get_route() the
dst_cache already available into the lwt info processed by
vxlan_fill_metadata_dst().

Fixes: d71785ffc7 ("net: add dst_cache to ovs vxlan lwtunnel")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 15:32:06 -05:00
Roopa Prabhu
98eb253cbd vxlan: remove vni zero check and drop for COLLECT_METADATA
This patch drops the vni zero check for COLLECT_METADATA mode.
It is not really needed, vni zero is a valid vni.

Fixes: 3ad7a4b141 ("vxlan: support fdb and learning in COLLECT_METADATA mode"
Reported-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-11 21:25:50 -05:00
Roopa Prabhu
3ad7a4b141 vxlan: support fdb and learning in COLLECT_METADATA mode
Vxlan COLLECT_METADATA mode today solves the per-vni netdev
scalability problem in l3 networks. It expects all forwarding
information to be present in dst_metadata. This patch series
enhances collect metadata mode to include the case where only
vni is present in dst_metadata, and the vxlan driver can then use
the rest of the forwarding information datbase to make forwarding
decisions. There is no change to default COLLECT_METADATA
behaviour. These changes only apply to COLLECT_METADATA when
used with the bridging use-case with a special dst_metadata
tunnel info flag (eg: where vxlan device is part of a bridge).
For all this to work, the vxlan driver will need to now support a
single fdb table hashed by mac + vni. This series essentially makes
this happen.

use-case and workflow:
vxlan collect metadata device participates in bridging vlan
to vn-segments. Bridge driver above the vxlan device,
sends the vni corresponding to the vlan in the dst_metadata.
vxlan driver will lookup forwarding database with (mac + vni)
for the required remote destination information to forward the
packet.

Changes introduced by this patch:
    - allow learning and forwarding database state in vxlan netdev in
      COLLECT_METADATA mode. Current behaviour is not changed
      by default. tunnel info flag IP_TUNNEL_INFO_BRIDGE is used
      to support the new bridge friendly mode.
    - A single fdb table hashed by (mac, vni) to allow fdb entries with
      multiple vnis in the same fdb table
    - rx path already has the vni
    - tx path expects a vni in the packet with dst_metadata
    - prior to this series, fdb remote_dsts carried remote vni and
      the vxlan device carrying the fdb table represented the
      source vni. With the vxlan device now representing multiple vnis,
      this patch adds a src vni attribute to the fdb entry. The remote
      vni already uses NDA_VNI attribute. This patch introduces
      NDA_SRC_VNI netlink attribute to represent the src vni in a multi
      vni fdb table.

iproute2 example (patched and pruned iproute2 output to just show
relevant fdb entries):
example shows same host mac learnt on two vni's.

before (netdev per vni):
$bridge fdb show | grep "00:02:00:00:00:03"
00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self
00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self

after this patch with collect metadata in bridged mode (single netdev):
$bridge fdb show | grep "00:02:00:00:00:03"
00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self
00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 15:21:21 -05:00
David S. Miller
4e8f2fc1a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two trivial overlapping changes conflicts in MPLS and mlx5.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28 10:33:06 -05:00
Balakrishnan Raman
efb5f68f32 vxlan: do not age static remote mac entries
Mac aging is applicable only for dynamically learnt remote mac
entries. Check for user configured static remote mac entries
and skip aging.

Signed-off-by: Balakrishnan Raman <ramanb@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 15:01:58 -05:00
Roopa Prabhu
8b3f9337e1 vxlan: don't flush static fdb entries on admin down
This patch skips flushing static fdb entries in
ndo_stop, but flushes all fdb entries during vxlan
device delete. This is consistent with the bridge
driver fdb

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 15:01:57 -05:00
Lance Richardson
1f6cc07e17 vxlan: preserve type of dst_port parm for encap_bypass_if_local()
Eliminate sparse warning by maintaining type of dst_port
as __be16.

Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:12:14 -05:00
Lance Richardson
d5ff72d9af vxlan: fix byte order of vxlan-gpe port number
vxlan->cfg.dst_port is in network byte order, so an htons()
is needed here. Also reduced comment length to stay closer
to 80 column width (still slightly over, however).

Fixes: e1e5314de0 ("vxlan: implement GPE")
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17 15:23:43 -05:00
Martynas Pumputis
4ecb1d83f6 vxlan: Set ports in flow key when doing route lookups
Otherwise, a xfrm policy with sport/dport being set cannot be matched.

Signed-off-by: Martynas Pumputis <martynas@weave.works>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11 16:09:47 -05:00
David S. Miller
2745529ac7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Couple conflicts resolved here:

1) In the MACB driver, a bug fix to properly initialize the
   RX tail pointer properly overlapped with some changes
   to support variable sized rings.

2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
   overlapping with a reorganization of the driver to support
   ACPI, OF, as well as PCI variants of the chip.

3) In 'net' we had several probe error path bug fixes to the
   stmmac driver, meanwhile a lot of this code was cleaned up
   and reorganized in 'net-next'.

4) The cls_flower classifier obtained a helper function in
   'net-next' called __fl_delete() and this overlapped with
   Daniel Borkamann's bug fix to use RCU for object destruction
   in 'net'.  It also overlapped with Jiri's change to guard
   the rhashtable_remove_fast() call with a check against
   tc_skip_sw().

5) In mlx4, a revert bug fix in 'net' overlapped with some
   unrelated changes in 'net-next'.

6) In geneve, a stale header pointer after pskb_expand_head()
   bug fix in 'net' overlapped with a large reorganization of
   the same code in 'net-next'.  Since the 'net-next' code no
   longer had the bug in question, there was nothing to do
   other than to simply take the 'net-next' hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 12:29:53 -05:00
Haishuang Yan
17b463654f vxlan: fix a potential issue when create a new vxlan fdb entry.
vxlan_fdb_append may return error, so add the proper check,
otherwise it will cause memory leak.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>

Changes in v2:
  - Unnecessary to initialize rc to zero.
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 12:02:49 -05:00
Alexey Dobriyan
c7d03a00b5 netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

	void f(long *p, int i)
	{
		g(p[i]);
	}

  roughly translates to

	movsx	rsi, esi
	mov	rdi, [rsi+...]
	call 	g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

	static inline void *net_generic(const struct net *net, int id)
	{
		...
		ptr = ng->ptr[id - 1];
		...
	}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
	function                                     old     new   delta
	nfsd4_lock                                  3886    3959     +73
	tipc_link_build_proto_msg                   1096    1140     +44
	mac80211_hwsim_new_radio                    2776    2808     +32
	tipc_mon_rcv                                1032    1058     +26
	svcauth_gss_legacy_init                     1413    1429     +16
	tipc_bcbase_select_primary                   379     392     +13
	nfsd4_exchange_id                           1247    1260     +13
	nfsd4_setclientid_confirm                    782     793     +11
		...
	put_client_renew_locked                      494     480     -14
	ip_set_sockfn_get                            730     716     -14
	geneve_sock_add                              829     813     -16
	nfsd4_sequence_done                          721     703     -18
	nlmclnt_lookup_host                          708     686     -22
	nfsd4_lockt                                 1085    1063     -22
	nfs_get_client                              1077    1050     -27
	tcf_bpf_init                                1106    1076     -30
	nfsd4_encode_fattr                          5997    5930     -67
	Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 10:59:15 -05:00
David S. Miller
8ebd115bb2 vxlan: Fix uninitialized variable warnings.
drivers/net/vxlan.c: In function ‘vxlan_xmit_one’:
drivers/net/vxlan.c:2141:10: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 16:32:11 -05:00
pravin shelar
0770b53bd2 vxlan: simplify vxlan xmit
Existing vxlan xmit function handles two distinct cases.
1. vxlan net device
2. vxlan lwt device.
By seperating initialization these two cases the egress path
looks better.

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 12:16:13 -05:00
pravin shelar
fee1fad7c7 vxlan: simplify RTF_LOCAL handling.
Avoid code duplicate code for handling RTF_LOCAL routes.

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 12:16:13 -05:00
pravin shelar
655c3de165 vxlan: improve vxlan route lookup checks.
Move route sanity check to respective vxlan[4/6]_get_route functions.
This allows us to perform all sanity checks before caching the dst so
that we can avoid these checks on subsequent packets.
This give move accurate metadata information for packet from
fill_metadata_dst().

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 12:16:13 -05:00
pravin shelar
c46b7897ad vxlan: simplify exception handling
vxlan egress path error handling has became complicated, it
need to handle IPv4 and IPv6 tunnel cases.
Earlier patch removes vlan handling from vxlan_build_skb(), so
vxlan_build_skb does not need to free skb and we can simplify
the xmit path by having single error handling for both type of
tunnels.

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 12:16:13 -05:00
pravin shelar
03dc52a86d vxlan: avoid checking socket multiple times.
Check the vxlan socket in vxlan6_getroute().

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 12:16:13 -05:00
pravin shelar
4a4f86cc7d vxlan: avoid vlan processing in vxlan device.
VxLan device does not have special handling for vlan taging on egress.
Therefore it does not make sense to expose vlan offloading feature.
This patch does not change vxlan functinality.

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 12:16:13 -05:00
David S. Miller
bb598c1b8c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of bug fixes in 'net' overlapping other changes in
'net-next-.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 10:54:36 -05:00
Arnd Bergmann
4053ab1bf9 vxlan: hide unused local variable
A bugfix introduced a harmless warning in v4.9-rc4:

drivers/net/vxlan.c: In function 'vxlan_group_used':
drivers/net/vxlan.c:947:21: error: unused variable 'sock6' [-Werror=unused-variable]

This hides the variable inside of the same #ifdef that is
around its user. The extraneous initialization is removed
at the same time, it was accidentally introduced in the
same commit.

Fixes: c6fcc4fc5f ("vxlan: avoid using stale vxlan socket.")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:59:50 -05:00
David S. Miller
27058af401 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 12:42:58 -04:00
pravin shelar
c6fcc4fc5f vxlan: avoid using stale vxlan socket.
When vxlan device is closed vxlan socket is freed. This
operation can race with vxlan-xmit function which
dereferences vxlan socket. Following patch uses RCU
mechanism to avoid this situation.

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 20:56:31 -04:00
Jarod Wilson
91572088e3 net: use core MTU range checking in core net infra
geneve:
- Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
- This one isn't quite as straight-forward as others, could use some
  closer inspection and testing

macvlan:
- set min/max_mtu

tun:
- set min/max_mtu, remove tun_net_change_mtu

vxlan:
- Merge __vxlan_change_mtu back into vxlan_change_mtu
- Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
  change_mtu function
- This one is also not as straight-forward and could use closer inspection
  and testing from vxlan folks

bridge:
- set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
  change_mtu function

openvswitch:
- set min/max_mtu, remove internal_dev_change_mtu
- note: max_mtu wasn't checked previously, it's been set to 65535, which
  is the largest possible size supported

sch_teql:
- set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)

macsec:
- min_mtu = 0, max_mtu = 65535

macvlan:
- min_mtu = 0, max_mtu = 65535

ntb_netdev:
- min_mtu = 0, max_mtu = 65535

veth:
- min_mtu = 68, max_mtu = 65535

8021q:
- min_mtu = 0, max_mtu = 65535

CC: netdev@vger.kernel.org
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Tom Herbert <tom@herbertland.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Jiri Benc <jbenc@redhat.com>
CC: WANG Cong <xiyou.wangcong@gmail.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Pravin B Shelar <pshelar@ovn.org>
CC: Sabrina Dubroca <sd@queasysnail.net>
CC: Patrick McHardy <kaber@trash.net>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Pravin Shelar <pshelar@nicira.com>
CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:09 -04:00
Sabrina Dubroca
fcd91dd449 net: add recursion limit to GRO
Currently, GRO can do unlimited recursion through the gro_receive
handlers.  This was fixed for tunneling protocols by limiting tunnel GRO
to one level with encap_mark, but both VLAN and TEB still have this
problem.  Thus, the kernel is vulnerable to a stack overflow, if we
receive a packet composed entirely of VLAN headers.

This patch adds a recursion counter to the GRO layer to prevent stack
overflow.  When a gro_receive function hits the recursion limit, GRO is
aborted for this skb and it is processed normally.  This recursion
counter is put in the GRO CB, but could be turned into a percpu counter
if we run out of space in the CB.

Thanks to Vladimír Beneš <vbenes@redhat.com> for the initial bug report.

Fixes: CVE-2016-7039
Fixes: 9b174d88c2 ("net: Add Transparent Ethernet Bridging GRO support.")
Fixes: 66e5133f19 ("vlan: Add GRO support for non hardware accelerated vlan")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:32:22 -04:00
David S. Miller
b20b378d49 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mediatek/mtk_eth_soc.c
	drivers/net/ethernet/qlogic/qed/qed_dcbx.c
	drivers/net/phy/Kconfig

All conflicts were cases of overlapping commits.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-12 15:52:44 -07:00
Amir Vadai
d817f432c2 net/ip_tunnels: Introduce tunnel_id_to_key32() and key32_to_tunnel_id()
Add utility functions to convert a 32 bits key into a 64 bits tunnel and
vice versa.
These functions will be used instead of cloning code in GRE and VXLAN,
and in tc act_iptunnel which will be introduced in a following patch in
this patchset.

Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 20:53:55 -07:00
Haishuang Yan
5e1e61a33f vxlan: Update tx_errors statistics if vxlan_build_skb return err.
If vxlan_build_skb return err < 0, tx_errors should be also increased.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-06 13:41:42 -07:00
Jiri Benc
3555621de7 vxlan: fix duplicated and wrong error messages
vxlan_dev_configure outputs error messages before returning, no need to
print again the same mesages in vxlan_newlink. Also, vxlan_dev_configure may
return a particular error code for a different reason than vxlan_newlink
thinks.

Move the remaining error messages into vxlan_dev_configure and let
vxlan_newlink just pass on the error code.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 11:42:56 -07:00
Jiri Benc
9b4cdd516d vxlan: reject multicast destination without an interface
Currently, kernel accepts configurations such as:

  ip l a type vxlan dstport 4789 id 1 group 239.192.0.1
  ip l a type vxlan dstport 4789 id 1 group ff0e::110

However, neither of those really works. In the IPv4 case, the interface
cannot be brought up ("RTNETLINK answers: No such device"). This is because
multicast join will be rejected without the interface being specified.

In the IPv6 case, multicast wil be joined on the first interface found. This
is not what the user wants as it depends on random factors (order of
interfaces).

Note that it's possible to add a local address but it doesn't solve
anything. For IPv4, it's not considered in the multicast join (thus the same
error as above is returned on ifup). This could be added but it wouldn't
help for IPv6 anyway. For IPv6, we do need the interface.

Just reject a configuration that sets multicast address and does not provide
an interface. Nobody can depend on the previous behavior as it never worked.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 11:42:56 -07:00
WANG Cong
38f507f1ba vxlan: call peernet2id() in fdb notification
netns id should be already allocated each time we change
netns, that is, in dev_change_net_namespace() (more precisely
in rtnl_fill_ifinfo()). It is safe to just call peernet2id() here.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 11:39:58 -07:00
Roopa Prabhu
d297653dd6 rtnetlink: fdb dump: optimize by saving last interface markers
fdb dumps spanning multiple skb's currently restart from the first
interface again for every skb. This results in unnecessary
iterations on the already visited interfaces and their fdb
entries. In large scale setups, we have seen this to slow
down fdb dumps considerably. On a system with 30k macs we
see fdb dumps spanning across more than 300 skbs.

To fix the problem, this patch replaces the existing single fdb
marker with three markers: netdev hash entries, netdevs and fdb
index to continue where we left off instead of restarting from the
first netdev. This is consistent with link dumps.

In the process of fixing the performance issue, this patch also
re-implements fix done by
commit 472681d57a ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
(with an internal fix from Wilson Kok) in the following ways:
- change ndo_fdb_dump handlers to return error code instead
of the last fdb index
- use cb->args strictly for dump frag markers and not error codes.
This is consistent with other dump functions.

Below results were taken on a system with 1000 netdevs
and 35085 fdb entries:
before patch:
$time bridge fdb show | wc -l
15065

real    1m11.791s
user    0m0.070s
sys 1m8.395s

(existing code does not return all macs)

after patch:
$time bridge fdb show | wc -l
35085

real    0m2.017s
user    0m0.113s
sys 0m1.942s

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:56:15 -07:00
Zhu Yanjun
2a7a3c5644 vxlan: remove the useless header file protocol.h
This header file is not used in vxlan.c file.

Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:48:41 -07:00
pravin shelar
bbec7802c6 net: vxlan: lwt: Fix vxlan local traffic.
vxlan driver has bypass for local vxlan traffic, but that
depends on information about all VNIs on local system in
vxlan driver. This is not available in case of LWT.
Therefore following patch disable encap bypass for LWT
vxlan traffic.

Fixes: ee122c79d4 ("vxlan: Flow based tunneling").
Reported-by: Jakub Libosvar <jlibosva@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-08 14:16:24 -07:00
pravin shelar
272d96a5ab net: vxlan: lwt: Use source ip address during route lookup.
LWT user can specify destination as well as source ip address
for given tunnel endpoint. But vxlan is ignoring given source
ip address. Following patch uses both ip address to route the
tunnel packet. This consistent with other LWT implementations,
like GENEVE and GRE.

Fixes: ee122c79d4 ("vxlan: Flow based tunneling").
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-08 14:16:23 -07:00
Sabrina Dubroca
e5de25dce9 drivers/net: fixup comments after "Future-proof tunnel offload handlers"
Some comments weren't updated to reflect the renaming of ndo's and the
change of arguments.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:42:11 -07:00
David S. Miller
ee58b57100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 05:03:36 -04:00
Alexander Duyck
b9adcd69bd vxlan: Add new UDP encapsulation offload type for VXLAN-GPE
The fact is VXLAN with Generic Protocol Extensions cannot be supported by
the same hardware parsers that support VXLAN.  The protocol extensions
allow for things like a Next Protocol field which in turn allows for things
other than Ethernet to be passed over the tunnel.  Most existing parsers
will not know how to interpret this.

To resolve this I am giving VXLAN-GPE its own UDP encapsulation offload
type.  This way hardware that does support GPE can simply add this type to
the switch statement for VXLAN, and if they don't support it then this will
fix any issues where headers might be interpreted incorrectly.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17 20:23:32 -07:00
Alexander Duyck
7c46a640de net: Merge VXLAN and GENEVE push notifiers into a single notifier
This patch merges the notifiers for VXLAN and GENEVE into a single UDP
tunnel notifier.  The idea is that we will want to only have to make one
notifier call to receive the list of ports for VXLAN and GENEVE tunnels
that need to be offloaded.

In addition we add a new set of ndo functions named ndo_udp_tunnel_add and
ndo_udp_tunnel_del that are meant to allow us to track the tunnel meta-data
such as port and address family as tunnels are added and removed.  The
tunnel meta-data is now transported in a structure named udp_tunnel_info
which for now carries the type, address family, and port number.  In the
future this could be updated so that we can include a tuple of values
including things such as the destination IP address and other fields.

I also ended up going with a naming scheme that consisted of using the
prefix udp_tunnel on function names.  I applied this to the notifier and
ndo ops as well so that it hopefully points to the fact that these are
primarily used in the udp_tunnel functions.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17 20:23:29 -07:00
Alexander Duyck
e7b3db5e60 net: Combine GENEVE and VXLAN port notifiers into single functions
This patch merges the GENEVE and VXLAN code so that both functions pass
through a shared code path.  This way we can start the effort of using a
single function on the network device drivers to handle both of these
tunnel types.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17 20:23:29 -07:00
Alexander Duyck
86a9805725 vxlan/geneve: Include udp_tunnel.h in vxlan/geneve.h and fixup includes
This patch makes it so that we add udp_tunnel.h to vxlan.h and geneve.h
header files.  This is useful as I plan to move the generic handlers for
the port offloads into the udp_tunnel header file and leave the vxlan and
geneve headers to be a bit more protocol specific.

I also went through and cleaned out a number of redundant includes that
where in the .h and .c files for these drivers.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17 20:23:29 -07:00
Nicolas Dichtel
cf5da330bb ovs/vxlan: fix rtnl notifications on iface deletion
The function vxlan_dev_create() (only used by ovs) never calls
rtnl_configure_link(). The consequence is that dev->rtnl_link_state is
never set to RTNL_LINK_INITIALIZED.
During the deletion phase, the function rollback_registered_many() sends
a RTM_DELLINK only if dev->rtnl_link_state is set to RTNL_LINK_INITIALIZED.

Note that the function vxlan_dev_create() is moved after the rtnl stuff so
that vxlan_dellink() can be called in this function.

Fixes: dcc38c033b ("openvswitch: Re-add CONFIG_OPENVSWITCH_VXLAN")
CC: Thomas Graf <tgraf@suug.ch>
CC: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 22:21:44 -07:00
Chen Haiquan
ce577668a4 vxlan: Accept user specified MTU value when create new vxlan link
When create a new vxlan link, example:
  ip link add vtap mtu 1440 type vxlan vni 1 dev eth0

The argument "mtu" has no effect, because it is not set to conf->mtu. The
default value is used in vxlan_dev_configure function.

This problem was introduced by commit 0dfbdf4102 (vxlan: Factor out device
configuration).

Fixes: 0dfbdf4102 (vxlan: Factor out device configuration)
Signed-off-by:  Chen Haiquan <oc@yunify.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-31 11:46:00 -07:00
Hannes Frederic Sowa
e5aed006be udp: prevent skbs lingering in tunnel socket queues
In case we find a socket with encapsulation enabled we should call
the encap_recv function even if just a udp header without payload is
available. The callbacks are responsible for correctly verifying and
dropping the packets.

Also, in case the header validation fails for geneve and vxlan we
shouldn't put the skb back into the socket queue, no one will pick
them up there.  Instead we can simply discard them in the respective
encap_recv functions.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 19:56:02 -04:00
Jiri Benc
8be0cfa4d3 vxlan: set mac_header correctly in GPE mode
For VXLAN-GPE, the interface is ARPHRD_NONE, thus we need to reset
mac_header after pulling the outer header.

v2: Put the code to the existing conditional block as suggested by
    Shmulik Ladkani.

Fixes: e1e5314de0 ("vxlan: implement GPE")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-16 13:37:10 -04:00
David S. Miller
e800072c18 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
In netdevice.h we removed the structure in net-next that is being
changes in 'net'.  In macsec.c and rtnetlink.c we have overlaps
between fixes in 'net' and the u64 attribute changes in 'net-next'.

The mlx5 conflicts have to do with vxlan support dependencies.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09 15:59:24 -04:00
Jarno Rajahalme
229740c631 udp_offload: Set encapsulation before inner completes.
UDP tunnel segmentation code relies on the inner offsets being set for
an UDP tunnel GSO packet, but the inner *_complete() functions will
set the inner offsets only if 'encapsulation' is set before calling
them.  Currently, udp_gro_complete() sets 'encapsulation' only after
the inner *_complete() functions are done.  This causes the inner
offsets having invalid values after udp_gro_complete() returns, which
in turn will make it impossible to properly segment the packet in case
it needs to be forwarded, which would be visible to the user either as
invalid packets being sent or as packet loss.

This patch fixes this by setting skb's 'encapsulation' in
udp_gro_complete() before calling into the inner complete functions,
and by making each possible UDP tunnel gro_complete() callback set the
inner_mac_header to the beginning of the tunnel payload.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Reviewed-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06 18:25:26 -04:00
Jarno Rajahalme
43b8448cd7 udp_tunnel: Remove redundant udp_tunnel_gro_complete().
The setting of the UDP tunnel GSO type is already performed by
udp[46]_gro_complete().

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06 18:25:26 -04:00
Jiri Benc
65226ef8ea vxlan: fix initialization with custom link parameters
Commit 0c867c9bf8 ("vxlan: move Ethernet initialization to a separate
function") changed initialization order and as an unintended result, when the
user specifies additional link parameters (such as IFLA_ADDRESS) while
creating vxlan interface, those are overwritten by vxlan_ether_setup later.

It's necessary to call ether_setup from withing the ->setup callback. That
way, the correct parameters are set by rtnl_create_link later. This is done
also for VXLAN-GPE, as we don't know the interface type yet at that point,
and changed to the correct interface type later.

Fixes: 0c867c9bf8 ("vxlan: move Ethernet initialization to a separate function")
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-29 15:08:56 -04:00
Hannes Frederic Sowa
b7aade1548 vxlan: break dependency with netdev drivers
Currently all drivers depend and autoload the vxlan module because how
vxlan_get_rx_port is linked into them. Remove this dependency:

By using a new event type in the netdevice notifier call chain we proxy
the request from the drivers to flush and resetup the vxlan ports not
directly via function call but by the already existing netdevice
notifier call chain.

I added a separate new event type, NETDEV_OFFLOAD_PUSH_VXLAN, to do so.
We don't need to save those ids, as the event type field is an unsigned
long and using specialized event types for this purpose seemed to be a
more elegant way. This also comes in beneficial if in future we want to
add offloading knobs for vxlan.

Cc: Jesse Gross <jesse@kernel.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 15:35:44 -04:00
Alexander Duyck
aed069df09 ip_tunnel_core: iptunnel_handle_offloads returns int and doesn't free skb
This patch updates the IP tunnel core function iptunnel_handle_offloads so
that we return an int and do not free the skb inside the function.  This
actually allows us to clean up several paths in several tunnels so that we
can free the skb at one point in the path without having to have a
secondary path if we are supporting tunnel offloads.

In addition it should resolve some double-free issues I have found in the
tunnels paths as I believe it is possible for us to end up triggering such
an event in the case of fou or gue.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16 19:09:13 -04:00
Hannes Frederic Sowa
544a773a01 vxlan: reduce usage of synchronize_net in ndo_stop
We only need to do the synchronize_net dance once for both, ipv4 and
ipv6 sockets, thus removing one synchronize_net in case both sockets get
dismantled.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16 18:23:23 -04:00
Hannes Frederic Sowa
0412bd931f vxlan: synchronously and race-free destruction of vxlan sockets
Due to the fact that the udp socket is destructed asynchronously in a
work queue, we have some nondeterministic behavior during shutdown of
vxlan tunnels and creating new ones. Fix this by keeping the destruction
process synchronous in regards to the user space process so IFF_UP can
be reliably set.

udp_tunnel_sock_release destroys vs->sock->sk if reference counter
indicates so. We expect to have the same lifetime of vxlan_sock and
vxlan_sock->sock->sk even in fast paths with only rcu locks held. So
only destruct the whole socket after we can be sure it cannot be found
by searching vxlan_net->sock_list.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jiri Benc <jbenc@redhat.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16 18:23:01 -04:00
Jiri Benc
61618eeac3 vxlan: fix incorrect type
The protocol is 16bit, not 32bit.

Fixes: e1e5314de0 ("vxlan: implement GPE")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:01:56 -04:00
Tom Herbert
5602c48cf8 vxlan: change vxlan to use UDP socket GRO
Adapt vxlan_gro_receive, vxlan_gro_complete to take a socket argument.
Set these functions in tunnel_config.  Don't set udp_offloads any more.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:53:29 -04:00
Jiri Benc
e1e5314de0 vxlan: implement GPE
Implement VXLAN-GPE. Only COLLECT_METADATA is supported for now (it is
possible to support static configuration, too, if there is demand for it).

The GPE header parsing has to be moved before iptunnel_pull_header, as we
need to know the protocol.

v2: Removed what was called "L2 mode" in v1 of the patchset. Only "L3 mode"
    (now called "raw mode") is added by this patch. This mode does not allow
    Ethernet header to be encapsulated in VXLAN-GPE when using ip route to
    specify the encapsulation, IP header is encapsulated instead. The patch
    does support Ethernet to be encapsulated, though, using ETH_P_TEB in
    skb->protocol. This will be utilized by other COLLECT_METADATA users
    (openvswitch in particular).

    If there is ever demand for Ethernet encapsulation with VXLAN-GPE using
    ip route, it's easy to add a new flag switching the interface to
    "Ethernet mode" (called "L2 mode" in v1 of this patchset). For now,
    leave this out, it seems we don't need it.

    Disallowed more flag combinations, especially RCO with GPE.
    Added comment explaining that GBP and GPE cannot be set together.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-06 16:50:32 -04:00
Jiri Benc
47e5d1b063 vxlan: move fdb code to common location in vxlan_xmit
Handle VXLAN_F_COLLECT_METADATA before VXLAN_F_PROXY. The latter does not
make sense with the former, as it needs populated fdb which does not happen
in metadata mode.

After this cleanup, the fdb code in vxlan_xmit is moved to a common location
and can be later skipped for VXLAN-GPE which does not necessarily carry
inner Ethernet header.

v2: changed commit description to not reference L3 mode

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-06 16:50:32 -04:00
Jiri Benc
0c867c9bf8 vxlan: move Ethernet initialization to a separate function
This will allow to initialize vxlan in ARPHRD_NONE mode based on the passed
rtnl attributes.

v2: renamed "l2mode" to "ether".

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-06 16:50:31 -04:00
Jiri Benc
7d34fa75d3 vxlan: fix too large pskb_may_pull with remote checksum
vxlan_remcsum is called after iptunnel_pull_header and thus the skb has
vxlan header already pulled. Don't include vxlan header again in the
calculation.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-21 13:32:19 -04:00
Daniel Borkmann
eaa93bf4c6 vxlan: fix populating tclass in vxlan6_get_route
Jiri mentioned that flowi6_tos of struct flowi6 is never used/read
anywhere. In fact, rest of the kernel uses the flowi6's flowlabel,
where the traffic class _and_ the flowlabel (aka flowinfo) is encoded.

For example, for policy routing, fib6_rule_match() uses ip6_tclass()
that is applied on the flowlabel member for matching on tclass. Similar
fix is needed for geneve, where flowi6_tos is set as well. Installing
a v6 blackhole rule that f.e. matches on tos is now working with vxlan.

Fixes: 1400615d64 ("vxlan: allow setting ipv6 traffic class")
Reported-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-20 13:44:34 -04:00
Alexander Duyck
c194cf93c1 gro: Defer clearing of flush bit in tunnel paths
This patch updates the GRO handlers for GRE, VXLAN, GENEVE, and FOU so that
we do not clear the flush bit until after we have called the next level GRO
handler.  Previously this was being cleared before parsing through the list
of frames, however this resulted in several paths where either the bit
needed to be reset but wasn't as in the case of FOU, or cases where it was
being set as in GENEVE.  By just deferring the clearing of the bit until
after the next level protocol has been parsed we can avoid any unnecessary
bit twiddling and avoid bugs.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13 15:01:00 -04:00
Daniel Borkmann
e7f70af111 vxlan: support setting IPv6 flow label
This work adds support for setting the IPv6 flow label for vxlan per
device and through collect metadata (ip_tunnel_key) frontends. The
vxlan dst cache does not need any special considerations here, for
the cases where caches can be used, the label is static per cache.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-11 15:14:26 -05:00
Daniel Borkmann
134611446d ip_tunnel: add support for setting flow label via collect metadata
This patch extends udp_tunnel6_xmit_skb() to pass in the IPv6 flow label
from call sites. Currently, there's no such option and it's always set to
zero when writing ip6_flow_hdr(). Add a label member to ip_tunnel_key, so
that flow-based tunnels via collect metadata frontends can make use of it.
vxlan and geneve will be converted to add flow label support separately.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-11 15:14:26 -05:00
Daniel Borkmann
1400615d64 vxlan: allow setting ipv6 traffic class
We can already do that for IPv4, but IPv6 support was missing. Add
it for vxlan, so it can be used with collect metadata frontends.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 13:58:47 -05:00
Daniel Borkmann
db3c6139e6 bpf, vxlan, geneve, gre: fix usage of dst_cache on xmit
The assumptions from commit 0c1d70af92 ("net: use dst_cache for vxlan
device"), 468dfffcd7 ("geneve: add dst caching support") and 3c1cb4d260
("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage
when ip_tunnel_info is used is unfortunately not always valid as assumed.

While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF
however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre
with different remote dsts, tos, etc, therefore they cannot be assumed as
packet independent.

Right now vxlan, geneve, gre would cache the dst for eBPF and every packet
would reuse the same entry that was first created on the initial route
lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have
a different one.

Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test
in vxlan needs to be handeled differently in this context as it is currently
inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable()
helper is added for the three tunnel cases, which checks if we can use dst
cache.

Fixes: 0c1d70af92 ("net: use dst_cache for vxlan device")
Fixes: 468dfffcd7 ("geneve: add dst caching support")
Fixes: 3c1cb4d260 ("net/ipv4: add dst cache support for gre lwtunnels")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 13:58:47 -05:00
David S. Miller
810813c47a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 12:34:12 -05:00
Zhang Shengju
6297b91c7f vxlan: use reset to set header pointers
Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add operation
in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-04 22:45:13 -05:00
Daniel Borkmann
4024fcf705 vxlan: fix missing options_len update on RX with collect metadata
When signalling to metadata consumers that the metadata_dst entry
carries additional GBP extension data for vxlan (TUNNEL_VXLAN_OPT),
the dst's vxlan_metadata information is populated, but options_len
is left to zero. F.e. in ovs, ovs_flow_key_extract() checks for
options_len before extracting the data through ip_tunnel_info_opts_get().

Geneve uses ip_tunnel_info_opts_set() helper in receive path, which
sets options_len internally, vxlan however uses ip_tunnel_info_opts(),
so when filling vxlan_metadata, we do need to update options_len.

Fixes: 4c22279848 ("ip-tunnel: Use API to access tunnel metadata options.")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03 17:10:31 -05:00
MINOURA Makoto / 箕浦 真
472681d57a net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump.
When the send skbuff reaches the end, nlmsg_put and friends returns
-EMSGSIZE but it is silently thrown away in ndo_fdb_dump. It is called
within a for_each_netdev loop and the first fdb entry of a following
netdev could fit in the remaining skbuff.  This breaks the mechanism
of cb->args[0] and idx to keep track of the entries that are already
dumped, which results missing entries in bridge fdb show command.

Signed-off-by: Minoura Makoto <minoura@valinux.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-26 15:04:02 -05:00
Jiri Benc
10a5af238c vxlan: simplify metadata_dst usage in vxlan_rcv
Now when the packet is scrubbed early, the metadata_dst can be assigned to
the skb as soon as it is allocated. This simplifies the error cleanup path,
as the dst will be freed by kfree_skb. It is also not necessary to pass it
as a parameter to functions anymore.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 15:17:12 -05:00
Jiri Benc
f2d1968ec8 vxlan: consolidate rx handling to a single function
Now when both vxlan_udp_encap_recv and vxlan_rcv are much shorter, combine
them into a single function.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 15:17:12 -05:00
Jiri Benc
760c68054e vxlan: move ECN decapsulation to a separate function
It simplifies the vxlan_rcv function.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 15:17:12 -05:00
Jiri Benc
1ab016e237 vxlan: move inner L2 header processing to a separate function
This code will be different for VXLAN-GPE, so move it to a separate
function. It will also make the rx path less spaghetti-like.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 15:17:11 -05:00
Jiri Benc
64f87d3616 vxlan: consolidate GBP handling even more
Now when the packet is scrubbed early, skb->mark can be set in the GBP
handling code.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 15:17:11 -05:00
David S. Miller
b633353115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/bcm7xxx.c
	drivers/net/phy/marvell.c
	drivers/net/vxlan.c

All three conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 00:09:14 -05:00
Alexander Duyck
6ceb31ca5f VXLAN: Support outer IPv4 Tx checksums by default
This change makes it so that if UDP CSUM is not specified we will default
to enabling it.  The main motivation behind this is the fact that with the
use of outer checksum we can greatly improve the performance for VXLAN
tunnels on devices that don't know how to parse tunnel headers.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21 22:05:50 -05:00
Jiri Benc
f468a729a2 vxlan: do not use fdb in metadata mode
In metadata mode, the vxlan interface is not supposed to use the fdb control
plane but an external one (openvswitch or static routes). With the current
code, packets may leak into the fdb handling code which usually causes them
to be dropped anyway but may have strange side effects.

Just drop the packets directly when in metadata mode if the destination data
are not correctly provided on egress.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 15:01:14 -05:00
Jiri Benc
82a0f6b4ab vxlan: clear IFF_TX_SKB_SHARING
ether_setup sets IFF_TX_SKB_SHARING but this is not supported by vxlan
as it modifies the skb on xmit.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:43:47 -05:00
Jiri Benc
7f290c9435 iptunnel: scrub packet in iptunnel_pull_header
Part of skb_scrub_packet was open coded in iptunnel_pull_header. Let it call
skb_scrub_packet directly instead.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:34:54 -05:00
Jiri Benc
c9e78efb6f vxlan: move vxlan device lookup before iptunnel_pull_header
This is in preparation for iptunnel_pull_header calling skb_scrub_packet.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:34:54 -05:00
Jiri Benc
07dabf20d9 vxlan: tun_id is 64bit, not 32bit
The tun_id field in struct ip_tunnel_key is __be64, not __be32. We need to
convert the vni to tun_id correctly.

Fixes: 54bfd872bf ("vxlan: keep flags and vni in network byte order")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 13:55:24 -05:00
Jiri Benc
b9167b2e77 vxlan: treat vni in metadata based tunnels consistently
For metadata based tunnels, VNI is ignored when doing vxlan device lookups
(because such tunnel receives all VNIs). However, this was not honored by
vxlan_xmit_one when doing encapsulation bypass. Move the check for metadata
based tunnel to the common place where it belongs.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17 23:52:12 -05:00
Jiri Benc
288b01c8c4 vxlan: clean up rx error path
When there are unrecognized flags present in the vxlan header, it doesn't
make much sense to return the packet for further UDP processing, especially
considering that for other invalid flag combinations we drop the packet
because of previous checks.

This means we return positive value only at the beginning of the function
where tun_dst is not yet allocated. This allows us to get rid of the
bad_flags and error jump labels.

When we're dropping packet, we need to free tun_dst now.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17 23:52:12 -05:00
Jiri Benc
f14ecebb3a vxlan: clean up extension handling on rx
Bring the extension handling to a single place and move the actual handling
logic out of vxlan_udp_encap_recv as much as possible.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17 23:52:12 -05:00
Jiri Benc
3288af0892 vxlan: move GBP header parsing to a separate function
To make vxlan_udp_encap_recv shorter and more comprehensible.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17 23:52:12 -05:00
Jiri Benc
be5cfeab8f vxlan: simplify vxlan_remcsum
Part of the parameters is not needed. Simplify the caller of this function
in preparation of making vxlan rx more comprehensible.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17 23:52:11 -05:00
Jiri Benc
54bfd872bf vxlan: keep flags and vni in network byte order
Prevent repeated conversions from and to network order in the fast path.

To achieve this, define all flag constants in big endian order and store VNI
as __be32. To prevent confusion between the actual VNI value and the VNI
field from the header (which contains additional reserved byte), strictly
distinguish between "vni" and "vni_field".

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17 23:52:11 -05:00
Jiri Benc
d4ac05ff36 vxlan: introduce vxlan_hdr
Currently, pointer to the vxlan header is kept in a local variable. It has
to be reloaded whenever the pskb pull operations are performed which usually
happens somewhere deep in called functions.

Create a vxlan_hdr function and use it to reference the vxlan header
instead.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17 23:52:11 -05:00
Paolo Abeni
d71785ffc7 net: add dst_cache to ovs vxlan lwtunnel
In case of UDP traffic with datagram length
below MTU this give about 2% performance increase
when tunneling over ipv4 and about 60% when tunneling
over ipv6

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:21:48 -05:00
Paolo Abeni
0c1d70af92 net: use dst_cache for vxlan device
In case of UDP traffic with datagram length
below MTU this give about 3% performance increase
when tunneling over ipv4 and about 70% when
tunneling over ipv6.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:21:48 -05:00
Edward Cree
6fa79666e2 net: ip_tunnel: remove 'csum_help' argument to iptunnel_handle_offloads
All users now pass false, so we can remove it, and remove the code that
 was conditional upon it.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12 05:52:16 -05:00
Edward Cree
b57085019d net: vxlan: enable local checksum offload
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12 05:52:15 -05:00
stephen hemminger
40d29af057 vxlan: udp_tunnel duplicate include net/udp_tunnel.h
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 09:45:24 -05:00
David Wragg
7e059158d5 vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices
Prior to 4.3, openvswitch tunnel vports (vxlan, gre and geneve) could
transmit vxlan packets of any size, constrained only by the ability to
send out the resulting packets.  4.3 introduced netdevs corresponding
to tunnel vports.  These netdevs have an MTU, which limits the size of
a packet that can be successfully encapsulated.  The default MTU
values are low (1500 or less), which is awkwardly small in the context
of physical networks supporting jumbo frames, and leads to a
conspicuous change in behaviour for userspace.

Instead, set the MTU on openvswitch-created netdevs to be the relevant
maximum (i.e. the maximum IP packet size minus any relevant overhead),
effectively restoring the behaviour prior to 4.3.

Signed-off-by: David Wragg <david@weave.works>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-10 05:50:03 -05:00
David Wragg
72564b59ff vxlan: Relax MTU constraints
Allow the MTU of vxlan devices without an underlying device to be set
to larger values (up to a maximum based on IP packet limits and vxlan
overhead).

Previously, their MTUs could not be set to higher than the
conventional ethernet value of 1500.  This is a very arbitrary value
in the context of vxlan, and prevented vxlan devices from being able
to take advantage of jumbo frames etc.

The default MTU remains 1500, for compatibility.

Signed-off-by: David Wragg <david@weave.works>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-10 05:50:03 -05:00
Jiri Benc
f491e56dba vxlan: consolidate vxlan_xmit_skb and vxlan6_xmit_skb
There's a lot of code duplication. Factor out the duplicate code to a new
function shared between IPv4 and IPv6 xmit path.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 13:51:00 -05:00
Jiri Benc
b4ed5cad24 vxlan: consolidate csum flag handling
The flag for tx checksumming for tunneling over IPv4 and IPv6 is different.
Decide whether to do tx checksumming in vxlan_xmit_one and pass it on as
a separate flag. This will allow for tx path consolidation in the next
patch.

Unfortunately, gcc is not clever enough to see that udp_sum is always
initialized and gives an uninitialized variable warning. Set it to false to
silence the warning.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 13:51:00 -05:00
Jiri Benc
1a8496ba40 vxlan: consolidate output route calculation
The code for output route lookup is duplicated for ndo_start_xmit and
ndo_fill_metadata_dst. Move it to a common function.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 13:51:00 -05:00
Li RongQing
7256eac13b vxlan: fix a out of bounds access in __vxlan_find_mac
The size of all_zeros_mac is 6 byte, but eth_hash() will access the
8 byte, and KASan reported the below bug:

[ 8596.479031] BUG: KASan: out of bounds access in __vxlan_find_mac+0x24/0x100 at addr ffffffff841514c0
[ 8596.487647] Read of size 8 by task ip/52820
[ 8596.490818] Address belongs to variable all_zeros_mac+0x0/0x40
[ 8596.496051] CPU: 0 PID: 52820 Comm: ip Tainted: G WC 4.1.15 #1
[ 8596.503520] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 02/10/2014
[ 8596.509365] ffffffff841514c0 ffff88007450f0b8 ffffffff822fa5e1 0000000000000032
[ 8596.516112] ffff88007450f150 ffff88007450f138 ffffffff812dd58c ffff88007450f1d8
[ 8596.522856] ffffffff81113b80 0000000000000282 0000000000000001 ffffffff8101ee4d
[ 8596.529599] Call Trace:
[ 8596.530858] [<ffffffff822fa5e1>] dump_stack+0x4f/0x7b
[ 8596.535080] [<ffffffff812dd58c>] kasan_report_error+0x3bc/0x3f0
[ 8596.540258] [<ffffffff81113b80>] ? __lock_acquire+0x90/0x2140
[ 8596.545245] [<ffffffff8101ee4d>] ? save_stack_trace+0x2d/0x80
[ 8596.550234] [<ffffffff812dda70>] kasan_report+0x40/0x50
[ 8596.554647] [<ffffffff81b211e4>] ? __vxlan_find_mac+0x24/0x100
[ 8596.559729] [<ffffffff812dc399>] __asan_load8+0x69/0xa0
[ 8596.564141] [<ffffffff81b211e4>] __vxlan_find_mac+0x24/0x100
[ 8596.569033] [<ffffffff81b2683d>] vxlan_fdb_create+0x9d/0x570

it can be fixed by enlarging the all_zeros_mac to 8 byte, although it is
harmless; eth_hash() will be called in other place with the memory which
is larger and equal to 8 byte.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-29 20:11:16 -08:00
Jesse Gross
35e2d1152b tunnels: Allow IPv6 UDP checksums to be correctly controlled.
When configuring checksums on UDP tunnels, the flags are different
for IPv4 vs. IPv6 (and reversed). However, when lightweight tunnels
are enabled the flags used are always the IPv4 versions, which are
ignored in the IPv6 code paths. This uses the correct IPv6 flags, so
checksums can be controlled appropriately.

Fixes: a725e514 ("vxlan: metadata based tunneling for IPv6")
Fixes: abe492b4 ("geneve: UDP checksum configuration via netlink")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-21 11:10:40 -08:00
David S. Miller
9d367eddf3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/bonding/bond_main.c
	drivers/net/ethernet/mellanox/mlxsw/spectrum.h
	drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c

The bond_main.c and mellanox switch conflicts were cases of
overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 23:55:43 -05:00
Hannes Frederic Sowa
787d7ac308 udp: restrict offloads to one namespace
udp tunnel offloads tend to aggregate datagrams based on inner
headers. gro engine gets notified by tunnel implementations about
possible offloads. The match is solely based on the port number.

Imagine a tunnel bound to port 53, the offloading will look into all
DNS packets and tries to aggregate them based on the inner data found
within. This could lead to data corruption and malformed DNS packets.

While this patch minimizes the problem and helps an administrator to find
the issue by querying ip tunnel/fou, a better way would be to match on
the specific destination ip address so if a user space socket is bound
to the same address it will conflict.

Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:28:24 -05:00
Nicolas Dichtel
07b9b37c22 vxlan: fix test which detect duplicate vxlan iface
When a vxlan interface is created, the driver checks that there is not
another vxlan interface with the same properties. To do this, it checks
the existing vxlan udp socket. Since commit 1c51a9159d, the creation of
the vxlan socket is done only when the interface is set up, thus it breaks
that test.

Example:
$ ip l a vxlan10 type vxlan id 10 group 239.0.0.10 dev eth0 dstport 0
$ ip l a vxlan11 type vxlan id 10 group 239.0.0.10 dev eth0 dstport 0
$ ip -br l | grep vxlan
vxlan10          DOWN           f2:55:1c:6a:fb:00 <BROADCAST,MULTICAST>
vxlan11          DOWN           7a:cb:b9:38:59:0d <BROADCAST,MULTICAST>

Instead of checking sockets, let's loop over the vxlan iface list.

Fixes: 1c51a9159d ("vxlan: fix race caused by dropping rtnl_unlock")
Reported-by: Thomas Faivre <thomas.faivre@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-09 21:04:20 -05:00
Pravin B Shelar
039f50629b ip_tunnel: Move stats update to iptunnel_xmit()
By moving stats update into iptunnel_xmit(), we can simplify
iptunnel_xmit() usage. With this change there is no need to
call another function (iptunnel_xmit_stats()) to update stats
in tunnel xmit code path.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-25 23:32:23 -05:00
Jiri Benc
ce212d0f6f vxlan: interpret IP headers for ECN correctly
When looking for outer IP header, use the actual socket address family, not
the address family of the default destination which is not set for metadata
based interfaces (and doesn't have to match the address family of the
received packet even if it was set).

Fix also the misleading comment.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-07 17:06:33 -05:00
Jiri Benc
239e944ff5 vxlan: support ndo_fill_metadata_dst also for IPv6
Fill the metadata correctly even when tunneling over IPv6. Also, check that
the provided metadata is of an address family that is supported by the
tunnel.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-07 16:31:25 -05:00
Jiri Benc
e5d4b29fe8 vxlan: move IPv6 outpute route calculation to a function
Will be used also for ndo_fill_metadata_dst.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-07 16:31:24 -05:00
David S. Miller
ba3e2084f2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv6/xfrm6_output.c
	net/openvswitch/flow_netlink.c
	net/openvswitch/vport-gre.c
	net/openvswitch/vport-vxlan.c
	net/openvswitch/vport.c
	net/openvswitch/vport.h

The openvswitch conflicts were overlapping changes.  One was
the egress tunnel info fix in 'net' and the other was the
vport ->send() op simplification in 'net-next'.

The xfrm6_output.c conflicts was also a simplification
overlapping a bug fix.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:54:12 -07:00
Pravin B Shelar
fc4099f172 openvswitch: Fix egress tunnel info.
While transitioning to netdev based vport we broke OVS
feature which allows user to retrieve tunnel packet egress
information for lwtunnel devices.  Following patch fixes it
by introducing ndo operation to get the tunnel egress info.
Same ndo operation can be used for lwtunnel devices and compat
ovs-tnl-vport devices. So after adding such device operation
we can remove similar operation from ovs-vport.

Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device").
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 19:39:25 -07:00
David S. Miller
26440c835f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/asix_common.c
	net/ipv4/inet_connection_sock.c
	net/switchdev/switchdev.c

In the inet_connection_sock.c case the request socket hashing scheme
is completely different in net-next.

The other two conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-20 06:08:27 -07:00
Jesse Gross
e277de5f3f tunnels: Don't require remote endpoint or ID during creation.
Before lightweight tunnels existed, it really didn't make sense to
create a tunnel that was not fully specified, such as without a
destination IP address - the resulting packets would go nowhere.
However, with lightweight tunnels, the opposite is true - it doesn't
make sense to require this information when it will be provided later
on by the route. This loosens the requirements for this information.

An alternative would be to allow the relaxed version only when
COLLECT_METADATA is enabled. However, since there are several
variations on this theme (such as NBMA tunnels in GRE), just dropping
the restrictions seems the most consistent across tunnels and with
the existing configuration.

CC: John Linville <linville@tuxdriver.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 22:44:10 -07:00
Jiri Benc
b1be00a6c3 vxlan: support both IPv4 and IPv6 sockets in a single vxlan device
For metadata based vxlan interface, open both IPv4 and IPv6 socket. This is
much more user friendly: it's not necessary to create two vxlan interfaces
and pay attention to using the right one in routing rules.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-26 22:40:55 -07:00
Jiri Benc
205f356d16 vxlan: make vxlan_sock_add and vxlan_sock_release complementary
Make vxlan_sock_add both alloc the socket and attach it to vxlan_dev. Let
vxlan_sock_release accept vxlan_dev as its parameter instead of vxlan_sock.

This makes vxlan_sock_add and vxlan_sock release complementary. It reduces
code duplication in the next patch.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-26 22:40:55 -07:00
Jiri Benc
057ba29bbe vxlan: reject IPv6 addresses if IPv6 is not configured
When IPv6 address is set without IPv6 configured, the vxlan socket is mostly
treated as an IPv4 one but various lookus in fdb etc. still take the
AF_INET6 into account. This creates incosistencies with weird consequences.

Just reject IPv6 addresses in such case.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 22:32:15 -07:00
Jiri Benc
9dc2ad1008 vxlan: set needed headroom correctly
vxlan_setup is called when allocating the net_device, i.e. way before
vxlan_newlink (or vxlan_dev_configure) is called. This means
vxlan->default_dst is actually unset in vxlan_setup and the condition that
sets needed_headroom always takes the else branch.

Set the needed_headrom at the point when we have the information about
the address family available.

Fixes: e4c7ed4153 ("vxlan: add ipv6 support")
Fixes: 2853af6a2e ("vxlan: use dev->needed_headroom instead of dev->hard_header_len")
CC: Cong Wang <cwang@twopensource.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 22:32:15 -07:00
Geert Uytterhoeven
0f1b7354e0 vxlan: Refactor vxlan_udp_encap_recv() to kill compiler warning
drivers/net/vxlan.c: In function ‘vxlan_udp_encap_recv’:
drivers/net/vxlan.c:1226: warning: ‘info’ may be used uninitialized in this function

While this warning is a false positive, it can be killed easily by
getting rid of the pointer intermediary and referring directly to the
ip_tunnel_info structure.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-06 19:48:03 -07:00
Pravin B Shelar
4c22279848 ip-tunnel: Use API to access tunnel metadata options.
Currently tun-info options pointer is used in few cases to
pass options around. But tunnel options can be accessed using
ip_tunnel_info_opts() API without using the pointer. Following
patch removes the redundant pointer and consistently make use
of API.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:28:56 -07:00
Jiri Benc
a43a9ef6a2 vxlan: do not receive IPv4 packets on IPv6 socket
By default (subject to the sysctl settings), IPv6 sockets listen also for
IPv4 traffic. Vxlan is not prepared for that and expects IPv6 header in
packets received through an IPv6 socket.

In addition, it's currently not possible to have both IPv4 and IPv6 vxlan
tunnel on the same port (unless bindv6only sysctl is enabled), as it's not
possible to create and bind both IPv4 and IPv6 vxlan interfaces and there's
no way to specify both IPv4 and IPv6 remote/group IP addresses.

Set IPV6_V6ONLY on vxlan sockets to fix both of these issues. This is not
done globally in udp_tunnel, as l2tp and tipc seems to work okay when
receiving IPv4 packets on IPv6 socket and people may rely on this behavior.
The other tunnels (geneve and fou) do not support IPv6.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:07:54 -07:00
Jiri Benc
7f9562a1f4 ip_tunnels: record IP version in tunnel info
There's currently nothing preventing directing packets with IPv6
encapsulation data to IPv4 tunnels (and vice versa). If this happens,
IPv6 addresses are incorrectly interpreted as IPv4 ones.

Track whether the given ip_tunnel_key contains IPv4 or IPv6 data. Store this
in ip_tunnel_info. Reject packets at appropriate places if they are supposed
to be encapsulated into an incompatible protocol.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:07:54 -07:00
Jiri Benc
46fa062ad6 ip_tunnels: convert the mode field of ip_tunnel_info to flags
The mode field holds a single bit of information only (whether the
ip_tunnel_info struct is for rx or tx). Change the mode field to bit flags.
This allows more mode flags to be added.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:07:54 -07:00
David S. Miller
0d36938bb8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-08-27 21:45:31 -07:00
Pravin B Shelar
c29a70d2ca tunnel: introduce udp_tun_rx_dst()
Introduce function udp_tun_rx_dst() to initialize tunnel dst on
receive path.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 15:42:47 -07:00
Marcelo Ricardo Leitner
bef0057b7b vxlan: re-ignore EADDRINUSE from igmp_join
Before 56ef9c909b40[1] it used to ignore all errors from igmp_join().
That commit enhanced that and made it error out whatever error happened
with igmp_join(), but that's not good because when using multicast
groups vxlan will try to join it multiple times if the socket is reused
and then the 2nd and further attempts will fail with EADDRINUSE.

As we don't track to which groups the socket is already subscribed, it's
okay to just ignore that error.

Fixes: 56ef9c909b ("vxlan: Move socket initialization to within rtnl scope")
Reported-by: John Nielsen <lists@jnielsen.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:24:35 -07:00
Tom Herbert
58ce31cca1 vxlan: GRO support at tunnel layer
Add calls to gro_cells infrastructure to do GRO when receiving on a tunnel.

Testing:

Ran 200 netperf TCP_STREAM instance

  - With fix (GRO enabled on VXLAN interface)

    Verify GRO is happening.

    9084 MBps tput
    3.44% CPU utilization

  - Without fix (GRO disabled on VXLAN interface)

    Verified no GRO is happening.

    9084 MBps tput
    5.54% CPU utilization

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 15:59:56 -07:00
Tom Herbert
b7fe10e5eb gro: Fix remcsum offload to deal with frags in GRO
The remote checksum offload GRO did not consider the case that frag0
might be in use. This patch fixes that by accessing headers using the
skb_gro functions and not saving offsets relative to skb->head.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 15:59:56 -07:00
Jiri Benc
a725e514db vxlan: metadata based tunneling for IPv6
Support metadata based (formerly flow based) tunneling also for IPv6.
This complements commit ee122c79d4 ("vxlan: Flow based tunneling").

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:37 -07:00
Jiri Benc
6f264e47d4 vxlan: do not shadow flags variable
The 'flags' variable is already defined in the outer scope.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:37 -07:00
Jiri Benc
705cc62f67 vxlan: provide access function for vxlan socket address family
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:37 -07:00
Jiri Benc
61adedf3e3 route: move lwtunnel state to dst_entry
Currently, the lwtunnel state resides in per-protocol data. This is
a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa).
The xmit function of the tunnel does not know whether the packet has been
routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the
lwtstate data to dst_entry makes such inter-protocol tunneling possible.

As a bonus, this brings a nice diffstat.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:36 -07:00
Jiri Benc
7c383fb225 ip_tunnels: use tos and ttl fields also for IPv6
Rename the ipv4_tos and ipv4_ttl fields to just 'tos' and 'ttl', as they'll
be used with IPv6 tunnels, too.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:36 -07:00
Jiri Benc
c1ea5d672a ip_tunnels: add IPv6 addresses to ip_tunnel_key
Add the IPv6 addresses as an union with IPv4 ones. When using IPv4, the
newly introduced padding after the IPv4 addresses needs to be zeroed out.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:36 -07:00
Phil Sutter
22dba393a3 net: vxlan: convert to using IFF_NO_QUEUE
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:05 -07:00
Atzm Watanabe
07a51cd379 vxlan: fix fdb_dump index calculation
When too many remotes are bound to an FDB entry, index may not be increased.
This problem will be caused on the large scale environment that is based on
the unicast default destination, for instance.

Signed-off-by: Atzm Watanabe <atzm@iij.ad.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 21:15:18 -07:00
Alexei Starovoitov
da8b43c0e1 vxlan: combine VXLAN_FLOWBASED into VXLAN_COLLECT_METADATA
IFLA_VXLAN_FLOWBASED is useless without IFLA_VXLAN_COLLECT_METADATA,
so combine them into single IFLA_VXLAN_COLLECT_METADATA flag.
'flowbased' doesn't convey real meaning of the vxlan tunnel mode.
This mode can be used by routing, tc+bpf and ovs.
Only ovs is strictly flow based, so 'collect metadata' is a better
name for this tunnel mode.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07 11:46:34 -07:00
Alexei Starovoitov
f8a9b1bc1b vxlan: expose COLLECT_METADATA flag to user space
Two vxlan driver flags FLOWBASED and COLLECT_METADATA need to be set to
make use of its new flow mode. The former already exposed. Expose the latter.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:24:24 -07:00
Roopa Prabhu
343d60aada ipv6: change ipv6_stub_impl.ipv6_dst_lookup to take net argument
This patch adds net argument to ipv6_stub_impl.ipv6_dst_lookup
for use cases where sk is not available (like mpls).
sk appears to be needed to get the namespace 'net' and is optional
otherwise. This patch series changes ipv6_stub_impl.ipv6_dst_lookup
to take net argument. sk remains optional.

All callers of ipv6_stub_impl.ipv6_dst_lookup have been modified
to pass net. I have modified them to use already available
'net' in the scope of the call. I can change them to
sock_net(sk) to avoid any unintended change in behaviour if sock
namespace is different. They dont seem to be from code inspection.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:21:30 -07:00
Thomas Graf
6b6948dda7 vxlan: Use proper endian type for vni in vxlan[6]_xmit_skb
Silences the following sparse warnings:
drivers/net/vxlan.c:1818:21: warning: incorrect type in assignment (different base types)
drivers/net/vxlan.c:1818:21:    expected restricted __be32 [usertype] vx_vni
drivers/net/vxlan.c:1818:21:    got unsigned int [unsigned] [usertype] vni
drivers/net/vxlan.c:2014:58: warning: incorrect type in argument 11 (different base types)
drivers/net/vxlan.c:2014:58:    expected unsigned int [unsigned] [usertype] vni
drivers/net/vxlan.c:2014:58:    got restricted __be32 [usertype] <noident>

Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 16:33:26 -07:00
Thomas Graf
614732eaa1 openvswitch: Use regular VXLAN net_device device
This gets rid of all OVS specific VXLAN code in the receive and
transmit path by using a VXLAN net_device to represent the vport.
Only a small shim layer remains which takes care of handling the
VXLAN specific OVS Netlink configuration.

Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb()
since they are no longer needed.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:07 -07:00
Thomas Graf
0dfbdf4102 vxlan: Factor out device configuration
This factors out the device configuration out of the RTNL newlink
API which allows for in-kernel creation of VXLAN net_devices.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:06 -07:00
Thomas Graf
e7030878fc fib: Add fib rule match on tunnel id
This add the ability to select a routing table based on the tunnel
id which allows to maintain separate routing tables for each virtual
tunnel network.

ip rule add from all tunnel-id 100 lookup 100
ip rule add from all tunnel-id 200 lookup 200

A new static key controls the collection of metadata at tunnel level
upon demand.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:06 -07:00
Thomas Graf
3093fbe7ff route: Per route IP tunnel metadata via lightweight tunnel
This introduces a new IP tunnel lightweight tunnel type which allows
to specify IP tunnel instructions per route. Only IPv4 is supported
at this point.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:06 -07:00
Thomas Graf
ee122c79d4 vxlan: Flow based tunneling
Allows putting a VXLAN device into a new flow-based mode in which
skbs with a ip_tunnel_info dst metadata attached will be encapsulated
according to the instructions stored in there with the VXLAN device
defaults taken into consideration.

Similar on the receive side, if the VXLAN_F_COLLECT_METADATA flag is
set, the packet processing will populate a ip_tunnel_info struct for
each packet received and attach it to the skb using the new metadata
dst.  The metadata structure will contain the outer header and tunnel
header fields which have been stripped off. Layers further up in the
stack such as routing, tc or netfitler can later match on these fields
and perform forwarding. It is the responsibility of upper layers to
ensure that the flag is set if the metadata is needed. The flag limits
the additional cost of metadata collecting based on demand.

This prepares the VXLAN device to be steered by the routing and other
subsystems which allows to support encapsulation for a large number
of tunnel endpoints and tunnel ids through a single net_device which
improves the scalability.

It also allows for OVS to leverage this mode which in turn allows for
the removal of the OVS specific VXLAN code.

Because the skb is currently scrubed in vxlan_rcv(), the attachment of
the new dst metadata is postponed until after scrubing which requires
the temporary addition of a new member to vxlan_metadata. This member
is removed again in a later commit after the indirect VXLAN receive API
has been removed.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:06 -07:00
Sowmini Varadhan
7177a3b037 net/vxlan: Fix kernel unaligned access in __vxlan_find_mac
__vxlan_find_mac invokes ether_addr_equal on the eth_addr field,
which triggers unaligned access messages, so rearrange vxlan_fdb
to avoid this in the most non-intrusive way.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:12:34 -07:00
Sorin Dumitru
14e1d0fa97 vxlan: release lock after each bucket in vxlan_cleanup
We're seeing some softlockups from this function when there
are a lot fdb entries on a vxlan device. Taking the lock for
each bucket instead of the whole table is enough to fix that.

Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-27 13:33:21 -04:00
David S. Miller
36583eb54d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/cadence/macb.c
	drivers/net/phy/phy.c
	include/linux/skbuff.h
	net/ipv4/tcp.c
	net/switchdev/switchdev.c

Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.

phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.

tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.

macb.c involved the addition of two zyncq device entries.

skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-23 01:22:35 -04:00
John W. Linville
13c3ed6a92 vxlan: correct typo in call to unregister_netdevice_queue
By inspection, this appears to be a typo.  The gating comparison
involves vxlan->dev rather than dev.  In fact, dev is the iterator in
the preceding loop above but it is actually constant in the 2nd loop.

Use of dev seems to be a bad cut-n-paste from the prior call to
unregister_netdevice_queue.  Change dev to vxlan->dev, since that is
what is actually being checked.

Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 16:57:09 -04:00
Nicolas Dichtel
7a0877d4b4 netns: rename peernet2id() to peernet2id_alloc()
In a following commit, a new function will be introduced to only lookup for
a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the
existing function is renamed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 22:15:30 -04:00
Thomas Graf
239fb791d4 vxlan: Correctly set flow*i_mark and flow4i_proto in route lookups
VXLAN must provide the skb mark and specifiy IPPROTO_UDP when doing
the FIB lookup for the remote ip. Otherwise an invalid route might
be returned.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05 19:37:54 -04:00
Li RongQing
608404290e vxlan: remove the unnecessary codes
The return value of vxlan_fdb_replace always is greater than or equal to 0

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22 18:45:49 -04:00
David S. Miller
87ffabb1f0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The dwmac-socfpga.c conflict was a case of a bug fix overlapping
changes in net-next to handle an error pointer differently.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-14 15:44:14 -04:00
Jesse Gross
b736a623bd udptunnels: Call handle_offloads after inserting vlan tag.
handle_offloads() calls skb_reset_inner_headers() to store
the layer pointers to the encapsulated packet. However, we
currently push the vlag tag (if there is one) onto the packet
afterwards. This changes the MAC header for the encapsulated
packet but it is not reflected in skb->inner_mac_header, which
breaks GSO and drivers which attempt to use this for encapsulation
offloads.

Fixes: 1eaa8178 ("vxlan: Add tx-vlan offload support.")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 14:56:32 -04:00
WANG Cong
f13b1689d7 vxlan: do not exit on error in vxlan_stop()
We need to clean up vxlan despite vxlan_igmp_leave() fails.

This fixes the following kernel warning:

 WARNING: CPU: 0 PID: 6 at lib/debugobjects.c:263 debug_print_object+0x7c/0x8d()
 ODEBUG: free active (active state 0) object type: timer_list hint: vxlan_cleanup+0x0/0xd0
 CPU: 0 PID: 6 Comm: kworker/u8:0 Not tainted 4.0.0-rc7+ #953
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 Workqueue: netns cleanup_net
  0000000000000009 ffff88011955f948 ffffffff81a25f5a 00000000253f253e
  ffff88011955f998 ffff88011955f988 ffffffff8107608e 0000000000000000
  ffffffff814deba2 ffff8800d4e94000 ffffffff82254c30 ffffffff81fbe455
 Call Trace:
  [<ffffffff81a25f5a>] dump_stack+0x4c/0x65
  [<ffffffff8107608e>] warn_slowpath_common+0x9c/0xb6
  [<ffffffff814deba2>] ? debug_print_object+0x7c/0x8d
  [<ffffffff81076116>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff814deba2>] debug_print_object+0x7c/0x8d
  [<ffffffff81666bf1>] ? vxlan_fdb_destroy+0x5b/0x5b
  [<ffffffff814dee02>] __debug_check_no_obj_freed+0xc3/0x15f
  [<ffffffff814df728>] debug_check_no_obj_freed+0x12/0x16
  [<ffffffff8117ae4e>] slab_free_hook+0x64/0x6c
  [<ffffffff8114deaa>] ? kvfree+0x31/0x33
  [<ffffffff8117dc66>] kfree+0x101/0x1ac
  [<ffffffff8114deaa>] kvfree+0x31/0x33
  [<ffffffff817d4137>] netdev_freemem+0x18/0x1a
  [<ffffffff817e8b52>] netdev_release+0x2e/0x32
  [<ffffffff815b4163>] device_release+0x5a/0x92
  [<ffffffff814bd4dd>] kobject_cleanup+0x49/0x5e
  [<ffffffff814bd3ff>] kobject_put+0x45/0x49
  [<ffffffff817d3fc1>] netdev_run_todo+0x26f/0x283
  [<ffffffff817d4873>] ? rollback_registered_many+0x20f/0x23b
  [<ffffffff817e0c80>] rtnl_unlock+0xe/0x10
  [<ffffffff817d4af0>] default_device_exit_batch+0x12a/0x139
  [<ffffffff810aadfa>] ? wait_woken+0x8f/0x8f
  [<ffffffff817c8e14>] ops_exit_list+0x2b/0x57
  [<ffffffff817c9b21>] cleanup_net+0x154/0x1e7
  [<ffffffff8108b05d>] process_one_work+0x255/0x4ad
  [<ffffffff8108af69>] ? process_one_work+0x161/0x4ad
  [<ffffffff8108b4b1>] worker_thread+0x1cd/0x2ab
  [<ffffffff8108b2e4>] ? process_scheduled_works+0x2f/0x2f
  [<ffffffff81090686>] kthread+0xd4/0xdc
  [<ffffffff8109eca3>] ? local_clock+0x19/0x22
  [<ffffffff810905b2>] ? __kthread_parkme+0x83/0x83
  [<ffffffff81a31c48>] ret_from_fork+0x58/0x90
  [<ffffffff810905b2>] ? __kthread_parkme+0x83/0x83

For the long-term, we should handle NETDEV_{UP,DOWN} event
from the lower device of a tunnel device.

Fixes: 56ef9c909b ("vxlan: Move socket initialization to within rtnl scope")
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 22:48:08 -04:00
WANG Cong
2790460e14 vxlan: fix a shadow local variable
Commit 79b16aadea
("udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb()")
introduce 'sk' but we already have one inner 'sk'.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 14:56:35 -04:00
David Miller
79b16aadea udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb().
That was we can make sure the output path of ipv4/ipv6 operate on
the UDP socket rather than whatever random thing happens to be in
skb->sk.

Based upon a patch by Jiri Pirko.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
2015-04-07 15:29:08 -04:00
Simon Horman
c4b495128c vxlan: correct spelling in comments
Fix some spelling / typos:
* droppped -> dropped
* asddress -> address
* compatbility -> compatibility

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-01 22:52:29 -04:00
Jiri Benc
67b61f6c13 netlink: implement nla_get_in_addr and nla_get_in6_addr
Those are counterparts to nla_put_in_addr and nla_put_in6_addr.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:58:35 -04:00
Jiri Benc
930345ea63 netlink: implement nla_put_in_addr and nla_put_in6_addr
IP addresses are often stored in netlink attributes. Add generic functions
to do that.

For nla_put_in_addr, it would be nicer to pass struct in_addr but this is
not used universally throughout the kernel, in way too many places __be32 is
used to store IPv4 address.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:58:35 -04:00
Jiri Benc
f0ef31264c vxlan: fix indentation
Some lines in vxlan code are indented by 7 spaces instead of a tab.

Fixes: e4c7ed4153 ("vxlan: add ipv6 support")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:55:21 -04:00
Marcelo Ricardo Leitner
24c0e6838c vxlan: simplify if clause in dev_close
Dan Carpenter's static checker warned that in vxlan_stop we are checking
if 'vs' can be NULL while later we simply derreference it.

As after commit 56ef9c909b ("vxlan: Move socket initialization to
within rtnl scope") 'vs' just cannot be NULL in vxlan_stop() anymore, as
the interface won't go up if the socket initialization fails. So we are
good to just remove the check and make it consistent.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 17:01:58 -04:00
David S. Miller
0fa74a4be4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be_main.c
	net/core/sysctl_net_core.c
	net/ipv4/inet_diag.c

The be_main.c conflict resolution was really tricky.  The conflict
hunks generated by GIT were very unhelpful, to say the least.  It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().

So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged.  And this worked beautifully.

The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 18:51:09 -04:00
Marcelo Ricardo Leitner
149d7549c2 vxlan: fix possible use of uninitialized in vxlan_igmp_{join, leave}
Test robot noticed that we check the return of vxlan_igmp_join and leave
but inside them there was a path that it could be used initialized.

It's not really possible because those if() inside these igmp functions
would always match as we can't have sockets of other type in there, but
this way we keep the compiler happy.

Fixes: 56ef9c909b ("vxlan: Move socket initialization to within rtnl scope")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 13:31:24 -04:00
Marcelo Ricardo Leitner
56ef9c909b vxlan: Move socket initialization to within rtnl scope
Currently, if a multicast join operation fail, the vxlan interface will
be UP but not functional, without even a log message informing the user.

Now that we can grab socket lock after already having rntl, we don't
need to defer socket creation and multicast operations. By not deferring
we can do proper error reporting to the user through ip exit code.

This patch thus removes all deferred work that vxlan had and put it back
inline. Now the socket will only be created, bound and join multicast
group when one bring the interface up, and will undo all that as soon as
one put the interface down.

As vxlan_sock_hold() is not used after this patch, it was removed too.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:05:10 -04:00
Marcelo Ricardo Leitner
54ff9ef36b ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}
in favor of their inner __ ones, which doesn't grab rtnl.

As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.

So this patch:
- move rtnl handling to callers instead while already fixing some
  reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
  __ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
  __ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:05:09 -04:00
Alexey Kodanev
40fb70f3aa vxlan: fix wrong usage of VXLAN_VID_MASK
commit dfd8645ea1 wrongly assumes that VXLAN_VDI_MASK includes
eight lower order reserved bits of VNI field that are using for remote
checksum offload.

Right now, when VNI number greater then 0xffff, vxlan_udp_encap_recv()
will always return with 'bad_flag' error, reducing the usable vni range
from 0..16777215 to 0..65535. Also, it doesn't really check whether RCO
bits processed or not.

Fix it by adding new VNI mask which has all 32 bits of VNI field:
24 bits for id and 8 bits for other usage.

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 13:08:07 -04:00
Simon Horman
719a11cdbf vxlan: Don't set s_addr in vxlan_create_sock
In the case of AF_INET s_addr was set to INADDR_ANY (0) which which both
symmetric with the AF_INET6 case, where s_addr is not set, and unnecessary
as udp_conf is zeroed out earlier in the same function.

I suspect this change does not have any run-time effect due to compiler
optimisations. But it does make the code a little easier on the/my eyes.

Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 23:23:16 -04:00
Tom Herbert
0ace2ca89c vxlan: Use checksum partial with remote checksum offload
Change remote checksum handling to set checksum partial as default
behavior. Added an iflink parameter to configure not using
checksum partial (calling csum_partial to update checksum).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11 15:12:13 -08:00
Tom Herbert
15e2396d4e net: Infrastructure for CHECKSUM_PARTIAL with remote checsum offload
This patch adds infrastructure so that remote checksum offload can
set CHECKSUM_PARTIAL instead of calling csum_partial and writing
the modfied checksum field.

Add skb_remcsum_adjust_partial function to set an skb for using
CHECKSUM_PARTIAL with remote checksum offload.  Changed
skb_remcsum_process and skb_gro_remcsum_process to take a boolean
argument to indicate if checksum partial can be set or the
checksum needs to be modified using the normal algorithm.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11 15:12:12 -08:00
Tom Herbert
26c4f7da3e net: Fix remcsum in GRO path to not change packet
Remote checksum offload processing is currently the same for both
the GRO and non-GRO path. When the remote checksum offload option
is encountered, the checksum field referred to is modified in
the packet. So in the GRO case, the packet is modified in the
GRO path and then the operation is skipped when the packet goes
through the normal path based on skb->remcsum_offload. There is
a problem in that the packet may be modified in the GRO path, but
then forwarded off host still containing the remote checksum option.
A remote host will again perform RCO but now the checksum verification
will fail since GRO RCO already modified the checksum.

To fix this, we ensure that GRO restores a packet to it's original
state before returning. In this model, when GRO processes a remote
checksum option it still changes the checksum per the algorithm
but on return from lower layer processing the checksum is restored
to its original value.

In this patch we add define gro_remcsum structure which is passed
to skb_gro_remcsum_process to save offset and delta for the checksum
being changed. After lower layer processing, skb_gro_remcsum_cleanup
is called to restore the checksum before returning from GRO.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11 15:12:09 -08:00