2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

1590 Commits

Author SHA1 Message Date
Nikolay Aleksandrov
974daef7f8 net: add missing dev_put() in __netdev_adjacent_dev_insert
I think that a dev_put() is needed in the error path to preserve the
proper dev refcount.

CC: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-25 19:03:39 -04:00
Eric Dumazet
3347c96029 ipv4: gso: make inet_gso_segment() stackable
In order to support GSO on IPIP, we need to make
inet_gso_segment() stackable.

It should not assume network header starts right after mac
header.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-19 19:36:18 -04:00
David S. Miller
53af53ae83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	include/linux/netdevice.h
	net/core/sock.c

Trivial merge issues.

Removal of "extern" for functions declaration in netdevice.h
at the same time "const" was added to an argument.

Two parallel line additions in net/core/sock.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 23:07:53 -04:00
Eric W. Biederman
5cde282938 net: Separate the close_list and the unreg_list v2
Separate the unreg_list and the close_list in dev_close_many preventing
dev_close_many from permuting the unreg_list.  The permutations of the
unreg_list have resulted in cases where the loopback device is accessed
it has been freed in code such as dst_ifdown.  Resulting in subtle memory
corruption.

This is the second bug from sharing the storage between the close_list
and the unreg_list.  The issues that crop up with sharing are
apparently too subtle to show up in normal testing or usage, so let's
forget about being clever and use two separate lists.

v2: Make all callers pass in a close_list to dev_close_many

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07 15:23:14 -04:00
Michael S. Tsirkin
3573540caf netif_set_xps_queue: make cpu mask const
virtio wants to pass in cpumask_of(cpu), make parameter
const to avoid build warnings.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07 12:29:26 -04:00
David S. Miller
4fbef95af4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be.h
	drivers/net/usb/qmi_wwan.c
	drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h
	include/net/netfilter/nf_conntrack_synproxy.h
	include/net/secure_seq.h

The conflicts are of two varieties:

1) Conflicts with Joe Perches's 'extern' removal from header file
   function declarations.  Usually it's an argument signature change
   or a function being added/removed.  The resolutions are trivial.

2) Some overlapping changes in qmi_wwan.c and be.h, one commit adds
   a new value, another changes an existing value.  That sort of
   thing.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-01 17:06:14 -04:00
Nicolas Dichtel
991fb3f74c dev: always advertise rx_flags changes via netlink
When flags IFF_PROMISC and IFF_ALLMULTI are changed, netlink messages are not
consistent. For example, if a multicast daemon is running (flag IFF_ALLMULTI
set in dev->flags but not dev->gflags, ie not exported to userspace) and then a
user sets it via netlink (flag IFF_ALLMULTI set in dev->flags and dev->gflags, ie
exported to userspace), no netlink message is sent.
Same for IFF_PROMISC and because dev->promiscuity is exported via
IFLA_PROMISCUITY, we may send a netlink message after each change of this
counter.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30 15:08:13 -04:00
Nicolas Dichtel
a528c219df dev: update __dev_notify_flags() to send rtnl msg
This patch only prepares the next one, there is no functional change.
Now, __dev_notify_flags() can also be used to notify flags changes via
rtnetlink.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30 15:08:12 -04:00
Eric W. Biederman
50624c934d net: Delay default_device_exit_batch until no devices are unregistering v2
There is currently serialization network namespaces exiting and
network devices exiting as the final part of netdev_run_todo does not
happen under the rtnl_lock.  This is compounded by the fact that the
only list of devices unregistering in netdev_run_todo is local to the
netdev_run_todo.

This lack of serialization in extreme cases results in network devices
unregistering in netdev_run_todo after the loopback device of their
network namespace has been freed (making dst_ifdown unsafe), and after
the their network namespace has exited (making the NETDEV_UNREGISTER,
and NETDEV_UNREGISTER_FINAL callbacks unsafe).

Add the missing serialization by a per network namespace count of how
many network devices are unregistering and having a wait queue that is
woken up whenever the count is decreased.  The count and wait queue
allow default_device_exit_batch to wait until all of the unregistration
activity for a network namespace has finished before proceeding to
unregister the loopback device and then allowing the network namespace
to exit.

Only a single global wait queue is used because there is a single global
lock, and there is a single waiter, per network namespace wait queues
would be a waste of resources.

The per network namespace count of unregistering devices gives a
progress guarantee because the number of network devices unregistering
in an exiting network namespace must ultimately drop to zero (assuming
network device unregistration completes).

The basic logic remains the same as in v1.  This patch is now half
comment and half rtnl_lock_unregistering an expanded version of
wait_event performs no extra work in the common case where no network
devices are unregistering when we get to default_device_exit_batch.

Reported-by: Francesco Ruggeri <fruggeri@aristanetworks.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28 15:09:15 -07:00
Veaceslav Falico
5831d66e80 net: create sysfs symlinks for neighbour devices
Also, remove the same functionality from bonding - it will be already done
for any device that links to its lower/upper neighbour.

The links will be created for dev's kobject, and will look like
lower_eth0 for lower device eth0 and upper_bridge0 for upper device
bridge0.

CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26 16:02:08 -04:00
Veaceslav Falico
842d67a7b3 net: expose the master link to sysfs, and remove it from bond
Currently, we can have only one master upper neighbour, so it would be
useful to create a symlink to it in the sysfs device directory, the way
that bonding now does it, for every device. Lower devices from
bridge/team/etc will automagically get it, so we could rely on it.

Also, remove the same functionality from bonding.

CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26 16:02:08 -04:00
Veaceslav Falico
b6ccba4c68 net: add a possibility to get private from netdev_adjacent->list
It will be useful to get first/last element.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26 16:02:06 -04:00
Veaceslav Falico
31088a113c net: add for_each iterators through neighbour lower link's private
Add a possibility to iterate through netdev_adjacent's private, currently
only for lower neighbours.

Add both RCU and RTNL/other locking variants of iterators, and make the
non-rcu variant to be safe from removal.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26 16:02:04 -04:00
Veaceslav Falico
402dae9614 net: add netdev_adjacent->private and allow to use it
Currently, even though we can access any linked device, we can't attach
anything to it, which is vital to properly manage them.

To fix this, add a new void *private to netdev_adjacent and functions
setting/getting it (per link), so that we can save, per example, bonding's
slave structures there, per slave device.

netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
upper dev and populates the neighbour link only with private.

netdev_lower_dev_get_private{,_rcu}() returns the private, if found.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26 16:02:04 -04:00
Veaceslav Falico
5249dec738 net: add RCU variant to search for netdev_adjacent link
Currently we have only the RTNL flavour, however we can traverse it while
holding only RCU, so add the RCU search. Add an RCU variant that uses
list_head * as an argument, so that it can be universally used afterwards.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26 16:02:04 -04:00
Veaceslav Falico
2f268f129c net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.

This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.

To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.

Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:

netdev_(all_)(upper|lower)_*

If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26 16:02:04 -04:00
Veaceslav Falico
7863c054d1 net: use lists as arguments instead of bool upper
Currently we make use of bool upper when we want to specify if we want to
work with upper/lower list. It's, however, harder to read, debug and
occupies a lot more code.

Fix this by just passing the correct upper/lower_dev_list list_head pointer
instead of bool upper, and work internally with it.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26 16:02:03 -04:00
Veaceslav Falico
82476b3160 net: correctly interlink lower/upper devices
Currently we're linking upper devices to lower ones, which results in
upside-down relationship: upper devices seeing lower devices via its upper
lists.

Fix this by correctly linking lower devices to the upper ones.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 00:27:26 -04:00
Nicolas Dichtel
8b27f27797 skb: allow skb_scrub_packet() to be used by tunnels
This function was only used when a packet was sent to another netns. Now, it can
also be used after tunnel encapsulation or decapsulation.

Only skb_orphan() should not be done when a packet is not crossing netns.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 00:27:25 -04:00
Veaceslav Falico
48311f4685 net: add netdev_upper_get_next_dev_rcu(dev, iter)
This function returns the next dev in the dev->upper_dev_list after the
struct list_head **iter position, and updates *iter accordingly. Returns
NULL if there are no devices left.

Caller must hold RCU read lock.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 16:19:42 -04:00
Veaceslav Falico
620f3186ca net: remove search_list from netdev_adjacent
We already don't need it cause we see every upper/lower device in the list
already.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 16:19:42 -04:00
Veaceslav Falico
5d261913ca net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.

It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.

There are cases when a device can be added several times to the adjacent
list, the simplest would be:

     /---- eth0.10 ---\
eth0-		       --- bond0
     \---- eth0.20 ---/

where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.

The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.

I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.

Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:

128*128*2*sizeof(netdev_adjacent) = 1.5MB

but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 16:19:42 -04:00
Veaceslav Falico
aa9d85605f net: rename netdev_upper to netdev_adjacent
Rename the structure to reflect the upcoming addition of lower_dev_list.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 16:19:41 -04:00
Nicolas Dichtel
64261f230a dev: move skb_scrub_packet() after eth_type_trans()
skb_scrub_packet() was called before eth_type_trans() to let eth_type_trans()
set pkt_type.

In fact, we should force pkt_type to PACKET_HOST, so move the call after
eth_type_trans().

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15 01:00:20 -07:00
Jiri Pirko
66b52b0dc8 net: add ndo to get id of physical port of the device
This patch adds a ndo for getting physical port of the device. Driver
which is aware of being virtual function of some physical port should
implement this ndo. This is applicable not only for IOV, but for other
solutions (NPAR, multichannel) as well. Basically if there is possible
to have multiple netdevs on the single hw port.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30 17:31:24 -07:00
Thomas Gleixner
18afa4b028 net: Make devnet_rename_seq static
No users outside net/core/dev.c.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:57:26 -07:00
Eric Dumazet
d4b812dea4 vlan: mask vlan prio bits
In commit 48cc32d38a
("vlan: don't deliver frames for unknown vlans to protocols")
Florian made sure we set pkt_type to PACKET_OTHERHOST
if the vlan id is set and we could find a vlan device for this
particular id.

But we also have a problem if prio bits are set.

Steinar reported an issue on a router receiving IPv6 frames with a
vlan tag of 4000 (id 0, prio 2), and tunneled into a sit device,
because skb->vlan_tci is set.

Forwarded frame is completely corrupted : We can see (8100:4000)
being inserted in the middle of IPv6 source address :

16:48:00.780413 IP6 2001:16d8:8100:4000:ee1c:0:9d9:bc87 >
9f94:4d95:2001:67c:29f4::: ICMP6, unknown icmp6 type (0), length 64
       0x0000:  0000 0029 8000 c7c3 7103 0001 a0ae e651
       0x0010:  0000 0000 ccce 0b00 0000 0000 1011 1213
       0x0020:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223
       0x0030:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233

It seems we are not really ready to properly cope with this right now.

We can probably do better in future kernels :
vlan_get_ingress_priority() should be a netdev property instead of
a per vlan_dev one.

For stable kernels, lets clear vlan_tci to fix the bugs.

Reported-by: Steinar H. Gunderson <sesse@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-18 13:05:23 -07:00
Alexander Duyck
cdbaa0bb26 gso: Update tunnel segmentation to support Tx checksum offload
This change makes it so that the GRE and VXLAN tunnels can make use of Tx
checksum offload support provided by some drivers via the hw_enc_features.
Without this fix enabling GSO means sacrificing Tx checksum offload and
this actually leads to a performance regression as shown below:

            Utilization
            Send
Throughput  local         GSO
10^6bits/s  % S           state
  6276.51   8.39          enabled
  7123.52   8.42          disabled

To resolve this it was necessary to address two items.  First
netif_skb_features needed to be updated so that it would correctly handle
the Trans Ether Bridging protocol without impacting the need to check for
Q-in-Q tagging.  To do this it was necessary to update harmonize_features
so that it used skb_network_protocol instead of just using the outer
protocol.

Second it was necessary to update the GRE and UDP tunnel segmentation
offloads so that they would reset the encapsulation bit and inner header
offsets after the offload was complete.

As a result of this change I have seen the following results on a interface
with Tx checksum enabled for encapsulated frames:

            Utilization
            Send
Throughput  local         GSO
10^6bits/s  % S           state
  7123.52   8.42          disabled
  8321.75   5.43          enabled

v2: Instead of replacing refrence to skb->protocol with
    skb_network_protocol just replace the protocol reference in
    harmonize_features to allow for double VLAN tag checks.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-11 12:18:49 -07:00
David S. Miller
0c1072ae02 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/freescale/fec_main.c
	drivers/net/ethernet/renesas/sh_eth.c
	net/ipv4/gre.c

The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list)
and the splitting of the gre.c code into seperate files.

The FEC conflict was two sets of changes adding ethtool support code
in an "!CONFIG_M5272" CPP protected block.

Finally the sh_eth.c conflict was between one commit add bits set
in the .eesr_err_check mask whilst another commit removed the
.tx_error_check member and assignments.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 14:55:13 -07:00
Isaku Yamahata
06a23fe31c core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()
The dev_forward_skb() assignment of pkt_type should be done
after the call to eth_type_trans().

ip-encapsulated packets can be handled by localhost. But skb->pkt_type
can be PACKET_OTHERHOST when packet comes via veth into ip tunnel device.
In that case, the packet is dropped by ip_rcv().
Although this example uses gretap. l2tp-eth also has same issue.
For l2tp-eth case, add dummy device for ip address and ip l2tp command.

netns A |                     root netns                      | netns B
   veth<->veth=bridge=gretap <-loop back-> gretap=bridge=veth<->veth

arp packet ->
pkt_type
         BROADCAST------------>ip_rcv()------------------------>

                                                             <- arp reply
                                                                pkt_type
                               ip_rcv()<-----------------OTHERHOST
                               drop

sample operations
  ip link add tapa type gretap remote 172.17.107.4 local 172.17.107.3
  ip link add tapb type gretap remote 172.17.107.3 local 172.17.107.4
  ip link set tapa up
  ip link set tapb up
  ip address add 172.17.107.3 dev tapa
  ip address add 172.17.107.4 dev tapb
  ip route get 172.17.107.3
  > local 172.17.107.3 dev lo  src 172.17.107.3
  >    cache <local>
  ip route get 172.17.107.4
  > local 172.17.107.4 dev lo  src 172.17.107.4
  >    cache <local>
  ip link add vetha type veth peer name vetha-peer
  ip link add vethb type veth peer name vethb-peer
  brctl addbr bra
  brctl addbr brb
  brctl addif bra tapa
  brctl addif bra vetha-peer
  brctl addif brb tapb
  brctl addif brb vethb-peer
  brctl show
  > bridge name     bridge id               STP enabled     interfaces
  > bra             8000.6ea21e758ff1       no              tapa
  >                                                         vetha-peer
  > brb             8000.420020eb92d5       no              tapb
  >                                                         vethb-peer
  ip link set vetha-peer up
  ip link set vethb-peer up
  ip link set bra up
  ip link set brb up
  ip netns add a
  ip netns add b
  ip link set vetha netns a
  ip link set vethb netns b
  ip netns exec a ip address add 10.0.0.3/24 dev vetha
  ip netns exec b ip address add 10.0.0.4/24 dev vethb
  ip netns exec a ip link set vetha up
  ip netns exec b ip link set vethb up
  ip netns exec a arping -I vetha 10.0.0.4
  ARPING 10.0.0.4 from 10.0.0.3 vetha
  ^CSent 2 probes (2 broadcast(s))
  Received 0 response(s)

Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Hong Zhiguo <honkiko@gmail.com>
Cc: Rami Rosen <ramirose@gmail.com>
Cc: Tom Parkin <tparkin@katalix.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Jesse Gross <jesse@nicira.com>
Cc: dev@openvswitch.org
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-02 15:59:18 -07:00
Nicolas Dichtel
621e84d6f3 dev: introduce skb_scrub_packet()
The goal of this new function is to perform all needed cleanup before sending
an skb into another netns.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-27 22:29:05 -07:00
Nicolas Schichan
5dbe7c178d net: fix kernel deadlock with interface rename and netdev name retrieval.
When the kernel (compiled with CONFIG_PREEMPT=n) is performing the
rename of a network interface, it can end up waiting for a workqueue
to complete. If userland is able to invoke a SIOCGIFNAME ioctl or a
SO_BINDTODEVICE getsockopt in between, the kernel will deadlock due to
the fact that read_secklock_begin() will spin forever waiting for the
writer process (the one doing the interface rename) to update the
devnet_rename_seq sequence.

This patch fixes the problem by adding a helper (netdev_get_name())
and using it in the code handling the SIOCGIFNAME ioctl and
SO_BINDTODEVICE setsockopt.

The netdev_get_name() helper uses raw_seqcount_begin() to avoid
spinning forever, waiting for devnet_rename_seq->sequence to become
even. cond_resched() is used in the contended case, before retrying
the access to give the writer process a chance to finish.

The use of raw_seqcount_begin() will incur some unneeded work in the
reader process in the contended case, but this is better than
deadlocking the system.

Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26 13:42:54 -07:00
Eric Dumazet
60877a32bc net: allow large number of tx queues
netif_alloc_netdev_queues() uses kcalloc() to allocate memory
for the "struct netdev_queue *_tx" array.

For large number of tx queues, kcalloc() might fail, so this
patch does a fallback to vzalloc().

As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT
to kzalloc() flags to do this fallback only when really needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-23 23:56:55 -07:00
Eliezer Tamir
af12fa6e46 net: add napi_id and hash
Adds a napi_id and a hashing mechanism to lookup a napi by id.
This will be used by subsequent patches to implement low latency
Ethernet device polling.
Based on a code sample by Eric Dumazet.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:22:35 -07:00
Baruch Siach
430f03cde2 net: mark netdev_create_hash __net_init
netdev_create_hash() is only called from netdev_init() which is marked
__net_init.

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04 16:47:27 -07:00
Simon Horman
ced14f6804 net: Correct comparisons and calculations using skb->tail and skb-transport_header
This corrects an regression introduced by "net: Use 16bits for *_headers
fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In
that case skb->tail will be a pointer whereas skb->transport_header
will be an offset from head. This is corrected by using wrappers that
ensure that comparisons and calculations are always made using pointers.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 23:49:07 -07:00
Cong Wang
75538c2b85 net: always pass struct netdev_notifier_info to netdevice notifiers
commit 351638e7de (net: pass info struct via netdevice notifier)
breaks booting of my KVM guest, this is due to we still forget to pass
struct netdev_notifier_info in several places. This patch completes it.

Cc: Jiri Pirko <jiri@resnulli.us>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 21:58:54 -07:00
Jiri Pirko
be9efd3653 net: pass changed flags along with NETDEV_CHANGE event
Use new netdevice notifier infrastructure to pass along changed flags.

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>

v2->v3: shortened notifier_info struct name
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 13:11:01 -07:00
Jiri Pirko
351638e7de net: pass info struct via netdevice notifier
So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>

v2->v3: fix typo on simeth
	shortened dev_getter
	shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 13:11:01 -07:00
dingtianhong
da6e378ba9 netpoll: remove return value from netpoll_rx_disable()
The netpoll_rx_disable() will always return 0, it is no use and looks wordy,
so remove the unnecessary code and get rid of it in _dev_open and _dev_close.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-27 23:18:50 -07:00
Simon Horman
0d89d2035f MPLS: Add limited GSO support
In the case where a non-MPLS packet is received and an MPLS stack is
added it may well be the case that the original skb is GSO but the
NIC used for transmit does not support GSO of MPLS packets.

The aim of this code is to provide GSO in software for MPLS packets
whose skbs are GSO.

SKB Usage:

When an implementation adds an MPLS stack to a non-MPLS packet it should do
the following to skb metadata:

* Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
  skb->inner_protocol is added by this patch.

* Set skb->protocol to the new MPLS ethertype of the packet.

* Set skb->network_header to correspond to the
  end of the L3 header, including the MPLS label stack.

I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
That patch sets the above requirements in datapath/actions.c:push_mpls()
and was used to exercise this code.  The datapath patch is against the Open
vSwtich tree but it is intended that it be added to the Open vSwtich code
present in the mainline Linux kernel at some point.

Features:

I believe that the approach that I have taken is at least partially
consistent with the handling of other protocols.  Jesse, I understand that
you have some ideas here.  I am more than happy to change my implementation.

This patch adds dev->mpls_features which may be used by devices
to advertise features supported for MPLS packets.

A new NETIF_F_MPLS_GSO feature is added for devices which support
hardware MPLS GSO offload.  Currently no devices support this
and MPLS GSO always falls back to software.

Alternate Implementation:

One possible alternate implementation is to teach netif_skb_features()
and skb_network_protocol() about MPLS, in a similar way to their
understanding of VLANs. I believe this would avoid the need
for net/mpls/mpls_gso.c and in particular the calls to
__skb_push() and __skb_push() in mpls_gso_segment().

I have decided on the implementation in this patch as it should
not introduce any overhead in the case where mpls_gso is not compiled
into the kernel or inserted as a module.

MPLS GSO suggested by Jesse Gross.
Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
by Pravin B Shelar.

Cc: Jesse Gross <jesse@nicira.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-27 22:50:59 -07:00
Jiri Pirko
42e52bf9e3 net: add netnotifier event for upper device change
Now when upper device is changed, event is not propagated via RT Netlink
to userspace. Userspace might never now about the change. Fix this by
adding upper-device-change notifier event.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-25 23:12:19 -07:00
Willem de Bruijn
99bbc70741 rps: selective flow shedding during softnet overflow
A cpu executing the network receive path sheds packets when its input
queue grows to netdev_max_backlog. A single high rate flow (such as a
spoofed source DoS) can exceed a single cpu processing rate and will
degrade throughput of other flows hashed onto the same cpu.

This patch adds a more fine grained hashtable. If the netdev backlog
is above a threshold, IRQ cpus track the ratio of total traffic of
each flow (using 4096 buckets, configurable). The ratio is measured
by counting the number of packets per flow over the last 256 packets
from the source cpu. Any flow that occupies a large fraction of this
(set at 50%) will see packet drop while above the threshold.

Tested:
Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0,
kernel receive (RPS) on cpu0 and application threads on cpus 2--7
each handling 20k req/s. Throughput halves when hit with a 400 kpps
antagonist storm. With this patch applied, antagonist overload is
dropped and the server processes its complete load.

The patch is effective when kernel receive processing is the
bottleneck. The above RPS scenario is a extreme, but the same is
reached with RFS and sufficient kernel processing (iptables, packet
socket tap, ..).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-20 13:48:04 -07:00
Nicolas Dichtel
57b354e66b dev: remove duplicate 'skb->dev = dev' in dev_forward_skb()
This was added by commit 59b9997bab (Revert "net: maintain namespace
isolation between vlan and real device").
In fact, before the initial commit - the one that is reverted -, this
statement was not present.
'skb->dev = dev' is already done in eth_type_trans(), which is call just
after.

Spotted-by: Alain Ritoux <alain.ritoux@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-17 18:23:08 -07:00
Pravin B Shelar
19acc32725 gso: Handle Trans-Ether-Bridging protocol in skb_network_protocol()
Rather than having logic to calculate inner protocol in every
tunnel gso handler move it to gso code. This simplifies code.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-08 13:13:30 -07:00
Patrick McHardy
6708c9e5cc net: use netdev_features_t in skb_needs_linearize()
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-02 13:58:12 -04:00
Sridhar Samudrala
0c772159d1 net: Use consume_skb() to free gso segmented skb
Use consume_skb() to free the original skb that is successfully transmitted
as gso segmented skbs so that it is not treated as a drop due to an error.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-30 00:18:13 -04:00
Eric Dumazet
e12472dc57 net: remove redundant code in dev_hard_start_xmit()
This reverts commit 068a2de57d (net: release dst entry while
cache-hot for GSO case too)

Before GSO packet segmentation, we already take care of skb->dst if it
can be released.

There is no point adding extra test for every segment in the gso loop.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-25 01:38:25 -04:00
David S. Miller
6e0895c2ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be_main.c
	drivers/net/ethernet/intel/igb/igb_main.c
	drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c
	include/net/scm.h
	net/batman-adv/routing.c
	net/ipv4/tcp_input.c

The e{uid,gid} --> {uid,gid} credentials fix conflicted with the
cleanup in net-next to now pass cred structs around.

The be2net driver had a bug fix in 'net' that overlapped with the VLAN
interface changes by Patrick McHardy in net-next.

An IGB conflict existed because in 'net' the build_skb() support was
reverted, and in 'net-next' there was a comment style fix within that
code.

Several batman-adv conflicts were resolved by making sure that all
calls to batadv_is_my_mac() are changed to have a new bat_priv first
argument.

Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO
rewrite in 'net-next', mostly overlapping changes.

Thanks to Stephen Rothwell and Antonio Quartulli for help with several
of these merge resolutions.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-22 20:32:51 -04:00
dingtianhong
53759be997 net: Remove return value from list_netdevice()
The return value from list_netdevice() is not used and no need, so remove it.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-22 16:09:24 -04:00
Ben Greear
c846ad9b88 net: rate-limit warn-bad-offload splats.
If one does do something unfortunate and allow a
bad offload bug into the kernel, this the
skb_warn_bad_offload can effectively live-lock the
system, filling the logs with the same error over
and over.

Add rate limitation to this so that box remains otherwise
functional in this case.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 17:57:49 -04:00
Patrick McHardy
8ad227ff89 net: vlan: add 802.1ad support
Add support for 802.1ad VLAN devices. This mainly consists of checking for
ETH_P_8021AD in addition to ETH_P_8021Q in a couple of places and check
offloading capabilities based on the used protocol.

Configuration is done using "ip link":

# ip link add link eth0 eth0.1000 \
	type vlan proto 802.1ad id 1000
# ip link add link eth0.1000 eth0.1000.1000 \
	type vlan proto 802.1q id 1000

52:54:00:12:34:56 > 92:b1:54:28:e4:8c, ethertype 802.1Q (0x8100), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84)
    20.1.0.2 > 20.1.0.1: ICMP echo request, id 3003, seq 8, length 64
92:b1:54:28:e4:8c > 52:54:00:12:34:56, ethertype 802.1Q-QinQ (0x88a8), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 47944, offset 0, flags [none], proto ICMP (1), length 84)
    20.1.0.1 > 20.1.0.2: ICMP echo reply, id 3003, seq 8, length 64

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 14:46:06 -04:00
Patrick McHardy
86a9bad3ab net: vlan: add protocol argument to packet tagging functions
Add a protocol argument to the VLAN packet tagging functions. In case of HW
tagging, we need that protocol available in the ndo_start_xmit functions,
so it is stored in a new field in the skb. The new field fits into a hole
(on 64 bit) and doesn't increase the sks's size.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 14:46:06 -04:00
Patrick McHardy
f646968f8f net: vlan: rename NETIF_F_HW_VLAN_* feature flags to NETIF_F_HW_VLAN_CTAG_*
Rename the hardware VLAN acceleration features to include "CTAG" to indicate
that they only support CTAGs. Follow up patches will introduce 802.1ad
server provider tagging (STAGs) and require the distinction for hardware not
supporting acclerating both.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 14:45:26 -04:00
David S. Miller
d978a6361a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/nfc/microread/mei.c
	net/netfilter/nfnetlink_queue_core.c

Pull in 'net' to get Eric Biederman's AF_UNIX fix, upon which
some cleanups are going to go on-top.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-07 18:37:01 -04:00
Patrick McHardy
124dff01af netfilter: don't reset nf_trace in nf_reset()
Commit 130549fe ("netfilter: reset nf_trace in nf_reset") added code
to reset nf_trace in nf_reset(). This is wrong and unnecessary.

nf_reset() is used in the following cases:

- when passing packets up the the socket layer, at which point we want to
  release all netfilter references that might keep modules pinned while
  the packet is queued. nf_trace doesn't matter anymore at this point.

- when encapsulating or decapsulating IPsec packets. We want to continue
  tracing these packets after IPsec processing.

- when passing packets through virtual network devices. Only devices on
  that encapsulate in IPv4/v6 matter since otherwise nf_trace is not
  used anymore. Its not entirely clear whether those packets should
  be traced after that, however we've always done that.

- when passing packets through virtual network devices that make the
  packet cross network namespace boundaries. This is the only cases
  where we clearly want to reset nf_trace and is also what the
  original patch intended to fix.

Add a new function nf_reset_trace() and use it in dev_forward_skb() to
fix this properly.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-05 15:38:10 -04:00
David S. Miller
a210576cf8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/mac80211/sta_info.c
	net/wireless/core.h

Two minor conflicts in wireless.  Overlapping additions of extern
declarations in net/wireless/core.h and a bug fix overlapping with
the addition of a boolean parameter to __ieee80211_key_free().

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-01 13:36:50 -04:00
Eric Dumazet
00cfec3748 net: add a synchronize_net() in netdev_rx_handler_unregister()
commit 35d48903e9 (bonding: fix rx_handler locking) added a race
in bonding driver, reported by Steven Rostedt who did a very good
diagnosis :

<quoting Steven>

I'm currently debugging a crash in an old 3.0-rt kernel that one of our
customers is seeing. The bug happens with a stress test that loads and
unloads the bonding module in a loop (I don't know all the details as
I'm not the one that is directly interacting with the customer). But the
bug looks to be something that may still be present and possibly present
in mainline too. It will just be much harder to trigger it in mainline.

In -rt, interrupts are threads, and can schedule in and out just like
any other thread. Note, mainline now supports interrupt threads so this
may be easily reproducible in mainline as well. I don't have the ability
to tell the customer to try mainline or other kernels, so my hands are
somewhat tied to what I can do.

But according to a core dump, I tracked down that the eth irq thread
crashed in bond_handle_frame() here:

        slave = bond_slave_get_rcu(skb->dev);
        bond = slave->bond; <--- BUG

the slave returned was NULL and accessing slave->bond caused a NULL
pointer dereference.

Looking at the code that unregisters the handler:

void netdev_rx_handler_unregister(struct net_device *dev)
{

        ASSERT_RTNL();
        RCU_INIT_POINTER(dev->rx_handler, NULL);
        RCU_INIT_POINTER(dev->rx_handler_data, NULL);
}

Which is basically:
        dev->rx_handler = NULL;
        dev->rx_handler_data = NULL;

And looking at __netif_receive_skb() we have:

        rx_handler = rcu_dereference(skb->dev->rx_handler);
        if (rx_handler) {
                if (pt_prev) {
                        ret = deliver_skb(skb, pt_prev, orig_dev);
                        pt_prev = NULL;
                }
                switch (rx_handler(&skb)) {

My question to all of you is, what stops this interrupt from happening
while the bonding module is unloading?  What happens if the interrupt
triggers and we have this:

        CPU0                    CPU1
        ----                    ----
  rx_handler = skb->dev->rx_handler

                        netdev_rx_handler_unregister() {
                           dev->rx_handler = NULL;
                           dev->rx_handler_data = NULL;

  rx_handler()
   bond_handle_frame() {
    slave = skb->dev->rx_handler;
    bond = slave->bond; <-- NULL pointer dereference!!!

What protection am I missing in the bond release handler that would
prevent the above from happening?

</quoting Steven>

We can fix bug this in two ways. First is adding a test in
bond_handle_frame() and others to check if rx_handler_data is NULL.

A second way is adding a synchronize_net() in
netdev_rx_handler_unregister() to make sure that a rcu protected reader
has the guarantee to see a non NULL rx_handler_data.

The second way is better as it avoids an extra test in fast path.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-29 15:38:15 -04:00
Shmulik Ladkani
a561cf7edf net: core: Remove redundant call to 'nf_reset' in 'dev_forward_skb'
'nf_reset' is called just prior calling 'netif_rx'.
No need to call it twice.

Reported-by: Igor Michailov <rgohita@gmail.com>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-29 15:25:28 -04:00
David S. Miller
e2a553dbf1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	include/net/ipip.h

The changes made to ipip.h in 'net' were already included
in 'net-next' before that header was moved to another location.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-27 13:52:49 -04:00
Jason Wang
15e5a03071 net_sched: better precise estimation on packet length for untrusted packets
gso_segs were reset to zero when kernel receive packets from untrusted
source. But we use this zero value to estimate precise packet len which is
wrong. So this patch tries to estimate the correct gso_segs value before using
it in qdisc_pkt_len_init().

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26 12:44:44 -04:00
Eric Dumazet
9979a55a83 net: remove a WARN_ON() in net_enable_timestamp()
The WARN_ON(in_interrupt()) in net_enable_timestamp() can get false
positive, in socket clone path, run from softirq context :

[ 3641.624425] WARNING: at net/core/dev.c:1532 net_enable_timestamp+0x7b/0x80()
[ 3641.668811] Call Trace:
[ 3641.671254]  <IRQ>  [<ffffffff80286817>] warn_slowpath_common+0x87/0xc0
[ 3641.677871]  [<ffffffff8028686a>] warn_slowpath_null+0x1a/0x20
[ 3641.683683]  [<ffffffff80742f8b>] net_enable_timestamp+0x7b/0x80
[ 3641.689668]  [<ffffffff80732ce5>] sk_clone_lock+0x425/0x450
[ 3641.695222]  [<ffffffff8078db36>] inet_csk_clone_lock+0x16/0x170
[ 3641.701213]  [<ffffffff807ae449>] tcp_create_openreq_child+0x29/0x820
[ 3641.707663]  [<ffffffff807d62e2>] ? ipt_do_table+0x222/0x670
[ 3641.713354]  [<ffffffff807aaf5b>] tcp_v4_syn_recv_sock+0xab/0x3d0
[ 3641.719425]  [<ffffffff807af63a>] tcp_check_req+0x3da/0x530
[ 3641.724979]  [<ffffffff8078b400>] ? inet_hashinfo_init+0x60/0x80
[ 3641.730964]  [<ffffffff807ade6f>] ? tcp_v4_rcv+0x79f/0xbe0
[ 3641.736430]  [<ffffffff807ab9bd>] tcp_v4_do_rcv+0x38d/0x4f0
[ 3641.741985]  [<ffffffff807ae14a>] tcp_v4_rcv+0xa7a/0xbe0

Its safe at this point because the parent socket owns a reference
on the netstamp_needed, so we cant have a 0 -> 1 transition, which
requires to lock a mutex.

Instead of refining the check, lets remove it, as all known callers
are safe. If it ever changes in the future, static_key_slow_inc()
will complain anyway.

Reported-by: Laurent Chavey <chavey@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-24 17:27:27 -04:00
David S. Miller
61816596d1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull in the 'net' tree to get Daniel Borkmann's flow dissector
infrastructure change.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-20 12:46:26 -04:00
Kusanagi Kouichi
166ec36968 net: Fix a comment typo
Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-18 13:06:09 -04:00
Li RongQing
c80a8512ee net/core: move vlan_depth out of while loop in skb_network_protocol()
[ Bug added added in commit 05e8ef4ab2 (net: factor out
  skb_mac_gso_segment() from skb_gso_segment() ) ]

move vlan_depth out of while loop, or else vlan_depth always is ETH_HLEN,
can not be increased, and lead to infinite loop when frame has two vlan headers.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-12 11:47:40 -04:00
David S. Miller
e5f2ef7ab4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/e1000e/netdev.c

Minor conflict in e1000e, a line that got fixed in 'net'
has been removed in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-12 05:52:22 -04:00
Pravin B Shelar
ee579677c2 tunnel: Inherit NETIF_F_SG for hw_enc_features.
Inherit scatergather feature for tunnel devices to avoid
copy for TSO packets of tunneling device like GRE.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-09 16:08:57 -05:00
Pravin B Shelar
ec5f061564 net: Kill link between CSUM and SG features.
Earlier SG was unset if CSUM was not available for given device to
force skb copy to avoid sending inconsistent csum.
Commit c9af6db4c1 (net: Fix possible wrong checksum generation)
added explicit flag to force copy to fix this issue.  Therefore
there is no need to link SG and CSUM, following patch kills this
link between there two features.

This patch is also required following patch in series.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-09 16:08:57 -05:00
Cristian Bercaru
3bc1b1add7 bridging: fix rx_handlers return code
The frames for which rx_handlers return RX_HANDLER_CONSUMED are no longer
counted as dropped. They are counted as successfully received by
'netif_receive_skb'.

This allows network interface drivers to correctly update their RX-OK and
RX-DRP counters based on the result of 'netif_receive_skb'.

Signed-off-by: Cristian Bercaru <B43982@freescale.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-08 12:19:59 -05:00
Eric Dumazet
d1f41b67ff net: reduce net_rx_action() latency to 2 HZ
We should use time_after_eq() to get maximum latency of two ticks,
instead of three.

Bug added in commit 24f8b2385 (net: increase receive packet quantum)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-06 02:47:06 -05:00
Randy Dunlap
691b3b7e13 net: fix new kernel-doc warnings in net core
Fix new kernel-doc warnings in net/core/dev.c:

Warning(net/core/dev.c:4788): No description found for parameter 'new_carrier'
Warning(net/core/dev.c:4788): Excess function parameter 'new_carries' description in 'dev_change_carrier'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-06 02:47:06 -05:00
Eric Dumazet
82dc3c63c6 net: introduce NAPI_POLL_WEIGHT
Some drivers use a too big NAPI poll weight.

This patch adds a NAPI_POLL_WEIGHT default value
and issues an error message if a driver attempts
to use a bigger weight.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-05 23:40:01 -05:00
Sasha Levin
b67bfe0d42 hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:24 -08:00
Alexander Duyck
2bb60cb9b7 net: Fix locking bug in netif_set_xps_queue
Smatch found a locking bug in netif_set_xps_queue in which we were not
releasing the lock in the case of an allocation failure.

This change corrects that so that we release the xps_map_mutex before
returning -ENOMEM in the case of an allocation failure.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-22 15:10:19 -05:00
Cong Wang
cd0615746b net: fix a build failure when !CONFIG_PROC_FS
When !CONFIG_PROC_FS dev_mcast_init() is not defined,
actually we can just merge dev_mcast_init() into
dev_proc_init().

Reported-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-19 13:18:13 -05:00
Cong Wang
900ff8c632 net: move procfs code to net/core/net-procfs.c
Similar to net/core/net-sysfs.c, group procfs code to
a single unit.

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-19 00:51:10 -05:00
Gao feng
ece31ffd53 net: proc: change proc_net_remove to remove_proc_entry
proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.

this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 14:53:08 -05:00
Gao feng
d4beaa66ad net: proc: change proc_net_fops_create to proc_create
Right now, some modules such as bonding use proc_create
to create proc entries under /proc/net/, and other modules
such as ipv4 use proc_net_fops_create.

It looks a little chaos.this patch changes all of
proc_net_fops_create to proc_create. we can remove
proc_net_fops_create after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 14:53:08 -05:00
Cong Wang
96b45cbd95 net: move ioctl functions into a separated file
They well deserve a separated unit.

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 12:27:32 -05:00
Eric Dumazet
efd9450e7e net: use skb_reset_mac_len() in dev_gro_receive()
We no longer need to use mac_len, lets cleanup things.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-15 15:36:39 -05:00
Pravin B Shelar
68c3316311 v4 GRE: Add TCP segmentation offload for GRE
Following patch adds GRE protocol offload handler so that
skb_gso_segment() can segment GRE packets.
SKB GSO CB is added to keep track of total header length so that
skb_segment can push entire header. e.g. in case of GRE, skb_segment
need to push inner and outer headers to every segment.
New NETIF_F_GRE_GSO feature is added for devices which support HW
GRE TSO offload. Currently none of devices support it therefore GRE GSO
always fall backs to software GSO.

[ Compute pkt_len before ip_local_out() invocation. -DaveM ]

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-15 15:17:11 -05:00
Pravin B Shelar
05e8ef4ab2 net: factor out skb_mac_gso_segment() from skb_gso_segment()
This function will be used in next GRE_GSO patch. This patch does
not change any functionality. It only exports skb_mac_gso_segment()
function.

[ Use skb_reset_mac_len() -DaveM ]

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-15 15:16:03 -05:00
David S. Miller
9754e29349 net: Don't write to current task flags on every packet received.
Even for non-pfmalloc SKBs, __netif_receive_skb() will do a
tsk_restore_flags() on current unconditionally.

Make __netif_receive_skb() a shim around the existing code, renamed to
__netif_receive_skb_core().  Let __netif_receive_skb() wrap the
__netif_receive_skb_core() call with the task flag modifications, if
necessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-14 15:57:38 -05:00
Eric Dumazet
6d1ccff627 net: reset mac header in dev_start_xmit()
On 64 bit arches :

There is a off-by-one error in qdisc_pkt_len_init() because
mac_header is not set in xmit path.

skb_mac_header() returns an out of bound value that was
harmless because hdr_len is an 'unsigned int'

On 32bit arches, the error is abysmal.

This patch is also a prereq for "macvlan: add multicast filter"

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-06 15:59:47 -05:00
Cong Wang
12b0004d1d net: adjust skb_gso_segment() for calling in rx path
skb_gso_segment() is almost always called in tx path,
except for openvswitch. It calls this function when
it receives the packet and tries to queue it to user-space.
In this special case, the ->ip_summed check inside
skb_gso_segment() is no longer true, as ->ip_summed value
has different meanings on rx path.

This patch adjusts skb_gso_segment() so that we can at least
avoid such warnings on checksum.

Cc: Jesse Gross <jesse@nicira.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-06 15:58:00 -05:00
Neil Horman
ca99ca14c9 netpoll: protect napi_poll and poll_controller during dev_[open|close]
Ivan Vercera was recently backporting commit
9c13cb8bb4 to a RHEL kernel, and I noticed that,
while this patch protects the tg3 driver from having its ndo_poll_controller
routine called during device initalization, it does nothing for the driver
during shutdown. I.e. it would be entirely possible to have the
ndo_poll_controller method (or subsequently the ndo_poll) routine called for a
driver in the netpoll path on CPU A while in parallel on CPU B, the ndo_close or
ndo_open routine could be called.  Given that the two latter routines tend to
initizlize and free many data structures that the former two rely on, the result
can easily be data corruption or various other crashes.  Furthermore, it seems
that this is potentially a problem with all net drivers that support netpoll,
and so this should ideally be fixed in a common path.

As Ben H Pointed out to me, we can't preform dev_open/dev_close in atomic
context, so I've come up with this solution.  We can use a mutex to sleep in
open/close paths and just do a mutex_trylock in the napi poll path and abandon
the poll attempt if we're locked, as we'll just retry the poll on the next send
anyway.

I've tested this here by flooding netconsole with messages on a system whos nic
driver I modfied to periodically return NETDEV_TX_BUSY, so that the netpoll tx
workqueue would be forced to send frames and poll the device.  While this was
going on I rapidly ifdown/up'ed the interface and watched for any problems.
I've not found any.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Ivan Vecera <ivecera@redhat.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Francois Romieu <romieu@fr.zoreil.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-06 15:45:03 -05:00
Joe Perches
62b5942aa5 net: core: Remove unnecessary alloc/OOM messages
alloc failures already get standardized OOM
messages and a dump_stack.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-06 14:58:52 -05:00
Michał Mirosław
d2ed273d30 net: disallow drivers with buggy VLAN accel to register_netdevice()
Instead of jumping aroung bugs that are easily fixed just don't let them in:
affected drivers should be either fixed or have NETIF_F_HW_VLAN_FILTER
removed from advertised features.

Quick grep in drivers/net shows two drivers that have NETIF_F_HW_VLAN_FILTER
but not ndo_vlan_rx_add/kill_vid(), but those are false-positives (features
are commented out).

OTOH two drivers have ndo_vlan_rx_add/kill_vid() implemented but don't
advertise NETIF_F_HW_VLAN_FILTER. Those are:

+ethernet/cisco/enic/enic_main.c
+ethernet/qlogic/qlcnic/qlcnic_main.c

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 22:58:40 -05:00
Eric Dumazet
cef401de7b net: fix possible wrong checksum generation
Pravin Shelar mentioned that GSO could potentially generate
wrong TX checksum if skb has fragments that are overwritten
by the user between the checksum computation and transmit.

He suggested to linearize skbs but this extra copy can be
avoided for normal tcp skbs cooked by tcp_sendmsg().

This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
in skb_shinfo(skb)->gso_type if at least one frag can be
modified by the user.

Typical sources of such possible overwrites are {vm}splice(),
sendfile(), and macvtap/tun/virtio_net drivers.

Tested:

$ netperf -H 7.7.8.84
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
7.7.8.84 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    3959.52

$ netperf -H 7.7.8.84 -t TCP_SENDFILE
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    3216.80

Performance of the SENDFILE is impacted by the extra allocation and
copy, and because we use order-0 pages, while the TCP_STREAM uses
bigger pages.

Reported-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-28 00:27:15 -05:00
Cong Wang
441d9d327f net: move rx and tx hash functions to net/core/flow_dissector.c
__skb_tx_hash() and __skb_get_rxhash() are all for calculating hash
value based by some fields in skb, mostly used for selecting queues
by device drivers.

Meanwhile, net/core/dev.c is bloating.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 14:26:17 -05:00
Eric Dumazet
757b8b1d2b net_sched: fix qdisc_pkt_len_init()
commit 1def9238d4 (net_sched: more precise pkt_len computation)
does a wrong computation of mac + network headers length, as it includes
the padding before the frame.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-16 00:41:19 -05:00
David S. Miller
4b87f92259 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	Documentation/networking/ip-sysctl.txt
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c

Both conflicts were simply overlapping context.

A build fix for qlcnic is in here too, simply removing the added
devinit annotations which no longer exist.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-15 15:05:59 -05:00
Stanislaw Gruszka
d07d7507bf net, wireless: overwrite default_ethtool_ops
Since:

commit 2c60db0370
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Sep 16 09:17:26 2012 +0000

    net: provide a default dev->ethtool_ops

wireless core does not correctly assign ethtool_ops.

After alloc_netdev*() call, some cfg80211 drivers provide they own
ethtool_ops, but some do not. For them, wireless core provide generic
cfg80211_ethtool_ops, which is assigned in NETDEV_REGISTER notify call:

        if (!dev->ethtool_ops)
                dev->ethtool_ops = &cfg80211_ethtool_ops;

But after Eric's commit, dev->ethtool_ops is no longer NULL (on cfg80211
drivers without custom ethtool_ops), but points to &default_ethtool_ops.

In order to fix the problem, provide function which will overwrite
default_ethtool_ops and use it by wireless core.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-11 15:55:48 -08:00
Alexander Duyck
87696f9234 net: Export __netdev_pick_tx so that it can be used in modules
When testing with FCoE enabled we discovered that I had not exported
__netdev_pick_tx.  As a result ixgbe doesn't build with the RFC patches
applied because ixgbe_select_queue was calling the function.  This change
corrects that build issue by correctly exporting __netdev_pick_tx so it
can be used by modules.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-11 15:47:27 -08:00
Alexander Duyck
024e9679a2 net: Add support for XPS without sysfs being defined
This patch makes it so that we can support transmit packet steering without
sysfs needing to be enabled.  The reason for making this change is to make
it so that a driver can make use of the XPS even while the sysfs portion of
the interface is not present.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10 22:47:04 -08:00
Alexander Duyck
01c5f864e6 net: Rewrite netif_set_xps_queues to address several issues
This change is meant to address several issues I found within the
netif_set_xps_queues function.

If the allocation of one of the maps to be assigned to new_dev_maps failed
we could end up with the device map in an inconsistent state since we had
already worked through a number of CPUs and removed or added the queue.  To
address that I split the process into several steps.  The first of which is
just the allocation of updated maps for CPUs that will need larger maps to
store the queue.  By doing this we can fail gracefully without actually
altering the contents of the current device map.

The second issue I found was the fact that we were always allocating a new
device map even if we were not adding any queues.  I have updated the code
so that we only allocate a new device map if we are adding queues,
otherwise if we are not adding any queues to CPUs we just skip to the
removal process.

The last change I made was to reuse the code from remove_xps_queue to remove
the queue from the CPU.  By making this change we can be consistent in how
we go about adding and removing the queues from the CPUs.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10 22:47:04 -08:00
Alexander Duyck
10cdc3f3cd net: Rewrite netif_reset_xps_queue to allow for better code reuse
This patch does a minor refactor on netif_reset_xps_queue to address a few
items I noticed.

First is the fact that we are doing removal of queues in both
netif_reset_xps_queue and netif_set_xps_queue.  Since there is no need to
have the code in two places I am pushing it out into a separate function
and will come back in another patch and reuse the code in
netif_set_xps_queue.

The second item this change addresses is the fact that the Tx queues were
not getting their numa_node value cleared as a part of the XPS queue reset.
This patch resolves that by resetting the numa_node value if the dev_maps
value is set.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10 22:47:04 -08:00
Alexander Duyck
537c00de1c net: Add functions netif_reset_xps_queue and netif_set_xps_queue
This patch adds two functions, netif_reset_xps_queue and
netif_set_xps_queue.  The main idea behind these two functions is to
provide a mechanism through which drivers can update their defaults in
regards to XPS.

Currently no such mechanism exists and as a result we cannot use XPS for
things such as ATR which would require a basic configuration to start in
which the Tx queues are mapped to CPUs via a 1:1 mapping.  With this change
I am making it possible for drivers such as ixgbe to be able to use the XPS
feature by controlling the default configuration.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10 22:47:03 -08:00
Alexander Duyck
416186fbf8 net: Split core bits of netdev_pick_tx into __netdev_pick_tx
This change splits the core bits of netdev_pick_tx into a separate function.
The main idea behind this is to make this code accessible to select queue
functions when they decide to process the standard path instead of their
own custom path in their select queue routine.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10 22:47:03 -08:00
Eric Dumazet
1def9238d4 net_sched: more precise pkt_len computation
One long standing problem with TSO/GSO/GRO packets is that skb->len
doesn't represent a precise amount of bytes on wire.

Headers are only accounted for the first segment.
For TCP, thats typically 66 bytes per 1448 bytes segment missing,
an error of 4.5 % for normal MSS value.

As consequences :

1) TBF/CBQ/HTB/NETEM/... can send more bytes than the assigned limits.
2) Device stats are slightly under estimated as well.

Fix this by taking account of headers in qdisc_skb_cb(skb)->pkt_len
computation.

Packet schedulers should use qdisc pkt_len instead of skb->len for their
bandwidth limitations, and TSO enabled devices drivers could use pkt_len
if their statistics are not hardware assisted, and if they don't scratch
skb->cb[] first word.

Both egress and ingress paths work, thanks to commit fda55eca5a
(net: introduce skb_transport_header_was_set()) : If GRO built
a GSO packet, it also set the transport header for us.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Paolo Valente <paolo.valente@unimore.it>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10 14:58:13 -08:00
Jiri Pirko
948b337e62 net: init perm_addr in register_netdevice()
Benefit from the fact that dev->addr_assign_type is set to NET_ADDR_PERM
in case the device has permanent address.

This also fixes the problem that many drivers do not set perm_addr at
all.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-08 18:00:47 -08:00
Eric Dumazet
fda55eca5a net: introduce skb_transport_header_was_set()
We have skb_mac_header_was_set() helper to tell if mac_header
was set on a skb. We would like the same for transport_header.

__netif_receive_skb() doesn't reset the transport header if already
set by GRO layer.

Note that network stacks usually reset the transport header anyway,
after pulling the network header, so this change only allows
a followup patch to have more precise qdisc pkt_len computation
for GSO packets at ingress side.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-08 17:51:54 -08:00
Jiri Pirko
8b98a70c28 net: remove no longer used netdev_set_bond_master() and netdev_set_master()
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-04 13:31:50 -08:00
Jiri Pirko
9ff162a8b9 net: introduce upper device lists
This lists are supposed to serve for storing pointers to all upper devices.
Eventually it will replace dev->master pointer which is used for
bonding, bridge, team but it cannot be used for vlan, macvlan where
there might be multiple upper present. In case the upper link is
replacement for dev->master, it is marked with "master" flag.

New upper device list resolves this limitation. Also, the information
stored in lists is used for preventing looping setups like
"bond->somethingelse->samebond"

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-04 13:31:49 -08:00
Jiri Pirko
fbdeca2d77 net: add address assign type "SET"
This is the way to indicate that mac address of a device has been set by
dev_set_mac_address()

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-03 22:37:36 -08:00
Jiri Pirko
f652151640 net: call add_device_randomness() only after successful mac change
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-03 22:37:35 -08:00
Jiri Pirko
4bf84c35c6 net: add change_carrier netdev op
This allows a driver to register change_carrier callback which will be
called whenever user will like to change carrier state. This is useful
for devices like dummy, gre, team and so on.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-28 15:24:18 -08:00
Eric Dumazet
30e6c9fa93 net: devnet_rename_seq should be a seqcount
Using a seqlock for devnet_rename_seq is not a good idea,
as device_rename() can sleep.

As we hold RTNL, we dont need a protection for writers,
and only need a seqcount so that readers can catch a change done
by a writer.

Bug added in commit c91f6df2db (sockopt: Change getsockopt() of
SO_BINDTODEVICE to return an interface name)

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-21 13:14:01 -08:00
Linus Torvalds
6be35c700f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:

1) Allow to dump, monitor, and change the bridge multicast database
   using netlink.  From Cong Wang.

2) RFC 5961 TCP blind data injection attack mitigation, from Eric
   Dumazet.

3) Networking user namespace support from Eric W. Biederman.

4) tuntap/virtio-net multiqueue support by Jason Wang.

5) Support for checksum offload of encapsulated packets (basically,
   tunneled traffic can still be checksummed by HW).  From Joseph
   Gasparakis.

6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
   Daniel Borkmann.

7) Bridge port parameters over netlink and BPDU blocking support
   from Stephen Hemminger.

8) Improve data access patterns during inet socket demux by rearranging
   socket layout, from Eric Dumazet.

9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
   Jon Maloy.

10) Update TCP socket hash sizing to be more in line with current day
    realities.  The existing heurstics were choosen a decade ago.
    From Eric Dumazet.

11) Fix races, queue bloat, and excessive wakeups in ATM and
    associated drivers, from Krzysztof Mazur and David Woodhouse.

12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
    in VXLAN driver, from David Stevens.

13) Add "oops_only" mode to netconsole, from Amerigo Wang.

14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
    allow DCB netlink to work on namespaces other than the initial
    namespace.  From John Fastabend.

15) Support PTP in the Tigon3 driver, from Matt Carlson.

16) tun/vhost zero copy fixes and improvements, plus turn it on
    by default, from Michael S. Tsirkin.

17) Support per-association statistics in SCTP, from Michele
    Baldessari.

And many, many, driver updates, cleanups, and improvements.  Too
numerous to mention individually.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
  net/mlx4_en: Add support for destination MAC in steering rules
  net/mlx4_en: Use generic etherdevice.h functions.
  net: ethtool: Add destination MAC address to flow steering API
  bridge: add support of adding and deleting mdb entries
  bridge: notify mdb changes via netlink
  ndisc: Unexport ndisc_{build,send}_skb().
  uapi: add missing netconf.h to export list
  pkt_sched: avoid requeues if possible
  solos-pci: fix double-free of TX skb in DMA mode
  bnx2: Fix accidental reversions.
  bna: Driver Version Updated to 3.1.2.1
  bna: Firmware update
  bna: Add RX State
  bna: Rx Page Based Allocation
  bna: TX Intr Coalescing Fix
  bna: Tx and Rx Optimizations
  bna: Code Cleanup and Enhancements
  ath9k: check pdata variable before dereferencing it
  ath5k: RX timestamp is reported at end of frame
  ath9k_htc: RX timestamp is reported at end of frame
  ...
2012-12-12 18:07:07 -08:00
Eric Dumazet
89c5fa3369 net: gro: dev_gro_receive() cleanup
__napi_gro_receive() is inlined from two call sites for no good reason.

Lets move the prep stuff in a function of its own, called only if/when
needed. This saves 300 bytes on x86 :

# size net/core/dev.o.after net/core/dev.o.before
   text	   data	    bss	    dec	    hex	filename
  51968	   1238	   1040	  54246	   d3e6	net/core/dev.o.before
  51664	   1238	   1040	  53942	   d2b6	net/core/dev.o.after

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11 12:49:53 -05:00
Alexander Duyck
fc70fb640b net: Handle encapsulated offloads before fragmentation or handing to lower dev
This change allows the VXLAN to enable Tx checksum offloading even on
devices that do not support encapsulated checksum offloads. The
advantage to this is that it allows for the lower device to change due
to routing table changes without impacting features on the VXLAN itself.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-09 00:20:28 -05:00
Eric Dumazet
c3c7c254b2 net: gro: fix possible panic in skb_gro_receive()
commit 2e71a6f808 (net: gro: selective flush of packets) added
a bug for skbs using frag_list. This part of the GRO stack is rarely
used, as it needs skb not using a page fragment for their skb->head.

Most drivers do use a page fragment, but some of them use GFP_KERNEL
allocations for the initial fill of their RX ring buffer.

napi_gro_flush() overwrite skb->prev that was used for these skb to
point to the last skb in frag_list.

Fix this using a separate field in struct napi_gro_cb to point to the
last fragment.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-07 14:39:29 -05:00
Jiri Pirko
e3d8fabee3 net: call notifiers for mtu change even if iface is not up
Do the same thing as in set mac. Call notifiers every time.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-07 12:22:30 -05:00
Serge Hallyn
4e66ae2ea3 net: dev_change_net_namespace: send a KOBJ_REMOVED/KOBJ_ADD
When a new nic is created in namespace ns1, the kernel sends a KOBJ_ADD uevent
to ns1.  When the nic is moved to ns2, we only send a KOBJ_MOVE to ns2, and
nothing to ns1.

This patch changes that behavior so that when moving a nic from ns1 to ns2, we
send a KOBJ_REMOVED to ns1 and KOBJ_ADD to ns2.  (The KOBJ_MOVE is still
sent to ns2).

The effects of this can be seen when starting and stopping containers in
an upstart based host.  Lxc will create a pair of veth nics, the kernel
sends KOBJ_ADD, and upstart starts network-instance jobs for each.  When
one nic is moved to the container, because no KOBJ_REMOVED event is
received, the network-instance job for that veth never goes away.  This
was reported at https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1065589
With this patch the networ-instance jobs properly go away.

The other oddness solved here is that if a nic is passed into a running
upstart-based container, without this patch no network-instance job is
started in the container.  But when the container creates a new nic
itself (ip link add new type veth) then network-interface jobs are
created.  With this patch, behavior comes in line with a regular host.

v2: also send KOBJ_ADD to new netns.  There will then be a
_MOVE event from the device_rename() call, but that should
be innocuous.

Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-04 13:25:57 -05:00
Rami Rosen
bb728820fe core: make GRO methods static.
This patch changes three methods to be static and removes their
EXPORT_SYMBOLs in core/dev.c and their external declaration in
netdevice.h. The methods, dev_gro_receive(), napi_frags_finish() and
napi_skb_finish(), which are in the GRO rx path, are not used
outside core/dev.c.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-29 13:18:32 -05:00
Brian Haley
c91f6df2db sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name
Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which
will then require another call like if_indextoname() to get the actual interface
name, have it return the name directly.

This also matches the existing man page description on socket(7) which mentions
the argument being an interface name.

If the value has not been set, zero is returned and optlen will be set to zero
to indicate there is no interface name present.

Added a seqlock to protect this code path, and dev_ifname(), from someone
changing the device name via dev_change_name().

v2: Added seqlock protection while copying device name.

v3: Fixed word wrap in patch.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-26 17:22:14 -05:00
Sachin Kamat
388dfc2d2d net: Remove redundant null check before kfree in dev.c
kfree on a null pointer is a no-op.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-20 13:48:09 -05:00
Eric W. Biederman
5e1fccc0bf net: Allow userns root control of the core of the network stack.
Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

Settings that merely control a single network device are allowed.
Either the network device is a logical network device where
restrictions make no difference or the network device is hardware NIC
that has been explicity moved from the initial network namespace.

In general policy and network stack state changes are allowed
while resource control is left unchanged.

Allow ethtool ioctls.

Allow binding to network devices.
Allow setting the socket mark.
Allow setting the socket priority.

Allow setting the network device alias via sysfs.
Allow setting the mtu via sysfs.
Allow changing the network device flags via sysfs.
Allow setting the network device group via sysfs.

Allow the following network device ioctls.
SIOCGMIIPHY
SIOCGMIIREG
SIOCSIFNAME
SIOCSIFFLAGS
SIOCSIFMETRIC
SIOCSIFMTU
SIOCSIFHWADDR
SIOCSIFSLAVE
SIOCADDMULTI
SIOCDELMULTI
SIOCSIFHWBROADCAST
SIOCSMIIREG
SIOCBONDENSLAVE
SIOCBONDRELEASE
SIOCBONDSETHWADDR
SIOCBONDCHANGEACTIVE
SIOCBRADDIF
SIOCBRDELIF
SIOCSHWTSTAMP

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-18 20:32:45 -05:00
David S. Miller
67f4efdce7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor line offset auto-merges.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-17 22:00:43 -05:00
Tom Herbert
baefa31db2 net-rps: Fix brokeness causing OOO packets
In commit c445477d74 which adds aRFS to the kernel, the CPU
selected for RFS is not set correctly when CPU is changing.
This is causing OOO packets and probably other issues.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-16 14:35:56 -05:00
Eric Dumazet
c53aa5058a net: use right lock in __dev_remove_offload
offload_base is protected by offload_lock, not ptype_lock

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-16 13:41:08 -05:00
Vlad Yasevich
f191a1d17f net: Remove code duplication between offload structures
Move the offload callbacks into its own structure.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:39:51 -05:00
Vlad Yasevich
22061d8014 net: Switch to using the new packet offload infrustructure
Convert to using the new GSO/GRO registration mechanism and new
packet offload structure.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:36:17 -05:00
Vlad Yasevich
62532da9d5 net: Add generic packet offload infrastructure.
Create a new data structure to contain the GRO/GSO callbacks and add
a new registration mechanism.

Singed-off-by: Vlad Yasevich <vyasevic@redhat.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:36:16 -05:00
David S. Miller
d4185bbf62 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c

Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net.  Based upon a conflict resolution
patch posted by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-10 18:32:51 -05:00
Eric Leblond
a3d744e995 af-packet: fix oops when socket is not present
Due to a NULL dereference, the following patch is causing oops
in normal trafic condition:

commit c0de08d042
Author: Eric Leblond <eric@regit.org>
Date:   Thu Aug 16 22:02:58 2012 +0000

    af_packet: don't emit packet on orig fanout group

This buggy patch was a feature fix and has reached most stable
branches.

When skb->sk is NULL and when packet fanout is used, there is a
crash in match_fanout_group where skb->sk is accessed.
This patch fixes the issue by returning false as soon as the
socket is NULL: this correspond to the wanted behavior because
the kernel as to resend the skb to all the listening socket in
this case.

Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-07 15:40:14 -05:00
Rami Rosen
47b70db555 net:dev: remove double indentical assignment in dev_change_net_namespace().
This patch removes double assignment of err to -EINVAL in dev_change_net_namespace().

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-21 20:44:39 -04:00
Florian Zumbiehl
48cc32d38a vlan: don't deliver frames for unknown vlans to protocols
6a32e4f9dd made the vlan code skip marking
vlan-tagged frames for not locally configured vlans as PACKET_OTHERHOST if
there was an rx_handler, as the rx_handler could cause the frame to be received
on a different (virtual) vlan-capable interface where that vlan might be
configured.

As rx_handlers do not necessarily return RX_HANDLER_ANOTHER, this could cause
frames for unknown vlans to be delivered to the protocol stack as if they had
been received untagged.

For example, if an ipv6 router advertisement that's tagged for a locally not
configured vlan is received on an interface with macvlan interfaces attached,
macvlan's rx_handler returns RX_HANDLER_PASS after delivering the frame to the
macvlan interfaces, which caused it to be passed to the protocol stack, leading
to ipv6 addresses for the announced prefix being configured even though those
are completely unusable on the underlying interface.

The fix moves marking as PACKET_OTHERHOST after the rx_handler so the
rx_handler, if there is one, sees the frame unchanged, but afterwards,
before the frame is delivered to the protocol stack, it gets marked whether
there is an rx_handler or not.

Signed-off-by: Florian Zumbiehl <florz@florz.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08 15:21:55 -04:00
Eric Dumazet
2e71a6f808 net: gro: selective flush of packets
Current GRO can hold packets in gro_list for almost unlimited
time, in case napi->poll() handler consumes its budget over and over.

In this case, napi_complete()/napi_gro_flush() are not called.

Another problem is that gro_list is flushed in non friendly way :
We scan the list and complete packets in the reverse order.
(youngest packets first, oldest packets last)
This defeats priorities that sender could have cooked.

Since GRO currently only store TCP packets, we dont really notice the
bug because of retransmits, but this behavior can add unexpected
latencies, particularly on mice flows clamped by elephant flows.

This patch makes sure no packet can stay more than 1 ms in queue, and
only in stress situations.

It also complete packets in the right order to minimize latencies.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08 14:51:51 -04:00
Eric Dumazet
ca07e43e28 net: gro: fix a potential crash in skb_gro_reset_offset
Before accessing skb first fragment, better make sure there
is one.

This is probably not needed for old kernels, since an ethernet frame
cannot contain only an ethernet header, but the recent GRO addition
to tunnels makes this patch needed.

Also skb_gro_reset_offset() can be static, it actually allows
compiler to inline it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-07 14:49:17 -04:00
Linus Torvalds
aecdc33e11 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:

 1) GRE now works over ipv6, from Dmitry Kozlov.

 2) Make SCTP more network namespace aware, from Eric Biederman.

 3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

 4) Make openvswitch network namespace aware, from Pravin B Shelar.

 5) IPV6 NAT implementation, from Patrick McHardy.

 6) Server side support for TCP Fast Open, from Jerry Chu and others.

 7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

 8) Increate the loopback default MTU to 64K, from Eric Dumazet.

 9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic.  This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail.  Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

Fix up various fairly trivial conflicts, mostly caused by the user
namespace changes.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
  hyperv: Add buffer for extended info after the RNDIS response message.
  hyperv: Report actual status in receive completion packet
  hyperv: Remove extra allocated space for recv_pkt_list elements
  hyperv: Fix page buffer handling in rndis_filter_send_request()
  hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
  hyperv: Fix the max_xfer_size in RNDIS initialization
  vxlan: put UDP socket in correct namespace
  vxlan: Depend on CONFIG_INET
  sfc: Fix the reported priorities of different filter types
  sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
  sfc: Fix loopback self-test with separate_tx_channels=1
  sfc: Fix MCDI structure field lookup
  sfc: Add parentheses around use of bitfield macro arguments
  sfc: Fix null function pointer in efx_sriov_channel_type
  vxlan: virtual extensible lan
  igmp: export symbol ip_mc_leave_group
  netlink: add attributes to fdb interface
  tg3: unconditionally select HWMON support when tg3 is enabled.
  Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
  gre: fix sparse warning
  ...
2012-10-02 13:38:27 -07:00
Linus Torvalds
437589a74b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "This is a mostly modest set of changes to enable basic user namespace
  support.  This allows the code to code to compile with user namespaces
  enabled and removes the assumption there is only the initial user
  namespace.  Everything is converted except for the most complex of the
  filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
  nfs, ocfs2 and xfs as those patches need a bit more review.

  The strategy is to push kuid_t and kgid_t values are far down into
  subsystems and filesystems as reasonable.  Leaving the make_kuid and
  from_kuid operations to happen at the edge of userspace, as the values
  come off the disk, and as the values come in from the network.
  Letting compile type incompatible compile errors (present when user
  namespaces are enabled) guide me to find the issues.

  The most tricky areas have been the places where we had an implicit
  union of uid and gid values and were storing them in an unsigned int.
  Those places were converted into explicit unions.  I made certain to
  handle those places with simple trivial patches.

  Out of that work I discovered we have generic interfaces for storing
  quota by projid.  I had never heard of the project identifiers before.
  Adding full user namespace support for project identifiers accounts
  for most of the code size growth in my git tree.

  Ultimately there will be work to relax privlige checks from
  "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
  root in a user names to do those things that today we only forbid to
  non-root users because it will confuse suid root applications.

  While I was pushing kuid_t and kgid_t changes deep into the audit code
  I made a few other cleanups.  I capitalized on the fact we process
  netlink messages in the context of the message sender.  I removed
  usage of NETLINK_CRED, and started directly using current->tty.

  Some of these patches have also made it into maintainer trees, with no
  problems from identical code from different trees showing up in
  linux-next.

  After reading through all of this code I feel like I might be able to
  win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
  userns: Convert the ufs filesystem to use kuid/kgid where appropriate
  userns: Convert the udf filesystem to use kuid/kgid where appropriate
  userns: Convert ubifs to use kuid/kgid
  userns: Convert squashfs to use kuid/kgid where appropriate
  userns: Convert reiserfs to use kuid and kgid where appropriate
  userns: Convert jfs to use kuid/kgid where appropriate
  userns: Convert jffs2 to use kuid and kgid where appropriate
  userns: Convert hpfs to use kuid and kgid where appropriate
  userns: Convert btrfs to use kuid/kgid where appropriate
  userns: Convert bfs to use kuid/kgid where appropriate
  userns: Convert affs to use kuid/kgid wherwe appropriate
  userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
  userns: On ia64 deal with current_uid and current_gid being kuid and kgid
  userns: On ppc convert current_uid from a kuid before printing.
  userns: Convert s390 getting uid and gid system calls to use kuid and kgid
  userns: Convert s390 hypfs to use kuid and kgid where appropriate
  userns: Convert binder ipc to use kuids
  userns: Teach security_path_chown to take kuids and kgids
  userns: Add user namespace support to IMA
  userns: Convert EVM to deal with kuids and kgids in it's hmac computation
  ...
2012-10-02 11:11:09 -07:00
Eric Dumazet
c9e6bc644e net: add gro_cells infrastructure
This adds a new include file (include/net/gro_cells.h), to bring GRO
(Generic Receive Offload) capability to tunnels, in a modular way.

Because tunnels receive path is lockless, and GRO adds a serialization
using a napi_struct, I chose to add an array of up to
DEFAULT_MAX_NUM_RSS_QUEUES cells, so that multi queue devices wont be
slowed down because of GRO layer.

skb_get_rx_queue() is used as selector.

In the future, we might add optional fanout capabilities, using rxhash
for example.

With help from Ben Hutchings who reminded me
netif_get_num_default_rss_queues() function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-01 17:01:46 -04:00
Linus Torvalds
06d2fe153b Driver core merge for 3.7-rc1
Here is the big driver core update for 3.7-rc1.
 
 A number of firmware_class.c updates (as you saw a month or so ago), and
 some hyper-v updates and some printk fixes as well.  All patches that
 are outside of the drivers/base area have been acked by the respective
 maintainers, and have all been in the linux-next tree for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlBp3vkACgkQMUfUDdst+ylQoACgldktGFgkCLzH+rGYthrXOC5P
 9hUAnjmOhdoHlMTL81vWTlH+BrGernym
 =khrr
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core merge from Greg Kroah-Hartman:
 "Here is the big driver core update for 3.7-rc1.

  A number of firmware_class.c updates (as you saw a month or so ago),
  and some hyper-v updates and some printk fixes as well.  All patches
  that are outside of the drivers/base area have been acked by the
  respective maintainers, and have all been in the linux-next tree for a
  while.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
  memory: tegra{20,30}-mc: Fix reading incorrect register in mc_readl()
  device.h: Add missing inline to #ifndef CONFIG_PRINTK dev_vprintk_emit
  memory: emif: Add ifdef CONFIG_DEBUG_FS guard for emif_debugfs_[init|exit]
  Documentation: Fixes some translation error in Documentation/zh_CN/gpio.txt
  Documentation: Remove 3 byte redundant code at the head of the Documentation/zh_CN/arm/booting
  Documentation: Chinese translation of Documentation/video4linux/omap3isp.txt
  device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit
  dev: Add dev_vprintk_emit and dev_printk_emit
  netdev_printk/netif_printk: Remove a superfluous logging colon
  netdev_printk/dynamic_netdev_dbg: Directly call printk_emit
  dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack
  driver-core: Shut up dev_dbg_reatelimited() without DEBUG
  tools/hv: Parse /etc/os-release
  tools/hv: Check for read/write errors
  tools/hv: Fix exit() error code
  tools/hv: Fix file handle leak
  Tools: hv: Implement the KVP verb - KVP_OP_GET_IP_INFO
  Tools: hv: Rename the function kvp_get_ip_address()
  Tools: hv: Implement the KVP verb - KVP_OP_SET_IP_INFO
  Tools: hv: Add an example script to configure an interface
  ...
2012-10-01 12:10:44 -07:00
David S. Miller
6a06e5e1bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/team/team.c
	drivers/net/usb/qmi_wwan.c
	net/batman-adv/bat_iv_ogm.c
	net/ipv4/fib_frontend.c
	net/ipv4/route.c
	net/l2tp/l2tp_netlink.c

The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.

qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

With help from Antonio Quartulli.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-28 14:40:49 -04:00
Ed Cashin
c0d680e577 net: do not disable sg for packets requiring no checksum
A change in a series of VLAN-related changes appears to have
inadvertently disabled the use of the scatter gather feature of
network cards for transmission of non-IP ethernet protocols like ATA
over Ethernet (AoE).  Below is a reference to the commit that
introduces a "harmonize_features" function that turns off scatter
gather when the NIC does not support hardware checksumming for the
ethernet protocol of an sk buff.

  commit f01a5236bd
  Author: Jesse Gross <jesse@nicira.com>
  Date:   Sun Jan 9 06:23:31 2011 +0000

      net offloading: Generalize netif_get_vlan_features().

The can_checksum_protocol function is not equipped to consider a
protocol that does not require checksumming.  Calling it for a
protocol that requires no checksum is inappropriate.

The patch below has harmonize_features call can_checksum_protocol when
the protocol needs a checksum, so that the network layer is not forced
to perform unnecessary skb linearization on the transmission of AoE
packets.  Unnecessary linearization results in decreased performance
and increased memory pressure, as reported here:

  http://www.spinics.net/lists/linux-mm/msg15184.html

The problem has probably not been widely experienced yet, because
only recently has the kernel.org-distributed aoe driver acquired the
ability to use payloads of over a page in size, with the patchset
recently included in the mm tree:

  https://lkml.org/lkml/2012/8/28/140

The coraid.com-distributed aoe driver already could use payloads of
greater than a page in size, but its users generally do not use the
newest kernels.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-20 22:23:40 -04:00
Amerigo Wang
8c4c49df5c netpoll: call ->ndo_select_queue() in tx path
In netpoll tx path, we miss the chance of calling ->ndo_select_queue(),
thus could cause problems when bonding is involved.

This patch makes dev_pick_tx() extern (and rename it to netdev_pick_tx())
to let netpoll call it in netpoll_send_skb_on_dev().

Reported-by: Sylvain Munaut <s.munaut@whatever-company.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Tested-by: Sylvain Munaut <s.munaut@whatever-company.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 17:19:09 -04:00
Eric Dumazet
2c60db0370 net: provide a default dev->ethtool_ops
Instead of forcing device drivers to provide empty ethtool_ops or tweak
net/core/ethtool.c again, we could provide a generic ethtool_ops.

This occurred to me when I wanted to add GSO support to GRE tunnels.
ethtool -k support should be generic for all drivers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Maciej Żenczykowski <maze@google.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 15:40:15 -04:00
Gao feng
828de4f6bf net: dev: fix incorrect getting net device's name
When moving a nic from net namespace A to net namespace B,
in dev_change_net_namesapce,we call __dev_get_by_name to
decide if the netns B has the device has the same name.

if the netns B already has the same named device,we call
dev_get_valid_name to try to get a valid name for this nic in
the netns B,but net_device->nd_net still point to netns A now.

this patch fix it.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 15:37:01 -04:00
Eric Dumazet
b40863c667 net: more accurate network taps in transmit path
dev_queue_xmit_nit() should be called right before ndo_start_xmit()
calls or we might give wrong packet contents to taps users :

Packet checksum can be changed, or packet can be linearized or
segmented, and segments partially sent for the later case.

Also a memory allocation can fail and packet never really hit the
driver entry point.

Reported-by: Jamie Gloudon <jamie.gloudon@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 15:32:42 -04:00
Michael S. Tsirkin
0e698bf662 net: fix memory leak on oom with zerocopy
If orphan flags fails, we don't free the skb
on receive, which leaks the skb memory.

Return value was also wrong: netif_receive_skb
is supposed to return NET_RX_DROP, not ENOMEM.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18 16:24:00 -04:00
Eric W. Biederman
e1760bd5ff userns: Convert the audit loginuid to be a kuid
Always store audit loginuids in type kuid_t.

Print loginuids by converting them into uids in the appropriate user
namespace, and then printing the resulting uid.

Modify audit_get_loginuid to return a kuid_t.

Modify audit_set_loginuid to take a kuid_t.

Modify /proc/<pid>/loginuid on read to convert the loginuid into the
user namespace of the opener of the file.

Modify /proc/<pid>/loginud on write to convert the loginuid
rom the user namespace of the opener of the file.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Paul Moore <paul@paul-moore.com> ?
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-09-17 18:08:54 -07:00
Joe Perches
666f355f38 device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit
Convert direct calls of vprintk_emit and printk_emit to the
dev_ equivalents.

Make create_syslog_header static.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-17 06:10:05 -07:00
Joe Perches
c2c5a7051c netdev_printk/netif_printk: Remove a superfluous logging colon
netdev_printk originally called dev_printk with %pV.

This style emitted the complete dev_printk header with
a colon followed by the netdev_name prefix followed
by a colon.

Now that netdev_printk does not call dev_printk, the
extra colon is superfluous.  Remove it.

Example:
old: sky2 0000:02:00.0: eth0: Link is up at 100 Mbps, full duplex, flow control both
new: sky2 0000:02:00.0 eth0: Link is up at 100 Mbps, full duplex, flow control both

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-17 06:10:05 -07:00
Joe Perches
b004ff4972 netdev_printk/dynamic_netdev_dbg: Directly call printk_emit
A lot of stack is used in recursive printks with %pV.

Using multiple levels of %pV (a logging function with %pV
that calls another logging function with %pV) can consume
more stack than necessary.

Avoid excessive stack use by not calling dev_printk from
netdev_printk and dynamic_netdev_dbg.  Duplicate the logic
and form of dev_printk instead.

Make __netdev_printk static.
Remove EXPORT_SYMBOL(__netdev_printk)
Whitespace and brace style neatening.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-17 06:08:30 -07:00
David S. Miller
b48b63a1f6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/netfilter/nfnetlink_log.c
	net/netfilter/xt_LOG.c

Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.

Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-15 11:43:53 -04:00
Chema Gonzalez
6862234238 net: small bug on rxhash calculation
In the current rxhash calculation function, while the
sorting of the ports/addrs is coherent (you get the
same rxhash for packets sharing the same 4-tuple, in
both directions), ports and addrs are sorted
independently. This implies packets from a connection
between the same addresses but crossed ports hash to
the same rxhash.

For example, traffic between A=S:l and B=L:s is hashed
(in both directions) from {L, S, {s, l}}. The same
rxhash is obtained for packets between C=S:s and D=L:l.

This patch ensures that you either swap both addrs and ports,
or you swap none. Traffic between A and B, and traffic
between C and D, get their rxhash from different sources
({L, S, {l, s}} for A<->B, and {L, S, {s, l}} for C<->D)

The patch is co-written with Eric Dumazet <edumazet@google.com>

Signed-off-by: Chema Gonzalez <chema@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-08 18:41:48 -04:00
Rami Rosen
d1a53dfd11 net: fix documentation of skb_needs_linearize().
skb_needs_linearize() does not check highmem DMA as it does not call
illegal_highdma() anymore, so there is no need to mention highmem DMA here.

(Indeed, ~NETIF_F_SG flag, which is checked in skb_needs_linearize(), can
be set when illegal_highdma() returns true, and we are assured that
illegal_highdma() is invoked prior to skb_needs_linearize() as
skb_needs_linearize() is a static method called only once.
But ~NETIF_F_SG can be set not only there in this same invocation path.
It can also be set when can_checksum_protocol() returns false).

see commit 02932ce9e2,
Convert skb_need_linearize() to use precomputed features.
Signed-off-by: Rami Rosen <rosenr@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 16:24:02 -04:00
Gao feng
6549dd43c0 net: dev: fix the incorrect hold of net namespace's lo device
When moving a net device from one net namespace to another
net namespace,dev_change_net_namespace calls NETDEV_DOWN
event,so the original net namespace's dst entries which
beloned to this net device will be put into dst_garbage
list.

then dev_change_net_namespace will set this net device's
net to the new net namespace.

If we unregister this net device's driver, this will trigger
the NETDEV_UNREGISTER_FINAL event, dst_ifdown will be called,
and get this net device's dst entries from dst_garbage list,
put these entries' dev to the new net namespace's lo device.

It's not what we want,actually we need these dst entries hold
the original net namespace's lo device,this incorrect device
holding will trigger emg message like below.
unregister_netdevice: waiting for lo to become free. Usage count = 1

so we should call NETDEV_UNREGISTER_FINAL event in
dev_change_net_namespace too,in order to make sure dst entries
already in the dst_garbage list, we need rcu_barrier before we
call NETDEV_UNREGISTER_FINAL event.

With help form Eric Dumazet.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-30 12:21:16 -04:00
David S. Miller
e6acb38480 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
This is an initial merge in of Eric Biederman's work to start adding
user namespace support to the networking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24 18:54:37 -04:00
Ben Hutchings
8f4cccbbd9 net: Set device operstate at registration time
The operstate of a device is initially IF_OPER_UNKNOWN and is updated
asynchronously by linkwatch after each change of carrier state
reported by the driver.  The default carrier state of a net device is
on, and this will never be changed on drivers that do not support
carrier detection, thus the operstate remains IF_OPER_UNKNOWN.

For devices that do support carrier detection, the driver must set the
carrier state to off initially, then poll the hardware state when the
device is opened.  However, we must not activate linkwatch for a
unregistered device, and commit b473001 ('net: Do not fire linkwatch
events until the device is registered.') ensured that we don't.  But
this means that the operstate for many devices that support carrier
detection remains IF_OPER_UNKNOWN when it should be IF_OPER_DOWN.

The same issue exists with the dormant state.

The proper initialisation sequence, avoiding a race with opening of
the device, is:

        rtnl_lock();
        rc = register_netdevice(dev);
        if (rc)
                goto out_unlock;
        netif_carrier_off(dev); /* or netif_dormant_on(dev) */
        rtnl_unlock();

but it seems silly that this should have to be repeated in so many
drivers.  Further, the operstate seen immediately after opening the
device may still be IF_OPER_UNKNOWN due to the asynchronous nature of
linkwatch.

Commit 22604c8 ('net: Fix for initial link state in 2.6.28') attempted
to fix this by setting the operstate synchronously, but it was
reverted as it could lead to deadlock.

This initialises the operstate synchronously at registration time
only.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24 12:46:13 -04:00
Eric Dumazet
748e2d9396 net: reinstate rtnl in call_netdevice_notifiers()
Eric Biederman pointed out that not holding RTNL while calling
call_netdevice_notifiers() was racy.

This patch is a direct transcription his feedback
against commit 0115e8e30d (net: remove delay at device dismantle)

Thanks Eric !

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-23 09:24:42 -07:00
Eric Dumazet
0115e8e30d net: remove delay at device dismantle
I noticed extra one second delay in device dismantle, tracked down to
a call to dst_dev_event() while some call_rcu() are still in RCU queues.

These call_rcu() were posted by rt_free(struct rtable *rt) calls.

We then wait a little (but one second) in netdev_wait_allrefs() before
kicking again NETDEV_UNREGISTER.

As the call_rcu() are now completed, dst_dev_event() can do the needed
device swap on busy dst.

To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
after a rcu_barrier(), but outside of RTNL lock.

Use NETDEV_UNREGISTER_FINAL with care !

Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL

Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
IP cache removal.

With help from Gao feng

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 21:50:36 -07:00
David S. Miller
1304a7343b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-08-22 14:21:38 -07:00
Randy Dunlap
3de7a37b02 net/core/dev.c: fix kernel-doc warning
Fix kernel-doc warning:

Warning(net/core/dev.c:5745): No description found for parameter 'dev'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc:	"David S. Miller" <davem@davemloft.net>
Cc:	netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 03:00:55 -07:00
Eric Leblond
c0de08d042 af_packet: don't emit packet on orig fanout group
If a packet is emitted on one socket in one group of fanout sockets,
it is transmitted again. It is thus read again on one of the sockets
of the fanout group. This result in a loop for software which
generate packets when receiving one.
This retransmission is not the intended behavior: a fanout group
must behave like a single socket. The packet should not be
transmitted on a socket if it originates from a socket belonging
to the same fanout group.

This patch fixes the issue by changing the transmission check to
take fanout group info account.

Reported-by: Aleksandr Kotov <a1k@mail.ru>
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 02:37:29 -07:00
Eric W. Biederman
d04a48b06d userns: Convert __dev_set_promiscuity to use kuids in audit logs
Cc: Klaus Heinrich Kiwi <klausk@br.ibm.com>
Cc: Eric Paris <eparis@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-08-14 21:42:40 -07:00
Amerigo Wang
b7bc2a5b5b net: remove netdev_bonding_change()
I don't see any benifits to use netdev_bonding_change() than
using call_netdevice_notifiers() directly.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14 14:28:24 -07:00
Amerigo Wang
ee89bab14e net: move and rename netif_notify_peers()
I believe net/core/dev.c is a better place for netif_notify_peers(),
because other net event notify functions also stay in this file.

And rename it to netdev_notify_peers().

Cc: David S. Miller <davem@davemloft.net>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14 14:28:23 -07:00
Pavel Emelyanov
aa79e66eee net: Make ifindex generation per-net namespace
Strictly speaking this is only _really_ required for checkpoint-restore to
make loopback device always have the same index.

This change appears to be safe wrt "ifindex should be unique per-system"
concept, as all the ifindex usage is either already made per net namespace
of is explicitly limited with init_net only.

There are two cool side effects of this. The first one -- ifindices of
devices in container are always small, regardless of how many containers
we've started (and re-started) so far. The second one is -- we can speed
up the loopback ifidex access as shown in the next patch.

v2: Place ifindex right after dev_base_seq : avoid two holes and use the
    same cache line, dirtied in list_netdevice()/unlist_netdevice()

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-09 16:18:07 -07:00
Pavel Emelyanov
9c7dafbfab net: Allow to create links with given ifindex
Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
is not zero. I propose to allow requesting ifindices on link creation. This
is required by the checkpoint-restore to correctly restore a net namespace
(i.e. -- a container).

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-09 16:18:06 -07:00
Alexey Khoroshilov
7364e445f6 net/core: Fix potential memory leak in dev_set_alias()
Do not leak memory by updating pointer with potentially NULL realloc return value.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-08 16:06:23 -07:00
Ben Hutchings
30b678d844 net: Allow driver to limit number of GSO segments per skb
A peer (or local user) may cause TCP to use a nominal MSS of as little
as 88 (actual MSS of 76 with timestamps).  Given that we have a
sufficiently prodigious local sender and the peer ACKs quickly enough,
it is nevertheless possible to grow the window for such a connection
to the point that we will try to send just under 64K at once.  This
results in a single skb that expands to 861 segments.

In some drivers with TSO support, such an skb will require hundreds of
DMA descriptors; a substantial fraction of a TX ring or even more than
a full ring.  The TX queue selected for the skb may stall and trigger
the TX watchdog repeatedly (since the problem skb will be retried
after the TX reset).  This particularly affects sfc, for which the
issue is designated as CVE-2012-3412.

Therefore:
1. Add the field net_device::gso_max_segs holding the device-specific
   limit.
2. In netif_skb_features(), if the number of segments is too high then
   mask out GSO features to force fall back to software GSO.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-02 00:19:17 -07:00
Linus Torvalds
ac694dbdbc Merge branch 'akpm' (Andrew's patch-bomb)
Merge Andrew's second set of patches:
 - MM
 - a few random fixes
 - a couple of RTC leftovers

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
  rtc/rtc-88pm80x: remove unneed devm_kfree
  rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
  mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
  tmpfs: distribute interleave better across nodes
  mm: remove redundant initialization
  mm: warn if pg_data_t isn't initialized with zero
  mips: zero out pg_data_t when it's allocated
  memcg: gix memory accounting scalability in shrink_page_list
  mm/sparse: remove index_init_lock
  mm/sparse: more checks on mem_section number
  mm/sparse: optimize sparse_index_alloc
  memcg: add mem_cgroup_from_css() helper
  memcg: further prevent OOM with too many dirty pages
  memcg: prevent OOM with too many dirty pages
  mm: mmu_notifier: fix freed page still mapped in secondary MMU
  mm: memcg: only check anon swapin page charges for swap cache
  mm: memcg: only check swap cache pages for repeated charging
  mm: memcg: split swapin charge function into private and public part
  mm: memcg: remove needless !mm fixup to init_mm when charging
  mm: memcg: remove unneeded shmem charge type
  ...
2012-07-31 19:25:39 -07:00
Linus Torvalds
3e9a97082f This patch series contains a major revamp of how we collect entropy
from interrupts for /dev/random and /dev/urandom.  The goal is to
 addresses weaknesses discussed in the paper "Mining your Ps and Qs:
 Detection of Widespread Weak Keys in Network Devices", by Nadia
 Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman, which will
 be published in the Proceedings of the 21st Usenix Security Symposium,
 August 2012.  (See https://factorable.net for more information and an
 extended version of the paper.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQF/0DAAoJENNvdpvBGATwIowQAOep9QKtLrBvb2lwIRVmeiy8
 lRf7V/tYZnz4FePbR0W92JQfKYkCV8yyOO0bmeRzWL3v4m+lRwDTSyA1DDyQMoH+
 LOMzvDKSLJMSXTXdSOIr1WYACphViCR/9CrbMBCKSkYfZLJ1MdaEDxT3rcpTGD0T
 6iknUweiSkHHhkerU5yQL7FKzD5kYUe0hsF47w7QVlHRHJsW2fsZqkFoh+RpnhNw
 03u+djxNGBo9qV81vZ9D1b0vA9uRlEjoWOOEG2XE4M2iq6TUySueA72dQnCwunfi
 3kG/u1Swv2dgq6aRrP3H7zdwhYSourGxziu3jNhEKwKEohrxYY7xjNX3RVeTqP67
 AzlKsOTWpRLIDrzjSLlb8VxRQiZewu8Unex3e1G+eo20sbcIObHGrxNp7K00zZvd
 QZiMHhOwItwFTe4lBO+XbqH2JKbL9/uJmwh5EipMpQTraKO9E6N3CJiUHjzBLo2K
 iGDZxRMKf4gVJRwDxbbP6D70JPVu8ZJ09XVIpsXQ3Z1xNqaMF0QdCmP3ty56q1o0
 NvkSXxPKrijZs8Sk0rVDqnJ3ll8PuDnXMv5eDtL42VT818I5WxESn9djjwEanGv0
 TYxbFub/NRxmPEE5B2Js5FBpqsLf5f282OSMeS/5WLBbnHJR1OoPoAhGVpHvxntC
 bi5FC1OolqhvzVIdsqgt
 =u7KM
 -----END PGP SIGNATURE-----

Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random

Pull random subsystem patches from Ted Ts'o:
 "This patch series contains a major revamp of how we collect entropy
  from interrupts for /dev/random and /dev/urandom.

  The goal is to addresses weaknesses discussed in the paper "Mining
  your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
  by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J.  Alex Halderman,
  which will be published in the Proceedings of the 21st Usenix Security
  Symposium, August 2012.  (See https://factorable.net for more
  information and an extended version of the paper.)"

Fix up trivial conflicts due to nearby changes in
drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
  random: mix in architectural randomness in extract_buf()
  dmi: Feed DMI table to /dev/random driver
  random: Add comment to random_initialize()
  random: final removal of IRQF_SAMPLE_RANDOM
  um: remove IRQF_SAMPLE_RANDOM which is now a no-op
  sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  [ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
  board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
  isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
  uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
  drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
  xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
  n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
  i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
  input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
  mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
  ...
2012-07-31 19:07:42 -07:00
Mel Gorman
b4b9e35585 netvm: set PF_MEMALLOC as appropriate during SKB processing
In order to make sure pfmemalloc packets receive all memory needed to
proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
This is limited to a subset of protocols that are expected to be used for
writing to swap.  Taps are not allowed to use PF_MEMALLOC as these are
expected to communicate with userspace processes which could be paged out.

[a.p.zijlstra@chello.nl: Ideas taken from various patches]
[jslaby@suse.cz: Lock imbalance fix]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:46 -07:00
David S. Miller
b68581778c net: Make skb->skb_iif always track skb->dev
Make it follow device decapsulation, from things such as VLAN and
bonding.

The stuff that actually cares about pre-demuxed device pointers, is
handled by the "orig_dev" variable in __netif_receive_skb().  And
the only consumer of that is the po->origdev feature of AF_PACKET
sockets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 16:36:27 -07:00
Michael S. Tsirkin
1080e512d4 net: orphan frags on receive
zero copy packets are normally sent to the outside
network, but bridging, tun etc might loop them
back to host networking stack. If this happens
destructors will never be called, so orphan
the frags immediately on receive.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 12:39:33 -07:00
David S. Miller
abaa72d7fd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
2012-07-19 11:17:30 -07:00
Rustad, Mark D
734b65417b net: Statically initialize init_net.dev_base_head
This change eliminates an initialization-order hazard most
recently seen when netprio_cgroup is built into the kernel.

With thanks to Eric Dumazet for catching a bug.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-18 13:32:27 -07:00
Theodore Ts'o
7bf2357524 net: feed /dev/random with the MAC address when registering a device
Cc: David Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-07-14 20:17:46 -04:00
David S. Miller
04c9f416e3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/batman-adv/bridge_loop_avoidance.c
	net/batman-adv/bridge_loop_avoidance.h
	net/batman-adv/soft-interface.c
	net/mac80211/mlme.c

With merge help from Antonio Quartulli (batman-adv) and
Stephen Rothwell (drivers/net/usb/qmi_wwan.c).

The net/mac80211/mlme.c conflict seemed easy enough, accounting for a
conversion to some new tracing macros.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 23:56:33 -07:00
Ben Hutchings
2c53040f01 net: Fix (nearly-)kernel-doc comments for various functions
Fix incorrect start markers, wrapped summary lines, missing section
breaks, incorrect separators, and some name mismatches.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 23:13:45 -07:00
Ben Hutchings
a55b138b1d net: Properly define functions with no parameters
Defining a function with no parameters as 'T foo()' is the deprecated
K&R style, and is not strictly equivalent to defining it as 'T foo(void)'.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 23:13:45 -07:00
Eric Dumazet
91c68ce2b2 net: cgroup: fix out of bounds accesses
dev->priomap is allocated by extend_netdev_table() called from
update_netdev_tables().
And this is only called if write_priomap() is called.

But if write_priomap() is not called, it seems we can have out of bounds
accesses in cgrp_destroy(), read_priomap() & skb_update_prio()

With help from Gao Feng

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-09 14:50:54 -07:00
Yuval Mintz
16917b87a2 net-next: Add netif_get_num_default_rss_queues
Most multi-queue networking driver consider the number of online cpus when
configuring RSS queues.
This patch adds a wrapper to the number of cpus, setting an upper limit on the
number of cpus a driver should consider (by default) when allocating resources
for his queues.

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05 03:06:44 -07:00
David S. Miller
b26d344c6b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/caif/caif_hsi.c
	drivers/net/usb/qmi_wwan.c

The qmi_wwan merge was trivial.

The caif_hsi.c, on the other hand, was not.  It's a conflict between
1c385f1fdf ("caif-hsi: Replace platform
device with ops structure.") in the net-next tree and commit
39abbaef19 ("caif-hsi: Postpone init of
HIS until open()") in the net tree.

I did my best with that one and will ask Sjur to check it out.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28 17:37:00 -07:00
Vinson Lee
7cecb523ad net: Downgrade CAP_SYS_MODULE deprecated message from error to warning.
Make logging level consistent with other deprecation messages in net
subsystem.

Signed-off-by: Vinson Lee <vlee@twitter.com>
Cc: David Mackey <tdmackey@twitter.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28 16:55:05 -07:00
David S. Miller
7e52b33bd5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv6/route.c

This deals with a merge conflict between the net-next addition of the
inetpeer network namespace ops, and Thomas Graf's bug fix in
2a0c451ade which makes sure we don't
register /proc/net/ipv6_route before it is actually safe to do so.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-15 15:51:55 -07:00
Eric Dumazet
62b1a8ab9b net: remove skb_orphan_try()
Orphaning skb in dev_hard_start_xmit() makes bonding behavior
unfriendly for applications sending big UDP bursts : Once packets
pass the bonding device and come to real device, they might hit a full
qdisc and be dropped. Without orphaning, the sender is automatically
throttled because sk->sk_wmemalloc reaches sk->sk_sndbuf (assuming
sk_sndbuf is not too big)

We could try to defer the orphaning adding another test in
dev_hard_start_xmit(), but all this seems of little gain,
now that BQL tends to make packets more likely to be parked
in Qdisc queues instead of NIC TX ring, in cases where performance
matters.

Reverts commits :
fc6055a5ba net: Introduce skb_orphan_try()
87fd308cfc net: skb_tx_hash() fix relative to skb_orphan_try()
and removes SKBTX_DRV_NEEDS_SK_REF flag

Reported-and-bisected-by: Jean-Michel Hautbois <jhautbois@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-15 15:30:15 -07:00
Michel Machado
95603e2293 net-next: add dev_loopback_xmit() to avoid duplicate code
Add dev_loopback_xmit() in order to deduplicate functions
ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and
ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c).

I was about to reinvent the wheel when I noticed that
ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I
need and are not IP-only functions, but they were not available to reuse
elsewhere.

ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I
understand that this is harmless, and should be in dev_loopback_xmit().

Signed-off-by: Michel Machado <michel@digirati.com.br>
CC: "David S. Miller" <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jpirko@redhat.com>
CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-12 18:51:09 -07:00
Eric Dumazet
4adb9c4ac8 net: napi_frags_skb() is static
No need to export napi_frags_skb()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-19 02:51:00 -04:00
David S. Miller
028940342a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-05-16 22:17:37 -04:00
Paul Gortmaker
211ed86510 net: delete all instances of special processing for token ring
We are going to delete the Token ring support.  This removes any
special processing in the core networking for token ring, (aside
from net/tr.c itself), leaving the drivers and remaining tokenring
support present but inert.

The mass removal of the drivers and net/tr.c will be in a separate
commit, so that the history of these files that we still care
about won't have the giant deletion tied into their history.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-05-15 20:14:35 -04:00
Joe Perches
e87cc4728f net: Convert net_ratelimit uses to net_<level>_ratelimited
Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-15 13:45:03 -04:00
David S. Miller
59b9997bab Revert "net: maintain namespace isolation between vlan and real device"
This reverts commit 8a83a00b07.

It causes regressions for S390 devices, because it does an
unconditional DST drop on SKBs for vlans and the QETH device
needs the neighbour entry hung off the DST for certain things
on transmit.

Arnd can't remember exactly why he even needed this change.

Conflicts:

	drivers/net/macvlan.c
	net/8021q/vlan_dev.c
	net/core/dev.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-10 23:03:34 -04:00
Eric Dumazet
d7e8883cfc net: make GRO aware of skb->head_frag
GRO can check if skb to be merged has its skb->head mapped to a page
fragment, instead of a kmalloc() area.

We 'upgrade' skb->head as a fragment in itself

This avoids the frag_list fallback, and permits to build true GRO skb
(one sk_buff and up to 16 fragments), using less memory.

This reduces number of cache misses when user makes its copy, since a
single sk_buff is fetched.

This is a followup of patch "net: allow skb->head to be a page fragment"

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30 21:35:49 -04:00
Eric Dumazet
daa8654828 net: gro: GRO_MERGED_FREE consumes packets
As part of GRO processing, merged skbs should be consumed, not freed, to
not confuse dropwatch/drop_monitor.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 14:23:56 -04:00
Eric Dumazet
95c9617472 net: cleanup unsigned to unsigned int
Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15 12:44:40 -04:00
Eric W. Biederman
7d3d43dab4 net: In unregister_netdevice_notifier unregister the netdevices.
We already synthesize events in register_netdevice_notifier and synthesizing
events in unregister_netdevice_notifier allows to us remove the need for
special case cleanup code.

This change should be safe as it adds no new cases for existing callers
of unregiser_netdevice_notifier to handle.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-13 11:01:43 -04:00
Eric Dumazet
2def16ae6b net: fix /proc/net/dev regression
Commit f04565ddf5 (dev: use name hash for dev_seq_ops) added a second
regression, as some devices are missing from /proc/net/dev if many
devices are defined.

When seq_file buffer is filled, the last ->next/show() method is
canceled (pos value is reverted to value prior ->next() call)

Problem is after above commit, we dont restart the lookup at right
position in ->start() method.

Fix this by removing the internal 'pos' pointer added in commit, since
we need to use the 'loff_t *pos' provided by seq_file layer.

This also reverts commit 5cac98dd0 (net: Fix corruption
in /proc/*/net/dev_mcast), since its not needed anymore.

Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Mihai Maruseac <mmaruseac@ixiacom.com>
Tested-by:  Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03 17:23:23 -04:00
Linus Torvalds
ed359a3b7b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Provide device string properly for USB i2400m wimax devices, also
    don't OOPS when providing firmware string.  From Phil Sutter.

 2) Add support for sh_eth SH7734 chips, from Nobuhiro Iwamatsu.

 3) Add another device ID to USB zaurus driver, from Guan Xin.

 4) Loop index start in pool vector iterator is wrong causing MAC to not
    get configured in bnx2x driver, fix from Dmitry Kravkov.

 5) EQL driver assumes HZ=100, fix from Eric Dumazet.

 6) Now that skb_add_rx_frag() can specify the truesize increment
    separately, do so in f_phonet and cdc_phonet, also from Eric
    Dumazet.

 7) virtio_net accidently uses net_ratelimit() not only on the kernel
    warning but also the statistic bump, fix from Rick Jones.

 8) ip_route_input_mc() uses fixed init_net namespace, oops, use
    dev_net(dev) instead.  Fix from Benjamin LaHaise.

 9) dev_forward_skb() needs to clear the incoming interface index of the
    SKB so that it looks like a new incoming packet, also from Benjamin
    LaHaise.

10) iwlwifi mistakenly initializes a channel entry as 2GHZ instead of
    5GHZ, fix from Stanislav Yakovlev.

11) Missing kmalloc() return value checks in orinoco, from Santosh
    Nayak.

12) ath9k doesn't check for HT capabilities in the right way, it is
    checking ht_supported instead of the ATH9K_HW_CAP_HT flag.  Fix from
    Sujith Manoharan.

13) Fix x86 BPF JIT emission of 16-bit immediate field of AND
    instructions, from Feiran Zhuang.

14) Avoid infinite loop in GARP code when registering sysfs entries.
    From David Ward.

15) rose protocol uses memcpy instead of memcmp in a device address
    comparison, oops.  Fix from Daniel Borkmann.

16) Fix build of lpc_eth due to dev_hw_addr_rancom() interface being
    renamed to eth_hw_addr_random().  From Roland Stigge.

17) Make ipv6 RTM_GETROUTE interpret RTA_IIF attribute the same way
    that ipv4 does.  Fix from Shmulik Ladkani.

18) via-rhine has an inverted bit test, causing suspend/resume
    regressions.  Fix from Andreas Mohr.

19) RIONET assumes 4K page size, fix from Akinobu Mita.

20) Initialization of imask register in sky2 is buggy, because bits are
    "or'd" into an uninitialized local variable.  Fix from Lino
    Sanfilippo.

21) Fix FCOE checksum offload handling, from Yi Zou.

22) Fix VLAN processing regression in e1000, from Jiri Pirko.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  sky2: dont overwrite settings for PHY Quick link
  tg3: Fix 5717 serdes powerdown problem
  net: usb: cdc_eem: fix mtu
  net: sh_eth: fix endian check for architecture independent
  usb/rtl8150 : Remove duplicated definitions
  rionet: fix page allocation order of rionet_active
  via-rhine: fix wait-bit inversion.
  ipv6: Fix RTM_GETROUTE's interpretation of RTA_IIF to be consistent with ipv4
  net: lpc_eth: Fix rename of dev_hw_addr_random
  net/netfilter/nfnetlink_acct.c: use linux/atomic.h
  rose_dev: fix memcpy-bug in rose_set_mac_address
  Fix non TBI PHY access; a bad merge undid bug fix in a previous commit.
  net/garp: avoid infinite loop if attribute already exists
  x86 bpf_jit: fix a bug in emitting the 16-bit immediate operand of AND
  bonding: emit event when bonding changes MAC
  mac80211: fix oper channel timestamp updation
  ath9k: Use HW HT capabilites properly
  MAINTAINERS: adding maintainer for ipw2x00
  net: orinoco: add error handling for failed kmalloc().
  net/wireless: ipw2x00: fix a typo in wiphy struct initilization
  ...
2012-04-02 17:53:39 -07:00
Linus Torvalds
0195c00244 Disintegrate and delete asm/system.h
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk
 8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m
 u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB
 ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m
 rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl
 eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45
 HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+
 /5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9
 Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK
 4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR
 FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD
 NypQthI85pc=
 =G9mT
 -----END PGP SIGNATURE-----

Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system

Pull "Disintegrate and delete asm/system.h" from David Howells:
 "Here are a bunch of patches to disintegrate asm/system.h into a set of
  separate bits to relieve the problem of circular inclusion
  dependencies.

  I've built all the working defconfigs from all the arches that I can
  and made sure that they don't break.

  The reason for these patches is that I recently encountered a circular
  dependency problem that came about when I produced some patches to
  optimise get_order() by rewriting it to use ilog2().

  This uses bitops - and on the SH arch asm/bitops.h drags in
  asm-generic/get_order.h by a circuituous route involving asm/system.h.

  The main difficulty seems to be asm/system.h.  It holds a number of
  low level bits with no/few dependencies that are commonly used (eg.
  memory barriers) and a number of bits with more dependencies that
  aren't used in many places (eg.  switch_to()).

  These patches break asm/system.h up into the following core pieces:

    (1) asm/barrier.h

        Move memory barriers here.  This already done for MIPS and Alpha.

    (2) asm/switch_to.h

        Move switch_to() and related stuff here.

    (3) asm/exec.h

        Move arch_align_stack() here.  Other process execution related bits
        could perhaps go here from asm/processor.h.

    (4) asm/cmpxchg.h

        Move xchg() and cmpxchg() here as they're full word atomic ops and
        frequently used by atomic_xchg() and atomic_cmpxchg().

    (5) asm/bug.h

        Move die() and related bits.

    (6) asm/auxvec.h

        Move AT_VECTOR_SIZE_ARCH here.

  Other arch headers are created as needed on a per-arch basis."

Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that.  We'll find out anything that got broken and fix it..

* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
  Delete all instances of asm/system.h
  Remove all #inclusions of asm/system.h
  Add #includes needed to permit the removal of asm/system.h
  Move all declarations of free_initmem() to linux/mm.h
  Disintegrate asm/system.h for OpenRISC
  Split arch_align_stack() out from asm-generic/system.h
  Split the switch_to() wrapper out of asm-generic/system.h
  Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
  Create asm-generic/barrier.h
  Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
  Disintegrate asm/system.h for Xtensa
  Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
  Disintegrate asm/system.h for Tile
  Disintegrate asm/system.h for Sparc
  Disintegrate asm/system.h for SH
  Disintegrate asm/system.h for Score
  Disintegrate asm/system.h for S390
  Disintegrate asm/system.h for PowerPC
  Disintegrate asm/system.h for PA-RISC
  Disintegrate asm/system.h for MN10300
  ...
2012-03-28 15:58:21 -07:00
David Howells
9ffc93f203 Remove all #inclusions of asm/system.h
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it.  Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-28 18:30:03 +01:00
Benjamin LaHaise
3b9785c6b0 net/core: dev_forward_skb() should clear skb_iif
While investigating another bug, I found that the code on the incoming path
in __netif_receive_skb will only set skb->skb_iif if it is already 0.  When
dev_forward_skb() is used in the case of interfaces like veth, skb_iif may
already have been set.  Making dev_forward_skb() cause the packet to look
like a newly received packet would seem to the the correct behaviour here,
as otherwise the wrong incoming interface can be reported for such a packet.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-28 04:45:37 -04:00
Eric Dumazet
2a2a459eee net: fix napi_reuse_skb() skb reserve
napi->skb is allocated in napi_get_frags() using
netdev_alloc_skb_ip_align(), with a reserve of NET_SKB_PAD +
NET_IP_ALIGN bytes.

However, when such skb is recycled in napi_reuse_skb(), it ends with a
reserve of NET_IP_ALIGN which is suboptimal.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-21 16:52:09 -04:00
Linus Torvalds
3b59bf0816 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking merge from David Miller:
 "1) Move ixgbe driver over to purely page based buffering on receive.
     From Alexander Duyck.

  2) Add receive packet steering support to e1000e, from Bruce Allan.

  3) Convert TCP MD5 support over to RCU, from Eric Dumazet.

  4) Reduce cpu usage in handling out-of-order TCP packets on modern
     systems, also from Eric Dumazet.

  5) Support the IP{,V6}_UNICAST_IF socket options, making the wine
     folks happy, from Erich Hoover.

  6) Support VLAN trunking from guests in hyperv driver, from Haiyang
     Zhang.

  7) Support byte-queue-limtis in r8169, from Igor Maravic.

  8) Outline code intended for IP_RECVTOS in IP_PKTOPTIONS existed but
     was never properly implemented, Jiri Benc fixed that.

  9) 64-bit statistics support in r8169 and 8139too, from Junchang Wang.

  10) Support kernel side dump filtering by ctmark in netfilter
      ctnetlink, from Pablo Neira Ayuso.

  11) Support byte-queue-limits in gianfar driver, from Paul Gortmaker.

  12) Add new peek socket options to assist with socket migration, from
      Pavel Emelyanov.

  13) Add sch_plug packet scheduler whose queue is controlled by
      userland daemons using explicit freeze and release commands.  From
      Shriram Rajagopalan.

  14) Fix FCOE checksum offload handling on transmit, from Yi Zou."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1846 commits)
  Fix pppol2tp getsockname()
  Remove printk from rds_sendmsg
  ipv6: fix incorrent ipv6 ipsec packet fragment
  cpsw: Hook up default ndo_change_mtu.
  net: qmi_wwan: fix build error due to cdc-wdm dependecy
  netdev: driver: ethernet: Add TI CPSW driver
  netdev: driver: ethernet: add cpsw address lookup engine support
  phy: add am79c874 PHY support
  mlx4_core: fix race on comm channel
  bonding: send igmp report for its master
  fs_enet: Add MPC5125 FEC support and PHY interface selection
  net: bpf_jit: fix BPF_S_LDX_B_MSH compilation
  net: update the usage of CHECKSUM_UNNECESSARY
  fcoe: use CHECKSUM_UNNECESSARY instead of CHECKSUM_PARTIAL on tx
  net: do not do gso for CHECKSUM_UNNECESSARY in netif_needs_gso
  ixgbe: Fix issues with SR-IOV loopback when flow control is disabled
  net/hyperv: Fix the code handling tx busy
  ixgbe: fix namespace issues when FCoE/DCB is not enabled
  rtlwifi: Remove unused ETH_ADDR_LEN defines
  igbvf: Use ETH_ALEN
  ...

Fix up fairly trivial conflicts in drivers/isdn/gigaset/interface.c and
drivers/net/usb/{Kconfig,qmi_wwan.c} as per David.
2012-03-20 21:04:47 -07:00
David S. Miller
95f050bf7f net: Use bool for return value of dev_valid_name().
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-06 16:12:15 -05:00
Eric Dumazet
77a1abf54f net: export netdev_stats_to_stats64
Some drivers use internal netdev stats member to store part of their
stats, yet advertize ndo_get_stats64() to implement some 64bit fields.

Allow them to use netdev_stats_to_stats64() helper to make the copy of
netdev stats before they compute their 64bit counters.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-05 15:38:32 -05:00
Ingo Molnar
737f24bda7 Merge branch 'perf/urgent' into perf/core
Conflicts:
	tools/perf/builtin-record.c
	tools/perf/builtin-top.c
	tools/perf/perf.h
	tools/perf/util/top.h

Merge reason: resolve these cherry-picking conflicts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 09:20:08 +01:00
Ingo Molnar
c5905afb0e static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()
So here's a boot tested patch on top of Jason's series that does
all the cleanups I talked about and turns jump labels into a
more intuitive to use facility. It should also address the
various misconceptions and confusions that surround jump labels.

Typical usage scenarios:

        #include <linux/static_key.h>

        struct static_key key = STATIC_KEY_INIT_TRUE;

        if (static_key_false(&key))
                do unlikely code
        else
                do likely code

Or:

        if (static_key_true(&key))
                do likely code
        else
                do unlikely code

The static key is modified via:

        static_key_slow_inc(&key);
        ...
        static_key_slow_dec(&key);

The 'slow' prefix makes it abundantly clear that this is an
expensive operation.

I've updated all in-kernel code to use this everywhere. Note
that I (intentionally) have not pushed through the rename
blindly through to the lowest levels: the actual jump-label
patching arch facility should be named like that, so we want to
decouple jump labels from the static-key facility a bit.

On non-jump-label enabled architectures static keys default to
likely()/unlikely() branches.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: a.p.zijlstra@chello.nl
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-24 10:05:59 +01:00
Eric Dumazet
5ca3b72c5d gro: more generic L2 header check
Shlomo Pongratz reported GRO L2 header check was suited for Ethernet
only, and failed on IB/ipoib traffic.

He provided a patch faking a zeroed header to let GRO aggregates frames.

Roland Dreier, Herbert Xu, and others suggested we change GRO L2 header
check to be more generic, ie not assuming L2 header is 14 bytes, but
taking into account hard_header_len.

__napi_gro_receive() has special handling for the common case (Ethernet)
to avoid a memcmp() call and use an inline optimized function instead.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Shlomo Pongratz <shlomop@mellanox.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-08 18:26:54 -05:00
Eric Dumazet
43480aecb1 gro: more generic L2 header check
Shlomo Pongratz reported GRO L2 header check was suited for Ethernet
only, and failed on IB/ipoib traffic.

He provided a patch faking a zeroed header to let GRO aggregates frames.

Roland Dreier, Herbert Xu, and others suggested we change GRO L2 header
check to be more generic, ie not assuming L2 header is 14 bytes, but
taking into account hard_header_len.

__napi_gro_receive() has special handling for the common case (Ethernet)
to avoid a memcmp() call and use an inline optimized function instead.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Shlomo Pongratz <shlomop@mellanox.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-08 15:50:01 -05:00
Joe Perches
7b6cd1ce72 PATCH V2 net-next] net: dev: Convert printks to pr_<level>
Use the current logging style.
Coalesce formats where appropriate.
Update grammar where appropriate.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-01 15:57:08 -05:00
Michał Mirosław
65e9d2faab net: fix NULL-deref in WARN() in skb_gso_segment()
Bug was introduced in commit c8f44affb7.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-17 15:51:23 -05:00
Ben Hutchings
36c9247449 net: WARN if skb_checksum_help() is called on skb requiring segmentation
skb_checksum_help() has never done anything useful with skbs that
require segmentation.  Setting skb->ip_summed = CHECKSUM_NONE makes
them invalid and provokes a later WARNing in skb_gso_segment().

Passing such an skb to skb_checksum_help() indicates a bug, so we
should warn about it immediately.  Move the warning from
skb_gso_segment() into a shared function, and add gso_type and
gso_size to it.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-17 15:49:03 -05:00
Ben Hutchings
e52ac3398c net: Use device model to get driver name in skb_gso_segment()
ethtool operations generally require the caller to hold RTNL and are
not safe to call in atomic context.  The device model provides this
information for most devices; we'll only lose it for some old ISA
drivers.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-17 10:31:12 -05:00
David S. Miller
b3613118eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2011-12-02 13:49:21 -05:00
Eric Dumazet
b536db9332 net: net_device flags is an unsigned int
commit b00055aacd ([NET] core: add RFC2863 operstate) changed
net_device flags from unsigned short to unsigned int.

Some core functions still assume its an unsigned short.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-01 11:41:48 -05:00
RongQing.Li
8f89148986 net/core: fix rollback handler in register_netdevice_notifier
Within nested statements, the break statement terminates only the
do, for, switch, or while statement that immediately encloses it,
So replace the break with goto.

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-30 23:43:07 -05:00
Igor Maravic
6977a79d36 net: Fix skb_update_prio RCU usage.
Change function rcu_dereference to rcu_dereference_bh to avoid warning

[ INFO: suspicious RCU usage. ]
-------------------------------
net/core/dev.c:2459 suspicious rcu_dereference_check() usage!

because we are locking with

rcu_read_lock_bh();

in function dev_queue_xmit(struct sk_buff *skb)

Signed-off-by: Igor Maravic <igorm@etf.rs>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29 18:25:17 -05:00
Tom Herbert
114cf58021 bql: Byte queue limits
Networking stack support for byte queue limits, uses dynamic queue
limits library.  Byte queue limits are maintained per transmit queue,
and a dql structure has been added to netdev_queue structure for this
purpose.

Configuration of bql is in the tx-<n> sysfs directory for the queue
under the byte_queue_limits directory.  Configuration includes:
limit_min, bql minimum limit
limit_max, bql maximum limit
hold_time, bql slack hold time

Also under the directory are:
limit, current byte limit
inflight, current number of bytes on the queue

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29 12:46:19 -05:00
Tom Herbert
7346649826 net: Add queue state xoff flag for stack
Create separate queue state flags so that either the stack or drivers
can turn on XOFF.  Added a set of functions used in the stack to determine
if a queue is really stopped (either by stack or driver)

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29 12:46:19 -05:00
Eric Dumazet
b90e5794c5 net: dont call jump_label_dec from irq context
Igor Maravic reported an error caused by jump_label_dec() being called
from IRQ context :

 BUG: sleeping function called from invalid context at kernel/mutex.c:271
 in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper
 1 lock held by swapper/0:
  #0:  (&n->timer){+.-...}, at: [<ffffffff8107ce90>] call_timer_fn+0x0/0x340
 Pid: 0, comm: swapper Not tainted 3.2.0-rc2-net-next-mpls+ #1
Call Trace:
 <IRQ>  [<ffffffff8104f417>] __might_sleep+0x137/0x1f0
 [<ffffffff816b9a2f>] mutex_lock_nested+0x2f/0x370
 [<ffffffff810a89fd>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff8109a37f>] ? local_clock+0x6f/0x80
 [<ffffffff810a90a5>] ? lock_release_holdtime.part.22+0x15/0x1a0
 [<ffffffff81557929>] ? sock_def_write_space+0x59/0x160
 [<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
 [<ffffffff810969cd>] atomic_dec_and_mutex_lock+0x5d/0x80
 [<ffffffff8112fc1d>] jump_label_dec+0x1d/0x50
 [<ffffffff81566525>] net_disable_timestamp+0x15/0x20
 [<ffffffff81557a75>] sock_disable_timestamp+0x45/0x50
 [<ffffffff81557b00>] __sk_free+0x80/0x200
 [<ffffffff815578d0>] ? sk_send_sigurg+0x70/0x70
 [<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
 [<ffffffff81557cba>] sock_wfree+0x3a/0x70
 [<ffffffff8155c2b0>] skb_release_head_state+0x70/0x120
 [<ffffffff8155c0b6>] __kfree_skb+0x16/0x30
 [<ffffffff8155c119>] kfree_skb+0x49/0x170
 [<ffffffff815e936e>] arp_error_report+0x3e/0x90
 [<ffffffff81575bd9>] neigh_invalidate+0x89/0xc0
 [<ffffffff81578dbe>] neigh_timer_handler+0x9e/0x2a0
 [<ffffffff81578d20>] ? neigh_update+0x640/0x640
 [<ffffffff81073558>] __do_softirq+0xc8/0x3a0

Since jump_label_{inc|dec} must be called from process context only,
we must defer jump_label_dec() if net_disable_timestamp() is called
from interrupt context.

Reported-by: Igor Maravic <igorm@etf.rs>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29 00:26:25 -05:00
Eric Dumazet
4504b8613b net: use skb_flow_dissect() in __skb_get_rxhash()
No functional changes.

This uses the code we factorized in skb_flow_dissect()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-28 19:09:07 -05:00
Anton Blanchard
5cac98dd06 net: Fix corruption in /proc/*/net/dev_mcast
I just hit this during my testing. Isn't there another bug lurking?

BUG kmalloc-8: Redzone overwritten

INFO: 0xc0000000de9dec48-0xc0000000de9dec4b. First byte 0x0 instead of 0xcc
INFO: Allocated in .__seq_open_private+0x30/0xa0 age=0 cpu=5 pid=3896
	.__kmalloc+0x1e0/0x2d0
	.__seq_open_private+0x30/0xa0
	.seq_open_net+0x60/0xe0
	.dev_mc_seq_open+0x4c/0x70
	.proc_reg_open+0xd8/0x260
	.__dentry_open.clone.11+0x2b8/0x400
	.do_last+0xf4/0x950
	.path_openat+0xf8/0x480
	.do_filp_open+0x48/0xc0
	.do_sys_open+0x140/0x250
	syscall_exit+0x0/0x40

dev_mc_seq_ops uses dev_seq_start/next/stop but only allocates
sizeof(struct seq_net_private) of private data, whereas it expects
sizeof(struct dev_iter_state):

struct dev_iter_state {
	struct seq_net_private p;
	unsigned int pos; /* bucket << BUCKET_SPACE + offset */
};

Create dev_seq_open_ops and use it so we don't have to expose
struct dev_iter_state.

[ Problem added by commit f04565ddf5 (dev: use name hash for
  dev_seq_ops) -Eric ]

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-28 18:07:29 -05:00
Neil Horman
5bc1421e34 net: add network priority cgroup infrastructure (v4)
This patch adds in the infrastructure code to create the network priority
cgroup.  The cgroup, in addition to the standard processes file creates two
control files:

1) prioidx - This is a read-only file that exports the index of this cgroup.
This is a value that is both arbitrary and unique to a cgroup in this subsystem,
and is used to index the per-device priority map

2) priomap - This is a writeable file.  On read it reports a table of 2-tuples
<name:priority> where name is the name of a network interface and priority is
indicates the priority assigned to frames egresessing on the named interface and
originating from a pid in this cgroup

This cgroup allows for skb priority to be set prior to a root qdisc getting
selected. This is benenficial for DCB enabled systems, in that it allows for any
application to use dcb configured priorities so without application modification

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-22 15:22:23 -05:00
Eric Dumazet
adc9300e78 net: use jump_label to shortcut RPS if not setup
Most machines dont use RPS/RFS, and pay a fair amount of instructions in
netif_receive_skb() / netif_rx() / get_rps_cpu() just to discover
RPS/RFS is not setup.

Add a jump_label named rps_needed.

If no device rps_map or global rps_sock_flow_table is setup,
netif_receive_skb() / netif_rx() do a single instruction instead of many
ones, including conditional jumps.

jmp +0    (if CONFIG_JUMP_LABEL=y)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-17 17:06:08 -05:00
Michał Mirosław
34324dc2bf net: remove NETIF_F_NO_CSUM feature bit
Only distinct use is checking if NETIF_F_NOCACHE_COPY should be
enabled by default. The check heuristics is altered a bit here,
so it hits other people than before. The default shouldn't be
trusted for performance-critical cases anyway.

For all other uses NETIF_F_NO_CSUM is equivalent to NETIF_F_HW_CSUM.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-16 17:43:12 -05:00
Michał Mirosław
c8f44affb7 net: introduce and use netdev_features_t for device features sets
v2:	add couple missing conversions in drivers
	split unexporting netdev_fix_features()
	implemented %pNF
	convert sock::sk_route_(no?)caps

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-16 17:43:10 -05:00
Michał Mirosław
bc5787c612 net: remove legacy ethtool ops
As all drivers are converted, we may now remove discrete offload setting
callback handling.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-16 17:43:08 -05:00
Eric Dumazet
588f033075 net: use jump_label for netstamp_needed
netstamp_needed seems a good candidate to jump_label conversion.

This avoids 3 conditional branches per incoming packet in fast path.

No measurable difference, given that these conditional branches are
predicted on modern cpus. Only a small icache reduction, thanks to the
unlikely() stuff.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-16 17:30:06 -05:00
Eric Dumazet
6a32e4f9dd vlan: allow nested vlan_do_receive()
commit 2425717b27 (net: allow vlan traffic to be received under bond)
broke ARP processing on vlan on top of bonding.

       +-------+
eth0 --| bond0 |---bond0.103
eth1 --|       |
       +-------+

52870.115435: skb_gro_reset_offset <-napi_gro_receive
52870.115435: dev_gro_receive <-napi_gro_receive
52870.115435: napi_skb_finish <-napi_gro_receive
52870.115435: netif_receive_skb <-napi_skb_finish
52870.115435: get_rps_cpu <-netif_receive_skb
52870.115435: __netif_receive_skb <-netif_receive_skb
52870.115436: vlan_do_receive <-__netif_receive_skb
52870.115436: bond_handle_frame <-__netif_receive_skb
52870.115436: vlan_do_receive <-__netif_receive_skb
52870.115436: arp_rcv <-__netif_receive_skb
52870.115436: kfree_skb <-arp_rcv

Packet is dropped in arp_rcv() because its pkt_type was set to
PACKET_OTHERHOST in the first vlan_do_receive() call, since no eth0.103
exists.

We really need to change pkt_type only if no more rx_handler is about to
be called for the packet.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-30 04:43:30 -04:00
Linus Torvalds
8a9ea3237e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1745 commits)
  dp83640: free packet queues on remove
  dp83640: use proper function to free transmit time stamping packets
  ipv6: Do not use routes from locally generated RAs
  |PATCH net-next] tg3: add tx_dropped counter
  be2net: don't create multiple RX/TX rings in multi channel mode
  be2net: don't create multiple TXQs in BE2
  be2net: refactor VF setup/teardown code into be_vf_setup/clear()
  be2net: add vlan/rx-mode/flow-control config to be_setup()
  net_sched: cls_flow: use skb_header_pointer()
  ipv4: avoid useless call of the function check_peer_pmtu
  TCP: remove TCP_DEBUG
  net: Fix driver name for mdio-gpio.c
  ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT
  rtnetlink: Add missing manual netlink notification in dev_change_net_namespaces
  ipv4: fix ipsec forward performance regression
  jme: fix irq storm after suspend/resume
  route: fix ICMP redirect validation
  net: hold sock reference while processing tx timestamps
  tcp: md5: add more const attributes
  Add ethtool -g support to virtio_net
  ...

Fix up conflicts in:
 - drivers/net/Kconfig:
	The split-up generated a trivial conflict with removal of a
	stale reference to Documentation/networking/net-modules.txt.
	Remove it from the new location instead.
 - fs/sysfs/dir.c:
	Fairly nasty conflicts with the sysfs rb-tree usage, conflicting
	with Eric Biederman's changes for tagged directories.
2011-10-25 13:25:22 +02:00
Linus Torvalds
2d03423b23 Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (38 commits)
  mm: memory hotplug: Check if pages are correctly reserved on a per-section basis
  Revert "memory hotplug: Correct page reservation checking"
  Update email address for stable patch submission
  dynamic_debug: fix undefined reference to `__netdev_printk'
  dynamic_debug: use a single printk() to emit messages
  dynamic_debug: remove num_enabled accounting
  dynamic_debug: consolidate repetitive struct _ddebug descriptor definitions
  uio: Support physical addresses >32 bits on 32-bit systems
  sysfs: add unsigned long cast to prevent compile warning
  drivers: base: print rejected matches with DEBUG_DRIVER
  memory hotplug: Correct page reservation checking
  memory hotplug: Refuse to add unaligned memory regions
  remove the messy code file Documentation/zh_CN/SubmitChecklist
  ARM: mxc: convert device creation to use platform_device_register_full
  new helper to create platform devices with dma mask
  docs/driver-model: Update device class docs
  docs/driver-model: Document device.groups
  kobj_uevent: Ignore if some listeners cannot handle message
  dynamic_debug: make netif_dbg() call __netdev_printk()
  dynamic_debug: make netdev_dbg() call __netdev_printk()
  ...
2011-10-25 12:13:59 +02:00
David S. Miller
1805b2f048 Merge branch 'master' of ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2011-10-24 18:18:09 -04:00
Eric W. Biederman
d2237d3574 rtnetlink: Add missing manual netlink notification in dev_change_net_namespaces
Renato Westphal noticed that since commit a2835763e1
"rtnetlink: handle rtnl_link netlink notifications manually" was merged
we no longer send a netlink message when a networking device is moved
from one network namespace to another.

Fix this by adding the missing manual notification in dev_change_net_namespaces.

Since all network devices that are processed by dev_change_net_namspaces are
in the initialized state the complicated tests that guard the manual
rtmsg_ifinfo calls in rollback_registered and register_netdevice are
unnecessary and we can just perform a plain notification.

Cc: stable@kernel.org
Tested-by: Renato Westphal <renatowestphal@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-24 03:03:37 -04:00
Mihai Maruseac
f04565ddf5 dev: use name hash for dev_seq_ops
Instead of using the dev->next chain and trying to resync at each call to
dev_seq_start, use the name hash, keeping the bucket and the offset in
seq->private field.

Tests revealed the following results for ifconfig > /dev/null
	* 1000 interfaces:
		* 0.114s without patch
		* 0.089s with patch
	* 3000 interfaces:
		* 0.489s without patch
		* 0.110s with patch
	* 5000 interfaces:
		* 1.363s without patch
		* 0.250s with patch
	* 128000 interfaces (other setup):
		* ~100s without patch
		* ~30s with patch

Signed-off-by: Mihai Maruseac <mmaruseac@ixiacom.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-21 02:54:38 -04:00
Richard Cochran
4dc360c5e7 net: validate HWTSTAMP ioctl parameters
This patch adds a sanity check on the values provided by user space for
the hardware time stamping configuration. If the values lie outside of
the absolute limits, then the ioctl request will be denied.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-19 17:00:35 -04:00
Eric W. Biederman
850a545bd8 net: Move rcu_barrier from rollback_registered_many to netdev_run_todo.
This patch moves the rcu_barrier from rollback_registered_many
(inside the rtnl_lock) into netdev_run_todo (just outside the rtnl_lock).
This allows us to gain the full benefit of sychronize_net calling
synchronize_rcu_expedited when the rtnl_lock is held.

The rcu_barrier in rollback_registered_many was originally a synchronize_net
but was promoted to be a rcu_barrier() when it was found that people were
unnecessarily hitting the 250ms wait in netdev_wait_allrefs().  Changing
the rcu_barrier back to a synchronize_net is therefore safe.

Since we only care about waiting for the rcu callbacks before we get
to netdev_wait_allrefs() it is also safe to move the wait into
netdev_run_todo.

This was tested by creating and destroying 1000 tap devices and observing
/proc/lock_stat.  /proc/lock_stat reports this change reduces the hold
times of the rtnl_lock by a factor of 10.  There was no observable
difference in the amount of time it takes to destroy a network device.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-19 16:59:42 -04:00
Eric Dumazet
9e903e0852 net: add skb frag size accessors
To ease skb->truesize sanitization, its better to be able to localize
all references to skb frags size.

Define accessors : skb_frag_size() to fetch frag size, and
skb_frag_size_{set|add|sub}() to manipulate it.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-19 03:10:46 -04:00
John Fastabend
2425717b27 net: allow vlan traffic to be received under bond
The following configuration used to work as I expected. At least
we could use the fcoe interfaces to do MPIO and the bond0 iface
to do load balancing or failover.

       ---eth2.228-fcoe
       |
eth2 -----|
          |
          |---- bond0
          |
eth3 -----|
       |
       ---eth3.228-fcoe

This worked because of a change we added to allow inactive slaves
to rx 'exact' matches. This functionality was kept intact with the
rx_handler mechanism. However now the vlan interface attached to the
active slave never receives traffic because the bonding rx_handler
updates the skb->dev and goto's another_round. Previously, the
vlan_do_receive() logic was called before the bonding rx_handler.

Now by the time vlan_do_receive calls vlan_find_dev() the
skb->dev is set to bond0 and it is clear no vlan is attached
to this iface. The vlan lookup fails.

This patch moves the VLAN check above the rx_handler. A VLAN
tagged frame is now routed to the eth2.228-fcoe iface in the
above schematic. Untagged frames continue to the bond0 as
normal. This case also remains intact,

eth2 --> bond0 --> vlan.228

Here the skb is VLAN tagged but the vlan lookup fails on eth2
causing the bonding rx_handler to be called. On the second
pass the vlan lookup is on the bond0 iface and completes as
expected.

Putting a VLAN.228 on both the bond0 and eth2 device will
result in eth2.228 receiving the skb. I don't think this is
completely unexpected and was the result prior to the rx_handler
result.

Note, the same setup is also used for other storage traffic that
MPIO is used with eg. iSCSI and similar setups can be contrived
without storage protocols.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Reviewed-by: Jiri Pirko <jpirko@redhat.com>
Tested-by: Hans Schillstrom <hams.schillstrom@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-18 23:46:46 -04:00
Ben Hutchings
09994d1b09 RPS: Ensure that an expired hardware filter can be re-added later
Amir Vadai wrote:
> When a stream is paused, and its rule is expired while it is paused,
> no new rule will be configured to the HW when traffic resume.
[...]
> - When stream was resumed, traffic was steered again by RSS, and
> because current-cpu was equal to desired-cpu,  ndo_rx_flow_steer
> wasn't called and no rule was configured to the HW.

Fix this by setting the flow's current CPU only in the table for the
newly selected RX queue.

Reported-and-tested-by: Amir Vadai <amirv@dev.mellanox.co.il>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-03 12:14:45 -04:00
Changli Gao
5dd17e08f3 net: rps: fix the support for PPPOE
The upper protocol numbers of PPPOE are different, and should be treated
specially.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-28 13:34:25 -04:00
David S. Miller
8decf86879 Merge branch 'master' of github.com:davem330/net
Conflicts:
	MAINTAINERS
	drivers/net/Kconfig
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
	drivers/net/ethernet/broadcom/tg3.c
	drivers/net/wireless/iwlwifi/iwl-pci.c
	drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
	drivers/net/wireless/rt2x00/rt2800usb.c
	drivers/net/wireless/wl12xx/main.c
2011-09-22 03:23:13 -04:00
Jiri Pirko
4bc71cb983 net: consolidate and fix ethtool_ops->get_settings calling
This patch does several things:
- introduces __ethtool_get_settings which is called from ethtool code and
  from drivers as well. Put ASSERT_RTNL there.
- dev_ethtool_get_settings() is replaced by __ethtool_get_settings()
- changes calling in drivers so rtnl locking is respected. In
  iboe_get_rate was previously ->get_settings() called unlocked. This
  fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same
  problem. Also fixed by calling __dev_get_by_index() instead of
  dev_get_by_index() and holding rtnl_lock for both calls.
- introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create()
  so bnx2fc_if_create() and fcoe_if_create() are called locked as they
  are from other places.
- use __ethtool_get_settings() in bonding code

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

v2->v3:
	-removed dev_ethtool_get_settings()
	-added ASSERT_RTNL into __ethtool_get_settings()
	-prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock
	 around it and __ethtool_get_settings() call
v1->v2:
        add missing export_symbol
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> [except FCoE bits]
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-15 17:32:26 -04:00
Michael S. Tsirkin
48c830120f net: copy userspace buffers on device forwarding
dev_forward_skb loops an skb back into host networking
stack which might hang on the memory indefinitely.
In particular, this can happen in macvtap in bridged mode.
Copy the userspace fragments to avoid blocking the
sender in that case.

As this patch makes skb_copy_ubufs extern now,
I also added some documentation and made it clear
the SKBTX_DEV_ZEROCOPY flag automatically instead
of doing it in all callers. This can be made into a separate
patch if people feel it's worth it.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-15 14:49:44 -04:00
Ian Campbell
ea2ab69379 net: convert core to skb paged frag APIs
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-24 17:52:11 -07:00
Eric Dumazet
ec5efe7946 rps: support IPIP encapsulation
Skip IPIP header to get proper layer-4 information.

Like GRE tunnels, this only works if rxhash is not already provided by
the device itself (ethtool -K ethX rxhash off), to allow kernel compute
a software rxhash.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-24 16:13:55 -07:00
Jason Baron
ffa10cb47a dynamic_debug: make netdev_dbg() call __netdev_printk()
Previously, if dynamic debug was enabled netdev_dbg() was using
dynamic_dev_dbg() to print out the underlying msg. Fix this by making
sure netdev_dbg() uses __netdev_printk().

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-22 18:23:06 -07:00
Jiri Pirko
0dfe178239 net: vlan: goto another_round instead of calling __netif_receive_skb
Now, when vlan tag on untagged in non-accelerated path is stripped from
skb, headers are reset right away. Benefit from that and avoid calling
__netif_receive_skb recursivelly and just use another_round.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-22 12:43:22 -07:00
Changli Gao
ae1511bf76 net: rps: support PPPOE session messages
Inspect the payload of PPPOE session messages for the 4 tuples to generate
skb->rxhash.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-18 23:23:47 -07:00
Changli Gao
1ff1986fc9 net: rps: support 802.1Q
For the 802.1Q packets, if the NIC doesn't support hw-accel-vlan-rx, RPS
won't inspect the internal 4 tuples to generate skb->rxhash, so this kind
of traffic can't get any benefit from RPS.

This patch adds the support for 802.1Q to RPS.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-18 22:07:54 -07:00
Jiri Pirko
b81693d914 net: remove ndo_set_multicast_list callback
Remove no longer used operation.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-17 20:22:03 -07:00
Jiri Pirko
01789349ee net: introduce IFF_UNICAST_FLT private flag
Use IFF_UNICAST_FTL to find out if driver handles unicast address
filtering. In case it does not, promisc mode is entered.

Patch also fixes following drivers:
stmmac, niu: support uc filtering and yet it propagated
	ndo_set_multicast_list
bna, benet, pxa168_eth, ks8851, ks8851_mll, ksz884x : has set
	ndo_set_rx_mode but do not support uc filtering

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-17 20:21:27 -07:00
Tom Herbert
c6865cb3cc rps: Inspect GRE encapsulated packets to get flow hash
Crack open GRE packets in __skb_get_rxhash to compute 4-tuple hash on
in encapsulated packet.  Note that this is used only when the
__skb_get_rxhash is taken, in particular only when the device does
not compute provide the rxhash (ie. feature is disabled).

This was tested by creating a single GRE tunnel between two 16 core
AMD machines.  200 netperf TCP_RR streams were ran with 1 byte
request and response size.

Without patch: 157497 tps, 50/90/99% latencies 1250/1292/1364 usecs
With patch: 325896 tps, 50/90/99% latencies 603/848/1169

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-17 20:06:03 -07:00
Tom Herbert
e971b7225b rps: Infrastructure in __skb_get_rxhash for deep inspection
Basics for looking for ports in encapsulated packets in tunnels.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-17 20:06:03 -07:00
Tom Herbert
bdeab99191 rps: Add flag to skb to indicate rxhash is based on L4 tuple
The l4_rxhash flag was added to the skb structure to indicate
that the rxhash value was computed over the 4 tuple for the
packet which includes the port information in the encapsulated
transport packet.  This is used by the stack to preserve the
rxhash value in __skb_rx_tunnel.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-17 20:06:03 -07:00
Tom Herbert
792df22cd0 rps: Some minor cleanup in get_rps_cpus
Use some variables for clarity and extensibility.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-17 20:06:03 -07:00
Eric Dumazet
33d480ce6d net: cleanup some rcu_dereference_raw
RCU api had been completed and rcu_access_pointer() or
rcu_dereference_protected() are better than generic
rcu_dereference_raw()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-12 02:55:28 -07:00
Stephen Hemminger
a9b3cd7f32 rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER
When assigning a NULL value to an RCU protected pointer, no barrier
is needed. The rcu_assign_pointer, used to handle that but will soon
change to not handle the special case.

Convert all rcu_assign_pointer of NULL value.

//smpl
@@ expression P; @@

- rcu_assign_pointer(P, NULL)
+ RCU_INIT_POINTER(P, NULL)

// </smpl>

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-02 04:29:23 -07:00
Joe Perches
2d348d1f56 net: Convert struct net_device uc_promisc to bool
No need to use int, its uses are boolean.
May save a few bytes one day.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-25 16:17:35 -07:00
Michał Mirosław
fec30c3381 net: unexport netdev_fix_features()
It is not used anywhere except net/core/dev.c now.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-14 14:44:32 -07:00
Michał Mirosław
1180e7d659 net: cleanup vlan_features setting in register_netdev
vlan_features contains features inherited from underlying device.
NETIF_SOFT_FEATURES are not inherited but belong to the vlan device
itself (ensured in vlan_dev_fix_features()).

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-14 14:41:11 -07:00
Shan Wei
f9d7a1187d net: Add GSO to vlan_features initialization
Just add GSO to vlan_features initialization, and update comments.

When we set offload features, vlan_dev_fix_features() will do more check.
In vlan_dev_fix_features(), final features is decided by
features of real device and vlan_features of real device.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-05 22:34:52 -07:00
Thomas Graf
4e985adaa5 rtnl: provide link dump consistency info
This patch adds a change sequence counter to each net namespace
which is bumped whenever a netdevice is added or removed from
the list. If such a change occurred while a link dump took place,
the dump will have the NLM_F_DUMP_INTR flag set in the first
message which has been interrupted and in all subsequent messages
of the same dump.

Note that links may still be modified or renamed while a dump is
taking place but we can guarantee for userspace to receive a
complete list of links and not miss any.

Testing:
I have added 500 VLAN netdevices to make sure the dump is split
over multiple messages. Then while continuously dumping links in
one process I also continuously deleted and re-added a dummy
netdevice in another process. Multiple dumps per seconds have
had the NLM_F_DUMP_INTR flag set.

I guess we can wait for Johannes patch to hit net-next via the
wireless tree.  I just wanted to give this some testing right away.

Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-01 15:39:53 -07:00
Paul Gortmaker
56f8a75c17 ip: introduce ip_is_fragment helper inline function
There are enough instances of this:

    iph->frag_off & htons(IP_MF | IP_OFFSET)

that a helper function is probably warranted.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-21 20:33:34 -07:00
David S. Miller
9f6ec8d697 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-agn-rxon.c
	drivers/net/wireless/rtlwifi/pci.c
	net/netfilter/ipvs/ip_vs_core.c
2011-06-20 22:29:08 -07:00
Jiri Pirko
0b5c9db1b1 vlan: Fix the ingress VLAN_FLAG_REORDER_HDR check
Testing of VLAN_FLAG_REORDER_HDR does not belong in vlan_untag
but rather in vlan_do_receive.  Otherwise the vlan header
will not be properly put on the packet in the case of
vlan header accelleration.

As we remove the check from vlan_check_reorder_header
rename it vlan_reorder_header to keep the naming clean.

Fix up the skb->pkt_type early so we don't look at the packet
after adding the vlan tag, which guarantees we don't goof
and look at the wrong field.

Use a simple if statement instead of a complicated switch
statement to decided that we need to increment rx_stats
for a multicast packet.

Hopefully at somepoint we will just declare the case where
VLAN_FLAG_REORDER_HDR is cleared as unsupported and remove
the code.  Until then this keeps it working correctly.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-11 16:15:50 -07:00
Alexander Duyck
bff55273f9 v2 ethtool: remove support for ETHTOOL_GRXNTUPLE
This change is meant to remove all support for displaying an ntuple as
strings via ETHTOOL_GRXNTUPLE.  The reason for this change is due to the
fact that multiple issues have been found including:
 - Multiple buffer overruns for strings being displayed.
 - Incorrect filters displayed, cleared filters with ring of -2 are displayed
 - Setting get_rx_ntuple displays no rules if defined.
 - Endianess wrong on displayed values.
 - Hard limit of 1024 filters makes display functionality extremely limited

The only driver that had supported this interface was ixgbe.  Since it no
longer uses the interface and due to the issues mentioned above I am
submitting this patch to remove it.

v2:
Updated based on comments from Ben Hutchings
 - Left ETH_SS_NTUPLE_FILTERS in code but commented on it being deprecated
 - Removed ethtool_rx_ntuple_list and ethtool_rx_ntuple_flow_spec_container
 - Left ETHTOOL_GRXNTUPLE but commented it as deprecated

Also cleaned up set_rx_ntuple since there is no flow spec container to
maintain we can drop all the code for the alloc and free of it and just
return ops->set_rx_ntuple().
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-08 16:45:31 -07:00
Heiko Carstens
264524d5e5 net: cpu offline cause napi stall
Frank Blaschka reported :
<quote>
  During heavy network load we turn off/on cpus.
  Sometimes this causes a stall on the network device.
  Digging into the dump I found out following:

  napi is scheduled but does not run. From the I/O buffers
  and the napi state I see napi/rx_softirq processing has stopped
  because the budget was reached. napi stays in the
  softnet_data poll_list and the rx_softirq was raised again.

  I assume at this time the cpu offline comes in,
  the rx softirq is raised/moved to another cpu but napi stays in the
  poll_list of the softnet_data of the now offline cpu.

  Reviewing dev_cpu_callback (net/core/dev.c) I did not find the
  poll_list is transfered to the new cpu.
</quote>

This patch is a straightforward implementation of Frank suggestion :

Transfert poll_list and trigger NET_RX_SOFTIRQ on new cpu.

Reported-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-07 01:01:22 -07:00
David S. Miller
3019de124b net: Rework netdev_drivername() to avoid warning.
This interface uses a temporary buffer, but for no real reason.
And now can generate warnings like:

net/sched/sch_generic.c: In function dev_watchdog
net/sched/sch_generic.c:254:10: warning: unused variable drivername

Just return driver->name directly or "".

Reported-by: Connor Hansen <cmdkhh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-06 16:41:33 -07:00
Koki Sanagi
ec764bf083 net: tracepoint of net_dev_xmit sees freed skb and causes panic
Because there is a possibility that skb is kfree_skb()ed and zero cleared
after ndo_start_xmit, we should not see the contents of skb like skb->len and
skb->dev->name after ndo_start_xmit. But trace_net_dev_xmit does that
and causes panic by NULL pointer dereference.
This patch fixes trace_net_dev_xmit not to see the contents of skb directly.

If you want to reproduce this panic,

1. Get tracepoint of net_dev_xmit on
2. Create 2 guests on KVM
2. Make 2 guests use virtio_net
4. Execute netperf from one to another for a long time as a network burden
5. host will panic(It takes about 30 minutes)

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-02 14:06:31 -07:00
Neil Horman
f11970e383 net: make dev_disable_lro use physical device if passed a vlan dev (v2)
If the device passed into dev_disable_lro is a vlan, then repoint the dev
poniter so that we actually modify the underlying physical device.

Signed-of-by: Neil Horman <nhorman@tuxdriver.com>
CC: davem@davemloft.net
CC: bhutchings@solarflare.com

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-25 17:55:25 -04:00
Eric Dumazet
be3fc413da net: use synchronize_rcu_expedited()
synchronize_rcu() is very slow in various situations (HZ=100,
CONFIG_NO_HZ=y, CONFIG_PREEMPT=n)

Extract from my (mostly idle) 8 core machine :

 synchronize_rcu() in 99985 us
 synchronize_rcu() in 79982 us
 synchronize_rcu() in 87612 us
 synchronize_rcu() in 79827 us
 synchronize_rcu() in 109860 us
 synchronize_rcu() in 98039 us
 synchronize_rcu() in 89841 us
 synchronize_rcu() in 79842 us
 synchronize_rcu() in 80151 us
 synchronize_rcu() in 119833 us
 synchronize_rcu() in 99858 us
 synchronize_rcu() in 73999 us
 synchronize_rcu() in 79855 us
 synchronize_rcu() in 79853 us

When we hold RTNL mutex, we would like to spend some cpu cycles but not
block too long other processes waiting for this mutex.

We also want to setup/dismantle network features as fast as possible at
boot/shutdown time.

This patch makes synchronize_net() call the expedited version if RTNL is
locked.

synchronize_rcu_expedited() typical delay is about 20 us on my machine.

 synchronize_rcu_expedited() in 18 us
 synchronize_rcu_expedited() in 18 us
 synchronize_rcu_expedited() in 18 us
 synchronize_rcu_expedited() in 18 us
 synchronize_rcu_expedited() in 20 us
 synchronize_rcu_expedited() in 16 us
 synchronize_rcu_expedited() in 20 us
 synchronize_rcu_expedited() in 18 us
 synchronize_rcu_expedited() in 18 us

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Ben Greear <greearb@candelatech.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-24 13:26:12 -04:00
Eric Dumazet
6df427fe8c net: remove synchronize_net() from netdev_set_master()
In the old days, we used to access dev->master in __netif_receive_skb()
in a rcu_read_lock section.

So one synchronize_net() call was needed in netdev_set_master() to make
sure another cpu could not use old master while/after we release it.

We now use netdev_rx_handler infrastructure and added one
synchronize_net() call in bond_release()/bond_release_all()

Remove the obsolete synchronize_net() from netdev_set_master() and add
one in bridge del_nbp() after its netdev_rx_handler_unregister() call.

This makes enslave -d a bit faster.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jiri Pirko <jpirko@redhat.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-22 21:01:20 -04:00
Eric Dumazet
449f454426 macvlan: remove one synchronize_rcu() call
When one macvlan device is dismantled, we can avoid one
synchronize_rcu() call done after deletion from hash list, since caller
will perform a synchronize_net() call after its ndo_stop() call.

Add a new netdev->dismantle field to signal this dismantle intent.

Reduces RTNL hold time.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-20 00:33:18 -04:00
David S. Miller
9cbc94eabb Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/vmxnet3/vmxnet3_ethtool.c
	net/core/dev.c
2011-05-17 17:33:11 -04:00
Michael S. Tsirkin
604ae14ffb net: Change netdev_fix_features messages loglevel
Cool, how about we make 'Features changed' debug as well?
This way userspace can't fill up the log just by tweaking tun features
with an ioctl.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-17 15:44:10 -04:00
Eric Dumazet
372b231201 net: use hlist_del_rcu() in dev_change_name()
Using plain hlist_del() in dev_change_name() is wrong since a
concurrent reader can crash trying to dereference LIST_POISON1.

Bug introduced in commit 72c9528bab (net: Introduce
dev_get_by_name_rcu())

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-17 13:56:59 -04:00
Michał Mirosław
6f404e441d net: Change netdev_fix_features messages loglevel
Those reduced to DEBUG can possibly be triggered by unprivileged processes
and are nothing exceptional. Illegal checksum combinations can only be
caused by driver bug, so promote those messages to WARN.

Since GSO without SG will now only cause DEBUG message from
netdev_fix_features(), remove the workaround from register_netdevice().

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-16 15:14:21 -04:00
Peter Pan(潘卫平)
0696c3a8ac net:set valid name before calling ndo_init()
In commit 1c5cae815d (net: call dev_alloc_name from register_netdevice),
a bug of bonding was involved, see example 1 and 2.

In register_netdevice(), the name of net_device is not valid until
dev_get_valid_name() is called. But dev->netdev_ops->ndo_init(that is
bond_init) is called before dev_get_valid_name(),
and it uses the invalid name of net_device.

I think register_netdevice() should make sure that the name of net_device is
valid before calling ndo_init().

example 1:
modprobe bonding
ls  /proc/net/bonding/bond%d

ps -eLf
root      3398     2  3398  0    1 21:34 ?        00:00:00 [bond%d]

example 2:
modprobe bonding max_bonds=3

[  170.100292] bonding: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
[  170.101090] bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details.
[  170.102469] ------------[ cut here ]------------
[  170.103150] WARNING: at /home/pwp/net-next-2.6/fs/proc/generic.c:586 proc_register+0x126/0x157()
[  170.104075] Hardware name: VirtualBox
[  170.105065] proc_dir_entry 'bonding/bond%d' already registered
[  170.105613] Modules linked in: bonding(+) sunrpc ipv6 uinput microcode ppdev parport_pc parport joydev e1000 pcspkr i2c_piix4 i2c_core [last unloaded: bonding]
[  170.108397] Pid: 3457, comm: modprobe Not tainted 2.6.39-rc2+ #14
[  170.108935] Call Trace:
[  170.109382]  [<c0438f3b>] warn_slowpath_common+0x6a/0x7f
[  170.109911]  [<c051a42a>] ? proc_register+0x126/0x157
[  170.110329]  [<c0438fc3>] warn_slowpath_fmt+0x2b/0x2f
[  170.110846]  [<c051a42a>] proc_register+0x126/0x157
[  170.111870]  [<c051a4dd>] proc_create_data+0x82/0x98
[  170.112335]  [<f94e6af6>] bond_create_proc_entry+0x3f/0x73 [bonding]
[  170.112905]  [<f94dd806>] bond_init+0x77/0xa5 [bonding]
[  170.113319]  [<c0721ac6>] register_netdevice+0x8c/0x1d3
[  170.113848]  [<f94e0e30>] bond_create+0x6c/0x90 [bonding]
[  170.114322]  [<f94f4763>] bonding_init+0x763/0x7b1 [bonding]
[  170.114879]  [<c0401240>] do_one_initcall+0x76/0x122
[  170.115317]  [<f94f4000>] ? 0xf94f3fff
[  170.115799]  [<c0463f1e>] sys_init_module+0x1286/0x140d
[  170.116879]  [<c07c6d9f>] sysenter_do_call+0x12/0x28
[  170.117404] ---[ end trace 64e4fac3ae5fff1a ]---
[  170.117924] bond%d: Warning: failed to register to debugfs
[  170.128728] ------------[ cut here ]------------
[  170.129360] WARNING: at /home/pwp/net-next-2.6/fs/proc/generic.c:586 proc_register+0x126/0x157()
[  170.130323] Hardware name: VirtualBox
[  170.130797] proc_dir_entry 'bonding/bond%d' already registered
[  170.131315] Modules linked in: bonding(+) sunrpc ipv6 uinput microcode ppdev parport_pc parport joydev e1000 pcspkr i2c_piix4 i2c_core [last unloaded: bonding]
[  170.133731] Pid: 3457, comm: modprobe Tainted: G        W   2.6.39-rc2+ #14
[  170.134308] Call Trace:
[  170.134743]  [<c0438f3b>] warn_slowpath_common+0x6a/0x7f
[  170.135305]  [<c051a42a>] ? proc_register+0x126/0x157
[  170.135820]  [<c0438fc3>] warn_slowpath_fmt+0x2b/0x2f
[  170.137168]  [<c051a42a>] proc_register+0x126/0x157
[  170.137700]  [<c051a4dd>] proc_create_data+0x82/0x98
[  170.138174]  [<f94e6af6>] bond_create_proc_entry+0x3f/0x73 [bonding]
[  170.138745]  [<f94dd806>] bond_init+0x77/0xa5 [bonding]
[  170.139278]  [<c0721ac6>] register_netdevice+0x8c/0x1d3
[  170.139828]  [<f94e0e30>] bond_create+0x6c/0x90 [bonding]
[  170.140361]  [<f94f4763>] bonding_init+0x763/0x7b1 [bonding]
[  170.140927]  [<c0401240>] do_one_initcall+0x76/0x122
[  170.141494]  [<f94f4000>] ? 0xf94f3fff
[  170.141975]  [<c0463f1e>] sys_init_module+0x1286/0x140d
[  170.142463]  [<c07c6d9f>] sysenter_do_call+0x12/0x28
[  170.142974] ---[ end trace 64e4fac3ae5fff1b ]---
[  170.144949] bond%d: Warning: failed to register to debugfs

Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 16:49:49 -04:00
Michał Mirosław
afe12cc86b net: introduce netdev_change_features()
It will be needed by bonding and other drivers changing vlan_features
after ndo_init callback.

As a bonus, this includes kernel-doc for netdev_update_features().

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-12 18:40:56 -04:00
David S. Miller
3c709f8fb4 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-3.6
Conflicts:
	drivers/net/benet/be_main.c
2011-05-11 14:26:58 -04:00
Eric Dumazet
e14a599335 net: dev_close() should check IFF_UP
Commit 443457242b (factorize sync-rcu call in
unregister_netdevice_many) mistakenly removed one test from dev_close()

Following actions trigger a BUG :

modprobe bonding
modprobe dummy
ifconfig bond0 up
ifenslave bond0 dummy0
rmmod dummy

dev_close() must not close a non IFF_UP device.

With help from Frank Blaschka and Einar EL Lueck

Reported-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Reported-by: Einar EL Lueck <ELELUECK@de.ibm.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 15:03:33 -07:00
David S. Miller
7143b7d412 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/tg3.c
2011-05-05 14:59:02 -07:00
Jiri Pirko
1c5cae815d net: call dev_alloc_name from register_netdevice
Force dev_alloc_name() to be called from register_netdevice() by
dev_get_valid_name(). That allows to remove multiple explicit
dev_alloc_name() calls.

The possibility to call dev_alloc_name in advance remains.

This also fixes veth creation regresion caused by
84c49d8c3e

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-05 10:57:45 -07:00
Lifeng Sun
41c31f318a networking: inappropriate ioctl operation should return ENOTTY
ioctl() calls against a socket with an inappropriate ioctl operation
are incorrectly returning EINVAL rather than ENOTTY:

  [ENOTTY]
      Inappropriate I/O control operation.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=33992

Signed-off-by: Lifeng Sun <lifongsun@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-02 15:41:29 -07:00
David Decotigny
8ae6daca85 ethtool: Call ethtool's get/set_settings callbacks with cleaned data
This makes sure that when a driver calls the ethtool's
get/set_settings() callback of another driver, the data passed to it
is clean. This guarantees that speed_hi will be zeroed correctly if
the called callback doesn't explicitely set it: we are sure we don't
get a corrupted speed from the underlying driver. We also take care of
setting the cmd field appropriately (ETHTOOL_GSET/SSET).

This applies to dev_ethtool_get_settings(), which now makes sure it
sets up that ethtool command parameter correctly before passing it to
drivers. This also means that whoever calls dev_ethtool_get_settings()
does not have to clean the ethtool command parameter. This function
also becomes an exported symbol instead of an inline.

All drivers visible to make allyesconfig under x86_64 have been
updated.

Signed-off-by: David Decotigny <decot@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-29 14:01:30 -07:00
Michał Mirosław
1742f183fc net: fix netdev_increment_features()
Simplify and fix netdev_increment_features() to conform to what is
stated in netdevice.h comments about NETIF_F_ONE_FOR_ALL.
Include FCoE segmentation and VLAN-challedged flags in computation.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 13:33:07 -07:00
Jiri Pirko
3aba891dde bonding: move processing of recv handlers into handle_frame()
Since now when bonding uses rx_handler, all traffic going into bond
device goes thru bond_handle_frame. So there's no need to go back into
bonding code later via ptype handlers. This patch converts
original ptype handlers into "bonding receive probes". These functions
are called from bond_handle_frame and they are registered per-mode.

Note that vlan packets are also handled because they are always untagged
thanks to vlan_untag()

Note that this also allows arpmon for eth-bond-bridge-vlan topology.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-25 12:00:30 -07:00
Michał Mirosław
22d5969fb4 net: make WARN_ON in dev_disable_lro() useful
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-25 11:56:27 -07:00
Eric Dumazet
b71d1d426d inet: constify ip headers and in6_addr
Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-22 11:04:14 -07:00
David S. Miller
e1943424e4 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x/bnx2x_ethtool.c
2011-04-19 00:21:33 -07:00
Ben Hutchings
31d8b9e099 net: Disable NETIF_F_TSO_ECN when TSO is disabled
NETIF_F_TSO_ECN has no effect when TSO is disabled; this just means
that feature state will be accurately reported to user-space.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-12 19:29:45 -07:00
Ben Hutchings
ea2d36883c net: Disable all TSO features when SG is disabled
The feature flags NETIF_F_TSO and NETIF_F_TSO6 independently enable
TSO for IPv4 and IPv6 respectively.  However, the test in
netdev_fix_features() and its predecessor functions was never updated
to check for NETIF_F_TSO6, possibly because it was originally proposed
that TSO for IPv6 would be dependent on both feature flags.

Now that these feature flags can be changed independently from
user-space and we depend on netdev_fix_features() to fix invalid
feature combinations, it's important to disable them both if
scatter-gather is disabled.  Also disable NETIF_F_TSO_ECN so
user-space sees all TSO features as disabled.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-12 19:29:45 -07:00
Michał Mirosław
872674858f net: add RTNL_ASSERT in __netdev_update_features()
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-12 14:36:07 -07:00
Jiri Pirko
bcc6d47903 net: vlan: make non-hw-accel rx path similar to hw-accel
Now there are 2 paths for rx vlan frames. When rx-vlan-hw-accel is
enabled, skb is untagged by NIC, vlan_tci is set and the skb gets into
vlan code in __netif_receive_skb - vlan_hwaccel_do_receive.

For non-rx-vlan-hw-accel however, tagged skb goes thru whole
__netif_receive_skb, it's untagged in ptype_base hander and reinjected

This incosistency is fixed by this patch. Vlan untagging happens early in
__netif_receive_skb so the rest of code (ptype_all handlers, rx_handlers)
see the skb like it was untagged by hw.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

v1->v2:
	remove "inline" from vlan_core.c functions
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-12 14:15:19 -07:00
David S. Miller
1c01a80cfe Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/smsc911x.c
2011-04-11 13:44:25 -07:00
Linus Torvalds
42933bac11 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
  Fix common misspellings
2011-04-07 11:14:49 -07:00
Tom Herbert
c6e1a0d12c net: Allow no-cache copy from user on transmit
This patch uses __copy_from_user_nocache on transmit to bypass data
cache for a performance improvement.  skb_add_data_nocache and
skb_copy_to_page_nocache can be called by sendmsg functions to use
this feature, initial support is in tcp_sendmsg.  This functionality is
configurable per device using ethtool.

Presumably, this feature would only be useful when the driver does
not touch the data.  The feature is turned on by default if a device
indicates that it does some form of checksum offload; it is off by
default for devices that do no checksum offload or indicate no checksum
is necessary.  For the former case copy-checksum is probably done
anyway, in the latter case the device is likely loopback in which case
the no cache copy is probably not beneficial.

This patch was tested using 200 instances of netperf TCP_RR with
1400 byte request and one byte reply.  Platform is 16 core AMD x86.

No-cache copy disabled:
   672703 tps, 97.13% utilization
   50/90/99% latency:244.31 484.205 1028.41

No-cache copy enabled:
   702113 tps, 96.16% utilization,
   50/90/99% latency 238.56 467.56 956.955

Using 14000 byte request and response sizes demonstrate the
effects more dramatically:

No-cache copy disabled:
   79571 tps, 34.34 %utlization
   50/90/95% latency 1584.46 2319.59 5001.76

No-cache copy enabled:
   83856 tps, 34.81% utilization
   50/90/95% latency 2508.42 2622.62 2735.88

Note especially the effect on latency tail (95th percentile).

This seems to provide a nice performance improvement and is
consistent in the tests I ran.  Presumably, this would provide
the greatest benfits in the presence of an application workload
stressing the cache and a lot of transmit data happening.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-04 22:30:30 -07:00
Michał Mirosław
6cb6a27c45 net: Call netdev_features_change() from netdev_update_features()
Issue FEAT_CHANGE notification when features are changed by
netdev_update_features().  This will allow changes made by extra constraints
on e.g. MTU change to be properly propagated like changes via ethtool.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-02 22:48:47 -07:00
Lucas De Marchi
25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Daniel Lezcano
79b569f0ec netdev: fix mtu check when TSO is enabled
In case the device where is coming from the packet has TSO enabled,
we should not check the mtu size value as this one could be bigger
than the expected value.

This is the case for the macvlan driver when the lower device has
TSO enabled. The macvlan inherit this feature and forward the packets
without fragmenting them. Then the packets go through dev_forward_skb
and are dropped. This patch fix this by checking TSO is not enabled
when we want to check the mtu size.

Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-30 02:42:17 -07:00
stephen hemminger
edf947f100 bridge: notify applications if address of bridge device changes
The mac address of the bridge device may be changed when a new interface
is added to the bridge. If this happens, then the bridge needs to call
the network notifiers to tickle any other systems that care. Since bridge
can be a module, this also means exporting the notifier function.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-27 23:35:02 -07:00
Amerigo Wang
3b261ade42 net: remove useless comments in net/core/dev.c
The code itself can explain what it is doing, no need these comments.

Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-27 23:34:59 -07:00
Michał Mirosław
27660515a2 net: implement dev_disable_lro() hw_features compatibility
Implement compatibility with new hw_features for dev_disable_lro().
This is a transition path - dev_disable_lro() should be later
integrated into netdev_fix_features() after all drivers are converted.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22 01:00:26 -07:00
Jiri Pirko
8a4eb5734e net: introduce rx_handler results and logic around that
This patch allows rx_handlers to better signalize what to do next to
it's caller. That makes skb->deliver_no_wcard no longer needed.

kernel-doc for rx_handler_result is taken from Nicolas' patch.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-16 12:53:54 -07:00
David S. Miller
33175d84ee Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x/bnx2x_cmn.c
2011-03-10 14:26:00 -08:00
Vasiliy Kulikov
8909c9ad8f net: don't allow CAP_NET_ADMIN to load non-netdev kernel modules
Since a8f80e8ff9 any process with
CAP_NET_ADMIN may load any module from /lib/modules/.  This doesn't mean
that CAP_NET_ADMIN is a superset of CAP_SYS_MODULE as modules are
limited to /lib/modules/**.  However, CAP_NET_ADMIN capability shouldn't
allow anybody load any module not related to networking.

This patch restricts an ability of autoloading modules to netdev modules
with explicit aliases.  This fixes CVE-2011-1019.

Arnd Bergmann suggested to leave untouched the old pre-v2.6.32 behavior
of loading netdev modules by name (without any prefix) for processes
with CAP_SYS_MODULE to maintain the compatibility with network scripts
that use autoloading netdev modules by aliases like "eth0", "wlan0".

Currently there are only three users of the feature in the upstream
kernel: ipip, ip_gre and sit.

    root@albatros:~# capsh --drop=$(seq -s, 0 11),$(seq -s, 13 34) --
    root@albatros:~# grep Cap /proc/$$/status
    CapInh:	0000000000000000
    CapPrm:	fffffff800001000
    CapEff:	fffffff800001000
    CapBnd:	fffffff800001000
    root@albatros:~# modprobe xfs
    FATAL: Error inserting xfs
    (/lib/modules/2.6.38-rc6-00001-g2bf4ca3/kernel/fs/xfs/xfs.ko): Operation not permitted
    root@albatros:~# lsmod | grep xfs
    root@albatros:~# ifconfig xfs
    xfs: error fetching interface information: Device not found
    root@albatros:~# lsmod | grep xfs
    root@albatros:~# lsmod | grep sit
    root@albatros:~# ifconfig sit
    sit: error fetching interface information: Device not found
    root@albatros:~# lsmod | grep sit
    root@albatros:~# ifconfig sit0
    sit0      Link encap:IPv6-in-IPv4
	      NOARP  MTU:1480  Metric:1

    root@albatros:~# lsmod | grep sit
    sit                    10457  0
    tunnel4                 2957  1 sit

For CAP_SYS_MODULE module loading is still relaxed:

    root@albatros:~# grep Cap /proc/$$/status
    CapInh:	0000000000000000
    CapPrm:	ffffffffffffffff
    CapEff:	ffffffffffffffff
    CapBnd:	ffffffffffffffff
    root@albatros:~# ifconfig xfs
    xfs: error fetching interface information: Device not found
    root@albatros:~# lsmod | grep xfs
    xfs                   745319  0

Reference: https://lkml.org/lkml/2011/2/24/203

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: James Morris <jmorris@namei.org>
2011-03-10 10:25:19 +11:00
Jiri Pirko
e3f48d37cf net: allow handlers to be processed for orig_dev
This was there before, I forgot about this. Allows deliveries to
ptype_base handlers registered for orig_dev. I presume this is still
desired.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-07 15:37:16 -08:00
David S. Miller
63d8ea7f93 net: Forgot to commit net/core/dev.c part of Jiri's ->rx_handler patch.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-28 10:48:59 -08:00
Michał Mirosław
14d1232f49 net: avoid initial "Features changed" message
Avoid "Features changed" message and ndo_set_features call on device
registration caused by automatic enabling of GSO and GRO. Driver should
have enabled hardware offloads it set in features, so the ndo_set_features()
is not needed at registration time.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 14:23:31 -08:00
Michał Mirosław
8e9b59b219 Fix "(unregistered net_device): Features changed" message
Fix netdev_update_features() messages on register time by moving
the call further in register_netdevice(). When
netdev->reg_state != NETREG_REGISTERED, netdev_name() returns
"(unregistered netdevice)" even if the dev's name is already filled.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 14:23:31 -08:00
David S. Miller
2a3bcfdde6 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next-2.6 2011-02-22 10:21:36 -08:00
David S. Miller
da935c66ba Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	Documentation/feature-removal-schedule.txt
	drivers/net/e1000e/netdev.c
	net/xfrm/xfrm_policy.c
2011-02-19 19:17:35 -08:00
Eric Dumazet
ceaaec98ad net: deinit automatic LIST_HEAD
commit 9b5e383c11 (net: Introduce
unregister_netdevice_many()) left an active LIST_HEAD() in
rollback_registered(), with possible memory corruption.

Even if device is freed without touching its unreg_list (and therefore
touching the previous memory location holding LISTE_HEAD(single), better
close the bug for good, since its really subtle.

(Same fix for default_device_exit_batch() for completeness)

Reported-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Eric W. Biderman <ebiderman@xmission.com>
Tested-by: Eric W. Biderman <ebiderman@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Octavian Purdila <opurdila@ixiacom.com>
CC: stable <stable@kernel.org> [.33+]
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-18 11:49:36 -08:00
Linus Torvalds
f87e6f4793 net: dont leave active on stack LIST_HEAD
Eric W. Biderman and Michal Hocko reported various memory corruptions
that we suspected to be related to a LIST head located on stack, that
was manipulated after thread left function frame (and eventually exited,
so its stack was freed and reused).

Eric Dumazet suggested the problem was probably coming from commit
443457242b (net: factorize
sync-rcu call in unregister_netdevice_many)

This patch fixes __dev_close() and dev_close() to properly deinit their
respective LIST_HEAD(single) before exiting.

References: https://lkml.org/lkml/2011/2/16/304
References: https://lkml.org/lkml/2011/2/14/223

Reported-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Eric W. Biderman <ebiderman@xmission.com>
Tested-by: Eric W. Biderman <ebiderman@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-18 11:49:35 -08:00
Michał Mirosław
5455c6998d net: Introduce new feature setting ops
This introduces a new framework to handle device features setting.
It consists of:
  - new fields in struct net_device:
	+ hw_features - features that hw/driver supports toggling
	+ wanted_features - features that user wants enabled, when possible
  - new netdev_ops:
	+ feat = ndo_fix_features(dev, feat) - API checking constraints for
		enabling features or their combinations
	+ ndo_set_features(dev) - API updating hardware state to match
		changed dev->features
  - new ethtool commands:
	+ ETHTOOL_GFEATURES/ETHTOOL_SFEATURES: get/set dev->wanted_features
		and trigger device reconfiguration if resulting dev->features
		changed
	+ ETHTOOL_GSTRINGS(ETH_SS_FEATURES): get feature bits names (meaning)

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-17 14:16:33 -08:00
Michał Mirosław
212b573f55 ethtool: enable GSO and GRO by default
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-17 14:16:32 -08:00
Ben Hutchings
69a19ee60d net: RPS: Make hardware-accelerated RFS conditional on NETIF_F_NTUPLE
For testing and debugging purposes it is useful to be able to disable
hardware acceleration of RFS without disabling RFS altogether.  Since
this is a similar feature to 'n-tuple' flow steering through the
ethtool API, test the same feature flag that controls that.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2011-02-15 20:36:11 +00:00
David S. Miller
f878b995b0 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next-2.6 2011-02-15 12:25:19 -08:00
Ben Hutchings
5c56580b74 net: Adjust TX queue kobjects if number of queues changes during unregister
If the root qdisc for a net device is mqprio, and the driver's
ndo_setup_tc() operation dynamically adds and remvoes TX queues,
netif_set_real_num_tx_queues() will be called during device
unregistration to remove the extra TX queues when the qdisc is
destroyed.  Currently this causes the corresponding kobjects
to be leaked, and the device's reference count never drops to 0.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2011-02-15 19:45:33 +00:00
Jiri Pirko
1765a57533 net: make dev->master general
dev->master is now tightly connected to bonding driver. This patch makes
this pointer more general and ready to be used by others.

 - netdev_set_master() - bond specifics moved to new function
   netdev_set_bond_master()
 - introduced netif_is_bond_slave() to check if device is a bonding slave

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-13 10:42:07 -08:00
Jiri Pirko
d59cfde2fb net: remove the unnecessary dance around skb_bond_should_drop
No need to check (master) twice and to drive in and out the header file.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-13 10:42:06 -08:00
David S. Miller
263fb5b1bf Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/e1000e/netdev.c
2011-02-08 17:19:01 -08:00
David S. Miller
8d3bdbd55a net: Fix lockdep regression caused by initializing netdev queues too early.
In commit aa94210411 ("net: init ingress
queue") we moved the allocation and lock initialization of the queues
into alloc_netdev_mq() since register_netdevice() is way too late.

The problem is that dev->type is not setup until the setup()
callback is invoked by alloc_netdev_mq(), and the dev->type is
what determines the lockdep class to use for the locks in the
queues.

Fix this by doing the queue allocation after the setup() callback
runs.

This is safe because the setup() callback is not allowed to make any
state changes that need to be undone on error (memory allocations,
etc.).  It may, however, make state changes that are undone by
free_netdev() (such as netif_napi_add(), which is done by the
ipoib driver's setup routine).

The previous code also leaked a reference to the &init_net namespace
object on RX/TX queue allocation failures.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-08 15:02:50 -08:00
David S. Miller
bd4a6974cc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-02-04 14:28:58 -08:00
Andy Gospodarek
6d152e23ad gro: reset skb_iif on reuse
Like Herbert's change from a few days ago:

66c46d741e gro: Reset dev pointer on reuse

this may not be necessary at this point, but we should still clean up
the skb->skb_iif.  If not we may end up with an invalid valid for
skb->skb_iif when the skb is reused and the check is done in
__netif_receive_skb.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-02 14:53:25 -08:00
Tom Herbert
8587523640 net: Check rps_flow_table when RPS map length is 1
In get_rps_cpu, add check that the rps_flow_table for the device is
NULL when trying to take fast path when RPS map length is one.
Without this, RFS is effectively disabled if map length is one which
is not correct.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-31 16:23:42 -08:00
David S. Miller
5403c8a295 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-01-31 13:13:24 -08:00
Herbert Xu
66c46d741e gro: Reset dev pointer on reuse
On older kernels the VLAN code may zero skb->dev before dropping
it and causing it to be reused by GRO.

Unfortunately we didn't reset skb->dev in that case which causes
the next GRO user to get a bogus skb->dev pointer.

This particular problem no longer happens with the current upstream
kernel due to changes in VLAN processing.

However, for correctness we should still reset the skb->dev pointer
in the GRO reuse function in case a future user does the same thing.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-29 22:36:24 -08:00
Eric Dumazet
ccf434380d net: fix dev_seq_next()
Commit c6d14c8456 (net: Introduce for_each_netdev_rcu() iterator)
added a race in dev_seq_next().

The rcu_dereference() call should be done _before_ testing the end of
list, or we might return a wrong net_device if a concurrent thread
changes net_device list under us.

Note : discovered thanks to a sparse warning :

net/core/dev.c:3919:9: error: incompatible types in comparison expression
(different address spaces)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-27 15:02:56 -08:00
Michał Mirosław
acd1130e87 net: reduce and unify printk level in netdev_fix_features()
Reduce printk() levels to KERN_INFO in netdev_fix_features() as this will
be used by ethtool and might spam dmesg unnecessarily.

This converts the function to use netdev_info() instead of plain printk().

As a side effect, bonding and bridge devices will now log dropped features
on every slave device change.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 15:45:15 -08:00
Michał Mirosław
04ed3e741d net: change netdev->features to u32
Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.

Occurences found by `grep -r` on net/, drivers/net, include/

[ Move features and vlan_features next to each other in
  struct netdev, as per Eric Dumazet's suggestion -DaveM ]

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 15:32:47 -08:00
Michał Mirosław
57422dc530 net: Move check of checksum features to netdev_fix_features()
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 15:29:11 -08:00
Ben Hutchings
c445477d74 net: RPS: Enable hardware acceleration of RFS
Allow drivers for multiqueue hardware with flow filter tables to
accelerate RFS.  The driver must:

1. Set net_device::rx_cpu_rmap to a cpu_rmap of the RX completion
IRQs (in queue order).  This will provide a mapping from CPUs to the
queues for which completions are handled nearest to them.

2. Implement net_device_ops::ndo_rx_flow_steer.  This operation adds
or replaces a filter steering the given flow to the given RX queue, if
possible.

3. Periodically remove filters for which rps_may_expire_flow() returns
true.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 14:53:01 -08:00
David S. Miller
5bdc22a565 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/sched/sch_hfsc.c
	net/sched/sch_htb.c
	net/sched/sch_tbf.c
2011-01-24 14:09:35 -08:00
David S. Miller
e92427b289 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 2011-01-24 13:17:06 -08:00
Eric Dumazet
c506653d35 net: arp_ioctl() must hold RTNL
Commit 941666c2e3 "net: RCU conversion of dev_getbyhwaddr() and
arp_ioctl()" introduced a regression, reported by Jamie Heilman.
"arp -Ds 192.168.2.41 eth0 pub" triggered the ASSERT_RTNL() assert
in pneigh_lookup()

Removing RTNL requirement from arp_ioctl() was a mistake, just revert
that part.

Reported-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 13:16:16 -08:00
Eric Dumazet
bb134d2298 net: netif_setup_tc() is static
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-21 13:08:27 -08:00
Eric Dumazet
a2da570d62 net_sched: RCU conversion of stab
This patch converts stab qdisc management to RCU, so that we can perform
the qdisc_calculate_pkt_len() call before getting qdisc lock.

This shortens the lock's held time in __dev_xmit_skb().

This permits more qdiscs to get TCQ_F_CAN_BYPASS status, avoiding lot of
cache misses and so reducing latencies.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jesper Dangaard Brouer <hawk@diku.dk>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20 16:59:32 -08:00
Eric Dumazet
3fbd8758b0 net: dev_close_many() is static
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Octavian Purdila <opurdila@ixiacom.com>
Reviewed-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20 16:55:30 -08:00
John Fastabend
4f57c087de net: implement mechanism for HW based QOS
This patch provides a mechanism for lower layer devices to
steer traffic using skb->priority to tx queues. This allows
for hardware based QOS schemes to use the default qdisc without
incurring the penalties related to global state and the qdisc
lock. While reliably receiving skbs on the correct tx ring
to avoid head of line blocking resulting from shuffling in
the LLD. Finally, all the goodness from txq caching and xps/rps
can still be leveraged.

Many drivers and hardware exist with the ability to implement
QOS schemes in the hardware but currently these drivers tend
to rely on firmware to reroute specific traffic, a driver
specific select_queue or the queue_mapping action in the
qdisc.

By using select_queue for this drivers need to be updated for
each and every traffic type and we lose the goodness of much
of the upstream work. Firmware solutions are inherently
inflexible. And finally if admins are expected to build a
qdisc and filter rules to steer traffic this requires knowledge
of how the hardware is currently configured. The number of tx
queues and the queue offsets may change depending on resources.
Also this approach incurs all the overhead of a qdisc with filters.

With the mechanism in this patch users can set skb priority using
expected methods ie setsockopt() or the stack can set the priority
directly. Then the skb will be steered to the correct tx queues
aligned with hardware QOS traffic classes. In the normal case with
single traffic class and all queues in this class everything
works as is until the LLD enables multiple tcs.

To steer the skb we mask out the lower 4 bits of the priority
and allow the hardware to configure upto 15 distinct classes
of traffic. This is expected to be sufficient for most applications
at any rate it is more then the 8021Q spec designates and is
equal to the number of prio bands currently implemented in
the default qdisc.

This in conjunction with a userspace application such as
lldpad can be used to implement 8021Q transmission selection
algorithms one of these algorithms being the extended transmission
selection algorithm currently being used for DCB.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-19 23:31:10 -08:00
Vlad Dogaru
cbda10fa97 net_device: add support for network device groups
Net devices can now be grouped, enabling simpler manipulation from
userspace. This patch adds a group field to the net_device structure, as
well as rtnetlink support to query and modify it.

Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-19 23:31:09 -08:00
Linus Torvalds
1268afe676 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (41 commits)
  sctp: user perfect name for Delayed SACK Timer option
  net: fix can_checksum_protocol() arguments swap
  Revert "netlink: test for all flags of the NLM_F_DUMP composite"
  gianfar: Fix misleading indentation in startup_gfar()
  net/irda/sh_irda: return to RX mode when TX error
  net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan.
  USB CDC NCM: tx_fixup() race condition fix
  ns83820: Avoid bad pointer deref in ns83820_init_one().
  ipv6: Silence privacy extensions initialization
  bnx2x: Update bnx2x version to 1.62.00-4
  bnx2x: Fix AER setting for BCM57712
  bnx2x: Fix BCM84823 LED behavior
  bnx2x: Mark full duplex on some external PHYs
  bnx2x: Fix BCM8073/BCM8727 microcode loading
  bnx2x: LED fix for BCM8727 over BCM57712
  bnx2x: Common init will be executed only once after POR
  bnx2x: Swap BCM8073 PHY polarity if required
  iwlwifi: fix valid chain reading from EEPROM
  ath5k: fix locking in tx_complete_poll_work
  ath9k_hw: do PA offset calibration only on longcal interval
  ...
2011-01-19 20:25:45 -08:00
Eric Dumazet
d402786ea4 net: fix can_checksum_protocol() arguments swap
commit 0363466866 (net offloading: Convert checksums to use
centrally computed features.) mistakenly swapped can_checksum_protocol()
arguments.

This broke IPv6 on bnx2 for instance, on NIC without TCPv6 checksum
offloads.

Reported-by: Hans de Bruin <jmdebruin@xmsnet.nl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-19 14:15:21 -08:00
David S. Miller
a5db219f4c Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-01-18 16:28:31 -08:00
Jesse Gross
6ee400aafb net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan.
In netif_skb_features() we return only the features that are valid for vlans
if we have a vlan packet.  However, we should not mask out NETIF_F_HW_VLAN_TX
since it enables transmission of vlan tags and is obviously valid.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-18 16:13:50 -08:00
Linus Torvalds
d018b6f4f1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
  GRETH: resolve SMP issues and other problems
  GRETH: handle frame error interrupts
  GRETH: avoid writing bad speed/duplex when setting transfer mode
  GRETH: fixed skb buffer memory leak on frame errors
  GRETH: GBit transmit descriptor handling optimization
  GRETH: fix opening/closing
  GRETH: added raw AMBA vendor/device number to match against.
  cassini: Fix build bustage on x86.
  e1000e: consistent use of Rx/Tx vs. RX/TX/rx/tx in comments/logs
  e1000e: update Copyright for 2011
  e1000: Avoid unhandled IRQ
  r8169: keep firmware in memory.
  netdev: tilepro: Use is_unicast_ether_addr helper
  etherdevice.h: Add is_unicast_ether_addr function
  ks8695net: Use default implementation of ethtool_ops::get_link
  ks8695net: Disable non-working ethtool operations
  USB CDC NCM: Don't deref NULL in cdc_ncm_rx_fixup() and don't use uninitialized variable.
  vxge: Remember to release firmware after upgrading firmware
  netdev: bfin_mac: Remove is_multicast_ether_addr use in netdev_for_each_mc_addr
  ipsec: update MAX_AH_AUTH_LEN to support sha512
  ...
2011-01-14 13:25:30 -08:00
Eric Dumazet
1ac9ad1394 net: remove dev_txq_stats_fold()
After recent changes, (percpu stats on vlan/tunnels...), we dont need
anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters.

Only remaining users are ixgbe, sch_teql, gianfar & macvlan :

1) ixgbe can be converted to use existing tx_ring counters.

2) macvlan incremented txq->tx_dropped, it can use the
dev->stats.tx_dropped counter.

3) sch_teql : almost revert ab35cd4b8f (Use net_device internal stats)
    Now we have ndo_get_stats64(), use it, even for "unsigned long"
fields (No need to bring back a struct net_device_stats)

4) gianfar adds a stats structure per tx queue to hold
tx_bytes/tx_packets

This removes a lockdep warning (and possible lockup) in rndis gadget,
calling dev_get_stats() from hard IRQ context.

Ref: http://www.spinics.net/lists/netdev/msg149202.html

Reported-by: Neil Jones <neiljay@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Sandeep Gopalpet <sandeep.kumar@freescale.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-13 21:44:34 -08:00
Linus Torvalds
008d23e485 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
  Documentation/trace/events.txt: Remove obsolete sched_signal_send.
  writeback: fix global_dirty_limits comment runtime -> real-time
  ppc: fix comment typo singal -> signal
  drivers: fix comment typo diable -> disable.
  m68k: fix comment typo diable -> disable.
  wireless: comment typo fix diable -> disable.
  media: comment typo fix diable -> disable.
  remove doc for obsolete dynamic-printk kernel-parameter
  remove extraneous 'is' from Documentation/iostats.txt
  Fix spelling milisec -> ms in snd_ps3 module parameter description
  Fix spelling mistakes in comments
  Revert conflicting V4L changes
  i7core_edac: fix typos in comments
  mm/rmap.c: fix comment
  sound, ca0106: Fix assignment to 'channel'.
  hrtimer: fix a typo in comment
  init/Kconfig: fix typo
  anon_inodes: fix wrong function name in comment
  fix comment typos concerning "consistent"
  poll: fix a typo in comment
  ...

Fix up trivial conflicts in:
 - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
 - fs/ext4/ext4.h

Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-13 10:05:56 -08:00
Eric Dumazet
bfe0d0298f net_sched: factorize qdisc stats handling
HTB takes into account skb is segmented in stats updates.
Generalize this to all schedulers.

They should use qdisc_bstats_update() helper instead of manipulating
bstats.bytes and bstats.packets

Add bstats_update() helper too for classes that use
gnet_stats_basic_packed fields.

Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no
stab is setup on qdisc.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-10 16:07:54 -08:00
Tom Herbert
36909ea438 net: Add alloc_netdev_mqs function
Added alloc_netdev_mqs function which allows the number of transmit and
receive queues to be specified independenty.  alloc_netdev_mq was
changed to a macro to call the new function.  Also added
alloc_etherdev_mqs with same purpose.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-10 16:05:30 -08:00
Jesse Gross
0363466866 net offloading: Convert checksums to use centrally computed features.
In order to compute the features for other offloads (primarily
scatter/gather), we need to first check the ability of the NIC to
offload the checksum for the packet.  Since we have already computed
this, we can directly use the result instead of figuring it out
again.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-09 23:35:35 -08:00
Jesse Gross
02932ce9e2 net offloading: Convert skb_need_linearize() to use precomputed features.
This switches skb_need_linearize() to use the features that have
been centrally computed.  In doing so, this fixes a problem where
scatter/gather should not be used because the card does not support
checksum offloading on that type of packet.  On device registration
we only check that some form of checksum offloading is available if
scatter/gatther is enabled but we must also check at transmission
time.  Examples of this include IPv6 or vlan packets on a NIC that
only supports IPv4 offloading.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-09 23:35:35 -08:00
Jesse Gross
91ecb63c07 net offloading: Convert dev_gso_segment() to use precomputed features.
This switches dev_gso_segment() to use the device features computed
by the centralized routine.  In doing so, it fixes a problem where
it would always use dev->features, instead of those appropriate
to the number of vlan tags if any are present.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-09 23:35:34 -08:00
Jesse Gross
fc741216db net offloading: Pass features into netif_needs_gso().
Now that there is a single function that can compute the device
features relevant to a packet, we don't want to run it for each
offload.  This converts netif_needs_gso() to take the features
of the device, rather than computing them itself.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-09 23:35:34 -08:00
Jesse Gross
f01a5236bd net offloading: Generalize netif_get_vlan_features().
netif_get_vlan_features() is currently only used by netif_needs_gso(),
so it only concerns itself with GSO features.  However, several other
places also should take into account the contents of the packet when
deciding whether to offload to hardware.  This generalizes the function
to return features about all of the various forms of offloading.  Since
offloads tend to be linked together, this avoids duplicating the logic
in each location (i.e. the scatter/gather code also needs the checksum
logic).

Suggested-by: Michał Mirosław <mirqus@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-09 23:35:33 -08:00
Jesse Gross
9497a0518e net offloading: Accept NETIF_F_HW_CSUM for all protocols.
We currently only have software fallback for one type of checksum: the
TCP/UDP one's complement.  This means that a protocol that uses hardware
offloading for a different type of checksum (FCoE, SCTP) must directly
check the device's features and do the right thing ahead of time.  By
the time we get to dev_can_checksum(), we're only deciding whether to
apply the one algorithm in software or hardware.  NETIF_F_HW_CSUM has the
same capabilities as the software version, so we should always use it if
present.  The primary advantage of this is multiply tagged vlans can use
hardware checksumming.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-09 23:35:33 -08:00
Jiri Kosina
4b7bd36470 Merge branch 'master' into for-next
Conflicts:
	MAINTAINERS
	arch/arm/mach-omap2/pm24xx.c
	drivers/scsi/bfa/bfa_fcpim.c

Needed to update to apply fixes for which the old branch was too
outdated.
2010-12-22 18:57:02 +01:00
Eric Dumazet
70978182d4 net: timestamp cloned packet in dev_queue_xmit_nit
Le vendredi 17 décembre 2010 à 10:26 +0100, Eric Dumazet a écrit :

>
> I think we can add this after latest Changli patch :
>
> He does one skb_clone() before calling the sniffers.
> We could set timestamp on this clone, instead of original skb.
>
> Problem solved.
>

[PATCH net-next-2.6] net: timestamp cloned packet in dev_queue_xmit_nit

Now we do one clone of skb if at least one sniffer might take packet,
we also can do the skb timestamping on the clone and let original packet
unchanged.

This is a generalization of commit 8caf153974 (net: sch_netem: Fix an
inconsistency in ingress netem timestamps.)

This way, we can have a good idea when packets are delivered to our
stack (tcpdump -i ifb0), while a tcpdump on original device gives
timestamps right before ingressing.

This also speedup our stack, avoiding taking timestamps if not needed.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Changli Gao <xiaosuo@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jarek Poplawski <jarkao2@gmail.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-21 10:50:38 -08:00
Changli Gao
71d9dec24d net: increase skb->users instead of skb_clone()
In dev_queue_xmit_nit(), we have to clone skbs as we need to mangle skbs,
however, we don't need to clone skbs for all the packet_types.

Except for the first packet_type, we increase skb->users instead of
skb_clone().

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-19 21:50:20 -08:00
Michał Mirosław
55508d601d net: Use skb_checksum_start_offset()
Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset().

Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb).

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 14:43:14 -08:00
Octavian Purdila
443457242b net: factorize sync-rcu call in unregister_netdevice_many
Add dev_close_many and dev_deactivate_many to factorize another
sync-rcu operation on the netdevice unregister path.

$ modprobe dummy numdummies=10000
$ ip link set dev dummy* up
$ time rmmod dummy

Without the patch           With the patch

real    0m 24.63s           real    0m 5.15s
user    0m 0.00s            user    0m 0.00s
sys     0m 6.05s            sys     0m 5.14s

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 14:04:44 -08:00
Changli Gao
b236da6931 net: use NUMA_NO_NODE instead of the magic number -1
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 13:16:06 -08:00
Vladislav Zolotarov
a3d22a68d7 bnx2x: Take the distribution range definition out of skb_tx_hash()
Move the calcualation of the Tx hash for a given hash range into a separate
function and define the skb_tx_hash(), which calculates a Tx hash for a
[0; dev->real_num_tx_queues - 1] hash values range, using this
function (__skb_tx_hash()).

Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 13:15:53 -08:00
Eric Dumazet
15c2d75f49 net: call dev_queue_xmit_nit() after skb_dst_drop()
Avoid some atomic ops on dst refcount, calling dev_queue_xmit_nit()
after skb_dst_drop() in dev_hard_start_xmit().

When queueing a packet into af_packet socket, we drop dst anyway.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-08 10:39:54 -08:00
Eric Dumazet
941666c2e3 net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()
Le dimanche 05 décembre 2010 à 09:19 +0100, Eric Dumazet a écrit :

> Hmm..
>
> If somebody can explain why RTNL is held in arp_ioctl() (and therefore
> in arp_req_delete()), we might first remove RTNL use in arp_ioctl() so
> that your patch can be applied.
>
> Right now it is not good, because RTNL wont be necessarly held when you
> are going to call arp_invalidate() ?

While doing this analysis, I found a refcount bug in llc, I'll send a
patch for net-2.6

Meanwhile, here is the patch for net-next-2.6

Your patch then can be applied after mine.

Thanks

[PATCH] net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()

dev_getbyhwaddr() was called under RTNL.

Rename it to dev_getbyhwaddr_rcu() and change all its caller to now use
RCU locking instead of RTNL.

Change arp_ioctl() to use RCU instead of RTNL locking.

Note: this fix a dev refcount bug in llc

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-08 10:07:24 -08:00
Changli Gao
aa94210411 net: init ingress queue
The dev field of ingress queue is forgot to initialized, then NULL
pointer dereference happens in qdisc_alloc().

Move inits of tx queues to netif_alloc_netdev_queues().

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-08 09:43:27 -08:00
Michał Mirosław
7903264402 net: Fix too optimistic NETIF_F_HW_CSUM features
NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM+NETIF_F_IPV6_CSUM, but
some drivers miss the difference. Fix this and also fix UFO dependency
on checksumming offload as it makes the same mistake in assumptions.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Acked-by: Jon Mason <jon.mason@exar.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-06 12:59:04 -08:00
Eric Dumazet
f2cd2d3e9b net sched: use xps information for qdisc NUMA affinity
Allocate qdisc memory according to NUMA properties of cpus included in
xps map.

To be effective, qdisc should be (re)setup after changes
of /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus

I added a numa_node field in struct netdev_queue, containing NUMA node
if all cpus included in xps_cpus share same node, else -1.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 12:47:42 -08:00
Tom Herbert
bf26414510 xps: Add CONFIG_XPS
This patch adds XPS_CONFIG option to enable and disable XPS.  This is
done in the same manner as RPS_CONFIG.  This is also fixes build
failure in XPS code when SMP is not enabled.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-28 18:24:14 -08:00
Tom Herbert
1d24eb4815 xps: Transmit Packet Steering
This patch implements transmit packet steering (XPS) for multiqueue
devices.  XPS selects a transmit queue during packet transmission based
on configuration.  This is done by mapping the CPU transmitting the
packet to a queue.  This is the transmit side analogue to RPS-- where
RPS is selecting a CPU based on receive queue, XPS selects a queue
based on the CPU (previously there was an XPS patch from Eric
Dumazet, but that might more appropriately be called transmit completion
steering).

Each transmit queue can be associated with a number of CPUs which will
use the queue to send packets.  This is configured as a CPU mask on a
per queue basis in:

/sys/class/net/eth<n>/queues/tx-<n>/xps_cpus

The mappings are stored per device in an inverted data structure that
maps CPUs to queues.  In the netdevice structure this is an array of
num_possible_cpu structures where each structure holds and array of
queue_indexes for queues which that CPU can use.

The benefits of XPS are improved locality in the per queue data
structures.  Also, transmit completions are more likely to be done
nearer to the sending thread, so this should promote locality back
to the socket on free (e.g. UDP).  The benefits of XPS are dependent on
cache hierarchy, application load, and other factors.  XPS would
nominally be configured so that a queue would only be shared by CPUs
which are sharing a cache, the degenerative configuration woud be that
each CPU has it's own queue.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR test
with 1 byte req. and resp.

bnx2x on 16 core AMD
   XPS (16 queues, 1 TX queue per CPU)  1234K at 100% CPU
   No XPS (16 queues)                   996K at 100% CPU

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-24 11:44:20 -08:00
Tom Herbert
3853b5841c xps: Improvements in TX queue selection
In dev_pick_tx, don't do work in calculating queue
index or setting
the index in the sock unless the device has more than one queue.  This
allows the sock to be set only with a queue index of a multi-queue
device which is desirable if device are stacked like in a tunnel.

We also allow the mapping of a socket to queue to be changed.  To
maintain in order packet transmission a flag (ooo_okay) has been
added to the sk_buff structure.  If a transport layer sets this flag
on a packet, the transmit queue can be changed for the socket.
Presumably, the transport would set this if there was no possbility
of creating OOO packets (for instance, there are no packets in flight
for the socket).  This patch includes the modification in TCP output
for setting this flag.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-24 11:44:19 -08:00
David S. Miller
6b35308850 net: Export netif_get_vlan_features().
ERROR: "netif_get_vlan_features" [drivers/net/xen-netfront.ko] undefined!

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 20:15:03 -08:00
Tom Herbert
fe8222406c net: Simplify RX queue allocation
This patch move RX queue allocation to alloc_netdev_mq and freeing of
the queues to free_netdev (symmetric to TX queue allocation).  Each
kobject RX queue takes a reference to the queue's device so that the
device can't be freed before all the kobjects have been released-- this
obviates the need for reference counts specific to RX queues.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 10:57:28 -08:00
Tom Herbert
ed9af2e839 net: Move TX queue allocation to alloc_netdev_mq
TX queues are now allocated in alloc_netdev_mq and freed in
free_netdev.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 10:56:54 -08:00
Jesse Gross
58e998c6d2 offloading: Force software GSO for multiple vlan tags.
We currently use vlan_features to check for TSO support if there is
a vlan tag.  However, it's quite likely that the NIC is not able to
do TSO when there is an arbitrary number of tags.  Therefore if there
is more than one tag (in-band or out-of-band), fall back to software
emulation.

Signed-off-by: Jesse Gross <jesse@nicira.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 09:22:53 -08:00
Jesse Gross
c8d5bcd1af offloading: Support multiple vlan tags in GSO.
We assume that hardware TSO can't support multiple levels of vlan tags
but we allow it to be done.  Therefore, enable GSO to parse these tags
so we can fallback to software.

Signed-off-by: Jesse Gross <jesse@nicira.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 09:22:53 -08:00
Jesse Gross
e1e78db628 offloading: Make scatter/gather more tolerant of vlans.
When checking if it is necessary to linearize a packet, we currently
use vlan_features if the packet contains either an in-band or out-
of-band vlan tag.  However, in-band tags aren't special in any way
for scatter/gather since they are part of the packet buffer and are
simply more data to DMA.  Therefore, only use vlan_features for out-
of-band tags, which could potentially have some interaction with
scatter/gather.

Signed-off-by: Jesse Gross <jesse@nicira.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 09:22:52 -08:00
Joe Perches
b194a3674f net/core/dev.c: Update WARN uses
Coalesce long formats.
Add missing newlines.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-09 09:22:31 -08:00
Tom Herbert
df32cc193a net: check queue_index from sock is valid for device
In dev_pick_tx recompute the queue index if the value stored in the
socket is greater than or equal to the number of real queues for the
device.  The saved index in the sock structure is not guaranteed to
be appropriate for the egress device (this could happen on a route
change or in presence of tunnelling).  The result of the queue index
being bad would be to return a bogus queue (crash could prersumably
follow).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-01 12:55:52 -07:00
Uwe Kleine-König
b595076a18 tree-wide: fix comment/printk typos
"gadget", "through", "command", "maintain", "maintain", "controller", "address",
"between", "initiali[zs]e", "instead", "function", "select", "already",
"equal", "access", "management", "hierarchy", "registration", "interest",
"relative", "memory", "offset", "already",

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-01 15:38:34 -04:00
Ben Hutchings
66c68bcc48 net: NETIF_F_HW_CSUM does not imply FCoE CRC offload
NETIF_F_HW_CSUM indicates the ability to update an TCP/IP-style 16-bit
checksum with the checksum of an arbitrary part of the packet data,
whereas the FCoE CRC is something entirely different.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: stable@kernel.org [2.6.32+]
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-27 11:37:29 -07:00
Ben Hutchings
af1905dbec net: Fix some corner cases in dev_can_checksum()
dev_can_checksum() incorrectly returns true in these cases:

1. The skb has both out-of-band and in-band VLAN tags and the device
   supports checksum offload for the encapsulated protocol but only with
   one layer of encapsulation.
2. The skb has a VLAN tag and the device supports generic checksumming
   but not in conjunction with VLAN encapsulation.

Rearrange the VLAN tag checks to avoid these.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-27 11:37:29 -07:00
Eric Dumazet
6e3f7faf3e rps: add __rcu annotations
Add __rcu annotations to :
	(struct netdev_rx_queue)->rps_map
	(struct netdev_rx_queue)->rps_flow_table
	struct rps_sock_flow_table *rps_sock_flow_table;

And use appropriate rcu primitives.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-25 14:18:27 -07:00
Eric Dumazet
198caeca3e ipv6: ip6_ptr rcu annotations
(struct net_device)->ip6_ptr is rcu protected :

add __rcu annotation and proper rcu primitives.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-25 13:09:43 -07:00
David S. Miller
11a766ce91 net: Increase xmit RECURSION_LIMIT to 10.
Three is definitely too low, and we know from reports that GRE tunnels
stacked as deeply as 37 levels cause stack overflows, so pick some
reasonable value between those two.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-25 12:51:55 -07:00
Linus Torvalds
5f05647dd8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
  bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
  vlan: Calling vlan_hwaccel_do_receive() is always valid.
  tproxy: use the interface primary IP address as a default value for --on-ip
  tproxy: added IPv6 support to the socket match
  cxgb3: function namespace cleanup
  tproxy: added IPv6 support to the TPROXY target
  tproxy: added IPv6 socket lookup function to nf_tproxy_core
  be2net: Changes to use only priority codes allowed by f/w
  tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
  tproxy: added tproxy sockopt interface in the IPV6 layer
  tproxy: added udp6_lib_lookup function
  tproxy: added const specifiers to udp lookup functions
  tproxy: split off ipv6 defragmentation to a separate module
  l2tp: small cleanup
  nf_nat: restrict ICMP translation for embedded header
  can: mcp251x: fix generation of error frames
  can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
  can-raw: add msg_flags to distinguish local traffic
  9p: client code cleanup
  rds: make local functions/variables static
  ...

Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
drivers/net/wireless/ath/ath9k/debug.c as per David
2010-10-23 11:47:02 -07:00
David S. Miller
2198a10b50 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/core/dev.c
2010-10-21 08:43:05 -07:00
stephen hemminger
d0c2b0d265 napi: unexport napi_reuse_skb
The function napi_reuse_skb is only used inside core.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-21 04:26:38 -07:00
Ben Greear
d2ed817766 net/core: Allow tagged VLAN packets to flow through VETH devices.
When there are VLANs on a VETH device, the packets being transmitted
through the VETH device may be 4 bytes bigger than MTU.  A check
in dev_forward_skb did not take this into account and so dropped
these packets.

This patch is needed at least as far back as 2.6.34.7 and should
be considered for -stable.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-21 04:06:29 -07:00
Jesse Gross
3701e51382 vlan: Centralize handling of hardware acceleration.
Currently each driver that is capable of vlan hardware acceleration
must be aware of the vlan groups that are configured and then pass
the stripped tag to a specialized receive function.  This is

different from other types of hardware offload in that it places a
significant amount of knowledge in the driver itself rather keeping
it in the networking core.

This makes vlan offloading function more similarly to other forms
of offloading (such as checksum offloading or TSO) by doing the
following:
* On receive, stripped vlans are passed directly to the network
core, without attempting to check for vlan groups or reconstructing
the header if no group
* vlans are made less special by folding the logic into the main
receive routines
* On transmit, the device layer will add the vlan header in software
if the hardware doesn't support it, instead of spreading that logic
out in upper layers, such as bonding.

There are a number of advantages to this:
* Fixes all bugs with drivers incorrectly dropping vlan headers at once.
* Avoids having to disable VLAN acceleration when in promiscuous mode
(good for bridging since it always puts devices in promiscuous mode).
* Keeps VLAN tag separate until given to ultimate consumer, which
avoids needing to do header reconstruction as in tg3 unless absolutely
necessary.
* Consolidates common code in core networking.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-21 01:26:53 -07:00
Jesse Gross
7b9c609037 vlan: Enable software emulation for vlan accleration.
Currently users of hardware vlan accleration need to know whether
the device supports it before generating packets.  However, vlan
acceleration will soon be available in a more flexible manner so
knowing ahead of time becomes much more difficult.  This adds
a software fallback path for vlan packets on devices without the
necessary offloading support, similar to other types of hardware
accleration.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-21 01:26:52 -07:00
Tom Herbert
e6484930d7 net: allocate tx queues in register_netdevice
This patch introduces netif_alloc_netdev_queues which is called from
register_device instead of alloc_netdev_mq.  This makes TX queue
allocation symmetric with RX allocation.  Also, queue locks allocation
is done in netdev_init_one_queue.  Change set_real_num_tx_queues to
fail if requested number < 1 or greater than number of allocated
queues.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-20 02:27:59 -07:00
Tom Herbert
bd25fa7ba5 net: cleanups in RX queue allocation
Clean up in RX queue allocation.  In netif_set_real_num_rx_queues
return error on attempt to set zero queues, or requested number is
greater than number of allocated queues.  In netif_alloc_rx_queues,
do BUG_ON if queue_count is zero.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-20 02:27:59 -07:00
Tom Herbert
55513fb428 net: fail alloc_netdev_mq if queue count < 1
In alloc_netdev_mq fail if requested queue_count < 1.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-20 02:27:58 -07:00
Eric Dumazet
29b4433d99 net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.

There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)

We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.

On x86, dev_hold(dev) code :

before
        lock    incl 0x280(%ebx)
after:
        movl    0x260(%ebx),%eax
        incl    fs:(%eax)

Stress bench :

(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)

Before:

real    1m1.662s
user    0m14.373s
sys     12m55.960s

After:

real    0m51.179s
user    0m15.329s
sys     10m15.942s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-12 12:35:25 -07:00
Tom Herbert
4315d834c1 net: Fix rxq ref counting
The rx->count reference is used to track reference counts to the
number of rx-queue kobjects created for the device.  This patch
eliminates initialization of the counter in netif_alloc_rx_queues
and instead increments the counter each time a kobject is created.
This is now symmetric with the decrement that is done when an object is
released.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-08 14:34:32 -07:00
Ben Hutchings
4e7f79511e net: Update kernel-doc for netif_set_real_num_rx_queues()
Synchronise the comment with the preceding implementation change.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-08 10:33:39 -07:00
John Fastabend
3d3211ef5c net: netif_set_real_num_rx_queues may cap num_rx_queues at init time
Do not set num_rx_queues in netif_set_real_num_rx_queues() some
drivers will increase the real_num_rx_queues later due to a feature
changes or available interrupts increasing. By setting num_rx_queues
here this ends up creating a cap on the number of rx queues
available.

For example the ixgbe driver sets the max number of queues it intends
to use ever then sets the current number in use with the
netif_set_num_{rx|tx}_queues calls. With the current implementation
the number of rx queues gets limited so when a feature such as DCB
or FCoE is enabled the queues are no longer available.

kobjects will only be allocated for real_num_rx_queues so the waste
in memory is minimal.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-06 23:35:15 -07:00
Eric Dumazet
caf586e5f2 net: add a core netdev->rx_dropped counter
In various situations, a device provides a packet to our stack and we
drop it before it enters protocol stack :
- softnet backlog full (accounted in /proc/net/softnet_stat)
- bad vlan tag (not accounted)
- unknown/unregistered protocol (not accounted)

We can handle a per-device counter of such dropped frames at core level,
and automatically adds it to the device provided stats (rx_dropped), so
that standard tools can be used (ifconfig, ip link, cat /proc/net/dev)

This is a generalization of commit 8990f468a (net: rx_dropped
accounting), thus reverting it.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05 14:47:55 -07:00
Eric Dumazet
24824a09e3 net: dynamic ingress_queue allocation
ingress being not used very much, and net_device->ingress_queue being
quite a big object (128 or 256 bytes), use a dynamic allocation if
needed (tc qdisc add dev eth0 ingress ...)

dev_ingress_queue(dev) helper should be used only with RTNL taken.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05 00:23:44 -07:00
Eric Dumazet
bfa5ae63b8 net: rename netdev rx_queue to ingress_queue
There is some confusion with rx_queue name after RPS, and net drivers
private rx_queue fields.

I suggest to rename "struct net_device"->rx_queue to ingress_queue.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-29 13:25:53 -07:00
Eric Dumazet
745e20f1b6 net: add a recursion limit in xmit path
As tunnel devices are going to be lockless, we need to make sure a
misconfigured machine wont enter an infinite loop.

Add a percpu variable, and limit to three the number of stacked xmits.

Reported-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-29 13:23:09 -07:00
Ben Hutchings
62fe0b40ab net: Allow changing number of RX queues after device allocation
For RPS, we create a kobject for each RX queue based on the number of
queues passed to alloc_netdev_mq().  However, drivers generally do not
determine the numbers of hardware queues to use until much later, so
this usually represents the maximum number the driver may use and not
the actual number in use.

For TX queues, drivers can update the actual number using
netif_set_real_num_tx_queues().  Add a corresponding function for RX
queues, netif_set_real_num_rx_queues().

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-27 22:09:49 -07:00
David S. Miller
e40051d134 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/qlcnic/qlcnic_init.c
	net/ipv4/ip_output.c
2010-09-27 01:03:03 -07:00
Eric Dumazet
1b4bf461f0 rps: allocate rx queues in register_netdevice only
Instead of having two places were we allocate dev->_rx, introduce
netif_alloc_rx_queues() helper and call it only from
register_netdevice(), not from alloc_netdev_mq()

Goal is to let drivers change dev->num_rx_queues after allocating netdev
and before registering it.

This also removes a lot of ifdefs in net/core/dev.c

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-26 19:04:07 -07:00
Eric Dumazet
c5256c5123 net: propagate NETIF_F_HIGHDMA to vlans
Automatically allows vlans to get NETIF_F_HIGHDMA if underlying device
supports it.

On 32bit arches (and more precisely if CONFIG_HIGHMEM is enabled), it
can help to reduce cost of illegal_highdma() and __skb_linearize()
calls.

Tested on tg3 , bnx2, bonding, this worked very well.

This is a generalization of a patch provided by Yi Zou & Jeff Kirsher.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-26 18:27:15 -07:00
Ingo Molnar
7ed569206e Merge commit 'v2.6.36-rc5' into perf/core
Merge reason: Pick up the latest fixes in -rc5.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-21 13:55:11 +02:00
David Lamparter
3b27e10555 netns: keep vlan slaves on master netns move
previously, if a vlan master device was moved from one network namespace
to another, all 802.1q and macvlan slaves were deleted.

we can use dev->reg_state to figure out whether dev_change_net_namespace
is happening, since that won't set dev->reg_state NETREG_UNREGISTERING.
so, this changes 8021q and macvlan to ignore NETDEV_UNREGISTER when
reg_state is not NETREG_UNREGISTERING.

Signed-off-by: David Lamparter <equinox@diac24.net>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-17 16:46:04 -07:00
Stephen Rothwell
caeda9b926 net: include inetdevice.h for rcu_dereference_raw api change
rcu_dereference_raw() now needs to know the type of its argument.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-16 21:39:16 -07:00
Brandon Philips
16c3ea785f net: enable GRO by default for vlan devices
Currently vlan devices don't have GRO by default as none of the Ethernet
drivers add NETIF_F_GRO to their vlan_features.

As GRO is a software feature add GRO to dev->vlan_features in
register_netdevice() and let vlan_dev_init() take care that it gets
enabled only when dev->features has NETIF_F_GRO too.

Signed-off-by: Brandon Philips <bphilips@suse.de>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-15 22:32:29 -07:00
Eric Dumazet
95ae6b228f ipv4: ip_ptr cleanups
dev->ip_ptr is protected by rtnl and rcu.

Yet some places dont use appropriate primitives and/or locking rules.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-15 22:06:05 -07:00
Ingo Molnar
3aabae7d9d Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core 2010-09-15 10:27:31 +02:00
Eric Dumazet
ef885afbf8 net: use rcu_barrier() in rollback_registered_many
netdev_wait_allrefs() waits that all references to a device vanishes.

It currently uses a _very_ pessimistic 250 ms delay between each probe.
Some users reported that no more than 4 devices can be dismantled per
second, this is a pretty serious problem for some setups.

Most of the time, a refcount is about to be released by an RCU callback,
that is still in flight because rollback_registered_many() uses a
synchronize_rcu() call instead of rcu_barrier(). Problem is visible if
number of online cpus is one, because synchronize_rcu() is then a no op.

time to remove 50 ipip tunnels on a UP machine :

before patch : real 11.910s
after patch : real 1.250s

Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: Octavian Purdila <opurdila@ixiacom.com>
Reported-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-14 14:27:29 -07:00
David S. Miller
e548833df8 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/mac80211/main.c
2010-09-09 22:27:33 -07:00
Changli Gao
6febfca98f net: rps: add the shortcut for one rps_cpus
When there is only one rps_cpus, skb_get_rxhash() can be eliminated.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-08 13:10:53 -07:00
Helmut Schaa
deabc772f3 net: fix tx queue selection for bridged devices implementing select_queue
When a net device is implementing the select_queue callback and is part of
a bridge, frames coming from the bridge already have a tx queue associated
to the socket (introduced in commit a4ee3ce329,
"net: Use sk_tx_queue_mapping for connected sockets"). The call to
sk_tx_queue_get will then return the tx queue used by the bridge instead
of calling the select_queue callback.

In case of mac80211 this broke QoS which is implemented by using the
select_queue callback. Furthermore it introduced problems with rt2x00
because frames with the same TID and RA sometimes appeared on different
tx queues which the hw cannot handle correctly.

Fix this by always calling select_queue first if it is available and only
afterwards use the socket tx queue mapping.

Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-07 13:57:20 -07:00
Koki Sanagi
07dc22e729 skb: Add tracepoints to freeing skb
This patch adds tracepoint to consume_skb and add trace_kfree_skb
before __kfree_skb in skb_free_datagram_locked and net_tx_action.
Combinating with tracepoint on dev_hard_start_xmit, we can check
how long it takes to free transmitted packets. And using it, we can
calculate how many packets driver had at that time. It is useful when
a drop of transmitted packet is a problem.

            sshd-6828  [000] 112689.258154: consume_skb: skbaddr=f2d99bb8

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com>
Cc: Izumo Taku <izumi.taku@jp.fujitsu.com>
Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Scott Mcmillan <scott.a.mcmillan@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <4C724364.50903@jp.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-09-07 17:51:53 +02:00
Koki Sanagi
cf66ba58b5 netdev: Add tracepoints to netdev layer
This patch adds tracepoint to dev_queue_xmit, dev_hard_start_xmit,
netif_rx and netif_receive_skb. These tracepoints help you to monitor
network driver's input/output.

          <idle>-0     [001] 112447.902030: netif_rx: dev=eth1 skbaddr=f3ef0900 len=84
          <idle>-0     [001] 112447.902039: netif_receive_skb: dev=eth1 skbaddr=f3ef0900 len=84
            sshd-6828  [000] 112447.903257: net_dev_queue: dev=eth4 skbaddr=f3fca538 len=226
            sshd-6828  [000] 112447.903260: net_dev_xmit: dev=eth4 skbaddr=f3fca538 len=226 rc=0

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com>
Cc: Izumo Taku <izumi.taku@jp.fujitsu.com>
Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Scott Mcmillan <scott.a.mcmillan@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <4C72431E.3000901@jp.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-09-07 17:51:33 +02:00
Eric Dumazet
c07b68e841 net: dev_add_pack() & __dev_remove_pack() changes
Add a small helper ptype_head() to get the head to manipulate

dev_add_pack() & __dev_remove_pack() can use a spinlock without
blocking BH, since softirq use RCU, and these functions are run from
process context only.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-02 10:12:06 -07:00
Eric Dumazet
86cac58b71 skge: add GRO support
- napi_gro_flush() is exported from net/core/dev.c, to avoid
  an irq_save/irq_restore in the packet receive path.
- use napi_gro_receive() instead of netif_receive_skb()
- use napi_gro_flush() before calling __napi_complete()
- turn on NETIF_F_GRO by default
- Tested on a Marvell 88E8001 Gigabit NIC

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-01 10:57:55 -07:00
Eric Dumazet
40d0802b3e gro: __napi_gro_receive() optimizations
compare_ether_header() can have a special implementation on 64 bit
arches if CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.

__napi_gro_receive() and vlan_gro_common() can avoid a conditional
branch to perform device match.

On x86_64, __napi_gro_receive() has now 38 instructions instead of 53

As gcc-4.4.3 still choose to not inline it, add inline keyword to this
performance critical function.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-26 22:03:08 -07:00
David S. Miller
21dc330157 net: Rename skb_has_frags to skb_has_frag_list
SKBs can be "fragmented" in two ways, via a page array (called
skb_shinfo(skb)->frags[]) and via a list of SKBs (called
skb_shinfo(skb)->frag_list).

Since skb_has_frags() tests the latter, it's name is confusing
since it sounds more like it's testing the former.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 00:13:46 -07:00
Changli Gao
05532121da net: 802.1q: make vlan_hwaccel_do_receive() return void
vlan_hwaccel_do_receive() always returns 0, so make it return void.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-22 21:03:33 -07:00
David S. Miller
d3c6e7ad09 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-08-21 23:32:24 -07:00
Changli Gao
1003489e06 net: rps: fix the wrong network header pointer
__skb_get_rxhash() was broken after the commit:

 commit bfb564e739
 Author: Krishna Kumar <krkumar2@in.ibm.com>
 Date:   Wed Aug 4 06:15:52 2010 +0000

 core: Factor out flow calculation from get_rps_cpu

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-21 22:54:49 -07:00
Changli Gao
12fcdefb36 net: rps: use proto_ports_offset() to handle the AH message correctly
The SPI isn't at the beginning of an AH message.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-19 17:16:23 -07:00
Changli Gao
dbe5775bbc net: rps: skip fragment when computing rxhash
Fragmented IP packets may have no transfer header, so when computing
rxhash, we should skip them.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-19 17:10:38 -07:00
Changli Gao
2d47b45951 net: rps: reset network header before calling skb_get_rxhash()
skb_get_rxhash() assumes the network header pointer of the skb is set
properly after the commit:

commit bfb564e739
Author: Krishna Kumar <krkumar2@in.ibm.com>
Date:   Wed Aug 4 06:15:52 2010 +0000

    core: Factor out flow calculation from get_rps_cpu

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-19 17:08:37 -07:00
Oliver Hartkopp
2244d07bfa net: simplify flags for tx timestamping
This patch removes the abstraction introduced by the union skb_shared_tx in
the shared skb data.

The access of the different union elements at several places led to some
confusion about accessing the shared tx_flags e.g. in skb_orphan_try().

    http://marc.info/?l=linux-netdev&m=128084897415886&w=2

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-19 00:08:30 -07:00
Jarek Poplawski
e5093aec2e net: Fix a memmove bug in dev_gro_receive()
>Xin Xiaohui wrote:
> I looked into the code dev_gro_receive(), found the code here:
> if the frags[0] is pulled to 0, then the page will be released,
> and memmove() frags left.
> Is that right? I'm not sure if memmove do right or not, but
> frags[0].size is never set after memove at least. what I think
> a simple way is not to do anything if we found frags[0].size == 0.
> The patch is as followed.
...

This version of the patch fixes the bug directly in memmove.

Reported-by: "Xin, Xiaohui" <xiaohui.xin@intel.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-17 17:37:28 -07:00
Krishna Kumar
bfb564e739 core: Factor out flow calculation from get_rps_cpu
Factor out flow calculation code from get_rps_cpu, since other
functions can use the same code.

Revisions:

v2 (Ben): Separate flow calcuation out and use in select queue.
v3 (Arnd): Don't re-implement MIN.
v4 (Changli): skb->data points to ethernet header in macvtap, and
	make a fast path. Tested macvtap with this patch.
v5 (Changli):
	- Cache skb->rxhash in skb_get_rxhash
	- macvtap may not have pow(2) queues, so change code for
	  queue selection.
    (Arnd):
	- Use first available queue if all fails.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-16 21:06:24 -07:00
Changli Gao
cece1945bf net: disable preemption before call smp_processor_id()
Although netif_rx() isn't expected to be called in process context with
preemption enabled, it'd better handle this case. And this is why get_cpu()
is used in the non-RPS #ifdef branch. If tree RCU is selected,
rcu_read_lock() won't disable preemption, so preempt_disable() should be
called explictly.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-07 20:35:43 -07:00
Jarek Poplawski
ce9e76c845 net: Fix napi_gro_frags vs netpoll path
The netpoll_rx_on() check in __napi_gro_receive() skips part of the
"common" GRO_NORMAL path, especially "pull:" in dev_gro_receive(),
where at least eth header should be copied for entirely paged skbs.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-05 13:21:25 -07:00
David S. Miller
3578b0c8ab Revert "net: remove zap_completion_queue"
This reverts commit 15e83ed788.

As explained by Johannes Berg, the optimization made here is
invalid.  Or, at best, incomplete.

Not only destructor invocation, but conntract entry releasing
must be executed outside of hw IRQ context.

So just checking "skb->destructor" is insufficient.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-03 00:24:04 -07:00
Changli Gao
a427615e04 net: cleanup inclusion
Commit ab95bfe01f replaces bridge and macvlan
hooks in __netif_receive_skb(), so dev.c doesn't need to include their headers.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-02 22:45:49 -07:00
Stephen Hemminger
de38483010 net: ingress filter message limit
If user misconfigures ingress and causes a redirection loop, don't
overwhelm the log.  This is also a error case so make it unlikely.
Found by inspection, luckily not in real system.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-01 00:33:23 -07:00
David S. Miller
bb7e95c8fd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x_main.c

Merge bnx2x bug fixes in by hand... :-/

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-27 21:01:35 -07:00
Ben Greear
c736eefadb net: dev_forward_skb should call nf_reset
With conn-track zones and probably with different network
namespaces, the netfilter logic needs to be re-calculated
on packet receive.  If the netfilter logic is not reset,
it will not be recalculated properly.  This patch adds
the nf_reset logic to dev_forward_skb.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-25 21:58:46 -07:00
David S. Miller
11fe883936 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/vhost/net.c
	net/bridge/br_device.c

Fix merge conflict in drivers/vhost/net.c with guidance from
Stephen Rothwell.

Revert the effects of net-2.6 commit 573201f36f
since net-next-2.6 has fixes that make bridge netpoll work properly thus
we don't need it disabled.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-20 18:25:24 -07:00
Eric Dumazet
bd27290a59 net: 64bit stats for netdev_queue
Since struct netdev_queue tx_bytes/tx_packets/tx_dropped are already
protected by _xmit_lock, its easy to convert these fields to u64 instead
of unsigned long.
This completes 64bit stats for devices using them (vlan, macvlan, ...)

Strictly, we could avoid the locking in dev_txq_stats_fold() on 64bit
arches, but its slow path and we prefer keep it simple.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-19 09:35:40 -07:00
Richard Cochran
c1f19b51d1 net: support time stamping in phy devices.
This patch adds a new networking option to allow hardware time stamps
from PHY devices. When enabled, likely candidates among incoming and
outgoing network packets are offered to the PHY driver for possible
time stamping. When accepted by the PHY driver, incoming packets are
deferred for later delivery by the driver.

The patch also adds phylib driver methods for the SIOCSHWTSTAMP ioctl
and callbacks for transmit and receive time stamping. Drivers may
optionally implement these functions.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-18 19:15:26 -07:00
Tom Herbert
b0f77d0eae net: fix problem in reading sock TX queue
Fix problem in reading the tx_queue recorded in a socket.  In
dev_pick_tx, the TX queue is read by doing a check with
sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get.
The problem is that there is not mutual exclusion across these
calls in the socket so it it is possible that the queue in the
sock can be invalidated after sk_tx_queue_recorded is called so
that sk_tx_queue get returns -1, which sets 65535 in queue_index
and thus dev_pick_tx returns 65536 which is a bogus queue and
can cause crash in dev_queue_xmit.

We fix this by only calling sk_tx_queue_get which does the proper
checks.  The interface is that sk_tx_queue_get returns the TX queue
if the sock argument is non-NULL and TX queue is recorded, else it
returns -1.  sk_tx_queue_recorded is no longer used so it can be
completely removed.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-14 20:50:29 -07:00
Eric Dumazet
87fd308cfc net: skb_tx_hash() fix relative to skb_orphan_try()
commit fc6055a5ba (net: Introduce skb_orphan_try()) added early
orphaning of skbs.

This unfortunately added a performance regression in skb_tx_hash() in
case of stacked devices (bonding, vlans, ...)

Since skb->sk is now NULL, we cannot access sk->sk_hash anymore to
spread tx packets to multiple NIC queues on multiqueue devices.

skb_tx_hash() in this case only uses skb->protocol, same value for all
flows.

skb_orphan_try() can copy sk->sk_hash into skb->rxhash and skb_tx_hash()
can use this saved sk_hash value to compute its internal hash value.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-14 15:33:27 -07:00
Ben Hutchings
d77535162e net: Document that dev_get_stats() returns the given pointer
Document that dev_get_stats() returns the same stats pointer it was
given.  Remove const qualification from the returned pointer since the
caller may do what it likes with that structure.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-09 17:41:57 -07:00
Ben Hutchings
3cfde79c6c net: Get rid of rtnl_link_stats64 / net_device_stats union
In commit be1f3c2c02 "net: Enable 64-bit
net device statistics on 32-bit architectures" I redefined struct
net_device_stats so that it could be used in a union with struct
rtnl_link_stats64, avoiding the need for explicit copying or
conversion between the two.  However, this is unsafe because there is
no locking required and no lock consistently held around calls to
dev_get_stats() and use of the statistics structure it returns.

In commit 28172739f0 "net: fix 64 bit
counters on 32 bit arches" Eric Dumazet dealt with that problem by
requiring callers of dev_get_stats() to provide storage for the
result.  This means that the net_device::stats64 field and the padding
in struct net_device_stats are now redundant, so remove them.

Update the comment on net_device_ops::ndo_get_stats64 to reflect its
new usage.

Change dev_txq_stats_fold() to use struct rtnl_link_stats64, since
that is what all its callers are really using and it is no longer
going to be compatible with struct net_device_stats.

Eric Dumazet suggested the separate function for the structure
conversion.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-09 17:41:56 -07:00
David S. Miller
597e608a84 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-07-07 15:59:38 -07:00
Eric Dumazet
28172739f0 net: fix 64 bit counters on 32 bit arches
There is a small possibility that a reader gets incorrect values on 32
bit arches. SNMP applications could catch incorrect counters when a
32bit high part is changed by another stats consumer/provider.

One way to solve this is to add a rtnl_link_stats64 param to all
ndo_get_stats64() methods, and also add such a parameter to
dev_get_stats().

Rule is that we are not allowed to use dev->stats64 as a temporary
storage for 64bit stats, but a caller provided area (usually on stack)

Old drivers (only providing get_stats() method) need no changes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-07 14:58:56 -07:00
Joe Perches
256df2f387 netdevice.h net/core/dev.c: Convert netdev_<level> logging macros to functions
Reduces an x86 defconfig text and data ~2k.
text is smaller, data is larger.

$ size vmlinux*
   text	   data	    bss	    dec	    hex	filename
7198862	 720112	1366288	9285262	 8dae8e	vmlinux
7205273	 716016	1366288	9287577	 8db799	vmlinux.device_h

Uses %pV and struct va_format
Format arguments are verified before printk

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-04 10:40:18 -07:00
John Fastabend
f0796d5c73 net: decreasing real_num_tx_queues needs to flush qdisc
Reducing real_num_queues needs to flush the qdisc otherwise
skbs with queue_mappings greater then real_num_tx_queues can
be sent to the underlying driver.

The flow for this is,

dev_queue_xmit()
	dev_pick_tx()
		skb_tx_hash()  => hash using real_num_tx_queues
		skb_set_queue_mapping()
	...
	qdisc_enqueue_root() => enqueue skb on txq from hash
...
dev->real_num_tx_queues -= n
...
sch_direct_xmit()
	dev_hard_start_xmit()
		ndo_start_xmit(skb,dev) => skb queue set with old hash

skbs are enqueued on the qdisc with skb->queue_mapping set
0 < queue_mappings < real_num_tx_queues.  When the driver
decreases real_num_tx_queues skb's may be dequeued from the
qdisc with a queue_mapping greater then real_num_tx_queues.

This fixes a case in ixgbe where this was occurring with DCB
and FCoE. Because the driver is using queue_mapping to map
skbs to tx descriptor rings we can potentially map skbs to
rings that no longer exist.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-02 21:59:07 -07:00
Sebastian Andrzej Siewior
70777d0346 net/core: use ntohs for skb->protocol
This is only noticed by people that are not doing everything correct in
the first place.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-30 10:39:19 -07:00
John Fastabend
6afff0caa7 net: consolidate netif_needs_gso() checks
netif_needs_gso() is checked twice in the TX path once,
before submitting the skb to the qdisc and once after
it is dequeued from the qdisc just before calling
ndo_hard_start().  This opens a window for a user to
change the gso/tso or tx checksum settings that can
cause netif_needs_gso to be true in one check and false
in the other.

Specifically, changing TX checksum setting may cause
the warning in skb_gso_segment() to be triggered if
the checksum is calculated earlier.

This consolidates the netif_needs_gso() calls so that
the stack only checks if gso is needed in
dev_hard_start_xmit().

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-23 12:58:41 -07:00
Jiri Pirko
f350a0a873 bridge: use rx_handler_data pointer to store net_bridge_port pointer
Register net_bridge_port pointer as rx_handler data pointer. As br_port is
removed from struct net_device, another netdev priv_flag is added to indicate
the device serves as a bridge port. Also rcuized pointers are now correctly
dereferenced in br_fdb.c and in netfilter parts.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-15 11:48:58 -07:00
Jiri Pirko
93e2c32b5c net: add rx_handler data pointer
Add possibility to register rx_handler data pointer along with a rx_handler.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-15 11:47:11 -07:00
Ben Hutchings
be1f3c2c02 net: Enable 64-bit net device statistics on 32-bit architectures
Use struct rtnl_link_stats64 as the statistics structure.

On 32-bit architectures, insert 32 bits of padding after/before each
field of struct net_device_stats to make its layout compatible with
struct rtnl_link_stats64.  Add an anonymous union in net_device; move
stats into the union and add struct rtnl_link_stats64 stats64.

Add net_device_ops::ndo_get_stats64, implementations of which will
return a pointer to struct rtnl_link_stats64.  Drivers that implement
this operation must not update the structure asynchronously.

Change dev_get_stats() to call ndo_get_stats64 if available, and to
return a pointer to struct rtnl_link_stats64.  Change callers of
dev_get_stats() accordingly.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-12 15:51:22 -07:00
David S. Miller
62522d36d7 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-06-11 13:32:31 -07:00
John Fastabend
597a264b1a net: deliver skbs on inactive slaves to exact matches
Currently, the accelerated receive path for VLAN's will
drop packets if the real device is an inactive slave and
is not one of the special pkts tested for in
skb_bond_should_drop().  This behavior is different then
the non-accelerated path and for pkts over a bonded vlan.

For example,

vlanx -> bond0 -> ethx

will be dropped in the vlan path and not delivered to any
packet handlers at all.  However,

bond0 -> vlanx -> ethx

and

bond0 -> ethx

will be delivered to handlers that match the exact dev,
because the VLAN path checks the real_dev which is not a
slave and netif_recv_skb() doesn't drop frames but only
delivers them to exact matches.

This patch adds a sk_buff flag which is used for tagging
skbs that would previously been dropped and allows the
skb to continue to skb_netif_recv().  Here we add
logic to check for the deliver_no_wcard flag and if it
is set only deliver to handlers that match exactly.  This
makes both paths above consistent and gives pkt handlers
a way to identify skbs that come from inactive slaves.
Without this patch in some configurations skbs will be
delivered to handlers with exact matches and in others
be dropped out right in the vlan path.

I have tested the following 4 configurations in failover modes
and load balancing modes.

# bond0 -> ethx

# vlanx -> bond0 -> ethx

# bond0 -> vlanx -> ethx

# bond0 -> ethx
            |
  vlanx -> --

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-10 22:23:34 -07:00
Tim Gardner
08c801f8d4 net: Print num_rx_queues imbalance warning only when there are allocated queues
BugLink: http://bugs.launchpad.net/bugs/591416

There are a number of network drivers (bridge, bonding, etc) that are not yet
receive multi-queue enabled and use alloc_netdev(), so don't print a
num_rx_queues imbalance warning in that case.

Also, only print the warning once for those drivers that _are_ multi-queue
enabled.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-06-09 13:46:03 -06:00
Eric Dumazet
bb69ae049f anycast: Some RCU conversions
- dev_get_by_flags() changed to dev_get_by_flags_rcu()

- ipv6_sock_ac_join() dont touch dev & idev refcounts
- ipv6_sock_ac_drop() dont touch dev & idev refcounts
- ipv6_sock_ac_close() dont touch dev & idev refcounts
- ipv6_dev_ac_dec() dount touch idev refcount
- ipv6_chk_acast_addr() dont touch idev refcount

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-07 22:49:25 -07:00
jamal
271c1dfa61 net: Remove unnecessary net action assertion
The extra assertion to allow packet munging only when there are
no other ptypes listening which may have worked around an old bug
is unnecessary. It is sufficient to check if the skb is cloned before
trampling on it. Thanks to Herbert Xu for being persistent and patient
in getting this across.
[Note that cloning checks and assertions are the general rule used
by tc actions (documentation/networking/tc-actions-env-rules.txt)].

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-07 01:10:44 -07:00
David S. Miller
eedc765ca4 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/sfc/net_driver.h
	drivers/net/sfc/siena.c
2010-06-06 17:42:02 -07:00
Alexander Duyck
b78462ebc6 skbuff: add check for non-linear to warn_if_lro and needs_linearize
We can avoid an unecessary cache miss by checking if the skb is non-linear
before accessing gso_size/gso_type in skb_warn_if_lro, the same can also be
done to avoid a cache miss on nr_frags if data_len is 0.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-05 02:23:16 -07:00
Jiri Pirko
ab95bfe01f net: replace hooks in __netif_receive_skb V5
What this patch does is it removes two receive frame hooks (for bridge and for
macvlan) from __netif_receive_skb. These are replaced them with a single
hook for both. It only supports one hook per device because it makes no
sense to do bridging and macvlan on the same device.

Then a network driver (of virtual netdev like macvlan or bridge) can register
an rx_handler for needed net device.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-02 07:11:15 -07:00
Eric Dumazet
79640a4ca6 net: add additional lock to qdisc to increase throughput
When many cpus compete for sending frames on a given qdisc, the qdisc
spinlock suffers from very high contention.

The cpu owning __QDISC_STATE_RUNNING bit has same priority to acquire
the lock, and cannot dequeue packets fast enough, since it must wait for
this lock for each dequeued packet.

One solution to this problem is to force all cpus spinning on a second
lock before trying to get the main lock, when/if they see
__QDISC_STATE_RUNNING already set.

The owning cpu then compete with at most one other cpu for the main
lock, allowing for higher dequeueing rate.

Based on a previous patch from Alexander Duyck. I added the heuristic to
avoid the atomic in fast path, and put the new lock far away from the
cache line used by the dequeue worker. Also try to release the busylock
lock as late as possible.

Tests with following script gave a boost from ~50.000 pps to ~600.000
pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
(A single netperf flow can reach ~800.000 pps on this platform)

for j in `seq 0 3`; do
  for i in `seq 0 7`; do
    netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
  done
done

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-02 05:09:29 -07:00
John Fastabend
2df4a0fa15 net: fix conflict between null_or_orig and null_or_bond
If a skb is received on an inactive bond that does not meet
the special cases checked for by skb_bond_should_drop it should
only be delivered to exact matches as the comment in
netif_receive_skb() says.

However because null_or_bond could also be null this is not
always true.  This patch renames null_or_bond to orig_or_bond
and initializes it to orig_dev.  This keeps the intent of
null_or_bond to pass frames received on VLAN interfaces stacked
on bonding interfaces without invalidating the statement for
null_or_orig.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-02 03:35:18 -07:00
Eric Dumazet
bc135b23d0 net: Define accessors to manipulate QDISC_STATE_RUNNING
Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a
second patch will move on another location.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-02 03:23:51 -07:00
Eric Dumazet
15e83ed788 net: remove zap_completion_queue
netpoll does an interesting work in zap_completion_queue(), but this was
before we did skb orphaning before delivering packets to device.

It now makes sense to add a test in dev_kfree_skb_irq() to not queue a
skb if already orphaned, and to remove netpoll zap_completion_queue() as
a bonus.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-31 00:24:01 -07:00
Eric Dumazet
27f39c73e6 net: Use __this_cpu_inc() in fast path
This patch saves 224 bytes of text on my machine.

__this_cpu_inc() generates a single instruction, using no scratch
registers :

  65 ff 04 25 a8 30 01 00      incl   %gs:0x130a8

instead of :

  48 c7 c2 80 30 01 00         mov    $0x13080,%rdx
  65 48 8b 04 25 88 ea 00 00   mov    %gs:0xea88,%rax
  83 44 10 28 01               addl   $0x1,0x28(%rax,%rdx,1)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-31 00:24:01 -07:00
Linus Torvalds
b1cdc4670b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (63 commits)
  drivers/net/usb/asix.c: Fix pointer cast.
  be2net: Bug fix to avoid disabling bottom half during firmware upgrade.
  proc_dointvec: write a single value
  hso: add support for new products
  Phonet: fix potential use-after-free in pep_sock_close()
  ath9k: remove VEOL support for ad-hoc
  ath9k: change beacon allocation to prefer the first beacon slot
  sock.h: fix kernel-doc warning
  cls_cgroup: Fix build error when built-in
  macvlan: do proper cleanup in macvlan_common_newlink() V2
  be2net: Bug fix in init code in probe
  net/dccp: expansion of error code size
  ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep
  wireless: fix sta_info.h kernel-doc warnings
  wireless: fix mac80211.h kernel-doc warnings
  iwlwifi: testing the wrong variable in iwl_add_bssid_station()
  ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs()
  ath9k_htc: dereferencing before check in hif_usb_tx_cb()
  rt2x00: Fix rt2800usb TX descriptor writing.
  rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions.
  ...
2010-05-25 16:59:51 -07:00
Daniel Lezcano
8ce6cebc2f net-2.6 : V2 - fix dev_get_valid_name
the commit:

commit d90310243f
Author: Octavian Purdila <opurdila@ixiacom.com>
Date:   Wed Nov 18 02:36:59 2009 +0000

    net: device name allocation cleanups

introduced a bug when there is a hash collision making impossible
to rename a device with eth%d. This bug is very hard to reproduce
and appears rarely.

The problem is coming from we don't pass a temporary buffer to
__dev_alloc_name but 'dev->name' which is modified by the function.

A detailed explanation is here:

http://marc.info/?l=linux-netdev&m=127417784011987&w=2

Changelog:
 V2 : replaced strings comparison by pointers comparison

Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Reviewed-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-23 23:24:36 -07:00
Eric W. Biederman
a1b3f594dc net: Expose all network devices in a namespaces in sysfs
This reverts commit aaf8cdc34d.

Drivers like the ipw2100 call device_create_group when they
are initialized and device_remove_group when they are shutdown.
Moving them between namespaces deletes their sysfs groups early.

In particular the following call chain results.
netdev_unregister_kobject -> device_del -> kobject_del -> sysfs_remove_dir
With sysfs_remove_dir recursively deleting all of it's subdirectories,
and nothing adding them back.

Ouch!

Therefore we need to call something that ultimate calls sysfs_mv_dir
as that sysfs function can move sysfs directories between namespaces
without deleting their subdirectories or their contents.   Allowing
us to avoid placing extra boiler plate into every driver that does
something interesting with sysfs.

Currently the function that provides that capability is device_rename.
That is the code works without nasty side effects as originally written.

So remove the misguided fix for moving devices between namespaces.  The
bug in the kobject layer that inspired it has now been recognized and
fixed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-21 09:37:34 -07:00
Tom Herbert
76cc8b13a6 net: fix problem in dequeuing from input_pkt_queue
Fix some issues introduced in batch skb dequeuing for input_pkt_queue.
The primary issue it that the queue head must be incremented only
after a packet has been processed, that is only after
__netif_receive_skb has been called.  This is needed for the mechanism
to prevent OOO packet in RFS.  Also when flushing the input_pkt_queue
and process_queue, the process queue should be done first to prevent
OOO packets.

Because the input_pkt_queue has been effectively split into two queues,
the calculation of the tail ptr is no longer correct.  The correct value
would be head+input_pkt_queue->len+process_queue->len.  To avoid
this calculation we added an explict input_queue_tail in softnet_data.
The tail value is simply incremented when queuing to input_pkt_queue.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-21 00:38:33 -07:00
Eric Dumazet
7fee226ad2 net: add a noref bit on skb dst
Use low order bit of skb->_skb_dst to tell dst is not refcounted.

Change _skb_dst to _skb_refdst to make sure all uses are catched.

skb_dst() returns the dst, regardless of noref bit set or not, but
with a lockdep check to make sure a noref dst is not given if current
user is not rcu protected.

New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
(with lockdep check)

skb_dst_drop() drops a reference only if skb dst was refcounted.

skb_dst_force() helper is used to force a refcount on dst, when skb
is queued and not anymore RCU protected.

Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
!IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
sock_queue_rcv_skb(), in __nf_queue().

Use skb_dst_force() in dev_requeue_skb().

Note: dst_use_noref() still dirties dst, we might transform it
later to do one dirtying per jiffies.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-17 17:18:50 -07:00
Eric Dumazet
ebda37c27d rps: avoid one atomic in enqueue_to_backlog
If CONFIG_SMP=y, then we own a queue spinlock, we can avoid the atomic
test_and_set_bit() from napi_schedule_prep().

We now have same number of atomic ops per netif_rx() calls than with
pre-RPS kernel.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-17 17:18:50 -07:00
Eric Dumazet
3b098e2d7c net: Consistent skb timestamping
With RPS inclusion, skb timestamping is not consistent in RX path.

If netif_receive_skb() is used, its deferred after RPS dispatch.

If netif_rx() is used, its done before RPS dispatch.

This can give strange tcpdump timestamps results.

I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)

Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.

Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue

Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.

Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-15 23:57:10 -07:00
Jiri Pirko
a14462f1bd net: adjust handle_macvlan to pass port struct to hook
Now there's null check here and also again in the hook. Looking at bridge bits
which are simmilar, port structure is rcu_dereferenced right away in
handle_bridge and passed to hook. Looks nicer.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-15 23:48:02 -07:00
David S. Miller
278554bd65 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	Documentation/feature-removal-schedule.txt
	drivers/net/wireless/ath/ar9170/usb.c
	drivers/scsi/iscsi_tcp.c
	net/ipv4/ipmr.c
2010-05-12 00:05:35 -07:00
Eric Dumazet
eecfd7c4e3 rps: Various optimizations
Introduce ____napi_schedule() helper for callers in irq disabled
contexts. rps_trigger_softirq() becomes a leaf function.

Use container_of() in process_backlog() instead of accessing per_cpu
address.

Use a custom inlined version of __napi_complete() in process_backlog()
to avoid one locked instruction :

 only current cpu owns and manipulates this napi,
 and NAPI_STATE_SCHED is the only possible flag set on backlog.
 we can use a plain write instead of clear_bit(),
 and we dont need an smp_mb() memory barrier, since RPS is on,
 backlog is protected by a spinlock.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-06 22:07:48 -07:00
Eric Dumazet
6ec82562ff veth: Dont kfree_skb() after dev_forward_skb()
In case of congestion, netif_rx() frees the skb, so we must assume
dev_forward_skb() also consume skb.

Bug introduced by commit 445409602c
(veth: move loopback logic to common location)

We must change dev_forward_skb() to always consume skb, and veth to not
double free it.

Bug report : http://marc.info/?l=linux-netdev&m=127310770900442&w=3

Reported-by: Martín Ferrari <martin.ferrari@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-06 00:53:53 -07:00
Changli Gao
dee42870a4 net: fix softnet_stat
Per cpu variable softnet_data.total was shared between IRQ and SoftIRQ context
without any protection. And enqueue_to_backlog should update the netdev_rx_stat
of the target CPU.

This patch renames softnet_data.total to softnet_data.processed: the number of
packets processed in uppper levels(IP stacks).

softnet_stat data is moved into softnet_data.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/linux/netdevice.h |   17 +++++++----------
 net/core/dev.c            |   26 ++++++++++++--------------
 net/sched/sch_generic.c   |    2 +-
 3 files changed, 20 insertions(+), 25 deletions(-)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-02 22:26:57 -07:00
Changli Gao
6e7676c1a7 net: batch skb dequeueing from softnet input_pkt_queue
batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
contention when RPS is enabled.

Note: in the worst case, the number of packets in a softnet_data may
be double of netdev_max_backlog.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27 15:11:49 -07:00
Changli Gao
a9cbd588fd net: reimplement softnet_data.output_queue as a FIFO queue
reimplement softnet_data.output_queue as a FIFO queue to keep the
fairness among the qdiscs rescheduled.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
----
 include/linux/netdevice.h |    1 +
 net/core/dev.c            |   22 ++++++++++++----------
 2 files changed, 13 insertions(+), 10 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27 14:32:12 -07:00
Changli Gao
8c52d509e8 rps: optimize rps_get_cpu()
optimize rps_get_cpu().

don't initialize ports when we can get the ports. one memory access
for ports than two.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-24 22:50:10 -07:00
David S. Miller
9ccb897594 net: Orphan and de-dst skbs earlier in xmit path.
This way GSO packets don't get handled differently.

With help from Eric Dumazet.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-04-22 01:02:07 -07:00
Eric Dumazet
e326bed2f4 rps: immediate send IPI in process_backlog()
If some skb are queued to our backlog, we are delaying IPI sending at
the end of net_rx_action(), increasing latencies. This defeats the
queueing, since we want to quickly dispatch packets to the pool of
worker cpus, then eventually deeply process our packets.

It's better to send IPI before processing our packets in upper layers,
from process_backlog().

Change the _and_disable_irq suffix to _and_enable_irq(), since we enable
local irq in net_rps_action(), sorry for the confusion.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 00:22:45 -07:00
Eric Dumazet
9a20e3197e net: Introduce skb_orphan_try()
At this point, skb->destructor is not the original one (stored in
DEV_GSO_CB(skb)->destructor)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21 22:54:08 -07:00
David S. Miller
87eb367003 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-6000.c
	net/core/dev.c
2010-04-21 01:14:25 -07:00
David Howells
05d17608a6 net: Fix an RCU warning in dev_pick_tx()
Fix the following RCU warning in dev_pick_tx():

===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
net/core/dev.c:1993 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
2 locks held by swapper/0:
 #0:  (&idev->mc_ifc_timer){+.-...}, at: [<ffffffff81039e65>] run_timer_softirq+0x17b/0x278
 #1:  (rcu_read_lock_bh){.+....}, at: [<ffffffff812ea3eb>] dev_queue_xmit+0x14e/0x4dc

stack backtrace:
Pid: 0, comm: swapper Not tainted 2.6.34-rc5-cachefs #4
Call Trace:
 <IRQ>  [<ffffffff810516c4>] lockdep_rcu_dereference+0xaa/0xb2
 [<ffffffff812ea4f6>] dev_queue_xmit+0x259/0x4dc
 [<ffffffff812ea3eb>] ? dev_queue_xmit+0x14e/0x4dc
 [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81035362>] ? local_bh_enable_ip+0xbc/0xc1
 [<ffffffff812f0954>] neigh_resolve_output+0x24b/0x27c
 [<ffffffff8134f673>] ip6_output_finish+0x7c/0xb4
 [<ffffffff81350c34>] ip6_output2+0x256/0x261
 [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff813517fb>] ip6_output+0xbbc/0xbcb
 [<ffffffff8135bc5d>] ? fib6_force_start_gc+0x2b/0x2d
 [<ffffffff81368acb>] mld_sendpack+0x273/0x39d
 [<ffffffff81368858>] ? mld_sendpack+0x0/0x39d
 [<ffffffff81052099>] ? mark_held_locks+0x52/0x70
 [<ffffffff813692fc>] mld_ifc_timer_expire+0x24f/0x288
 [<ffffffff81039ed6>] run_timer_softirq+0x1ec/0x278
 [<ffffffff81039e65>] ? run_timer_softirq+0x17b/0x278
 [<ffffffff813690ad>] ? mld_ifc_timer_expire+0x0/0x288
 [<ffffffff81035531>] ? __do_softirq+0x69/0x140
 [<ffffffff8103556a>] __do_softirq+0xa2/0x140
 [<ffffffff81002e0c>] call_softirq+0x1c/0x28
 [<ffffffff81004b54>] do_softirq+0x38/0x80
 [<ffffffff81034f06>] irq_exit+0x45/0x47
 [<ffffffff810177c3>] smp_apic_timer_interrupt+0x88/0x96
 [<ffffffff810028d3>] apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff810488dd>] ? __atomic_notifier_call_chain+0x0/0x86
 [<ffffffff810096bf>] ? mwait_idle+0x6e/0x78
 [<ffffffff810096b6>] ? mwait_idle+0x65/0x78
 [<ffffffff810011cb>] cpu_idle+0x4d/0x83
 [<ffffffff81380b05>] rest_init+0xb9/0xc0
 [<ffffffff81380a4c>] ? rest_init+0x0/0xc0
 [<ffffffff8168dcf0>] start_kernel+0x392/0x39d
 [<ffffffff8168d2a3>] x86_64_start_reservations+0xb3/0xb7
 [<ffffffff8168d38b>] x86_64_start_kernel+0xe4/0xeb

An rcu_dereference() should be an rcu_dereference_bh().

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21 01:09:44 -07:00
Jiri Pirko
ab9304717f net: emphasize rtnl lock required in call_netdevice_notifiers
Since netdev_chain is guarded by rtnl_lock, ASSERT_RTNL should be
present here to make sure that all callers of call_netdevice_notifiers
does the locking properly.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 01:45:37 -07:00
Eric Dumazet
b249dcb82d rps: consistent rxhash
In case we compute a software skb->rxhash, we can generate a consistent
hash : Its value will be the same in both flow directions.

This helps some workloads, like conntracking, since the same state needs
to be accessed in both directions.

tbench + RFS + this patch gives better results than tbench with default
kernel configuration (no RPS, no RFS)

Also fixed some sparse warnings.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 01:18:06 -07:00
Eric Dumazet
e36fa2f7e9 rps: cleanups
struct softnet_data holds many queues, so consistent use "sd" name
instead of "queue" is better.

Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog()

Adds a _and_irq_disable suffix to net_rps_action() name, as David
suggested.

incr_input_queue_head() becomes input_queue_head_incr()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 01:18:05 -07:00
Eric Dumazet
88751275b8 rps: shortcut net_rps_action()
net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
RPS is not active.

Tom Herbert used two bitmasks to hold information needed to send IPI,
but a single LIFO list seems more appropriate.

Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
(remove two ifdefs)

Move rps_remote_softirq_cpus into softnet_data to share its first cache
line, filling an existing hole.

In a future patch, we could call net_rps_action() from process_backlog()
to make sure we send IPI before handling this cpu backlog.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-19 13:20:34 -07:00
Eric Dumazet
fc6055a5ba net: Introduce skb_orphan_try()
Transmitted skb might be attached to a socket and a destructor, for
memory accounting purposes.

Traditionally, this destructor is called at tx completion time, when skb
is freed.

When tx completion is performed by another cpu than the sender, this
forces some cache lines to change ownership. XPS was an attempt to give
tx completion to initial cpu.

David idea is to call destructor right before giving skb to device (call
to ndo_start_xmit()). Because device queues are usually small, orphaning
skb before tx completion is not a big deal. Some drivers already do
this, we could do it in upper level.

There is one known exception to this early orphaning, called tx
timestamping. It needs to keep a reference to socket until device can
give a hardware or software timestamp.

This patch adds a skb_orphan_try() helper, to centralize all exceptions
to early orphaning in one spot, and use it in dev_hard_start_xmit().

"tbench 16" results on a Nehalem machine (2 X5570  @ 2.93GHz)
before: Throughput 4428.9 MB/sec 16 procs
after: Throughput 4448.14 MB/sec 16 procs

UDP should get even better results, its destructor being more complex,
since SOCK_USE_WRITE_QUEUE is not set (four atomic ops instead of one)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-18 02:39:41 -07:00
Eric Dumazet
9958da0501 net: remove time limit in process_backlog()
- There is no point to enforce a time limit in process_backlog(), since
other napi instances dont follow same rule. We can exit after only one
packet processed...
The normal quota of 64 packets per napi instance should be the norm, and
net_rx_action() already has its own time limit.
Note : /proc/net/core/dev_weight can be used to tune this 64 default
value.

- Use DEFINE_PER_CPU_ALIGNED for softnet_data definition.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-18 02:36:13 -07:00
Eric Dumazet
8770acf049 rps: rps_sock_flow_table is mostly read
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 00:54:36 -07:00
Tom Herbert
fec5e652e5 rfs: Receive Flow Steering
This patch implements receive flow steering (RFS).  RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running.  RFS is an
extension of Receive Packet Steering (RPS).

The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure.  The rxhash is passed in skb's received on
the connection from netif_receive_skb.  For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.

The convolution of the simple approach is that it would potentially
allow OOO packets.  If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.

To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.

rps_sock_table is a global hash table.  Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.

rps_dev_flow_table is specific to each device queue.  Each entry
contains a CPU and a tail queue counter.  The CPU is the "current"
CPU for a matching flow.  The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.

Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length.  When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.

And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted.  When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:

- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table.  This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.

Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality.  2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.

This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.

There are two configuration parameters for RFS.  The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue.  Both are rounded to power of two.

The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).

The benefits of RFS are dependent on cache hierarchy, application
load, and other factors.  On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation.  However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp.  The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.

e1000e on 8 core Intel
   No RFS or RPS		104K tps at 30% CPU
   No RFS (best RPS config):    290K tps at 63% CPU
   RFS				303K tps at 61% CPU

RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
  No RFS/RPS	103K	48%	757/900/3185		4472.35
  RPS only:	174K	73%	415/993/2468		491.66
  RFS		223K	73%	379/651/1382		315.61

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
Eric Dumazet
8728c544a9 net: dev_pick_tx() fix
When dev_pick_tx() caches tx queue_index on a socket, we must check
socket dst_entry matches skb one, or risk a crash later, as reported by
Denys Fedorysychenko, if old packets are in flight during a route
change, involving devices with different number of queues.

Bug introduced by commit a4ee3ce3
(net: Use sk_tx_queue_mapping for connected sockets)

Reported-by: Denys Fedorysychenko <nuclearcat@nuclearcat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-15 01:27:11 -07:00
Eric Dumazet
b0e28f1eff net: netif_rx() must disable preemption
Eric Paris reported netif_rx() is calling smp_processor_id() from
preemptible context, in particular when caller is
ip_dev_loopback_xmit().

RPS commit added this smp_processor_id() call, this patch makes sure
preemption is disabled. rps_get_cpus() wants rcu_read_lock() anyway, we
can dot it a bit earlier.

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-15 00:14:07 -07:00
Eric Dumazet
acbbc07145 net: uninline skb_bond_should_drop()
skb_bond_should_drop() is too big to be inlined.

This patch reduces kernel text size, and its compilation time as well
(shrinking include/linux/netdevice.h)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 03:32:42 -07:00
Eric Dumazet
b6c6712a42 net: sk_dst_cache RCUification
With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
work.

sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)

This rwlock is readlocked for a very small amount of time, and dst
entries are already freed after RCU grace period. This calls for RCU
again :)

This patch converts sk_dst_lock to a spinlock, and use RCU for readers.

__sk_dst_get() is supposed to be called with rcu_read_lock() or if
socket locked by user, so use appropriate rcu_dereference_check()
condition (rcu_read_lock_held() || sock_owned_by_user(sk))

This patch avoids two atomic ops per tx packet on UDP connected sockets,
for example, and permits sk_dst_lock to be much less dirtied.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 01:41:33 -07:00
Eric Dumazet
7a161ea924 net: Dont use netdev_warn()
Dont use netdev_warn() in dev_cap_txqueue() and get_rps_cpu() so that we
can catch following warnings without crash.

bond0.2240 received packet on queue 6, but number of RX queues is 1
bond0.2240 received packet on queue 11, but number of RX queues is 1

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 01:41:32 -07:00
David S. Miller
871039f02f Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/stmmac/stmmac_main.c
	drivers/net/wireless/wl12xx/wl1271_cmd.c
	drivers/net/wireless/wl12xx/wl1271_main.c
	drivers/net/wireless/wl12xx/wl1271_spi.c
	net/core/ethtool.c
	net/mac80211/scan.c
2010-04-11 14:53:53 -07:00
Eric Dumazet
e4008276fd net: Add a missing local_irq_enable()
As noticed by Changli Gao, we must call local_irq_enable() after
rps_unlock()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-05 15:42:39 -07:00
Tom Herbert
5a6d234e73 rps: fixed missed rps_unlock
Fix spin_unlock_irq which needs to be rps_unlock.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-05 14:37:55 -07:00
Jiri Pirko
22bedad3ce net: convert multicast list to list_head
Converts the list and the core manipulating with it to be the same as uc_list.

+uses two functions for adding/removing mc address (normal and "global"
 variant) instead of a function parameter.
+removes dev_mcast.c completely.
+exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for
 manipulation with lists on a sandbox (used in bonding and 80211 drivers)

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-03 14:22:15 -07:00
Jiri Pirko
a748ee2426 net: move address list functions to a separate file
+little renaming of unicast functions to be smooth with multicast ones

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-03 14:22:11 -07:00
Eric Dumazet
9092c658ba net: illegal_highdma() fix
Followup to commit 5acbbd428d
(net: change illegal_highdma to use dma_mask)

If dev->dev.parent is NULL, we should not try to dereference it.

Dont force inline illegal_highdma() as its pretty big now.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-02 13:34:49 -07:00
FUJITA Tomonori
5acbbd428d net: change illegal_highdma to use dma_mask
Robert Hancock pointed out two problems about NETIF_F_HIGHDMA:

-Many drivers only set the flag when they detect they can use 64-bit DMA,
since otherwise they could receive DMA addresses that they can't handle
(which on platforms without IOMMU/SWIOTLB support is fatal). This means that if
64-bit support isn't available, even buffers located below 4GB will get copied
unnecessarily.

-Some drivers set the flag even though they can't actually handle 64-bit DMA,
which would mean that on platforms without IOMMU/SWIOTLB they would get a DMA
mapping error if the memory they received happened to be located above 4GB.

http://lkml.org/lkml/2010/3/3/530

We can use the dma_mask if we need bouncing or not here. Then we can
safely fix drivers that misuse NETIF_F_HIGHDMA.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-01 19:53:12 -07:00
Changli Gao
152102c7f2 rps: keep the old behavior on SMP without rps
keep the old behavior on SMP without rps

RPS introduces a lock operation to per cpu variable input_pkt_queue on
SMP whenever rps is enabled or not. On SMP without RPS, this lock isn't
needed at all.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
net/core/dev.c | 42 ++++++++++++++++++++++++++++--------------
1 file changed, 28 insertions(+), 14 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-01 18:41:40 -07:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Eric Dumazet
10f744d205 net: __netif_receive_skb should be static
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-28 23:07:20 -07:00
Eric Dumazet
df3345457a rps: add CONFIG_RPS
RPS currently depends on SMP and SYSFS

Adding a CONFIG_RPS makes sense in case this requirement changes in the
future. This patch saves about 1500 bytes of kernel text in case SMP is
on but SYSFS is off.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-25 12:07:00 -07:00
Tom Herbert
e51d739ab7 net: Fix locking in flush_backlog
Need to take spinlocks when dequeuing from input_pkt_queue in flush_backlog.
Also, flush_backlog can now be called directly from netdev_run_todo.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-23 23:17:18 -07:00
Eric Dumazet
99fe3c391d net: dev_getfirstbyhwtype() optimization
Use RCU to avoid RTNL use in dev_getfirstbyhwtype()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-21 20:33:36 -07:00
Eric Dumazet
283f2fe87e net: speedup netdev_set_master()
We currently force a synchronize_net() in netdev_set_master()

This seems necessary only when a slave had a master and we dismantle it.

In the other case ("ifenslave bond0 ethO"), we dont need this long
delay.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-21 18:34:15 -07:00
Jiri Pirko
32a806c194 bonding: flush unicast and multicast lists when changing type
After the type change, addresses in unicast and multicast lists wouldn't make
sense, not to mention possible different lenghts. So flush both lists here.

Note "dev_addr_discard" will be very soon replaced by "dev_mc_flush" (once
mc_list conversion will be done).

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-21 18:31:34 -07:00
David S. Miller
e77c8e83dd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-03-20 15:24:29 -07:00
Eric Dumazet
0641e4fbf2 net: Potential null skb->dev dereference
When doing "ifenslave -d bond0 eth0", there is chance to get NULL
dereference in netif_receive_skb(), because dev->master suddenly becomes
NULL after we tested it.

We should use ACCESS_ONCE() to avoid this (or rcu_dereference())

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-18 21:16:45 -07:00
Jiri Pirko
3ca5b4042e bonding: check return value of nofitier when changing type
This patch adds the possibility to refuse the bonding type change for
other subsystems (such as for example bridge, vlan, etc.)

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-18 20:00:02 -07:00
Tom Herbert
1e94d72fea rps: Fixed build with CONFIG_SMP not enabled.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-18 17:45:44 -07:00
Tom Herbert
0a9627f264 rps: Receive Packet Steering
This patch implements software receive side packet steering (RPS).  RPS
distributes the load of received packet processing across multiple CPUs.

Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load.  This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.

This solution queues packets early on in the receive path on the backlog queues
of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel.   For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask.  The IPI mechanism is used to raise networking receive
softirqs between CPUs.  This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.

Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.

The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
bit maps for receive queues in the device (numbered by <n>).  If a device
does not support multi-queue, a single variable is used for the device (rx-0).

Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization.  Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy.  Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.

e1000e on 8 core Intel
   Without RPS: 108K tps at 33% CPU
   With RPS:    311K tps at 64% CPU

forcedeth on 16 core AMD
   Without RPS: 156K tps at 15% CPU
   With RPS:    404K tps at 49% CPU

bnx2x on 16 core AMD
   Without RPS  567K tps at 61% CPU (4 HW RX queues)
   Without RPS  738K tps at 96% CPU (8 HW RX queues)
   With RPS:    854K tps at 76% CPU (4 HW RX queues)

Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet.  In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets.  It's
probably best not change the masks too frequently.

Signed-off-by: Tom Herbert <therbert@google.com>

 include/linux/netdevice.h |   32 ++++-
 include/linux/skbuff.h    |    3 +
 net/core/dev.c            |  335 +++++++++++++++++++++++++++++++++++++--------
 net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
 net/core/skbuff.c         |    2 +
 5 files changed, 538 insertions(+), 59 deletions(-)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:23:18 -07:00
David S. Miller
47871889c6 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	drivers/firmware/iscsi_ibft.c
2010-02-28 19:23:06 -08:00
Patrick McHardy
bd38081160 dev: support deferring device flag change notifications
Split dev_change_flags() into two functions: __dev_change_flags() to
perform the actual changes and __dev_notify_flags() to invoke netdevice
notifiers. This will be used by rtnl_link to defer netlink notifications
until the device has been fully configured.

This changes ordering of some operations, in particular:

- netlink notifications are sent after all changes have been performed.
  As a side effect this surpresses one unnecessary netlink message when
  the IFF_UP and other flags are changed simultaneously.

- The NETDEV_UP/NETDEV_DOWN and NETDEV_CHANGE notifiers are invoked
  after all changes have been performed. Their relative is unchanged.

- net_dmaengine_put() is invoked before the NETDEV_DOWN notifier instead
  of afterwards. This should not make any difference since both RX and TX
  are already shut down at this point.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-27 02:43:40 -08:00
Patrick McHardy
a2835763e1 rtnetlink: handle rtnl_link netlink notifications manually
In order to support specifying device flags during device creation,
we must be able to roll back device registration in case setting the
flags fails without sending any notifications related to the device
to userspace.

This patch changes rollback_registered_many() and register_netdevice()
to manually send netlink notifications for devices not handled by
rtnl_link and allows to defer notifications for devices handled by
rtnl_link until setup is complete.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-27 02:43:39 -08:00
stephen hemminger
e5e26d75f4 netdev: use list_first_entry macro
Use list_first_entry macro; no longer any need to use
'next' directly in list to find first entry.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-26 04:18:35 -08:00
Paul E. McKenney
a898def29e net: Add checking to rcu_dereference() primitives
Update rcu_dereference() primitives to use new lockdep-based
checking. The rcu_dereference() in __in6_dev_get() may be
protected either by rcu_read_lock() or RTNL, per Eric Dumazet.
The rcu_dereference() in __sk_free() is protected by the fact
that it is never reached if an update could change it.  Check
for this by using rcu_dereference_check() to verify that the
struct sock's ->sk_wmem_alloc counter is zero.

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-5-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 09:41:03 +01:00
Ajit Khaparde
c4d49794ff net: bug fix for vlan + gro issue
Traffic (tcp) doesnot start on a vlan interface when gro is enabled.
Even the tcp handshake was not taking place.
This is because, the eth_type_trans call before the netif_receive_skb
in napi_gro_finish() resets the skb->dev to napi->dev from the previously
set vlan netdev interface. This causes the ip_route_input to drop the
incoming packet considering it as a packet coming from a martian source.

I could repro this on 2.6.32.7 (stable) and 2.6.33-rc7.
With this fix, the traffic starts and the test runs fine on both vlan
and non-vlan interfaces.

CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-23 19:09:31 -08:00
Ajit Khaparde
e76b69cc01 net: bug fix for vlan + gro issue
Traffic (tcp) doesnot start on a vlan interface when gro is enabled.
Even the tcp handshake was not taking place.
This is because, the eth_type_trans call before the netif_receive_skb
in napi_gro_finish() resets the skb->dev to napi->dev from the previously
set vlan netdev interface. This causes the ip_route_input to drop the
incoming packet considering it as a packet coming from a martian source.

I could repro this on 2.6.32.7 (stable) and 2.6.33-rc7.
With this fix, the traffic starts and the test runs fine on both vlan
and non-vlan interfaces.

CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-17 15:59:47 -08:00
Jiri Pirko
4cd24eaf0c net: use netdev_mc_count and netdev_mc_empty when appropriate
This patch replaces dev->mc_count in all drivers (hopefully I didn't miss
anything). Used spatch and did small tweaks and conding style changes when
it was suitable.

Jirka

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-12 11:38:58 -08:00
Peter P Waskiewicz Jr
15682bc488 ethtool: Introduce n-tuple filter programming support
This patchset enables the ethtool layer to program n-tuple
filters to an underlying device.  The idea is to allow capable
hardware to have static rules applied that can assist steering
flows into appropriate queues.

Hardware that is known to support these types of filters today
are ixgbe and niu.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-10 20:03:05 -08:00
Arnd Bergmann
8a83a00b07 net: maintain namespace isolation between vlan and real device
In the vlan and macvlan drivers, the start_xmit function forwards
data to the dev_queue_xmit function for another device, which may
potentially belong to a different namespace.

To make sure that classification stays within a single namespace,
this resets the potentially critical fields.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-03 20:20:32 -08:00
Jiri Pirko
32e7bfc411 net: use helpers to access uc list V2
This patch introduces three macros to work with uc list from net drivers.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-25 13:36:10 -08:00
Krishna Kumar
4b258461c0 net: Optimize non-gso test checks
Avoid checking twice whether skb needs to be linearized, if one
skb_linearize was already done.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-21 01:26:29 -08:00
David S. Miller
11380a4b2d net: Unexport napi_gro_flush().
Nothing outside of net/core/dev.c uses it.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-19 13:46:10 -08:00
Jesper Dangaard Brouer
2d13bafeba net: Make it easier to parse /proc/net/dev contents.
The contents of /proc/net/dev is annoying to parse, because
it changes whether there is a space after the "ethX:" or not.
It depends upon the size of the "Receive bytes" counter,
if the number is below 7 digits, then there is whitespaces
else if the number is 8 digits or above there is no space
between the ":" and the number.

This patch changes the output to assure there is always a space
between the ":" and the number.  Given that all existing userspace
application already need to handle the whitespaces, I see
no breakage of existing tools.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-07 00:59:10 -08:00
Andy Gospodarek
ca8d9ea30b fix bonding: allow arp_ip_targets on separate vlans to use arp validation
On Wed, Jan 06, 2010 at 10:10:03PM +0100, Eric Dumazet wrote:
> Le 06/01/2010 19:38, Eric Dumazet a écrit :
> >
> > (net-next-2.6 doesnt work well on my bond/vlan setup, I suspect I need a bisection)
>
> David, I had to revert 1f3c8804ac
> (bonding: allow arp_ip_targets on separate vlans to use arp validation)
>
> Or else, my vlan devices dont work (unfortunatly I dont have much time
> these days to debug the thing)
>
> My config :
>
>               +---------+
> vlan.103 -----+ bond0   +--- eth1 (bnx2)
>               |         +
> vlan.825 -----+         +--- eth2 (tg3)
>               +---------+
>
> $ cat /proc/net/bonding/bond0
> Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
>
> Bonding Mode: fault-tolerance (active-backup)
> Primary Slave: None
> Currently Active Slave: eth2
> MII Status: up
> MII Polling Interval (ms): 100
> Up Delay (ms): 0
> Down Delay (ms): 0
>
> Slave Interface: eth1  (bnx2)
> MII Status: down
> Link Failure Count: 1
> Permanent HW addr: 00:1e:0b:ec:d3:d2
>
> Slave Interface: eth2   (tg3)
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:1e:0b:92:78:50
>

This patch fixes up a problem with found with commit
1f3c8804ac.  The original change
overloaded null_or_orig, but doing that prevented any packet handlers
that were not tied to a specific device (i.e. ptype->dev == NULL) from
ever receiving any frames.

The null_or_orig variable cannot be overloaded, and must be kept as NULL
to prevent the frame from being ignored by packet handlers designed to
accept frames on any interface.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-07 00:46:39 -08:00
Andy Gospodarek
1f3c8804ac bonding: allow arp_ip_targets on separate vlans to use arp validation
This allows a bond device to specify an arp_ip_target as a host that is
not on the same vlan as the base bond device and still use arp
validation.  A configuration like this, now works:

BONDING_OPTS="mode=active-backup arp_interval=1000 arp_ip_target=10.0.100.1 arp_validate=3"

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
    link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
    link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::213:21ff:febe:33e9/64 scope link
       valid_lft forever preferred_lft forever
9: bond0.100@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
    inet 10.0.100.2/24 brd 10.0.100.255 scope global bond0.100
    inet6 fe80::213:21ff:febe:33e9/64 scope link
       valid_lft forever preferred_lft forever

Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 1000
ARP IP target/s (n.n.n.n form): 10.0.100.1

Slave Interface: eth1
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:40:05:30:ff:30

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:13:21:be:33:e9

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-03 21:17:16 -08:00
Krishna Kumar
068a2de57d net: release dst entry while cache-hot for GSO case too
Non-GSO code drops dst entry for performance reasons, but
the same is missing for GSO code. Drop dst while cache-hot
for GSO case too.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-23 14:13:30 -08:00
Eric W. Biederman
d90a909e1f net: Fix userspace RTM_NEWLINK notifications.
I received some bug reports about userspace programs having problems
because after RTM_NEWLINK was received they could not immediate access
files under /proc/sys/net/ because they had not been registered yet.

The original problem was trivially fixed by moving the userspace
notification from rtnetlink_event() to the end of
register_netdevice().

When testing that change I discovered I was still getting RTM_NEWLINK
events before I could access proc and I was also getting RTM_NEWLINK
events after I was seeing RTM_DELLINK.  Things practically guaranteed
to confuse userspace.

After a little more investigation these extra notifications proved to
be from the new notifiers NETDEV_POST_INIT and NETDEV_UNREGISTER_BATCH
hitting the default case in rtnetlink_event, and triggering
unnecessary RTM_NEWLINK messages.

rtnetlink_event now explicitly handles NETDEV_UNREGISTER_BATCH and
NETDEV_POST_INIT to avoid sending the incorrect userspace
notifications.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-13 19:45:22 -08:00
Krishna Kumar
e93737b0f0 net: Handle NETREG_UNINITIALIZED devices correctly
Fix two problems:

1. If unregister_netdevice_many() is called with both registered
   and unregistered devices, rollback_registered_many() bails out
   when it reaches the first unregistered device. The processing
   of the prior registered devices is unfinished, and the
   remaining devices are skipped, and possible registered netdev's
   are leaked/unregistered.

2. System hangs or panics depending on how the devices are passed,
   since when netdev_run_todo() runs, some devices were not fully
   processed.

Tested by passing intermingled unregistered and registered vlan
devices to unregister_netdevice_many() as follows:
	1. dev, fake_dev1, fake_dev2: hangs in run_todo
	   ("unregister_netdevice: waiting for eth1.100 to become
	    free. Usage count = 1")
	2. fake_dev1, dev, fake_dev2: failure during de-registration
	   and next registration, followed by a vlan driver Oops
	   during subsequent registration.

Confirmed that the patch fixes both cases.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-11 15:11:45 -08:00
Patrick Mullaney
fc4a748966 netdevice: provide common routine for macvlan and vlan operstate management
Provide common routine for the transition of operational state for a leaf
device during a root device transition.

Signed-off-by: Patrick Mullaney <pmullaney@novell.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 15:59:22 -08:00
Eric W. Biederman
04dc7f6be3 net: Move network device exit batching
Move network device exit batching from a special case in
net_namespace.c to using common mechanisms in dev.c

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 12:22:02 -08:00
Eric W. Biederman
e008b5fc8d net: Simplfy default_device_exit and improve batching.
- Defer dellink to net_cleanup() allowing for batching.
- Fix comment.
- Use for_each_netdev_safe again as dev_change_net_namespace touches
  at most one network device (unlike veth dellink).

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-01 16:15:52 -08:00
Eric W. Biederman
a5ee155136 net: NETDEV_UNREGISTER_PERNET -> NETDEV_UNREGISTER_BATCH
The motivation for an additional notifier in batched netdevice
notification (rt_do_flush) only needs to be called once per batch not
once per namespace.

For further batching improvements I need a guarantee that the
netdevices are unregistered in order allowing me to unregister an all
of the network devices in a network namespace at the same time with
the guarantee that the loopback device is really and truly
unregistered last.

Additionally it appears that we moved the route cache flush after
the final synchronize_net, which seems wrong and there was no
explanation.  So I have restored the original location of the final
synchronize_net.

Cc: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-01 16:15:50 -08:00
Joe Perches
f64f9e7192 net: Move && and || to end of previous line
Not including net/atm/

Compiled tested x86 allyesconfig only
Added a > 80 column line or two, which I ignored.
Existing checkpatch plaints willfully, cheerfully ignored.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-29 16:55:45 -08:00
Arnd Bergmann
445409602c veth: move loopback logic to common location
The veth driver contains code to forward an skb
from the start_xmit function of one network
device into the receive path of another device.

Moving that code into a common location lets us
reuse the code for direct forwarding of data
between macvlan ports, and possibly in other
drivers.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-26 15:52:58 -08:00
Octavian Purdila
09ad9bc752 net: use net_eq to compare nets
Generated with the following semantic patch

@@
struct net *n1;
struct net *n2;
@@
- n1 == n2
+ net_eq(n1, n2)

@@
struct net *n1;
struct net *n2;
@@
- n1 != n2
+ !net_eq(n1, n2)

applied over {include,net,drivers/net}.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-25 15:14:13 -08:00
Jaswinder Singh Rajput
6ebfbc0656 net: Fix missing kernel-doc notation
Fix the following htmldocs warning:

  Warning(net/core/dev.c:5378): bad line:

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-22 20:43:13 -08:00
Eric Dumazet
8964be4a9a net: rename skb->iif to skb->skb_iif
To help grep games, rename iif to skb_iif

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-20 15:35:04 -08:00
Octavian Purdila
d90310243f net: device name allocation cleanups
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-18 05:03:35 -08:00
Eric Dumazet
e014debecd linkwatch: linkwatch_forget_dev() to speedup device dismantle
Herbert Xu a écrit :
> On Tue, Nov 17, 2009 at 04:26:04AM -0800, David Miller wrote:
>> Really, the link watch stuff is just due for a redesign.  I don't
>> think a simple hack is going to cut it this time, sorry Eric :-)
>
> I have no objections against any redesigns, but since the only
> caller of linkwatch_forget_dev runs in process context with the
> RTNL, it could also legally emit those events.

Thanks guys, here an updated version then, before linkwatch surgery ?

In this version, I force the event to be sent synchronously.

[PATCH net-next-2.6] linkwatch: linkwatch_forget_dev() to speedup device dismantle

time ip link del eth3.103 ; time ip link del eth3.104 ; time ip link del eth3.105

real	0m0.266s
user	0m0.000s
sys	0m0.001s

real	0m0.770s
user	0m0.000s
sys	0m0.000s

real	0m1.022s
user	0m0.000s
sys	0m0.000s

One problem of current schem in vlan dismantle phase is the
holding of device done by following chain :

vlan_dev_stop() ->
	netif_carrier_off(dev) ->
		linkwatch_fire_event(dev) ->
			dev_hold() ...

And __linkwatch_run_queue() runs up to one second later...

A generic fix to this problem is to add a linkwatch_forget_dev() method
to unlink the device from the list of watched devices.

dev->link_watch_next becomes dev->link_watch_list (and use a bit more memory),
to be able to unlink device in O(1).

After patch :
time ip link del eth3.103 ; time ip link del eth3.104 ; time ip link del eth3.105

real    0m0.024s
user    0m0.000s
sys     0m0.000s

real    0m0.032s
user    0m0.000s
sys     0m0.001s

real    0m0.033s
user    0m0.000s
sys     0m0.000s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-18 05:03:11 -08:00
Octavian Purdila
395264d509 net: introduce NETDEV_UNREGISTER_PERNET
This new event is called once for each unique net namespace in batched
unregister operations (with the argument set to a random device from
that namespace) and once per device in non-batched unregister
operations.

It allows us to factorize some device unregister work such as clearing the
routing cache.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-18 05:03:03 -08:00
Eric Dumazet
d83345adf9 net: add dev_txq_stats_fold() helper
Some drivers ndo_get_stats() method need to perform txqueue stats folding.

Move folding from dev_get_stats() to a new dev_txq_stats_fold() function

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-17 23:51:52 -08:00
David S. Miller
a2bfbc072e Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/can/Kconfig
2009-11-17 00:05:02 -08:00
Eric Dumazet
91e9c07bd6 net: Fix the rollback test in dev_change_name()
net: Fix the rollback test in dev_change_name()

In dev_change_name() an err variable is used for storing the original
call_netdevice_notifiers() errno (negative) and testing for a rollback
error later, but the test for non-zero is wrong, because the err might
have positive value as well - from dev_alloc_name(). It means the
rollback for a netdevice with a number > 0 will never happen. (The err
test is reordered btw. to make it more readable.)

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-16 03:30:35 -08:00
Jarek Poplawski
9a1654ba0b net: Optimize hard_start_xmit() return checking
Recent changes in the TX error propagation require additional checking
and masking of values returned from hard_start_xmit(), mainly to
separate cases where skb was consumed. This aim can be simplified by
changing the order of NETDEV_TX and NET_XMIT codes, because the latter
are treated similarly to negative (ERRNO) values.

After this change much simpler dev_xmit_complete() is also used in
sch_direct_xmit(), so it is moved to netdevice.h.

Additionally NET_RX definitions in netdevice.h are moved up from
between TX codes to avoid confusion while reading the TX comment.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-15 22:08:33 -08:00
Eric Dumazet
ed04642f75 net: check the return value of ndo_select_queue()
Check the return value of ndo_select_queue(). If the value isn't smaller
than the real_num_tx_queues, print a warning message, and reset it to zero.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
----
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-15 22:08:05 -08:00
Patrick McHardy
572a9d7b6f net: allow to propagate errors through ->ndo_hard_start_xmit()
Currently the ->ndo_hard_start_xmit() callbacks are only permitted to return
one of the NETDEV_TX codes. This prevents any kind of error propagation for
virtual devices, like queue congestion of the underlying device in case of
layered devices, or unreachability in case of tunnels.

This patches changes the NET_XMIT codes to avoid clashes with the NETDEV_TX
codes and changes the two callers of dev_hard_start_xmit() to expect either
errno codes, NET_XMIT codes or NETDEV_TX codes as return value.

In case of qdisc_restart(), all non NETDEV_TX codes are mapped to NETDEV_TX_OK
since no error propagation is possible when using qdiscs. In case of
dev_queue_xmit(), the error is propagated upwards.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13 14:07:32 -08:00
stephen hemminger
08e9897d51 netdev: fold name hash properly (v3)
The full_name_hash function does not produce well distributed values in
the lower bits, so most code uses hash_32() to fold it.  This is really
a bug introduced when name hashing was added, back in 2.5 when I added
name hashing.

hash_32 is all that is needed since full_name_hash returns unsigned int
which is only 32 bits on 64 bit platforms.

Also, there is no point in using hash_32 on ifindex, because the is naturally
sequential and usually well distributed.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-11 19:22:12 -08:00
Eric Dumazet
c6d14c8456 net: Introduce for_each_netdev_rcu() iterator
Adds RCU management to the list of netdevices.

Convert some for_each_netdev() users to RCU version, if
it can avoid read_lock-ing dev_base_lock

Ie:
	read_lock(&dev_base_loack);
	for_each_netdev(net, dev)
		some_action();
	read_unlock(&dev_base_lock);

becomes :

	rcu_read_lock();
	for_each_netdev_rcu(net, dev)
		some_action();
	rcu_read_unlock();


Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-04 05:43:23 -08:00
Eric Dumazet
3710becf8a net: RCU locking for simple ioctl()
All ioctls() implemented by dev_ifsioc_locked() :
SIOCGIFFLAGS, SIOCGIFMETRIC, SIOCGIFMTU, SIOCGIFHWADDR,
SIOCGIFSLAVE, SIOCGIFMAP, SIOCGIFINDEX & SIOCGIFTXQLEN
can use RCU lock instead of dev_base_lock rwlock

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-01 23:55:11 -08:00
Eric W. Biederman
9fdce099bb veth: Fix unregister_netdevice_queue for veth
I tested the recent unregister many changes and got a weird,
nasty and seemingly unrelasted kernel oops. Changing
unregister_netdevice_queue to use list_move_tail fixes
the problem for me.

ip link add type veth
rmmod veth

ls /sys/class/net/
showed one of the veth devices still present.

A subsequent ip link oopsed the box.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-01 23:55:09 -08:00
Eric Dumazet
72c9528bab net: Introduce dev_get_by_name_rcu()
Some workloads hit dev_base_lock rwlock pretty hard.
We can use RCU lookups to avoid touching this rwlock
(and avoid touching netdevice refcount)

netdevices are already freed after a RCU grace period, so this patch
adds no penalty at device dismantle time.

However, it adds a synchronize_rcu() call in dev_change_name()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-01 23:55:08 -08:00
Eric Dumazet
0bd8d53656 net: use hlist_for_each_entry()
Small cleanup of __dev_get_by_name() and __dev_get_by_index()
to use hlist_for_each_entry() : They'll look like their _rcu variant.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-30 01:40:11 -07:00
Ben Hutchings
c7c4b3b6e9 gro: Change all receive functions to return GRO result codes
This will allow drivers to adjust their receive path dynamically
based on whether GRO is being applied successfully.

Currently all in-tree callers ignore the return values of these
functions and do not need to be changed.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 21:36:53 -07:00
Ben Hutchings
5b252f0c2f gro: Name the GRO result enumeration type
This clarifies which return and parameter types are GRO result codes
and not RX result codes.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 21:33:55 -07:00
Eric Dumazet
fb699dfd42 net: Introduce dev_get_by_index_rcu()
Some workloads hit dev_base_lock rwlock pretty hard.
We can use RCU lookups to avoid touching this rwlock.

netdevices are already freed after a RCU grace period, so this patch
adds no penalty at device dismantle time.

dev_ifname() converted to dev_get_by_index_rcu()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 01:42:55 -07:00
Eric Dumazet
63c8099d90 vlan: Optimize multiple unregistration
Use unregister_netdevice_many() to speedup master device unregister.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-28 02:22:08 -07:00
Eric Dumazet
23289a37e2 net: add a list_head parameter to dellink() method
Adding a list_head parameter to rtnl_link_ops->dellink() methods
allow us to queue devices on a list, in order to dismantle
them all at once.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-28 02:22:07 -07:00
Eric Dumazet
9b5e383c11 net: Introduce unregister_netdevice_many()
Introduce rollback_registered_many() and unregister_netdevice_many()

rollback_registered_many() is able to perform necessary steps at device dismantle
time, factorizing two expensive synchronize_net() calls.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-28 02:22:06 -07:00
Eric Dumazet
44a0873d52 net: Introduce unregister_netdevice_queue()
This patchs adds an unreg_list anchor to struct net_device, and
introduces an unregister_netdevice_queue() function, able to queue
a net_device to a list instead of immediately unregister it.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-28 02:22:06 -07:00
Eric Dumazet
05423b2413 vlan: allow null VLAN ID to be used
We currently use a 16 bit field (vlan_tci) to store VLAN ID/PRIO on a skb.

Null value is used as a special value, meaning vlan tagging not enabled.
This forbids use of null vlan ID.

As pointed by David, some drivers use the 3 high order bits (PRIO)

As VLAN ID is 12 bits, we can use the remaining bit (CFI) as a flag, and
allow null VLAN ID.

In case future code really wants to use VLAN_CFI_MASK, we'll have to use
a bit outside of vlan_tci.

#define VLAN_PRIO_MASK         0xe000 /* Priority Code Point */
#define VLAN_PRIO_SHIFT        13
#define VLAN_CFI_MASK          0x1000 /* Canonical Format Indicator */
#define VLAN_TAG_PRESENT       VLAN_CFI_MASK
#define VLAN_VID_MASK          0x0fff /* VLAN Identifier */

Reported-by: Gertjan Hofman <gertjan_hofman@yahoo.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-27 01:02:33 -07:00
Eric Dumazet
7c28bd0b8e rtnetlink: speedup rtnl_dump_ifinfo()
When handling large number of netdevice, rtnl_dump_ifinfo()
is very slow because it has O(N^2) complexity.

Instead of scanning one single list, we can use the 256 sub lists
of the dev_index hash table.

This considerably speedups "ip link" operations

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-24 06:13:17 -07:00
Krishna Kumar
a4ee3ce329 net: Use sk_tx_queue_mapping for connected sockets
For connected sockets, the first run of dev_pick_tx saves the
calculated txq in sk_tx_queue_mapping. This is not saved if
either the device has a queue select or the socket is not
connected. Next iterations of dev_pick_tx uses the cached value
of sk_tx_queue_mapping.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-20 18:55:47 -07:00
Eric Dumazet
89d71a66c4 net: Use netdev_alloc_skb_ip_align()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-13 11:48:18 -07:00
Sridhar Samudrala
d9f5950f90 net: Make UFO on master device independent of attached devices
Now that software UFO is supported, UFO can be enabled on master
devices like bridge, bond even though the attached device doesn't
support this feature in hardware.

This allows UFO to be used between KVM host and guest even when a
physical interface attached to the bridge doesn't support UFO.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-07 22:00:23 -07:00
Johannes Berg
7ffbe3fdac net: introduce NETDEV_POST_INIT notifier
For various purposes including a wireless extensions
bugfix, we need to hook into the netdev creation before
before netdev_register_kobject(). This will also ease
doing the dev type assignment that Marcel was working
on for cfg80211 drivers w/o touching them all.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-05 00:43:34 -07:00
Eric Dumazet
81bbb3d404 net: restore tx timestamping for accelerated vlans
Since commit 9b22ea5609
( net: fix packet socket delivery in rx irq handler )

We lost rx timestamping of packets received on accelerated vlans.

Effect is that tcpdump on real dev can show strange timings, since it gets rx timestamps
too late (ie at skb dequeueing time, not at skb queueing time)

14:47:26.986871 IP 192.168.20.110 > 192.168.20.141: icmp 64: echo request seq 1
14:47:26.986786 IP 192.168.20.141 > 192.168.20.110: icmp 64: echo reply seq 1

14:47:27.986888 IP 192.168.20.110 > 192.168.20.141: icmp 64: echo request seq 2
14:47:27.986781 IP 192.168.20.141 > 192.168.20.110: icmp 64: echo reply seq 2

14:47:28.986896 IP 192.168.20.110 > 192.168.20.141: icmp 64: echo request seq 3
14:47:28.986780 IP 192.168.20.141 > 192.168.20.110: icmp 64: echo reply seq 3

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-30 16:42:42 -07:00
Moni Shoua
75c78500dd bonding: remap muticast addresses without using dev_close() and dev_open()
This patch fixes commit e36b9d16c6. The approach
there is to call dev_close()/dev_open() whenever the device type is changed in
order to remap the device IP multicast addresses to HW multicast addresses.
This approach suffers from 2 drawbacks:

*. It assumes tha the device is UP when calling dev_close(), or otherwise
   dev_close() has no affect. It is worth to mention that initscripts (Redhat)
   and sysconfig (Suse) doesn't act the same in this matter. 
*. dev_close() has other side affects, like deleting entries from the routing
   table, which might be unnecessary.

The fix here is to directly remap the IP multicast addresses to HW multicast
addresses for a bonding device that changes its type, and nothing else.
   
Reported-by:   Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-15 02:37:40 -07:00
Linus Torvalds
d7e9660ad9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1623 commits)
  netxen: update copyright
  netxen: fix tx timeout recovery
  netxen: fix file firmware leak
  netxen: improve pci memory access
  netxen: change firmware write size
  tg3: Fix return ring size breakage
  netxen: build fix for INET=n
  cdc-phonet: autoconfigure Phonet address
  Phonet: back-end for autoconfigured addresses
  Phonet: fix netlink address dump error handling
  ipv6: Add IFA_F_DADFAILED flag
  net: Add DEVTYPE support for Ethernet based devices
  mv643xx_eth.c: remove unused txq_set_wrr()
  ucc_geth: Fix hangs after switching from full to half duplex
  ucc_geth: Rearrange some code to avoid forward declarations
  phy/marvell: Make non-aneg speed/duplex forcing work for 88E1111 PHYs
  drivers/net/phy: introduce missing kfree
  drivers/net/wan: introduce missing kfree
  net: force bridge module(s) to be GPL
  Subject: [PATCH] appletalk: Fix skb leak when ipddp interface is not loaded
  ...

Fixed up trivial conflicts:

 - arch/x86/include/asm/socket.h

   converted to <asm-generic/socket.h> in the x86 tree.  The generic
   header has the same new #define's, so that works out fine.

 - drivers/net/tun.c

   fix conflict between 89f56d1e9 ("tun: reuse struct sock fields") that
   switched over to using 'tun->socket.sk' instead of the redundantly
   available (and thus removed) 'tun->sk', and 2b980dbd ("lsm: Add hooks
   to the TUN driver") which added a new 'tun->sk' use.

   Noted in 'next' by Stephen Rothwell.
2009-09-14 10:37:28 -07:00
Stephen Hemminger
4fb019a01a net: force bridge module(s) to be GPL
The only valid usage for the bridge frame hooks are by a
GPL components (such as the bridge module).
The kernel should not leave a crack in the door for proprietary
networking stacks to slip in.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-11 12:54:26 -07:00
Eric Dumazet
55f9d6786d net: Remove debugging code
Remove a debugging aid I accidently left in previous 'cleanup' patch

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-03 05:17:20 -07:00
Eric Dumazet
d1b19dff91 net: net/core/dev.c cleanups
Pure style cleanup patch before surgery :)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-03 01:29:39 -07:00
Krishna Kumar
03a9a447d2 net: convert remaining non-symbolic return values in dev_queue_xmit
Patch compiled and 32 simultaneous netperf testing ran fine.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-30 22:16:57 -07:00
Dmitry Eremin-Solenikov
929122cdd5 Drop ARPHRD_IEEE802154_PHY
There are not maste devices in mac802154 anymore, so drop
ARPHRD_IEEE802154_PHY definition.

Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
2009-08-19 23:08:24 +04:00
Eric Paris
a8f80e8ff9 Networking: use CAP_NET_ADMIN when deciding to call request_module
The networking code checks CAP_SYS_MODULE before using request_module() to
try to load a kernel module.  While this seems reasonable it's actually
weakening system security since we have to allow CAP_SYS_MODULE for things
like /sbin/ip and bluetoothd which need to be able to trigger module loads.
CAP_SYS_MODULE actually grants those binaries the ability to directly load
any code into the kernel.  We should instead be protecting modprobe and the
modules on disk, rather than granting random programs the ability to load code
directly into the kernel.  Instead we are going to gate those networking checks
on CAP_NET_ADMIN which still limits them to root but which does not grant
those processes the ability to load arbitrary code into the kernel.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Paul Moore <paul.moore@hp.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: James Morris <jmorris@namei.org>
2009-08-14 11:18:34 +10:00
David S. Miller
aa11d958d1 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	arch/microblaze/include/asm/socket.h
2009-08-12 17:44:53 -07:00
Krishna Kumar
bbd8a0d3a3 net: Avoid enqueuing skb for default qdiscs
dev_queue_xmit enqueue's a skb and calls qdisc_run which
dequeue's the skb and xmits it. In most cases, the skb that
is enqueue'd is the same one that is dequeue'd (unless the
queue gets stopped or multiple cpu's write to the same queue
and ends in a race with qdisc_run). For default qdiscs, we
can remove the redundant enqueue/dequeue and simply xmit the
skb since the default qdisc is work-conserving.

The patch uses a new flag - TCQ_F_CAN_BYPASS to identify the
default fast queue. The controversial part of the patch is
incrementing qlen when a skb is requeued - this is to avoid
checks like the second line below:

+  } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>>         !q->gso_skb &&
+          !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) {

Results of a 2 hour testing for multiple netperf sessions (1,
2, 4, 8, 12 sessions on a 4 cpu system-X). The BW numbers are
aggregate Mb/s across iterations tested with this version on
System-X boxes with Chelsio 10gbps cards:

----------------------------------
Size |  ORG BW          NEW BW   |
----------------------------------
128K |  156964          159381   |
256K |  158650          162042   |
----------------------------------

Changes from ver1:

1. Move sch_direct_xmit declaration from sch_generic.h to
   pkt_sched.h
2. Update qdisc basic statistics for direct xmit path.
3. Set qlen to zero in qdisc_reset.
4. Changed some function names to more meaningful ones.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-06 20:10:18 -07:00
Jan Engelhardt
36cbd3dcc1 net: mark read-only arrays as const
String literals are constant, and usually, we can also tag the array
of pointers const too, moving it to the .rodata section.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-05 10:42:58 -07:00
Ingo Molnar
0bf52b9817 net: Fix spinlock use in alloc_netdev_mq()
-tip testing found this lockdep warning:

[    2.272010] calling  net_dev_init+0x0/0x164 @ 1
[    2.276033] device class 'net': registering
[    2.280191] INFO: trying to register non-static key.
[    2.284005] the code is fine but needs lockdep annotation.
[    2.284005] turning off the locking correctness validator.
[    2.284005] Pid: 1, comm: swapper Not tainted 2.6.31-rc5-tip #1145
[    2.284005] Call Trace:
[    2.284005]  [<7958eb4e>] ? printk+0xf/0x11
[    2.284005]  [<7904f83c>] __lock_acquire+0x11b/0x622
[    2.284005]  [<7908c9b7>] ? alloc_debug_processing+0xf9/0x144
[    2.284005]  [<7904e2be>] ? mark_held_locks+0x3a/0x52
[    2.284005]  [<7908dbc4>] ? kmem_cache_alloc+0xa8/0x13f
[    2.284005]  [<7904e475>] ? trace_hardirqs_on_caller+0xa2/0xc3
[    2.284005]  [<7904fdf6>] lock_acquire+0xb3/0xd0
[    2.284005]  [<79489678>] ? alloc_netdev_mq+0xf5/0x1ad
[    2.284005]  [<79591514>] _spin_lock_bh+0x2d/0x5d
[    2.284005]  [<79489678>] ? alloc_netdev_mq+0xf5/0x1ad
[    2.284005]  [<79489678>] alloc_netdev_mq+0xf5/0x1ad
[    2.284005]  [<793a38f2>] ? loopback_setup+0x0/0x74
[    2.284005]  [<798eecd0>] loopback_net_init+0x20/0x5d
[    2.284005]  [<79483efb>] register_pernet_device+0x23/0x4b
[    2.284005]  [<798f5c9f>] net_dev_init+0x115/0x164
[    2.284005]  [<7900104f>] do_one_initcall+0x4a/0x11a
[    2.284005]  [<798f5b8a>] ? net_dev_init+0x0/0x164
[    2.284005]  [<79066f6d>] ? register_irq_proc+0x8c/0xa8
[    2.284005]  [<798cc29a>] do_basic_setup+0x42/0x52
[    2.284005]  [<798cc30a>] kernel_init+0x60/0xa1
[    2.284005]  [<798cc2aa>] ? kernel_init+0x0/0xa1
[    2.284005]  [<79003e03>] kernel_thread_helper+0x7/0x10
[    2.284078] device: 'lo': device_add
[    2.288248] initcall net_dev_init+0x0/0x164 returned 0 after 11718 usecs
[    2.292010] calling  neigh_init+0x0/0x66 @ 1
[    2.296010] initcall neigh_init+0x0/0x66 returned 0 after 0 usecs

it's using an zero-initialized spinlock. This is a side-effect of:

        dev_unicast_init(dev);

in alloc_netdev_mq() making use of dev->addr_list_lock.

The device has just been allocated freshly, it's not accessible
anywhere yet so no locking is needed at all - in fact it's wrong
to lock it here (the lock isnt initialized yet).

This bug was introduced via:

| commit a6ac65db23
| Date:   Thu Jul 30 01:06:12 2009 +0000
|
|     net: restore the original spinlock to protect unicast list

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jiri Pirko <jpirko@redhat.com>
Tested-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-05 08:35:11 -07:00
Jiri Pirko
a6ac65db23 net: restore the original spinlock to protect unicast list
There is a path when an assetion in dev_unicast_sync() appears.

igmp6_group_added -> dev_mc_add -> __dev_set_rx_mode ->
-> vlan_dev_set_rx_mode -> dev_unicast_sync

Therefore we cannot protect this list with rtnl. This patch restores the
original protecting this list with spinlock.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-02 12:20:46 -07:00
Johannes Berg
463d018323 cfg80211: make aware of net namespaces
In order to make cfg80211/nl80211 aware of network namespaces,
we have to do the following things:

 * del_virtual_intf method takes an interface index rather
   than a netdev pointer - simply change this

 * nl80211 uses init_net a lot, it changes to use the sender's
   network namespace

 * scan requests use the interface index, hold a netdev pointer
   and reference instead

 * we want a wiphy and its associated virtual interfaces to be
   in one netns together, so
    - we need to be able to change ns for a given interface, so
      export dev_change_net_namespace()
    - for each virtual interface set the NETIF_F_NETNS_LOCAL
      flag, and clear that flag only when the wiphy changes ns,
      to disallow breaking this invariant

 * when a network namespace goes away, we need to reparent the
   wiphy to init_net

 * cfg80211 users that support creating virtual interfaces must
   create them in the wiphy's namespace, currently this affects
   only mac80211

The end result is that you can now switch an entire wiphy into
a different network namespace with the new command
	iw phy#<idx> set netns <pid>
and all virtual interfaces will follow (or the operation fails).

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-27 15:24:07 -04:00
Johannes Berg
c4029083e2 net: export __dev_addr_sync/__dev_addr_unsync
For mac80211, with the master netdev removal, we need to be
able to sync a multicast address list onto another list that
is not tracked within a netdev, so we need access to the
functions doing that.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-24 15:05:30 -04:00
Patrick McHardy
ec634fe328 net: convert remaining non-symbolic return values in ndo_start_xmit() functions
This patch converts the remaining occurences of raw return values to their
symbolic counterparts in ndo_start_xmit() functions that were missed by the
previous automatic conversion.

Additionally code that assumed the symbolic value of NETDEV_TX_OK to be zero
is changed to explicitly use NETDEV_TX_OK.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-05 19:23:38 -07:00
Herbert Xu
ff780cd8f2 gro: Flush GRO packets in napi_disable_pending path
When NAPI is disabled while we're in net_rx_action, we end up
calling __napi_complete without flushing GRO packets.  This is
a bug as it would cause the GRO packets to linger, of course it
also literally BUGs to catch error like this :)

This patch changes it to napi_complete, with the obligatory IRQ
reenabling.  This should be safe because we've only just disabled
IRQs and it does not materially affect the test conditions in
between.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-26 19:27:04 -07:00
Herbert Xu
d55d87fdff net: Move rx skb_orphan call to where needed
In order to get the tun driver to account packets, we need to be
able to receive packets with destructors set.  To be on the safe
side, I added an skb_orphan call for all protocols by default since
some of them (IP in particular) cannot handle receiving packets
destructors properly.

Now it seems that at least one protocol (CAN) expects to be able
to pass skb->sk through the rx path without getting clobbered.

So this patch attempts to fix this properly by moving the skb_orphan
call to where it's actually needed.  In particular, I've added it
to skb_set_owner_[rw] which is what most users of skb->destructor
call.

This is actually an improvement for tun too since it means that
we only give back the amount charged to the socket when the skb
is passed to another socket that will also be charged accordingly.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Oliver Hartkopp <olver@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-23 16:36:25 -07:00
Jiri Pirko
31278e7147 net: group address list and its count
This patch is inspired by patch recently posted by Johannes Berg. Basically what
my patch does is to group list and a count of addresses into newly introduced
structure netdev_hw_addr_list. This brings us two benefits:
1) struct net_device becames a bit nicer.
2) in the future there will be a possibility to operate with lists independently
   on netdevices (with exporting right functions).
I wanted to introduce this patch before I'll post a multicast lists conversion.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

 drivers/net/bnx2.c              |    4 +-
 drivers/net/e1000/e1000_main.c  |    4 +-
 drivers/net/ixgbe/ixgbe_main.c  |    6 +-
 drivers/net/mv643xx_eth.c       |    2 +-
 drivers/net/niu.c               |    4 +-
 drivers/net/virtio_net.c        |   10 ++--
 drivers/s390/net/qeth_l2_main.c |    2 +-
 include/linux/netdevice.h       |   17 +++--
 net/core/dev.c                  |  130 ++++++++++++++++++--------------------
 9 files changed, 89 insertions(+), 90 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-18 00:29:08 -07:00
David S. Miller
9cbc1cb8cd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:
	Documentation/feature-removal-schedule.txt
	drivers/scsi/fcoe/fcoe.c
	net/core/drop_monitor.c
	net/core/net-traces.c
2009-06-15 03:02:23 -07:00
Michał Mirosław
da6782927d bridge: Simplify interface for ATM LANE
This patch changes FDB entry check for ATM LANE bridge integration.
There's no point in holding a FDB entry around SKB building.

br_fdb_get()/br_fdb_put() pair are changed into single br_fdb_test_addr()
hook that checks if the addr has FDB entry pointing to other port
to the one the request arrived on.

FDB entry refcounting is removed as it's not used anywhere else.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 21:03:21 -07:00
John Dykstra
746e6ad23c [PATCH] net core: Some interface flags not returned by SIOCGIFFLAGS
Commit b00055aacd " [NET] core: add
RFC2863 operstate" defined new interface flag values.  Its
documentation specified that these flags could be accessed from user
space via SIOCGIFFLAGS.  However, this does not work because the new
flags do not fit in that ioctl's argument width.

Change the documentation to match the code's behavior.  Also change
the source to explicitly show the truncation.  This _should_ have no
effect on executable code, and did not with gcc 4.2.4 generating x86
code.

A new ioctl could be defined to return all interface flags to user
space.  However, since this has been broken for three years with no
one complaining, there doesn't seem much need.  They are still
accessible via netlink.

Reported-by:  "Fredrik Arnerup" <fredrik.arnerup@edgeware.tv>
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 20:57:21 -07:00
Sergey Lapin
fcb94e4224 Add constants for the ieee 802.15.4 stack
IEEE 802.15.4 stack requires several constants to be defined/adjusted.

Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Signed-off-by: Sergey Lapin <slapin@ossfans.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-09 05:25:30 -07:00
Eric Dumazet
0c27922e49 net: dev_addr_init() fix
commit f001fde5ea
(net: introduce a list of device addresses dev_addr_list (v6))
added one regression Vegard Nossum found in its testings.

With kmemcheck help, Vegard found some uninitialized memory
was read and reported to user, potentialy leaking kernel data.
( thread can be found on http://lkml.org/lkml/2009/5/30/177 )

dev_addr_init() incorrectly uses sizeof() operator. We were
initializing one byte instead of MAX_ADDR_LEN bytes.

Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-09 05:11:42 -07:00
David S. Miller
4cf704fbea net/core/dev.c: Use frag list abstraction interfaces.
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-09 00:18:51 -07:00
Johannes Berg
3b8bcfd5d3 net: introduce pre-up netdev notifier
NETDEV_UP is called after the device is set UP, but sometimes
it is useful to be able to veto the device UP. Introduce a
new NETDEV_PRE_UP notifier that can be used for exactly this.
The first use case will be cfg80211 denying interfaces to be
set UP if the device is known to be rfkill'ed.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-06-03 14:05:12 -04:00
Eric Dumazet
adf30907d6 net: skb->dst accessors
Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-03 02:51:04 -07:00
Jiri Pirko
ccffad25b5 net: convert unicast addr list
This patch converts unicast address list to standard list_head using
previously introduced struct netdev_hw_addr. It also relaxes the
locking. Original spinlock (still used for multicast addresses) is not
needed and is no longer used for a protection of this list. All
reading and writing takes place under rtnl (with no changes).

I also removed a possibility to specify the length of the address
while adding or deleting unicast address. It's always dev->addr_len.

The convertion touched especially e1000 and ixgbe codes when the
change is not so trivial.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

 drivers/net/bnx2.c               |   13 +--
 drivers/net/e1000/e1000_main.c   |   24 +++--
 drivers/net/ixgbe/ixgbe_common.c |   14 ++--
 drivers/net/ixgbe/ixgbe_common.h |    4 +-
 drivers/net/ixgbe/ixgbe_main.c   |    6 +-
 drivers/net/ixgbe/ixgbe_type.h   |    4 +-
 drivers/net/macvlan.c            |   11 +-
 drivers/net/mv643xx_eth.c        |   11 +-
 drivers/net/niu.c                |    7 +-
 drivers/net/virtio_net.c         |    7 +-
 drivers/s390/net/qeth_l2_main.c  |    6 +-
 drivers/scsi/fcoe/fcoe.c         |   16 ++--
 include/linux/netdevice.h        |   18 ++--
 net/8021q/vlan.c                 |    4 +-
 net/8021q/vlan_dev.c             |   10 +-
 net/core/dev.c                   |  195 +++++++++++++++++++++++++++-----------
 net/dsa/slave.c                  |   10 +-
 net/packet/af_packet.c           |    4 +-
 18 files changed, 227 insertions(+), 137 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-29 22:12:32 -07:00
Eric Dumazet
1ce8e7b57b net: ALIGN/PTR_ALIGN cleanup in alloc_netdev_mq()/netdev_priv()
Use ALIGN() and PTR_ALIGN() macros instead of handcoding them.

Get rid of NETDEV_ALIGN_CONST ugly define

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-27 15:47:06 -07:00
Herbert Xu
cb18978cbf gro: Open-code final pskb_may_pull
As we know the only packets which need the final pskb_may_pull
are completely non-linear, and have all the required bits in
frag0, we can perform a straight memcpy instead of going through
pskb_may_pull and doing skb_copy_bits.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-27 03:26:02 -07:00
Herbert Xu
a5b1cf288d gro: Avoid unnecessary comparison after skb_gro_header
For the overwhelming majority of cases, skb_gro_header's return
value cannot be NULL.  Yet we must check it because of its current
form.  This patch splits it up into multiple functions in order
to avoid this.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-27 03:26:01 -07:00
Herbert Xu
7489594cb2 gro: Optimise length comparison in skb_gro_header
By caching frag0_len, we can avoid checking both frag0 and the
length separately in skb_gro_header.  This helps as skb_gro_header
is called four times per packet which amounts to a few million
times at 10Gb/s.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-27 03:26:01 -07:00
Herbert Xu
78d3fd0b7d gro: Only use skb_gro_header for completely non-linear packets
Currently skb_gro_header is used for packets which put the hardware
header in skb->data with the rest in frags.  Since the drivers that
need this optimisation all provide completely non-linear packets,
we can gain extra optimisations by only performing the frag0
optimisation for completely non-linear packets.

In particular, we can simply test frag0 (instead of skb_headlen)
to see whether the optimisation is in force.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-27 03:25:57 -07:00
Herbert Xu
78a478d0ef gro: Inline skb_gro_header and cache frag0 virtual address
The function skb_gro_header is called four times per packet which
quickly adds up at 10Gb/s.  This patch inlines it to allow better
optimisations.

Some architectures perform multiplication for page_address, which
is done by each skb_gro_header invocation.  This patch caches that
value in skb->cb to avoid the unnecessary multiplications.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-27 03:25:55 -07:00
Eric Dumazet
08baf56108 net: txq_trans_update() helper
We would like to get rid of netdev->trans_start = jiffies; that about all net
drivers have to use in their start_xmit() function, and use txq->trans_start
instead.

This can be done generically in core network, as suggested by David.

Some devices, (particularly loopback) dont need trans_start update, because
they dont have transmit watchdog. We could add a new device flag, or rely
on fact that txq->tran_start can be updated is txq->xmit_lock_owner is
different than -1. Use a helper function to hide our choice.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-25 22:58:01 -07:00
Alexander Beregalov
e3804cbebb net: remove COMPAT_NET_DEV_OPS
All drivers are already converted to new net_device_ops API
and nobody uses old API anymore.

Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-25 01:53:53 -07:00
Neil Horman
4ea7e38696 dropmon: add ability to detect when hardware dropsrxpackets
Patch to add the ability to detect drops in hardware interfaces via dropwatch.
Adds a tracepoint to net_rx_action to signal everytime a napi instance is
polled.  The dropmon code then periodically checks to see if the rx_frames
counter has changed, and if so, adds a drop notification to the netlink
protocol, using the reserved all-0's vector to indicate the drop location was in
hardware, rather than somewhere in the code.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

 include/linux/net_dropmon.h |    8 ++
 include/trace/napi.h        |   11 +++
 net/core/dev.c              |    5 +
 net/core/drop_monitor.c     |  124 ++++++++++++++++++++++++++++++++++++++++++--
 net/core/net-traces.c       |    4 +
 net/core/netpoll.c          |    2
 6 files changed, 149 insertions(+), 5 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-21 16:50:21 -07:00
Eric Dumazet
93f154b594 net: release dst entry in dev_hard_start_xmit()
One point of contention in high network loads is the dst_release() performed
when a transmited skb is freed. This is because NIC tx completion calls
dev_kree_skb() long after original call to dev_queue_xmit(skb).

CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
quite visible if one CPU is 100% handling softirqs for a network device,
since dst_clone() is done by other cpus, involving cache line ping pongs.

It seems right place to release dst is in dev_hard_start_xmit(), for most
devices but ones that are virtual, and some exceptions.

David Miller suggested to define a new device flag, set in alloc_netdev_mq()
(so that most devices set it at init time), and carefuly unset in devices
which dont want a NULL skb->dst in their ndo_start_xmit().

List of devices that must clear this flag is :

- loopback device, because it calls netif_rx() and quoting Patrick :
    "ip_route_input() doesn't accept loopback addresses, so loopback packets
     already need to have a dst_entry attached."
- appletalk/ipddp.c : needs skb->dst in its xmit function

- And all devices that call again dev_queue_xmit() from their xmit function
(as some classifiers need skb->dst) : bonding, vlan, macvlan, eql, ifb, hdlc_fr

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-18 22:19:19 -07:00
Eric Dumazet
7004bf252c net: add tx_packets/tx_bytes/tx_dropped counters in struct netdev_queue
offsetof(struct net_device, features)=0x44
offsetof(struct net_device, stats.tx_packets)=0x54
offsetof(struct net_device, stats.tx_bytes)=0x5c
offsetof(struct net_device, stats.tx_dropped)=0x6c

Network drivers that touch dev->stats.tx_packets/stats.tx_bytes in their
tx path can slow down SMP operations, since they dirty a cache line
that should stay shared (dev->features is needed in rx and tx paths)

We could move away stats field in net_device but it wont help that much.
(Two cache lines dirtied in tx path, we can do one only)

Better solution is to add tx_packets/tx_bytes/tx_dropped in struct
netdev_queue because this structure is already touched in tx path and
counters updates will then be free (no increase in size)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-18 15:15:06 -07:00
Jiri Pirko
ab9c73ccb5 net: check retval of dev_addr_init()
Add missed checking of dev_addr_init return value in alloc_netdev_mq.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

 net/core/dev.c |   15 ++++++++++++---
 1 files changed, 12 insertions(+), 3 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-09 13:15:48 -07:00
Jiri Pirko
f001fde5ea net: introduce a list of device addresses dev_addr_list (v6)
v5 -> v6 (current):
-removed so far unused static functions
-corrected dev_addr_del_multiple to call del instead of add

v4 -> v5:
-added device address type (suggested by davem)
-removed refcounting (better to have simplier code then safe potentially few
 bytes)

v3 -> v4:
-changed kzalloc to kmalloc in __hw_addr_add_ii()
-ASSERT_RTNL() avoided in dev_addr_flush() and dev_addr_init()

v2 -> v3:
-removed unnecessary rcu read locking
-moved dev_addr_flush() calling to ensure no null dereference of dev_addr

v1 -> v2:
-added forgotten ASSERT_RTNL to dev_addr_init and dev_addr_flush
-removed unnecessary rcu_read locking in dev_addr_init
-use compare_ether_addr_64bits instead of compare_ether_addr
-use L1_CACHE_BYTES as size for allocating struct netdev_hw_addr
-use call_rcu instead of rcu_synchronize
-moved is_etherdev_addr into __KERNEL__ ifdef

This patch introduces a new list in struct net_device and brings a set of
functions to handle the work with device address list. The list is a replacement
for the original dev_addr field and because in some situations there is need to
carry several device addresses with the net device. To be backward compatible,
dev_addr is made to point to the first member of the list so original drivers
sees no difference.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-05 12:26:24 -07:00
David S. Miller
513de11bba net: Avoid modulus in skb_tx_hash() for forwarding case.
Based almost entirely upon a patch by Eric Dumazet.

The common case is to have num-tx-queues <= num_rx_queues
and even if num_tx_queues is larger it will not be significantly
larger.

Therefore, a subtraction loop is always going to be faster than
modulus.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-03 14:43:10 -07:00
David S. Miller
d252a5e7b7 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2009-05-03 14:07:43 -07:00
Eric Dumazet
ec581f6a42 net: Fix skb_tx_hash() for forwarding workloads.
When skb_rx_queue_recorded() is true, we dont want to use jash distribution
as the device driver exactly told us which queue was selected at RX time.
jhash makes a statistical shuffle, but this wont work with 8 static inputs.

Later improvements would be to compute reciprocal value of real_num_tx_queues
to avoid a divide here. But this computation should be done once,
when real_num_tx_queues is set. This needs a separate patch, and a new
field in struct net_device.

Reported-by: Andrew Dickinson <andrew@whydna.net>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-01 09:05:06 -07:00
Herbert Xu
edbd9e3030 gro: Fix handling of headers that extend over the tail
The skb_gro_* code fails to handle the case where a header starts
in the linear area but ends in the frags area.  Since the goal
of skb_gro_* is to optimise the case of completely non-linear
packets, we can simply bail out if we have anything in the linear
area.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-27 05:44:29 -07:00
David S. Miller
e5e9743bb7 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/core/dev.c
2009-04-21 01:32:26 -07:00
Ben Hutchings
5db8765a86 net: Fix GRO for multiple page fragments
This loop over fragments in napi_fraginfo_skb() was "interesting".

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-20 02:20:30 -07:00
Marcin Slusarz
eb39c57ff7 net: fix "compatibility" typos
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-20 02:15:01 -07:00
Jarek Poplawski
8caf153974 net: sch_netem: Fix an inconsistency in ingress netem timestamps.
Alex Sidorenko reported:

"while experimenting with 'netem' we have found some strange behaviour. It
seemed that ingress delay as measured by 'ping' command shows up on some
hosts but not on others.

After some investigation I have found that the problem is that skbuff->tstamp
field value depends on whether there are any packet sniffers enabled. That
is:

- if any ptype_all handler is registered, the tstamp field is as expected
- if there are no ptype_all handlers, the tstamp field does not show the delay"

This patch prevents unnecessary update of tstamp in dev_queue_xmit_nit()
on ingress path (with act_mirred) adding a check, so minimal overhead on
the fast path, but only when sniffers etc. are active.

Since netem at ingress seems to logically emulate a network before a host,
tstamp is zeroed to trigger the update and pretend delays are from the
outside.

Reported-by: Alex Sidorenko <alexandre.sidorenko@hp.com>
Tested-by: Alex Sidorenko <alexandre.sidorenko@hp.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-20 02:14:59 -07:00
David S. Miller
a54bfa40fd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2009-04-16 17:35:26 -07:00
Herbert Xu
76620aafd6 gro: New frags interface to avoid copying shinfo
It turns out that copying a 16-byte area at ~800k times a second
can be really expensive :) This patch redesigns the frags GRO
interface to avoid copying that area twice.

The two disciples of the frags interface have been converted.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-16 02:02:07 -07:00
Herbert Xu
fc59f9a3bf gro: Restore correct value to gso_size
Since everybody has been focusing on baremetal GRO performance
no one noticed when I added a bug that zapped gso_size for all
GRO packets.  This only gets picked up when you forward the skb
out of an interface.

Thanks to Mark Wagner for noticing this bug when testing kvm.

Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-14 15:11:06 -07:00
Alexander Duyck
d543103a0c net: netif_device_attach/detach should start/stop all queues
Currently netif_device_attach/detach are only stopping one queue.  They
should be starting and stopping all the queues on a given device.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-11 01:43:10 -07:00
Stephen Hemminger
f2bde73286 net: allow multiple dev per napi with GRO
GRO assumes that there is a one-to-one relationship between NAPI
structure and network device. Some devices like sky2 share multiple
devices on a single interrupt so only have one NAPI handler. Rather than
split GRO from NAPI, just have GRO assume if device changes that
it is a different flow.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-02 01:07:37 -07:00
Linus Torvalds
d54b3538b0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (119 commits)
  [SCSI] scsi_dh_rdac: Retry for NOT_READY check condition
  [SCSI] mpt2sas: make global symbols unique
  [SCSI] sd: Make revalidate less chatty
  [SCSI] sd: Try READ CAPACITY 16 first for SBC-2 devices
  [SCSI] sd: Refactor sd_read_capacity()
  [SCSI] mpt2sas v00.100.11.15
  [SCSI] mpt2sas: add MPT2SAS_MINOR(221) to miscdevice.h
  [SCSI] ch: Add scsi type modalias
  [SCSI] 3w-9xxx: add power management support
  [SCSI] bsg: add linux/types.h include to bsg.h
  [SCSI] cxgb3i: fix function descriptions
  [SCSI] libiscsi: fix possbile null ptr session command cleanup
  [SCSI] iscsi class: remove host no argument from session creation callout
  [SCSI] libiscsi: pass session failure a session struct
  [SCSI] iscsi lib: remove qdepth param from iscsi host allocation
  [SCSI] iscsi lib: have lib create work queue for transmitting IO
  [SCSI] iscsi class: fix lock dep warning on logout
  [SCSI] libiscsi: don't cap queue depth in iscsi modules
  [SCSI] iscsi_tcp: replace scsi_debug/tcp_debug logging with iscsi conn logging
  [SCSI] libiscsi_tcp: replace tcp_debug/scsi_debug logging with session/conn logging
  ...
2009-03-28 13:30:43 -07:00
Herbert Xu
8f1ead2d1a GRO: Disable GRO on legacy netif_rx path
When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete.  Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.

What's more, we really do need to keep the __napi_complete call
within the IRQ-off section since in theory an IRQ can occur in
between and fill up the backlog to the maximum, causing us to
lock up.

Since we can't seem to find a fix that works properly right now,
this patch reverts all the GRO support from the netif_rx path.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-26 22:24:28 -07:00
Eric Dumazet
ed734a97c6 net: remove useless prefetch() call
There is no gain using prefetch() in dev_hard_start_xmit(), since
we already had to read ops->ndo_select_queue pointer in dev_pick_tx(),
and both pointers are probably located in the same cache line.

This prefetch call slows down fast path because of a stall in address
computation.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-21 13:42:55 -07:00
Stephen Hemminger
9247744e5e skb: expose and constify hash primitives
Some minor changes to queue hashing:
 1. Use const on accessor functions
 2. Export skb_tx_hash for use in drivers (see ixgbe)

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-21 13:39:26 -07:00
David S. Miller
2b1c4354de Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/virtio_net.c
2009-03-20 02:27:41 -07:00
Roel Kluin
e4a389a9b5 net: kfree(napi->skb) => kfree_skb
struct sk_buff pointers should be freed with kfree_skb.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-18 23:12:13 -07:00
David S. Miller
2d6a5e9500 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/igb/igb_main.c
	drivers/net/qlge/qlge_main.c
	drivers/net/wireless/ath9k/ath9k.h
	drivers/net/wireless/ath9k/core.h
	drivers/net/wireless/ath9k/hw.c
2009-03-17 15:01:30 -07:00
Herbert Xu
303c6a0251 gro: Fix legacy path napi_complete crash
On the legacy netif_rx path, I incorrectly tried to optimise
the napi_complete call by using __napi_complete before we reenable
IRQs.  This simply doesn't work since we need to flush the held
GRO packets first.

This patch fixes it by doing the obvious thing of reenabling
IRQs first and then calling napi_complete.

Reported-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-17 13:11:29 -07:00
Herbert Xu
d1c76af9e2 GRO: Move netpoll checks to correct location
As my netpoll fix for net doesn't really work for net-next, we
need this update to move the checks into the right place.  As it
stands we may pass freed skbs to netpoll_receive_skb.

This patch also introduces a netpoll_rx_on function to avoid GRO
completely if we're invoked through netpoll.  This might seem
paranoid but as netpoll may have an external receive hook it's
better to be safe than sorry.  I don't think we need this for
2.6.29 though since there's nothing immediately broken by it.

This patch also moves the GRO_* return values to netdevice.h since
VLAN needs them too (I tried to avoid this originally but alas
this seems to be the easiest way out).  This fixes a bug in VLAN
where it continued to use the old return value 2 instead of the
correct GRO_DROP.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-16 10:50:02 -07:00
Yi Zou
1c8dbcf649 [SCSI] net: add NETIF_F_FCOE_CRC to can_checksum_protocol
Add FC CRC offload check for ETH_P_FCOE.

Signed-off-by: Yi Zou <yi.zou@intel.com>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-03-13 15:11:38 -05:00
David S. Miller
508827ff0a Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/tokenring/tmspci.c
	drivers/net/ucc_geth_mii.c
2009-03-05 02:06:47 -08:00
David S. Miller
9d40bbda59 vlan: Fix vlan-in-vlan crashes.
As analyzed by Patrick McHardy, vlan needs to reset it's
netdev_ops pointer in it's ->init() function but this
leaves the compat method pointers stale.

Add a netdev_resync_ops() and call it from the vlan code.

Any other driver which changes ->netdev_ops after register_netdevice()
will need to call this new function after doing so too.

With help from Patrick McHardy.

Tested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-04 23:46:25 -08:00
David S. Miller
54acd0efab net: Fix missing dev->neigh_setup in register_netdevice().
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-04 23:01:02 -08:00
Eric W. Biederman
17edde5209 netns: Remove net_alive
It turns out that net_alive is unnecessary, and the original problem
that led to it being added was simply that the icmp code thought
it was a network device and wound up being unable to handle packets
while there were still packets in the network namespace.

Now that icmp and tcp have been fixed to properly register themselves
this problem is no longer present and we have a stronger guarantee
that packets will not arrive in a network namespace then that provided
by net_alive in netif_receive_skb.  So remove net_alive allowing
packet reception run a little faster.

Additionally document the strong reason why network namespace cleanup
is safe so that if something happens again someone else will have
a chance of figuring it out.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-03 01:14:27 -08:00
David S. Miller
aa4abc9bcc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-tx.c
	net/8021q/vlan_core.c
	net/core/dev.c
2009-03-01 21:35:16 -08:00
Herbert Xu
4ead443163 netpoll: Add drop checks to all entry points
The netpoll entry checks are required to ensure that we don't
receive normal packets when invoked via netpoll.  Unfortunately
it only ever worked for the netif_receive_skb/netif_rx entry
points.  The VLAN (and subsequently GRO) entry point didn't
have the check and therefore can trigger all sorts of weird
problems.

This patch adds the netpoll check to all entry points.

I'm still uneasy with receiving at all under netpoll (which
apparently is only used by the out-of-tree kdump code).  The
reason is it is perfectly legal to receive all data including
headers into highmem if netpoll is off, but if you try to do
that with netpoll on and someone gets a printk in an IRQ handler                                             
you're going to get a nice BUG_ON.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-01 00:11:52 -08:00
Eric W. Biederman
ce16c5337a netns: Remove net_alive
It turns out that net_alive is unnecessary, and the original problem
that led to it being added was simply that the icmp code thought
it was a network device and wound up being unable to handle packets
while there were still packets in the network namespace.

Now that icmp and tcp have been fixed to properly register themselves
this problem is no longer present and we have a stronger guarantee
that packets will not arrive in a network namespace then that provided
by net_alive in netif_receive_skb.  So remove net_alive allowing
packet reception run a little faster.

Additionally document the strong reason why network namespace cleanup
is safe so that if something happens again someone else will have
a chance of figuring it out.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-22 19:54:50 -08:00
Patrick Ohly
cd4d8fdad1 net: kernel panic in dev_hard_start_xmit: remove faulty software TX time stamping
The current implementation of the TX software time stamping fallback is
faulty because it accesses the skb after ndo_start_xmit() returns
successfully. This patch removes the fallback, which fixes kernel panics
seen during stress tests. Hardware time stamping is not affected by this
removal.

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-21 02:42:18 -08:00
Krishna Kumar
e88721f87d net: Optimize skb_tx_hash() by eliminating a comparison
Optimize skb_tx_hash() by eliminating a comparison that executes for
every packet. skb_tx_hashrnd initialization is moved to a later part of
the startup sequence, namely after the "random" driver is initialized.

Rebooted the system three times and verified that the code generates
different random numbers each time.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-18 17:55:02 -08:00
Patrick Ohly
d24fff22d8 net: pass new SIOCSHWTSTAMP through to device drivers
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-15 22:43:38 -08:00
Patrick Ohly
ac45f602ee net: infrastructure for hardware time stamping
The additional per-packet information (16 bytes for time stamps, 1
byte for flags) is stored for all packets in the skb_shared_info
struct. This implementation detail is hidden from users of that
information via skb_* accessor functions. A separate struct resp.
union is used for the additional information so that it can be
stored/copied easily outside of skb_shared_info.

Compared to previous implementations (reusing the tstamp field
depending on the context, optional additional structures) this
is the simplest solution. It does not extend sk_buff itself.

TX time stamping is implemented in software if the device driver
doesn't support hardware time stamping.

The new semantic for hardware/software time stamping around
ndo_start_xmit() is based on two assumptions about existing
network device drivers which don't support hardware time
stamping and know nothing about it:
 - they leave the new skb_shared_tx unmodified
 - the keep the connection to the originating socket in skb->sk
   alive, i.e., don't call skb_orphan()

Given that skb_shared_tx is new, the first assumption is safe.
The second is only true for some drivers. As a result, software
TX time stamping currently works with the bnx2 driver, but not
with the unmodified igb driver (the two drivers this patch series
was tested with).

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-15 22:43:34 -08:00
Herbert Xu
aa4b9f533e gro: Optimise Ethernet header comparison
This patch optimises the Ethernet header comparison to use 2-byte
and 4-byte xors instead of memcmp.  In order to facilitate this,
the actual comparison is now carried out by the callers of the
shared dev_gro_receive function.

This has a significant impact when receiving 1500B packets through
10GbE.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-08 20:22:18 -08:00
Herbert Xu
4ae5544f9a gro: Remember number of held packets instead of counting every time
This patch prepares for the move of the same_flow checks out of
dev_gro_receive.  As such we need to remember the number of held
packets since doing a loop just to count them every time is silly.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-08 20:22:17 -08:00
David S. Miller
409f0a9014 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-agn.c
	drivers/net/wireless/iwlwifi/iwl3945-base.c
2009-02-07 02:52:44 -08:00
David S. Miller
b4bd07c20b net_dma: call dmaengine_get only if NET_DMA enabled
Based upon a patch from Atsushi Nemoto <anemo@mba.ocn.ne.jp>

--------------------
The commit 649274d993 ("net_dma:
acquire/release dma channels on ifup/ifdown") added unconditional call
of dmaengine_get() to net_dma.  The API should be called only if
NET_DMA was enabled.
--------------------

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
2009-02-06 22:06:43 -08:00