2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

759 Commits

Author SHA1 Message Date
David S. Miller
92101b3b2e ipv4: Prepare for change of rt->rt_iif encoding.
Use inet_iif() consistently, and for TCP record the input interface of
cached RX dst in inet sock.

rt->rt_iif is going to be encoded differently, so that we can
legitimately cache input routes in the FIB info more aggressively.

When the input interface is "use SKB device index" the rt->rt_iif will
be set to zero.

This forces us to move the TCP RX dst cache installation into the ipv4
specific code, and as well it should since doing the route caching for
ipv6 is pointless at the moment since it is not inspected in the ipv6
input paths yet.

Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
obsolete set to a non-zero value to force invocation of the check
callback.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 16:36:26 -07:00
Eric Dumazet
563d34d057 tcp: dont drop MTU reduction indications
ICMP messages generated in output path if frame length is bigger than
mtu are actually lost because socket is owned by user (doing the xmit)

One example is the ipgre_tunnel_xmit() calling
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));

We had a similar case fixed in commit a34a101e1e (ipv6: disable GSO on
sockets hitting dst_allfrag).

Problem of such fix is that it relied on retransmit timers, so short tcp
sessions paid a too big latency increase price.

This patch uses the tcp_release_cb() infrastructure so that MTU
reduction messages (ICMP messages) are not lost, and no extra delay
is added in TCP transmits.

Reported-by: Maciej Żenczykowski <maze@google.com>
Diagnosed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 00:58:46 -07:00
David S. Miller
ba3f7f04ef ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:36:54 -07:00
Yuchung Cheng
cf60af03ca net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)
sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
and write(2). The application should replace connect() with it to
send data in the opening SYN packet.

For blocking socket, sendmsg() blocks until all the data are buffered
locally and the handshake is completed like connect() call. It
returns similar errno like connect() if the TCP handshake fails.

For non-blocking socket, it returns the number of bytes queued (and
transmitted in the SYN-data packet) if cookie is available. If cookie
is not available, it transmits a data-less SYN packet with Fast Open
cookie request option and returns -EINPROGRESS like connect().

Using MSG_FASTOPEN on connecting or connected socket will result in
simlar errno like repeating connect() calls. Therefore the application
should only use this flag on new sockets.

The buffer size of sendmsg() is independent of the MSS of the connection.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 11:02:03 -07:00
Yuchung Cheng
2100c8d2d9 net-tcp: Fast Open base
This patch impelements the common code for both the client and server.

1. TCP Fast Open option processing. Since Fast Open does not have an
   option number assigned by IANA yet, it shares the experiment option
   code 254 by implementing draft-ietf-tcpm-experimental-options
   with a 16 bits magic number 0xF989. This enables global experiments
   without clashing the scarce(2) experimental options available for TCP.

   When the draft status becomes standard (maybe), the client should
   switch to the new option number assigned while the server supports
   both numbers for transistion.

2. The new sysctl tcp_fastopen

3. A place holder init function

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 10:55:36 -07:00
Eric Dumazet
be9f4a44e7 ipv4: tcp: remove per net tcp_sock
tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
per network namespace.

This leads to bad behavior on multiqueue NICS, because many cpus
contend for the socket lock and once socket lock is acquired, extra
false sharing on various socket fields slow down the operations.

To better resist to attacks, we use a percpu socket. Each cpu can
run without contention, using appropriate memory (local node)

Additional features :

1) We also mirror the queue_mapping of the incoming skb, so that
answers use the same queue if possible.

2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()

3) We now limit the number of in-flight RST/ACK [1] packets
per cpu, instead of per namespace, and we honor the sysctl_wmem_default
limit dynamically. (Prior to this patch, sysctl_wmem_default value was
copied at boot time, so any further change would not affect tcp_sock
limit)

[1] These packets are only generated when no socket was matched for
the incoming packet.

Reported-by: Bill Sommerfeld <wsommerfeld@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 10:35:30 -07:00
David S. Miller
6700c2709c net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}()
This will be used so that we can compose a full flow key.

Even though we have a route in this context, we need more.  In the
future the routes will be without destination address, source address,
etc. keying.  One ipv4 route will cover entire subnets, etc.

In this environment we have to have a way to possess persistent storage
for redirects and PMTU information.  This persistent storage will exist
in the FIB tables, and that's why we'll need to be able to rebuild a
full lookup flow key here.  Using that flow key will do a fib_lookup()
and create/update the persistent entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-17 03:29:28 -07:00
David S. Miller
80d0a69fc5 ipv4: Add helper inet_csk_update_pmtu().
This abstracts away the call to dst_ops->update_pmtu() so that we can
transparently handle the fact that, in the future, the dst itself can
be invalidated by the PMTU update (when we have non-host routes cached
in sockets).

So we try to rebuild the socket cached route after the method
invocation if necessary.

This isn't used by SCTP because it needs to cache dsts per-transport,
and thus will need it's own local version of this helper.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-16 03:28:06 -07:00
David S. Miller
1ed5c48f23 net: Remove checks for dst_ops->redirect being NULL.
No longer necessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-12 00:41:25 -07:00
David S. Miller
55be7a9c60 ipv4: Add redirect support to all protocol icmp error handlers.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 21:27:49 -07:00
Eric Dumazet
46d3ceabd8 tcp: TCP Small Queues
This introduce TSQ (TCP Small Queues)

TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.

TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.

As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.

Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.

Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)

I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.

If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.

[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
  but some drivers call it in their start_xmit() handler.
  These drivers should at least use BQL, or else a single TCP
  session can still fill the whole NIC TX ring, since TSQ will
  have no effect.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 18:12:59 -07:00
David S. Miller
16d1839907 inet: Remove ->get_peer() method.
No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:10 -07:00
David S. Miller
81166dd6fa tcp: Move timestamps from inetpeer to metrics cache.
With help from Lin Ming.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:08 -07:00
David S. Miller
ab92bb2f67 tcp: Abstract back handling peer aliveness test into helper function.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 20:33:49 -07:00
David S. Miller
70e7341673 ipv4: Show that ip_send_reply() is purely unicast routine.
Rename it to ip_send_unicast_reply() and add explicit 'saddr'
argument.

This removed one of the few users of rt->rt_spec_dst.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28 03:21:41 -07:00
David S. Miller
160eb5a6b1 ipv4: Kill early demux method return value.
It's completely unnecessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-27 22:01:22 -07:00
David S. Miller
c10237e077 Revert "ipv4: tcp: dont cache unconfirmed intput dst"
This reverts commit c074da2810.

This change has several unwanted side effects:

1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
   thus never create a real cached route.

2) All TCP traffic will use DST_NOCACHE and never use the routing
   cache at all.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-27 17:05:06 -07:00
Eric Dumazet
c074da2810 ipv4: tcp: dont cache unconfirmed intput dst
DDOS synflood attacks hit badly IP route cache.

On typical machines, this cache is allowed to hold up to 8 Millions dst
entries, 256 bytes for each, for a total of 2GB of memory.

rt_garbage_collect() triggers and tries to cleanup things.

Eventually route cache is disabled but machine is under fire and might
OOM and crash.

This patch exploits the new TCP early demux, to set a nocache
boolean in case incoming TCP frame is for a not yet ESTABLISHED or
TIMEWAIT socket.

This 'nocache' boolean is then used in case dst entry is not found in
route cache, to create an unhashed dst entry (DST_NOCACHE)

SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
output dst for syncookies), so after this patch, a machine is able to
absorb a DDOS synflood attack without polluting its IP route cache.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-27 15:34:24 -07:00
Vijay Subramanian
7011d0851b tcp: Fix bug in tcp socket early demux
The dest port for the call to __inet_lookup_established() in TCP early demux
code is passed with the wrong endian-ness. This causes the lookup to fail
leading to early demux not being used.

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-23 23:22:38 -07:00
Eric Dumazet
7586eceb0a ipv4: tcp: dont cache output dst for syncookies
Don't cache output dst for syncookies, as this adds pressure on IP route
cache and rcu subsystem for no gain.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-22 21:47:33 -07:00
David S. Miller
fd62e09b94 tcp: Validate route interface in early demux.
Otherwise we might violate reverse path filtering.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-21 14:58:10 -07:00
David S. Miller
41063e9dd1 ipv4: Early TCP socket demux.
Input packet processing for local sockets involves two major demuxes.
One for the route and one for the socket.

But we can optimize this down to one demux for certain kinds of local
sockets.

Currently we only do this for established TCP sockets, but it could
at least in theory be expanded to other kinds of connections.

If a TCP socket is established then it's identity is fully specified.

This means that whatever input route was used during the three-way
handshake must work equally well for the rest of the connection since
the keys will not change.

Once we move to established state, we cache the receive packet's input
route to use later.

Like the existing cached route in sk->sk_dst_cache used for output
packets, we have to check for route invalidations using dst->obsolete
and dst->ops->check().

Early demux occurs outside of a socket locked section, so when a route
invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
actually inside of established state packet processing and thus have
the socket locked.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-19 21:22:05 -07:00
David S. Miller
2397849baa [PATCH] tcp: Cache inetpeer in timewait socket, and only when necessary.
Since it's guarenteed that we will access the inetpeer if we're trying
to do timewait recycling and TCP options were enabled on the
connection, just cache the peer in the timewait socket.

In the future, inetpeer lookups will be context dependent (per routing
realm), and this helps facilitate that as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 14:56:12 -07:00
David S. Miller
4670fd819e tcp: Get rid of inetpeer special cases.
The get_peer method TCP uses is full of special cases that make no
sense accommodating, and it also gets in the way of doing more
reasonable things here.

First of all, if the socket doesn't have a usable cached route, there
is no sense in trying to optimize timewait recycling.

Likewise for the case where we have IP options, such as SRR enabled,
that make the IP header destination address (and thus the destination
address of the route key) differ from that of the connection's
destination address.

Just return a NULL peer in these cases, and thus we're also able to
get rid of the clumsy inetpeer release logic.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 01:25:47 -07:00
David S. Miller
fbfe95a42e inet: Create and use rt{,6}_get_peer_create().
There's a lot of places that open-code rt{,6}_get_peer() only because
they want to set 'create' to one.  So add an rt{,6}_get_peer_create()
for their sake.

There were also a few spots open-coding plain rt{,6}_get_peer() and
those are transformed here as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-08 23:24:18 -07:00
Gao feng
54db0cc2ba inetpeer: add parameter net for inet_getpeer_v4,v6
add struct net as a parameter of inet_getpeer_v[4,6],
use net to replace &init_net.

and modify some places to provide net for inet_getpeer_v[4,6]

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-08 14:27:23 -07:00
Eric Dumazet
4aea39c11c tcp: tcp_make_synack() consumes dst parameter
tcp_make_synack() clones the dst, and callers release it.

We can avoid two atomic operations per SYNACK if tcp_make_synack()
consumes dst instead of cloning it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-04 11:27:39 -04:00
Eric Dumazet
fff3269907 tcp: reflect SYN queue_mapping into SYNACK packets
While testing how linux behaves on SYNFLOOD attack on multiqueue device
(ixgbe), I found that SYNACK messages were dropped at Qdisc level
because we send them all on a single queue.

Obvious choice is to reflect incoming SYN packet @queue_mapping to
SYNACK packet.

Under stress, my machine could only send 25.000 SYNACK per second (for
200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues.

After patch, not a single SYNACK is dropped.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-01 14:22:11 -04:00
Eric Dumazet
a2a385d627 tcp: bool conversions
bool conversions where possible.

__inline__ -> inline

space cleanups

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-17 14:59:59 -04:00
Joe Perches
e87cc4728f net: Convert net_ratelimit uses to net_<level>_ratelimited
Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-15 13:45:03 -04:00
Eric Dumazet
bd14b1b2e2 tcp: be more strict before accepting ECN negociation
It appears some networks play bad games with the two bits reserved for
ECN. This can trigger false congestion notifications and very slow
transferts.

Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
disable TCP ECN negociation if it happens we receive mangled CT bits in
the SYN packet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Perry Lorier <perryl@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Wilmer van der Gaast <wilmer@google.com>
Cc: Ankur Jain <jankur@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Dave Täht <dave.taht@bufferbloat.net>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-04 12:05:27 -04:00
Eric Dumazet
da882c1f2e tcp: sk_add_backlog() is too agressive for TCP
While investigating TCP performance problems on 10Gb+ links, we found a
tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit
in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends
one ACK every two MSS segments.

A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default
value (87380), allowing a too small backlog.

A TCP ACK, even being small, can consume nearly same truesize space than
outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and
is fast to compute.

Performance results on netperf, single flow, receiver with disabled
GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop
increments at sender.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 22:28:28 -04:00
Eric Dumazet
f545a38f74 net: add a limit parameter to sk_add_backlog()
sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
memory limit. We need to make this limit a parameter for TCP use.

No functional change expected in this patch, all callers still using the
old sk_rcvbuf limit.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 22:28:28 -04:00
David S. Miller
ac807fa8e6 tcp: Fix build warning after tcp_{v4,v6}_init_sock consolidation.
net/ipv4/tcp_ipv4.c: In function 'tcp_v4_init_sock':
net/ipv4/tcp_ipv4.c:1891:19: warning: unused variable 'tp' [-Wunused-variable]
net/ipv6/tcp_ipv6.c: In function 'tcp_v6_init_sock':
net/ipv6/tcp_ipv6.c:1836:19: warning: unused variable 'tp' [-Wunused-variable]

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 03:21:58 -04:00
Neal Cardwell
900f65d361 tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock()
This commit moves the (substantial) common code shared between
tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family
independent function, tcp_init_sock().

Centralizing this functionality should help avoid drift issues,
e.g. where the IPv4 side is updated without a corresponding update to
IPv6. There was already some drift: IPv4 initialized snd_cwnd to
TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to
2 (in this case it should not matter, since snd_cwnd is also
initialized in tcp_init_metrics(), but the general risks and
maintenance overhead remain).

When diffing the old and new code, note that new tcp_init_sock()
function uses the order of steps from the tcp_v4_init_sock()
implementation (the order is slightly different in
tcp_v6_init_sock()).

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 16:36:42 -04:00
Pavel Emelyanov
ee9952831c tcp: Initial repair mode
This includes (according the the previous description):

* TCP_REPAIR sockoption

This one just puts the socket in/out of the repair mode.
Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
When repair mode is turned off and the socket happens to be in
the established state the window probe is sent to the peer to
'unlock' the connection.

* TCP_REPAIR_QUEUE sockoption

This one sets the queue which we're about to repair. The
'no-queue' is set by default.

* TCP_QUEUE_SEQ socoption

Sets the write_seq/rcv_nxt of a selected repaired queue.
Allowed for TCP_CLOSE-d sockets only. When the socket changes
its state the other seq-s are changed by the kernel according
to the protocol rules (most of the existing code is actually
reused).

* Ability to forcibly bind a socket to a port

The sk->sk_reuse is set to SK_FORCE_REUSE.

* Immediate connect modification

The connect syscall initializes the connection, then directly jumps
to the code which finalizes it.

* Silent close modification

The close just aborts the connection (similar to SO_LINGER with 0
time) but without sending any FIN/RST-s to peer.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 15:52:25 -04:00
Linus Torvalds
94fb175c04 dmaengine-fixes for 3.4-rc3
1/ regression fix for Xen as it now trips over a broken assumption
    about the dma address size on 32-bit builds
 
 2/ new quirk for netdma to ignore dma channels that cannot meet
    netdma alignment requirements
 
 3/ fixes for two long standing issues in ioatdma (ring size overflow)
    and iop-adma (potential stack corruption)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJPhIfhAAoJEB7SkWpmfYgCguIQAL4qF+RC9/JggSHIjfOrYiPd
 yboV80GqqQHHBwy8hfZVUrIEPMebvD/xUIk6iUQNXR+6EA8Ln0jukvQMpWNnI+Cc
 TXgA5Ok70an4PD1MqnCsWyCJjsyPyhprbRHurxBcesf+y96POJxhING0rcKvft50
 mvYnbtrkYe9M9x3b8TBGc0JaTVeL29Ck3FtkTz4uUktbkhRNfCcfEd28NRQpf8MB
 vkjbjRGBQmGsnKxYCaEhlF1GPJyTlYjg4BBWtseJgb2R9s7tvJrkotFea/NmSfjq
 XCuVKjpiFp3YyJuxJERWdwqRWvyAZFfcYyZX440nG0b7GBgSn+T7A9XhUs8vMboi
 tLwoDfBbJDlKMaFpHex7Z6RtZZmVl3gWDNZTqpG44n4pabd4RPip04f0k7Wfs+cp
 tzU9hGAOvgsZ8w4/JgxH8YJOZbIGzbDGOA1IhWcbxIbmFTblMiFnV3TC7qfhoRbR
 8qtScIE7bUck2MYVlMMn9utd9tvKFa6HNgo41+f78/4+U7zQ/VrsbA/DWQct40R5
 5k+EEvyYFUzIXn79E0GVN5h4NHH5gfAs3MZ7jIgwgHedBp4Ki68XYKNu+pIV3YwG
 CFTPn1mVOXnCdt+fsjG5tL9Jecx1Mij6w3nWU93ZU6cHmC77YmU+DLxPIGuyR1a2
 EmpObwfq5peXzkgQpEsB
 =F3IR
 -----END PGP SIGNATURE-----

Merge tag 'dmaengine-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine

Pull dmaengine fixes from Dan Williams:

1/ regression fix for Xen as it now trips over a broken assumption
   about the dma address size on 32-bit builds

2/ new quirk for netdma to ignore dma channels that cannot meet
   netdma alignment requirements

3/ fixes for two long standing issues in ioatdma (ring size overflow)
   and iop-adma (potential stack corruption)

* tag 'dmaengine-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
  netdma: adding alignment check for NETDMA ops
  ioatdma: DMA copy alignment needed to address IOAT DMA silicon errata
  ioat: ring size variables need to be 32bit to avoid overflow
  iop-adma: Corrected array overflow in RAID6 Xscale(R) test.
  ioat: fix size of 'completion' for Xen
2012-04-10 15:30:16 -07:00
Dave Jiang
a2bd1140a2 netdma: adding alignment check for NETDMA ops
This is the fallout from adding memcpy alignment workaround for certain
IOATDMA hardware. NetDMA will only use DMA engine that can handle byte align
ops.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2012-04-05 15:27:12 -07:00
David S. Miller
4da0bd7365 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-03-18 23:29:41 -04:00
Joe Perches
afd465030a net: ipv4: Standardize prefixes for message logging
Add #define pr_fmt(fmt) as appropriate.

Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files.
Standardize on "UDPLite: " for appropriate uses.
Some prefixes were previously "UDPLITE: " and "UDP-Lite: ".

Add KBUILD_MODNAME ": " to icmp and gre.
Remove embedded prefixes as appropriate.

Add missing "\n" to pr_info in gre.c.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-12 17:05:21 -07:00
Joe Perches
058bd4d2a4 net: Convert printks to pr_<level>
Use a more current kernel messaging style.

Convert a printk block to print_hex_dump.
Coalesce formats, align arguments.
Use %s, __func__ instead of embedding function names.

Some messages that were prefixed with <foo>_close are
now prefixed with <foo>_fini.  Some ah4 and esp messages
are now not prefixed with "ip ".

The intent of this patch is to later add something like
  #define pr_fmt(fmt) "IPv4: " fmt.
to standardize the output messages.

Text size is trivially reduced. (x86-32 allyesconfig)

$ size net/ipv4/built-in.o*
   text	   data	    bss	    dec	    hex	filename
 887888	  31558	 249696	1169142	 11d6f6	net/ipv4/built-in.o.new
 887934	  31558	 249800	1169292	 11d78c	net/ipv4/built-in.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-11 23:42:51 -07:00
Eric Dumazet
dfd25ffffc tcp: fix syncookie regression
commit ea4fc0d619 (ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit())
added a serious regression on synflood handling.

Simon Kirby discovered a successful connection was delayed by 20 seconds
before being responsive.

In my tests, I discovered that xmit frames were lost, and needed ~4
retransmits and a socket dst rebuild before being really sent.

In case of syncookie initiated connection, we use a different path to
initialize the socket dst, and inet->cork.fl.u.ip4 is left cleared.

As ip_queue_xmit() now depends on inet flow being setup, fix this by
copying the temp flowi4 we use in cookie_v4_check().

Reported-by: Simon Kirby <sim@netnation.com>
Bisected-by: Simon Kirby <sim@netnation.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-11 15:52:12 -07:00
Eric Dumazet
b4fb05ea40 tcp: md5: correct a RCU lockdep splat
commit a8afca0329 (tcp: md5: protects md5sig_info with RCU) added a
lockdep splat in tcp_md5_do_lookup() in case a timer fires a tcp
retransmit.

At this point, socket lock is owned by the sofirq handler, not the user,
so we should adjust a bit the lockdep condition, as we dont hold
rcu_read_lock().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-07 15:13:42 -05:00
Jiri Benc
4c507d2897 net: implement IP_RECVTOS for IP_PKTOPTIONS
Currently, it is not easily possible to get TOS/DSCP value of packets from
an incoming TCP stream. The mechanism is there, IP_PKTOPTIONS getsockopt
with IP_RECVTOS set, the same way as incoming TTL can be queried. This is
not actually implemented for TOS, though.

This patch adds this functionality, both for IPv4 (IP_PKTOPTIONS) and IPv6
(IPV6_2292PKTOPTIONS). For IPv4, like in the IP_RECVTTL case, the value of
the TOS field is stored from the other party's ACK.

This is needed for proxies which require DSCP transparency. One such example
is at http://zph.bratcheda.org/.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-13 00:46:41 -05:00
David S. Miller
d5ef8a4d87 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/infiniband/hw/nes/nes_cm.c

Simple whitespace conflict.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-10 23:32:28 -05:00
Shawn Lu
e2446eaab5 tcp_v4_send_reset: binding oif to iif in no sock case
Binding RST packet outgoing interface to incoming interface
for tcp v4 when there is no socket associate with it.
when sk is not NULL, using sk->sk_bound_dev_if instead.
(suggested by Eric Dumazet).

This has few benefits:
1. tcp_v6_send_reset already did that.
2. This helps tcp connect with SO_BINDTODEVICE set. When
connection is lost, we still able to sending out RST using
same interface.
3. we are sending reply, it is most likely to be succeed
if iif is used

Signed-off-by: Shawn Lu <shawn.lu@ericsson.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-04 18:20:05 -05:00
Shawn Lu
658ddaaf66 tcp: md5: RST: getting md5 key from listener
TCP RST mechanism is broken in TCP md5(RFC2385). When
connection is gone, md5 key is lost, sending RST
without md5 hash is deem to ignored by peer. This can
be a problem since RST help protocal like bgp to fast
recove from peer crash.

In most case, users of tcp md5, such as bgp and ldp,
have listener on both sides to accept connection from peer.
md5 keys for peers are saved in listening socket.

There are two cases in finding md5 key when connection is
lost:
1.Passive receive RST: The message is send to well known port,
tcp will associate it with listner. md5 key is gotten from
listener.

2.Active receive RST (no sock): The message is send to ative
side, there is no socket associated with the message. In this
case, finding listener from source port, then find md5 key from
listener.

we are not loosing sercuriy here:
packet is checked with md5 hash. No RST is generated
if md5 hash doesn't match or no md5 key can be found.

Signed-off-by: Shawn Lu <shawn.lu@ericsson.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-01 12:43:54 -05:00
Eric Dumazet
a8afca0329 tcp: md5: protects md5sig_info with RCU
This patch makes sure we use appropriate memory barriers before
publishing tp->md5sig_info, allowing tcp_md5_do_lookup() being used from
tcp_v4_send_reset() without holding socket lock (upcoming patch from
Shawn Lu)

Note we also need to respect rcu grace period before its freeing, since
we can free socket without this grace period thanks to
SLAB_DESTROY_BY_RCU

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Shawn Lu <shawn.lu@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-01 02:11:47 -05:00
Eric Dumazet
5f3d9cb296 tcp: md5: use sock_kmalloc() to limit md5 keys
There is no limit on number of MD5 keys an application can attach to a
tcp socket.

This patch adds a per tcp socket limit based
on /proc/sys/net/core/optmem_max

With current default optmem_max values, this allows about 150 keys on
64bit arches, and 88 keys on 32bit arches.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-31 16:11:48 -05:00
Eric Dumazet
a915da9b69 tcp: md5: rcu conversion
In order to be able to support proper RST messages for TCP MD5 flows, we
need to allow access to MD5 keys without locking listener socket.

This conversion is a nice cleanup, and shrinks size of timewait sockets
by 80 bytes.

IPv6 code reuses generic code found in IPv4 instead of duplicating it.

Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Shawn Lu <shawn.lu@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-31 12:14:00 -05:00
Eric Dumazet
a2d91241a8 tcp: md5: remove obsolete md5_add() method
We no longer use md5_add() method from struct tcp_sock_af_ops

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-31 12:13:59 -05:00
shawnlu
8a622e71f5 tcp: md5: using remote adress for md5 lookup in rst packet
md5 key is added in socket through remote address.
remote address should be used in finding md5 key when
sending out reset packet.

Signed-off-by: shawnlu <shawn.lu@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-22 15:08:45 -05:00
Glauber Costa
3dc43e3e4d per-netns ipv4 sysctl_tcp_mem
This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-12 19:04:11 -05:00
Glauber Costa
d1a4c0b37c tcp memory pressure controls
This patch introduces memory pressure controls for the tcp
protocol. It uses the generic socket memory pressure code
introduced in earlier patches, and fills in the
necessary data in cg_proto struct.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-12 19:04:10 -05:00
Glauber Costa
180d8cd942 foundations of per-cgroup memory pressure controlling.
This patch replaces all uses of struct sock fields' memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem to acessor
macros. Those macros can either receive a socket argument, or a mem_cgroup
argument, depending on the context they live in.

Since we're only doing a macro wrapping here, no performance impact at all is
expected in the case where we don't have cgroups disabled.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-12 19:04:10 -05:00
Eric Dumazet
d8a6e65f8b tcp: inherit listener congestion control for passive cnx
Rick Jones reported that TCP_CONGESTION sockopt performed on a listener
was ignored for its children sockets : right after accept() the
congestion control for new socket is the system default one.

This seems an oversight of the initial design (quoted from Stephen)

Based on prior investigation and patch from Rick.

Reported-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Yuchung Cheng <ycheng@google.com>
Tested-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-30 16:55:26 -05:00
Eric Dumazet
709e8697af tcp: clear xmit timers in tcp_v4_syn_recv_sock()
Simon Kirby reported divides by zero errors in __tcp_select_window()

This happens when inet_csk_route_child_sock() returns a NULL pointer :

We free new socket while we eventually armed keepalive timer in
tcp_create_openreq_child()

Fix this by a call to tcp_clear_xmit_timers()

[ This is a followup to commit 918eb39962 (net: add missing
bh_unlock_sock() calls) ]

Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-16 16:57:45 -05:00
Eric Dumazet
918eb39962 net: add missing bh_unlock_sock() calls
Simon Kirby reported lockdep warnings and following messages :

[104661.897577] huh, entered softirq 3 NET_RX ffffffff81613740
preempt_count 00000101, exited with 00000102?

[104661.923653] huh, entered softirq 3 NET_RX ffffffff81613740
preempt_count 00000101, exited with 00000102?

Problem comes from commit 0e734419
(ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.)

If inet_csk_route_child_sock() returns NULL, we should release socket
lock before freeing it.

Another lock imbalance exists if __inet_inherit_port() returns an error
since commit 093d282321 ( tproxy: fix hash locking issue when using
port redirection in __inet_inherit_port()) a backport is also needed for
>= 2.6.37 kernels.

Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Balazs Scheidler <bazsi@balabit.hu>
CC: KOVACS Krisztian <hidden@balabit.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-03 18:06:18 -04:00
Arjan van de Ven
73cb88ecb9 net: make the tcp and udp file_operations for the /proc stuff const
the tcp and udp code creates a set of struct file_operations at runtime
while it can also be done at compile time, with the added benefit of then
having these file operations be const.

the trickiest part was to get the "THIS_MODULE" reference right; the naive
method of declaring a struct in the place of registration would not work
for this reason.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-01 17:56:14 -04:00
Eric Dumazet
66b13d99d9 ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT
There is a long standing bug in linux tcp stack, about ACK messages sent
on behalf of TIME_WAIT sockets.

In the IP header of the ACK message, we choose to reflect TOS field of
incoming message, and this might break some setups.

Example of things that were broken :
  - Routing using TOS as a selector
  - Firewalls
  - Trafic classification / shaping

We now remember in timewait structure the inet tos field and use it in
ACK generation, and route lookup.

Notes :
 - We still reflect incoming TOS in RST messages.
 - We could extend MuraliRaja Muniraju patch to report TOS value in
netlink messages for TIME_WAIT sockets.
 - A patch is needed for IPv6

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-24 03:06:21 -04:00
Eric Dumazet
318cf7aaa0 tcp: md5: add more const attributes
Now tcp_md5_hash_header() has a const tcphdr argument, we can add more
const attributes to callers.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-24 02:46:04 -04:00
Eric Dumazet
cf533ea53e tcp: add const qualifiers where possible
Adding const qualifiers to pointers can ease code review, and spot some
bugs. It might allow compiler to optimize code further.

For example, is it legal to temporary write a null cksum into tcphdr
in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
temporary null value...

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-21 05:22:42 -04:00
David S. Miller
88c5100c28 Merge branch 'master' of github.com:davem330/net
Conflicts:
	net/batman-adv/soft-interface.c
2011-10-07 13:38:43 -04:00
Yan, Zheng
260fcbeb1a tcp: properly handle md5sig_pool references
tcp_v4_clear_md5_list() assumes that multiple tcp md5sig peers
only hold one reference to md5sig_pool. but tcp_v4_md5_do_add()
increases use count of md5sig_pool for each peer. This patch
makes tcp_v4_md5_do_add() only increases use count for the first
tcp md5sig peer.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-04 23:31:24 -04:00
Eric Dumazet
b82d1bb4fd tcp: unalias tcp_skb_cb flags and ip_dsfield
struct tcp_skb_cb contains a "flags" field containing either tcp flags
or IP dsfield depending on context (input or output path)

Introduce ip_dsfield to make the difference clear and ease maintenance.
If later we want to save space, we can union flags/ip_dsfield

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-27 02:20:08 -04:00
David S. Miller
8decf86879 Merge branch 'master' of github.com:davem330/net
Conflicts:
	MAINTAINERS
	drivers/net/Kconfig
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
	drivers/net/ethernet/broadcom/tg3.c
	drivers/net/wireless/iwlwifi/iwl-pci.c
	drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
	drivers/net/wireless/rt2x00/rt2800usb.c
	drivers/net/wireless/wl12xx/main.c
2011-09-22 03:23:13 -04:00
Eric Dumazet
946cedccbd tcp: Change possible SYN flooding messages
"Possible SYN flooding on port xxxx " messages can fill logs on servers.

Change logic to log the message only once per listener, and add two new
SNMP counters to track :

TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client

TCPReqQFullDrop : number of times a SYN request was dropped because
syncookies were not enabled.

Based on a prior patch from Tom Herbert, and suggestions from David.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-15 14:49:43 -04:00
Tom Herbert
bdeab99191 rps: Add flag to skb to indicate rxhash is based on L4 tuple
The l4_rxhash flag was added to the skb structure to indicate
that the rxhash value was computed over the 4 tuple for the
packet which includes the port information in the encapsulated
transport packet.  This is used by the stack to preserve the
rxhash value in __skb_rx_tunnel.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-17 20:06:03 -07:00
David S. Miller
6e5714eaf7 net: Compute protocol sequence numbers and fragment IDs using MD5.
Computers have become a lot faster since we compromised on the
partial MD4 hash which we use currently for performance reasons.

MD5 is a much safer choice, and is inline with both RFC1948 and
other ISS generators (OpenBSD, Solaris, etc.)

Furthermore, only having 24-bits of the sequence number be truly
unpredictable is a very serious limitation.  So the periodic
regeneration and 8-bit counter have been removed.  We compute and
use a full 32-bit sequence number.

For ipv6, DCCP was found to use a 32-bit truncated initial sequence
number (it needs 43-bits) and that is fixed here as well.

Reported-by: Dan Kaminsky <dan@doxpara.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-06 18:33:19 -07:00
David S. Miller
9f6ec8d697 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-agn-rxon.c
	drivers/net/wireless/rtlwifi/pci.c
	net/netfilter/ipvs/ip_vs_core.c
2011-06-20 22:29:08 -07:00
Eric Dumazet
1eddceadb0 net: rfs: enable RFS before first data packet is received
Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit :
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Fri, 17 Jun 2011 00:50:46 +0100
>
> > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote:
> >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
> >>  			goto discard;
> >>
> >>  		if (nsk != sk) {
> >> +			sock_rps_save_rxhash(nsk, skb->rxhash);
> >>  			if (tcp_child_process(sk, nsk, skb)) {
> >>  				rsk = nsk;
> >>  				goto reset;
> >>
> >
> > I haven't tried this, but it looks reasonable to me.
> >
> > What about IPv6?  The logic in tcp_v6_do_rcv() looks very similar.
>
> Indeed ipv6 side needs the same fix.
>
> Eric please add that part and resubmit.  And in fact I might stick
> this into net-2.6 instead of net-next-2.6
>

OK, here is the net-2.6 based one then, thanks !

[PATCH v2] net: rfs: enable RFS before first data packet is received

First packet received on a passive tcp flow is not correctly RFS
steered.

One sock_rps_record_flow() call is missing in inet_accept()

But before that, we also must record rxhash when child socket is setup.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@conan.davemloft.net>
2011-06-17 15:27:31 -04:00
Jerry Chu
9ad7c049f0 tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side
This patch lowers the default initRTO from 3secs to 1sec per
RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
has been retransmitted, AND the TCP timestamp option is not on.

It also adds support to take RTT sample during 3WHS on the passive
open side, just like its active open counterpart, and uses it, if
valid, to seed the initRTO for the data transmission phase.

The patch also resets ssthresh to its initial default at the
beginning of the data transmission phase, and reduces cwnd to 1 if
there has been MORE THAN ONE retransmission during 3WHS per RFC5681.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-08 17:05:30 -07:00
Dan Rosenberg
71338aa7d0 net: convert %p usage to %pK
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces.  Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers.  The behavior of %pK depends on the kptr_restrict sysctl.

If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs.  If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
 If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges.  Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".

The supporting code for kptr_restrict and %pK are currently in the -mm
tree.  This patch converts users of %p in net/ to %pK.  Cases of printing
pointers to the syslog are not covered, since this would eliminate useful
information for postmortem debugging and the reading of the syslog is
already optionally protected by the dmesg_restrict sysctl.

Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-24 01:13:12 -04:00
David S. Miller
a48eff1288 ipv4: Pass explicit destination address to rt_bind_peer().
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-18 18:42:43 -04:00
David S. Miller
ed2361e66e ipv4: Pass explicit destination address to rt_get_peer().
This will next trickle down to rt_bind_peer().

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-18 18:38:54 -04:00
David S. Miller
6bd023f3dd ipv4: Make caller provide flowi4 key to inet_csk_route_req().
This way the caller can get at the fully resolved fl4->{daddr,saddr}
etc.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-18 18:32:03 -04:00
David S. Miller
0a5ebb8000 ipv4: Pass explicit daddr arg to ip_send_reply().
This eliminates an access to rt->rt_src.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 13:32:46 -07:00
David S. Miller
c5216cc70f tcp: Use cork flow info instead of rt->rt_dst in tcp_v4_get_peer()
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 15:28:29 -07:00
David S. Miller
0e73441992 ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.
Operation order is now transposed, we first create the child
socket then we try to hook up the route.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 15:28:03 -07:00
David S. Miller
da905bd1d5 tcp: Use cork flow in tcp_v4_connect()
Since this is invoked from inet_stream_connect() the socket is locked
and therefore this usage is safe.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 13:18:54 -07:00
David S. Miller
d4fb3d74d7 ipv4: Get route daddr from flow key in tcp_v4_connect().
Now that output route lookups update the flow with
destination address selection, we can fetch it from
fl4->daddr instead of rt->rt_dst

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 23:50:32 -07:00
David S. Miller
4071cfff84 ipv4: Fetch route saddr from flow key in tcp_v4_connect().
Now that output route lookups update the flow with
source address selection, we can fetch it from
fl4->saddr instead of rt->rt_src

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 23:17:31 -07:00
Eric Dumazet
f6d8bd051c inet: add RCU protection to inet->opt
We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 13:16:35 -07:00
David S. Miller
2d7192d6cb ipv4: Sanitize and simplify ip_route_{connect,newports}()
These functions are used together as a unit for route resolution
during connect().  They address the chicken-and-egg problem that
exists when ports need to be allocated during connect() processing,
yet such port allocations require addressing information from the
routing code.

It's currently more heavy handed than it needs to be, and in
particular we allocate and initialize a flow object twice.

Let the callers provide the on-stack flow object.  That way we only
need to initialize it once in the ip_route_connect() call.

Later, if ip_route_newports() needs to do anything, it re-uses that
flow object as-is except for the ports which it updates before the
route re-lookup.

Also, describe why this set of facilities are needed and how it works
in a big comment.

Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
2011-04-27 13:59:04 -07:00
Eric Dumazet
b71d1d426d inet: constify ip headers and in6_addr
Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-22 11:04:14 -07:00
David S. Miller
b23dd4fe42 ipv4: Make output route lookup return rtable directly.
Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-02 14:31:35 -08:00
David S. Miller
abdf7e7239 ipv4: Can final ip_route_connect() arg to boolean "can_sleep".
Since that's what the current vague "flags" thing means.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:15:24 -08:00
David S. Miller
dca8b089c9 ipv4: Rearrange how ip_route_newports() gets port keys.
ip_route_newports() is the only place in the entire kernel that
cares about the port members in the routing cache entry's lookup
flow key.

Therefore the only reason we store an entire flow inside of the
struct rtentry is for this one special case.

Rewrite ip_route_newports() such that:

1) The caller passes in the original port values, so we don't need
   to use the rth->fl.fl_ip_{s,d}port values to remember them.

2) The lookup flow is constructed by hand instead of being copied
   from the routing cache entry's flow.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 13:38:12 -08:00
Shan Wei
089c34827e tcp: Remove debug macro of TCP_CHECK_TIMER
Now, TCP_CHECK_TIMER is not used for debuging, it does nothing.
And, it has been there for several years, maybe 6 years.

Remove it to keep code clearer.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-20 11:10:14 -08:00
David S. Miller
7a71ed899e inetpeer: Abstract address representation further.
Future changes will add caching information, and some of
these new elements will be addresses.

Since the family is implicit via the ->daddr.family member,
replicating the family in ever address we store is entirely
redundant.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-10 13:22:28 -08:00
Eric Dumazet
fd0273c503 tcp: fix bug in listening_get_next()
commit a8b690f98b (tcp: Fix slowness in read /proc/net/tcp)
introduced a bug in handling of SYN_RECV sockets.

st->offset represents number of sockets found since beginning of
listening_hash[st->bucket].

We should not reset st->offset when iterating through
syn_table[st->sbucket], or else if more than ~25 sockets (if
PAGE_SIZE=4096) are in SYN_RECV state, we exit from listening_get_next()
with a too small st->offset

Next time we enter tcp_seek_last_pos(), we are not able to seek past
already found sockets.

Reported-by: PK <runningdoglackey@yahoo.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 14:41:20 -08:00
David S. Miller
17f7f4d9fc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/ipv4/fib_frontend.c
2010-12-26 22:37:05 -08:00
Eric Dumazet
1bde5ac493 tcp: fix listening_get_next()
Alexey Vlasov found /proc/net/tcp could sometime loop and display
millions of sockets in LISTEN state.

In 2.6.29, when we converted TCP hash tables to RCU, we left two
sk_next() calls in listening_get_next().

We must instead use sk_nulls_next() to properly detect an end of chain.

Reported-by: Alexey Vlasov <renton@renton.name>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-23 09:32:46 -08:00
David S. Miller
0dbaee3b37 net: Abstract default ADVMSS behind an accessor.
Make all RTAX_ADVMSS metric accesses go through a new helper function,
dst_metric_advmss().

Leave the actual default metric as "zero" in the real metric slot,
and compute the actual default value dynamically via a new dst_ops
AF specific callback.

For stacked IPSEC routes, we use the advmss of the path which
preserves existing behavior.

Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates
advmss on pmtu updates.  This inconsistency in advmss handling
results in more raw metric accesses than I wish we ended up with.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-13 12:52:14 -08:00
David S. Miller
fe6c791570 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/ath/ath9k/ar9003_eeprom.c
	net/llc/af_llc.c
2010-12-08 13:47:38 -08:00
David S. Miller
ccb7c410dd timewait_sock: Create and use getpeer op.
The only thing AF-specific about remembering the timestamp
for a time-wait TCP socket is getting the peer.

Abstract that behind a new timewait_sock_ops vector.

Support for real IPV6 sockets is not filled in yet, but
curiously this makes timewait recycling start to work
for v4-mapped ipv6 sockets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 18:09:13 -08:00
David S. Miller
3f419d2d48 inet: Turn ->remember_stamp into ->get_peer in connection AF ops.
Then we can make a completely generic tcp_remember_stamp()
that uses ->get_peer() as a helper, minimizing the AF specific
code and minimizing the eventual code duplication when we implement
the ipv6 side of TW recycling.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 12:28:06 -08:00
David S. Miller
b534ecf1cd inetpeer: Make inet_getpeer() take an inet_peer_adress_t pointer.
And make an inet_getpeer_v4() helper, update callers.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 11:54:19 -08:00
David S. Miller
582a72da9a inetpeer: Introduce inet_peer_address_t.
Currently only the v4 aspect is used, but this will change.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 11:53:55 -08:00
Pavel Emelyanov
8475ef9fd1 netns: Don't leak others' openreq-s in proc
The /proc/net/tcp leaks openreq sockets from other namespaces.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-27 22:57:48 -08:00
David S. Miller
8f49c2703b tcp: Don't change unlocked socket state in tcp_v4_err().
Alexey Kuznetsov noticed a regression introduced by
commit f1ecd5d9e7
("Revert Backoff [v3]: Revert RTO on ICMP destination unreachable")

The RTO and timer modification code added to tcp_v4_err()
doesn't check sock_owned_by_user(), which if true means we
don't have exclusive access to the socket and therefore cannot
modify it's critical state.

Just skip this new code block if sock_owned_by_user() is true
and eliminate the now superfluous sock_owned_by_user() code
block contained within.

Reported-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-11-12 13:35:00 -08:00
Balazs Scheidler
093d282321 tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()
When __inet_inherit_port() is called on a tproxy connection the wrong locks are
held for the inet_bind_bucket it is added to. __inet_inherit_port() made an
implicit assumption that the listener's port number (and thus its bind bucket).
Unfortunately, if you're using the TPROXY target to redirect skbs to a
transparent proxy that assumption is not true anymore and things break.

This patch adds code to __inet_inherit_port() so that it can handle this case
by looking up or creating a new bind bucket for the child socket and updates
callers of __inet_inherit_port() to gracefully handle __inet_inherit_port()
failing.

Reported by and original patch from Stephen Buck <stephen.buck@exinda.com>.
See http://marc.info/?t=128169268200001&r=1&w=2 for the original discussion.

Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-10-21 13:06:43 +02:00
Eric Dumazet
1639ab6f78 gro: unexport tcp4_gro_receive and tcp4_gro_complete
tcp4_gro_receive() and tcp4_gro_complete() dont need to be exported.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-31 13:37:07 -07:00
Changli Gao
7ba4291007 inet, inet6: make tcp_sendmsg() and tcp_sendpage() through inet_sendmsg() and inet_sendpage()
a new boolean flag no_autobind is added to structure proto to avoid the autobind
calls when the protocol is TCP. Then sock_rps_record_flow() is called int the
TCP's sendmsg() and sendpage() pathes.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/net/inet_common.h |    4 ++++
 include/net/sock.h        |    1 +
 include/net/tcp.h         |    8 ++++----
 net/ipv4/af_inet.c        |   15 +++++++++------
 net/ipv4/tcp.c            |   11 +++++------
 net/ipv4/tcp_ipv4.c       |    3 +++
 net/ipv6/af_inet6.c       |    8 ++++----
 net/ipv6/tcp_ipv6.c       |    3 +++
 8 files changed, 33 insertions(+), 20 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-12 20:21:46 -07:00
Eric Dumazet
4bc2f18ba4 net/ipv4: EXPORT_SYMBOL cleanups
CodingStyle cleanups

EXPORT_SYMBOL should immediately follow the symbol declaration.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-12 12:57:54 -07:00
Florian Westphal
172d69e63c syncookies: add support for ECN
Allows use of ECN when syncookies are in effect by encoding ecn_ok
into the syn-ack tcp timestamp.

While at it, remove a uneeded #ifdef CONFIG_SYN_COOKIES.
With CONFIG_SYN_COOKIES=nm want_cookie is ifdef'd to 0 and gcc
removes the "if (0)".

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-26 22:00:03 -07:00
Eric Dumazet
317fe0e6c5 inetpeer: restore small inet_peer structures
Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches.

Thats a bit unfortunate, since old size was exactly 64 bytes.

This can be solved, using an union between this rcu_head an four fields,
that are normally used only when a refcount is taken on inet_peer.
rcu_head is used only when refcnt=-1, right before structure freeing.

Add a inet_peer_refcheck() function to check this assertion for a while.

We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-16 11:55:39 -07:00
Changli Gao
d8d1f30b95 net-next: remove useless union keyword
remove useless union keyword in rtable, rt6_info and dn_route.

Since there is only one member in a union, the union keyword isn't useful.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-10 23:31:35 -07:00
Tom Herbert
a8b690f98b tcp: Fix slowness in read /proc/net/tcp
This patch address a serious performance issue in reading the
TCP sockets table (/proc/net/tcp).

Reading the full table is done by a number of sequential read
operations.  At each read operation, a seek is done to find the
last socket that was previously read.  This seek operation requires
that the sockets in the table need to be counted up to the current
file position, and to count each of these requires taking a lock for
each non-empty bucket.  The whole algorithm is O(n^2).

The fix is to cache the last bucket value, offset within the bucket,
and the file position returned by the last read operation.   On the
next sequential read, the bucket and offset are used to find the
last read socket immediately without needing ot scan the previous
buckets  the table.  This algorithm t read the whole table is O(n).

The improvement offered by this patch is easily show by performing
cat'ing /proc/net/tcp on a machine with a lot of connections.  With
about 182K connections in the table, I see the following:

- Without patch
time cat /proc/net/tcp > /dev/null

real	1m56.729s
user	0m0.214s
sys	1m56.344s

- With patch
time cat /proc/net/tcp > /dev/null

real	0m0.894s
user	0m0.290s
sys	0m0.594s

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-07 00:43:42 -07:00
David S. Miller
eedc765ca4 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/sfc/net_driver.h
	drivers/net/sfc/siena.c
2010-06-06 17:42:02 -07:00
Florian Westphal
af9b473857 syncookies: avoid unneeded tcp header flag double check
caller: if (!th->rst && !th->syn && th->ack)
callee: if (!th->ack)

make the caller only check for !syn (common for 3whs), and move
the !rst / ack test to the callee.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-05 02:23:14 -07:00
Florian Westphal
2a1d4bd460 syncookies: make v4/v6 synflood warning behaviour the same
both syn_flood_warning functions print a message, but
ipv4 version only prints a warning if CONFIG_SYN_COOKIES=y.

Make the v4 one behave like the v6 one.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-05 02:23:14 -07:00
Eric Dumazet
ca55158c6e rps: tcp: fix rps_sock_flow_table table updates
I believe a moderate SYN flood attack can corrupt RFS flow table
(rps_sock_flow_table), making RPS/RFS much less effective.

Even in a normal situation, server handling short lived sessions suffer
from bad steering for the first data packet of a session, if another SYN
packet is received for another session.

We do following action in tcp_v4_rcv() :

	sock_rps_save_rxhash(sk, skb->rxhash);

We should _not_ do this if sk is a LISTEN socket, as about each
packet received on a LISTEN socket has a different rxhash than
previous one.
 -> RPS_NO_CPU markers are spread all over rps_sock_flow_table.

Also, it makes sense to protect sk->rxhash field changes with socket
lock (We currently can change it even if user thread owns the lock
and might use rxhash)

This patch moves sock_rps_save_rxhash() to a sock locked section,
and only for non LISTEN sockets.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-04 15:56:02 -07:00
Eric Dumazet
a465419b1f net: Introduce sk_route_nocaps
TCP-MD5 sessions have intermittent failures, when route cache is
invalidated. ip_queue_xmit() has to find a new route, calls
sk_setup_caps(sk, &rt->u.dst), destroying the 

sk->sk_route_caps &= ~NETIF_F_GSO_MASK

that MD5 desperately try to make all over its way (from
tcp_transmit_skb() for example)

So we send few bad packets, and everything is fine when
tcp_transmit_skb() is called again for this socket.

Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.

Reported-by: Bhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16 00:36:33 -07:00
David S. Miller
c58dc01bab net: Make RFS socket operations not be inet specific.
Idea from Eric Dumazet.

As for placement inside of struct sock, I tried to choose a place
that otherwise has a 32-bit hole on 64-bit systems.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-04-27 15:11:48 -07:00
Eric Dumazet
0eae88f31c net: Fix various endianness glitches
Sparse can help us find endianness bugs, but we need to make some
cleanups to be able to more easily spot real bugs.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 19:06:52 -07:00
Tom Herbert
fec5e652e5 rfs: Receive Flow Steering
This patch implements receive flow steering (RFS).  RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running.  RFS is an
extension of Receive Packet Steering (RPS).

The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure.  The rxhash is passed in skb's received on
the connection from netif_receive_skb.  For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.

The convolution of the simple approach is that it would potentially
allow OOO packets.  If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.

To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.

rps_sock_table is a global hash table.  Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.

rps_dev_flow_table is specific to each device queue.  Each entry
contains a CPU and a tail queue counter.  The CPU is the "current"
CPU for a matching flow.  The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.

Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length.  When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.

And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted.  When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:

- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table.  This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.

Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality.  2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.

This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.

There are two configuration parameters for RFS.  The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue.  Both are rounded to power of two.

The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).

The benefits of RFS are dependent on cache hierarchy, application
load, and other factors.  On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation.  However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp.  The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.

e1000e on 8 core Intel
   No RFS or RPS		104K tps at 30% CPU
   No RFS (best RPS config):    290K tps at 63% CPU
   RFS				303K tps at 61% CPU

RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
  No RFS/RPS	103K	48%	757/900/3185		4472.35
  RPS only:	174K	73%	415/993/2468		491.66
  RFS		223K	73%	379/651/1382		315.61

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
Herbert Xu
bb29624614 inet: Remove unused send_check length argument
inet: Remove unused send_check length argument

This patch removes the unused length argument from the send_check
function in struct inet_connection_sock_af_ops.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Yinghai <yinghai.lu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-11 15:29:09 -07:00
Herbert Xu
419f9f8960 tcp: Handle CHECKSUM_PARTIAL for SYNACK packets for IPv4
tcp: Handle CHECKSUM_PARTIAL for SYNACK packets for IPv4

This patch moves the common code between tcp_v4_send_check and
tcp_v4_gso_send_check into a new function __tcp_v4_send_check.

It then uses the new function in tcp_v4_send_synack so that it
handles CHECKSUM_PARTIAL properly.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Yinghai <yinghai.lu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-11 15:29:08 -07:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
stephen hemminger
97e3ecd112 TCP: check min TTL on received ICMP packets
This adds RFC5082 checks for TTL on received ICMP packets.
It adds some security against spoofed ICMP packets
disrupting GTSM protected sessions.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-19 21:00:42 -07:00
Eric Dumazet
bb134d5d95 tcp: Fix tcp_v4_rcv()
Commit d218d111 (tcp: Generalized TTL Security Mechanism) added a bug
for TIMEWAIT sockets. We should not test min_ttl for TW sockets.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-10 07:32:27 -08:00
Eric Dumazet
6cce09f87a tcp: Add SNMP counters for backlog and min_ttl drops
Commit 6b03a53a (tcp: use limited socket backlog) added the possibility
of dropping frames when backlog queue is full.

Commit d218d111 (tcp: Generalized TTL Security Mechanism) added the
possibility of dropping frames when TTL is under a given limit.

This patch adds new SNMP MIB entries, named TCPBacklogDrop and
TCPMinTTLDrop, published in /proc/net/netstat in TcpExt: line

netstat -s | egrep "TCPBacklogDrop|TCPMinTTLDrop"
    TCPBacklogDrop: 0
    TCPMinTTLDrop: 0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-08 10:45:27 -08:00
Zhu Yi
a3a858ff18 net: backlog functions rename
sk_add_backlog -> __sk_add_backlog
sk_add_backlog_limited -> sk_add_backlog

Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-05 13:34:03 -08:00
Zhu Yi
6b03a53a5a tcp: use limited socket backlog
Make tcp adapt to the limited socket backlog change.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-05 13:34:00 -08:00
Alexey Dobriyan
2c8c1e7297 net: spread __net_init, __net_exit
__net_init/__net_exit are apparently not going away, so use them
to full extent.

In some cases __net_init was removed, because it was called from
__net_exit code.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-17 19:16:02 -08:00
Octavian Purdila
72659ecce6 tcp: account SYN-ACK timeouts & retransmissions
Currently we don't increment SYN-ACK timeouts & retransmissions
although we do increment the same stats for SYN. We seem to have lost
the SYN-ACK accounting with the introduction of tcp_syn_recv_timer
(commit 2248761e in the netdev-vger-cvs tree).

This patch fixes this issue. In the process we also rename the v4/v6
syn/ack retransmit functions for clarity. We also add a new
request_socket operations (syn_ack_timeout) so we can keep code in
inet_connection_sock.c protocol agnostic.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-17 19:09:39 -08:00
Stephen Hemminger
d218d11133 tcp: Generalized TTL Security Mechanism
This patch adds the kernel portions needed to implement
RFC 5082 Generalized TTL Security Mechanism (GTSM).
It is a lightweight security measure against forged
packets causing DoS attacks (for BGP). 

This is already implemented the same way in BSD kernels.
For the necessary Quagga patch 
  http://www.gossamer-threads.com/lists/quagga/dev/17389

Description from Cisco
  http://www.cisco.com/en/US/docs/ios/12_3t/12_3t7/feature/guide/gt_btsh.html

It does add one byte to each socket structure, but I did
a little rearrangement to reuse a hole (on 64 bit), but it
does grow the structure on 32 bit

This should be documented on ip(4) man page and the Glibc in.h
file also needs update.  IPV6_MINHOPLIMIT should also be added
(although BSD doesn't support that).  

Only TCP is supported, but could also be added to UDP, DCCP, SCTP
if desired.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-11 16:28:01 -08:00
David S. Miller
bb5b7c1126 tcp: Revert per-route SACK/DSACK/TIMESTAMP changes.
It creates a regression, triggering badness for SYN_RECV
sockets, for example:

[19148.022102] Badness at net/ipv4/inet_connection_sock.c:293
[19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000
[19148.023035] REGS: eeecbd30 TRAP: 0700   Not tainted  (2.6.32)
[19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR>  CR: 24002442  XER: 00000000
[19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000

This is likely caused by the change in the 'estab' parameter
passed to tcp_parse_options() when invoked by the functions
in net/ipv4/tcp_minisocks.c

But even if that is fixed, the ->conn_request() changes made in
this patch series is fundamentally wrong.  They try to use the
listening socket's 'dst' to probe the route settings.  The
listening socket doesn't even have a route, and you can't
get the right route (the child request one) until much later
after we setup all of the state, and it must be done by hand.

This stuff really isn't ready, so the best thing to do is a
full revert.  This reverts the following commits:

f55017a93f
022c3f7d82
1aba721eba
cda42ebd67
345cda2fd6
dc343475ed
05eaade278
6a2a2d6bf8

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-15 20:56:42 -08:00
Eric Dumazet
9327f7053e tcp: Fix a connect() race with timewait sockets
First patch changes __inet_hash_nolisten() and __inet6_hash()
to get a timewait parameter to be able to unhash it from ehash
at same time the new socket is inserted in hash.

This makes sure timewait socket wont be found by a concurrent
writer in __inet_check_established()

Reported-by: kapil dakhane <kdakhane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-08 20:17:51 -08:00
Eric Dumazet
49d0900787 tcp: diag: Dont report negative values for rx queue
Both netlink and /proc/net/tcp interfaces can report transient
negative values for rx queue.

ss ->
State   Recv-Q Send-Q  Local Address:Port  Peer Address:Port
ESTAB   -6     6       127.0.0.1:45956     127.0.0.1:3333 

netstat ->
tcp   4294967290      6 127.0.0.1:37784  127.0.0.1:3333 ESTABLISHED

This is because we dont lock socket while computing 
tp->rcv_nxt - tp->copied_seq,
and another CPU can update copied_seq before rcv_next in RX path.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 16:06:13 -08:00
Eric W. Biederman
b099ce2602 net: Batch inet_twsk_purge
This function walks the whole hashtable so there is no point in
passing it a network namespace.  Instead I purge all timewait
sockets from dead network namespaces that I find.  If the namespace
is one of the once I am trying to purge I am guaranteed no new timewait
sockets can be formed so this will get them all.  If the namespace
is one I am not acting for it might form a few more but I will
call inet_twsk_purge again and  shortly to get rid of them.  In
any even if the network namespace is dead timewait sockets are
useless.

Move the calls of inet_twsk_purge into batch_exit routines so
that if I am killing a bunch of namespaces at once I will just
call inet_twsk_purge once and save a lot of redundant unnecessary
work.

My simple 4k network namespace exit test the cleanup time dropped from
roughly 8.2s to 1.6s.  While the time spent running inet_twsk_purge fell
to about 2ms.  1ms for ipv4 and 1ms for ipv6.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 12:23:47 -08:00
William Allen Simpson
4957faade1 TCPCT part 1g: Responder Cookie => Initiator
Parse incoming TCP_COOKIE option(s).

Calculate <SYN,ACK> TCP_COOKIE option.

Send optional <SYN,ACK> data.

This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

Requires:
   TCPCT part 1a: add request_values parameter for sending SYNACK
   TCPCT part 1b: generate Responder Cookie secret
   TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
   TCPCT part 1d: define TCP cookie option, extend existing struct's
   TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS
   TCPCT part 1f: Initiator Cookie => Responder

Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 22:07:26 -08:00
William Allen Simpson
435cf559f0 TCPCT part 1d: define TCP cookie option, extend existing struct's
Data structures are carefully composed to require minimal additions.
For example, the struct tcp_options_received cookie_plus variable fits
between existing 16-bit and 8-bit variables, requiring no additional
space (taking alignment into consideration).  There are no additions to
tcp_request_sock, and only 1 pointer in tcp_sock.

This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

The principle difference is using a TCP option to carry the cookie nonce,
instead of a user configured offset in the data.  This is more flexible and
less subject to user configuration error.  Such a cookie option has been
suggested for many years, and is also useful without SYN data, allowing
several related concepts to use the same extension option.

    "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
    http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html

    "Re: what a new TCP header might look like", May 12, 1998.
    ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail

These functions will also be used in subsequent patches that implement
additional features.

Requires:
   TCPCT part 1a: add request_values parameter for sending SYNACK
   TCPCT part 1b: generate Responder Cookie secret
   TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS

Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 22:07:25 -08:00
William Allen Simpson
e6b4d11367 TCPCT part 1a: add request_values parameter for sending SYNACK
Add optional function parameters associated with sending SYNACK.
These parameters are not needed after sending SYNACK, and are not
used for retransmission.  Avoids extending struct tcp_request_sock,
and avoids allocating kernel memory.

Also affects DCCP as it uses common struct request_sock_ops,
but this parameter is currently reserved for future use.

Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 22:07:23 -08:00
Eric Dumazet
2c1409a0a2 inetpeer: Optimize inet_getid()
While investigating for network latencies, I found inet_getid() was a
contention point for some workloads, as inet_peer_idlock is shared
by all inet_getid() users regardless of peers.

One way to fix this is to make ip_id_count an atomic_t instead
of __u16, and use atomic_add_return().

In order to keep sizeof(struct inet_peer) = 64 on 64bit arches
tcp_ts_stamp is also converted to __u32 instead of "unsigned long".

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13 20:46:58 -08:00
William Allen Simpson
bee7ca9ec0 net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED
Define two symbols needed in both kernel and user space.

Remove old (somewhat incorrect) kernel variant that wasn't used in
most cases.  Default should apply to both RMSS and SMSS (RFC2581).

Replace numeric constants with defined symbols.

Stand-alone patch, originally developed for TCPCT.

Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13 20:38:48 -08:00
Gilad Ben-Yossef
022c3f7d82 Allow tcp_parse_options to consult dst entry
We need tcp_parse_options to be aware of dst_entry to
take into account per dst_entry TCP options settings

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 01:28:41 -07:00
Eric Dumazet
c720c7e838 inet: rename some inet_sock fields
In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.

Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)

This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-18 18:52:53 -07:00
Eric Dumazet
f373b53b5f tcp: replace ehash_size by ehash_mask
Storing the mask (size - 1) instead of the size allows fast path to be
a bit faster.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-13 03:44:02 -07:00
Ilpo Järvinen
0b6a05c1db tcp: fix ssthresh u16 leftover
It was once upon time so that snd_sthresh was a 16-bit quantity.
...That has not been true for long period of time. I run across
some ancient compares which still seem to trust such legacy.
Put all that magic into a single place, I hopefully found all
of them.

Compile tested, though linking of allyesconfig is ridiculous
nowadays it seems.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-15 01:30:10 -07:00
Wu Fengguang
aa1330766c tcp: replace hard coded GFP_KERNEL with sk_allocation
This fixed a lockdep warning which appeared when doing stress
memory tests over NFS:

	inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.

	page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock

	mount_root => nfs_root_data => tcp_close => lock sk_lock =>
			tcp_send_fin => alloc_skb_fclone => page reclaim

David raised a concern that if the allocation fails in tcp_send_fin(), and it's
GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting
for the allocation to succeed.

But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks
weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could
loop endlessly under memory pressure.

CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
CC: David S. Miller <davem@davemloft.net>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-02 23:45:45 -07:00
Stephen Hemminger
3b401a81c0 inet: inet_connection_sock_af_ops const
The function block inet_connect_sock_af_ops contains no data
make it constant.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-02 01:03:49 -07:00
Stephen Hemminger
b2e4b3debc tcp: MD5 operations should be const
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-02 01:03:43 -07:00
Damian Lukowski
f1ecd5d9e7 Revert Backoff [v3]: Revert RTO on ICMP destination unreachable
Here, an ICMP host/network unreachable message, whose payload fits to
TCP's SND.UNA, is taken as an indication that the RTO retransmission has
not been lost due to congestion, but because of a route failure
somewhere along the path.
With true congestion, a router won't trigger such a message and the
patched TCP will operate as standard TCP.

This patch reverts one RTO backoff, if an ICMP host/network unreachable
message, whose payload fits to TCP's SND.UNA, arrives.
Based on the new RTO, the retransmission timer is reset to reflect the
remaining time, or - if the revert clocked out the timer - a retransmission
is sent out immediately.
Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
there have been retransmissions and reversible backoffs, already.

Changes from v2:
1) Renaming of skb in tcp_v4_err() moved to another patch.
2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
3) Fixed code comments.

Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-01 02:45:42 -07:00
Damian Lukowski
4d1a2d9ec1 Revert Backoff [v3]: Rename skb to icmp_skb in tcp_v4_err()
This supplementary patch renames skb to icmp_skb in tcp_v4_err() in order to
disambiguate from another sk_buff variable, which will be introduced
in a separate patch.

Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-01 02:45:38 -07:00
John Dykstra
e547bc1ecc tcp: Use correct peer adr when copying MD5 keys
When the TCP connection handshake completes on the passive
side, a variety of state must be set up in the "child" sock,
including the key if MD5 authentication is being used.  Fix TCP
for both address families to label the key with the peer's
destination address, rather than the address from the listening
sock, which is usually the wildcard.

Reported-by:   Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-20 07:49:08 -07:00
John Dykstra
e3afe7b75e tcp: Fix MD5 signature checking on IPv4 mapped sockets
Fix MD5 signature checking so that an IPv4 active open
to an IPv6 socket can succeed.  In particular, use the
correct address family's signature generation function
for the SYN/ACK.

Reported-by:   Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-20 07:49:07 -07:00
Eric Dumazet
adf30907d6 net: skb->dst accessors
Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-03 02:51:04 -07:00
Eric Dumazet
511c3f92ad net: skb->rtable accessor
Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

Delete skb->rtable field

Setting rtable is not allowed, just set dst instead as rtable is an alias.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-03 02:51:02 -07:00
Shan Wei
ae8d7f884a tcp:fix the code indent
Signed-off-by: Shan Wei<shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-05 12:29:47 -07:00
Herbert Xu
36e7b1b8da gro: Fix COMPLETE checksum handling
On a brand new GRO skb, we cannot call ip_hdr since the header
may lie in the non-linear area.  This patch adds the helper
skb_gro_network_header to handle this.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-27 05:44:45 -07:00
Paul Moore
284904aa79 lsm: Relocate the IPv4 security_inet_conn_request() hooks
The current placement of the security_inet_conn_request() hooks do not allow
individual LSMs to override the IP options of the connection's request_sock.
This is a problem as both SELinux and Smack have the ability to use labeled
networking protocols which make use of IP options to carry security attributes
and the inability to set the IP options at the start of the TCP handshake is
problematic.

This patch moves the IPv4 security_inet_conn_request() hooks past the code
where the request_sock's IP options are set/reset so that the LSM can safely
manipulate the IP options as needed.  This patch intentionally does not change
the related IPv6 hooks as IPv6 based labeling protocols which use IPv6 options
are not currently implemented, once they are we will have a better idea of
the correct placement for the IPv6 hooks.

Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: James Morris <jmorris@namei.org>
2009-03-28 15:01:36 +11:00
Eric Dumazet
fc1ad92dfc tcp: allow timestamps even if SYN packet has tsval=0
Some systems send SYN packets with apparently wrong RFC1323 timestamp
option values [timestamp tsval=0 tsecr=0].
It might be for security reasons (http://www.secuobs.com/plugs/25220.shtml )

Linux TCP stack ignores this option and sends back a SYN+ACK packet
without timestamp option, thus many TCP flows cannot use timestamps
and lose some benefit of RFC1323.

Other operating systems seem to not care about initial tsval value, and let
tcp flows to negotiate timestamp option.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-11 09:23:57 -07:00
Eric W. Biederman
6a1b3054d9 tcp: Like icmp use register_pernet_subsys
To remove the possibility of packets flying around when network
devices are being cleaned up use reisger_pernet_subsys instead of
register_pernet_device.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-22 19:54:49 -08:00
Herbert Xu
86911732d3 gro: Avoid copying headers of unmerged packets
Unfortunately simplicity isn't always the best.  The fraginfo
interface turned out to be suboptimal.  The problem was quite
obvious.  For every packet, we have to copy the headers from
the frags structure into skb->head, even though for 99% of the
packets this part is immediately thrown away after the merge.

LRO didn't have this problem because it directly read the headers
from the frags structure.

This patch attempts to address this by creating an interface
that allows GRO to access the headers in the first frag without
having to copy it.  Because all drivers that use frags place the
headers in the first frag this optimisation should be enough.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-29 16:33:03 -08:00
Dan Williams
f67b459992 net_dma: convert to dma_find_channel
Use the general-purpose channel allocation provided by dmaengine.

Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-01-06 11:38:15 -07:00
Herbert Xu
eb4dea5853 net: Fix percpu counters deadlock
When we converted the protocol atomic counters such as the orphan
count and the total socket count deadlocks were introduced due to
the mismatch in BH status of the spots that used the percpu counter
operations.

Based on the diagnosis and patch by Peter Zijlstra, this patch
fixes these issues by disabling BH where we may be in process
context.

Reported-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-29 23:04:08 -08:00
Herbert Xu
bf296b125b tcp: Add GRO support
This patch adds the TCP-specific portion of GRO.  The criterion for
merging is extremely strict (the TCP header must match exactly apart
from the checksum) so as to allow refragmentation.  Otherwise this
is pretty much identical to LRO, except that we support the merging
of ECN packets.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-15 23:43:36 -08:00
Eric Dumazet
1748376b66 net: Use a percpu_counter for sockets_allocated
Instead of using one atomic_t per protocol, use a percpu_counter
for "sockets_allocated", to reduce cache line contention on
heavy duty network servers. 

Note : We revert commit (248969ae31
net: af_unix can make unix_nr_socks visbile in /proc),
since it is not anymore used after sock_prot_inuse_add() addition

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 21:16:35 -08:00
Eric Dumazet
c25eb3bfb9 net: Convert TCP/DCCP listening hash tables to use RCU
This is the last step to be able to perform full RCU lookups
in __inet_lookup() : After established/timewait tables, we
add RCU lookups to listening hash table.

The only trick here is that a socket of a given type (TCP ipv4,
TCP ipv6, ...) can now flight between two different tables
(established and listening) during a RCU grace period, so we
must use different 'nulls' end-of-chain values for two tables.

We define a large value :

#define LISTENING_NULLS_BASE (1U << 29)

So that slots in listening table are guaranteed to have different
end-of-chain values than slots in established table. A reader can
still detect it finished its lookup in the right chain.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-23 17:22:55 -08:00
Eric Dumazet
9db66bdcc8 net: convert TCP/DCCP ehash rwlocks to spinlocks
Now TCP & DCCP use RCU lookups, we can convert ehash rwlocks to spinlocks.

/proc/net/tcp and other seq_file 'readers' can safely be converted to 'writers'.

This should speedup writers, since spin_lock()/spin_unlock()
only use one atomic operation instead of two for write_lock()/write_unlock()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-20 20:39:09 -08:00
Eric Dumazet
5caea4ea70 net: listening_hash get a spinlock per bucket
This patch prepares RCU migration of listening_hash table for
TCP/DCCP protocols.

listening_hash table being small (32 slots per protocol), we add
a spinlock for each slot, instead of a single rwlock for whole table.

This should reduce hold time of readers, and writers concurrency.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-20 00:40:07 -08:00
Joe Perches
07f0757a68 include/net net/ - csum_partial - remove unnecessary casts
The first argument to csum_partial is const void *
casts to char/u8 * are not necessary

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-19 15:44:53 -08:00
Eric Dumazet
3ab5aee7fe net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls
RCU was added to UDP lookups, using a fast infrastructure :
- sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
  price of call_rcu() at freeing time.
- hlist_nulls permits to use few memory barriers.

This patch uses same infrastructure for TCP/DCCP established
and timewait sockets.

Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
using short lived TCP connections. A followup patch, converting
rwlocks to spinlocks will even speedup this case.

__inet_lookup_established() is pretty fast now we dont have to
dirty a contended cache line (read_lock/read_unlock)

Only established and timewait hashtable are converted to RCU
(bind table and listen table are still using traditional locking)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-16 19:40:17 -08:00
Jianjun Kong
5799de0b12 net: clean up net/ipv4/tcp_ipv4.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 02:49:10 -08:00
Harvey Harrison
673d57e723 net: replace NIPQUAD() in net/ipv4/ net/ipv6/
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31 00:53:57 -07:00
Ilpo Järvinen
78e645cb89 tcpv[46]: fix md5 pseudoheader address field ordering
Maybe it's just me but I guess those md5 people made a mess
out of it by having *_md5_hash_* to use daddr, saddr order
instead of the one that is natural (and equal to what csum
functions use). For the segment were sending, the original
addresses are reversed so buff's saddr == skb's daddr and
vice-versa.

Maybe I can finally proceed with unification of some code
after fixing it first... :-)

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-09 14:37:47 -07:00
Ilpo Järvinen
52cd5750e8 tcp: fix length used for checksum in a reset
While looking for some common code I came across difference
in checksum calculation between tcp_v6_send_(reset|ack) I
couldn't explain. I checked both v4 and v6 and found out that
both seem to have the same "feature". I couldn't find anything
in rfc nor anywhere else which would state that md5 option
should be ignored like it was in case of reset so I came to
a conclusion that this is probably a genuine bug. I suspect
that addition of md5 just was fooled by the excessive
copy-paste code in those functions and the reset part was
never tested well enough to find out the problem.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-08 11:34:06 -07:00
Arnaldo Carvalho de Melo
9a1f27c480 inet_hashtables: Add inet_lookup_skb helpers
To be able to use the cached socket reference in the skb during input
processing we add a new set of lookup functions that receive the skb on
their argument list.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 11:41:57 -07:00
KOVACS Krisztian
88ef4a5a78 tcp: Handle TCP SYN+ACK/ACK/RST transparency
The TCP stack sends out SYN+ACK/ACK/RST reply packets in response to
incoming packets. The non-local source address check on output bites
us again, as replies for transparently redirected traffic won't have a
chance to leave the node.

This patch selectively sets the FLOWI_FLAG_ANYSRC flag when doing the
route lookup for those replies. Transparent replies are enabled if the
listening socket has the transparent socket flag set.

Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-01 07:41:00 -07:00
David S. Miller
b262e60309 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/wireless/ath9k/core.c
	drivers/net/wireless/ath9k/main.c
	net/core/dev.c
2008-10-01 06:12:56 -07:00
Vitaliy Gusev
4dd7972d12 tcp: Fix NULL dereference in tcp_4_send_ack()
Fix NULL dereference in tcp_4_send_ack().

As skb->dev is reset to NULL in tcp_v4_rcv() thus OOPS occurs:

BUG: unable to handle kernel NULL pointer dereference at 00000000000004d0
IP: [<ffffffff80498503>] tcp_v4_send_ack+0x203/0x250

Stack:  ffff810005dbb000 ffff810015c8acc0 e77b2c6e5f861600 a01610802e90cb6d
 0a08010100000000 88afffff88afffff 0000000080762be8 0000000115c872e8
 0004122000000000 0000000000000001 ffffffff80762b88 0000000000000020
Call Trace:
 <IRQ>  [<ffffffff80499c33>] tcp_v4_reqsk_send_ack+0x20/0x22
 [<ffffffff8049bce5>] tcp_check_req+0x108/0x14c
 [<ffffffff8047aaf7>] ? rt_intern_hash+0x322/0x33c
 [<ffffffff80499846>] tcp_v4_do_rcv+0x399/0x4ec
 [<ffffffff8045ce4b>] ? skb_checksum+0x4f/0x272
 [<ffffffff80485b74>] ? __inet_lookup_listener+0x14a/0x15c
 [<ffffffff8049babc>] tcp_v4_rcv+0x6a1/0x701
 [<ffffffff8047e739>] ip_local_deliver_finish+0x157/0x24a
 [<ffffffff8047ec9a>] ip_local_deliver+0x72/0x7c
 [<ffffffff8047e5bd>] ip_rcv_finish+0x38d/0x3b2
 [<ffffffff803d3548>] ? scsi_io_completion+0x19d/0x39e
 [<ffffffff8047ebe5>] ip_rcv+0x2a2/0x2e5
 [<ffffffff80462faa>] netif_receive_skb+0x293/0x303
 [<ffffffff80465a9b>] process_backlog+0x80/0xd0
 [<ffffffff802630b4>] ? __rcu_process_callbacks+0x125/0x1b4
 [<ffffffff8046560e>] net_rx_action+0xb9/0x17f
 [<ffffffff80234cc5>] __do_softirq+0xa3/0x164
 [<ffffffff8020c52c>] call_softirq+0x1c/0x28
 <EOI>  [<ffffffff8020de1c>] do_softirq+0x34/0x72
 [<ffffffff80234b8e>] local_bh_enable_ip+0x3f/0x50
 [<ffffffff804d43ca>] _spin_unlock_bh+0x12/0x14
 [<ffffffff804599cd>] release_sock+0xb8/0xc1
 [<ffffffff804a6f9a>] inet_stream_connect+0x146/0x25c
 [<ffffffff80243078>] ? autoremove_wake_function+0x0/0x38
 [<ffffffff8045751f>] sys_connect+0x68/0x8e
 [<ffffffff80291818>] ? fd_install+0x5f/0x68
 [<ffffffff80457784>] ? sock_map_fd+0x55/0x62
 [<ffffffff8020b39b>] system_call_after_swapgs+0x7b/0x80

Code: 41 10 11 d0 83 d0 00 4d 85 ed 89 45 c0 c7 45 c4 08 00 00 00 74 07 41 8b 45 04 89 45 c8 48 8b 43 20 8b 4d b8 48 8d 55 b0 48 89 de <48> 8b 80 d0 04 00 00 48 8b b8 60 01 00 00 e8 20 ae fe ff 65 48
RIP  [<ffffffff80498503>] tcp_v4_send_ack+0x203/0x250
 RSP <ffffffff80762b78>
CR2: 00000000000004d0

Signed-off-by: Vitaliy Gusev <vgusev@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-01 01:51:39 -07:00
Tom Quetchenbach
f5fff5dc8a tcp: advertise MSS requested by user
I'm trying to use the TCP_MAXSEG option to setsockopt() to set the MSS
for both sides of a bidirectional connection.

man tcp says: "If this option is set before connection establishment, it
also changes the MSS value announced to the other end in the initial
packet."

However, the kernel only uses the MTU/route cache to set the advertised
MSS. That means if I set the MSS to, say, 500 before calling connect(),
I will send at most 500-byte packets, but I will still receive 1500-byte
packets in reply.

This is a bug, either in the kernel or the documentation.

This patch (applies to latest net-2.6) reduces the advertised value to
that requested by the user as long as setsockopt() is called before
connect() or accept(). This seems like the behavior that one would
expect as well as that which is documented.

I've tried to make sure that things that depend on the advertised MSS
are set correctly.

Signed-off-by: Tom Quetchenbach <virtualphtn@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-21 00:21:51 -07:00
David S. Miller
17dce5dfe3 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	net/mac80211/mlme.c
2008-09-08 16:59:05 -07:00
Daniel Lezcano
d315492b1a netns : fix kernel panic in timewait socket destruction
How to reproduce ?
 - create a network namespace
 - use tcp protocol and get timewait socket
 - exit the network namespace
 - after a moment (when the timewait socket is destroyed), the kernel
   panics.

# BUG: unable to handle kernel NULL pointer dereference at
0000000000000007
IP: [<ffffffff821e394d>] inet_twdr_do_twkill_work+0x6e/0xb8
PGD 119985067 PUD 11c5c0067 PMD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: ipv6 button battery ac loop dm_mod tg3 libphy ext3 jbd
edd fan thermal processor thermal_sys sg sata_svw libata dock serverworks
sd_mod scsi_mod ide_disk ide_core [last unloaded: freq_table]
Pid: 0, comm: swapper Not tainted 2.6.27-rc2 #3
RIP: 0010:[<ffffffff821e394d>] [<ffffffff821e394d>]
inet_twdr_do_twkill_work+0x6e/0xb8
RSP: 0018:ffff88011ff7fed0 EFLAGS: 00010246
RAX: ffffffffffffffff RBX: ffffffff82339420 RCX: ffff88011ff7ff30
RDX: 0000000000000001 RSI: ffff88011a4d03c0 RDI: ffff88011ac2fc00
RBP: ffffffff823392e0 R08: 0000000000000000 R09: ffff88002802a200
R10: ffff8800a5c4b000 R11: ffffffff823e4080 R12: ffff88011ac2fc00
R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
FS: 0000000041cbd940(0000) GS:ffff8800bff839c0(0000)
knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000007 CR3: 00000000bd87c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff8800bff9e000, task
ffff88011ff76690)
Stack: ffffffff823392e0 0000000000000100 ffffffff821e3a3a
0000000000000008
0000000000000000 ffffffff821e3a61 ffff8800bff7c000 ffffffff8203c7e7
ffff88011ff7ff10 ffff88011ff7ff10 0000000000000021 ffffffff82351108
Call Trace:
<IRQ> [<ffffffff821e3a3a>] ? inet_twdr_hangman+0x0/0x9e
[<ffffffff821e3a61>] ? inet_twdr_hangman+0x27/0x9e
[<ffffffff8203c7e7>] ? run_timer_softirq+0x12c/0x193
[<ffffffff820390d1>] ? __do_softirq+0x5e/0xcd
[<ffffffff8200d08c>] ? call_softirq+0x1c/0x28
[<ffffffff8200e611>] ? do_softirq+0x2c/0x68
[<ffffffff8201a055>] ? smp_apic_timer_interrupt+0x8e/0xa9
[<ffffffff8200cad6>] ? apic_timer_interrupt+0x66/0x70
<EOI> [<ffffffff82011f4c>] ? default_idle+0x27/0x3b
[<ffffffff8200abbd>] ? cpu_idle+0x5f/0x7d


Code: e8 01 00 00 4c 89 e7 41 ff c5 e8 8d fd ff ff 49 8b 44 24 38 4c 89 e7
65 8b 14 25 24 00 00 00 89 d2 48 8b 80 e8 00 00 00 48 f7 d0 <48> 8b 04 d0
48 ff 40 58 e8 fc fc ff ff 48 89 df e8 c0 5f 04 00
RIP [<ffffffff821e394d>] inet_twdr_do_twkill_work+0x6e/0xb8
RSP <ffff88011ff7fed0>
CR2: 0000000000000007

This patch provides a function to purge all timewait sockets related
to a network namespace. The timewait sockets life cycle is not tied with
the network namespace, that means the timewait sockets stay alive while
the network namespace dies. The timewait sockets are for avoiding to
receive a duplicate packet from the network, if the network namespace is
freed, the network stack is removed, so no chance to receive any packets
from the outside world. Furthermore, having a pending destruction timer
on these sockets with a network namespace freed is not safe and will lead
to an oops if the timer callback which try to access data belonging to 
the namespace like for example in:
	inet_twdr_do_twkill_work
		-> NET_INC_STATS_BH(twsk_net(tw), LINUX_MIB_TIMEWAITED);

Purging the timewait sockets at the network namespace destruction will:
 1) speed up memory freeing for the namespace
 2) fix kernel panic on asynchronous timewait destruction

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-08 13:17:27 -07:00
Andi Kleen
6eac560407 tcp: Skip empty hash buckets faster in /proc/net/tcp
On most systems most of the TCP established/time-wait hash buckets are empty.
When walking the hash table for /proc/net/tcp their read locks would
always be aquired just to find out they're empty. This patch changes the code
to check first if the buckets have any entries before taking the lock, which
is much cheaper than taking a lock. Since the hash tables are large
this makes a measurable difference on processing /proc/net/tcp, 
especially on architectures with slow read_lock (e.g. PPC) 

On a 2GB Core2 system time cat /proc/net/tcp > /dev/null (with a mostly
empty hash table) goes from 0.046s to 0.005s.

On systems with slower atomics (like P4 or POWER4) or larger hash tables
(more RAM) the difference is much higher.

This can be noticeable because there are some daemons around who regularly
scan /proc/net/tcp.

Original idea for this patch from Marcus Meissner, but redone by me.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-08-28 01:08:02 -07:00
Gui Jianfeng
6edafaaf6f tcp: Fix kernel panic when calling tcp_v(4/6)_md5_do_lookup
If the following packet flow happen, kernel will panic.
MathineA			MathineB
		SYN
	---------------------->    
        	SYN+ACK
	<----------------------
		ACK(bad seq)
	---------------------->
When a bad seq ACK is received, tcp_v4_md5_do_lookup(skb->sk, ip_hdr(skb)->daddr))
is finally called by tcp_v4_reqsk_send_ack(), but the first parameter(skb->sk) is 
NULL at that moment, so kernel panic happens.
This patch fixes this bug.

OOPS output is as following:
[  302.812793] IP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42
[  302.817075] Oops: 0000 [#1] SMP 
[  302.819815] Modules linked in: ipv6 loop dm_multipath rtc_cmos rtc_core rtc_lib pcspkr pcnet32 mii i2c_piix4 parport_pc i2c_core parport ac button ata_piix libata dm_mod mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan]
[  302.849946] 
[  302.851198] Pid: 0, comm: swapper Not tainted (2.6.27-rc1-guijf #5)
[  302.855184] EIP: 0060:[<c05cfaa6>] EFLAGS: 00010296 CPU: 0
[  302.858296] EIP is at tcp_v4_md5_do_lookup+0x12/0x42
[  302.861027] EAX: 0000001e EBX: 00000000 ECX: 00000046 EDX: 00000046
[  302.864867] ESI: ceb69e00 EDI: 1467a8c0 EBP: cf75f180 ESP: c0792e54
[  302.868333]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[  302.871287] Process swapper (pid: 0, ti=c0792000 task=c0712340 task.ti=c0746000)
[  302.875592] Stack: c06f413a 00000000 cf75f180 ceb69e00 00000000 c05d0d86 000016d0 ceac5400 
[  302.883275]        c05d28f8 000016d0 ceb69e00 ceb69e20 681bf6e3 00001000 00000000 0a67a8c0 
[  302.890971]        ceac5400 c04250a3 c06f413a c0792eb0 c0792edc cf59a620 cf59a620 cf59a634 
[  302.900140] Call Trace:
[  302.902392]  [<c05d0d86>] tcp_v4_reqsk_send_ack+0x17/0x35
[  302.907060]  [<c05d28f8>] tcp_check_req+0x156/0x372
[  302.910082]  [<c04250a3>] printk+0x14/0x18
[  302.912868]  [<c05d0aa1>] tcp_v4_do_rcv+0x1d3/0x2bf
[  302.917423]  [<c05d26be>] tcp_v4_rcv+0x563/0x5b9
[  302.920453]  [<c05bb20f>] ip_local_deliver_finish+0xe8/0x183
[  302.923865]  [<c05bb10a>] ip_rcv_finish+0x286/0x2a3
[  302.928569]  [<c059e438>] dev_alloc_skb+0x11/0x25
[  302.931563]  [<c05a211f>] netif_receive_skb+0x2d6/0x33a
[  302.934914]  [<d0917941>] pcnet32_poll+0x333/0x680 [pcnet32]
[  302.938735]  [<c05a3b48>] net_rx_action+0x5c/0xfe
[  302.941792]  [<c042856b>] __do_softirq+0x5d/0xc1
[  302.944788]  [<c042850e>] __do_softirq+0x0/0xc1
[  302.948999]  [<c040564b>] do_softirq+0x55/0x88
[  302.951870]  [<c04501b1>] handle_fasteoi_irq+0x0/0xa4
[  302.954986]  [<c04284da>] irq_exit+0x35/0x69
[  302.959081]  [<c0405717>] do_IRQ+0x99/0xae
[  302.961896]  [<c040422b>] common_interrupt+0x23/0x28
[  302.966279]  [<c040819d>] default_idle+0x2a/0x3d
[  302.969212]  [<c0402552>] cpu_idle+0xb2/0xd2
[  302.972169]  =======================
[  302.974274] Code: fc ff 84 d2 0f 84 df fd ff ff e9 34 fe ff ff 83 c4 0c 5b 5e 5f 5d c3 90 90 57 89 d7 56 53 89 c3 50 68 3a 41 6f c0 e8 e9 55 e5 ff <8b> 93 9c 04 00 00 58 85 d2 59 74 1e 8b 72 10 31 db 31 c9 85 f6 
[  303.011610] EIP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42 SS:ESP 0068:c0792e54
[  303.018360] Kernel panic - not syncing: Fatal exception in interrupt

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-08-06 23:50:04 -07:00
Adam Langley
90b7e1120b tcp: MD5: Fix MD5 signatures on certain ACK packets
I noticed, looking at tcpdumps, that timewait ACKs were getting sent
with an incorrect MD5 signature when signatures were enabled.

I broke this in 49a72dfb88 ("tcp: Fix
MD5 signatures for non-linear skbs"). I didn't take into account that
the skb passed to tcp_*_send_ack was the inbound packet, thus the
source and dest addresses need to be swapped when calculating the MD5
pseudoheader.

Signed-off-by: Adam Langley <agl@imperialviolet.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-31 20:49:48 -07:00
David S. Miller
785957d3e8 tcp: MD5: Use MIB counter instead of warning for MD5 mismatch.
From a report by Matti Aarnio, and preliminary patch by Adam Langley.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-30 03:27:25 -07:00
Ilpo Järvinen
547b792cac net: convert BUG_TRAP to generic WARN_ON
Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.

I could make at least one BUILD_BUG_ON conversion.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-25 21:43:18 -07:00
Daniel Lezcano
bdccc4ca13 tcp: fix kernel panic with listening_get_next
# BUG: unable to handle kernel NULL pointer dereference at
0000000000000038
IP: [<ffffffff821ed01e>] listening_get_next+0x50/0x1b3
PGD 11e4b9067 PUD 11d16c067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
CPU 3
Modules linked in: bridge ipv6 button battery ac loop dm_mod tg3 ext3
jbd edd fan thermal processor thermal_sys hwmon sg sata_svw libata dock
serverworks sd_mod scsi_mod ide_disk ide_core [last unloaded: freq_table]
Pid: 3368, comm: slpd Not tainted 2.6.26-rc2-mm1-lxc4 #1
RIP: 0010:[<ffffffff821ed01e>] [<ffffffff821ed01e>]
listening_get_next+0x50/0x1b3
RSP: 0018:ffff81011e1fbe18 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8100be0ad3c0 RCX: ffff8100619f50c0
RDX: ffffffff82475be0 RSI: ffff81011d9ae6c0 RDI: ffff8100be0ad508
RBP: ffff81011f4f1240 R08: 00000000ffffffff R09: ffff8101185b6780
R10: 000000000000002d R11: ffffffff820fdbfa R12: ffff8100be0ad3c8
R13: ffff8100be0ad6a0 R14: ffff8100be0ad3c0 R15: ffffffff825b8ce0
FS: 00007f6a0ebd16d0(0000) GS:ffff81011f424540(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000038 CR3: 000000011dc20000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process slpd (pid: 3368, threadinfo ffff81011e1fa000, task
ffff81011f4b8660)
Stack: 00000000000002ee ffff81011f5a57c0 ffff81011f4f1240
ffff81011e1fbe90
0000000000001000 0000000000000000 00007fff16bf2590 ffffffff821ed9c8
ffff81011f5a57c0 ffff81011d9ae6c0 000000000000041a ffffffff820b0abd
Call Trace:
[<ffffffff821ed9c8>] ? tcp_seq_next+0x34/0x7e
[<ffffffff820b0abd>] ? seq_read+0x1aa/0x29d
[<ffffffff820d21b4>] ? proc_reg_read+0x73/0x8e
[<ffffffff8209769c>] ? vfs_read+0xaa/0x152
[<ffffffff82097a7d>] ? sys_read+0x45/0x6e
[<ffffffff8200bd2b>] ? system_call_after_swapgs+0x7b/0x80


Code: 31 a9 25 00 e9 b5 00 00 00 ff 45 20 83 7d 0c 01 75 79 4c 8b 75 10
48 8b 0e eb 1d 48 8b 51 20 0f b7 45 08 39 02 75 0e 48 8b 41 28 <4c> 39
78 38 0f 84 93 00 00 00 48 8b 09 48 85 c9 75 de 8b 55 1c
RIP [<ffffffff821ed01e>] listening_get_next+0x50/0x1b3
RSP <ffff81011e1fbe18>
CR2: 0000000000000038

This kernel panic appears with CONFIG_NET_NS=y.

How to reproduce ?

    On the buggy host (host A)
       * ip addr add 1.2.3.4/24 dev eth0

    On a remote host (host B)
       * ip addr add 1.2.3.5/24 dev eth0
       * iptables -A INPUT -p tcp -s 1.2.3.4 -j DROP
       * ssh 1.2.3.4

    On host A:
       * netstat -ta or cat /proc/net/tcp

This bug happens when reading /proc/net/tcp[6] when there is a req_sock
at the SYN_RECV state.

When a SYN is received the minisock is created and the sk field is set to
NULL. In the listening_get_next function, we try to look at the field 
req->sk->sk_net.

When looking at how to fix this bug, I noticed that is useless to do
the check for the minisock belonging to the namespace. A minisock belongs
to a listen point and this one is per namespace, so when browsing the
minisock they are always per namespace.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19 00:15:13 -07:00
Adam Langley
49a72dfb88 tcp: Fix MD5 signatures for non-linear skbs
Currently, the MD5 code assumes that the SKBs are linear and, in the case
that they aren't, happily goes off and hashes off the end of the SKB and
into random memory.

Reported by Stephen Hemminger in [1]. Advice thanks to Stephen and Evgeniy
Polyakov. Also includes a couple of missed route_caps from Stephen's patch
in [2].

[1] http://marc.info/?l=linux-netdev&m=121445989106145&w=2
[2] http://marc.info/?l=linux-netdev&m=121459157816964&w=2

Signed-off-by: Adam Langley <agl@imperialviolet.org>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19 00:01:42 -07:00
Pavel Emelyanov
de0744af1f mib: add net to NET_INC_STATS_BH
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:31:16 -07:00
Pavel Emelyanov
63231bddf6 mib: add net to TCP_INC_STATS_BH
Same as before - the sock is always there to get the net from,
but there are also some places with the net already saved on 
the stack.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:22:25 -07:00
Pavel Emelyanov
a86b1e3019 inet: prepare struct net for TCP MIB accounting
This is the same as the first patch in the set, but preparing
the net for TCP_XXX_STATS - save the struct net on the stack
where required and possible.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:20:58 -07:00
Pavel Emelyanov
7c73a6faff mib: add net to IP_INC_STATS_BH
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:20:11 -07:00
Pavel Emelyanov
dcfc23cac1 mib: add struct net to ICMP_INC_STATS_BH
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 23:05:29 -07:00
Pavel Emelyanov
fd54d716b1 inet: toss struct net initialization around
Some places, that deal with ICMP statistics already have where
to get a struct net from, but use it directly, without declaring
a separate variable on the stack.

Since I will need this net soon, I declare a struct net on the
stack and use it in the existing places in a separate patch not
to spoil the future ones.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 23:05:26 -07:00
David S. Miller
1b63ba8a86 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/wireless/iwlwifi/iwl4965-base.c
2008-06-28 01:19:40 -07:00
Stephen Hemminger
7be87351a1 tcp: /proc/net/tcp rto,ato values not scaled properly (v2)
I found another case where we are sending information to userspace
in the wrong HZ scale.  This should have been fixed back in 2.5 :-(

This means an ABI change but as it stands there is no way for an application
like ss to get the right value.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 20:00:19 -07:00
David S. Miller
caea902f72 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/wireless/rt2x00/Kconfig
	drivers/net/wireless/rt2x00/rt2x00usb.c
	net/sctp/protocol.c
2008-06-16 18:25:48 -07:00
Rami Rosen
a9d246dbb0 ipv4: Remove unused definitions in net/ipv4/tcp_ipv4.c.
1) Remove ICMP_MIN_LENGTH, as it is unused.

2) Remove unneeded tcp_v4_send_check() declaration.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-16 17:07:16 -07:00
Brian Haley
7d06b2e053 net: change proto destroy method to return void
Change struct proto destroy function pointer to return void.  Noticed
by Al Viro.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-14 17:04:49 -07:00
David S. Miller
4ae127d1b6 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/smc911x.c
2008-06-13 20:52:39 -07:00
David S. Miller
ec0a196626 tcp: Revert 'process defer accept as established' changes.
This reverts two changesets, ec3c0982a2
("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
the follow-on bug fix 9ae27e0adb
("tcp: Fix slab corruption with ipv6 and tcp6fuzz").

This change causes several problems, first reported by Ingo Molnar
as a distcc-over-loopback regression where connections were getting
stuck.

Ilpo Järvinen first spotted the locking problems.  The new function
added by this code, tcp_defer_accept_check(), only has the
child socket locked, yet it is modifying state of the parent
listening socket.

Fixing that is non-trivial at best, because we can't simply just grab
the parent listening socket lock at this point, because it would
create an ABBA deadlock.  The normal ordering is parent listening
socket --> child socket, but this code path would require the
reverse lock ordering.

Next is a problem noticed by Vitaliy Gusev, he noted:

----------------------------------------
>--- a/net/ipv4/tcp_timer.c
>+++ b/net/ipv4/tcp_timer.c
>@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
> 		goto death;
> 	}
>
>+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
>+		tcp_send_active_reset(sk, GFP_ATOMIC);
>+		goto death;

Here socket sk is not attached to listening socket's request queue. tcp_done()
will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
release this sk) as socket is not DEAD. Therefore socket sk will be lost for
freeing.
----------------------------------------

Finally, Alexey Kuznetsov argues that there might not even be any
real value or advantage to these new semantics even if we fix all
of the bugs:

----------------------------------------
Hiding from accept() sockets with only out-of-order data only
is the only thing which is impossible with old approach. Is this really
so valuable? My opinion: no, this is nothing but a new loophole
to consume memory without control.
----------------------------------------

So revert this thing for now.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-12 16:34:35 -07:00
David S. Miller
e6e30add6b Merge branch 'net-next-2.6-misc-20080612a' of git://git.linux-ipv6.org/gitroot/yoshfuji/linux-2.6-next 2008-06-11 22:33:59 -07:00
Adrian Bunk
0b04082995 net: remove CVS keywords
This patch removes CVS keywords that weren't updated for a long time
from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-11 21:00:38 -07:00
YOSHIFUJI Hideaki
9501f97229 tcp md5sig: Let the caller pass appropriate key for tcp_v{4,6}_do_calc_md5_hash().
As we do for other socket/timewait-socket specific parameters,
let the callers pass appropriate arguments to
tcp_v{4,6}_do_calc_md5_hash().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-06-12 03:46:30 +09:00
YOSHIFUJI Hideaki
8d26d76dd4 tcp md5sig: Share most of hash calcucaltion bits between IPv4 and IPv6.
We can share most part of the hash calculation code because
the only difference between IPv4 and IPv6 is their pseudo headers.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-06-12 02:38:20 +09:00
YOSHIFUJI Hideaki
076fb72233 tcp md5sig: Remove redundant protocol argument.
Protocol is always TCP, so remove useless protocol argument.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-06-12 02:38:19 +09:00
YOSHIFUJI Hideaki
7d5d5525bd tcp md5sig: Share MD5 Signature option parser between IPv4 and IPv6.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-06-12 02:38:18 +09:00
Arnaldo Carvalho de Melo
ce4a7d0d48 inet{6}_request_sock: Init ->opt and ->pktopts in the constructor
Wei Yongjun noticed that we may call reqsk_free on request sock objects where
the opt fields may not be initialized, fix it by introducing inet_reqsk_alloc
where we initialize ->opt to NULL and set ->pktopts to NULL in
inet6_reqsk_alloc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-10 12:39:35 -07:00
Denis V. Lunev
84841c3c6c ipv4: assign PDE->data before gluing PDE into /proc tree
The check for PDE->data != NULL becomes useless after the replacement
of proc_net_fops_create with proc_create_data.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-02 04:10:08 -07:00
Pavel Emelyanov
5e659e4cb0 [NET]: Fix heavy stack usage in seq_file output routines.
Plan C: we can follow the Al Viro's proposal about %n like in this patch.
The same applies to udp, fib (the /proc/net/route file), rt_cache and 
sctp debug. This is minus ~150-200 bytes for each.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-24 01:02:16 -07:00
YOSHIFUJI Hideaki
a7d632b6b4 [IPV4]: Use NIPQUAD_FMT to format ipv4 addresses.
And use %u to format port.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-14 04:09:00 -07:00
Denis V. Lunev
5f4472c5a6 [TCP]: Remove owner from tcp_seq_afinfo.
Move it to tcp_seq_afinfo->seq_fops as should be.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 22:13:53 -07:00
Denis V. Lunev
68fcadd16c [TCP]: Place file operations directly into tcp_seq_afinfo.
No need to have separate never-used variable.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 22:13:30 -07:00
Denis V. Lunev
52d6f3f11b [TCP]: Cleanup /proc/tcp[6] creation/removal.
Replace seq_open with seq_open_net and remove tcp_seq_release
completely.  seq_release_net will do this job just fine.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 22:12:41 -07:00
Denis V. Lunev
9427c4b36b [TCP]: Move seq_ops from tcp_iter_state to tcp_seq_afinfo.
No need to create seq_operations for each instance of 'netstat'.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 22:12:13 -07:00
Denis V. Lunev
1abf4fb20d [TCP]: No need to check afinfo != NULL in tcp_proc_(un)register.
tcp_proc_register/tcp_proc_unregister are called with a static pointer only.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 22:11:46 -07:00
Denis V. Lunev
a4146b1b2c [TCP]: Replace struct net on tcp_iter_state with seq_net_private.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 22:11:14 -07:00
Florian Westphal
4dfc281702 [Syncookies]: Add support for TCP options via timestamps.
Allow the use of SACK and window scaling when syncookies are used
and the client supports tcp timestamps. Options are encoded into
the timestamp sent in the syn-ack and restored from the timestamp
echo when the ack is received.

Based on earlier work by Glenn Griffin.
This patch avoids increasing the size of structs by encoding TCP
options into the least significant bits of the timestamp and
by not using any 'timestamp offset'.

The downside is that the timestamp sent in the packet after the synack
will increase by several seconds.

changes since v1:
 don't duplicate timestamp echo decoding function, put it into ipv4/syncookie.c
 and have ipv6/syncookies.c use it.
 Feedback from Glenn Griffin: fix line indented with spaces, kill redundant if ()

Reviewed-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10 03:12:40 -07:00
Denis V. Lunev
7feb49c82a [NETNS]: Use TCP control socket from a correct namespace.
Signed-off-by: Denis V.Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:32:00 -07:00
Denis V. Lunev
046ee90235 [NETNS]: Create tcp control socket in the each namespace.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:31:33 -07:00
Denis V. Lunev
5677242f43 [NETNS]: Inet control socket should not hold a namespace.
This is a generic requirement, so make inet_ctl_sock_create namespace
aware and create a inet_ctl_sock_destroy wrapper around
sk_release_kernel.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:28:30 -07:00
Denis V. Lunev
eee4fe4ded [INET]: Let inet_ctl_sock_create return sock rather than socket.
All upper protocol layers are already use sock internally.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:27:58 -07:00
Denis V. Lunev
3d58b5fa8e [INET]: Rename inet_csk_ctl_sock_create to inet_ctl_sock_create.
This call is nothing common with INET connection sockets code. It
simply creates an unhashes kernel sockets for protocol messages.

Move the new call into af_inet.c after the rename.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:22:32 -07:00
Denis V. Lunev
14c0c8e8e0 [TCP]: Replace socket with sock for reset sending.
Replace tcp_socket with tcp_sock. This is more effective (less
derefferences on fast paths). Additionally, the approach is unified to
one used in ICMP.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:19:38 -07:00
Pavel Emelyanov
bdcde3d71a [SOCK]: Drop inuse pcounter from struct proto (v2).
An uppercut - do not use the pcounter on struct proto.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28 16:39:33 -07:00
YOSHIFUJI Hideaki
878628fbf2 [NET] NETNS: Omit namespace comparision without CONFIG_NET_NS.
Introduce an inline net_eq() to compare two namespaces.
Without CONFIG_NET_NS, since no namespace other than &init_net
exists, it is always 1.

We do not need to convert 1) inline vs inline and
2) inline vs &init_net comparisons.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:40:00 +09:00
YOSHIFUJI Hideaki
3b1e0a655f [NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.
Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:39:55 +09:00
YOSHIFUJI Hideaki
c346dca108 [NET] NETNS: Omit net_device->nd_net without CONFIG_NET_NS.
Introduce per-net_device inlines: dev_net(), dev_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:39:53 +09:00
Pavel Emelyanov
757764f61d [NETNS][TCP]: Register /proc/net/tcp in a namespace.
After the commit f40c8174d3 ([NETNS][IPV4] 
tcp - make proc handle the network namespaces) it is now possible to make
this file present in newly created namespaces.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 14:56:02 -07:00
Pavel Emelyanov
39d8cda76c [SOCK]: Add udp_hash member to struct proto.
Inspired by the commit ab1e0a13 ([SOCK] proto: Add hashinfo member to 
struct proto) from Arnaldo, I made similar thing for UDP/-Lite IPv4 
and -v6 protocols.

The result is not that exciting, but it removes some levels of
indirection in udpxxx_get_port and saves some space in code and text.

The first step is to union existing hashinfo and new udp_hash on the
struct proto and give a name to this union, since future initialization 
of tcpxxx_prot, dccp_vx_protinfo and udpxxx_protinfo will cause gcc 
warning about inability to initialize anonymous member this way.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-22 16:50:58 -07:00
Patrick McManus
ec3c0982a2 [TCP]: TCP_DEFER_ACCEPT updates - process as established
Change TCP_DEFER_ACCEPT implementation so that it transitions a
connection to ESTABLISHED after handshake is complete instead of
leaving it in SYN-RECV until some data arrvies. Place connection in
accept queue when first data packet arrives from slow path.

Benefits:
  - established connection is now reset if it never makes it
   to the accept queue

 - diagnostic state of established matches with the packet traces
   showing completed handshake

 - TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be
   enforced with reasonable accuracy instead of rounding up to next
   exponential back-off of syn-ack retry.

Signed-off-by: Patrick McManus <mcmanus@ducksong.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 16:33:01 -07:00
Pavel Emelyanov
28518fc170 [NET]: NULL pointer dereference and other nasty things in /proc/net/(tcp|udp)[6]
Commits f40c81 ([NETNS][IPV4] tcp - make proc handle the network
namespaces) and a91275 ([NETNS][IPV6] udp - make proc handle the
network namespace) both introduced bad checks on sockets and tw
buckets to belong to proper net namespace.

I.e. when checking for socket to belong to given net and family the

	do {
		sk = sk_next(sk);
	} while (sk && sk->sk_net != net && sk->sk_family != family);

constructions were used. This is wrong, since as soon as the
sk->sk_net fits the net the socket is immediately returned, even if it
belongs to other family.

As the result four /proc/net/(udp|tcp)[6] entries show wrong info.
The udp6 entry even oopses when dereferencing inet6_sk(sk) pointer:

static void udp6_sock_seq_show(struct seq_file *seq, struct sock *sp, int bucket)
{
	...
        struct ipv6_pinfo *np = inet6_sk(sp);
	...

        dest  = &np->daddr; /* will be NULL for AF_INET sockets */
	...
	seq_printf(...
	           dest->s6_addr32[0], dest->s6_addr32[1],
                   dest->s6_addr32[2], dest->s6_addr32[3],
	...

Fix it by converting && to ||.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 15:52:00 -07:00
Daniel Lezcano
6f8b13bcb3 [NETNS][IPV6] tcp6 - make proc per namespace
Make the proc for tcp6 to be per namespace.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 04:14:45 -07:00
Daniel Lezcano
f40c8174d3 [NETNS][IPV4] tcp - make proc handle the network namespaces
This patch, like udp proc, makes the proc functions to take care of
which namespace the socket belongs.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 04:13:54 -07:00
Eric Dumazet
ee6b967301 [IPV4]: Add 'rtable' field in struct sk_buff to alias 'dst' and avoid casts
(Anonymous) unions can help us to avoid ugly casts.

A common cast it the (struct rtable *)skb->dst one.

Defining an union like  :
union {
     struct dst_entry *dst;
     struct rtable *rtable;
};
permits to use skb->rtable in place.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-05 18:30:47 -08:00
Denis V. Lunev
7cd04fa7e3 [TCP]: Merge exit paths in tcp_v4_conn_request.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-03 11:59:32 -08:00
Denis V. Lunev
fd80eb942a [INET]: Remove struct dst_entry *dst from request_sock_ops.rtx_syn_ack.
It looks like dst parameter is used in this API due to historical
reasons.  Actually, it is really used in the direct call to
tcp_v4_send_synack only.  So, create a wrapper for tcp_v4_send_synack
and remove dst from rtx_syn_ack.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-29 11:43:03 -08:00
Denis V. Lunev
9b0f976f27 [INET]: Remove struct net_proto_family* from _init calls.
struct net_proto_family* is not used in icmp[v6]_init, ndisc_init,
igmp_init and tcp_v4_init. Remove it.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-29 11:13:15 -08:00
Kris Katterjohn
9bf1d83e7e [TCP]: Fix tcp_v4_send_synack() comment
Signed-off-by: Kris Katterjohn <katterjohn@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-17 22:29:19 -08:00
Arnaldo Carvalho de Melo
ab1e0a13d7 [SOCK] proto: Add hashinfo member to struct proto
This way we can remove TCP and DCCP specific versions of

sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash:     inet_hash is directly used, only v6 need
                       a specific version to deal with mapped sockets
sk->sk_prot->unhash:   both v4 and v6 use inet_hash directly

struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.

Now only the lookup routines receive as a parameter a struct inet_hashtable.

With this we further reuse code, reducing the difference among INET transport
protocols.

Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.

net-2.6/net/ipv4/inet_hashtables.c:
  struct proto			     |   +8
  struct inet_connection_sock_af_ops |   +8
 2 structs changed
  __inet_hash_nolisten               |  +18
  __inet_hash                        | -210
  inet_put_port                      |   +8
  inet_bind_bucket_create            |   +1
  __inet_hash_connect                |   -8
 5 functions changed, 27 bytes added, 218 bytes removed, diff: -191

net-2.6/net/core/sock.c:
  proto_seq_show                     |   +3
 1 function changed, 3 bytes added, diff: +3

net-2.6/net/ipv4/inet_connection_sock.c:
  inet_csk_get_port                  |  +15
 1 function changed, 15 bytes added, diff: +15

net-2.6/net/ipv4/tcp.c:
  tcp_set_state                      |   -7
 1 function changed, 7 bytes removed, diff: -7

net-2.6/net/ipv4/tcp_ipv4.c:
  tcp_v4_get_port                    |  -31
  tcp_v4_hash                        |  -48
  tcp_v4_destroy_sock                |   -7
  tcp_v4_syn_recv_sock               |   -2
  tcp_unhash                         | -179
 5 functions changed, 267 bytes removed, diff: -267

net-2.6/net/ipv6/inet6_hashtables.c:
  __inet6_hash |   +8
 1 function changed, 8 bytes added, diff: +8

net-2.6/net/ipv4/inet_hashtables.c:
  inet_unhash                        | +190
  inet_hash                          | +242
 2 functions changed, 432 bytes added, diff: +432

vmlinux:
 16 functions changed, 485 bytes added, 492 bytes removed, diff: -7

/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
  tcp_v6_get_port                    |  -31
  tcp_v6_hash                        |   -7
  tcp_v6_syn_recv_sock               |   -9
 3 functions changed, 47 bytes removed, diff: -47

/home/acme/git/net-2.6/net/dccp/proto.c:
  dccp_destroy_sock                  |   -7
  dccp_unhash                        | -179
  dccp_hash                          |  -49
  dccp_set_state                     |   -7
  dccp_done                          |   +1
 5 functions changed, 1 bytes added, 242 bytes removed, diff: -241

/home/acme/git/net-2.6/net/dccp/ipv4.c:
  dccp_v4_get_port                   |  -31
  dccp_v4_request_recv_sock          |   -2
 2 functions changed, 33 bytes removed, diff: -33

/home/acme/git/net-2.6/net/dccp/ipv6.c:
  dccp_v6_get_port                   |  -31
  dccp_v6_hash                       |   -7
  dccp_v6_request_recv_sock          |   +5
 3 functions changed, 5 bytes added, 38 bytes removed, diff: -33

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-03 04:28:52 -08:00
Pavel Emelyanov
c67499c0e7 [NETNS]: Tcp-v4 sockets per-net lookup.
Add a net argument to inet_lookup and propagate it further
into lookup calls. Plus tune the __inet_check_established.

The dccp and inet_diag, which use that lookup functions
pass the init_net into them.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31 19:28:19 -08:00
YOSHIFUJI Hideaki
9cb5734e5b [TCP]: Convert several length variable to unsigned.
Several length variables cannot be negative, so convert int to
unsigned int.  This also allows us to do sane shift operations
on those variables.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:59:56 -08:00
Pavel Emelyanov
152da81deb [INET]: Uninline the __inet_hash function.
This one is used in quite many places in the networking code and
seems to big to be inline.

After the patch net/ipv4/build-in.o loses ~650 bytes:
add/remove: 2/0 grow/shrink: 0/5 up/down: 461/-1114 (-653)
function                                     old     new   delta
__inet_hash_nolisten                           -     282    +282
__inet_hash                                    -     179    +179
tcp_sacktag_write_queue                     2255    2254      -1
__inet_lookup_listener                       284     274     -10
tcp_v4_syn_recv_sock                         755     493    -262
tcp_v4_hash                                  389      35    -354
inet_hash_connect                           1086     599    -487

This version addresses the issue pointed by Eric, that
while being inline this function was optimized by gcc
in respect to the 'listen_possible' argument.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:59:26 -08:00
YOSHIFUJI Hideaki
354faf0977 [IPV4] TCPMD5: Use memmove() instead of memcpy() because we have overlaps.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 17:30:31 -08:00
YOSHIFUJI Hideaki
a80cc20da4 [IPV4] TCPMD5: Omit redundant NULL check for kfree() argument.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 17:30:06 -08:00
Eric Dumazet
230140cffa [INET]: Remove per bucket rwlock in tcp/dccp ehash table.
As done two years ago on IP route cache table (commit
22c047ccbc) , we can avoid using one
lock per hash bucket for the huge TCP/DCCP hash tables.

On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
litle performance differences. (we hit a different cache line for the
rwlock, but then the bucket cache line have a better sharing factor
among cpus, since we dirty it less often). For netstat or ss commands
that want a full scan of hash table, we perform fewer memory accesses.

Using a 'small' table of hashed rwlocks should be more than enough to
provide correct SMP concurrency between different buckets, without
using too much memory. Sizing of this table depends on
num_possible_cpus() and various CONFIG settings.

This patch provides some locking abstraction that may ease a future
work using a different model for TCP/DCCP table.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:11 -08:00
Eric Dumazet
47a31a6ffc [IPV4]: Use the {DEFINE|REF}_PROTO_INUSE infrastructure
Trivial patch to make "tcp,udp,udplite,raw" protocols uses the fast
"inuse sockets" infrastructure

Each protocol use then a static percpu var, instead of a dynamic one.
This saves some ram and some cpu cycles

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:58 -08:00
Jens Axboe
c46f2334c8 [SG] Get rid of __sg_mark_end()
sg_mark_end() overwrites the page_link information, but all users want
__sg_mark_end() behaviour where we just set the end bit. That is the most
natural way to use the sg list, since you'll fill it in and then mark the
end point.

So change sg_mark_end() to only set the termination bit. Add a sg_magic
debug check as well, and clear a chain pointer if it is set.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-02 08:47:06 +01:00
David S. Miller
51c739d1f4 [NET]: Fix incorrect sg_mark_end() calls.
This fixes scatterlist corruptions added by

	commit 68e3f5dd4d
	[CRYPTO] users: Fix up scatterlist conversion errors

The issue is that the code calls sg_mark_end() which clobbers the
sg_page() pointer of the final scatterlist entry.

The first part fo the fix makes skb_to_sgvec() do __sg_mark_end().

After considering all skb_to_sgvec() call sites the most correct
solution is to call __sg_mark_end() in skb_to_sgvec() since that is
what all of the callers would end up doing anyways.

I suspect this might have fixed some problems in virtio_net which is
the sole non-crypto user of skb_to_sgvec().

Other similar sg_mark_end() cases were converted over to
__sg_mark_end() as well.

Arguably sg_mark_end() is a poorly named function because it doesn't
just "mark", it clears out the page pointer as a side effect, which is
what led to these bugs in the first place.

The one remaining plain sg_mark_end() call is in scsi_alloc_sgtable()
and arguably it could be converted to __sg_mark_end() if only so that
we can delete this confusing interface from linux/scatterlist.h

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-30 21:29:29 -07:00
Matthias M. Dellweg
b0a713e9e6 [TCP] MD5: Remove some more unnecessary casting.
while reviewing the tcp_md5-related code further i came across with
another two of these casts which you probably have missed. I don't
actually think that they impose a problem by now, but as you said we
should remove them.

Signed-off-by: Matthias M. Dellweg <2500@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-29 22:37:27 -07:00
David S. Miller
c7da57a183 [TCP]: Fix scatterlist handling in MD5 signature support.
Use sg_init_table() and sg_mark_end() as needed.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 00:41:21 -07:00
Stephen Hemminger
227b60f510 [INET]: local port range robustness
Expansion of original idea from Denis V. Lunev <den@openvz.org>

Add robustness and locking to the local_port_range sysctl.
1. Enforce that low < high when setting.
2. Use seqlock to ensure atomic update.

The locking might seem like overkill, but there are
cases where sysadmin might want to change value in the
middle of a DoS attack.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 17:30:46 -07:00
Eric W. Biederman
457c4cbc5a [NET]: Make /proc/net per network namespace
This patch makes /proc/net per network namespace.  It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.

Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:06 -07:00
David S. Miller
f8ab18d2d9 [TCP]: Fix MD5 signature handling on big-endian.
Based upon a report and initial patch by Peter Lieven.

tcp4_md5sig_key and tcp6_md5sig_key need to start with
the exact same members as tcp_md5sig_key.  Because they
are both cast to that type by tcp_v{4,6}_md5_do_lookup().

Unfortunately tcp{4,6}_md5sig_key use a u16 for the key
length instead of a u8, which is what tcp_md5sig_key
uses.  This just so happens to work by accident on
little-endian, but on big-endian it doesn't.

Instead of casting, just place tcp_md5sig_key as the first member of
the address-family specific structures, adjust the access sites, and
kill off the ugly casts.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-09-28 15:18:35 -07:00
David S. Miller
3516ffb0fe [TCP]: Invoke tcp_sendmsg() directly, do not use inet_sendmsg().
As discovered by Evegniy Polyakov, if we try to sendmsg after
a connection reset, we can do incredibly stupid things.

The core issue is that inet_sendmsg() tries to autobind the
socket, but we should never do that for TCP.  Instead we should
just go straight into TCP's sendmsg() code which will do all
of the necessary state and pending socket error checks.

TCP's sendpage already directly vectors to tcp_sendpage(), so this
merely brings sendmsg() in line with that.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-08-02 19:42:28 -07:00
Herbert Xu
a7ab4b501f [TCPv4]: Improve BH latency in /proc/net/tcp
Currently the code for /proc/net/tcp disable BH while iterating
over the entire established hash table.  Even though we call
cond_resched_softirq for each entry, we still won't process
softirq's as regularly as we would otherwise do which results
in poor performance when the system is loaded near capacity.

This anomaly comes from the 2.4 code where this was all in a
single function and the local_bh_disable might have made sense
as a small optimisation.

The cost of each local_bh_disable is so small when compared
against the increased latency in keeping it disabled over a
large but mostly empty TCP established hash table that we
should just move it to the individual read_lock/read_unlock
calls as we do in inet_diag.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:06:20 -07:00
David S. Miller
3d7dbeac58 [TCP]: Disable TSO if MD5SIG is enabled.
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-12 14:36:42 -07:00
Patrick McHardy
f0e48dbfc5 [TCP]: Honour sk_bound_dev_if in tcp_v4_send_ack
A time_wait socket inherits sk_bound_dev_if from the original socket,
but it is not used when sending ACK packets using ip_send_reply.

Fix by passing the oif to ip_send_reply in struct ip_reply_arg and
use it for output routing.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-07 13:38:51 -07:00
Wei Dong
584bdf8cbd [IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP
Signed-off-by: Wei Dong <weidong@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-03 18:08:50 -07:00
Herbert Xu
604763722c [NET]: Treat CHECKSUM_PARTIAL as CHECKSUM_UNNECESSARY
When a transmitted packet is looped back directly, CHECKSUM_PARTIAL
maps to the semantics of CHECKSUM_UNNECESSARY.  Therefore we should
treat it as such in the stack.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:43 -07:00
Herbert Xu
663ead3bb8 [NET]: Use csum_start offset instead of skb_transport_header
The skb transport pointer is currently used to specify the start
of the checksum region for transmit checksum offload.  Unfortunately,
the same pointer is also used during receive side processing.

This creates a problem when we want to retransmit a received
packet with partial checksums since the skb transport pointer
would be overwritten.

This patch solves this problem by creating a new 16-bit csum_start
offset value to replace the skb transport header for the purpose
of checksums.  This offset is calculated from skb->head so that
it does not have to change when skb->data changes.

No extra space is required since csum_offset itself fits within
a 16-bit word so we can use the other 16 bits for csum_start.

For backwards compatibility, just before we push a packet with
partial checksums off into the device driver, we set the skb
transport header to what it would have been under the old scheme.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:40 -07:00
Eric Dumazet
4103f8cd5c [TCP]: tcp_memory_pressure and tcp_socket are__read_mostly candidates
tcp_memory_pressure and tcp_socket currently share a cache line with tcp_memory_allocated, tcp_sockets_allocated.
(Very hot cache line)
It makes sense to declare these variables as __read_mostly, to avoid false sharing on SMP.

ffffffff8081d9c0 B tcp_orphan_count
ffffffff8081d9c4 B tcp_memory_allocated
ffffffff8081d9c8 B tcp_sockets_allocated
ffffffff8081d9cc B tcp_memory_pressure
ffffffff8081d9d0 b tcp_md5sig_users
ffffffff8081d9d8 b tcp_md5sig_pool
ffffffff8081d9e0 b warntime.31570
ffffffff8081d9e8 b tcp_socket

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:19 -07:00
Arnaldo Carvalho de Melo
aa8223c7bb [SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:26 -07:00
Arnaldo Carvalho de Melo
ab6a5bb6b2 [TCP]: Introduce tcp_hdrlen() and tcp_optlen()
The ip_hdrlen() buddy, created to reduce the number of skb->h.th-> uses and to
avoid the longer, open coded equivalent.

Ditched a no-op in bnx2 in the process.

I wonder if we should have a BUG_ON(skb->h.th->doff < 5) in tcp_optlen()...

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:24 -07:00
Arnaldo Carvalho de Melo
88c7664f13 [SK_BUFF]: Introduce icmp_hdr(), remove skb->h.icmph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:23 -07:00
Arnaldo Carvalho de Melo
eddc9ec53b [SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:10 -07:00
David S. Miller
fe067e8ab5 [TCP]: Abstract out all write queue operations.
This allows the write queue implementation to be changed,
for example, to one which allows fast interval searching.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:02 -07:00
James Morris
9d729f72dc [NET]: Convert xtime.tv_sec to get_seconds()
Where appropriate, convert references to xtime.tv_sec to the
get_seconds() helper function.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:32 -07:00
Ilpo Järvinen
cf4c6bf83d [TCP]: struct *sock argument renamed: sp -> sk
In general, TCP code uses "sk" for struct sock pointer.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:20 -07:00
YOSHIFUJI Hideaki
e905a9edab [NET] IPV4: Fix whitespace errors.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-10 23:19:39 -08:00
Eric Dumazet
dbca9b2750 [NET]: change layout of ehash table
ehash table layout is currently this one :

First half of this table is used by sockets not in TIME_WAIT state
Second half of it is used by sockets in TIME_WAIT state.

This is non optimal because of for a given hash or socket, the two chain heads 
are located in separate cache lines.
Moreover the locks of the second half are never used.

If instead of this halving, we use two list heads in inet_ehash_bucket instead 
of only one, we probably can avoid one cache miss, and reduce ram usage, 
particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK, 
CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep 
together related chains to speedup lookups and socket state change.

In this patch I did not try to align struct inet_ehash_bucket, but a future 
patch could try to make this structure have a convenient size (a power of two 
or a multiple of L1_CACHE_SIZE).
I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so 
maybe we dont need to scratch our heads to align the bucket...

Note : In case struct inet_ehash_bucket is not a power of two, we could 
probably change alloc_large_system_hash() (in case it use __get_free_pages()) 
to free the unused space. It currently allocates a big zone, but the last 
quarter of it could be freed. Again, this should be a temporary 'problem'.

Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-08 14:16:46 -08:00
David S. Miller
8eb9086f21 [IPV4/IPV6]: Always wait for IPSEC SA resolution in socket contexts.
Do this even for non-blocking sockets.  This avoids the silly -EAGAIN
that applications can see now, even for non-blocking sockets in some
cases (f.e. connect()).

With help from Venkat Tekkirala.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-08 12:38:45 -08:00
Frederik Deweerdt
ba7808eac1 [TCP]: remove tcp header from tcp_v4_check (take #2)
The tcphdr struct passed to tcp_v4_check is not used, the following
patch removes it from the parameter list.

This adds the netfilter modifications missing in the patch I sent
for rc3-mm1.

Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-08 12:38:44 -08:00
Craig Schlenter
cb48cfe807 [TCP]: Fix iov_len calculation in tcp_v4_send_ack().
This fixes the ftp stalls present in the current kernels.

All credit goes to Komuro <komurojun-mbn@nifty.com> for tracking
this down. The patch is untested but it looks *cough* obviously
correct.

Signed-off-by: Craig Schlenter <craig@codefountain.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-01-09 00:30:08 -08:00
Leigh Brown
a9fc00cca8 [TCP]: Trivial fix to message in tcp_v4_inbound_md5_hash
The message logged in tcp_v4_inbound_md5_hash when the hash was expected
but not found was reversed.

Signed-off-by: Leigh Brown <leigh@solinno.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-17 21:59:26 -08:00
Leigh Brown
8228a18dd3 [TCP]: Fix oops caused by tcp_v4_md5_do_del
md5sig_info.alloced4 must be set to zero when freeing keys4, otherwise
it will not be alloc'd again when another key is added to the same
socket by tcp_v4_md5_do_add.

Signed-off-by: Leigh Brown <leigh@solinno.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-17 21:59:25 -08:00
Andrew Morton
b6332e6cf9 [TCP]: Fix warnings with TCP_MD5SIG disabled.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:31:52 -08:00
Adrian Bunk
f5b99bcddd [NET]: Possible cleanups.
This patch contains the following possible cleanups:
- make the following needlessly global functions statis:
  - ipv4/tcp.c: __tcp_alloc_md5sig_pool()
  - ipv4/tcp_ipv4.c: tcp_v4_reqsk_md5_lookup()
  - ipv4/udplite.c: udplite_rcv()
  - ipv4/udplite.c: udplite_err()
- make the following needlessly global structs static:
  - ipv4/tcp_ipv4.c: tcp_request_sock_ipv4_ops
  - ipv4/tcp_ipv4.c: tcp_sock_ipv4_specific
  - ipv6/tcp_ipv6.c: tcp_request_sock_ipv6_ops
- net/ipv{4,6}/udplite.c: remove inline's from static functions
                          (gcc should know best when to inline them)

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:31:51 -08:00
David S. Miller
08dd1a506b [TCP] MD5SIG: Kill CONFIG_TCP_MD5SIG_DEBUG.
It just obfuscates the code and adds limited value.  And as Adrian
Bunk noticed, it lacked Kconfig help text too, so just kill it.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:31:47 -08:00
Al Viro
ff1dcadb1b [NET]: Split skb->csum
... into anonymous union of __wsum and __u32 (csum and csum_offset resp.)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:27:18 -08:00
Al Viro
8e5200f540 [NET]: Fix assorted misannotations (from md5 and udplite merges).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:27:16 -08:00
Arnaldo Carvalho de Melo
f6685938f9 [TCP_IPV4]: Use kmemdup where appropriate
Also use a variable to avoid the longish tp->md5sig_info-> use
in tcp_v4_md5_do_add.

Code diff stats:

[acme@newtoy net-2.6.20]$ codiff /tmp/tcp_ipv4.o.before /tmp/tcp_ipv4.o.after
/pub/scm/linux/kernel/git/acme/net-2.6.20/net/ipv4/tcp_ipv4.c:
  tcp_v4_md5_do_add     |  -62
  tcp_v4_syn_recv_sock  |  -32
  tcp_v4_parse_md5_keys |  -86
 3 functions changed, 180 bytes removed
[acme@newtoy net-2.6.20]$

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2006-12-02 21:23:54 -08:00
Arnaldo Carvalho de Melo
7174259e6c [TCP_IPV4]: CodingStyle cleanups, no code change
Mostly related to CONFIG_TCP_MD5SIG recent merge.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2006-12-02 21:23:53 -08:00
Al Viro
b51655b958 [NET]: Annotate __skb_checksum_complete() and friends.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:23:38 -08:00
Al Viro
714e85be35 [IPV6]: Assorted trivial endianness annotations.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:50 -08:00
YOSHIFUJI Hideaki
cfb6eeb4c8 [TCP]: MD5 Signature Option (RFC2385) support.
Based on implementation by Rick Payne.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:39 -08:00
Gerrit Renker
b9df3cb8cf [TCP/DCCP]: Introduce net_xmit_eval
Throughout the TCP/DCCP (and tunnelling) code, it often happens that the
return code of a transmit function needs to be tested against NET_XMIT_CN
which is a value that does not indicate a strict error condition.

This patch uses a macro for these recurring situations which is consistent
with the already existing macro net_xmit_errno, saving on duplicated code.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2006-12-02 21:22:27 -08:00
Gerrit Renker
a94f723d59 [TCP]: Remove dead code in init_sequence
This removes two redundancies:

1) The test (skb->protocol == htons(ETH_P_IPV6) in tcp_v6_init_sequence()
   is always true, due to
	* tcp_v6_conn_request() is the only function calling this one
	* tcp_v6_conn_request() redirects all skb's with ETH_P_IP protocol to
	  tcp_v4_conn_request() [ cf. top of tcp_v6_conn_request()]

2) The first argument, `struct sock *sk' of tcp_v{4,6}_init_sequence() is
   never used.

Signed-off-by: Gerrit Renker  <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:10 -08:00
Eric Dumazet
72a3effaf6 [NET]: Size listen hash tables using backlog hint
We currently allocate a fixed size (TCP_SYNQ_HSIZE=512) slots hash table for
each LISTEN socket, regardless of various parameters (listen backlog for
example)

On x86_64, this means order-1 allocations (might fail), even for 'small'
sockets, expecting few connections. On the contrary, a huge server wanting a
backlog of 50000 is slowed down a bit because of this fixed limit.

This patch makes the sizing of listen hash table a dynamic parameter,
depending of :
- net.core.somaxconn tunable (default is 128)
- net.ipv4.tcp_max_syn_backlog tunable (default : 256, 1024 or 128)
- backlog value given by user application  (2nd parameter of listen())

For large allocations (bigger than PAGE_SIZE), we use vmalloc() instead of
kmalloc().

We still limit memory allocation with the two existing tunables (somaxconn &
tcp_max_syn_backlog). So for standard setups, this patch actually reduce RAM
usage.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:21:44 -08:00
Eric Dumazet
06ca719fad [TCP]: One NET_INC_STATS() could be NET_INC_STATS_BH in tcp_v4_err()
I believe this NET_INC_STATS() call can be replaced by
NET_INC_STATS_BH(), a little bit cheaper.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-10-20 00:22:25 -07:00
YOSHIFUJI Hideaki
9469c7b4aa [NET]: Use typesafe inet_twsk() inline function instead of cast.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-10-11 23:59:58 -07:00
YOSHIFUJI Hideaki
4244f8a9f8 [TCP]: Use TCPOLEN_TSTAMP_ALIGNED macro instead of magic number.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-10-11 23:59:54 -07:00
Al Viro
23f33c2d4f [IPV4]: struct inet_timewait_sock annotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-28 18:02:27 -07:00
Al Viro
adaf345b53 [IPV4]: annotate address in inet_request_sock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-28 18:01:51 -07:00
Al Viro
bada8adc4e [IPV4]: ip_route_connect() ipv4 address arguments annotated
annotated address arguments (port number left alone for now); ditto
for inferred net-endian variables in callers.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-28 17:54:06 -07:00
Dmitry Mishin
fda9ef5d67 [NET]: Fix sk->sk_filter field access
Function sk_filter() is called from tcp_v{4,6}_rcv() functions with arg
needlock = 0, while socket is not locked at that moment. In order to avoid
this and similar issues in the future, use rcu for sk->sk_filter field read
protection.

Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
2006-09-22 15:18:47 -07:00
Brian Haley
ab32ea5d8a [NET/IPV4/IPV6]: Change some sysctl variables to __read_mostly
Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly.

Couldn't actually measure any performance increase while testing (.3%
I consider noise), but seems like the right thing to do.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 14:55:03 -07:00
Herbert Xu
8f491069b4 [IPV4]: Use network-order dport for all visible inet_lookup_*
Right now most inet_lookup_* functions take a host-order hnum instead
of a network-order dport because that's how it is represented
internally.

This means that users of these functions have to be careful about
using the right byte-order.  To add more confusion, inet_lookup takes
a network-order dport unlike all other functions.

So this patch changes all visible inet_lookup functions to take a
dport and move all dport->hnum conversion inside them.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 14:54:14 -07:00
Patrick McHardy
84fa7933a3 [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE
Replace CHECKSUM_HW by CHECKSUM_PARTIAL (for outgoing packets, whose
checksum still needs to be completed) and CHECKSUM_COMPLETE (for
incoming packets, device supplied full checksum).

Patch originally from Herbert Xu, updated by myself for 2.6.18-rc3.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 14:53:53 -07:00
Venkat Yekkirala
4237c75c0a [MLSXFRM]: Auto-labeling of child sockets
This automatically labels the TCP, Unix stream, and dccp child sockets
as well as openreqs to be at the same MLS level as the peer. This will
result in the selection of appropriately labeled IPSec Security
Associations.

This also uses the sock's sid (as opposed to the isec sid) in SELinux
enforcement of secmark in rcv_skb and postroute_last hooks.

Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 14:53:29 -07:00
Wei Yongjun
3687b1dc6f [TCP]: SNMPv2 tcpAttemptFails counter error
Refer to RFC2012, tcpAttemptFails is defined as following:
  tcpAttemptFails OBJECT-TYPE
      SYNTAX      Counter32
      MAX-ACCESS  read-only
      STATUS      current
      DESCRIPTION
              "The number of times TCP connections have made a direct
              transition to the CLOSED state from either the SYN-SENT
              state or the SYN-RCVD state, plus the number of times TCP
              connections have made a direct transition to the LISTEN
              state from the SYN-RCVD state."
      ::= { tcp 7 }

When I lookup into RFC793, I found that the state change should occured
under following condition:
  1. SYN-SENT -> CLOSED
     a) Received ACK,RST segment when SYN-SENT state.

  2. SYN-RCVD -> CLOSED
     b) Received SYN segment when SYN-RCVD state(came from LISTEN).
     c) Received RST segment when SYN-RCVD state(came from SYN-SENT).
     d) Received SYN segment when SYN-RCVD state(came from SYN-SENT).

  3. SYN-RCVD -> LISTEN
     e) Received RST segment when SYN-RCVD state(came from LISTEN).

In my test, those direct state transition can not be counted to
tcpAttemptFails.

Signed-off-by: Wei Yongjun <yjwei@nanjing-fnst.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-02 13:38:19 -07:00
Panagiotis Issaris
0da974f4f3 [NET]: Conversions from kmalloc+memset to k(z|c)alloc.
Signed-off-by: Panagiotis Issaris <takis@issaris.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-21 14:51:30 -07:00
Herbert Xu
a430a43d08 [NET] gso: Fix up GSO packets with broken checksums
Certain subsystems in the stack (e.g., netfilter) can break the partial
checksum on GSO packets.  Until they're fixed, this patch allows this to
work by recomputing the partial checksums through the GSO mechanism.

Once they've all been converted to update the partial checksum instead of
clearing it, this workaround can be removed.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-08 13:34:56 -07:00
Ingo Molnar
c636618485 [PATCH] lockdep: annotate bh_lock_sock()
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:08 -07:00
Ingo Molnar
e4d9191885 [PATCH] lockdep: locking init debugging improvement
Locking init improvement:

 - introduce and use __SPIN_LOCK_UNLOCKED for array initializations,
   to pass in the name string of locks, used by debugging

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:02 -07:00
Linus Torvalds
e37a72de84 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [IPV6]: Added GSO support for TCPv6
  [NET]: Generalise TSO-specific bits from skb_setup_caps
  [IPV6]: Added GSO support for TCPv6
  [IPV6]: Remove redundant length check on input
  [NETFILTER]: SCTP conntrack: fix crash triggered by packet without chunks
  [TG3]: Update version and reldate
  [TG3]: Add TSO workaround using GSO
  [TG3]: Turn on hw fix for ASF problems
  [TG3]: Add rx BD workaround
  [TG3]: Add tg3_netif_stop() in vlan functions
  [TCP]: Reset gso_segs if packet is dodgy
2006-06-30 15:40:17 -07:00
Herbert Xu
bcd7611117 [NET]: Generalise TSO-specific bits from skb_setup_caps
This patch generalises the TSO-specific bits from sk_setup_caps by adding
the sk_gso_type member to struct sock.  This makes sk_setup_caps generic
so that it can be used by TCPv6 or UFO.

The only catch is that whoever uses this must provide a GSO implementation
for their protocol which I think is a fair deal :) For now UFO continues to
live without a GSO implementation which is OK since it doesn't use the sock
caps field at the moment.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-30 14:12:08 -07:00
Jörn Engel
6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Sridhar Samudrala
47da8ee681 [TCP]: Export accept queue len of a TCP listening socket via rx_queue
While debugging a TCP server hang issue, we noticed that currently there is
no way for a user to get the acceptq backlog value for a TCP listen socket.

All the standard networking utilities that display socket info like netstat,
ss and /proc/net/tcp have 2 fields called rx_queue and tx_queue. These
fields do not mean much for listening sockets. This patch uses one of these
unused fields(rx_queue) to export the accept queue len for listening sockets.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29 16:57:57 -07:00
Chris Leech
1a2449a87b [I/OAT]: TCP recv offload to I/OAT
Locks down user pages and sets up for DMA in tcp_recvmsg, then calls
dma_async_try_early_copy in tcp_v4_do_rcv

Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:25:56 -07:00
Adrian Bunk
6c97e72a16 [IPV4]: Possible cleanups.
This patch contains the following possible cleanups:
- make the following needlessly global function static:
  - arp.c: arp_rcv()
- remove the following unused EXPORT_SYMBOL's:
  - devinet.c: devinet_ioctl
  - fib_frontend.c: ip_rt_ioctl
  - inet_hashtables.c: inet_bind_bucket_create
  - inet_hashtables.c: inet_bind_hash
  - tcp_input.c: sysctl_tcp_abc
  - tcp_ipv4.c: sysctl_tcp_tw_reuse
  - tcp_output.c: sysctl_tcp_mtu_probing
  - tcp_output.c: sysctl_tcp_base_mss

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-14 15:00:20 -07:00
Arnaldo Carvalho de Melo
543d9cfeec [NET]: Identation & other cleanups related to compat_[gs]etsockopt cset
No code changes, just tidying up, in some cases moving EXPORT_SYMBOLs
to just after the function exported, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:48:35 -08:00
Dmitry Mishin
3fdadf7d27 [NET]: {get|set}sockopt compatibility layer
This patch extends {get|set}sockopt compatibility layer in order to
move protocol specific parts to their place and avoid huge universal
net/compat.c file in the future.

Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:45:21 -08:00
Arnaldo Carvalho de Melo
c4d9390941 [ICSK]: Introduce inet_csk_ctl_sock_create
Consolidating open coded sequences in tcp and dccp, v4 and v6.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:01:03 -08:00
John Heffner
5d424d5a67 [TCP]: MTU probing
Implementation of packetization layer path mtu discovery for TCP, based on
the internet-draft currently found at
<http://www.ietf.org/internet-drafts/draft-ietf-pmtud-method-05.txt>.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 17:53:41 -08:00
Sam Ravnborg
f9d9516db7 [NET]: Do not export inet_bind_bucket_create twice.
inet_bind_bucket_create was exported twice.  Keep the export in the
file where inet_bind_bucket_create is defined.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-31 17:47:02 -08:00
Patrick McHardy
5d39a795bf [IPV4]: Always set fl.proto in ip_route_newports
ip_route_newports uses the struct flowi from the struct rtable returned
by ip_route_connect for the new route lookup and just replaces the port
numbers if they have changed. If an IPsec policy exists which doesn't match
port 0 the struct flowi won't have the proto field set and no xfrm lookup
is done for the changed ports.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-31 17:35:35 -08:00
Patrick McHardy
b59c270104 [NETFILTER]: Keep conntrack reference until IPsec policy checks are done
Keep the conntrack reference until policy checks have been performed for
IPsec NAT support. The reference needs to be dropped before a packet is
queued to avoid having the conntrack module unloadable.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-07 12:57:36 -08:00
Arnaldo Carvalho de Melo
80e40daa47 [TCP]: syn_flood_warning is only needed if CONFIG_SYN_COOKIES is selected
CC      net/ipv4/tcp_ipv4.o
  /pub/scm/linux/kernel/git/acme/net-2.6/net/ipv4/tcp_ipv4.c:665: warning:
  'syn_flood_warning' defined but not used

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2006-01-04 01:58:06 -02:00
Stephen Hemminger
40efc6fa17 [TCP]: less inline's
TCP inline usage cleanup:
 * get rid of inline in several places
 * replace __inline__ with inline where possible
 * move functions used in one file out of tcp.h
 * let compiler decide on used once cases

On x86_64: 
   text	   data	    bss	    dec	    hex	filename
3594701	 648348	 567400	4810449	 4966d1	vmlinux.orig
3593133	 648580	 567400	4809113	 496199	vmlinux

On sparc64:
   text	   data	    bss	    dec	    hex	filename
2538278	 406152	 530392	3474822	 350586	vmlinux.ORIG
2536382	 406384	 530392	3473158	 34ff06	vmlinux

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 16:03:49 -08:00
Arnaldo Carvalho de Melo
d83d8461f9 [IP_SOCKGLUE]: Remove most of the tcp specific calls
As DCCP needs to be called in the same spots.

Now we have a member in inet_sock (is_icsk), set at sock creation time from
struct inet_protosw->flags (if INET_PROTOSW_ICSK is set, like for TCP and
DCCP) to see if a struct sock instance is a inet_connection_sock for places
like the ones in ip_sockglue.c (v4 and v6) where we previously were looking if
sk_type was SOCK_STREAM, that is insufficient because we now use the same code
for DCCP, that has sk_type SOCK_DCCP.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:58 -08:00
Arnaldo Carvalho de Melo
a7f5e7f164 [INET]: Generalise tcp_v4_hash_connect
Renaming it to inet_hash_connect, making it possible to ditch
dccp_v4_hash_connect and share the same code with TCP instead.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:55 -08:00
Arnaldo Carvalho de Melo
6d6ee43e0b [TWSK]: Introduce struct timewait_sock_ops
So that we can share several timewait sockets related functions and
make the timewait mini sockets infrastructure closer to the request
mini sockets one.

Next changesets will take advantage of this, moving more code out of
TCP and DCCP v4 and v6 to common infrastructure.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:54 -08:00
Arnaldo Carvalho de Melo
af05dc9394 [ICSK]: Move v4_addr2sockaddr from TCP to icsk
Renaming it to inet_csk_addr2sockaddr.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:39 -08:00
Arnaldo Carvalho de Melo
8292a17a39 [ICSK]: Rename struct tcp_func to struct inet_connection_sock_af_ops
And move it to struct inet_connection_sock. DCCP will use it in the
upcoming changesets.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:38 -08:00
Arnaldo Carvalho de Melo
971af18bbf [IPV6]: Reuse inet_csk_get_port in tcp_v6_get_port
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:33 -08:00
Stephen Hemminger
caa20d9abe [TCP]: spelling fixes
Minor spelling fixes for TCP code.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 17:13:47 -08:00
Herbert Xu
fb286bb299 [NET]: Detect hardware rx checksum faults correctly
Here is the patch that introduces the generic skb_checksum_complete
which also checks for hardware RX checksum faults.  If that happens,
it'll call netdev_rx_csum_fault which currently prints out a stack
trace with the device name.  In future it can turn off RX checksum.

I've converted every spot under net/ that does RX checksum checks to
use skb_checksum_complete or __skb_checksum_complete with the
exceptions of:

* Those places where checksums are done bit by bit.  These will call
netdev_rx_csum_fault directly.

* The following have not been completely checked/converted:

ipmr
ip_vs
netfilter
dccp

This patch is based on patches and suggestions from Stephen Hemminger
and David S. Miller.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 13:01:24 -08:00
Jesper Juhl
a51482bde2 [NET]: kfree cleanup
From: Jesper Juhl <jesper.juhl@gmail.com>

This is the net/ part of the big kfree cleanup patch.

Remove pointless checks for NULL prior to calling kfree() in net/.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
2005-11-08 09:41:34 -08:00
Stephen Hemminger
6df716340d [TCP/DCCP]: Randomize port selection
This patch randomizes the port selected on bind() for connections
to help with possible security attacks. It should also be faster
in most cases because there is no need for a global lock.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-05 21:23:15 -02:00
Eric Dumazet
81c3d5470e [INET]: speedup inet (tcp/dccp) lookups
Arnaldo and I agreed it could be applied now, because I have other
pending patches depending on this one (Thank you Arnaldo)

(The other important patch moves skc_refcnt in a separate cache line,
so that the SMP/NUMA performance doesnt suffer from cache line ping pongs)

1) First some performance data :
--------------------------------

tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established()

The most time critical code is :

sk_for_each(sk, node, &head->chain) {
     if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif))
         goto hit; /* You sunk my battleship! */
}

The sk_for_each() does use prefetch() hints but only the begining of
"struct sock" is prefetched.

As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far
away from the begining of "struct sock", it has to bring into CPU
cache cold cache line. Each iteration has to use at least 2 cache
lines.

This can be problematic if some chains are very long.

2) The goal
-----------

The idea I had is to change things so that INET_MATCH() may return
FALSE in 99% of cases only using the data already in the CPU cache,
using one cache line per iteration.

3) Description of the patch
---------------------------

Adds a new 'unsigned int skc_hash' field in 'struct sock_common',
filling a 32 bits hole on 64 bits platform.

struct sock_common {
	unsigned short		skc_family;
	volatile unsigned char	skc_state;
	unsigned char		skc_reuse;
	int			skc_bound_dev_if;
	struct hlist_node	skc_node;
	struct hlist_node	skc_bind_node;
	atomic_t		skc_refcnt;
+	unsigned int		skc_hash;
	struct proto		*skc_prot;
};

Store in this 32 bits field the full hash, not masked by (ehash_size -
1) Using this full hash as the first comparison done in INET_MATCH
permits us immediatly skip the element without touching a second cache
line in case of a miss.

Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to
sk_hash and tw_hash) already contains the slot number if we mask with
(ehash_size - 1)

File include/net/inet_hashtables.h

64 bits platforms :
#define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
     (((__sk)->sk_hash == (__hash))
     ((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie))   &&  \
     ((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports))   &&  \
     (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))

32bits platforms:
#define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
     (((__sk)->sk_hash == (__hash))                 &&  \
     (inet_sk(__sk)->daddr          == (__saddr))   &&  \
     (inet_sk(__sk)->rcv_saddr      == (__daddr))   &&  \
     (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))


- Adds a prefetch(head->chain.first) in 
__inet_lookup_established()/__tcp_v4_check_established() and 
__inet6_lookup_established()/__tcp_v6_check_established() and 
__dccp_v4_check_established() to bring into cache the first element of the 
list, before the {read|write}_lock(&head->lock);

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-03 14:13:38 -07:00
Arnaldo Carvalho de Melo
20380731bc [NET]: Fix sparse warnings
Of this type, mostly:

CHECK   net/ipv6/netfilter.c
net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static?
net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static?

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:01:32 -07:00
Arnaldo Carvalho de Melo
6687e988d9 [ICSK]: Move TCP congestion avoidance members to icsk
This changeset basically moves tcp_sk()->{ca_ops,ca_state,etc} to inet_csk(),
minimal renaming/moving done in this changeset to ease review.

Most of it is just changes of struct tcp_sock * to struct sock * parameters.

With this we move to a state closer to two interesting goals:

1. Generalisation of net/ipv4/tcp_diag.c, becoming inet_diag.c, being used
   for any INET transport protocol that has struct inet_hashinfo and are
   derived from struct inet_connection_sock. Keeps the userspace API, that will
   just not display DCCP sockets, while newer versions of tools can support
   DCCP.

2. INET generic transport pluggable Congestion Avoidance infrastructure, using
   the current TCP CA infrastructure with DCCP.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:56:18 -07:00
Patrick McHardy
64ce207306 [NET]: Make NETDEBUG pure printk wrappers
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:56:08 -07:00
Arnaldo Carvalho de Melo
295ff7edb8 [TIMEWAIT]: Introduce inet_timewait_death_row
That groups all of the tables and variables associated to the TCP timewait
schedulling/recycling/killing code, that now can be isolated from the TCP
specific code and used by other transport protocols, such as DCCP.

Next changeset will move this code to net/ipv4/inet_timewait_sock.c

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:55:48 -07:00
Arnaldo Carvalho de Melo
0a5578cf8e [ICSK]: Generalise tcp_listen_{start,stop}
This also moved inet_iif from tcp to inet_hashtables.h, as it is
needed by the inet_lookup callers, perhaps this needs a bit of
polishing, but for now seems fine.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:49:24 -07:00
Arnaldo Carvalho de Melo
3f421baa47 [NET]: Just move the inet_connection_sock function from tcp sources
Completing the previous changeset, this also generalises tcp_v4_synq_add,
renaming it to inet_csk_reqsk_queue_hash_add, already geing used in the
DCCP tree, which I plan to merge RSN.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:49:14 -07:00
Arnaldo Carvalho de Melo
463c84b97f [NET]: Introduce inet_connection_sock
This creates struct inet_connection_sock, moving members out of struct
tcp_sock that are shareable with other INET connection oriented
protocols, such as DCCP, that in my private tree already uses most of
these members.

The functions that operate on these members were renamed, using a
inet_csk_ prefix while not being moved yet to a new file, so as to
ease the review of these changes.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:43:19 -07:00
Arnaldo Carvalho de Melo
e48c414ee6 [INET]: Generalise the TCP sock ID lookup routines
And also some TIME_WAIT functions.

[acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size
/tmp/before.size: 282955   13122    9312  305389   4a8ed net/ipv4/built-in.o
/tmp/after.size:  281566   13122    9312  304000   4a380 net/ipv4/built-in.o
[acme@toy net-2.6.14]$

I kept them still inlined, will uninline at some point to see what
would be the performance difference.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:42:18 -07:00
Arnaldo Carvalho de Melo
8feaf0c0a5 [INET]: Generalise tcp_tw_bucket, aka TIME_WAIT sockets
This paves the way to generalise the rest of the sock ID lookup
routines and saves some bytes in TCPv4 TIME_WAIT sockets on distro
kernels (where IPv6 is always built as a module):

[root@qemu ~]# grep tw_sock /proc/slabinfo
tw_sock_TCPv6  0  0  128  31  1
tw_sock_TCP    0  0   96  41  1
[root@qemu ~]#

Now if a protocol wants to use the TIME_WAIT generic infrastructure it
only has to set the sk_prot->twsk_obj_size field with the size of its
inet_timewait_sock derived sock and proto_register will create
sk_prot->twsk_slab, for now its only for INET sockets, but we can
introduce timewait_sock later if some non INET transport protocolo
wants to use this stuff.

Next changesets will take advantage of this new infrastructure to
generalise even more TCP code.

[acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size
/tmp/before.size: 188646   11764    5068  205478   322a6 net/ipv4/built-in.o
/tmp/after.size:  188144   11764    5068  204976   320b0 net/ipv4/built-in.o
[acme@toy net-2.6.14]$

Tested with both IPv4 & IPv6 (::1 (localhost) & ::ffff:172.20.0.1
(qemu host)).

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:42:13 -07:00
Arnaldo Carvalho de Melo
33b6223190 [INET]: Generalise tcp_v4_lookup_listener
[acme@toy net-2.6.14]$ grep built-in /tmp/before /tmp/after
/tmp/before: 282560       13122    9312  304994   4a762 net/ipv4/built-in.o
/tmp/after:  282560       13122    9312  304994   4a762 net/ipv4/built-in.o

Will be used in DCCP, not exporting it right now not to get in Adrian
Bunk's exported-but-not-used-on-modules radar 8)

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:42:08 -07:00
Arnaldo Carvalho de Melo
81849d106b [INET]: Generalise tcp_v4_hash & tcp_unhash
It really just makes the existing code be a helper function that
tcp_v4_hash and tcp_unhash uses, specifying the right inet_hashinfo,
tcp_hashinfo.

One thing I'll investigate at some point is to have the inet_hashinfo
pointer in sk_prot, so that we get all the hashtable information from
the sk pointer, this can lead to some extra indirections that may well
hurt performance/code size, we'll see. Ultimate idea would be that
sk_prot would provide _all_ the information about a protocol
implementation.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:42:02 -07:00
Arnaldo Carvalho de Melo
f3f05f7046 [INET]: Generalise the tcp_listen_ lock routines
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:41:49 -07:00
Arnaldo Carvalho de Melo
6e04e02165 [INET]: Move tcp_port_rover to inet_hashinfo
Also expose all of the tcp_hashinfo members, i.e. killing those
tcp_ehash, etc macros, this will more clearly expose already generic
functions and some that need just a bit of work to become generic, as
we'll see in the upcoming changesets.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:41:44 -07:00
Arnaldo Carvalho de Melo
2d8c4ce519 [INET]: Generalise tcp_bind_hash & tcp_inherit_port
This required moving tcp_bucket_cachep to inet_hashinfo.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:40:29 -07:00
Arnaldo Carvalho de Melo
a55ebcc4c4 [INET]: Move bind_hash from tcp_sk to inet_sk
This should really be in a inet_connection_sock, but I'm leaving it
for a later optimization, when some more fields common to INET
transport protocols now in tcp_sk or inet_sk will be chunked out into
inet_connection_sock, for now its better to concentrate on getting the
changes in the core merged to leave the DCCP tree with only DCCP
specific code.

Next changesets will take advantage of this move to generalise things
like tcp_bind_hash, tcp_put_port, tcp_inherit_port, making the later
receive a inet_hashinfo parameter, and even __tcp_tw_hashdance, etc in
the future, when tcp_tw_bucket gets transformed into the struct
timewait_sock hierarchy.

tcp_destroy_sock also is eligible as soon as tcp_orphan_count gets
moved to sk_prot.

A cascade of incremental changes will ultimately make the tcp_lookup
functions be fully generic.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:38:48 -07:00
Arnaldo Carvalho de Melo
77d8bf9c62 [INET]: Move the TCP hashtable functions/structs to inet_hashtables.[ch]
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:38:39 -07:00
Arnaldo Carvalho de Melo
0f7ff9274e [INET]: Just rename the TCP hashtable functions/structs to inet_
This is to break down the complexity of the series of patches,
making it very clear that this one just does:

1. renames tcp_ prefixed hashtable functions and data structures that
   were already mostly generic to inet_ to share it with DCCP and
   other INET transport protocols.

2. Removes not used functions (__tb_head & tb_head)

3. Removes some leftover prototypes in the headers (tcp_bucket_unlock &
   tcp_v4_build_header)

Next changesets will move tcp_sk(sk)->bind_hash to inet_sock so that we can
make functions such as tcp_inherit_port, __tcp_inherit_port, tcp_v4_get_port,
__tcp_put_port,  generic and get others like tcp_destroy_sock closer to generic
(tcp_orphan_count will go to sk->sk_prot to allow this).

Eventually most of these functions will be used passing the transport protocol
inet_hashinfo structure.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:38:32 -07:00
Arnaldo Carvalho de Melo
304a16180f [INET]: Move the TCP ehash functions to include/net/inet_hashtables.h
To be shared with DCCP (and others), this is the start of a series of patches
that will expose the already generic TCP hash table routines.

The few changes noticed when calling gcc -S before/after on a pentium4 were of
this type:

        movl    40(%esp), %edx
        cmpl    %esi, 472(%edx)
        je      .L168
-       pushl   $291
+       pushl   $272
        pushl   $.LC0
        pushl   $.LC1
        pushl   $.LC2

[acme@toy net-2.6.14]$ size net/ipv4/tcp_ipv4.before.o net/ipv4/tcp_ipv4.after.o
   text    data     bss     dec     hex filename
  17804     516     140   18460    481c net/ipv4/tcp_ipv4.before.o
  17804     516     140   18460    481c net/ipv4/tcp_ipv4.after.o

Holler if some weird architecture has issues with things like this 8)

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:38:22 -07:00
Arnaldo Carvalho de Melo
32519f11d3 [INET]: Introduce inet_sk_rebuild_header
From tcp_v4_rebuild_header, that already was pretty generic, I only
needed to use sk->sk_protocol instead of the hardcoded IPPROTO_TCP and
establish the requirement that INET transport layer protocols that
want to use this function map TCP_SYN_SENT to its equivalent state.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:37:55 -07:00
Arnaldo Carvalho de Melo
6cbb0df788 [SOCK]: Introduce sk_setup_caps
From tcp_v4_setup_caps, that always is preceded by a call to
__sk_dst_set, so coalesce this sequence into sk_setup_caps, removing
one call to a TCP function in the IP layer.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:37:48 -07:00
Arnaldo Carvalho de Melo
614c6cb4f2 [SOCK]: Rename __tcp_v4_rehash to __sk_prot_rehash
This operation was already generic and DCCP will use it.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:37:42 -07:00
David S. Miller
d5d283751e [TCP]: Document non-trivial locking path in tcp_v{4,6}_get_port().
This trips up a lot of folks reading this code.
Put an unlikely() around the port-exhaustion test
for good measure.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-23 10:49:54 -07:00
Heikki Orsila
ca9334523c [IPV4]: Debug cleanup
Here's a small patch to cleanup NETDEBUG() use in net/ipv4/ for Linux 
kernel 2.6.13-rc5. Also weird use of indentation is changed in some
places.

Signed-off-by: Heikki Orsila <heikki.orsila@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-08 14:26:52 -07:00
David S. Miller
c1b4a7e695 [TCP]: Move to new TSO segmenting scheme.
Make TSO segment transmit size decisions at send time not earlier.

The basic scheme is that we try to build as large a TSO frame as
possible when pulling in the user data, but the size of the TSO frame
output to the card is determined at transmit time.

This is guided by tp->xmit_size_goal.  It is always set to a multiple
of MSS and tells sendmsg/sendpage how large an SKB to try and build.

Later, tcp_write_xmit() and tcp_push_one() chop up the packet if
necessary and conditions warrant.  These routines can also decide to
"defer" in order to wait for more ACKs to arrive and thus allow larger
TSO frames to be emitted.

A general observation is that TSO elongates the pipe, thus requiring a
larger congestion window and larger buffering especially at the sender
side.  Therefore, it is important that applications 1) get a large
enough socket send buffer (this is accomplished by our dynamic send
buffer expansion code) 2) do large enough writes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:24:38 -07:00
Stephen Hemminger
5f8ef48d24 [TCP]: Allow choosing TCP congestion control via sockopt.
Allow using setsockopt to set TCP congestion control to use on a per
socket basis.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 20:37:36 -07:00
Stephen Hemminger
317a76f9a4 [TCP]: Add pluggable congestion control algorithm infrastructure.
Allow TCP to have multiple pluggable congestion control algorithms.
Algorithms are defined by a set of operations and can be built in
or modules.  The legacy "new RENO" algorithm is used as a starting
point and fallback.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 12:19:55 -07:00
David S. Miller
e52c1f17e4 [NET]: Move sysctl_max_syn_backlog into request_sock.c
This fixes the CONFIG_INET=n build failure noticed
by Andrew Morton.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:49:40 -07:00
Arnaldo Carvalho de Melo
2ad69c55a2 [NET] rename struct tcp_listen_opt to struct listen_sock
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:48:55 -07:00
Arnaldo Carvalho de Melo
0e87506fcc [NET] Generalise tcp_listen_opt
This chunks out the accept_queue and tcp_listen_opt code and moves
them to net/core/request_sock.c and include/net/request_sock.h, to
make it useful for other transport protocols, DCCP being the first one
to use it.

Next patches will rename tcp_listen_opt to accept_sock and remove the
inline tcp functions that just call a reqsk_queue_ function.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:47:59 -07:00
Arnaldo Carvalho de Melo
60236fdd08 [NET] Rename open_request to request_sock
Ok, this one just renames some stuff to have a better namespace and to
dissassociate it from TCP:

struct open_request  -> struct request_sock
tcp_openreq_alloc    -> reqsk_alloc
tcp_openreq_free     -> reqsk_free
tcp_openreq_fastfree -> __reqsk_free

With this most of the infrastructure closely resembles a struct
sock methods subset.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:47:21 -07:00
Arnaldo Carvalho de Melo
2e6599cb89 [NET] Generalise TCP's struct open_request minisock infrastructure
Kept this first changeset minimal, without changing existing names to
ease peer review.

Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:

->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
  a specific protocol

The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.

I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.

Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)

Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:46:52 -07:00
Folkert van Heusden
0b2531bdc5 [TCP]: Optimize check in port-allocation code.
Signed-off-by: Folkert van Heusden <folkert@vanheusden.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03 14:36:08 -07:00
Linus Torvalds
1da177e4c3 Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!
2005-04-16 15:20:36 -07:00