Currently, the SMC experimental TCP option in a SYN packet is lost on
the server side when SYN Cookies are active. However, the corresponding
SYNACK sent back to the client contains the SMC option. This causes an
inconsistent view of the SMC capabilities on the client and server.
This patch disables the SMC option in the SYNACK when SYN Cookies are
active to avoid this issue.
Fixes: 60e2a77807 ("tcp: TCP experimental option for SMC")
Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All of the conflicts were cases of overlapping changes.
In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.
Signed-off-by: David S. Miller <davem@davemloft.net>
When the connection is reset, there is no point in
keeping the packets on the write queue until the connection
is closed.
RFC 793 (page 70) and RFC 793-bis (page 64) both suggest
purging the write queue upon RST:
https://tools.ietf.org/html/draft-ietf-tcpm-rfc793bis-07
Moreover, this is essential for a correct MSG_ZEROCOPY
implementation, because userspace cannot call close(fd)
before receiving zerocopy signals even when the connection
is reset.
Fixes: f214f915e7 ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 89fe18e44f.
While the patch could detect more spurious timeouts, it could cause
poor TCP performance on broken middle-boxes that modifies TCP packets
(e.g. receive window, SACK options). Since the performance gain is
much smaller compared to the potential loss. The best solution is
to fully revert the change.
Fixes: 89fe18e44f ("tcp: extend F-RTO to catch more spurious timeouts")
Reported-by: Teodor Milkov <tm@del.bg>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit cc663f4d4c. While fixing
some broken middle-boxes that modifies receive window fields, it does not
address middle-boxes that strip off SACK options. The best solution is
to fully revert this patch and the root F-RTO enhancement.
Fixes: cc663f4d4c ("tcp: restrict F-RTO to work-around broken middle-boxes")
Reported-by: Teodor Milkov <tm@del.bg>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After previous commit, sk_can_gso() is always true.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
배석진 reported that in some situations, packets for a given 5-tuple
end up being processed by different CPUS.
This involves RPS, and fragmentation.
배석진 is seeing packet drops when a SYN_RECV request socket is
moved into ESTABLISH state. Other states are protected by socket lock.
This is caused by a CPU losing the race, and simply not caring enough.
Since this seems to occur frequently, we can do better and perform
a second lookup.
Note that all needed memory barriers are already in the existing code,
thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
and reqsk_put(). The second lookup must find the new socket,
unless it has already been accepted and closed by another cpu.
Note that the fragmentation could be avoided in the first place by
use of a correct TCP MSS option in the SYN{ACK} packet, but this
does not mean we can not be more robust.
Many thanks to 배석진 for a very detailed analysis.
Reported-by: 배석진 <soukjin.bae@samsung.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A persistent connection may send tiny amount of data (e.g. health-check)
for a long period of time. BBR's windowed min RTT filter may only see
RTT samples from delayed ACKs causing BBR to grossly over-estimate
the path delay depending how much the ACK was delayed at the receiver.
This patch skips RTT samples that are likely coming from delayed ACKs. Note
that it is possible the sender never obtains a valid measure to set the
min RTT. In this case BBR will continue to set cwnd to initial window
which seems fine because the connection is thin stream.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch avoids having TCP sender or congestion control
overestimate the min RTT by orders of magnitude. This happens when
all the samples in the windowed filter are one-packet transfer
like small request and health-check like chit-chat, which is farily
common for applications using persistent connections. This patch
tries to conservatively labels and skip RTT samples obtained from
this type of workload.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds an event to trace TCP stat variables with
slightly intrusive trace-event. This uses ftrace/perf
event log buffer to trace those state, no needs to
prepare own ring-buffer, nor custom user apps.
User can use ftrace to trace this event as below;
# cd /sys/kernel/debug/tracing
# echo 1 > events/tcp/tcp_probe/enable
(run workloads)
# cat trace
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Three sets of overlapping changes, two in the packet scheduler
and one in the meson-gxl PHY driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
When ms timestamp is used, current logic uses 1us in
tcp_rcv_rtt_update() when the real rcv_rtt is within 1 - 999us.
This could cause rcv_rtt underestimation.
Fix it by always using a min value of 1ms if ms timestamp is used.
Fixes: 645f4c6f2e ("tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps")
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Back in linux-3.13 (commit b0983d3c9b ("tcp: fix dynamic right sizing"))
I addressed the pressing issues we had with receiver autotuning.
But DRS suffers from extra latencies caused by rcv_rtt_est.rtt_us
drifts. One common problem happens during slow start, since the
apparent RTT measured by the receiver can be inflated by ~50%,
at the end of one packet train.
Also, a single drop can delay read() calls by one RTT, meaning
tcp_rcv_space_adjust() can be called one RTT too late.
By replacing the tri-modal heuristic with a continuous function,
we can offset the effects of not growing 'at the optimal time'.
The curve of the function matches prior behavior if the space
increased by 25% and 50% exactly.
Cost of added multiply/divide is small, considering a TCP flow
typically would run this part of the code few times in its life.
I tested this patch with 100 ms RTT / 1% loss link, 100 runs
of (netperf -l 5), and got an average throughput of 4600 Mbit
instead of 1700 Mbit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using large tcp_rmem[2] values (I did tests with 500 MB),
I noticed overflows while computing rcvwin.
Lets fix this before the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While rcvbuf is properly clamped by tcp_rmem[2], rcvwin
is left to a potentially too big value.
It has no serious effect, since :
1) tcp_grow_window() has very strict checks.
2) window_clamp can be mangled by user space to any value anyway.
tcp_init_buffer_space() and companions use tcp_full_space(),
we use tcp_win_from_space() to avoid reloading sk->sk_rcvbuf
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sender detects spurious retransmission, all packets
marked lost are remarked to be in-flight. However some may
be considered lost based on its timestamps in RACK. This patch
forces RACK to re-evaluate, which may be skipped previously if
the ACK does not advance RACK timestamp.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mark tcp_sock during a SACK reneging event and invalidate rate samples
while marked. Such rate samples may overestimate bw by including packets
that were SACKed before reneging.
< ack 6001 win 10000 sack 7001:38001
< ack 7001 win 0 sack 8001:38001 // Reneg detected
> seq 7001:8001 // RTO, SACK cleared.
< ack 38001 win 10000
In above example the rate sample taken after the last ack will count
7001-38001 as delivered while the actual delivery rate likely could
be much lower i.e. 7001-8001.
This patch adds a new field tcp_sock.sack_reneg and marks it when we
declare SACK reneging and entering TCP_CA_Loss, and unmarks it after
the last rate sample was taken before moving back to TCP_CA_Open. This
patch also invalidates rate samples taken while tcp_sock.is_sack_reneg
is set.
Fixes: b9f64820fb ("tcp: track data delivery rate for a TCP connection")
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I switched rcv_rtt_est to high resolution timestamps, I forgot
that tp->tcp_mstamp needed to be refreshed in tcp_rcv_space_adjust()
Using an old timestamp leads to autotuning lags.
Fixes: 645f4c6f2e ("tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the TLP scheduling logic so that when scheduling a TLP probe, we
ensure that the estimated time at which an RTO would fire accounts for
the fact that ACKs indicating forward progress should push back RTO
times.
After the following fix:
df92c8394e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
we had an unintentional behavior change in the following kind of
scenario: suppose the RTT variance has been very low recently. Then
suppose we send out a flight of N packets and our RTT is 100ms:
t=0: send a flight of N packets
t=100ms: receive an ACK for N-1 packets
The response before df92c8394e that was:
-> schedule a TLP for now + RTO_interval
The response after df92c8394e is:
-> schedule a TLP for t=0 + RTO_interval
Since RTO_interval = srtt + RTT_variance, this means that we have
scheduled a TLP timer at a point in the future that only accounts for
RTT_variance. If the RTT_variance term is small, this means that the
timer fires soon.
Before df92c8394e this would not happen, because in that code, when
we receive an ACK for a prefix of flight, we did:
1) Near the top of tcp_ack(), switch from TLP timer to RTO
at write_queue_head->paket_tx_time + RTO_interval:
if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
tcp_rearm_rto(sk);
2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
if (flag & FLAG_ACKED) {
tcp_rearm_rto(sk);
3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
to TLP at now + RTO_interval:
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
tcp_schedule_loss_probe(sk);
In df92c8394e we removed that 3-phase dance, and instead directly
set the TLP timer once: we set the TLP timer in cases like this to
write_queue_head->packet_tx_time + RTO_interval. So if the RTT
variance is small, then this means that this is setting the TLP timer
to fire quite soon. This means if the ACK for the tail of the flight
takes longer than an RTT to arrive (often due to delayed ACKs), then
the TLP timer fires too quickly.
Fixes: df92c8394e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2 updates
- almost all of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (131 commits)
memory hotplug: fix comments when adding section
mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
mm: simplify nodemask printing
mm,oom_reaper: remove pointless kthread_run() error check
mm/page_ext.c: check if page_ext is not prepared
writeback: remove unused function parameter
mm: do not rely on preempt_count in print_vma_addr
mm, sparse: do not swamp log with huge vmemmap allocation failures
mm/hmm: remove redundant variable align_end
mm/list_lru.c: mark expected switch fall-through
mm/shmem.c: mark expected switch fall-through
mm/page_alloc.c: broken deferred calculation
mm: don't warn about allocations which stall for too long
fs: fuse: account fuse_inode slab memory as reclaimable
mm, page_alloc: fix potential false positive in __zone_watermark_ok
mm: mlock: remove lru_add_drain_all()
mm, sysctl: make NUMA stats configurable
shmem: convert shmem_init_inodecache() to void
Unify migrate_pages and move_pages access checks
mm, pagevec: rename pagevec drained field
...
Patch series "kmemcheck: kill kmemcheck", v2.
As discussed at LSF/MM, kill kmemcheck.
KASan is a replacement that is able to work without the limitation of
kmemcheck (single CPU, slow). KASan is already upstream.
We are also not aware of any users of kmemcheck (or users who don't
consider KASan as a suitable replacement).
The only objection was that since KASAN wasn't supported by all GCC
versions provided by distros at that time we should hold off for 2
years, and try again.
Now that 2 years have passed, and all distros provide gcc that supports
KASAN, kill kmemcheck again for the very same reasons.
This patch (of 4):
Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
[alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
"Highlights:
1) Maintain the TCP retransmit queue using an rbtree, with 1GB
windows at 100Gb this really has become necessary. From Eric
Dumazet.
2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.
3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
Lunn.
4) Add meter action support to openvswitch, from Andy Zhou.
5) Add a data meta pointer for BPF accessible packets, from Daniel
Borkmann.
6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.
7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.
8) More work to move the RTNL mutex down, from Florian Westphal.
9) Add 'bpftool' utility, to help with bpf program introspection.
From Jakub Kicinski.
10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
Dangaard Brouer.
11) Support 'blocks' of transformations in the packet scheduler which
can span multiple network devices, from Jiri Pirko.
12) TC flower offload support in cxgb4, from Kumar Sanghvi.
13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
Leitner.
14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.
15) Add RED qdisc offloadability, and use it in mlxsw driver. From
Nogah Frankel.
16) eBPF based device controller for cgroup v2, from Roman Gushchin.
17) Add some fundamental tracepoints for TCP, from Song Liu.
18) Remove garbage collection from ipv6 route layer, this is a
significant accomplishment. From Wei Wang.
19) Add multicast route offload support to mlxsw, from Yotam Gigi"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
tcp: highest_sack fix
geneve: fix fill_info when link down
bpf: fix lockdep splat
net: cdc_ncm: GetNtbFormat endian fix
openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
netem: remove unnecessary 64 bit modulus
netem: use 64 bit divide by rate
tcp: Namespace-ify sysctl_tcp_default_congestion_control
net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
ipv6: set all.accept_dad to 0 by default
uapi: fix linux/tls.h userspace compilation error
usbnet: ipheth: prevent TX queue timeouts when device not ready
vhost_net: conditionally enable tx polling
uapi: fix linux/rxrpc.h userspace compilation errors
net: stmmac: fix LPI transitioning for dwmac4
atm: horizon: Fix irq release error
net-sysfs: trigger netlink notification on ifalias change via sysfs
openvswitch: Using kfree_rcu() to simplify the code
openvswitch: Make local function ovs_nsh_key_attr_size() static
openvswitch: Fix return value check in ovs_meter_cmd_features()
...
Pull core locking updates from Ingo Molnar:
"The main changes in this cycle are:
- Another attempt at enabling cross-release lockdep dependency
tracking (automatically part of CONFIG_PROVE_LOCKING=y), this time
with better performance and fewer false positives. (Byungchul Park)
- Introduce lockdep_assert_irqs_enabled()/disabled() and convert
open-coded equivalents to lockdep variants. (Frederic Weisbecker)
- Add down_read_killable() and use it in the VFS's iterate_dir()
method. (Kirill Tkhai)
- Convert remaining uses of ACCESS_ONCE() to
READ_ONCE()/WRITE_ONCE(). Most of the conversion was Coccinelle
driven. (Mark Rutland, Paul E. McKenney)
- Get rid of lockless_dereference(), by strengthening Alpha atomics,
strengthening READ_ONCE() with smp_read_barrier_depends() and thus
being able to convert users of lockless_dereference() to
READ_ONCE(). (Will Deacon)
- Various micro-optimizations:
- better PV qspinlocks (Waiman Long),
- better x86 barriers (Michael S. Tsirkin)
- better x86 refcounts (Kees Cook)
- ... plus other fixes and enhancements. (Borislav Petkov, Juergen
Gross, Miguel Bernal Marin)"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE
rcu: Use lockdep to assert IRQs are disabled/enabled
netpoll: Use lockdep to assert IRQs are disabled/enabled
timers/posix-cpu-timers: Use lockdep to assert IRQs are disabled/enabled
sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabled
irq_work: Use lockdep to assert IRQs are disabled/enabled
irq/timings: Use lockdep to assert IRQs are disabled/enabled
perf/core: Use lockdep to assert IRQs are disabled/enabled
x86: Use lockdep to assert IRQs are disabled/enabled
smp/core: Use lockdep to assert IRQs are disabled/enabled
timers/hrtimer: Use lockdep to assert IRQs are disabled/enabled
timers/nohz: Use lockdep to assert IRQs are disabled/enabled
workqueue: Use lockdep to assert IRQs are disabled/enabled
irq/softirqs: Use lockdep to assert IRQs are disabled/enabled
locking/lockdep: Add IRQs disabled/enabled assertion APIs: lockdep_assert_irqs_enabled()/disabled()
locking/pvqspinlock: Implement hybrid PV queued/unfair locks
locking/rwlocks: Fix comments
x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized
block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion()
workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes
...
Replace the reordering distance measurement in packet unit with
sequence based approach. Previously it trackes the number of "packets"
toward the forward ACK (i.e. highest sacked sequence)in a state
variable "fackets_out".
Precisely measuring reordering degree on packet distance has not much
benefit, as the degree constantly changes by factors like path, load,
and congestion window. It is also complicated and prone to arcane bugs.
This patch replaces with sequence-based approach that's much simpler.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
FACK loss detection has been disabled by default and the
successor RACK subsumed FACK and can handle reordering better.
This patch removes FACK to simplify TCP loss recovery.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the cause of an WARNING indicatng TCP has pending
retransmission in Open state in tcp_fastretrans_alert().
The root cause is a bad interaction between path mtu probing,
if enabled, and the RACK loss detection. Upong receiving a SACK
above the sequence of the MTU probing packet, RACK could mark the
probe packet lost in tcp_fastretrans_alert(), prior to calling
tcp_simple_retransmit().
tcp_simple_retransmit() only enters Loss state if it newly marks
the probe packet lost. If the probe packet is already identified as
lost by RACK, the sender remains in Open state with some packets
marked lost and retransmitted. Then the next SACK would trigger
the warning. The likely scenario is that the probe packet was
lost due to its size or network congestion. The actual impact of
this warning is small by potentially entering fast recovery an
ACK later.
The simple fix is always entering recovery (Loss) state if some
packet is marked lost during path MTU probing.
Fixes: a0370b3f3f ("tcp: enable RACK loss detection to trigger recovery")
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Note that when a new netns is created, it inherits its
sysctl_tcp_rmem and sysctl_tcp_wmem from initial netns.
This change is needed so that we can refine TCP rcvbuf autotuning,
to take RTT into consideration.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple cases of overlapping changes in the packet scheduler.
Must easier to resolve this time.
Which probably means that I screwed it up somehow.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes DSACK-based undo when sender is in Open State and
an ACK advances snd_una.
Example scenario:
- Sender goes into recovery and makes some spurious rtx.
- It comes out of recovery and enters into open state.
- It sends some more packets, let's say 4.
- The receiver sends an ACK for the first two, but this ACK is lost.
- The sender receives ack for first two, and DSACK for previous
spurious rtx.
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently TCP RACK loss detection does not work well if packets are
being reordered beyond its static reordering window (min_rtt/4).Under
such reordering it may falsely trigger loss recoveries and reduce TCP
throughput significantly.
This patch improves that by increasing and reducing the reordering
window based on DSACK, which is now supported in major TCP implementations.
It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.
- If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
by srtt), since there is possibility that spurious retransmission was
due to reordering delay longer than reo_wnd.
- Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
no. of successful recoveries (accounts for full DSACK-based loss
recovery undo). After that, reset it to default (min_rtt/4).
- At max, reo_wnd is incremented only once per rtt. So that the new
DSACK on which we are reacting, is due to the spurious retx (approx)
after the reo_wnd has been updated last time.
- reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
absolute value to account for change in rtt.
In our internal testing, we observed significant increase in throughput,
in scenarios where reordering exceeds min_rtt/4 (previous static value).
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Files removed in 'net-next' had their license header updated
in 'net'. We take the remove from 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This causes build failures:
In file included from net/ipv4/tcp_input.c:79:0:
./include/linux/unaligned/access_ok.h:7:28: error: redefinition of
'get_unaligned_le16'
In file included from ./include/asm-generic/unaligned.h:17:0,
from ./arch/arm/include/generated/asm/unaligned.h:1,
from net/ipv4/tcp_input.c:76:
./include/linux/unaligned/le_struct.h:6:19: note: previous definition
of 'get_unaligned_le16' was here
In file included from net/ipv4/tcp_input.c:79:0:
./include/linux/unaligned/access_ok.h:12:28: error: redefinition of
'get_unaligned_le32'
Plain "asm/access_ok.h", which is already included, is
sufficient.
Fixes: 60e2a77807 ("tcp: TCP experimental option for SMC")
Reported-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also remove an obsolete comment about TCP pacing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SMC protocol [1] relies on the use of a new TCP experimental
option [2, 3]. With this option, SMC capabilities are exchanged
between peers during the TCP three way handshake. This patch adds
support for this experimental option to TCP.
References:
[1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609
[2] Shared Use of TCP Experimental Options RFC 6994:
https://tools.ietf.org/rfc/rfc6994.txt
[3] IANA ExID SMCR:
http://www.iana.org/assignments/tcp-parameters/tcp-parameters.xhtml#tcp-exids
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.
However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.
It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts the IPv4 TCP input code and comments to use
{READ,WRITE}_ONCE() consistently.
----
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-8-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We already allow to enable TFO without a cookie by using the
fastopen-sysctl and setting it to TFO_SERVER_COOKIE_NOT_REQD (or
TFO_CLIENT_NO_COOKIE).
This is safe to do in certain environments where we know that there
isn't a malicous host (aka., data-centers) or when the
application-protocol already provides an authentication mechanism in the
first flight of data.
A server however might be providing multiple services or talking to both
sides (public Internet and data-center). So, this server would want to
enable cookie-less TFO for certain services and/or for connections that
go to the data-center.
This patch exposes a socket-option and a per-route attribute to enable such
fine-grained configurations.
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
New tracepoint trace_tcp_receive_reset is added and called from
tcp_reset(). This tracepoint is define with a new class tcp_event_sk.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There were quite a few overlapping sets of changes here.
Daniel's bug fix for off-by-ones in the new BPF branch instructions,
along with the added allowances for "data_end > ptr + x" forms
collided with the metadata additions.
Along with those three changes came veritifer test cases, which in
their final form I tried to group together properly. If I had just
trimmed GIT's conflict tags as-is, this would have split up the
meta tests unnecessarily.
In the socketmap code, a set of preemption disabling changes
overlapped with the rename of bpf_compute_data_end() to
bpf_compute_data_pointers().
Changes were made to the mv88e6060.c driver set addr method
which got removed in net-next.
The hyperv transport socket layer had a locking change in 'net'
which overlapped with a change of socket state macro usage
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Notice that in some cases I placed the "fall through" comment
on its own line, which is what GCC is expecting to find.
Addresses-Coverity-ID: 115108
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using a linear list to store all skbs in write queue has been okay
for quite a while : O(N) is not too bad when N < 500.
Things get messy when N is the order of 100,000 : Modern TCP stacks
want 10Gbit+ of throughput even with 200 ms RTT flows.
40 ns per cache line miss means a full scan can use 4 ms,
blowing away CPU caches.
SACK processing often can use various hints to avoid parsing
whole retransmit queue. But with high packet losses and/or high
reordering, hints no longer work.
Sender has to process thousands of unfriendly SACK, accumulating
a huge socket backlog, burning a cpu and massively dropping packets.
Using an rb-tree for retransmit queue has been avoided for years
because it added complexity and overhead, but now is the time
to be more resistant and say no to quadratic behavior.
1) RTX queue is no longer part of the write queue : already sent skbs
are stored in one rb-tree.
2) Since reaching the head of write queue no longer needs
sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue
Tested:
On receiver :
netem on ingress : delay 150ms 200us loss 1
GRO disabled to force stress and SACK storms.
for f in `seq 1 10`
do
./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'
Before patch :
323.87
351.48
339.59
338.62
306.72
204.07
304.93
291.88
202.47
176.88
2840
After patch:
1700.83
2207.98
2070.17
1544.26
2114.76
2124.89
1693.14
1080.91
2216.82
1299.94
18053
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to recompute previous skb, as it will be a bit more
expensive when rtx queue is converted to RB tree.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With upcoming rb-tree implementation, the checks will trigger
more often, and this is expected.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It will be a bit more expensive to get the head of rtx queue
once rtx queue is converted to an rb-tree.
We can avoid this extra cost in case tp->lost_skb_hint is set.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geeralize private netem_rb_to_skb()
TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new queue (list) that tracks the sent but not yet
acked or SACKed skbs for a TCP connection. The list is chronologically
ordered by skb->skb_mstamp (the head is the oldest sent skb).
This list will be used to optimize TCP Rack recovery, which checks
an skb's timestamp to judge if it has been lost and needs to be
retransmitted. Since TCP write queue is ordered by sequence instead
of sent time, RACK has to scan over the write queue to catch all
eligible packets to detect lost retransmission, and iterates through
SACKed skbs repeatedly.
Special cares for rare events:
1. TCP repair fakes skb transmission so the send queue needs adjusted
2. SACK reneging would require re-inserting SACKed skbs into the
send queue. For now I believe it's not worth the complexity to
make RACK work perfectly on SACK reneging, so we do nothing here.
3. Fast Open: currently for non-TFO, send-queue correctly queues
the pure SYN packet. For TFO which queues a pure SYN and
then a data packet, send-queue only queues the data packet but
not the pure SYN due to the structure of TFO code. This is okay
because the SYN receiver would never respond with a SACK on a
missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).
In order to not grow sk_buff, we use an union for the new list and
_skb_refdst/destructor fields. This is a bit complicated because
we need to make sure _skb_refdst and destructor are properly zeroed
before skb is cloned/copied at transmit, and before being freed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit does a cleanup and moves tcp_rearm_rto() call in the TFO
server case into a previous spot in tcp_rcv_state_process() to make
it more compact.
This is only a cosmetic change.
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently in the TCP code, the initialization sequence for cached
metrics, congestion control, BPF, etc, after successful connection
is very inconsistent. This introduces inconsistent bevhavior and is
prone to bugs. The current call sequence is as follows:
(1) for active case (tcp_finish_connect() case):
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
(2) for passive case (tcp_rcv_state_process() TCP_SYN_RECV case):
icsk->icsk_af_ops->rebuild_header(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_mtup_init(sk);
tcp_init_buffer_space(sk);
tcp_init_metrics(sk);
(3) for TFO passive case (tcp_fastopen_create_child()):
inet_csk(child)->icsk_af_ops->rebuild_header(child);
tcp_init_congestion_control(child);
tcp_mtup_init(child);
tcp_init_metrics(child);
tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_buffer_space(child);
This commit uniforms the above functions to have the following sequence:
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE/PASSIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
This sequence is the same as the (1) active case. We pick this sequence
because this order correctly allows BPF to override the settings
including congestion control module and initial cwnd, etc from
the route, and then allows the CC module to see those settings.
Suggested-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).
Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.
This saves some overhead in both TCP and netem.
v2: removes the swtstamp field from struct tcp_skb_cb
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the #ifdef into the static void function so that the use
of DBGUNDO is validated when FASTRETRANS_DEBUG <= 1.
Remove the now unnecessary #else and #define DBGUNDO.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 45f119bf93.
Eric Dumazet says:
We found at Google a significant regression caused by
45f119bf93 tcp: remove header prediction
In typical RPC (TCP_RR), when a TCP socket receives data, we now call
tcp_ack() while we used to not call it.
This touches enough cache lines to cause a slowdown.
so problem does not seem to be HP removal itself but the tcp_ack()
call. Therefore, it might be possible to remove HP after all, provided
one finds a way to elide tcp_ack for most cases.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change was a followup to the header prediction removal,
so first revert this as a prerequisite to back out hp removal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian reported UDP xmit drops that could be root caused to the
too small neigh limit.
Current limit is 64 KB, meaning that even a single UDP socket would hit
it, since its default sk_sndbuf comes from net.core.wmem_default
(~212992 bytes on 64bit arches).
Once ARP/ND resolution is in progress, we should allow a little more
packets to be queued, at least for one producer.
Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf
limit and either block in sendmsg() or return -EAGAIN.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the
timestamp corresponding to the highest sequence number data returned.
Previously the skb->tstamp is overwritten when a TCP packet is placed
in the out of order queue. While the packet is in the ooo queue, save the
timestamp in the TCB_SKB_CB. This space is shared with the gso_*
options which are only used on the tx path, and a previously unused 4
byte hole.
When skbs are coalesced either in the sk_receive_queue or the
out_of_order_queue always choose the timestamp of the appended skb to
maintain the invariant of returning the timestamp of the last byte in
the recvmsg buffer.
Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_peer_is_proven needs a proper route to make the
determination, but dst always is NULL. This bug may
be there at the beginning of git tree. This does not
look serious enough to deserve backports to stable
versions.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some situations tcp_send_loss_probe() can realize that it's unable
to send a loss probe (TLP), and falls back to calling tcp_rearm_rto()
to schedule an RTO timer. In such cases, sometimes tcp_rearm_rto()
realizes that the RTO was eligible to fire immediately or at some
point in the past (delta_us <= 0). Previously in such cases
tcp_rearm_rto() was scheduling such "overdue" RTOs to happen at now +
icsk_rto, which caused needless delays of hundreds of milliseconds
(and non-linear behavior that made reproducible testing
difficult). This commit changes the logic to schedule "overdue" RTOs
ASAP, rather than at now + icsk_rto.
Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.
The TCP conflict was a case of local variables in a function
being removed from both net and net-next.
In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.
Signed-off-by: David S. Miller <davem@davemloft.net>
Using ssthresh to revert cwnd is less reliable when ssthresh is
bounded to 2 packets. This patch uses an existing variable in TCP
"prior_cwnd" that snapshots the cwnd right before entering fast
recovery and RTO recovery in Reno. This fixes the issue discussed
in netdev thread: "A buggy behavior for Linux TCP Reno and HTCP"
https://www.spinics.net/lists/netdev/msg444955.html
Suggested-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Wei Sun <unlcsewsun@gmail.com>
Signed-off-by: Yuchung Cheng <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a TCP loss recovery performance bug raised recently on the netdev
list, in two threads:
(i) July 26, 2017: netdev thread "TCP fast retransmit issues"
(ii) July 26, 2017: netdev thread:
"[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
outstanding TLP retransmission"
The basic problem is that incoming TCP packets that did not indicate
forward progress could cause the xmit timer (TLP or RTO) to be rearmed
and pushed back in time. In certain corner cases this could result in
the following problems noted in these threads:
- Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
could cause TCP to repeatedly schedule TLPs forever. We kept
sending TLPs after every ~200ms, which elicited bogus SACKs, which
caused more TLPs, ad infinitum; we never fired an RTO to fill in
the holes.
- Incoming data segments could, in some cases, cause us to reschedule
our RTO or TLP timer further out in time, for no good reason. This
could cause repeated inbound data to result in stalls in outbound
data, in the presence of packet loss.
This commit fixes these bugs by changing the TLP and RTO ACK
processing to:
(a) Only reschedule the xmit timer once per ACK.
(b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
ACK indicates sufficient forward progress (a packet was
cumulatively ACKed, or we got a SACK for a packet that was sent
before the most recent retransmit of the write queue head).
This brings us back into closer compliance with the RFCs, since, as
the comment for tcp_rearm_rto() notes, we should only restart the RTO
timer after forward progress on the connection. Previously we were
restarting the xmit timer even in these cases where there was no
forward progress.
As a side benefit, this commit simplifies and speeds up the TCP timer
arming logic. We had been calling inet_csk_reset_xmit_timer() three
times on normal ACKs that cumulatively acknowledged some data:
1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
tcp_rearm_rto(sk);
2) Once in tcp_clean_rtx_queue(), to update the RTO:
if (flag & FLAG_ACKED) {
tcp_rearm_rto(sk);
3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
to TLP:
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
tcp_schedule_loss_probe(sk);
This commit, by only rescheduling the xmit timer once per ACK,
simplifies the code and reduces CPU overhead.
This commit was tested in an A/B test with Google web server
traffic. SNMP stats and request latency metrics were within noise
levels, substantiating that for normal web traffic patterns this is a
rare issue. This commit was also tested with packetdrill tests to
verify that it fixes the timer behavior in the corner cases discussed
in the netdev threads mentioned above.
This patch is a bug fix patch intended to be queued for -stable
relases.
Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Reported-by: Klavs Klavsen <kl@vsen.dk>
Reported-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pure refactor. This helper will be required in the xmit timer fix
later in the patch series. (Because the TLP logic will want to make
this calculation.)
Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 45f119bf93 ("tcp: remove header prediction") introduced a
minor bug: the sk_state_change() and sk_wake_async() notifications for
a completed active connection happen twice: once in this new spot
inside tcp_finish_connect() and once in the existing code in
tcp_rcv_synsent_state_process() immediately after it calls
tcp_finish_connect(). This commit remoes the duplicate POLL_OUT
notifications.
Fixes: 45f119bf93 ("tcp: remove header prediction")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the sender switches the congestion control during ECN-triggered
cwnd-reduction state (CA_CWR), upon exiting recovery cwnd is set to
the ssthresh value calculated by the previous congestion control. If
the previous congestion control is BBR that always keep ssthresh
to TCP_INIFINITE_SSTHRESH, cwnd ends up being infinite. The safe
step is to avoid assigning invalid ssthresh value when recovery ends.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c13ee2a4f0 ("tcp: reindent two spots after prequeue removal")
removed code in tcp_data_queue().
We can go a little farther, removing an always true test,
and removing initializers for fragstolen and eaten variables.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
re-indent tcp_ack, and remove CA_ACK_SLOWPATH; it is always set now.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like prequeue, I am not sure this is overly useful nowadays.
If we receive a train of packets, GRO will aggregate them if the
headers are the same (HP predates GRO by several years) so we don't
get a per-packet benefit, only a per-aggregated-packet one.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
These two branches are now always true, remove the conditional.
objdiff shows no changes.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
prequeue is a tcp receive optimization that moves part of rx processing
from bh to process context.
This only works if the socket being processed belongs to a process that
is blocked in recv on that socket.
In practice, this doesn't happen anymore that often because nowadays
servers tend to use an event driven (epoll) model.
Even normal client applications (web browsers) commonly use many tcp
connections in parallel.
This has measureable impact only in netperf (which uses plain recv and
thus allows prequeue use) from host to locally running vm (~4%), however,
there were no changes when using netperf between two physical hosts with
ixgbe interfaces.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The last (4th) argument of tcp_rcv_established() is redundant as it
always equals to skb->len and the skb itself is always passed as 2th
agrument. There is no reason to have it.
Signed-off-by: Ilya V. Matveychikov <matvejchikov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added support for changing congestion control for SOCK_OPS bpf
programs through the setsockopt bpf helper function. It also adds
a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added callbacks to BPF SOCK_OPS type program before an active
connection is intialized and after a passive or active connection is
established.
The following patch demostrates how they can be used to set send and
receive buffer sizes.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for setting a per connection SYN and
SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
to set small RTOs when it is known both hosts are within a
datacenter.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to move some TCP sysctls to net namespaces in the future.
tcp_window_scaling, tcp_sack and tcp_timestamps being fetched
from tcp_parse_options(), we need to pass an extra parameter.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when a data packet is retransmitted, we do not compute an
RTT sample for congestion control due to Kern's check. Therefore the
congestion control that uses RTT signals may not receive any update
during loss recovery which could last many round trips. For example,
BBR and Vegas may not be able to update its min RTT estimation if the
network path has shortened until it recovers from losses. This patch
mitigates that by using TCP timestamp options for RTT measurement
for congestion control. Note that we already use timestamps for
RTT estimation.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Fiterau Brostean reported :
<quote>
Linux TCP stack we analyze exhibits behavior that seems odd to me.
The scenario is as follows (all packets have empty payloads, no window
scaling, rcv/snd window size should not be a factor):
TEST HARNESS (CLIENT) LINUX SERVER
1. - LISTEN (server listen,
then accepts)
2. - --> <SEQ=100><CTL=SYN> --> SYN-RECEIVED
3. - <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
4. - --> <SEQ=101><ACK=301><CTL=ACK> --> ESTABLISHED
5. - <-- <SEQ=301><ACK=101><CTL=FIN,ACK> <-- FIN WAIT-1 (server
opts to close the data connection calling "close" on the connection
socket)
6. - --> <SEQ=101><ACK=99999><CTL=FIN,ACK> --> CLOSING (client sends
FIN,ACK with not yet sent acknowledgement number)
7. - <-- <SEQ=302><ACK=102><CTL=ACK> <-- CLOSING (ACK is 102
instead of 101, why?)
... (silence from CLIENT)
8. - <-- <SEQ=301><ACK=102><CTL=FIN,ACK> <-- CLOSING
(retransmission, again ACK is 102)
Now, note that packet 6 while having the expected sequence number,
acknowledges something that wasn't sent by the server. So I would
expect
the packet to maybe prompt an ACK response from the server, and then be
ignored. Yet it is not ignored and actually leads to an increase of the
acknowledgement number in the server's retransmission of the FIN,ACK
packet. The explanation I found is that the FIN in packet 6 was
processed, despite the acknowledgement number being unacceptable.
Further experiments indeed show that the server processes this FIN,
transitioning to CLOSING, then on receiving an ACK for the FIN it had
send in packet 5, the server (or better said connection) transitions
from CLOSING to TIME_WAIT (as signaled by netstat).
</quote>
Indeed, tcp_rcv_state_process() calls tcp_ack() but
does not exploit the @acceptable status but for TCP_SYN_RECV
state.
What we want here is to send a challenge ACK, if not in TCP_SYN_RECV
state. TCP_FIN_WAIT1 state is not the only state we should fix.
Add a FLAG_NO_CHALLENGE_ACK so that tcp_rcv_state_process()
can choose to send a challenge ACK and discard the packet instead
of wrongly change socket state.
With help from Neal Cardwell.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Paul Fiterau Brostean <p.fiterau-brostean@science.ru.nl>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit bafbb9c732 ("tcp: eliminate negative reordering
in tcp_clean_rtx_queue") fixes an issue for negative
reordering metrics.
To be resilient to such errors, warn and return
when a negative metric is passed to tcp_update_reordering().
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skbs in (re)transmit queue no longer have a copy of jiffies
at the time of the transmit : skb->skb_mstamp is now in usec unit,
with no correlation to tcp_jiffies32.
We have to convert rto from jiffies to usec, compute a time difference
in usec, then convert the delta to HZ units.
Fixes: 9a568de481 ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP Timestamps option is defined in RFC 7323
Traditionally on linux, it has been tied to the internal
'jiffies' variable, because it had been a cheap and good enough
generator.
For TCP flows on the Internet, 1 ms resolution would be much better
than 4ms or 10ms (HZ=250 or HZ=100 respectively)
For TCP flows in the DC, Google has used usec resolution for more
than two years with great success [1]
Receive size autotuning (DRS) is indeed more precise and converges
faster to optimal window size.
This patch converts tp->tcp_mstamp to a plain u64 value storing
a 1 usec TCP clock.
This choice will allow us to upstream the 1 usec TS option as
discussed in IETF 97.
[1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After this patch, all uses of tcp_time_stamp will require
a change when we introduce 1 ms and/or 1 us TCP TS option.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This place wants to use tcp_jiffies32, this is good enough.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use tcp_jiffies32 instead of tcp_time_stamp, since
tcp_time_stamp will soon be only used for TCP TS option.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use tcp_jiffies32 instead of tcp_time_stamp to feed
tp->snd_cwnd_stamp.
tcp_time_stamp will soon be a litle bit more expensive
than simply reading 'jiffies'.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use tcp_jiffies32 instead of tcp_time_stamp to feed
tp->lsndtime.
tcp_time_stamp will soon be a litle bit more expensive
than simply reading 'jiffies'.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_ack() can call tcp_fragment() which may dededuct the
value tp->fackets_out when MSS changes. When prior_fackets
is larger than tp->fackets_out, tcp_clean_rtx_queue() can
invoke tcp_update_reordering() with negative values. This
results in absurd tp->reodering values higher than
sysctl_tcp_max_reordering.
Note that tcp_update_reordering indeeds sets tp->reordering
to min(sysctl_tcp_max_reordering, metric), but because
the comparison is signed, a negative metric always wins.
Fixes: c7caf8d3ed ("[TCP]: Fix reord detection due to snd_una covered holes")
Reported-by: Rebecca Isaacs <risaacs@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a bug in splitting an SKB during SACK
processing. Specifically if an skb contains multiple
packets and is only partially sacked in the higher sequences,
tcp_match_sack_to_skb() splits the skb and marks the second fragment
as SACKed.
The current code further attempts rounding up the first fragment
to MSS boundaries. But it misses a boundary condition when the
rounded-up fragment size (pkt_len) is exactly skb size. Spliting
such an skb is pointless and causses a kernel warning and aborts
the SACK processing. This patch universally checks such over-split
before calling tcp_fragment to prevent these unnecessary warnings.
Fixes: adb92db857 ("tcp: Make SACK code to split only at mss boundaries")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Whole point of randomization was to hide server uptime, but an attacker
can simply start a syn flood and TCP generates 'old style' timestamps,
directly revealing server jiffies value.
Also, TSval sent by the server to a particular remote address vary
depending on syncookies being sent or not, potentially triggering PAWS
drops for innocent clients.
Lets implement proper randomization, including for SYNcookies.
Also we do not need to export sysctl_tcp_timestamps, since it is not
used from a module.
In v2, I added Florian feedback and contribution, adding tsoff to
tcp_get_cookie_sock().
v3 removed one unused variable in tcp_v4_connect() as Florian spotted.
Fixes: 95a22caee3 ("tcp: randomize tcp timestamp offsets for each connection")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Tested-by: Florian Westphal <fw@strlen.de>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some devices or distributions use HZ=100 or HZ=250
TCP receive buffer autotuning has poor behavior caused by this choice.
Since autotuning happens after 4 ms or 10 ms, short distance flows
get their receive buffer tuned to a very high value, but after an initial
period where it was frozen to (too small) initial value.
With tp->tcp_mstamp introduction, we can switch to high resolution
timestamps almost for free (at the expense of 8 additional bytes per
TCP structure)
Note that some TCP stacks use usec TCP timestamps where this
patch makes even more sense : Many TCP flows have < 500 usec RTT.
Hopefully this finer TS option can be standardized soon.
Tested:
HZ=100 kernel
./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 &
Peer without patch :
lpaa24:~# ss -tmi dst lpaa23
...
skmem:(r0,rb8388608,...)
rcv_rtt:10 rcv_space:3210000 minrtt:0.017
Peer with the patch :
lpaa23:~# ss -tmi dst lpaa24
...
skmem:(r0,rb428800,...)
rcv_rtt:0.069 rcv_space:30000 minrtt:0.017
We can see saner RCVBUF, and more precise rcv_rtt information.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is no longer needed, everything uses tp->tcp_mstamp instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch will remove ack_time from struct tcp_sacktag_state
Same info is now found in tp->tcp_mstamp
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer needed, since tp->tcp_mstamp holds the information.
This is needed to remove sack_state.ack_time in a following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer needed, since tp->tcp_mstamp holds the information.
This is needed to remove sack_state.ack_time in a following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not used anymore now tp->tcp_mstamp holds the information.
This is needed to remove sack_state.ack_time in a following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not used anymore now tp->tcp_mstamp holds the information.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is no longer used, since tcp_rack_detect_loss() takes
the timestamp from tp->tcp_mstamp
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to use precise timestamps in TCP stack, but we do not
want to call possibly expensive kernel time services too often.
tp->tcp_mstamp is guaranteed to be updated once per incoming packet.
We will use it in the following patches, removing specific
skb_mstamp_get() calls, and removing ack_time from
struct tcp_sacktag_state.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This counter records the number of times the firewall blackhole issue is
detected and active TFO is disabled.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C <--- syn/ack ----- S
C (accept & write)
C <---- data ------- S
C ----- ACK -> X S
[retry and timeout]
https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C? X <- data ------ S
[retry and timeout]
This is the worst failure b/c the client can not detect such behavior to
mitigate the situation (such as disabling TFO). Failing to proceed, the
application (e.g., SSL library) may simply timeout and retry with TFO
again, and the process repeats indefinitely.
The proposed solution is to disable active TFO globally under the
following circumstances:
1. client side TFO socket detects out of order FIN
2. client side TFO socket receives out of order RST
We disable active side TFO globally for 1hr at first. Then if it
happens again, we disable it for 2h, then 4h, 8h, ...
And we reset the timeout to 1hr if a client side TFO sockets not opened
on loopback has successfully received data segs from server.
And we examine this condition during close().
The rational behind it is that when such firewall issue happens,
application running on the client should eventually close the socket as
it is not able to get the data it is expecting. Or application running
on the server should close the socket as it is not able to receive any
response from client.
In both cases, out of order FIN or RST will get received on the client
given that the firewall will not block them as no data are in those
frames.
And we want to disable active TFO globally as it helps if the middle box
is very close to the client and most of the connections are likely to
fail.
Also, add a debug sysctl:
tcp_fastopen_blackhole_detect_timeout_sec:
the initial timeout to use when firewall blackhole issue happens.
This can be set and read.
When setting it to 0, it means to disable the active disable logic.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using TCP FastOpen for an active session, we send one wakeup event
from tcp_finish_connect(), right before the data eventually contained in
the received SYNACK is queued to sk->sk_receive_queue.
This means that depending on machine load or luck, poll() users
might receive POLLOUT events instead of POLLIN|POLLOUT
To fix this, we need to move the call to sk->sk_state_change()
after the (optional) call to tcp_rcv_fastopen_synack()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a RST packet is processed, we send two wakeup events to interested
polling users.
First one by a sk->sk_error_report(sk) from tcp_reset(),
followed by a sk->sk_state_change(sk) from tcp_done().
Depending on machine load and luck, poll() can either return POLLERR,
or POLLIN|POLLOUT|POLLERR|POLLHUP (this happens on 99 % of the cases)
This is probably fine, but we can avoid the confusion by reordering
things so that we have more TCP fields updated before the first wakeup.
This might even allow us to remove some barriers we added in the past.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were simply overlapping changes. In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.
In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent extension of F-RTO 89fe18e44 ("tcp: extend F-RTO
to catch more spurious timeouts") interacts badly with certain
broken middle-boxes. These broken boxes modify and falsely raise
the receive window on the ACKs. During a timeout induced recovery,
F-RTO would send new data packets to probe if the timeout is false
or not. Since the receive window is falsely raised, the receiver
would silently drop these F-RTO packets. The recovery would take N
(exponentially backoff) timeouts to repair N packet losses. A TCP
performance killer.
Due to this unfortunate situation, this patch removes this extension
to revert F-RTO back to the RFC specification.
Fixes: 89fe18e44f ("tcp: extend F-RTO to catch more spurious timeouts")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly simple cases of overlapping changes (adding code nearby,
a function whose name changes, for example).
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the reordering SNMP counters only increase if a connection
sees a higher degree then it has previously seen. It ignores if the
reordering degree is not greater than the default system threshold.
This significantly under-counts the number of reordering events
and falsely convey that reordering is rare on the network.
This patch properly and faithfully records the number of reordering
events detected by the TCP stack, just like the comment says "this
exciting event is worth to be remembered". Note that even so TCP
still under-estimate the actual reordering events because TCP
requires TS options or certain packet sequences to detect reordering
(i.e. ACKing never-retransmitted sequence in recovery or disordered
state).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define one new macro TCP_MAX_WSCALE instead of literal number '14',
and use U16_MAX instead of 65535 as the max value of TCP window.
There is another minor change, use rounddown(space, mss) instead of
(space / mss) * mss;
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Trippelsdorf reported that after commit dcb17d22e1 ("tcp: warn
on bogus MSS and try to amend it") the kernel started logging the
warning for a NIC driver that doesn't even support GRO.
It was diagnosed that it was possibly caused on connections that were
using TCP Timestamps but some packets lacked the Timestamps option. As
we reduce rcv_mss when timestamps are used, the lack of them would cause
the packets to be bigger than expected, although this is a valid case.
As this warning is more as a hint, getting a clean-cut on the
threshold is probably not worth the execution time spent on it. This
patch thus alleviates the false-positives with 2 quick checks: by
accounting for the entire TCP option space and also checking against the
interface MTU if it's available.
These changes, specially the MTU one, might mask some real positives,
though if they are really happening, it's possible that sooner or later
it will be triggered anyway.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/genet/bcmmii.c
drivers/net/hyperv/netvsc.c
kernel/bpf/hashtab.c
Almost entirely overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
icsk_ack.lrcvtime has a 0 value at socket creation time.
tcpi_last_data_recv can have bogus value if no payload is ever received.
This patch initializes icsk_ack.lrcvtime for active sessions
in tcp_finish_connect(), and for passive sessions in
tcp_create_openreq_child()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tcp_tw_recycle was already broken for connections
behind NAT, since the per-destination timestamp is not
monotonically increasing for multiple machines behind
a single destination address.
After the randomization of TCP timestamp offsets
in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
for each connection), the tcp_tw_recycle is broken for all
types of connections for the same reason: the timestamps
received from a single machine is not monotonically increasing,
anymore.
Remove tcp_tw_recycle, since it is not functional. Also, remove
the PAWSPassive SNMP counter since it is only used for
tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
since the strict argument is only set when tcp_tw_recycle is
enabled.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Lutz Vieweg <lvml@5t9.de>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection)
randomizes TCP timestamps per connection. After this commit,
there is no guarantee that the timestamps received from the
same destination are monotonically increasing. As a result,
the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts
in struct tcp_metrics_block) is broken and cannot be relied upon.
Remove the per-destination timestamp cache and all related code
paths.
Note that this cache was already broken for caching timestamps of
multiple machines behind a NAT sharing the same address.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Lutz Vieweg <lvml@5t9.de>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions that are returning tcp sequence number also setup
TS offset value, so rename them to better describe their purpose.
No functional changes in this patch.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.
Use the new sk_dst_confirm() helper to propagate the
indication from received packets to sock_confirm_neigh().
Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: 5110effee8 ("net: Do delayed neigh confirmation.")
Fixes: f2bb4bedf3 ("ipv4: Cache output routes in fib_info nexthops.")
Tested-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit().
Thus, we need smp_mb(), smp_mb__after_atomic() is not sufficient.
Fixes: 3c7151275c ("tcp: add memory barriers to write space paths")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_add_backlog() can use skb_condense() helper to get better
gains and less SKB_TRUESIZE() magic. This only happens when socket
backlog has to be used.
Some attacks involve specially crafted out of order tiny TCP packets,
clogging the ofo queue of (many) sockets.
Then later, expensive collapse happens, trying to copy all these skbs
into single ones.
This unfortunately does not work if each skb has no neighbor in TCP
sequence order.
By using skb_condense() if the skb could not be coalesced to a prior
one, we defeat these kind of threats, potentially saving 4K per skb
(or more, since this is one page fragment).
A typical NAPI driver allocates gro packets with GRO_MAX_HEAD bytes
in skb->head, meaning the copy done by skb_condense() is limited to
about 200 bytes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using a Mac OSX box as a client connecting to a Linux server, we have found
that when certain applications (such as 'ab'), are abruptly terminated
(via ^C), a FIN is sent followed by a RST packet on tcp connections. The
FIN is accepted by the Linux stack but the RST is sent with the same
sequence number as the FIN, and Linux responds with a challenge ACK per
RFC 5961. The OSX client then sometimes (they are rate-limited) does not
reply with any RST as would be expected on a closed socket.
This results in sockets accumulating on the Linux server left mostly in
the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
This sequence of events can tie up a lot of resources on the Linux server
since there may be a lot of data in write buffers at the time of the RST.
Accepting a RST equal to rcv_nxt - 1, after we have already successfully
processed a FIN, has made a significant difference for us in practice, by
freeing up unneeded resources in a more expedient fashion.
A packetdrill test demonstrating the behavior:
// testing mac osx rst behavior
// Establish a connection
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0
0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5>
0.200 < . 1:1(0) ack 1 win 32768
0.200 accept(3, ..., ...) = 4
// Client closes the connection
0.300 < F. 1:1(0) ack 1 win 32768
// now send rst with same sequence
0.300 < R. 1:1(0) ack 1 win 32768
// make sure we are in TCP_CLOSE
0.400 %{
assert tcpi_state == 7
}%
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch disables FACK by default as RACK is the successor of FACK
(inspired by the insights behind FACK).
FACK[1] in Linux works as follows: a packet P is deemed lost,
if packet Q of higher sequence is s/acked and P and Q are distant
by at least dupthresh number of packets in sequence space.
FACK is more aggressive than the IETF recommened recovery for SACK
(RFC3517 A Conservative Selective Acknowledgment (SACK)-based Loss
Recovery Algorithm for TCP), because a single SACK may trigger
fast recovery. This obviously won't work well with reordering so
FACK is dynamically disabled upon detecting reordering.
RACK supersedes FACK by using time distance instead of sequence
distance. On reordering, RACK waits for a quarter of RTT receiving
a single SACK before starting recovery. (the timer can be made more
adaptive in the future by measuring reordering distance in time,
but currently RTT/4 seem to work well.) Once the recovery starts,
RACK behaves almost like FACK because it reduces the reodering
window to 1ms, so it fast retransmits quickly. In addition RACK
can detect loss retransmission as it does not care about the packet
sequences (being repeated or not), which is extremely useful when
the connection is going through a traffic policer.
Google server experiments indicate that disabling FACK after enabling
RACK has negligible impact on the overall loss recovery performance
with more reordering events detected. But we still keep the FACK
implementation for backup if RACK has bugs that needs to be disabled.
[1] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining
TCP Congestion Control," In Proceedings of SIGCOMM '96, August 1996.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thin stream DUPACK is to start fast recovery on only one DUPACK
provided the connection is a thin stream (i.e., low inflight). But
this older feature is now subsumed with RACK. If a connection
receives only a single DUPACK, RACK would arm a reordering timer
and soon starts fast recovery instead of timeout if no further
ACKs are received.
The socket option (THIN_DUPACK) is kept as a nop for compatibility.
Note that this patch does not change another thin-stream feature
which enables linear RTO. Although it might be good to generalize
that in the future (i.e., linear RTO for the first say 3 retries).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the (partial) implementation of the aggressive
limited transmit in RFC4653 TCP Non-Congestion Robustness (NCR).
NCR is a mitigation to the problem created by the dynamic
DUPACK threshold. With the current adaptive DUPACK threshold
(tp->reordering) could cause timeouts by preventing fast recovery.
For example, if the last packet of a cwnd burst was reordered, the
threshold will be set to the size of cwnd. But if next application
burst is smaller than threshold and has drops instead of reorderings,
the sender would not trigger fast recovery but instead resorts to a
timeout recovery.
NCR mitigates this issue by checking the number of DUPACKs against
the current flight size additionally. The techniqueue is similar to
the early retransmit RFC.
With RACK loss detection, this mitigation is not needed, because RACK
does not use DUPACK threshold to detect losses. RACK arms a reordering
timer to fire at most a quarter RTT later to start fast recovery.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the support of RFC5827 early retransmit (i.e.,
fast recovery on small inflight with <3 dupacks) because it is
subsumed by the new RACK loss detection. More specifically when
RACK receives DUPACKs, it'll arm a reordering timer to start fast
recovery after a quarter of (min)RTT, hence it covers the early
retransmit except RACK does not limit itself to specific inflight
or dupack numbers.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Forward retransmit is an esoteric feature in RFC3517 (condition(3)
in the NextSeg()). Basically if a packet is not considered lost by
the current criteria (# of dupacks etc), but the congestion window
has room for more packets, then retransmit this packet.
However it actually conflicts with the rest of recovery design. For
example, when reordering is detected we want to be conservative
in retransmitting packets but forward-retransmit feature would
break that to force more retransmission. Also the implementation is
fairly complicated inside the retransmission logic inducing extra
iterations in the write queue. With RACK losses are being detected
timely and this heuristic is no longer necessary. There this patch
removes the feature.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current F-RTO reverts cwnd reset whenever a never-retransmitted
packet was (s)acked. The timeout can be declared spurious because
the packets acknoledged with this ACK was transmitted before the
timeout, so clearly not all the packets are lost to reset the cwnd.
This nice detection does not really depend F-RTO internals. This
patch applies the detection universally. On Google servers this
change detected 20% more spurious timeouts.
Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes two things:
1. Start fast recovery with RACK in addition to other heuristics
(e.g., DUPACK threshold, FACK). Prior to this change RACK
is enabled to detect losses only after the recovery has
started by other algorithms.
2. Disable TCP early retransmit. RACK subsumes the early retransmit
with the new reordering timer feature. A latter patch in this
series removes the early retransmit code.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently RACK would mark loss before the undo operations in TCP
loss recovery. This could incorrectly identify real losses as
spurious. For example a sender first experiences a delay spike and
then eventually some packets were lost due to buffer overrun.
In this case, the sender should perform fast recovery b/c not all
the packets were lost.
But the sender may first trigger a (spurious) RTO and reset
cwnd to 1. The following ACKs may used to mark real losses by
tcp_rack_mark_lost. Then in tcp_process_loss this ACK could trigger
F-RTO undo condition and unmark real losses and revert the cwnd
reduction. If there are no more ACKs coming back, eventually the
sender would timeout again instead of performing fast recovery.
The patch fixes this incorrect process by always performing
the undo checks before detecting losses.
Fixes: 4f41b1c58a ("tcp: use RACK to detect losses")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The packets inside a jumbo skb (e.g., TSO) share the same skb
timestamp, even though they are sent sequentially on the wire. Since
RACK is based on time, it can not detect some packets inside the
same skb are lost. However, we can leverage the packet sequence
numbers as extended timestamps to detect losses. Therefore, when
RACK timestamp is identical to skb's timestamp (i.e., one of the
packets of the skb is acked or sacked), we use the sequence numbers
of the acked and unacked packets to break ties.
We can use the same sequence logic to advance RACK xmit time as
well to detect more losses and avoid timeout.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes RACK install a reordering timer when it suspects
some packets might be lost, but wants to delay the decision
a little bit to accomodate reordering.
It does not create a new timer but instead repurposes the existing
RTO timer, because both are meant to retransmit packets.
Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
the RACK timing check fails. The wait time is set to
RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge
This translates to expecting a packet (Packet) should take
(RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.
When there are multiple packets that need a timer, we use one timer
with the maximum timeout. Therefore the timer conservatively uses
the maximum window to expire N packets by one timeout, instead of
N timeouts to expire N packets sent at different times.
The fudge factor is 2 jiffies to ensure when the timer fires, all
the suspected packets would exceed the deadline and be marked lost
by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
clock may tick between calling icsk_reset_xmit_timer(timeout) and
actually hang the timer. The next jiffy is to lower-bound the timeout
to 2 jiffies when reo_wnd is < 1ms.
When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
in Recovery we'll enter fast recovery and force fast retransmit.
This is very similar to the early retransmit (RFC5827) except RACK
is not constrained to only enter recovery for small outstanding
flights.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Record the most recent RTT in RACK. It is often identical to the
"ca_rtt_us" values in tcp_clean_rtx_queue. But when the packet has
been retransmitted, RACK choses to believe the ACK is for the
(latest) retransmitted packet if the RTT is over minimum RTT.
This requires passing the arrival time of the most recent ACK to
RACK routines. The timestamp is now recorded in the "ack_time"
in tcp_sacktag_state during the ACK processing.
This patch does not change the RACK algorithm itself. It only adds
the RTT variable to prepare the next main patch.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new helper tcp_rack_detect_loss to prepare the upcoming
RACK reordering timer patch.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Different namespace application might require different maximal
number of remembered connection requests.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Different namespace application might require fast recycling
TIME-WAIT sockets independently of the host.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There have been some reports lately about TCP connection stalls caused
by NIC drivers that aren't setting gso_size on aggregated packets on rx
path. This causes TCP to assume that the MSS is actually the size of the
aggregated packet, which is invalid.
Although the proper fix is to be done at each driver, it's often hard
and cumbersome for one to debug, come to such root cause and report/fix
it.
This patch amends this situation in two ways. First, it adds a warning
on when this situation occurs, so it gives a hint to those trying to
debug this. It also limit the maximum probed MSS to the adverised MSS,
as it should never be any higher than that.
The result is that the connection may not have the best performance ever
but it shouldn't stall, and the admin will have a hint on what to look
for.
Tested with virtio by forcing gso_size to 0.
v2: updated msg per David's suggestion
v3: use skb_iif to find the interface and also log its name, per Eric
Dumazet's suggestion. As the skb may be backlogged and the interface
gone by then, we need to check if the number still has a meaning.
v4: use helper tcp_gro_dev_warn() and avoid pr_warn_once inside __once, per
David's suggestion
Cc: Jonathan Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple
flows, we can deduct the relative qdisc delays (think of fq pacing).
This should work even if we have one flow per remote peer."
Having random per flow (or host) offsets doesn't allow that anymore so add
a way to turn this off.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
jiffies based timestamps allow for easy inference of number of devices
behind NAT translators and also makes tracking of hosts simpler.
commit ceaa1fef65 ("tcp: adding a per-socket timestamp offset")
added the main infrastructure that is needed for per-connection ts
randomization, in particular writing/reading the on-wire tcp header
format takes the offset into account so rest of stack can use normal
tcp_time_stamp (jiffies).
So only two items are left:
- add a tsoffset for request sockets
- extend the tcp isn generator to also return another 32bit number
in addition to the ISN.
Re-use of ISN generator also means timestamps are still monotonically
increasing for same connection quadruple, i.e. PAWS will still work.
Includes fixes from Eric Dumazet.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch measures the amount of time when TCP runs out of new data
to send to the network due to insufficient send buffer, while TCP
is still busy delivering (i.e. write queue is not empty). The goal
is to indicate either the send buffer autotuning or user SO_SNDBUF
setting has resulted network under-utilization.
The measurement starts conservatively by checking various conditions
to minimize false claims (i.e. under-estimation is more likely).
The measurement stops when the SOCK_NOSPACE flag is cleared. But it
does not account the time elapsed till the next application write.
Also the measurement only starts if the sender is still busy sending
data, s.t. the limit accounted is part of the total busy time.
Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch measures TCP busy time, which is defined as the period
of time when sender has data (or FIN) to send. The time starts when
data is buffered and stops when the write queue is flushed by ACKs
or error events.
Note the busy time does not include SYN time, unless data is
included in SYN (i.e. Fast Open). It does include FIN time even
if the FIN carries no payload. Excluding pure FIN is possible but
would incur one additional test in the fast path, which may not
be worth it.
Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh,
which un-does reno halving behaviour.
It seems more appropriate to let congctl algorithms pair .ssthresh
and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it
up for all congestion algorithms that used to rely on the fallback.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
We had various problems in the past in tcp_get_info() and used
specific synchronization to avoid deadlocks.
We would like to add more instrumentation points for TCP, and
avoiding grabing socket lock in tcp_getinfo() was too costly.
Being able to lock the socket allows to provide consistent set
of fields.
inet_diag_dump_icsk() can make sure ehash locks are not
held any more when tcp_get_info() is called.
We can remove syncp added in commit d654976cbf
("tcp: fix a potential deadlock in tcp_get_info()"), but we need
to use lock_sock_fast() instead of spin_lock_bh() since TCP input
path can now be run from process context.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Per listen(fd, backlog) rules, there is really no point accepting a SYN,
sending a SYNACK, and dropping the following ACK packet if accept queue
is full, because application is not draining accept queue fast enough.
This behavior is fooling TCP clients that believe they established a
flow, while there is nothing at server side. They might then send about
10 MSS (if using IW10) that will be dropped anyway while server is under
stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/netfilter/core.c
net/netfilter/nf_tables_netdev.c
Resolve two conflicts before pull request for David's net-next tree:
1) Between c73c248490 ("netfilter: nf_tables_netdev: remove redundant
ip_hdr assignment") from the net tree and commit ddc8b6027a
("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()").
2) Between e8bffe0cf9 ("net: Add _nf_(un)register_hooks symbols") and
Aaron Conole's patches to replace list_head with single linked list.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The introduction of TCP_NEW_SYN_RECV state, and the addition of request
sockets to the ehash table seems to have broken the --transparent option
of the socket match for IPv6 (around commit a9407000).
Now that the socket lookup finds the TCP_NEW_SYN_RECV socket instead of the
listener, the --transparent option tries to match on the no_srccheck flag
of the request socket.
Unfortunately, that flag was only set for IPv4 sockets in tcp_v4_init_req()
by copying the transparent flag of the listener socket. This effectively
causes '-m socket --transparent' not match on the ACK packet sent by the
client in a TCP handshake.
Based on the suggestion from Eric Dumazet, this change moves the code
initializing no_srccheck to tcp_conn_request(), rendering the above
scenario working again.
Fixes: a940700003 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV support")
Signed-off-by: Alex Badics <alex.badics@balabit.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If DBGUNDO() is enabled (FASTRETRANS_DEBUG > 1), a compile
error will happen, since inet6_sk(sk)->daddr became sk->sk_v6_daddr
Fixes: efe4208f47 ("ipv6: make lookups simpler and faster")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the TFO socket is accepted right off SYN-data, the socket
owner can call getsockopt(TCP_INFO) to collect ongoing SYN-ACK
retransmission or timeout stats (i.e., tcpi_total_retrans,
tcpi_retransmits). Currently those stats are only updated
upon handshake completes. This patch fixes it.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit introduces an optional new "omnipotent" hook,
cong_control(), for congestion control modules. The cong_control()
function is called at the end of processing an ACK (i.e., after
updating sequence numbers, the SACK scoreboard, and loss
detection). At that moment we have precise delivery rate information
the congestion control module can use to control the sending behavior
(using cwnd, TSO skb size, and pacing rate) in any CA state.
This function can also be used by a congestion control that prefers
not to use the default cwnd reduction approach (i.e., the PRR
algorithm) during CA_Recovery to control the cwnd and sending rate
during loss recovery.
We take advantage of the fact that recent changes defer the
retransmission or transmission of new data (e.g. by F-RTO) in recovery
until the new tcp_cong_control() function is run.
With this commit, we only run tcp_update_pacing_rate() if the
congestion control is not using this new API. New congestion controls
which use the new API do not want the TCP stack to run the default
pacing rate calculation and overwrite whatever pacing rate they have
chosen at initialization time.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the TCP send buffer expands to twice cwnd, in order to allow
limited transmits in the CA_Recovery state. This assumes that cwnd
does not increase in the CA_Recovery.
For some congestion control algorithms, like the upcoming BBR module,
if the losses in recovery do not indicate congestion then we may
continue to raise cwnd multiplicatively in recovery. In such cases the
current multiplier will falsely limit the sending rate, much as if it
were limited by the application.
This commit adds an optional congestion control callback to use a
different multiplier to expand the TCP send buffer. For congestion
control modules that do not specificy this callback, TCP continues to
use the previous default of 2.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch generates data delivery rate (throughput) samples on a
per-ACK basis. These rate samples can be used by congestion control
modules, and specifically will be used by TCP BBR in later patches in
this series.
Key state:
tp->delivered: Tracks the total number of data packets (original or not)
delivered so far. This is an already-existing field.
tp->delivered_mstamp: the last time tp->delivered was updated.
Algorithm:
A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:
d1: the current tp->delivered after processing the ACK
t1: the current time after processing the ACK
d0: the prior tp->delivered when the acked skb was transmitted
t0: the prior tp->delivered_mstamp when the acked skb was transmitted
When an skb is transmitted, we snapshot d0 and t0 in its control
block in tcp_rate_skb_sent().
When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
to reflect the latest (d0, t0).
Finally, tcp_rate_gen() generates a rate sample by storing
(d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.
One caveat: if an skb was sent with no packets in flight, then
tp->delivered_mstamp may be either invalid (if the connection is
starting) or outdated (if the connection was idle). In that case,
we'll re-stamp tp->delivered_mstamp.
At first glance it seems t0 should always be the time when an skb was
transmitted, but actually this could over-estimate the rate due to
phase mismatch between transmit and ACK events. To track the delivery
rate, we ensure that if packets are in flight then t0 and and t1 are
times at which packets were marked delivered.
If the initial and final RTTs are different then one may be corrupted
by some sort of noise. The noise we see most often is sending gaps
caused by delayed, compressed, or stretched acks. This either affects
both RTTs equally or artificially reduces the final RTT. We approach
this by recording the info we need to compute the initial RTT
(duration of the "send phase" of the window) when we recorded the
associated inflight. Then, for a filter to avoid bandwidth
overestimates, we generalize the per-sample bandwidth computation
from:
bw = delivered / ack_phase_rtt
to the following:
bw = delivered / max(send_phase_rtt, ack_phase_rtt)
In large-scale experiments, this filtering approach incorporating
send_phase_rtt is effective at avoiding bandwidth overestimates due to
ACK compression or stretched ACKs.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Count the number of packets that a TCP connection marks lost.
Congestion control modules can use this loss rate information for more
intelligent decisions about how fast to send.
Specifically, this is used in TCP BBR policer detection. BBR uses a
high packet loss rate as one signal in its policer detection and
policer bandwidth estimation algorithm.
The BBR policer detection algorithm cannot simply track retransmits,
because a retransmit can be (and often is) an indicator of packets
lost long, long ago. This is particularly true in a long CA_Loss
period that repairs the initial massive losses when a policer kicks
in.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the TCP min_rtt code to reuse the new win_minmax library in
lib/win_minmax.c to simplify the TCP code.
This is a pure refactor: the functionality is exactly the same. We
just moved the windowed min code to make TCP easier to read and
maintain, and to allow other parts of the kernel to use the windowed
min/max filter code.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When skb replaces another one in ooo queue, I forgot to also
update tp->ooo_last_skb as well, if the replaced skb was the last one
in the queue.
To fix this, we simply can re-use the code that runs after an insertion,
trying to merge skbs at the right of current skb.
This not only fixes the bug, but also remove all small skbs that might
be a subset of the new one.
Example:
We receive segments 2001:3001, 4001:5001
Then we receive 2001:8001 : We should replace 2001:3001 with the big
skb, but also remove 4001:50001 from the queue to save space.
packetdrill test demonstrating the bug
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
+0.100 < . 1:1(0) ack 1 win 1024
+0 accept(3, ..., ...) = 4
+0.01 < . 1001:2001(1000) ack 1 win 1024
+0 > . 1:1(0) ack 1 <nop,nop, sack 1001:2001>
+0.01 < . 1001:3001(2000) ack 1 win 1024
+0 > . 1:1(0) ack 1 <nop,nop, sack 1001:2001 1001:3001>
Fixes: 9f5afeae51 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem noticed that we could avoid an rbtree lookup if the
the attempt to coalesce incoming skb to the last skb failed
for some reason.
Since most ooo additions are at the tail, this is definitely
worth adding a test and fast path.
Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.
Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.
In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.
Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.
However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.
This patch converts it to a RB tree, to get bounded latencies.
Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.
Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)
Next step would be to also use an RB tree for the write queue at sender
side ;)
Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Over the years, TCP BDP has increased a lot, and is typically
in the order of ~10 Mbytes with help of clever Congestion Control
modules.
In presence of packet losses, TCP stores incoming packets into an out of
order queue, and number of skbs sitting there waiting for the missing
packets to be received can match the BDP (~10 Mbytes)
In some cases, TCP needs to make room for incoming skbs, and current
strategy can simply remove all skbs in the out of order queue as a last
resort, incurring a huge penalty, both for receiver and sender.
Unfortunately these 'last resort events' are quite frequent, forcing
sender to send all packets again, stalling the flow and wasting a lot of
resources.
This patch cleans only a part of the out of order queue in order
to meet the memory constraints.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: C. Stephen Gun <csg@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull security subsystem updates from James Morris:
"Highlights:
- TPM core and driver updates/fixes
- IPv6 security labeling (CALIPSO)
- Lots of Apparmor fixes
- Seccomp: remove 2-phase API, close hole where ptrace can change
syscall #"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (156 commits)
apparmor: fix SECURITY_APPARMOR_HASH_DEFAULT parameter handling
tpm: Add TPM 2.0 support to the Nuvoton i2c driver (NPCT6xx family)
tpm: Factor out common startup code
tpm: use devm_add_action_or_reset
tpm2_i2c_nuvoton: add irq validity check
tpm: read burstcount from TPM_STS in one 32-bit transaction
tpm: fix byte-order for the value read by tpm2_get_tpm_pt
tpm_tis_core: convert max timeouts from msec to jiffies
apparmor: fix arg_size computation for when setprocattr is null terminated
apparmor: fix oops, validate buffer size in apparmor_setprocattr()
apparmor: do not expose kernel stack
apparmor: fix module parameters can be changed after policy is locked
apparmor: fix oops in profile_unpack() when policy_db is not present
apparmor: don't check for vmalloc_addr if kvzalloc() failed
apparmor: add missing id bounds check on dfa verification
apparmor: allow SYS_CAP_RESOURCE to be sufficient to prlimit another task
apparmor: use list_next_entry instead of list_entry_next
apparmor: fix refcount race when finding a child profile
apparmor: fix ref count leak when profile sha1 hash is read
apparmor: check that xindex is in trans_table bounds
...
The per-socket rate limit for 'challenge acks' was introduced in the
context of limiting ack loops:
commit f2b2c582e8 ("tcp: mitigate ACK loops for connections as tcp_sock")
And I think it can be extended to rate limit all 'challenge acks' on a
per-socket basis.
Since we have the global tcp_challenge_ack_limit, this patch allows for
tcp_challenge_ack_limit to be set to a large value and effectively rely on
the per-socket limit, or set tcp_challenge_ack_limit to a lower value and
still prevents a single connections from consuming the entire challenge ack
quota.
It further moves in the direction of eliminating the global limit at some
point, as Eric Dumazet has suggested. This a follow-up to:
Subject: tcp: make challenge acks less predictable
Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Yue Cao <ycao009@ucr.edu>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yue Cao claims that current host rate limiting of challenge ACKS
(RFC 5961) could leak enough information to allow a patient attacker
to hijack TCP sessions. He will soon provide details in an academic
paper.
This patch increases the default limit from 100 to 1000, and adds
some randomization so that the attacker can no longer hijack
sessions without spending a considerable amount of probes.
Based on initial analysis and patch from Linus.
Note that we also have per socket rate limiting, so it is tempting
to remove the host limit in the future.
v2: randomize the count of challenge acks per second, not the period.
Fixes: 282f23c6ee ("tcp: implement RFC 5961 3.2")
Reported-by: Yue Cao <ycao009@ucr.edu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If set, these will take precedence over the parent's options during
both sending and child creation. If they're not set, the parent's
options (if any) will be used.
This is to allow the security_inet_conn_request() hook to modify the
IPv6 options in just the same way that it already may do for IPv4.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Add in_flight (bytes in flight when packet was sent) field
to tx component of tcp_skb_cb and make it available to
congestion modules' pkts_acked() function through the
ack_sample function argument.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 5961 advises to only accept RST packets containing a seq number
matching the next expected seq number instead of the whole receive
window in order to avoid spoofing attacks.
However, this situation is not optimal in the case SACK is in use at the
time the RST is sent. I recently run into a scenario in which packet
losses were high while uploading data to a server, and userspace was
willing to frequently terminate connections by sending a RST. In
this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting
for a lost packet retransmission and SACK blocks are used to let the
client continue uploading data. At some point later on, the client sends
the RST (snd_nxt), which matches the next expected seq number of the
right-most SACK block on the receiver side which is going forward
receiving data.
In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the
frozen main ACK at receiver side and thus gets dropped and a challenge
ACK is sent, which gets usually lost due to network conditions. The main
consequence is that the connection stays alive for a while even if it
made sense to accept the RST. This can get really bad if lots of
connections like this one are created in few seconds, allocating all the
resources of the server easily.
For security reasons, not all SACK blocks are checked (there could be a
big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it
wouldn't make sense to check for RST in blocks other than the right-most
received one because the sender is not expected to be sending new data
after the RST. For simplicity, only up to the 4 most recently updated
SACK blocks (selective_acks[4] field) are compared to find the
right-most block, as usually those are the ones with bigger probability
to contain it.
This patch was tested in a 3.18 kernel and probed to improve the
situation in the scenario described above.
Signed-off-by: Pau Espin Pedrol <pau.espin@tessares.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace 2 arguments (cnt and rtt) in the congestion control modules'
pkts_acked() function with a struct. This will allow adding more
information without having to modify existing congestion control
modules (tcp_nv in particular needs bytes in flight when packet
was sent).
As proposed by Neal Cardwell in his comments to the tcp_nv patch.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_snd_una_update() and tcp_rcv_nxt_update() call
u64_stats_update_begin() either from process context or BH handler.
This triggers a lockdep splat on 32bit & SMP builds.
We could add u64_stats_update_begin_bh() variant but this would
slow down 32bit builds with useless local_disable_bh() and
local_enable_bh() pairs, since we own the socket lock at this point.
I add sock_owned_by_me() helper to have proper lockdep support
even on 64bit builds, and new u64_stats_update_begin_raw()
and u64_stats_update_end_raw methods.
Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Reported-by: Fabio Estevam <festevam@gmail.com>
Diagnosed-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Fabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
AFAIK, nothing in current TCP stack absolutely wants BH
being disabled once socket is owned by a thread running in
process context.
As mentioned in my prior patch ("tcp: give prequeue mode some care"),
processing a batch of packets might take time, better not block BH
at all.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.
Many SNMP updates were assuming that BH (and preemption) was disabled.
Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()
Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
the timestamp of the TCP acknowledgement should be reported on
error queue. Since accessing skb_shinfo is likely to incur a
cache-line miss at the time of receiving the ack, the
txstamp_ack bit was added in tcp_skb_cb, which is set iff
the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
SKBTX_ACK_TSTAMP flag redundant.
Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
everywhere.
Note that this frees one bit in shinfo->tx_flags.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We now have proper per-listener but also per network namespace counters
for SYN packets that might be dropped.
We replace the kfree_skb() by consume_skb() to be drop monitor [1]
friendly, and remove an obsolete comment.
FastOpen SYN packets can carry payload in them just fine.
[1] perf record -a -g -e skb:kfree_skb sleep 1; perf report
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.
Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
- Less memory overhead, because write queues have less skbs
- Less cpu overhead at ACK processing.
- Better SACK processing, as lot of studies mentioned how
awful linux was at this ;)
- Less cpu overhead to send the rtx packets
(IP stack traversal, netfilter traversal, drivers...)
- Better latencies in presence of losses.
- Smaller spikes in fq like packet schedulers, as retransmits
are not constrained by TCP Small Queues.
1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were two cases of simple overlapping changes,
nothing serious.
In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.
Signed-off-by: David S. Miller <davem@davemloft.net>
Last known hot point during SYNFLOOD attack is the clearing
of rx_opt.saw_tstamp in tcp_rcv_state_process()
It is not needed for a listener, so we move it where it matters.
Performance while a SYNFLOOD hits a single listener socket
went from 5 Mpps to 6 Mpps on my test server (24 cores, 8 NIC RX queues)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When removing sk_refcnt manipulation on synflood, I missed that
using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
transitioned to 0.
We should hold sk_refcnt instead, but this is a big deal under attack.
(Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)
In this patch, I chose to not attach a socket to syncookies skb.
Performance is now 5 Mpps instead of 3.2 Mpps.
Following patch will remove last known false sharing in
tcp_rcv_state_process()
Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Goal: packets dropped by a listener are accounted for.
This adds tcp_listendrop() helper, and clears sk_drops in sk_clone_lock()
so that children do not inherit their parent drop count.
Note that we no longer increment LINUX_MIB_LISTENDROPS counter when
sending a SYNCOOKIE, since the SYN packet generated a SYNACK.
We already have a separate LINUX_MIB_SYNCOOKIESSENT
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now ss can report sk_drops, we can instruct TCP to increment
this per socket counter when it drops an incoming frame, to refine
monitoring and debugging.
Following patch takes care of listeners drops.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, to avoid a cache line miss for accessing skb_shinfo,
tcp_ack_tstamp skips socket that do not have
SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
implemented based on an implicit assumption that the
SOF_TIMESTAMPING_TX_ACK is set via socket options for the
duration that ACK timestamps are needed.
To implement per-write timestamps, this check should be
removed and replaced with a per-packet alternative that
quickly skips packets missing ACK timestamps marks without
a cache-line miss.
To enable per-packet marking without a cache line miss, use
one bit in TCP_SKB_CB to mark a whether a SKB might need a
ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
modified and work as before.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For non-SACK connections, cwnd is lowered to inflight plus 3 packets
when the recovery ends. This is an optional feature in the NewReno
RFC 2582 to reduce the potential burst when cwnd is "re-opened"
after recovery and inflight is low.
This feature is questionably effective because of PRR: when
the recovery ends (i.e., snd_una == high_seq) NewReno holds the
CA_Recovery state for another round trip to prevent false fast
retransmits. But if the inflight is low, PRR will overwrite the
moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
receiver responds bogus ACKs (i.e., acking future data) to speed up
transfer after recovery, it can only induce a burst up to a window
worth of data packets by acking up to SND.NXT. A restart from (short)
idle or receiving streched ACKs can both cause such bursts as well.
On the other hand, if the recovery ends because the sender
detects the losses were spurious (e.g., reordering). This feature
unconditionally lowers a reverted cwnd even though nothing
was lost.
By principle loss recovery module should not update cwnd. Further
pacing is much more effective to reduce burst. Hence this patch
removes the cwnd moderation feature.
v2 changes: revised commit message on bogus ACKs and burst, and
missing signature
Signed-off-by: Matt Mathis <mattmathis@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/phy/bcm7xxx.c
drivers/net/phy/marvell.c
drivers/net/vxlan.c
All three conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some cases where rtt_us derives from deltas of jiffies,
instead of using usec timestamps.
Since we want to track minimal rtt, better to assume a delta of 0 jiffie
might be in fact be very close to 1 jiffie.
It is kind of sad jiffies_to_usecs(1) calls a function instead of simply
using a constant.
Fixes: f672258391 ("tcp: track min RTT using windowed min-filter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor and consolidate cwnd and rate updates into a new function
tcp_cong_control().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change enables congestion control to update cwnd based on
not only packet cumulatively acked but also packets delivered
out-of-order. This makes congestion control robust against packet
reordering because it may raise cwnd as long as packets are being
delivered once reordering has been detected (i.e., it only cares
the amount of packets delivered, not the ordering among them).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A small refactoring that gets number of packets cumulatively acked
from tcp_clean_rtx_queue() directly.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the accounting of how many packets are
newly acked or sacked when the sender receives an ACK.
The current approach basically computes
newly_acked_sacked = (prior_packets - prior_sacked) -
(tp->packets_out - tp->sacked_out)
where prior_packets and prior_sacked out are snapshot
at the beginning of the ACK processing.
The new approach tracks the delivery information via a new
TCP state variable "delivered" which monotically increases
as new packets are delivered in order or out-of-order.
The reason for this change is that the current approach is
brittle that produces negative or inaccurate estimate.
1) For non-SACK connections, an ACK that advances the SND.UNA
could reset the DUPACK counters (tp->sacked_out) in
tcp_process_loss() or tcp_fastretrans_alert(). This inflates
the inflight suddenly and causes under-estimate or even
negative estimate. Here is a real example:
before after (processing ACK)
packets_out 75 73
sacked_out 23 0
ca state Loss Open
The old approach computes (75-23) - (73 - 0) = -21 delivered
while the new approach computes 1 delivered since it
considers the 2nd-24th packets are delivered OOO.
2) MSS change would re-count packets_out and sacked_out so
the estimate is in-accurate and can even become negative.
E.g., the inflight is doubled when MSS is halved.
3) Spurious retransmission signaled by DSACK is not accounted
The new approach is simpler and more robust. For SACK connections,
tp->delivered increments as packets are being acked or sacked in
SACK and ACK processing.
For non-sack connections, it's done in tcp_remove_reno_sacks() and
tcp_add_reno_sack(). When an ACK advances the SND.UNA, tp->delivered
is incremented by the number of packets ACKed (less the current
number of DUPACKs received plus one packet hole). Upon receiving
a DUPACK, tp->delivered is incremented assuming one out-of-order
packet is delivered.
Upon receiving a DSACK, tp->delivered is incremtened assuming one
retransmission is delivered in tcp_sacktag_write_queue().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the cwnd is reduced and increased in various different
places. The reduction happens in various places in the recovery
state processing (tcp_fastretrans_alert) while the increase
happens afterward.
A better sequence is to identify lost packets and update
the congestion control state (icsk_ca_state) first. Then base
on the new state, up/down the cwnd in one central place. It's
more clear to reason cwnd changes.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The retransmission and F-RTO transmission currently happen inside
recovery state processing (tcp_fastretrans_alert) but before
congestion control. This refactoring moves the logic after both
s.t. we can determine how much to send (cwnd) before deciding what to
send.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we acknowledge a FIN, it is not enough to ack the sequence number
and queue the skb into receive queue. We also have to call tcp_fin()
to properly update socket state and send proper poll() notifications.
It seems we also had the problem if we received a SYN packet with the
FIN flag set, but it does not seem an urgent issue, as no known
implementation can do that.
Fixes: 61d2bcae99 ("tcp: fastopen: accept data/FIN present in SYNACK message")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message
MAY include data and/or FIN
This patch adds support for the client side :
If we receive a SYNACK with payload or FIN, queue the skb instead
of ignoring it.
Since we already support the same for SYN, we refactor the existing
code and reuse it. Note we need to clone the skb, so this operation
might fail under memory pressure.
Sara Dickinson pointed out FreeBSD server Fast Open implementation
was planned to generate such SYNACK in the future.
The server side might be implemented on linux later.
Reported-by: Sara Dickinson <sara@sinodun.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 4015 section 3.4 says the TCP sender MUST refrain from
reversing the congestion control state when the ACK signals
congestion through the ECN-Echo flag. Currently we may not
always do that when prior_ssthresh is reset upon receiving
ACKs with ECE marks. This patch fixes that.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit fixes a corner case in tcp_mark_head_lost() which was
causing the WARN_ON(len > skb->len) in tcp_fragment() to fire.
tcp_mark_head_lost() was assuming that if a packet has
tcp_skb_pcount(skb) of N, then it's safe to fragment off a prefix of
M*mss bytes, for any M < N. But with the tricky way TCP pcounts are
maintained, this is not always true.
For example, suppose the sender sends 4 1-byte packets and have the
last 3 packet sacked. It will merge the last 3 packets in the write
queue into an skb with pcount = 3 and len = 3 bytes. If another
recovery happens after a sack reneging event, tcp_mark_head_lost()
may attempt to split the skb assuming it has more than 2*MSS bytes.
This sounds very counterintuitive, but as the commit description for
the related commit c0638c247f ("tcp: don't fragment SACKed skbs in
tcp_mark_head_lost()") notes, this is because tcp_shifted_skb()
coalesces adjacent regions of SACKed skbs, and when doing this it
preserves the sum of their packet counts in order to reflect the
real-world dynamics on the wire. The c0638c247f commit tried to
avoid problems by not fragmenting SACKed skbs, since SACKed skbs are
where the non-proportionality between pcount and skb->len/mss is known
to be possible. However, that commit did not handle the case where
during a reneging event one of these weird SACKed skbs becomes an
un-SACKed skb, which tcp_mark_head_lost() can then try to fragment.
The fix is to simply mark the entire skb lost when this happens.
This makes the recovery slightly more aggressive in such corner
cases before we detect reordering. But once we detect reordering
this code path is by-passed because FACK is disabled.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patch 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode
conditionally") introduced a bug that cwnd may become 0 when both
inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead
to a div-by-zero if the connection starts another cwnd reduction
phase by setting tp->prior_cwnd to the current cwnd (0) in
tcp_init_cwnd_reduction().
To prevent this we skip PRR operation when nothing is acked or
sacked. Then cwnd must be positive in all cases as long as ssthresh
is positive:
1) The proportional reduction mode
inflight > ssthresh > 0
2) The reduction bound mode
a) inflight == ssthresh > 0
b) inflight < ssthresh
sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh
Therefore in all cases inflight and sndcnt can not both be 0.
We check invalid tp->prior_cwnd to avoid potential div0 bugs.
In reality this bug is triggered only with a sequence of less common
events. For example, the connection is terminating an ECN-triggered
cwnd reduction with an inflight 0, then it receives reordered/old
ACKs or DSACKs from prior transmission (which acks nothing). Or the
connection is in fast recovery stage that marks everything lost,
but fails to retransmit due to local issues, then receives data
packets from other end which acks nothing.
Fixes: 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode conditionally")
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow accepted sockets to derive their sk_bound_dev_if setting from the
l3mdev domain in which the packets originated. A sysctl setting is added
to control the behavior which is similar to sk_mark and
sysctl_tcp_fwmark_accept.
This effectively allow a process to have a "VRF-global" listen socket,
with child sockets bound to the VRF device in which the packet originated.
A similar behavior can be achieved using sk_mark, but a solution using marks
is incomplete as it does not handle duplicate addresses in different L3
domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
domain provides a complete solution.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry provided a syzkaller (http://github.com/google/syzkaller)
generated program that triggers the WARNING at
net/ipv4/tcp.c:1729 in tcp_recvmsg() :
WARN_ON(tp->copied_seq != tp->rcv_nxt &&
!(flags & (MSG_PEEK | MSG_TRUNC)));
His program is specifically attempting a Cross SYN TCP exchange,
that we support (for the pleasure of hackers ?), but it looks we
lack proper tcp->copied_seq initialization.
Thanks again Dmitry for your report and testings.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_send_rcvq() is used for re-injecting data into tcp receive queue.
Problems :
- No check against size is performed, allowed user to fool kernel in
attempting very large memory allocations, eventually triggering
OOM when memory is fragmented.
- In case of fault during the copy we do not return correct errno.
Lets use alloc_skb_with_frags() to cook optimal skbs.
Fixes: 292e8d8c85 ("tcp: Move rcvq sending to tcp_input.c")
Fixes: c0e88ff0f2 ("tcp: Repair socket queues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.
tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.
The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.
Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.
We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is the first half of the RACK loss recovery.
RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.
But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery
RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.
Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.
This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.
Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101
We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.
We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.
The second half is implemented in the next patch that marks packet
lost using RACK timestamp.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a helper to prepare the main RACK patch
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the existing lost retransmit detection because RACK subsumes
it completely. This also stops the overloading the ack_seq field of
the skb control block.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.
Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.
Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code
The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.
The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently ca_seq_rtt_us does not use Kern's check. Fix that by
checking if any packet acked is a retransmit, for both RTT used
for RTT estimation and congestion control.
Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At the time of commit fff3269907 ("tcp: reflect SYN queue_mapping into
SYNACK packets") we had little ways to cope with SYN floods.
We no longer need to reflect incoming skb queue mappings, and instead
can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
affinities.
Note that all SYNACK retransmits were picking TX queue 0, this no longer
is a win given that SYNACK rtx are now distributed on all cpus.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One 32bit hole is following skc_refcnt, use it.
skc_incoming_cpu can also be an union for request_sock rcv_wnd.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_reqsk_alloc() is used to allocate a temporary request
in order to generate a SYNACK with a cookie. Then later,
syncookie validation also uses a temporary request.
These paths already took a reference on listener refcount,
we can avoid a couple of atomic operations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are multiple races that need fixes :
1) skb_get() + queue skb + kfree_skb() is racy
An accept() can be done on another cpu, data consumed immediately.
tcp_recvmsg() uses __kfree_skb() as it is assumed all skb found in
socket receive queue are private.
Then the kfree_skb() in tcp_rcv_state_process() uses an already freed skb
2) tcp_reqsk_record_syn() needs to be done before tcp_try_fastopen()
for the same reasons.
3) We want to send the SYNACK before queueing child into accept queue,
otherwise we might reintroduce the ooo issue fixed in
commit 7c85af8810 ("tcp: avoid reorders for TFO passive connections")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.
(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)
By attaching the skb to the request socket, we remove this
source of contention.
Tested:
listen(fd, 10485760); // single listener (no SO_REUSEPORT)
16 RX/TX queue NIC
Sustain a SYNFLOOD attack of ~320,000 SYN per second,
Sending ~1,400,000 SYNACK per second.
Perf profiles now show listener spinlock being next bottleneck.
20.29% [kernel] [k] queued_spin_lock_slowpath
10.06% [kernel] [k] __inet_lookup_established
5.12% [kernel] [k] reqsk_timer_handler
3.22% [kernel] [k] get_next_timer_interrupt
3.00% [kernel] [k] tcp_make_synack
2.77% [kernel] [k] ipt_do_table
2.70% [kernel] [k] run_timer_softirq
2.50% [kernel] [k] ip_finish_output
2.04% [kernel] [k] cascade
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.
ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.
In nominal conditions, this halves pressure on listener lock.
Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.
We will shrink listen_sock in the following patch to ease
code review.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
long term plan is to remove struct listen_sock when its hash
table is no longer there.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_syn_flood_action() will soon be called with unlocked socket.
In order to avoid SYN flood warning being emitted multiple times,
use xchg().
Extend max_qlen_log and synflood_warned fields in struct listen_sock
to u32
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factorize code to get tcp header from skb. It makes no sense
to duplicate code in callers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once we realize tcp_rcv_synsent_state_process() does not use
its 'len' argument and we get rid of it, then it becomes clear
this argument is no longer used in tcp_rcv_state_process()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We found that a TCP Fast Open passive connection was vulnerable
to reorders, as the exchange might look like
[1] C -> S S <FO ...> <request>
[2] S -> C S. ack request <options>
[3] S -> C . <answer>
packets [2] and [3] can be generated at almost the same time.
If C receives the 3rd packet before the 2nd, it will drop it as
the socket is in SYN_SENT state and expects a SYNACK.
S will have to retransmit the answer.
Current OOO avoidance in linux is defeated because SYNACK
packets are attached to the LISTEN socket, while DATA packets
are attached to the children. They might be sent by different cpus,
and different TX queues might be selected.
It turns out that for TFO, we created a child, which is a
full blown socket in TCP_SYN_RECV state, and we simply can attach
the SYNACK packet to this socket.
This means that at the time tcp_sendmsg() pushes DATA packet,
skb->ooo_okay will be set iff the SYNACK packet had been sent
and TX completed.
This removes the reorder source at the host level.
We also removed the export of tcp_try_fastopen(), as it is no
longer called from IPv6.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
RTT is often measured as 0ms or sometimes 1ms, which would affect
RTT estimation and min RTT samping used by some congestion control.
This patch improves SYN/ACK RTT to be usec resolution if platform
supports it. While the timestamping of SYN/ACK is done in request
sock, the RTT measurement is carefully arranged to avoid storing
another u64 timestamp in tcp_sock.
For regular handshake w/o SYNACK retransmission, the RTT is sampled
right after the child socket is created and right before the request
sock is released (tcp_check_req() in tcp_minisocks.c)
For Fast Open the child socket is already created when SYN/ACK was
sent, the RTT is sampled in tcp_rcv_state_process() after processing
the final ACK an right before the request socket is released.
If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
on TCP timestamps to measure the RTT. The sample is taken at the
same place in tcp_rcv_state_process() after the timestamp values
are validated in tcp_validate_incoming(). Note that we do not store
TS echo value in request_sock for SYN-cookies, because the value
is already stored in tp->rx_opt used by tcp_ack_update_rtt().
One side benefit is that the RTT measurement now happens before
initializing congestion control (of the passive side). Therefore
the congestion control can use the SYN/ACK RTT.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit b73c3d0e4f ("net: Save TX flow hash in sock and set in skbuf
on xmit"), Tom provided a l4 hash to most outgoing TCP packets.
We'd like to provide one as well for SYNACK packets, so that all packets
of a given flow share same txhash, to later enable bonding driver to
also use skb->hash to perform slave selection.
Note that a SYNACK retransmit shuffles the tx hash, as Tom did
in commit 265f94ff54 ("net: Recompute sk_txhash on negative routing
advice") for established sockets.
This has nice effect making TCP flows resilient to some kind of black
holes, even at connection establish phase.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the following case doesn't use DCTCP, even if it should:
A responder has f.e. Cubic as system wide default, but for a specific
route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
initiating host then uses DCTCP as congestion control, but since the
initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
and we have to fall back to Reno after 3WHS completes.
We were thinking on how to solve this in a minimal, non-intrusive
way without bloating tcp_ecn_create_request() needlessly: lets cache
the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
to only do a single metric feature lookup inside tcp_ecn_create_request().
Joint work with Florian Westphal.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TCP pacing was added back in linux-3.12, we chose
to apply a fixed ratio of 200 % against current rate,
to allow probing for optimal throughput even during
slow start phase, where cwnd can be doubled every other gRTT.
At Google, we found it was better applying a different ratio
while in Congestion Avoidance phase.
This ratio was set to 120 %.
We've used the normal tcp_in_slow_start() helper for a while,
then tuned the condition to select the conservative ratio
as soon as cwnd >= ssthresh/2 :
- After cwnd reduction, it is safer to ramp up more slowly,
as we approach optimal cwnd.
- Initial ramp up (ssthresh == INFINITY) still allows doubling
cwnd every other RTT.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
slow start after idle might reduce cwnd, but we perform this
after first packet was cooked and sent.
With TSO/GSO, it means that we might send a full TSO packet
even if cwnd should have been reduced to IW10.
Moving the SSAI check in skb_entail() makes sense, because
we slightly reduce number of times this check is done,
especially for large send() and TCP Small queue callbacks from
softirq context.
As Neal pointed out, we also need to perform the check
if/when receive window opens.
Tested:
Following packetdrill test demonstrates the problem
// Test of slow start after idle
`sysctl -q net.ipv4.tcp_slow_start_after_idle=1`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.100 < . 1:1(0) ack 1 win 511
+0 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0
+0 write(4, ..., 26000) = 26000
+0 > . 1:5001(5000) ack 1
+0 > . 5001:10001(5000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10 }%
+.100 < . 1:1(0) ack 10001 win 511
+0 %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
+0 > . 10001:20001(10000) ack 1
+0 > P. 20001:26001(6000) ack 1
+.100 < . 1:1(0) ack 26001 win 511
+0 %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%
+4 write(4, ..., 20000) = 20000
// If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
+0 > . 26001:31001(5000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
+0 > . 31001:36001(5000) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/bridge/br_mdb.c
br_mdb.c conflict was a function call being removed to fix a bug in
'net' but whose signature was changed in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently F-RTO may repeatedly send new data packets on non-recurring
timeouts in CA_Loss mode. This is a bug because F-RTO (RFC5682)
should only be used on either new recovery or recurring timeouts.
This exacerbates the recovery progress during frequent timeout &
repair, because we prioritize sending new data packets instead of
repairing the holes when the bandwidth is already scarce.
Fix it by correcting the test of a new recovery episode.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The congestion state and cwnd can be updated in the wrong order.
For example, upon receiving a dubious ACK, we incorrectly raise
the cwnd first (tcp_may_raise_cwnd()/tcp_cong_avoid()) because
the state is still Open, then enter recovery state to reduce cwnd.
For another example, if the ACK indicates spurious timeout or
retransmits, we first revert the cwnd reduction and congestion
state back to Open state. But we don't raise the cwnd even though
the ACK does not indicate any congestion.
To fix this problem we should first call tcp_fastretrans_alert() to
process the dubious ACK and update the congestion state, then call
tcp_may_raise_cwnd() that raises cwnd based on the current state.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
V1 of this patch contains Eric Dumazet's suggestion to move the per
dst RTAX_QUICKACK check into tcp_in_quickack_mode(). Thanks Eric.
I ran some tests and after setting the "ip route change quickack 1"
knob there were still many delayed ACKs sent. This occured
because when icsk_ack.quick=0 the !icsk_ack.pingpong value is
subsequently ignored as tcp_in_quickack_mode() checks both these
values. The condition for a quick ack to trigger requires
that both icsk_ack.quick != 0 and icsk_ack.pingpong=0. Currently
only icsk_ack.pingpong is controlled by the knob. But the
icsk_ack.quick value changes dynamically depending on heuristics.
The crux of the matter is that delayed acks still cannot be entirely
disabled even with the RTAX_QUICKACK per dst knob enabled. This
patch ensures that a quick ack is always sent when the RTAX_QUICKACK
per dst knob is turned on.
The "ip route change quickack 1" knob was recently added to enable
quickacks. It was modeled around the TCP_QUICKACK setsockopt() option.
This issue is that even with "ip route change quickack 1" enabled
we still see delayed ACKs under some conditions. It would be nice
to be able to completely disable delayed ACKs.
Here is an example:
# netstat -s|grep dela
3 delayed acks sent
For all routes enable the knob
# ip route change quickack 1
Generate some traffic across a slow link and we still see the delayed
acks.
# netstat -s|grep dela
106 delayed acks sent
1 delayed acks further delayed because of locked socket
The issue is that both the "ip route change quickack 1" knob and
the TCP_QUICKACK option set the icsk_ack.pingpong variable to 0.
However at the business end in the __tcp_ack_snd_check() routine,
tcp_in_quickack_mode() checks that both icsk_ack.quick != 0
and icsk_ack.pingpong=0 in order to trigger a quickack. As
icsk_ack.quick is determined by heuristics it can be 0. When
that occurs the icsk_ack.pingpong value is ignored and a delayed
ACK is sent regardless.
This patch moves the RTAX_QUICKACK per dst check into the
tcp_in_quickack_mode() routine which ensures that a quickack is
always sent when the quickack knob is enabled for that dst.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PRR slow start is often too aggressive especially when drops are
caused by traffic policers. The policers mainly use token bucket
to enforce the rate so sending (twice) faster than the delivery
rate causes excessive drops.
This patch changes PRR to the conservative reduction bound
(CRB) mode in RFC 6937 by default. CRB follows the packet
conservation rule to send at most the delivery rate by default.
But if many packets are lost and the pipe is empty, CRB may take N
round trips to repair N losses. We conditionally turn on slow start
mode if all these conditions are made to speed up the recovery:
1) on the second round or later in recovery
2) retransmission sent in the previous round is delivered on this ACK
3) no retransmission is marked lost on this ACK
By using packet conservation by default, this change reduces the loss
retransmits signicantly on networks that deploy traffic policers,
up to 20% reduction of overall loss rate.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the retransmission in CA_Loss is lost again, we should not
continue to slow start or raise cwnd in congestion avoidance mode.
Instead we should enter fast recovery and use PRR to reduce cwnd,
following the principle in RFC5681:
"... or the loss of a retransmission, should be taken as two
indications of congestion and, therefore, cwnd (and ssthresh) MUST
be lowered twice in this case."
This is especially important to reduce loss when the CA_Loss
state was caused by a traffic policer dropping the entire inflight.
The CA_Loss state has a problem where a loss of L packets causes the
sender to send a burst of L packets. So a policer that's dropping
most packets in a given RTT can cause a huge retransmit storm. By
contrast, PRR includes logic to bound the number of outbound packets
that result from a given ACK. So switching to CA_Recovery on lost
retransmits in CA_Loss avoids this retransmit storm problem when
in CA_Loss.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit cd7d8498c9 ("tcp: change tcp_skb_pcount() location") we stored
gso_segs in a temporary cache hot location.
This patch does the same for gso_size.
This allows to save 2 cache line misses in tcp xmit path for
the last packet that is considered but not sent because of
various conditions (cwnd, tso defer, receiver window, TSQ...)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Our goal is to touch skb_shinfo(skb) only when absolutely needed,
to avoid two cache line misses in TCP output path for last skb
that is considered but not sent because of various conditions
(cwnd, tso defer, receiver window, TSQ...)
A packet is GSO only when skb_shinfo(skb)->gso_size is not zero.
We can set skb_shinfo(skb)->gso_type to sk->sk_gso_type even for
non GSO packets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upcoming tcp_cdg uses tcp_enter_cwr() to initiate PRR. Export this
function so that CDG can be compiled as a module.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: David Hayes <davihay@ifi.uio.no>
Cc: Andreas Petlund <apetlund@simula.no>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu>
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c
Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.
phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.
tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.
macb.c involved the addition of two zyncq device entries.
skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.
Signed-off-by: David S. Miller <davem@davemloft.net>
Taking socket spinlock in tcp_get_info() can deadlock, as
inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
while packet processing can use the reverse locking order.
We could avoid this locking for TCP_LISTEN states, but lockdep would
certainly get confused as all TCP sockets share same lockdep classes.
[ 523.722504] ======================================================
[ 523.728706] [ INFO: possible circular locking dependency detected ]
[ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted
[ 523.739202] -------------------------------------------------------
[ 523.745474] ss/18032 is trying to acquire lock:
[ 523.750002] (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
[ 523.758129]
[ 523.758129] but task is already holding lock:
[ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
[ 523.774661]
[ 523.774661] which lock already depends on the new lock.
[ 523.774661]
[ 523.782850]
[ 523.782850] the existing dependency chain (in reverse order) is:
[ 523.790326]
-> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
[ 523.796599] [<ffffffff811126bb>] lock_acquire+0xbb/0x270
[ 523.802565] [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
[ 523.808628] [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
[ 523.815273] [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
[ 523.822067] [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
[ 523.828199] [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
[ 523.834331] [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
[ 523.840202] [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
[ 523.847214] [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
[ 523.853440] [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
[ 523.859624] [<ffffffff81659db7>] ip_rcv+0x307/0x420
Lets use u64_sync infrastructure instead. As a bonus, 64bit
arches get optimized, as these are nop for them.
Fixes: 0df48c26d8 ("tcp: add tcpi_bytes_acked to tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After sending the new data packets to probe (step 2), F-RTO may
incorrectly send more probes if the next ACK advances SND_UNA and
does not sack new packet. However F-RTO RFC 5682 probes at most
once. This bug may cause sender to always send new data instead of
repairing holes, inducing longer HoL blocking on the receiver for
the application.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Undo based on TCP timestamps should only happen on ACKs that advance
SND_UNA, according to the Eifel algorithm in RFC 3522:
Section 3.2:
(4) If the value of the Timestamp Echo Reply field of the
acceptable ACK's Timestamps option is smaller than the
value of RetransmitTS, then proceed to step (5),
Section Terminology:
We use the term 'acceptable ACK' as defined in [RFC793]. That is an
ACK that acknowledges previously unacknowledged data.
This is because upon receiving an out-of-order packet, the receiver
returns the last timestamp that advances RCV_NXT, not the current
timestamp of the packet in the DUPACK. Without checking the flag,
the DUPACK will cause tcp_packet_delayed() to return true and
tcp_try_undo_loss() will revert cwnd reduction.
Note that we check the condition in CA_Recovery already by only
calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or
tcp_try_undo_recovery() if snd_una crosses high_seq.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing tight tcp_mem settings, I found tcp sessions could be
stuck because we do not allow even one skb to be received on them.
By allowing one skb to be received, we introduce fairness and
eventuallu force memory hogs to release their allocation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the advent of small rto timers in datacenter TCP,
(ip route ... rto_min x), the following can happen :
1) Qdisc is full, transmit fails.
TCP sets a timer based on icsk_rto to retry the transmit, without
exponential backoff.
With low icsk_rto, and lot of sockets, all cpus are servicing timer
interrupts like crazy.
Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN)
and 500ms (TCP_RESOURCE_PROBE_INTERVAL)
2) Receivers can send zero windows if they don't drain their receive queue.
TCP sends zero window probes, based on icsk_rto current value, with
exponential backoff.
With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in
some cases), sender can abort in less than one or two minutes !
If receiver stops the sender, it obviously doesn't care of very tight
rto. Probability of dropping the ACK reopening the window is not
worth the risk.
Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these
events (but not normal RTO based retransmits)
A followup patch adds a new SNMP counter, as it would have helped a lot
diagnosing this issue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows a server application to get the TCP SYN headers for
its passive connections. This is useful if the server is doing
fingerprinting of clients based on SYN packet contents.
Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.
The first is used on a socket to enable saving the SYN headers
for child connections. This can be set before or after the listen()
call.
The latter is used to retrieve the SYN headers for passive connections,
if the parent listener has enabled TCP_SAVE_SYN.
TCP_SAVED_SYN is read once, it frees the saved SYN headers.
The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
headers.
Original patch was written by Tom Herbert, I changed it to not hold
a full skb (and associated dst and conntracking reference).
We have used such patch for about 3 years at Google.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Invoking pkts_acked is currently conditioned on FLAG_ACKED:
receiving a cumulative ACK of new data, or ACK with SYN flag set.
Remove this condition so that CC may get RTT measurements from all SACKs.
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_sacktag_one() always picks the earliest sequence SACKed for RTT.
This might not make sense for congestion control in cases where:
1. ACKs are lost, i.e. a SACK following a lost SACK covers both
new and old segments at the receiver.
2. The receiver disregards the RFC 5681 recommendation to immediately
ACK out-of-order segments.
Give congestion control a RTT for the latest segment SACKed, which is the
most accurate RTT estimate, but preserve the conservative RTT for RTO.
Removes the call to skb_mstamp_get() in tcp_sacktag_one().
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Later patch passes two values set in tcp_sacktag_one() to
tcp_clean_rtx_queue(). Prepare passing them via struct tcp_sacktag_state.
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_mark_lost_retrans is not used when FACK is disabled. Since
tcp_update_reordering may disable FACK, it should be called first
before tcp_mark_lost_retrans.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tracks total number of payload bytes received on a TCP socket.
This is the sum of all changes done to tp->rcv_nxt
RFC4898 named this : tcpEStatsAppHCThruOctetsReceived
This is a 64bit field, and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)
Note that tp->bytes_received was placed near tp->rcv_nxt for
best data locality and minimal performance impact.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Eric Salo <salo@google.com>
Cc: Martin Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tracks total number of bytes acked for a TCP socket.
This is the sum of all changes done to tp->snd_una, and allows
for precise tracking of delivered data.
RFC4898 named this : tcpEStatsAppHCThruOctetsAcked
This is a 64bit field, and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)
Note that tp->bytes_acked was placed near tp->snd_una for
best data locality and minimal performance impact.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Eric Salo <salo@google.com>
Cc: Martin Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure that we either see that the buffer has write space
in tcp_poll() or that we perform a wakeup from the input
side. Did not run into any actual problem here, but thought
that we should make things explicit.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since retransmitted segments are not used for RTT estimation, previously
SACKed segments present in the rtx queue are used. This estimation can be
several times larger than the actual RTT. When a cumulative ack covers both
previously SACKed and retransmitted segments, CC may thus get a bogus RTT.
Such segments previously had an RTT estimation in tcp_sacktag_one(), so it
seems reasonable to not reuse them in tcp_clean_rtx_queue() at all.
Afaik, this has had no effect on SRTT/RTO because of Karn's check.
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fast Open has been using an experimental option with a magic number
(RFC6994). This patch makes the client by default use the RFC7413
option (34) to get and send Fast Open cookies. This patch makes
the client solicit cookies from a given server first with the
RFC7413 option. If that fails to elicit a cookie, then it tries
the RFC6994 experimental option. If that also fails, it uses the
RFC7413 option on all subsequent connect attempts. If the server
returns a Fast Open cookie then the client caches the form of the
option that successfully elicited a cookie, and uses that form on
later connects when it presents that cookie.
The idea is to gradually obsolete the use of experimental options as
the servers and clients upgrade, while keeping the interoperability
meanwhile.
Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fast Open has been using the experimental option with a magic number
(RFC6994) to request and grant Fast Open cookies. This patch enables
the server to support the official IANA option 34 in RFC7413 in
addition.
The change has passed all existing Fast Open tests with both
old and new options at Google.
Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mellanox/mlx4/cmd.c
net/core/fib_rules.c
net/ipv4/fib_frontend.c
The fib_rules.c and fib_frontend.c conflicts were locking adjustments
in 'net' overlapping addition and removal of code in 'net-next'.
The mlx4 conflict was a bug fix in 'net' happening in the same
place a constant was being replaced with a more suitable macro.
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipv4 code uses a mixture of coding styles. In some instances check
for non-NULL pointer is done as x != NULL and sometimes as x. x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipv4 code uses a mixture of coding styles. In some instances check
for NULL pointer is done as x == NULL and sometimes as !x. !x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
On processing cumulative ACKs, the FRTO code was not checking the
SACKed bit, meaning that there could be a spurious FRTO undo on a
cumulative ACK of a previously SACKed skb.
The FRTO code should only consider a cumulative ACK to indicate that
an original/unretransmitted skb is newly ACKed if the skb was not yet
SACKed.
The effect of the spurious FRTO undo would typically be to make the
connection think that all previously-sent packets were in flight when
they really weren't, leading to a stall and an RTO.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: e33099f96d ("tcp: implement RFC5682 F-RTO")
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 1fb6f159fd ("tcp: add tcp_conn_request"),
tcp_syn_flood_action() is no longer used from IPv6.
We can make it static, by moving it above tcp_conn_request()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ss should display ipv4 mapped request sockets like this :
tcp SYN-RECV 0 0 ::ffff:192.168.0.1:8080 ::ffff:192.0.2.1:35261
and not like this :
tcp SYN-RECV 0 0 192.168.0.1:8080 192.0.2.1:35261
We should init ireq->ireq_family based on listener sk_family,
not the actual protocol carried by SYN packet.
This means we can set ireq_family in inet_reqsk_alloc()
Fixes: 3f66b083a5 ("inet: introduce ireq_family")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When request sock are put in ehash table, the whole notion
of having a previous request to update dl_next is pointless.
Also, following patch will get rid of big purge timer,
so we want to delete a request sock without holding listener lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing last patch series, I found req sock refcounting was wrong.
We must set skc_refcnt to 1 for all request socks added in hashes,
but also on request sockets created by FastOpen or syncookies.
It is tricky because we need to defer this initialization so that
future RCU lookups do not try to take a refcount on a not yet
fully initialized request socket.
Also get rid of ireq_refcnt alias.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 13854e5a60 ("inet: add proper refcounting to request sock")
Signed-off-by: David S. Miller <davem@davemloft.net>
The listener field in struct tcp_request_sock is a pointer
back to the listener. We now have req->rsk_listener, so TCP
only needs one boolean and not a full pointer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once we'll be able to lookup request sockets in ehash table,
we'll need to get access to listener which created this request.
This avoid doing a lookup to find the listener, which benefits
for a more solid SO_REUSEPORT, and is needed once we no
longer queue request sock into a listener private queue.
Note that 'struct tcp_request_sock'->listener could be reduced
to a single bit, as TFO listener should match req->rsk_listener.
TFO will no longer need to hold a reference on the listener.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_reqsk_alloc() is becoming fat and should not be inlined.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
listener socket can be used to set net pointer, and will
be later used to hold a reference on listener.
Add a const qualifier to first argument (struct request_sock_ops *),
and factorize all write_pnet(&ireq->ireq_net, sock_net(sk));
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_oow_rate_limited() is hardly used in fast path, there is
no point inlining it.
Signed-of-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This big helper is called once from tcp_conn_request(), there is no
point having it in an include. Compiler will inline it anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once request socks will be in ehash table, they will need to have
a valid ir_iff field.
This is currently true only for IPv6. This patch extends support
for IPv4 as well.
This means inet_diag_fill_req() can now properly use ir_iif,
which is better for IPv6 link locals anyway, as request sockets
and established sockets will propagate consistent netlink idiag_if.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I forgot to update dccp_v6_conn_request() & cookie_v6_check().
They both need to set ireq->ireq_net and ireq->ir_cookie
Lets clear ireq->ir_cookie in inet_reqsk_alloc()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33cf7c90fe ("net: add real socket cookies")
Signed-off-by: David S. Miller <davem@davemloft.net>
I forgot to use write_pnet() in three locations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33cf7c90fe ("net: add real socket cookies")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A long standing problem in netlink socket dumps is the use
of kernel socket addresses as cookies.
1) It is a security concern.
2) Sockets can be reused quite quickly, so there is
no guarantee a cookie is used once and identify
a flow.
3) request sock, establish sock, and timewait socks
for a given flow have different cookies.
Part of our effort to bring better TCP statistics requires
to switch to a different allocator.
In this patch, I chose to use a per network namespace 64bit generator,
and to use it only in the case a socket needs to be dumped to netlink.
(This might be refined later if needed)
Note that I tried to carry cookies from request sock, to establish sock,
then timewait sockets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eric Salo <salo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_should_expand_sndbuf() does not expand the send buffer if we have
filled the congestion window.
However, it should use tcp_packets_in_flight() instead of
tp->packets_out to make this check.
Testing has established that the difference matters a lot if there are
many SACKed packets, causing a needless performance shortfall.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure that in state ESTABLISHED, where the connection is represented
by a tcp_sock, we rate limit dupacks in response to incoming packets
(a) with TCP timestamps that fail PAWS checks, or (b) with sequence
numbers or ACK numbers that are out of the acceptable window.
We do not send a dupack in response to out-of-window packets if it has
been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
last sent a dupack in response to an out-of-window packet.
There is already a similar (although global) rate-limiting mechanism
for "challenge ACKs". When deciding whether to send a challence ACK,
we first consult the new per-connection rate limit, and then the
global rate limit.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helpers for mitigating ACK loops by rate-limiting dupacks sent in
response to incoming out-of-window packets.
This patch includes:
- rate-limiting logic
- sysctl to control how often we allow dupacks to out-of-window packets
- SNMP counter for cases where we rate-limited our dupack sending
The rate-limiting logic in this patch decides to not send dupacks in
response to out-of-window segments if (a) they are SYNs or pure ACKs
and (b) the remote endpoint is sending them faster than the configured
rate limit.
We rate-limit our responses rather than blocking them entirely or
resetting the connection, because legitimate connections can rely on
dupacks in response to some out-of-window segments. For example, zero
window probes are typically sent with a sequence number that is below
the current window, and ZWPs thus expect to thus elicit a dupack in
response.
We allow dupacks in response to TCP segments with data, because these
may be spurious retransmissions for which the remote endpoint wants to
receive DSACKs. This is safe because segments with data can't
realistically be part of ACK loops, which by their nature consist of
each side sending pure/data-less ACKs to each other.
The dupack interval is controlled by a new sysctl knob,
tcp_invalid_ratelimit, given in milliseconds, in case an administrator
needs to dial this upward in the face of a high-rate DoS attack. The
name and units are chosen to be analogous to the existing analogous
knob for ICMP, icmp_ratelimit.
The default value for tcp_invalid_ratelimit is 500ms, which allows at
most one such dupack per 500ms. This is chosen to be 2x faster than
the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
2.4). We allow the extra 2x factor because network delay variations
can cause packets sent at 1 second intervals to be compressed and
arrive much closer.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One deployment requirement of DCTCP is to be able to run
in a DC setting along with TCP traffic. As Glenn Judd's
NSDI'15 paper "Attaining the Promise and Avoiding the Pitfalls
of TCP in the Datacenter" [1] (tba) explains, one way to
solve this on switch side is to split DCTCP and TCP traffic
in two queues per switch port based on the DSCP: one queue
soley intended for DCTCP traffic and one for non-DCTCP traffic.
For the DCTCP queue, there's the marking threshold K as
explained in commit e3118e8359 ("net: tcp: add DCTCP congestion
control algorithm") for RED marking ECT(0) packets with CE.
For the non-DCTCP queue, there's f.e. a classic tail drop queue.
As already explained in e3118e8359, running DCTCP at scale
when not marking SYN/SYN-ACK packets with ECT(0) has severe
consequences as for non-ECT(0) packets, traversing the RED
marking DCTCP queue will result in a severe reduction of
connection probability.
This is due to the DCTCP queue being dominated by ECT(0) traffic
and switches handle non-ECT traffic in the RED marking queue
after passing K as drops, where K is usually a low watermark
in order to leave enough tailroom for bursts. Splitting DCTCP
traffic among several queues (ECN and non-ECN queue) is being
considered a terrible idea in the network community as it
splits single flows across multiple network paths.
Therefore, commit e3118e8359 implements this on Linux as
ECT(0) marked traffic, as we argue that marking all packets
of a DCTCP flow is the only viable solution and also doesn't
speak against the draft.
However, recently, a DCTCP implementation for FreeBSD hit also
their mainline kernel [2]. In order to let them play well
together with Linux' DCTCP, we would need to loosen the
requirement that ECT(0) has to be asserted during the 3WHS as
not implemented in FreeBSD. This simplifies the ECN test and
lets DCTCP work together with FreeBSD.
Joint work with Daniel Borkmann.
[1] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd
[2] 8ad8794452
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current behavior only passes RTTs from sequentially acked data to CC.
If sender gets a combined ACK for segment 1 and SACK for segment 3, then the
computed RTT for CC is the time between sending segment 1 and receiving SACK
for segment 3.
Pass the minimum computed RTT from any acked data to CC, i.e. time between
sending segment 3 and receiving SACK for segment 3.
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
With TLP, the peer may reply to a probe with an
ACK+D-SACK, with ack value set to tlp_high_seq. In the current code,
such ACK+DSACK will be missed and only at next, higher ack will the TLP
episode be considered done. Since the DSACK is not present anymore,
this will cost a cwnd reduction.
This patch ensures that this scenario does not cause a cwnd reduction, since
receiving an ACK+DSACK indicates that both the initial segment and the probe
have been received by the peer.
The following packetdrill test, from Neal Cardwell, validates this patch:
// Establish a connection.
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.020 < . 1:1(0) ack 1 win 257
+0 accept(3, ..., ...) = 4
// Send 1 packet.
+0 write(4, ..., 1000) = 1000
+0 > P. 1:1001(1000) ack 1
// Loss probe retransmission.
// packets_out == 1 => schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
// In this case, this means: 1.5*RTT + 200ms = 230ms
+.230 > P. 1:1001(1000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10 }%
// Receiver ACKs at tlp_high_seq with a DSACK,
// indicating they received the original packet and probe.
+.020 < . 1:1(0) ack 1001 win 257 <sack 1:1001,nop,nop>
+0 %{ assert tcpi_snd_cwnd == 10 }%
// Send another packet.
+0 write(4, ..., 1000) = 1000
+0 > P. 1001:2001(1000) ack 1
// Receiver ACKs above tlp_high_seq, which should end the TLP episode
// if we haven't already. We should not reduce cwnd.
+.020 < . 1:1(0) ack 2001 win 257
+0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
Credits:
-Gregory helped in finding that tcp_process_tlp_ack was where the cwnd
got reduced in our MPTCP tests.
-Neal wrote the packetdrill test above
-Yuchung reworked the patch to make it more readable.
Cc: Gregory Detal <gregory.detal@uclouvain.be>
Cc: Nandita Dukkipati <nanditad@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ieee802154/fakehard.c
A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.
Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c3ae62af8e ("tcp: should drop incoming frames without ACK
flag set") was created to mitigate a security vulnerability in which a
local attacker is able to inject data into locally-opened sockets by
using TCP protocol statistics in procfs to quickly find the correct
sequence number.
This broke the RFC5961 requirement to send a challenge ACK in response
to spurious RST packets, which was subsequently fixed by commit
7b514a886b ("tcp: accept RST without ACK flag").
Unfortunately, the RFC5961 requirement that spurious SYN packets be
handled in a similar manner remains broken.
RFC5961 section 4 states that:
... the handling of the SYN in the synchronized state SHOULD be
performed as follows:
1) If the SYN bit is set, irrespective of the sequence number, TCP
MUST send an ACK (also referred to as challenge ACK) to the remote
peer:
<SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
After sending the acknowledgment, TCP MUST drop the unacceptable
segment and stop processing further.
By sending an ACK, the remote peer is challenged to confirm the loss
of the previous connection and the request to start a new connection.
A legitimate peer, after restart, would not have a TCB in the
synchronized state. Thus, when the ACK arrives, the peer should send
a RST segment back with the sequence number derived from the ACK
field that caused the RST.
This RST will confirm that the remote peer has indeed closed the
previous connection. Upon receipt of a valid RST, the local TCP
endpoint MUST terminate its connection. The local TCP endpoint
should then rely on SYN retransmission from the remote end to
re-establish the connection.
This patch lets SYN packets through the discard added in c3ae62af8e,
so that spurious SYN packets are properly dealt with as per the RFC.
The challenge ACK is sent unconditionally and is rate-limited, so the
original vulnerability is not reintroduced by this patch.
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the more common dynamic_debug capable net_dbg_ratelimited
and remove the LIMIT_NETDEBUG macro.
All messages are still ratelimited.
Some KERN_<LEVEL> uses are changed to KERN_DEBUG.
This may have some negative impact on messages that were
emitted at KERN_INFO that are not not enabled at all unless
DEBUG is defined or dynamic_debug is enabled. Even so,
these messages are now _not_ emitted by default.
This also eliminates the use of the net_msg_warn sysctl
"/proc/sys/net/core/warnings". For backward compatibility,
the sysctl is not removed, but it has no function. The extern
declaration of net_msg_warn is removed from sock.h and made
static in net/core/sysctl_net_core.c
Miscellanea:
o Update the sysctl documentation
o Remove the embedded uses of pr_fmt
o Coalesce format fragments
o Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ueki Kohei reported that when we are using NewReno with connections that
have a very low traffic, we may timeout the connection too early if a
second loss occurs after the first one was successfully acked but no
data was transfered later. Below is his description of it:
When SACK is disabled, and a socket suffers multiple separate TCP
retransmissions, that socket's ETIMEDOUT value is calculated from the
time of the *first* retransmission instead of the *latest*
retransmission.
This happens because the tcp_sock's retrans_stamp is set once then never
cleared.
Take the following connection:
Linux remote-machine
| |
send#1---->(*1)|--------> data#1 --------->|
| | |
RTO : :
| | |
---(*2)|----> data#1(retrans) ---->|
| (*3)|<---------- ACK <----------|
| | |
| : :
| : :
| : :
16 minutes (or more) :
| : :
| : :
| : :
| | |
send#2---->(*4)|--------> data#2 --------->|
| | |
RTO : :
| | |
---(*5)|----> data#2(retrans) ---->|
| | |
| | |
RTO*2 : :
| | |
| | |
ETIMEDOUT<----(*6)| |
(*1) One data packet sent.
(*2) Because no ACK packet is received, the packet is retransmitted.
(*3) The ACK packet is received. The transmitted packet is acknowledged.
At this point the first "retransmission event" has passed and been
recovered from. Any future retransmission is a completely new "event".
(*4) After 16 minutes (to correspond with retries2=15), a new data
packet is sent. Note: No data is transmitted between (*3) and (*4).
The socket's timeout SHOULD be calculated from this point in time, but
instead it's calculated from the prior "event" 16 minutes ago.
(*5) Because no ACK packet is received, the packet is retransmitted.
(*6) At the time of the 2nd retransmission, the socket returns
ETIMEDOUT.
Therefore, now we clear retrans_stamp as soon as all data during the
loss window is fully acked.
Reported-by: Ueki Kohei
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows to set ECN on a per-route basis in case the sysctl
tcp_ecn is not set to 1. In other words, when ECN is set for specific
routes, it provides a tcp_ecn=1 behaviour for that route while the rest
of the stack acts according to the global settings.
One can use 'ip route change dev $dev $net features ecn' to toggle this.
Having a more fine-grained per-route setting can be beneficial for various
reasons, for example, 1) within data centers, or 2) local ISPs may deploy
ECN support for their own video/streaming services [1], etc.
There was a recent measurement study/paper [2] which scanned the Alexa's
publicly available top million websites list from a vantage point in US,
Europe and Asia:
Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side
only ECN") ;)); the break in connectivity on-path was found is about
1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
more common in the negotiation phase (and mostly seen in the Alexa
middle band, ranks around 50k-150k): from 12-thousand hosts on which
there _may_ be ECN-linked connection failures, only 79 failed with RST
when _not_ failing with RST when ECN is not requested.
It's unclear though, how much equipment in the wild actually marks CE
when buffers start to fill up.
We thought about a fallback to non-ECN for retransmitted SYNs as another
global option (which could perhaps one day be made default), but as Eric
points out, there's much more work needed to detect broken middleboxes.
Two examples Eric mentioned are buggy firewalls that accept only a single
SYN per flow, and middleboxes that successfully let an ECN flow establish,
but later mark CE for all packets (so cwnd converges to 1).
[1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
[2] http://ecn.ethz.ch/
Joint work with Daniel Borkmann.
Reference: http://thread.gmane.org/gmane.linux.network/335797
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Challenge ACK is described in RFC 5961, fix typo.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing upcoming Yaogong patch (converting out of order queue
into an RB tree), I hit the max reordering level of linux TCP stack.
Reordering level was limited to 127 for no good reason, and some
network setups [1] can easily reach this limit and get limited
throughput.
Allow a new max limit of 300, and add a sysctl to allow admins to even
allow bigger (or lower) values if needed.
[1] Aggregation of links, per packet load balancing, fabrics not doing
deep packet inspections, alternative TCP congestion modules...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We worked hard to improve tcp_ack() performance, by not accessing
skb_shinfo() in fast path (cd7d8498c9 tcp: change tcp_skb_pcount()
location)
We still have one spurious access because of ACK timestamping,
added in commit e1c8a607b2 ("net-timestamp: ACK timestamp for
bytestreams")
By checking if sk_tsflags has SOF_TIMESTAMPING_TX_ACK set,
we can avoid two cache line misses for the common case.
While we are at it, add two prefetchw() :
One in tcp_ack() to bring skb at the head of write queue.
One in tcp_clean_rtx_queue() loop to bring following skb,
as we will delete skb from the write queue and dirty skb->next->prev.
Add a couple of [un]likely() clauses.
After this patch, tcp_ack() is no longer the most consuming
function in tcp stack.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Most notable changes in here:
1) By far the biggest accomplishment, thanks to a large range of
contributors, is the addition of multi-send for transmit. This is
the result of discussions back in Chicago, and the hard work of
several individuals.
Now, when the ->ndo_start_xmit() method of a driver sees
skb->xmit_more as true, it can choose to defer the doorbell
telling the driver to start processing the new TX queue entires.
skb->xmit_more means that the generic networking is guaranteed to
call the driver immediately with another SKB to send.
There is logic added to the qdisc layer to dequeue multiple
packets at a time, and the handling mis-predicted offloads in
software is now done with no locks held.
Finally, pktgen is extended to have a "burst" parameter that can
be used to test a multi-send implementation.
Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
virtio_net
Adding support is almost trivial, so export more drivers to
support this optimization soon.
I want to thank, in no particular or implied order, Jesper
Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
David Tat, Hannes Frederic Sowa, and Rusty Russell.
2) PTP and timestamping support in bnx2x, from Michal Kalderon.
3) Allow adjusting the rx_copybreak threshold for a driver via
ethtool, and add rx_copybreak support to enic driver. From
Govindarajulu Varadarajan.
4) Significant enhancements to the generic PHY layer and the bcm7xxx
driver in particular (EEE support, auto power down, etc.) from
Florian Fainelli.
5) Allow raw buffers to be used for flow dissection, allowing drivers
to determine the optimal "linear pull" size for devices that DMA
into pools of pages. The objective is to get exactly the
necessary amount of headers into the linear SKB area pre-pulled,
but no more. The new interface drivers use is eth_get_headlen().
From WANG Cong, with driver conversions (several had their own
by-hand duplicated implementations) by Alexander Duyck and Eric
Dumazet.
6) Support checksumming more smoothly and efficiently for
encapsulations, and add "foo over UDP" facility. From Tom
Herbert.
7) Add Broadcom SF2 switch driver to DSA layer, from Florian
Fainelli.
8) eBPF now can load programs via a system call and has an extensive
testsuite. Alexei Starovoitov and Daniel Borkmann.
9) Major overhaul of the packet scheduler to use RCU in several major
areas such as the classifiers and rate estimators. From John
Fastabend.
10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
Duyck.
11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
Dumazet.
12) Add Datacenter TCP congestion control algorithm support, From
Florian Westphal.
13) Reorganize sk_buff so that __copy_skb_header() is significantly
faster. From Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
netlabel: directly return netlbl_unlabel_genl_init()
net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
net: description of dma_cookie cause make xmldocs warning
cxgb4: clean up a type issue
cxgb4: potential shift wrapping bug
i40e: skb->xmit_more support
net: fs_enet: Add NAPI TX
net: fs_enet: Remove non NAPI RX
r8169:add support for RTL8168EP
net_sched: copy exts->type in tcf_exts_change()
wimax: convert printk to pr_foo()
af_unix: remove 0 assignment on static
ipv6: Do not warn for informational ICMP messages, regardless of type.
Update Intel Ethernet Driver maintainers list
bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
tipc: fix bug in multicast congestion handling
net: better IFF_XMIT_DST_RELEASE support
net/mlx4_en: remove NETDEV_TX_BUSY
3c59x: fix bad split of cpu_to_le32(pci_map_single())
net: bcmgenet: fix Tx ring priority programming
...
1/ Step down as dmaengine maintainer see commit 08223d80df "dmaengine
maintainer update"
2/ Removal of net_dma, as it has been marked 'broken' since 3.13 (commit
7787380336 "net_dma: mark broken"), without reports of performance
regression.
3/ Miscellaneous fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUKDLKAAoJEB7SkWpmfYgC7wwP/iNHqRjf1suMUTBIF3P6Hgbe
VCUwh0IkuujMPDG46WRn6cYzarRxVPLoGaLHLPszgjI6pmGPVv19wqeDOlUxtcmr
0iQWEWv/zqseaAIW+4gj/WYCyMgKil49EUBJKCZCfNmIaad+e0pr8f0uE5yOkHPM
tqWoZERu9A4dlXGr1TjeOZVzdnPrCt92MrLDN6ZZ6tMuJaEc5PauaLxKTeGy5fYj
UB+k1xJQzECbsYfpB+uCVYl5/qPO1rNyuBYS8THCsW+JYmrbbfH2kkF2lo2FaUpO
8Yd50FtzXHKWwAt7BzfIwU2M7x0wRmryrC/xsQi6M+WmVeHYvvHUIpzaA66xRZ5x
fCy3Fu8sEnmnmboAbh2v2c5uTycqRl2xPzbpLAuxglloXIxzi3ckp6ESF/Z4SldH
oxIoEievN7lah3vKgvlHZYcWDzrYr8EKf/EzFe9RqDBQDKtzDzre1H9Uivr387Vm
uFUcGHYG/GXuX47C7EUsMtaSW2UEoR2ytw/HR6CKFPTVXwAzEO6kA9vg0EqL0iIq
2wVLgavlZuwegmaUBgnr+bgVZMvVN7OU7fAIRVe5xNO6itrPKvheSlQthmRiiq9C
uzOu4PS6PexqzHUNPCcJpCsj+lawmCSrE0bxtPzTA/CQInVgWs219V9+W5Gn/0YA
EARN9k6ueX9PZPQrPQLm
=BBBv
-----END PGP SIGNATURE-----
Merge tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine
Pull dmaengine updates from Dan Williams:
"Even though this has fixes marked for -stable, given the size and the
needed conflict resolutions this is 3.18-rc1/merge-window material.
These patches have been languishing in my tree for a long while. The
fact that I do not have the time to do proper/prompt maintenance of
this tree is a primary factor in the decision to step down as
dmaengine maintainer. That and the fact that the bulk of drivers/dma/
activity is going through Vinod these days.
The net_dma removal has not been in -next. It has developed simple
conflicts against mainline and net-next (for-3.18).
Continuing thanks to Vinod for staying on top of drivers/dma/.
Summary:
1/ Step down as dmaengine maintainer see commit 08223d80df
"dmaengine maintainer update"
2/ Removal of net_dma, as it has been marked 'broken' since 3.13
(commit 7787380336 "net_dma: mark broken"), without reports of
performance regression.
3/ Miscellaneous fixes"
* tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
net: make tcp_cleanup_rbuf private
net_dma: revert 'copied_early'
net_dma: simple removal
dmaengine maintainer update
dmatest: prevent memory leakage on error path in thread
ioat: Use time_before_jiffies()
dmaengine: fix xor sources continuation
dma: mv_xor: Rename __mv_xor_slot_cleanup() to mv_xor_slot_cleanup()
dma: mv_xor: Remove all callers of mv_xor_slot_cleanup()
dma: mv_xor: Remove unneeded mv_xor_clean_completed_slots() call
ioat: Use pci_enable_msix_exact() instead of pci_enable_msix()
drivers: dma: Include appropriate header file in dca.c
drivers: dma: Mark functions as static in dma_v3.c
dma: mv_xor: Add DMA API error checks
ioat/dca: Use dev_is_pci() to check whether it is pci device
Suggested by Stephen. Also drop inline keyword and let compiler decide.
gcc 4.7.3 decides to no longer inline tcp_ecn_check_ce, so split it up.
The actual evaluation is not inlined anymore while the ECN_OK test is.
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
After Octavian Purdilas tcp ipv4/ipv6 unification work this helper only
has a single callsite.
While at it, convert name to lowercase, suggested by Stephen.
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
DataCenter TCP (DCTCP) determines cwnd growth based on ECN information
and ACK properties, e.g. ACK that updates window is treated differently
than DUPACK.
Also DCTCP needs information whether ACK was delayed ACK. Furthermore,
DCTCP also implements a CE state machine that keeps track of CE markings
of incoming packets.
Therefore, extend the congestion control framework to provide these
event types, so that DCTCP can be properly implemented as a normal
congestion algorithm module outside of the core stack.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The congestion control ops "cwnd_event" currently supports
CA_EVENT_FAST_ACK and CA_EVENT_SLOW_ACK events (among others).
Both FAST and SLOW_ACK are only used by Westwood congestion
control algorithm.
This removes both flags from cwnd_event and adds a new
in_ack_event callback for this. The goal is to be able to
provide more detailed information about ACKs, such as whether
ECE flag was set, or whether the ACK resulted in a window
update.
It is required for DataCenter TCP (DCTCP) congestion control
algorithm as it makes a different choice depending on ECE being
set or not.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a flag to TCP congestion algorithms that allows
for requesting to mark IPv4/IPv6 sockets with transport as ECN
capable, that is, ECT(0), when required by a congestion algorithm.
It is currently used and needed in DataCenter TCP (DCTCP), as it
requires both peers to assert ECT on all IP packets sent - it
uses ECN feedback (i.e. CE, Congestion Encountered information)
from switches inside the data center to derive feedback to the
end hosts.
Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's
algorithm/behaviour slightly diverges from RFC3168, therefore this
is only (!) enabled iff the assigned congestion control ops module
has requested this. By that, we can tightly couple this logic really
only to the provided congestion control ops.
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a cleanup which follows the idea in commit e11ecddf51 (tcp: use
TCP_SKB_CB(skb)->tcp_flags in input path),
and it may reduce register pressure since skb->cb[] access is fast,
bacause skb is probably in a register.
v2: remove variable th
v3: reword the changelog
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Our goal is to access no more than one cache line access per skb in
a write or receive queue when doing the various walks.
After recent TCP_SKB_CB() reorganizations, it is almost done.
Last part is tcp_skb_pcount() which currently uses
skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs
3 cache lines in current kernel (skb->head, skb->end, and
shinfo->gso_segs are all in 3 different cache lines, far from skb->cb)
This very simple patch reuses space currently taken by tcp_tw_isn
only in input path, as tcp_skb_pcount is only needed for skb stored in
write queue.
This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags
to get SKBTX_ACK_TSTAMP, which seems possible.
This also speeds up all sack processing in general.
This speeds up tcp_sendmsg() because it no longer has to access/dirty
shinfo.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that tcp_dma_try_early_copy() is gone nothing ever sets
copied_early.
Also reverts "53240c208776 tcp: Fix possible double-ack w/ user dma"
since it is no longer necessary.
Cc: Ali Saidi <saidi@engin.umich.edu>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Neal Cardwell <ncardwell@google.com>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
and there is no plan to fix it.
This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
Reverting the remainder of the net_dma induced changes is deferred to
subsequent patches.
Marked for stable due to Roman's report of a memory leak in
dma_pin_iovec_pages():
https://lkml.org/lkml/2014/9/3/177
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: David Whipple <whipple@securedatainnovations.ch>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: <stable@vger.kernel.org>
Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In order to make TCP more resilient in presence of reorders, we need
to allow coalescing to happen when skbs from out of order queue are
transferred into receive queue. LRO/GRO can be completely canceled
in some pathological cases, like per packet load balancing on aggregated
links.
I had to move tcp_try_coalesce() up in the file above tcp_ofo_queue()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default,
or more if some sysctl (eg tcp_retries2) are changed.
Better use 64bit to perform icsk_rto << icsk_backoff operations
As Joe Perches suggested, add a helper for this.
Yuchung spotted the tcp_v4_err() case.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we no longer rely on having tcp headers for skbs in receive queue,
tcp repair do not need to build fake ones.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_collapse() wants to shrink skb so that the overhead is minimal.
Now we store tcp flags into TCP_SKB_CB(skb)->tcp_flags, we no longer
need to keep around full headers.
Whole available space is dedicated to the payload.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can allow a segment with FIN to be aggregated,
if we take care to add tcp flags,
and if skb_try_coalesce() takes care of zero sized skbs.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Input path of TCP do not currently uses TCP_SKB_CB(skb)->tcp_flags,
which is only used in output path.
tcp_recvmsg(), looks at tcp_hdr(skb)->syn for every skb found in receive queue,
and its unfortunate because this bit is located in a cache line right before
the payload.
We can simplify TCP by copying tcp flags into TCP_SKB_CB(skb)->tcp_flags.
This patch does so, and avoids the cache line miss in tcp_recvmsg()
Following patches will
- allow a segment with FIN being coalesced in tcp_try_coalesce()
- simplify tcp_collapse() by not copying the headers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 740b0f1841 ("tcp: switch rtt estimations to usec resolution"),
we no longer need to maintain timestamps in two different fields.
TCP_SKB_CB(skb)->when can be removed, as same information sits in skb_mstamp.stamp_jiffies
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP_SKB_CB(skb)->when has different meaning in output and input paths.
In output path, it contains a timestamp.
In input path, it contains an ISN, chosen by tcp_timewait_state_process()
Lets add a different name to ease code comprehension.
Note that 'when' field will disappear in following patch,
as skb_mstamp already contains timestamp, the anonymous
union will promptly disappear as well.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon timeout, undo (via both timestamps/Eifel and DSACKs) was
disabled if any retransmits were still in flight. The concern was
perhaps that spurious retransmission sent in a previous recovery
episode may trigger DSACKs to falsely undo the current recovery.
However, this inadvertently misses undo opportunities (using either
TCP timestamps or DSACKs) when timeout occurs during a loss episode,
i.e. recurring timeouts or timeout during fast recovery. In these
cases some retransmissions will be in flight but we should allow
undo. Furthermore, we should only reset undo_marker and undo_retrans
upon timeout if we are starting a new recovery episode. Finally,
when we do reset our undo state, we now do so in a manner similar
to tcp_enter_recovery(), so that we require a DSACK for each of
the outstsanding retransmissions. This will achieve the original
goal by requiring that we receive the same number of DSACKs as
retransmissions.
This patch increases the undo events by 50% on Google servers.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix TCP FRTO logic so that it always notices when snd_una advances,
indicating that any RTO after that point will be a new and distinct
loss episode.
Previously there was a very specific sequence that could cause FRTO to
fail to notice a new loss episode had started:
(1) RTO timer fires, enter FRTO and retransmit packet 1 in write queue
(2) receiver ACKs packet 1
(3) FRTO sends 2 more packets
(4) RTO timer fires again (should start a new loss episode)
The problem was in step (3) above, where tcp_process_loss() returned
early (in the spot marked "Step 2.b"), so that it never got to the
logic to clear icsk_retransmits. Thus icsk_retransmits stayed
non-zero. Thus in step (4) tcp_enter_loss() would see the non-zero
icsk_retransmits, decide that this RTO is not a new episode, and
decide not to cut ssthresh and remember the current cwnd and ssthresh
for undo.
There were two main consequences to the bug that we have
observed. First, ssthresh was not decreased in step (4). Second, when
there was a series of such FRTO (1-4) sequences that happened to be
followed by an FRTO undo, we would restore the cwnd and ssthresh from
before the entire series started (instead of the cwnd and ssthresh
from before the most recent RTO). This could result in cwnd and
ssthresh being restored to values much bigger than the proper values.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: e33099f96d ("tcp: implement RFC5682 F-RTO")
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_tw_recycle heavily relies on tcp timestamps to build a per-host
ordering of incoming connections and teardowns without the need to
hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for
the last measured RTO. To do so, we keep the last seen timestamp in a
per-host indexed data structure and verify if the incoming timestamp
in a connection request is strictly greater than the saved one during
last connection teardown. Thus we can verify later on that no old data
packets will be accepted by the new connection.
During moving a socket to time-wait state we already verify if timestamps
where seen on a connection. Only if that was the case we let the
time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN
will be used. But we don't verify this on incoming SYN packets. If a
connection teardown was less than TCP_PAWS_MSL seconds in the past we
cannot guarantee to not accept data packets from an old connection if
no timestamps are present. We should drop this SYN packet. This patch
closes this loophole.
Please note, this patch does not make tcp_tw_recycle in any way more
usable but only adds another safety check:
Sporadic drops of SYN packets because of reordering in the network or
in the socket backlog queues can happen. Users behing NAT trying to
connect to a tcp_tw_recycle enabled server can get caught in blackholes
and their connection requests may regullary get dropped because hosts
behind an address translator don't have synchronized tcp timestamp clocks.
tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled.
In general, use of tcp_tw_recycle is disadvised.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ACK timestamps are generated in tcp_clean_rtx_queue. The TSO datapath
can break out early, causing the timestamp code to be skipped. Move
the code up before the break.
Reported-by: David S. Miller <davem@davemloft.net>
Also fix a boundary condition: tp->snd_una is the next unacknowledged
byte and between tests inclusive (a <= b <= c), so generate a an ACK
timestamp if (prior_snd_una <= tskey <= tp->snd_una - 1).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
in the send() call is acknowledged. It implements the feature for TCP.
The timestamp is generated when the TCP socket cumulative ACK is moved
beyond the tracked seqno for the first time. The feature ignores SACK
and FACK, because those acknowledge the specific byte, but not
necessarily the entire contents of the buffer up to that byte.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.
When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.
Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.
To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.
A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.
We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:
(1) 90% of inter-ACK delays were less than 10ms
(2) 99% of inter-ACK delays were less than RTT/2
In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since Yuchung's 9b44190dc1 (tcp: refactor F-RTO), tcp_enter_cwr is always
called with set_ssthresh = 1. Thus, we can remove this argument from
tcp_enter_cwr. Further, as we remove this one, tcp_init_cwnd_reduction
is then always called with set_ssthresh = true, and so we can get rid of
this argument as well.
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The undo code assumes that, upon entering loss recovery, TCP
1) always retransmit something
2) the retransmission never fails locally (e.g., qdisc drop)
so undo_marker is set in tcp_enter_recovery() and undo_retrans is
incremented only when tcp_retransmit_skb() is successful.
When the assumption is broken because TCP's cwnd is too small to
retransmit or the retransmit fails locally. The next (DUP)ACK
would incorrectly revert the cwnd and the congestion state in
tcp_try_undo_dsack() or tcp_may_undo(). Subsequent (DUP)ACKs
may enter the recovery state. The sender repeatedly enter and
(incorrectly) exit recovery states if the retransmits continue to
fail locally while receiving (DUP)ACKs.
The fix is to initialize undo_retrans to -1 and start counting on
the first retransmission. Always increment undo_retrans even if the
retransmissions fail locally because they couldn't cause DSACKs to
undo the cwnd reduction.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes build error introduced by commit 1fb6f159fd (tcp: add
tcp_conn_request):
net/ipv4/tcp_input.c: In function 'pr_drop_req':
net/ipv4/tcp_input.c:5889:130: error: 'struct sock_common' has no member named 'skc_v6_daddr'
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create tcp_conn_request and remove most of the code from
tcp_v4_conn_request and tcp_v6_conn_request.
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If there is an MSS change (or misbehaving receiver) that causes a SACK
to arrive that covers the end of an skb but is less than one MSS, then
tcp_match_skb_to_sack() was rounding up pkt_len to the full length of
the skb ("Round if necessary..."), then chopping all bytes off the skb
and creating a zero-byte skb in the write queue.
This was visible now because the recently simplified TLP logic in
bef1909ee3 ("tcp: fixing TLP's FIN recovery") could find that 0-byte
skb at the end of the write queue, and now that we do not check that
skb's length we could send it as a TLP probe.
Consider the following example scenario:
mss: 1000
skb: seq: 0 end_seq: 4000 len: 4000
SACK: start_seq: 3999 end_seq: 4000
The tcp_match_skb_to_sack() code will compute:
in_sack = false
pkt_len = start_seq - TCP_SKB_CB(skb)->seq = 3999 - 0 = 3999
new_len = (pkt_len / mss) * mss = (3999/1000)*1000 = 3000
new_len += mss = 4000
Previously we would find the new_len > skb->len check failing, so we
would fall through and set pkt_len = new_len = 4000 and chop off
pkt_len of 4000 from the 4000-byte skb, leaving a 0-byte segment
afterward in the write queue.
With this new commit, we notice that the new new_len >= skb->len check
succeeds, so that we return without trying to fragment.
Fixes: adb92db857 ("tcp: Make SACK code to split only at mss boundaries")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_fragment can be called from process context (from tso_fragment).
Add a new gfp parameter to allow it to preserve atomic memory if
possible.
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Reviewed-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/net/inetpeer.h
net/ipv6/output_core.c
Changes in net were fixing bugs in code removed in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
This bug is discovered by an recent F-RTO issue on tcpm list
https://www.ietf.org/mail-archive/web/tcpm/current/msg08794.html
The bug is that currently F-RTO does not use DSACK to undo cwnd in
certain cases: upon receiving an ACK after the RTO retransmission in
F-RTO, and the ACK has DSACK indicating the retransmission is spurious,
the sender only calls tcp_try_undo_loss() if some never retransmisted
data is sacked (FLAG_ORIG_DATA_SACKED).
The correct behavior is to unconditionally call tcp_try_undo_loss so
the DSACK information is used properly to undo the cwnd reduction.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e114a710aa ("tcp: fix cwnd limited checking to improve
congestion control") obsoleted in_flight parameter from
tcp_is_cwnd_limited() and its callers.
This patch does the removal as promised.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make tcp_cwnd_application_limited() static and move it from tcp_input.c to
tcp_output.c
Signed-off-by: Weiping Pan <wpan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several spots in the kernel perform a sequence like:
skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);
But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.
Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.
And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.
So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.
Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.
Signed-off-by: David S. Miller <davem@davemloft.net>
All skb in socket write queue should be properly timestamped.
In case of FastOpen, we special case the SYN+DATA 'message' as we
queue in socket wrote queue the two fallback skbs:
1) SYN message by itself.
2) DATA segment by itself.
We should make sure these skbs have proper timestamps.
Add a WARN_ON_ONCE() to eventually catch future violations.
Fixes: 740b0f1841 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/ath/ath9k/recv.c
drivers/net/wireless/mwifiex/pcie.c
net/ipv6/sit.c
The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.
The two wireless conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the following snmp stats:
TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed beacuse
the remote does not accept it or the attempts timed out.
TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
TCPOrigDataSent: number of outgoing packets with original data (excluding
retransmission but including data-in-SYN). This counter is different from
TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
more useful to track the TCP retransmission rate.
Change TCPFastOpenActive to track only successful Fast Opens to be symmetric to
TCPFastOpenPassive.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Lawrence Brakmo <brakmo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RTT may be bogus with tall loss probe (TLP) when a packet
is retransmitted and latter (s)acked without TCPCB_SACKED_RETRANS flag.
For example, TLP calls __tcp_retransmit_skb() instead of
tcp_retransmit_skb(). The skb timestamps are updated but the sacked
flag is not marked with TCPCB_SACKED_RETRANS. As a result we'll
get bogus RTT in tcp_clean_rtx_queue() or in tcp_sacktag_one() on
spurious retransmission.
The fix is to apply the sticky flag TCP_EVER_RETRANS to enforce Karn's
check on RTT sampling. However this will disable F-RTO if timeout occurs
after TLP, by resetting undo_marker in tcp_enter_loss(). We relax this
check to only if any pending retransmists are still in-flight.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upcoming congestion controls for TCP require usec resolution for RTT
estimations. Millisecond resolution is simply not enough these days.
FQ/pacing in DC environments also require this change for finer control
and removal of bimodal behavior due to the current hack in
tcp_update_pacing_rate() for 'small rtt'
TCP_CONG_RTT_STAMP is no longer needed.
As Julian Anastasov pointed out, we need to keep user compatibility :
tcp_metrics used to export RTT and RTTVAR in msec resolution,
so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
to use the new attributes if provided by the kernel.
In this example ss command displays a srtt of 32 usecs (10Gbit link)
lpk51:~# ./ss -i dst lpk52
Netid State Recv-Q Send-Q Local Address:Port Peer
Address:Port
tcp ESTAB 0 1 10.246.11.51:42959
10.246.11.52:64614
cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
cwnd:10 send
3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559
Updated iproute2 ip command displays :
lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
10.246.11.51
Old binary displays :
lpk51:~# ip tcp_metrics | grep 10.246.11.52
10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
10.246.11.51
With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Larry Brakmo <brakmo@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP pacing depends on an accurate srtt estimation.
Current srtt estimation is using jiffie resolution,
and has an artificial offset of at least 1 ms, which can produce
slowdowns when FQ/pacing is used, especially in DC world,
where typical rtt is below 1 ms.
We are planning a switch to usec resolution for linux-3.15,
but in the meantime, this patch removes the 1 ms offset.
All we need is to have tp->srtt minimal value of 1 to differentiate
the case of srtt being initialized or not, not 8.
The problematic behavior was observed on a 40Gbit testbed,
where 32 concurrent netperf were reaching 12Gbps of aggregate
speed, instead of line speed.
This patch also has the effect of reporting more accurate srtt and send
rates to iproute2 ss command as in :
$ ss -i dst cca2
Netid State Recv-Q Send-Q Local Address:Port
Peer Address:Port
tcp ESTAB 0 0 10.244.129.1:56984
10.244.129.2:12865
cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send
463.4Mbps rcv_rtt:1 rcv_space:29200
tcp ESTAB 0 390960 10.244.129.1:60247
10.244.129.2:50204
cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51
send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200
Reported-by: Vytautas Valancius <valas@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following are only used in one file:
tcp_connect_init
tcp_set_rto
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix checkpatch errors like:
ERROR: spaces required around that XXX
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Slow start now increases cwnd by 1 if an ACK acknowledges some packets,
regardless the number of packets. Consequently slow start performance
is highly dependent on the degree of the stretch ACKs caused by
receiver or network ACK compression mechanisms (e.g., delayed-ACK,
GRO, etc). But slow start algorithm is to send twice the amount of
packets of packets left so it should process a stretch ACK of degree
N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A
follow up patch will use the remainder of the N (if greater than 1)
to adjust cwnd in the congestion avoidance phase.
In addition this patch retires the experimental limited slow start
(LSS) feature. LSS has multiple drawbacks but questionable benefit. The
fractional cwnd increase in LSS requires a loop in slow start even
though it's rarely used. Configuring such an increase step via a global
sysctl on different BDPS seems hard. Finally and most importantly the
slow start overshoot concern is now better covered by the Hybrid slow
start (hystart) enabled by default.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be.h
drivers/net/netconsole.c
net/bridge/br_private.h
Three mostly trivial conflicts.
The net/bridge/br_private.h conflict was a function signature (argument
addition) change overlapping with the extern removals from Joe Perches.
In drivers/net/netconsole.c we had one change adjusting a printk message
whilst another changed "printk(KERN_INFO" into "pr_info(".
Lastly, the emulex change was a new inline function addition overlapping
with Joe Perches's extern removals.
Signed-off-by: David S. Miller <davem@davemloft.net>
Patch ed08495c3 "tcp: use RTT from SACK for RTO" always re-arms RTO upon
obtaining a RTT sample from newly sacked data.
But technically RTO should only be re-armed when the data sent before
the last (re)transmission of write queue head are (s)acked. Otherwise
the RTO may continue to extend during loss recovery on data sent
in the future.
Note that RTTs from ACK or timestamps do not have this problem, as the RTT
source must be from data sent before.
The new RTO re-arm policy is
1) Always re-arm RTO if SND.UNA is advanced
2) Re-arm RTO if sack RTT is available, provided the sacked data was
sent before the last time write_queue_head was sent.
Signed-off-by: Larry Brakmo <brakmo@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patch ed08495c3 "tcp: use RTT from SACK for RTO" has a bug that
it does not check if the ACK acknowledge new data before taking
the RTT sample from TCP timestamps. This patch adds the check
back as required by the RFC.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tp->lsndtime may not always be the SYNACK timestamp if a passive
Fast Open socket sends data before handshake completes. And if the
remote acknowledges both the data and the SYNACK, the RTT sample
is already taken in tcp_ack(), so no need to call
tcp_update_ack_rtt() in tcp_synack_rtt_meas() aagain.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/qmi_wwan.c
include/net/dst.h
Trivial merge conflicts, both were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
For passive TCP connections, upon receiving the ACK that completes the
3WHS, make sure we set our pacing rate after we get our first RTT
sample.
On passive TCP connections, when we receive the ACK completing the
3WHS we do not take an RTT sample in tcp_ack(), but rather in
tcp_synack_rtt_meas(). So upon receiving the ACK that completes the
3WHS, tcp_ack() leaves sk_pacing_rate at its initial value.
Originally the initial sk_pacing_rate value was 0, so passive-side
connections defaulted to sysctl_tcp_min_tso_segs (2 segs) in skbuffs
made in the first RTT. With a default initial cwnd of 10 packets, this
happened to be correct for RTTs 5ms or bigger, so it was hard to
see problems in WAN or emulated WAN testing.
Since 7eec4174ff ("pkt_sched: fq: fix non TCP flows pacing"), the
initial sk_pacing_rate is 0xffffffff. So after that change, passive
TCP connections were keeping this value (and using large numbers of
segments per skbuff) until receiving an ACK for data.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On receiving an ACK that covers the loss probe sequence, TLP
immediately sets the congestion state to Open, even though some packets
are not recovered and retransmisssion are on the way. The later ACks
may trigger a WARN_ON check in step D of tcp_fastretrans_alert(), e.g.,
https://bugzilla.redhat.com/show_bug.cgi?id=989251
The fix is to follow the similar procedure in recovery by calling
tcp_try_keep_open(). The sender switches to Open state if no packets
are retransmissted. Otherwise it goes to Disorder and let subsequent
ACKs move the state to Recovery or Open.
Reported-By: Michael Sterrett <michael@sterretts.net>
Tested-By: Dormando <dormando@rydia.net>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_pacing_rate is read by sch_fq packet scheduler at any time,
with no synchronization, so make sure we update it in a
sensible way. ACCESS_ONCE() is how we instruct compiler
to not do stupid things, like using the memory location
as a temporary variable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/linux/netdevice.h
net/core/sock.c
Trivial merge issues.
Removal of "extern" for functions declaration in netdevice.h
at the same time "const" was added to an argument.
Two parallel line additions in net/core/sock.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung found following problem :
There are bugs in the SACK processing code, merging part in
tcp_shift_skb_data(), that incorrectly resets or ignores the sacked
skbs FIN flag. When a receiver first SACK the FIN sequence, and later
throw away ofo queue (e.g., sack-reneging), the sender will stop
retransmitting the FIN flag, and hangs forever.
Following packetdrill test can be used to reproduce the bug.
$ cat sack-merge-bug.pkt
`sysctl -q net.ipv4.tcp_fack=0`
// Establish a connection and send 10 MSS.
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+.000 bind(3, ..., ...) = 0
+.000 listen(3, 1) = 0
+.050 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+.000 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.001 < . 1:1(0) ack 1 win 1024
+.000 accept(3, ..., ...) = 4
+.100 write(4, ..., 12000) = 12000
+.000 shutdown(4, SHUT_WR) = 0
+.000 > . 1:10001(10000) ack 1
+.050 < . 1:1(0) ack 2001 win 257
+.000 > FP. 10001:12001(2000) ack 1
+.050 < . 1:1(0) ack 2001 win 257 <sack 10001:11001,nop,nop>
+.050 < . 1:1(0) ack 2001 win 257 <sack 10001:12002,nop,nop>
// SACK reneg
+.050 < . 1:1(0) ack 12001 win 257
+0 %{ print "unacked: ",tcpi_unacked }%
+5 %{ print "" }%
First, a typo inverted left/right of one OR operation, then
code forgot to advance end_seq if the merged skb carried FIN.
Bug was added in 2.6.29 by commit 832d11c5cd
("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable fully_acked is only assigned the values true and false.
Change its type to bool.
The simplified semantic patch that find this problem is as
follows (http://coccinelle.lip6.fr/):
@exists@
type T;
identifier b;
@@
- T
+ bool
b = ...;
... when any
b = \(true\|false\)
Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_fixup_sndbuf() is underestimating initial send buffer requirements.
It was not noticed because big GSO packets were escaping the limitation,
but with smaller TSO packets (or TSO/GSO/SG off), application hits
sk_sndbuf before having a chance to fill enough packets in socket write
queue.
- initial cwnd can be bigger than 10 for specific routes
- SKB_TRUESIZE() is a bit under real needs in some cases,
because of power-of-two rounding in kmalloc()
- Fast Recovery (RFC 5681 3.2) : Cubic needs 70% factor
- Extra cushion (application might react slowly to POLLOUT)
tcp_v4_conn_req_fastopen() needs to call tcp_init_metrics() before
calling tcp_init_buffer_space()
Then we realize tcp_new_space() should call tcp_fixup_sndbuf()
instead of duplicating this stuff.
Rename tcp_fixup_sndbuf() to tcp_sndbuf_expand() to be more
descriptive.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As mentioned in commit afe4fd0624 ("pkt_sched: fq: Fair Queue packet
scheduler"), this patch adds a new socket option.
SO_MAX_PACING_RATE offers the application the ability to cap the
rate computed by transport layer. Value is in bytes per second.
u32 val = 1000000;
setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));
To be effectively paced, a flow must use FQ packet scheduler.
Note that a packet scheduler takes into account the headers for its
computations. The effective payload rate depends on MSS and retransmits
if any.
I chose to make this pacing rate a SOL_SOCKET option instead of a
TCP one because this can be used by other protocols.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steinar H. Gunderson <sesse@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dynamic Right Sizing (DRS) is supposed to open TCP receive window
automatically, but suffers from two bugs, presented by order
of importance.
1) tcp_rcv_space_adjust() fix :
Using twice the last received amount is very pessimistic,
because it doesn't allow fast recovery or proper slow start
ramp up, if sender wants to increase cwin by 100% every RTT.
copied = bytes received in previous RTT
2*copied = bytes we expect to receive in next RTT
4*copied = bytes we need to advertise in rwin at end of next RTT
DRS is one RTT late, it needs a 4x factor.
If sender is not using ABC, and increases cwin by 50% every rtt,
then we needed 1.5*1.5 = 2.25 factor.
This is probably why this bug was not really noticed.
2) There is no window adjustment after first RTT. DRS triggers only
after the second RTT.
DRS needs two RTT to initialize, so tcp_fixup_rcvbuf() should setup
sk_rcvbuf to allow proper window grow for first two RTT.
This patch increases TCP efficiency particularly for large RTT flows
when autotuning is used at the receiver, and more particularly
in presence of packet losses.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP receive window handling is multi staged.
A socket has a memory budget, static or dynamic, in sk_rcvbuf.
Because we do not really know how this memory budget translates to
a TCP window (payload), TCP announces a small initial window
(about 20 MSS).
When a packet is received, we increase TCP rcv_win depending
on the payload/truesize ratio of this packet. Good citizen
packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2
This heuristic takes place in tcp_grow_window()
Problem is : We currently call tcp_grow_window() only for in-order
packets.
This means that reorders or packet losses stop proper grow of
rcv_win, and senders are unable to benefit from fast recovery,
or proper reordering level detection.
Really, a packet being stored in OFO queue is not a bad citizen.
It should be part of the game as in-order packets.
In our traces, we very often see sender is limited by linux small
receive windows, even if linux hosts use autotuning (DRS) and should
allow rcv_win to grow to ~3MB.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
it only allows cwnd to increase in Open state. This mistakenly disables
slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
state moves from Disorder to Open later in tcp_fastretrans_alert().
Therefore the correct logic should be to allow cwnd to grow as long
as the data is received in order in Open, Loss, or even Disorder state.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
net/bridge/br_multicast.c
net/ipv6/sit.c
The conflicts were minor:
1) sit.c changes overlap with change to ip_tunnel_xmit() signature.
2) br_multicast.c had an overlap between computing max_delay using
msecs_to_jiffies and turning MLDV2_MRC() into an inline function
with a name using lowercase instead of uppercase letters.
3) stmmac had two overlapping changes, one which conditionally allocated
and hooked up a dma_cfg based upon the presence of the pbl OF property,
and another one handling store-and-forward DMA made. The latter of
which should not go into the new of_find_property() basic block.
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_rcv_established() returns only one value namely 0. We change the return
value to void (as suggested by David Miller).
After commit 0c24604b (tcp: implement RFC 5961 4.2), we no longer send RSTs in
response to SYNs. We can remove the check and processing on the return value of
tcp_rcv_established().
We also fix jtcp_rcv_established() in tcp_probe.c to match that of
tcp_rcv_established().
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.
One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.
This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.
This field could be set by other transports.
Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.
For other flows, this helps better packet scheduling and ACK clocking.
This patch increases performance of TCP flows in lossy environments.
A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).
A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.
This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.
sk_pacing_rate = 2 * cwnd * mss / srtt
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The zero value means that tsecr is not valid, so it's a special case.
tsoffset is used to customize tcp_time_stamp for one socket.
tsoffset is usually zero, it's used when a socket was moved from one
host to another host.
Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to
incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to
TCP_RTO_MAX.
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The stack currently detects reordering and avoid spurious
retransmission very well. However the throughput is sub-optimal under
high reordering because cwnd is increased only if the data is deliverd
in order. I.e., FLAG_DATA_ACKED check in tcp_ack(). The more packet
are reordered the worse the throughput is.
Therefore when reordering is proven high, cwnd should advance whenever
the data is delivered regardless of its ordering. If reordering is low,
conservatively advance cwnd only on ordered deliveries in Open state,
and retain cwnd in Disordered state (RFC5681).
Using netperf on a qdisc setup of 20Mbps BW and random RTT from 45ms
to 55ms (for reordering effect). This change increases TCP throughput
by 20 - 25% to near bottleneck BW.
A special case is the stretched ACK with new SACK and/or ECE mark.
For example, a receiver may receive an out of order or ECN packet with
unacked data buffered because of LRO or delayed ACK. The principle on
such an ACK is to advance cwnd on the cummulative acked part first,
then reduce cwnd in tcp_fastretrans_alert().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On timeout the TCP sender unconditionally resets the estimated degree
of network reordering (tp->reordering). The idea behind this is that
the estimate is too large to trigger fast recovery (e.g., due to a IP
path change).
But for example if the sender only had 2 packets outstanding, then a
timeout doesn't tell much about reordering. A sender that learns about
reordering on big writes and loses packets on small writes will end up
falsely retransmitting again and again, especially when reordering is
more likely on big writes.
Therefore the sender should only suspect that tp->reordering is too
high if it could have gone into fast recovery with the (lower) default
estimate.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If RTT is not available because Karn's check has failed or no
new packet is acked, use the RTT measured from SACK to estimate
the RTO. The sender can continue to estimate the RTO during loss
recovery or reordering event upon receiving non-partial ACKs.
This also changes when the RTO is re-armed. Previously it is
only re-armed when some data is cummulatively acknowledged (i.e.,
SND.UNA advances), but now it is re-armed whenever RTT estimator
is updated. This feature is particularly useful to reduce spurious
timeout for buffer bloat including cellular carriers [1], and
RTT estimation on reordering events.
[1] "An In-depth Study of LTE: Effect of Network Protocol and
Application Behavior on Performance", In Proc. of SIGCOMM 2013
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take RTT sample if an ACK selectively acks some sequences that
have never been retransmitted. The Karn's algorithm does not apply
even if that ACK (s)acks other retransmitted sequences, because it
must been generated by an original but perhaps out-of-order packet.
There is no ambiguity. In case when multiple blocks are newly
sacked because of ACK losses the earliest block is used to
measure RTT, similar to cummulative ACKs.
Such RTT samples allow the sender to estimate the RTO during loss
recovery and packet reordering events. It is still useful even with
TCP timestamps. That's because during these events the SND.UNA may
not advance preventing RTT samples from TS ECR (thus the FLAG_ACKED
check before calling tcp_ack_update_rtt()). Therefore this new
RTT source is complementary to existing ACK and TS RTT mechanisms.
This patch does not update the RTO. It is done in the next patch.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prefer packet timings to TS-ecr for RTT measurements when both
sources are available. That's because broken middle-boxes and remote
peer can return packets with corrupted TS ECR fields. Similarly most
congestion controls that require RTT signals favor timing-based
sources as well. Also check for bad TS ECR values to avoid RTT
blow-ups. It has happened on production Web servers.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The first patch consolidates SYNACK and other RTT measurement to use a
central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK
RTT measurement happens after PAWS check, potentially reducing the
impact of RTO seeding on bad TCP timestamps values.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In previous discussions, I tried to find some reasonable heuristics
for delayed ACK, however this seems not possible, according to Eric:
"ACKS might also be delayed because of bidirectional
traffic, and is more controlled by the application
response time. TCP stack can not easily estimate it."
"ACK can be incredibly useful to recover from losses in
a short time.
The vast majority of TCP sessions are small lived, and we
send one ACK per received segment anyway at beginning or
retransmits to let the sender smoothly increase its cwnd,
so an auto-tuning facility wont help them that much."
and according to David:
"ACKs are the only information we have to detect loss.
And, for the same reasons that TCP VEGAS is fundamentally
broken, we cannot measure the pipe or some other
receiver-side-visible piece of information to determine
when it's "safe" to stretch ACK.
And even if it's "safe", we should not do it so that losses are
accurately detected and we don't spuriously retransmit.
The only way to know when the bandwidth increases is to
"test" it, by sending more and more packets until drops happen.
That's why all successful congestion control algorithms must
operate on explicited tested pieces of information.
Similarly, it's not really possible to universally know if
it's safe to stretch ACK or not."
It still makes sense to enable or disable quick ack mode like
what TCP_QUICK_ACK does.
Similar to TCP_QUICK_ACK option, but for people who can't
modify the source code and still wants to control
TCP delayed ACK behavior. As David suggested, this should belong
to per-path scope, since different pathes may want different
behaviors.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
CC: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linux sends new unset data during disorder and recovery state if all
(suspected) lost packets have been retransmitted ( RFC5681, section
3.2 step 1 & 2, RFC3517 section 4, NexSeg() Rule 2). One requirement
is to keep the receive window about twice the estimated sender's
congestion window (tcp_rcv_space_adjust()), assuming the fast
retransmits repair the losses in the next round trip.
But currently it's not the case on the first round trip in either
normal or Fast Open connection, beucase the initial receive window
is identical to (expected) sender's initial congestion window. The
fix is to double it.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the receiver supports DSACK, sender can detect false recoveries and
revert cwnd reductions triggered by either severe network reordering or
concurrent reordering and loss event.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon detecting spurious fast retransmit via timestamps during recovery,
use PRR to clock out new data packet instead of retransmission. Once
all retransmission are proven spurious, the sender then reverts the
cwnd reduction and congestion state to open or disorder.
The current code does the opposite: it undoes cwnd as soon as any
retransmission is spurious and continues to retransmit until all
data are acked. This nullifies the point to undo the cwnd because
the sender is still retransmistting spuriously. This patch fixes
it. The undo_ssthresh argument of tcp_undo_cwnd_reductiuon() is no
longer needed and is removed.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor and relocate various functions or variables to prepare the
undo fix. Remove some unused function arguments. Rename tcp_undo_cwr
to tcp_undo_cwnd_reduction to be consistent with the rest of
CWR related function names.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch series fixes an undo bug in fast recovery: the sender
mistakenly undos the cwnd too early but continues fast retransmits
until all pending data are acked. This also multiplies the SNMP
stat PARTIALUNDO events by the degree of the network reordering.
The first patch prepares the fix by consolidating the accounting
of newly_acked_sacked in tcp_cwnd_reduction(), instead of updating
newly_acked_sacked everytime sacked_out is adjusted. Also pass
acked and prior_unsacked as const type because they are readonly
in the rest of recovery processing.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
case TCP_FIN_WAIT1 can also be simplified by reversing tests
and adding breaks;
Add braces after case and move automatic definitions.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
case TCP_SYN_RECV: can have another indentation level removed
by converting
if (acceptable) {
...;
} else {
return 1;
}
to
if (!acceptable)
return 1;
...;
Reflow code and comments to fit 80 columns.
Another pure cleanup patch.
Signed-off-by: Joe Perches <joe@perches.com>
Improved-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove one level of indentation 'introduced' in commit
c3ae62af8e (tcp: should drop incoming frames without ACK flag set)
if (true) {
...
}
@acceptable variable is a boolean.
This patch is a pure cleanup.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge net into net-next because some upcoming net-next changes
build on top of bug fixes that went into net.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a fix for a bug triggering newly_acked_sacked < 0
in tcp_ack(.).
The bug is triggered by sacked_out decreasing relative to prior_sacked,
but packets_out remaining the same as pior_packets. This is because the
snapshot of prior_packets is taken after tcp_sacktag_write_queue() while
prior_sacked is captured before tcp_sacktag_write_queue(). The problem
is: tcp_sacktag_write_queue (tcp_match_skb_to_sack() -> tcp_fragment)
adjusts the pcount for packets_out and sacked_out (MSS change or other
reason). As a result, this delta in pcount is reflected in
(prior_sacked - sacked_out) but not in (prior_packets - packets_out).
This patch does the following:
1) initializes prior_packets at the start of tcp_ack() so as to
capture the delta in packets_out created by tcp_fragment.
2) introduces a new "previous_packets_out" variable that snapshots
packets_out right before tcp_clean_rtx_queue, so pkts_acked can be
correctly computed as before.
3) Computes pkts_acked using previous_packets_out, and computes
newly_acked_sacked using prior_packets.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_timeout_skb() was intended to trigger fast recovery on timeout,
unfortunately in reality it often causes spurious retransmission
storms during fast recovery. The particular sign is a fast retransmit
over the highest sacked sequence (SND.FACK).
Currently the RTO timer re-arming (as in RFC6298) offers a nice cushion
to avoid spurious timeout: when SND.UNA advances the sender re-arms
RTO and extends the timeout by icsk_rto. The sender does not offset
the time elapsed since the packet at SND.UNA was sent.
But if the next (DUP)ACK arrives later than ~RTTVAR and triggers
tcp_fastretrans_alert(), then tcp_timeout_skb() will mark any packet
sent before the icsk_rto interval lost, including one that's above the
highest sacked sequence. Most likely a large part of scorebard will be
marked.
If most packets are not lost then the subsequent DUPACKs with new SACK
blocks will cause the sender to continue to retransmit packets beyond
SND.FACK spuriously. Even if only one packet is lost the sender may
falsely retransmit almost the entire window.
The situation becomes common in the world of bufferbloat: the RTT
continues to grow as the queue builds up but RTTVAR remains small and
close to the minimum 200ms. If a data packet is lost and the DUPACK
triggered by the next data packet is slightly delayed, then a spurious
retransmission storm forms.
As the original comment on tcp_timeout_skb() suggests: the usefulness
of this feature is questionable. It also wastes cycles walking the
sack scoreboard and is actually harmful because of false recovery.
It's time to remove this.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_fixup_rcvbuf() contains a loop to estimate initial socket
rcv space needed for a given mss. With large MTU (like 64K on lo),
we can loop ~500 times and consume a lot of cpu cycles.
perf top of 200 concurrent netperf -t TCP_CRR
5.62% netperf [kernel.kallsyms] [k] tcp_init_buffer_space
1.71% netperf [kernel.kallsyms] [k] _raw_spin_lock
1.55% netperf [kernel.kallsyms] [k] kmem_cache_free
1.51% netperf [kernel.kallsyms] [k] tcp_transmit_skb
1.50% netperf [kernel.kallsyms] [k] tcp_ack
Lets use a 100% factor, and remove the loop.
100% is needed anyway for tcp_adv_win_scale=1
default value, and is also the maximum factor.
Refs: commit b49960a05e
("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add MIB counters for checksum errors in IP layer,
and TCP/UDP/ICMP layers, to help diagnose problems.
$ nstat -a | grep Csum
IcmpInCsumErrors 72 0.0
TcpInCsumErrors 382 0.0
UdpInCsumErrors 463221 0.0
Icmp6InCsumErrors 75 0.0
Udp6InCsumErrors 173442 0.0
IpExtInCsumErrors 10884 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
drivers/net/ethernet/intel/igb/igb_main.c
drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c
include/net/scm.h
net/batman-adv/routing.c
net/ipv4/tcp_input.c
The e{uid,gid} --> {uid,gid} credentials fix conflicted with the
cleanup in net-next to now pass cred structs around.
The be2net driver had a bug fix in 'net' that overlapped with the VLAN
interface changes by Patrick McHardy in net-next.
An IGB conflict existed because in 'net' the build_skb() support was
reverted, and in 'net-next' there was a comment style fix within that
code.
Several batman-adv conflicts were resolved by making sure that all
calls to batadv_is_my_mac() are changed to have a new bat_priv first
argument.
Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO
rewrite in 'net-next', mostly overlapping changes.
Thanks to Stephen Rothwell and Antonio Quartulli for help with several
of these merge resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
commit bd090dfc63 (tcp: tcp_replace_ts_recent() should not be called
from tcp_validate_incoming()) introduced a TS ecr bug in slow path
processing.
1 A > B P. 1:10001(10000) ack 1 <nop,nop,TS val 1001 ecr 200>
2 B < A . 1:1(0) ack 1 win 257 <sack 9001:10001,TS val 300 ecr 1001>
3 A > B . 1:1001(1000) ack 1 win 227 <nop,nop,TS val 1002 ecr 200>
4 A > B . 1001:2001(1000) ack 1 win 227 <nop,nop,TS val 1002 ecr 200>
(ecr 200 should be ecr 300 in packets 3 & 4)
Problem is tcp_ack() can trigger send of new packets (retransmits),
reflecting the prior TSval, instead of the TSval contained in the
currently processed incoming packet.
Fix this by calling tcp_replace_ts_recent() from tcp_ack() after the
checks, but before the actions.
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/net/ipip.h
The changes made to ipip.h in 'net' were already included
in 'net-next' before that header was moved to another location.
Signed-off-by: David S. Miller <davem@davemloft.net>