Page to collect gist of RFCs or general networking ideas.
Active Queue Management (AQM) by Matthew Macy
Active Queue Management is an effort to avoid the latency increases (and increase in time in the feedback loop) and bursty losses caused by naive tail drop in intermediate buffering. The concept was introduced along with a discussion of the queue management algorithm "RED" (Random Early Detect/Drop) by RFC 2309. The most current RFC is 7567.
The usual mix of long high throughput and short low latency flows place conflicting demands on the queue occupancy of a switch:
- o The queue must be short enough that it does not impose excessive
- latency on short flows.
- long flows to saturate the path capacity.
- excessive packet loss.
RED:
- The RED algorithm itself consists of two main parts: estimation of the average queue size and the decision of whether or not to drop an incoming packet. (a) Estimation of Average Queue Size
- RED estimates the average queue size, either in the forwarding path using a simple exponentially weighted moving average (such as presented in Appendix A of [Jacobson88]), or in the background (i.e., not in the forwarding path) using a similar mechanism.
- In the second portion of the algorithm, RED decides whether or not to drop an incoming packet. It is RED's particular algorithm for dropping that results in performance improvement for responsive flows. Two RED parameters, minth (minimum threshold) and maxth (maximum threshold), figure prominently in this decision process. Minth specifies the average queue size
- below which* no packets will be dropped, while maxth specifies the average queue size *above which* all packets will be dropped. As the average queue size varies from minth to maxth, packets will be dropped with a probability that varies linearly from 0 to maxp.
Recommendations on Queue Management and Congestion Avoidance in the Internet
https://tools.ietf.org/html/rfc2309
IETF Recommendations Regarding Active Queue Management
https://tools.ietf.org/html/rfc7567
https://en.wikipedia.org/wiki/Active_queue_management
Explicit Congestion Notification (ECN) by Matthew Macy At its core ECN in TCP allows compliant routers to provide compliant senders with notification of "virtual drops" as a congestion indicator to halve its congestion window. This allows the sender to not wait for the retransmit timeout or repeated ACKS to learn of a congestion event and allows the receiver to avoid latency induced by drop/retransmit. ECN relies on some form of AQM in the intermediate routers/switches to determine the marking the CE (congestion encountered) bit IP header, it is then the receiver's responsibility to mark the ECE (ECN-Echo) in the TCP header of the subsequent ACK. The receiver will continue to send packets marked with the ECE bit until it receives a packet with the CWR (Congestion Window Reduced) bit set. Note that although this last design decision makes it robust in the presence of ack loss (the original version ECN specifies that ACKs / SYNs / SYN-ACKs not be marked as ECN capable and thus are not eligible for marking), it limits the use of ECN to once per RTT. As we'll see later this leads to interoperability issues with DCTCP.
ECN is negotiated at connection time. In FreeBSD it is configured by the sysctl 'net.inet.tcp.ecn.enable':
0 |
Disable ECN. |
1 |
Allow incoming connections to request ECN. Outgoing connections will request ECN. |
2 (default) |
Allow incoming connections to request ECN. Outgoing connections will not request ECN. |
The last time a survey was done, 2.7% of the internet would not respond to a SYN negotiating ECN. This isn't fatal as subsequent SYNs will switch to not requesting ECN. This just adds the default RTO to connection establishment (3s in FreeBSD, 1s per RFC6298 - discussed later).
Linux has some very common sense configurability improvements. Its ECN knob takes on _3_ values: 0) no request / no accept 1) no request / accept 2) request / accept. The default is (1), supporting it for those adventurous enough to request it. The route command can specify ECN by subnet. In effect allowing servers / clients to only use it within a data center or between compliant data centers.
ECN sees very little usage due to continued compatibility concerns. Although the difficulty of correctly tuning maxth and minth in RED and many other AQM mechanisms is not specific to ECN, RED et al are necessary to use ECN and thus further add to associated difficulties of its use.
Talks: More Accurate ECN Feedback in TCP (AccECN) - https://www.ietf.org/proceedings/90/slides/slides-90-tcpm-10.pdf
ECN is slow, does not report condition extent, just it's existence. It lacks inter- operability with DCTCP. Need to add mechanism for negotiating finer-grained, adaptive congestion notification.
RFCS:
A Proposal to add Explicit Congestion Notification (ECN) to IP
https://tools.ietf.org/html/rfc2481 Initial proposal.
The Addition of Explicit Congestion Notification (ECN) to IP
https://tools.ietf.org/html/rfc3168 Elaboration and further specification of how to tie it in to TCP.
Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets
https://tools.ietf.org/html/rfc5562 Sometimes referred to as ECN+. This extends ECN to SYN/ACK packets. Note that SYN packets are still not covered, being considered a potential security hole.
Problem Statement and Requirements for Increased Accuracy in Explicit Congestion Notification (ECN) Feedback
https://tools.ietf.org/html/rfc7560
(AccECN) Problem Statement and Requirements for Increased Accuracy in Explicit Congestion Notification (ECN) Feedback
- A primary motivation for this document is to intervene before each proprietary implementation invents its own non-interoperable handshake, which could lead to _de facto_ consumption of the few flags or codepoints that remain available for standardizing capability negotiation."
Data Center Transmission Control Protocol (DCTCP) by Matthew Macy The Microsoft & Stanford developed CC protocol uses simplified switch RED/ECN CE marking to provide fine grained congestion notification to senders. RED is enabled in the switch but minth=maxth=K, where K is an empirically determined constant that is a function of bandwidth and desired switch utilization vs rate of convergence. Common values for K are 5 for 1Gbps and 60 for 10Gbps. The value for 40Gbps is presumably on the order of 240. The sender's congestion window is scaled back once per RTT as function of (#ECE/(#segments in window))/2. In the degenerate case of all segments being marked window is scaled back a la a loss in Reno. In the steady state latencies are much lower than in Reno due to considerably reduced switch occupancy.
There is currently no mechanism for negotiating CC protocols and DCTCP's reliance on continuous ECE notifications is incompatible with ECN's continuous repeating of the same ECE until a CWR is received. In effect ECN support has to be sucessfully negotiated when establishing the connection, but the receiver has to instead provide one ECE per new CE seen.
RFC: Datacenter TCP (DCTCP): TCP Congestion Control for Datacenters
https://tools.ietf.org/pdf/draft-ietf-tcpm-dctcp-00.pdf
The window scaling constant is referred to as 'alpha'. Alpha=0 corresponds to no congestion, alpha=1 corresponds to a loss event in Reno or an ECE mark in standard ECN - resulting in a halving of the congestion window. 'g' is the feedback gain, 'M' is the fraction of bytes marked to bytes sent. Alpha and the congestion window 'cwnd' are calculated as follows:
alpha = alpha * (1 - g) + g * M
cwnd = cwnd * (1 - alpha/2)
To cope with delayed acks DCTCP specifies the following state machine - CE refers to DCTCP.CE, a new Boolean TCP state variable, "DCTCP Congestion Encountered" - which is initialized to false and stored in the Transmission Control Block (TCB).
The clear implication of this is that if the ack is delayed by more than m, as in different assumptions between peers or dropped ACKs, the signal can underestimate the level of encountered congestion. None of the literature suggests that this has been a problem in practice.
[Section 3.4 of RFC] Handling of SYN, SYN-ACK, RST Packets
- [RFC3168] requires that a compliant TCP MUST NOT set ECT on SYN or SYN-ACK packets. [RFC5562] proposes setting ECT on SYN-ACK packets, but maintains the restriction of no ECT on SYN packets. Both these RFCs prohibit ECT in SYN packets due to security concerns regarding malicious SYN packets with ECT set. These RFCs, however, are intended for general Internet use, and do not directly apply to a controlled datacenter environment. The switching fabric can drop TCP packets that do not have the ECT set in the IP header. If SYN and SYN-ACK packets for DCTCP connections do not have ECT set, they will be dropped with high probability. For DCTCP connections, the sender SHOULD set ECT for SYN, SYN-ACK and RST packets.
[Section 4] Implementation Issues - the implementation must choose a suitable estimation gain (feedback gain)
- - [DCTCP10] provides a theoretical basis for its selection, in practice
- more practical to select empirically by network/workload
- the implementation must decide when to use DCTCP. DCTCP may not be
- suitable or supported for all peers.
- It is RECOMMENDED that the implementation deal with loss episodes in
- the same way as conventional TCP.
- To prevent incast throughput collapse, the minimum RTO (MinRTO) should be
- lowered significantly. The default value of MinRTO in Windows is 300ms, Linux 200ms, and FreeBSD 233ms. A lower MinRTO requires a correspondingly lower delayed ACK timeout on the receiver. Thus, it is RECOMMENDED that an implementation allow configuration of lower timeouts for DCTCP connections.
- It is also RECOMMENDED that an implementation allow configuration of
- restarting the congestion window (cwnd) of idle DCTCP connections as described in [RFC5681].
- [RFC3168] forbids the ECN-marking of pure ACK packets, because of the
- inability of TCP to mitigate ACK-path congestion and protocol-wise preferential treatment by routers. However, dropping pure ACKs - rather than ECN marking them - has disadvantages for typical datacenter traffic patterns. Dropping of ACKs causes subsequent re- transmissions. It is RECOMMENDED that an implementation provide a configuration knob that forces ECT to be set on pure ACKs.
[Section 5] Deployment Issues - DCTCP and conventional TCP congestion control do not coexist well in
- the same network. In DCTCP, the marking threshold is set to a very low value to reduce queueing delay, and a relatively small amount of congestion will exceed the marking threshold. During such periods of congestion, conventional TCP will suffer packet loss and quickly and drastically reduce cwnd. DCTCP, on the other hand, will use the fraction of marked packets to reduce cwnd more gradually. Thus, the rate reduction in DCTCP will be much slower than that of conventional TCP, and DCTCP traffic will gain a larger share of the capacity compared to conventional TCP traffic traversing the same path. It is RECOMMENDED that DCTCP traffic be segregated from conventional TCP traffic. [MORGANSTANLEY] describes a deployment that uses the IP DSCP bits to segregate the network such that AQM is applied to DCTCP traffic, whereas TCP traffic is managed via drop-tail queuing.
- Since DCTCP relies on congestion marking by the switches, DCTCP can
- only be deployed in datacenters where the entire network infrastructure supports ECN. The switches may also support configuration of the congestion threshold used for marking. The proposed parameterization can be configured with switches that implement RED. [DCTCP10] provides a theoretical basis for selecting the congestion threshold, but as with the estimation gain, it may be more practical to rely on experimentation or simply to use the default configuration of the device. DCTCP will degrade to loss- based congestion control when transiting a congested drop-tail link.
- DCTCP requires changes on both the sender and the receiver, so both
- endpoints must support DCTCP. Furthermore, DCTCP provides no mechanism for negotiating its use, so both endpoints must be configured through some out-of-band mechanism to use DCTCP. A variant of DCTCP that can be deployed unilaterally and only requires standard ECN behavior has been described in [ODCTCP][BSDCAN], but requires additional experimental evaluation.
[Section 6]
Known Issues
- DCTCP relies on the sender?s ability to reconstruct the stream of CE
- codepoints received by the remote endpoint. To accomplish this, DCTCP avoids using a single ACK packet to acknowledge segments received both with and without the CE codepoint set. However, if one or more ACK packets are dropped, it is possible that a subsequent ACK will cumulatively acknowledge a mix of CE and non-CE segments. This will, of course, result in a less accurate congestion estimate. o Even with an inaccurate congestion estimate, DCTCP may still
- perform better than [RFC3168].
- the estimate may not be too inaccurate.
- will occur during an unbroken string of CE packets, and the estimate will be unaffected
- The effect of packet drops on DCTCP under real world conditions has not been analyzed.
- Much like standard TCP, DCTCP is biased against flows with longer
- RTTs. A method for improving the fairness of DCTCP has been proposed in [ADCTCP], but requires additional experimental evaluation.
Papers: Data Center TCP [DCTCP10]
http://research.microsoft.com/en-us/um/people/padhye/publications/dctcp-sigcomm2010.pdf
The original DCTCP SIGCOMM paper by Stanford and Microsoft Research. It is very accessible even for those of us not well versed in CC protocols.
- - reduce minRTO to 10ms.
- suggest that K > (RTT * C)/7, where C is the sending rate in packets per second.
Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter [MORGANSTANLEY]
https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-judd.pdf
Real world experience deploying DCTCP on Linux at Morgan Stanley.
- - reduce minRTO to 5ms. - reduce delayed ACK to 1ms. - Only ToR switches support ECN marking, higher level switches purely tail-drop.
- Tests show that DCTCP successfully resorts to loss-based congestion control when transiting a congested drop-tail link.
- deployment of DCTCP. Under load, DCTCP would fail to establish network connections in the absence of ECT in SYN and SYN-ACK packets. (DCTCP+)
- rather than the theoretical 1.4 x TCP.
Per-packet latency in ms
- TCP DCTCP+
Mean 4.01 0.0422 Median 4.06 0.0395 Maximum 4.20 0.0850 Minimum 3.32 0.0280 sigma 0.167 0.0106
Extensions to FreeBSD Datacenter TCP for Incremental Deployment Support [BSDCAN]
https://www.bsdcan.org/2015/schedule/attachments/315_dctcp-bsdcan2015-paper.pdf
Proposes a variant of DCTCP that can be deployed only on one endpoint of a connection, provided the peer is ECN-capable.
ODTCP changes:
- - In order to facilitate one-sided deployment, a DCTCP
- sender should set the CWR mark after receiving an ECE- marked ACK once per RTT. It is safe in two-sided deploy- ments, because a regular DCTCP receiver will simply ig-
nore the CWR mark.
- incoming packets marked with CWR, which is the only indication
of recovery exit.
- sender should set the CWR mark after receiving an ECE- marked ACK once per RTT. It is safe in two-sided deploy- ments, because a regular DCTCP receiver will simply ig-
DCTCP improvements:
- - ECE processing: Under standard ECN an ACK with an ECE mark will
- trigger congestion recovery. When this happens a sender stops increasing cwnd for one RTT. For DCTCP there is no reason for this response. ECEs are used, not for detecting congestion events, but to quantify the extent of congestion and react proportionally. Thus, there is no need to stop cwnd from in-
creasing.
ECE seen).
- apply to alpha. The FreeBSD implementation re-initializes alpha
after an idle period longer than the RTO.
- update interval for alpha as one RTT. To track this DCTCP compares received ACKs against the sequence numbers of outgoing packets. This is not robust in the face of packet loss. The FreeBSD implementation addresses this by updating alpha when it detects
duplicate ACKs or timeouts.
- trigger congestion recovery. When this happens a sender stops increasing cwnd for one RTT. For DCTCP there is no reason for this response. ECEs are used, not for detecting congestion events, but to quantify the extent of congestion and react proportionally. Thus, there is no need to stop cwnd from in-
Data Center TCP (DCTCP)
http://www.ietf.org/proceedings/80/slides/iccrg-3.pdf
Case studies, workloads, latency and flow completion time of TCP vs DCTCP. Interesting set of slides worth skimming.
- Small (10-100KB & 100KB - 1MB) background flows complete in ~45% less time than TCP.
- 99th %ile & 99.9th %ile query flows are 2/3rds and 4/7ths respectively
- large (1-10MB & > 10MB) flows unchanged
- query completion time with 10 to 1 background incast unchanged with DCTCP, ~5x slower with TCP
Analysis of DCTCP: Stability, Convergence, and Fairness [ADCTCP]
http://sedcl.stanford.edu/files/dctcp-analysis.pdf
Follow up mathematical analysis of DCTCP using a fluid model. Contains interesting graphs showing how the gain factor affects the convergence rate between two flows.
- - Analyzes the convergence of DCTCP sources to their fair share, obtaining
an explicit characterization of the convergence rate.
- significantly improves DCTCP's RTT-fairness. It suggests updating the
congestion window continuously rather than once per RTT.
- delay product, DCTCP achieves 100% throughput, and that even for values of K as small as 1% of the bandwidth-delay product, its throughput is
at least 94%.
TCP
Using Data Center TCP (DCTCP) in the Internet [ADCTCP]
- http://www.ikr.uni-stuttgart.de/Content/Publications/Archive/Wa_GLOBECOM_14_40260.pdf
Investigates what would be needed to deploy DCTCP incrementally outside the data center.
- Proposes finer resolution for alpha value
- Allow the congestion window to grow in the CWR state (similar to [BSDCAN])
- Continuous update of alpha: Define a smaller gain factor (1/28 instead of 1/24)
- to permit an EWMA updated every packet. However, g should actually be a function
of number of packets in flight.
window on the reception of each ECE.
between DCTCP and non-DCTCP.
- to permit an EWMA updated every packet. However, g should actually be a function
Incast Transmission Control Protocol (ICTCP)
In ICTCP the receiver plays a direct role in estimating the per-flow available bandwidth and actively re-sizes each connection's receive window accordingly.
- http://research.microsoft.com/pubs/141115/ictcp.pdf
Quantum Congestion Notification (QCN) by Matthew Macy
Congestion control in ethernet. Introduced as part of the IEEE 802.2 Standards Body discussions for Data Center Bridging [DCB] motivated by the needs of FCoE. The initial congestion control protocol was standardized as 802.1Qau. Unlike the single bit of congestion information per-packet in TCP QCN uses 6-bits.
The algorithm is composed of two main parts: Switch or Control Point (CP) Dynamics and Rate Limiter or Reaction Point (RP) Dynamics.
- - The CP Algorithm runs at the network nodes. Its objective is to maintain the
- node's buffer occupancy at the operating point 'Beq'. It computes a con- gestion measure Fb and randomly samples an incoming packet with a probability proportional to the severity of the congestion. The node sends a 6-bit quantized value of Fb back to the source of the sampled packet.
- B: Value of the current queue length
- Bold: Value of the buffer occupancy when the last feedback message wasgenerated.
- w: a non-negative constant, equal to 2 for the baseline implementation
- Boff = B - Beq
- Bd = B - Bold
- Fb = Boff + w*Bd
- - essentially equivalent to the PI AQM. The first term is the offset
- from the target operating point and the second term is proportional to the rate at which the queue size is changing.
When Fb < 0, there is no congestion, and no feedback messages are sent. When Fb >= 0, then either the buffers or the link is oversubscribed, and control action needs to be taken.
- - The RP algorithm runs on end systems (NICs) and controls the rate at which
- ethernet packets are transmitted. Unlike TCP, the RP algorithm does not get positive ACKs from the network and thus needs alternative mechanisms for increasing its sending rate. - Current Rate (Rc): The transmission rate of the source - Target Rate (Rt): The transmission rate of the source just before the
- arrival of the last feedback message
- the rate can decrease by at most 50%. Only 6 bits are available for feedback so Fbmax = 64, and thus Gd = 1/128.
- to time rate increases
Rate Decreases:
A rate decrease is only done when a feedback message is received:
- Rt <- Rc
- Rt <- Rc*(1 - Gd*|Fb|)
Rate Increases:
Rate Increase is done in two phases: Fast Recovery and Active Increase.- Fast Recovery (FR): The source enters the FR state immediately after a rate decrease event - at which point the Byte Counter is reset. FR consists of 5 cycles, in each of which 150KB of data (assuming full- sized regular frames) are transmitted (100 packets of 1500 bytes each), as counted by the Byte Counter. At the end of each cycle, Rt remains unchanged, and Rc is updated as follows:
Rc <- (Rc + Rt)/2
- The rationale being that, when congested, Rate Decrease messages are sent by the CP once every 100 packets. Thus the absence of a Rate Decrease message during this interval indicates that the CP is no longer congested.
- Active Increase (AI): After 5 cycles of FR, the source enters the AI state when it probes for extra bandwidth. AI consists of multiple cycles of 50 packets each. Rt and Rc are updated as follows:
- Rt <- Rt + Rai
- Rc <- (Rc + Rt)/2
- Rai: a constant set to 5Mbps by default.
- 1) reset timer when rate decrease message arrives 2) source enters FR and counts out 5 cycles of T ms duration
- (T = 10ms in baseline implementation), and in the AI state, each cycle is T/2 ms long
- or the Timer completes a cycle.
- or the timer is in teh AI state.
- said to be in Hyper-Active Increase (HAI). In this case, at the completion of the ith Byte Counter and Timer cycle, Rt and Rc are updated:
- Rt <- Rt + i*Rhai - Rc <- (Rc + Rt) / 2 - Rhai: 50Mbps in the baseline
- ethernet packets are transmitted. Unlike TCP, the RP algorithm does not get positive ACKs from the network and thus needs alternative mechanisms for increasing its sending rate. - Current Rate (Rc): The transmission rate of the source - Target Rate (Rt): The transmission rate of the source just before the
- node's buffer occupancy at the operating point 'Beq'. It computes a con- gestion measure Fb and randomly samples an incoming packet with a probability proportional to the severity of the congestion. The node sends a 6-bit quantized value of Fb back to the source of the sampled packet.
[Taken from "Internet Congestion Control" by Subir Varna, ch. 8]