TCP Incast and Cloud application performance

Incast is a many-to-one communication pattern commonly found in cloud data centers implementing scale out distributed storage and computing frameworks such as Hadoop, MapReduce, HDFS, Cassandra, etc. — powering applications such as web search, maps, social networks, data warehousing and analytics.

Incast can also more specifically be referred to as TCP Incast, as the cloud applications creating this communication pattern rely heavily on TCP.

The basic pattern of Incast begins when a singular Parent server places a request for data to a cluster of Nodes which all receive the request simultaneously. The cluster of Nodes may in turn all synchronously respond to the singular Parent. The result is a micro burst of many machines simultaneously sending TCP data streams to one machine (many to one).

The Incast traffic can be very short lived flows (micro burts), depending on the application.  For example, you could have a Parent server request 80KB of data across 40 Nodes.  Each Node simultaneously responds with 2KB of data.  That’s just two packets from each Node.  In a real world scenario, the Parent could be a database server requesting a 80KB photo of the newest friend added to your social network.

“TCP Incast” may also be used in a context referring to the detrimental effect on TCP throughput caused by network congestion the Incast communication patterns create.

This simultaneous many-to-one burst can cause egress congestion at the network port attached to the Parent server, overwhelming the port egress buffer. The resulting packet loss requires Nodes to detect the loss (missing ACKs), re-send data (after RTO), and slowly ramp up throughput per standard TCP behavior. The application issuing the original request might wait until all data has been received before the job can complete, or just decide to return partial results after a certain delay threshold is exceeded. Either way, the speed, quality, and consistency of performance suffers.

The congestion caused by Incast has the effect of increasing the latency observed by the application and its users. There’s a give and take with application performance as Node cluster sizes grow. Larger clusters provide more distributed processing capacity allowing more jobs to complete faster. However larger clusters also mean a wider fan-in source for Incast traffic, increasing network congestion and lowering TCP throughput per Node.

The “give” of a larger cluster is a lot more obvious than the “take”, as cloud data centers continue to build larger cluster sizes to increase application performance. The question is: How much of the “take” (Incast congestion) is affecting the true performance potential of the application?  And what can be done to measure and reduce the impact of Incast congestion?

Some have proposed a new variation of TCP specifically for the data center computers called Data Center TCP (DCTCP). This variation optimizes the way congestion is detected and can quickly minimize the amount of in flight data at risk during periods of congestion, minimizing packet loss and time outs.  DCTCP optimizations have the effect of reducing buffer utilization on switches, which in theory eases the buffering requirement on switches.  However, based on what I’ve read, DCTCP needs to have longer lived flows before it can accurately detect and react to congestion.

While DCTCP appears to have promise for overall performance improvements within the cloud data center, it’s not the complete end all answer, IMHO.  Take for example our 80KB micro burst Incast to retrieve a friends photo.  This micro burst of 2 packets per Node is not long lived enough for DCTCP optimizations to help. (DCTCP evangelists: Feel free to correct me here if I am wrong about this.  I’ve been wrong before).

Another consideration is taking a closer look at the switch architecture.  For example, a switch with sufficient egress buffer resources will be able to better absorb the Incast micro burts, preventing the performance detrimental packet loss and timeouts.  Are those buffers dedicated per port? Are they shared?  How well does the fabric chip(s) inside the switch detect and manage micro burst congestion, etc.?

With the right switch architecture in place to handle the micro bursts, in combination with optimizations like DCTCP to provide better performance for longer lived flows, the “take” can be minimized and the “give” accentuated.  The result is a more responsive application.

Some will argue about the higher capital costs of better engineered switches.  Understood.  I get that.  To that end, how do you put a price tag on application performance?  How do we measure and quantify the performance differences a better engineered switch provides?  And how do you place a value on those deltas?

And finally:  How do we engineer a series of  industry accepted benchmark tests that provides better visibility into how a switch (or network of switches) will perform under the real world traffic patters of a cloud data center (such as Incast micro bursts)?  Rather than tests that simply validate a data sheet, wouldn’t it be more valuable to validate and test real world application performance?  I think so.  What do you think?






  1. says

    Isn’t PFC supposed to solve exactly this type of problems? What’s wrong with using PFC for TCP traffic? … not that TCP couldn’t use some improvement 😉

    • says

      Hi Ivan,
      Good question. I think the main challenge with using PFC here would be 1) a more unique and maybe expensive NIC as PFC is not ubiquitous yet, and 2) head of line blocking in a multi tier network.
      DCTCP is 30 lines of code change to TCP (so the paper says). Moreover, DCTCP provides selective feedback per TCP flow, rather than the entire physical port, as PFC would do.


    • Mitch says

      Actually we have studied TCP (3 versions) Incast in a CEE fabric, with and without PFC, resp. QCN. Turns out that PFC helps significantly. However, a bunch of qualifiers apply, all in the “Short and Fat: TCP Performance in CEE Datacenter Networks”, at HOTI 19. If you can’t find it, i can provide the paper. Mitch

  2. says

    When one looks at the logical and physical links, load-balancing algorithms and failover/ha scenarios I believe we could do with some better flow diagrams/visualisations to help understand and see where traffic actually goes. A simulator that one could break components/flows and visually see the deltas would rock! I mean this in regards to flow modelling at a low price point (e.g. not some OpNet $$ package).

    It’s very hard to convey such things to decision makers when we have flows inside hosts inside hypervisors, inside hypervisors, inside virtual interfaces, inside virtual pipes, inside virtual pipes inside VRFs etc… we are trying to abstract and divorce the devil from the detail whereupon better flow modelling would allow all partners to build better platforms for customers e.g. realise the inherent benefits and help protect from human fallibility. (Note: albeit I love the simplexity in your diagrams!)


  3. Dennis Olvany says

    I manage an environment that doesn’t support anything as advanced as Hadoop, but we have recently acquired servers capable of fully utilizing gigabit connectivity. The switch setup is gigabit downlink to two servers and a 2x gigabit etherchannel uplink. With the most granular etherchannel load balancing algorithm the result is similar to the egress buffer condition that you describe. Even at low bandwidth utilization (50Mbps), transmit discards occur at the uplink.

    While it would seem logical that a 2x gigabit etherchannel is greater than or equal to two gigabit links, it just isn’t so. The most granular etherchannel algorithm entails link affinity and the combined pps coming in from the servers is enough to cause transmit discards on one of the uplinks during a burst. This seems to be a limitation of traditional ethernet implementation and I have a difficult time realizing that the solution will be delivered by an upper-layer protocol.

    There are probably a few things that could be done at the ethernet layer to address this issue. A round-robin etherchannel algorithm could go a long way to mitigating transmit discards. A 10GE uplink would provide adequate pps to mitigate the discards. As you mentioned, increasing the buffer length can help sustain a burst. Flow control can chain multiple buffers together to achieve a similar effect (Does anyone use flow control in the datacenter?). It would also be interesting if the queue length were a measure of bytes instead of packets. This would dynamically increase the buffer capacity when dealing with small packets.

    I would be interested to know how closely transmit discards are monitored in other environments. I have long considered transmit discards the ultimate capacity planning metric and I monitor discards across the entire environment. I am sure most of us expect to see transmit discards at the wan links, but probably not so much within the datacenter. I would hope that this is an issue the vendors are addressing. It just doesn’t seem right to deploy 10GE for lossless 50Mbps.


Leave a Reply

Your email address will not be published. Required fields are marked *