TCP Incast and Cloud application performance

Incast is a many-to-one communication pattern commonly found in cloud data centers implementing scale out distributed storage and computing frameworks such as Hadoop, MapReduce, HDFS, Cassandra, etc. – powering applications such as web search, maps, social networks, data warehousing and analytics.

Incast can also more specifically be referred to as TCP Incast, as the cloud applications creating this communication pattern rely heavily on TCP.

The basic pattern of Incast begins when a singular Parent server places a request for data to a cluster of Nodes which all receive the request simultaneously. The cluster of Nodes may in turn all synchronously respond to the singular Parent. The result is a micro burst of many machines simultaneously sending TCP data streams to one machine (many to one).

TCP Incast

The Incast traffic can be very short lived flows (micro burts), depending on the application. For example, you could have a Parent server request 80KB of data across 40 Nodes. Each Node simultaneously responds with 2KB of data. That’s just two packets from each Node. In a real world scenario, the Parent could be a database server requesting a 80KB photo of the newest friend added to your social network.

“TCP Incast” may also be used in a context referring to the detrimental effect on TCP throughput caused by network congestion the Incast communication patterns create.

This simultaneous many-to-one burst can cause egress congestion at the network port attached to the Parent server, overwhelming the port egress buffer. The resulting packet loss requires Nodes to detect the loss (missing ACKs), re-send data (after RTO), and slowly ramp up throughput per standard TCP behavior. The application issuing the original request might wait until all data has been received before the job can complete, or just decide to return partial results after a certain delay threshold is exceeded. Either way, the speed, quality, and consistency of performance suffers.

The congestion caused by Incast has the effect of increasing the latency observed by the application and its users. There’s a give and take with application performance as Node cluster sizes grow. Larger clusters provide more distributed processing capacity allowing more jobs to complete faster. However larger clusters also mean a wider fan-in source for Incast traffic, increasing network congestion and lowering TCP throughput per Node.

The “give” of a larger cluster is a lot more obvious than the “take”, as cloud data centers continue to build larger cluster sizes to increase application performance. The question is: How much of the “take” (Incast congestion) is affecting the true performance potential of the application? And what can be done to measure and reduce the impact of Incast congestion?

Some have proposed a new variation of TCP specifically for the data center computers called Data Center TCP (DCTCP). This variation optimizes the way congestion is detected and can quickly minimize the amount of in flight data at risk during periods of congestion, minimizing packet loss and time outs. DCTCP optimizations have the effect of reducing buffer utilization on switches, which in theory eases the buffering requirement on switches. However, based on what I’ve read, DCTCP needs to have longer lived flows before it can accurately detect and react to congestion.

While DCTCP appears to have promise for overall performance improvements within the cloud data center, it’s not the complete end all answer, IMHO. Take for example our 80KB micro burst Incast to retrieve a friends photo. This micro burst of 2 packets per Node is not long lived enough for DCTCP optimizations to help. (DCTCP evangelists: Feel free to correct me here if I am wrong about this. I’ve been wrong before).

Another consideration is taking a closer look at the switch architecture. For example, a switch with sufficient egress buffer resources will be able to better absorb the Incast micro burts, preventing the performance detrimental packet loss and timeouts. Are those buffers dedicated per port? Are they shared? How well does the fabric chip(s) inside the switch detect and manage micro burst congestion, etc.?

With the right switch architecture in place to handle the micro bursts, in combination with optimizations like DCTCP to provide better performance for longer lived flows, the “take” can be minimized and the “give” accentuated. The result is a more responsive application.

Some will argue about the higher capital costs of better engineered switches. Understood. I get that. To that end, how do you put a price tag on application performance? How do we measure and quantify the performance differences a better engineered switch provides? And how do you place a value on those deltas?

And finally: How do we engineer a series of industry accepted benchmark tests that provides better visibility into how a switch (or network of switches) will perform under the real world traffic patters of a cloud data center (such as Incast micro bursts)? Rather than tests that simply validate a data sheet, wouldn’t it be more valuable to validate and test real world application performance? I think so. What do you think?

Cheers, Brad