Cisco UCS intelligent QoS vs. HP Virtual Connect rate limiting

This article is a simple examination of the fundamental differences in how server bandwidth is handled between the Cisco UCS approach of QoS (quality of service), and the HP Virtual Connect Flex-10 / FlexFabric approach of Rate Limiting. I created two simple flash animations shown below to make the comparison.

[SWF] http://bradhedlund.s3.amazonaws.com/2010/qos-vs-rate-limit/ucs-qos-10.swf, 640, 480 [/SWF]

[SWF] http://bradhedlund.s3.amazonaws.com/2010/qos-vs-rate-limit/rate-limit-10.swf, 640, 480 [/SWF]

The animations above are each showing (4) virtual adapters sharing a single 10GE physical link to the upstream network switch. In the case of Cisco UCS the virtual adapters are called VNIC's that could be provisioned on the Cisco UCS virtual interface card (aka "Palo"). For HP Virtual Connect the virtual adapters are called FlexNIC's. In either case, the virtual adapters are each provisioned for a certain type of traffic on a VMware host and share a single 10GE physical link to the upstream network. This is a very common design element for 10GE implementations with VMware and blade servers.

When you have multiple virtual adapters sharing a single physical link, the immediate challenge lies in how you guarantee each virtual adapter will have access to physical link bandwidth. The virtual adapters themselves are unaware of the other virtual adapters, and as a result they don't know how to share available bandwidth resources without help from a higher level system function, a referee of sorts, that does know about all the virtual adapters and the physical resources they share. The system referee can define and enforce the rules of the road, making sure each virtual adapter gets a guaranteed slice of the physical link at all times.

There are two approaches to this challenge: Quality of Service (as implemented by Cisco UCS); and Rate Limiting (as implemented by HP Virtual Connect Flex-10 or FlexFabric).

The Cisco UCS QoS approach is based on the concept of minimum guarantees with no maximums, where each virtual adapter has an insurance policy that says it will always get a certain minimum percentage of bandwidth under the worst case scenario (heavy congestion). Under normal conditions, the virtual adapter is free to use as much bandwidth as it possibly can, all 10GE if its available, for example if the other virtual adapters are not using the link or using very little. However if two or more virtual adapters try to use more than 10GE of bandwidth at any time, the minimum guarantee will be enforced and each virtual adapter will get its minimum guaranteed bandwidth, plus any additional bandwidth that may be available.

Cisco UCS provides a 10GE highway where each traffic class is given road signs that designate which lanes are guaranteed to be available for that class of traffic. Between each lane is a spray painted dotted line that allows traffic to merge into other lanes if those lanes are free and have room for driving. There is one simple rule of the road on the Cisco UCS highway: If you are driving in a lane not marked for you, and that lane becomes congested, you must go to another available lane or go back to your designated lane.

The HP Virtual Connect approach of Rate Limiting does somewhat of the opposite. With HP, the system referee gives each virtual adapter a maximum possible bandwidth that cannot be exceeded, and then insures that the sum of maximums does not exceed the physical link speed. For example (4) FlexNICs could each be given a maximum bandwidth of 2.5 Gpbs. If FlexNIC #1 needed to use the link it would only be able to use 2.5 Gbps even if the other 7.5 Gbps of the physical link is unused.

HP Virtual Connect provides a 10GE highway where lanes are designated for each virtual adapter, and each lane is divided from the other lanes by cement barriers. There could be massive congestion in Lane #1, and as the driver stuck in that congestion you might be able to look over the cement barrier and see that Lane #2 is wide open, but you would not be able to do anything about it. How frustrating would that be?

The HP rate limiting approach does the basic job of providing each virtual adapter guaranteed access to link bandwidth, but does so in a way that results in massively inefficient use of all available network I/O bandwidth. Not all bandwidth is available to each virtual adapter from the start, even under normal non-congested conditions. As the administrator of HP Virtual Connect, you need to define the maximum bandwidth for traffic such as VMotion, VM data, IP storage, etc. (something less than 10GE) and from the very start that traffic will not be able to transmit any faster, there is an immediate consequence.

The Cisco UCS approach allows efficient use of all available bandwidth with intelligent QoS, all bandwidth is available to all virtual adapters from the start while providing each virtual adapter minimum bandwidth guarantees. As the Cisco UCS administrator you define the minimum guarantees for each virtual adapter through a QoS Policy. Traffic such as VMotion, VM data, IP Storage, etc. will have immediate access to all 10GE of bandwidth, there is an immediate benefit of maximum bandwidth. Only under periods of congestion will the QoS policy be enforced.

UPDATE: Follow-up post: VMware 10GE QoS Design Deep Dive with Cisco UCS, Nexus