Nexus 1000V’s 17 load balancing algorithms, dedicated to Nicholas Weaver at EMC

I wanted to take a few minutes to point out the 17 different load balancing algorithms available when distributing traffic from a VMware vSphere ESX host to a network of clustered upstream switches. Normally I don’t write blogs on such short topics but this one has a little story behind it.

This week I am at VMworld 2010 attending a lot of great sessions, meeting new people, and reconnecting with some really awesome people I know in the industry.

I decided to drop into a session on Virtual Networking by Nicholas Weaver who is a vSpecialist at EMC, whom I know from conversations with him on twitter (@lynxbat) and his blog. One thing I learned today is that if you enter a session taught by Nicholas you had better be prepared to be called out to answer a question or provide commentary.

During the session Nick called me out several times in front of his packed audience to promote my blog. Thank you for that Nick, I really appreciate it, you didn’t need to do that. You did a great job with the session and deserve all the attention for it.

At one point Nick was presenting a slide on the Nexus 1000V and its capabilities to provide very granular load balancing. Nick stated to the audience that he believed the Nexus 1000V had 17 different possible load balancing algorithms, but had called me out again for verification. Caught a little off guard and unprepared, for some reason the number 15 came to my mind, so I responded: “Yeah, pretty close”. Not that a difference of 2 really matters, but nonetheless in such a technical session of paying participants you want to get every detail correct.

After Nick moved on to other slides I pulled out my laptop to double check, and sure enough, Nick was 100% correct that it is 17 different algorithms the Nexus 1000V can use to load balance traffic (when using a port channel uplink).

So, this post is dedicated to Nicholas Weaver and his packed VMworld session.  Great job with the session Nick.  I think its great to see non-Cisco presenters touting the virtues of Nexus 1000V and advocating its deployment.

Below are the *17* different load balancing algorithms you can choose from when using a port channel uplink from Nexus 1000V, preceded by a simple use case diagram.  Each algorithm is telling the Nexus 1000V which fields to look at to determine what constitutes a flow and calculate a hash that determines which physical port channel member link will carry that flow.

•dest-ip-port—Loads distribution on the destination IP address and L4 port.

•dest-ip-port-vlan—Loads distribution on the destination IP address, L4 port, and VLAN.

•destination-ip-vlan—Loads distribution on the destination IP address and VLAN

•destination-mac—Loads distribution on the destination MAC address.

•destination-port—Loads distribution on the destination L4 port.

•source-dest-ip-port—Loads distribution on the source and destination IP address and L4 port.

•source-dest-ip-port-vlan—Loads distribution on the source and destination IP address, L4 port, and VLAN.

•source-dest-ip-vlan—Loads distribution on the source and destination IP address and VLAN.

•source-dest-mac—Loads distribution on the source and destination MAC address.

•source-dest-port—Loads distribution on the source and destination L4 port.

•source-ip-port—Loads distribution on the source IP address.

•source-ip-port-vlan—Loads distribution on the source IP address, L4, and VLAN

•source-ip-vlan—Loads distribution on the source IP address and VLAN.

•source-mac—Loads distribution on the source MAC address.

•source-port—Loads distribution on the source port.

•source-virtual-port-id—Loads distribution on the source virtual port ID.

•vlan-only—Loads distribution on the VLAN only.

The algorithm that I recommend most is source-dest-ip-port, where the Nexus 1000V will look at the source and destination IP address and TCP/UDP port numbers to constitute a flow and make a hashing decision.  Given the inspection up to Layer 4, this generally provides the most granular flow definitions and therefore closer to 50/50 Even Steven load balancing than the other methods.  As shown in the diagram above, both VM1 and VM2 might have multiple flows each with different destination IP addresses or TCP/UDP port numbers, and therefore each flow could be distributed on separate physical links.

The default algorithm is source-mac.  This setting will take all flows from a VM (assuming a single source MAC address) for placement on a single physical link within a port channel.

Configuration Example:

Nexus-1000V(config)# port-channel load-balance ethernet src-dest-ip-port
Link to Configuration Documentation

Cisco UCS intelligent QoS vs. HP Virtual Connect rate limiting

This article is a simple examination of the fundamental differences in how server bandwidth is handled between the Cisco UCS approach of QoS (quality of service), and the HP Virtual Connect Flex-10 / FlexFabric approach of Rate Limiting.  I created two simple flash animations shown below to make the comparison.

This movie requires Flash Player 9

This movie requires Flash Player 9

The animations above are each showing (4) virtual adapters sharing a single 10GE physical link to the upstream network switch.  In the case of Cisco UCS the virtual adapters are called VNIC’s that could be provisioned on the Cisco UCS virtual interface card (aka “Palo”).  For HP Virtual Connect the virtual adapters are called FlexNIC’s.  In either case, the virtual adapters are each provisioned for a certain type of traffic on a VMware host and share a single 10GE physical link to the upstream network.  This is a very common design element for 10GE implementations with VMware and blade servers.

When you have multiple virtual adapters sharing a single physical link, the immediate challenge lies in how you guarantee each virtual adapter will have access to physical link bandwidth.  The virtual adapters themselves are unaware of the other virtual adapters, and as a result they don’t know how to share available bandwidth resources without help from a higher level system function, a referee of sorts, that does know about all the virtual adapters and the physical resources they share.  The system referee can define and enforce the rules of the road, making sure each virtual adapter gets a guaranteed slice of the physical link at all times.

There are two approaches to this challenge: Quality of Service (as implemented by Cisco UCS); and Rate Limiting (as implemented by HP Virtual Connect Flex-10 or FlexFabric).

The Cisco UCS QoS approach is based on the concept of minimum guarantees with no maximums, where each virtual adapter has an insurance policy that says it will always get a certain minimum percentage of bandwidth under the worst case scenario (heavy congestion).  Under normal conditions, the virtual adapter is free to use as much bandwidth as it possibly can, all 10GE if its available, for example if the other virtual adapters are not using the link or using very little.  However if two or more virtual adapters try to use more than 10GE of bandwidth at any time, the minimum guarantee will be enforced and each virtual adapter will get its minimum guaranteed bandwidth, plus any additional bandwidth that may be available.

Cisco UCS provides a 10GE highway where each traffic class is given road signs that designate which lanes are guaranteed to be available for that class of traffic.  Between each lane is a spray painted dotted line that allows traffic to merge into other lanes if those lanes are free and have room for driving.  There is one simple rule of the road on the Cisco UCS highway: If you are driving in a lane not marked for you, and that lane becomes congested, you must go to another available lane or go back to your designated lane.

The HP Virtual Connect approach of Rate Limiting does somewhat of the opposite.  With HP, the system referee gives each virtual adapter a maximum possible bandwidth that cannot be exceeded, and then insures that the sum of maximums does not exceed the physical link speed.  For example (4) FlexNICs could each be given a maximum bandwidth of 2.5 Gpbs.  If FlexNIC #1 needed to use the link it would only be able to use 2.5 Gbps even if the other 7.5 Gbps of the physical link is unused.

HP Virtual Connect provides a 10GE highway where lanes are designated for each virtual adapter, and each lane is divided from the other lanes by cement barriers.  There could be massive congestion in Lane #1, and as the driver stuck in that congestion you might be able to look over the cement barrier and see that Lane #2 is wide open, but you would not be able to do anything about it.  How frustrating would that be?

The HP rate limiting approach does the basic job of providing each virtual adapter guaranteed access to link bandwidth, but does so in a way that results in massively inefficient use of all available network I/O bandwidth.  Not all bandwidth is available to each virtual adapter from the start, even under normal non-congested conditions.  As the administrator of HP Virtual Connect, you need to define the maximum bandwidth for traffic such as VMotion, VM data, IP storage, etc. (something less than 10GE) and from the very start that traffic will not be able to transmit any faster, there is an immediate consequence.

The Cisco UCS approach allows efficient use of all available bandwidth with intelligent QoS, all bandwidth is available to all virtual adapters from the start while providing each virtual adapter minimum bandwidth guarantees.  As the Cisco UCS administrator you define the minimum guarantees for each virtual adapter through a QoS Policy.  Traffic such as VMotion, VM data, IP Storage, etc. will have immediate access to all 10GE of bandwidth, there is an immediate benefit of maximum bandwidth. Only under periods of congestion will the QoS policy be enforced.

UPDATE: Follow-up post: VMware 10GE QoS Design Deep Dive with Cisco UCS, Nexus