Aug 31 2010

Nexus 1000V’s 17 load balancing algorithms, dedicated to Nicholas Weaver at EMC

Published by Brad Hedlund under Data Center, Nexus, vmware

I wanted to take a few minutes to point out the 17 different load balancing algorithms available when distributing traffic from a VMware vSphere ESX host to a network of clustered upstream switches. Normally I don’t write blogs on such short topics but this one has a little story behind it.

This week I am at VMworld 2010 attending a lot of great sessions, meeting new people, and reconnecting with some really awesome people I know in the industry.

I decided to drop into a session on Virtual Networking by Nicholas Weaver who is a vSpecialist at EMC, whom I know from conversations with him on twitter (@lynxbat) and his blog. One thing I learned today is that if you enter a session taught by Nicholas you had better be prepared to be called out to answer a question or provide commentary.

During the session Nick called me out several times in front of his packed audience to promote my blog. Thank you for that Nick, I really appreciate it, you didn’t need to do that. You did a great job with the session and deserve all the attention for it.

At one point Nick was presenting a slide on the Nexus 1000V and its capabilities to provide very granular load balancing. Nick stated to the audience that he believed the Nexus 1000V had 17 different possible load balancing algorithms, but had called me out again for verification. Caught a little off guard and unprepared, for some reason the number 15 came to my mind, so I responded: “Yeah, pretty close”. Not that a difference of 2 really matters, but nonetheless in such a technical session of paying participants you want to get every detail correct.

After Nick moved on to other slides I pulled out my laptop to double check, and sure enough, Nick was 100% correct that it is 17 different algorithms the Nexus 1000V can use to load balance traffic (when using a port channel uplink).

So, this post is dedicated to Nicholas Weaver and his packed VMworld session.  Great job with the session Nick.  I think its great to see non-Cisco presenters touting the virtues of Nexus 1000V and advocating its deployment.

Below are the *17* different load balancing algorithms you can choose from when using a port channel uplink from Nexus 1000V, preceded by a simple use case diagram.  Each algorithm is telling the Nexus 1000V which fields to look at to determine what constitutes a flow and calculate a hash that determines which physical port channel member link will carry that flow.

•dest-ip-port—Loads distribution on the destination IP address and L4 port.

•dest-ip-port-vlan—Loads distribution on the destination IP address, L4 port, and VLAN.

•destination-ip-vlan—Loads distribution on the destination IP address and VLAN

•destination-mac—Loads distribution on the destination MAC address.

•destination-port—Loads distribution on the destination L4 port.

•source-dest-ip-port—Loads distribution on the source and destination IP address and L4 port.

•source-dest-ip-port-vlan—Loads distribution on the source and destination IP address, L4 port, and VLAN.

•source-dest-ip-vlan—Loads distribution on the source and destination IP address and VLAN.

•source-dest-mac—Loads distribution on the source and destination MAC address.

•source-dest-port—Loads distribution on the source and destination L4 port.

•source-ip-port—Loads distribution on the source IP address.

•source-ip-port-vlan—Loads distribution on the source IP address, L4, and VLAN

•source-ip-vlan—Loads distribution on the source IP address and VLAN.

•source-mac—Loads distribution on the source MAC address.

•source-port—Loads distribution on the source port.

•source-virtual-port-id—Loads distribution on the source virtual port ID.

•vlan-only—Loads distribution on the VLAN only.

The algorithm that I recommend most is source-dest-ip-port, where the Nexus 1000V will look at the source and destination IP address and TCP/UDP port numbers to constitute a flow and make a hashing decision.  Given the inspection up to Layer 4, this generally provides the most granular flow definitions and therefore closer to 50/50 Even Steven load balancing than the other methods.  As shown in the diagram above, both VM1 and VM2 might have multiple flows each with different destination IP addresses or TCP/UDP port numbers, and therefore each flow could be distributed on separate physical links.

The default algorithm is source-mac.  This setting will take all flows from a VM (assuming a single source MAC address) for placement on a single physical link within a port channel.

Configuration Example:

Nexus-1000V(config)# port-channel load-balance ethernet src-dest-ip-port
Link to Configuration Documentation

One response so far

Aug 16 2010

Cisco UCS intelligent QoS vs. HP Virtual Connect rate limiting

This article is a simple examination of the fundamental differences in how server bandwidth is handled between the Cisco UCS approach of QoS (quality of service), and the HP Virtual Connect Flex-10 / FlexFabric approach of Rate Limiting.  I created two simple flash animations shown below to make the comparison.



The animations above are each showing (4) virtual adapters sharing a single 10GE physical link to the upstream network switch.  In the case of Cisco UCS the virtual adapters are called VNIC’s that could be provisioned on the Cisco UCS virtual interface card (aka “Palo”).  For HP Virtual Connect the virtual adapters are called FlexNIC’s.  In either case, the virtual adapters are each provisioned for a certain type of traffic on a VMware host and share a single 10GE physical link to the upstream network.  This is a very common design element for 10GE implementations with VMware and blade servers.

When you have multiple virtual adapters sharing a single physical link, the immediate challenge lies in how you guarantee each virtual adapter will have access to physical link bandwidth.  The virtual adapters themselves are unaware of the other virtual adapters, and as a result they don’t know how to share available bandwidth resources without help from a higher level system function, a referee of sorts, that does know about all the virtual adapters and the physical resources they share.  The system referee can define and enforce the rules of the road, making sure each virtual adapter gets a guaranteed slice of the physical link at all times.

There are two approaches to this challenge: Quality of Service (as implemented by Cisco UCS); and Rate Limiting (as implemented by HP Virtual Connect Flex-10 or FlexFabric).

The Cisco UCS QoS approach is based on the concept of minimum guarantees with no maximums, where each virtual adapter has an insurance policy that says it will always get a certain minimum percentage of bandwidth under the worst case scenario (heavy congestion).  Under normal conditions, the virtual adapter is free to use as much bandwidth as it possibly can, all 10GE if its available, for example if the other virtual adapters are not using the link or using very little.  However if two or more virtual adapters try to use more than 10GE of bandwidth at any time, the minimum guarantee will be enforced and each virtual adapter will get its minimum guaranteed bandwidth, plus any additional bandwidth that may be available.

Cisco UCS provides a 10GE highway where each traffic class is given road signs that designate which lanes are guaranteed to be available for that class of traffic.  Between each lane is a spray painted dotted line that allows traffic to merge into other lanes if those lanes are free and have room for driving.  There is one simple rule of the road on the Cisco UCS highway: If you are driving in a lane not marked for you, and that lane becomes congested, you must go to another available lane or go back to your designated lane.

The HP Virtual Connect approach of Rate Limiting does somewhat of the opposite.  With HP, the system referee gives each virtual adapter a maximum possible bandwidth that cannot be exceeded, and then insures that the sum of maximums does not exceed the physical link speed.  For example (4) FlexNICs could each be given a maximum bandwidth of 2.5 Gpbs.  If FlexNIC #1 needed to use the link it would only be able to use 2.5 Gbps even if the other 7.5 Gbps of the physical link is unused.

HP Virtual Connect provides a 10GE highway where lanes are designated for each virtual adapter, and each lane is divided from the other lanes by cement barriers.  There could be massive congestion in Lane #1, and as the driver stuck in that congestion you might be able to look over the cement barrier and see that Lane #2 is wide open, but you would not be able to do anything about it.  How frustrating would that be?

The HP rate limiting approach does the basic job of providing each virtual adapter guaranteed access to link bandwidth, but does so in a way that results in massively inefficient use of all available network I/O bandwidth.  Not all bandwidth is available to each virtual adapter from the start, even under normal non-congested conditions.  As the administrator of HP Virtual Connect, you need to define the maximum bandwidth for traffic such as VMotion, VM data, IP storage, etc. (something less than 10GE) and from the very start that traffic will not be able to transmit any faster, there is an immediate consequence.

The Cisco UCS approach allows efficient use of all available bandwidth with intelligent QoS, all bandwidth is available to all virtual adapters from the start while providing each virtual adapter minimum bandwidth guarantees.  As the Cisco UCS administrator you define the minimum guarantees for each virtual adapter through a QoS Policy.  Traffic such as VMotion, VM data, IP Storage, etc. will have immediate access to all 10GE of bandwidth, there is an immediate benefit of maximum bandwidth. Only under periods of congestion will the QoS policy be enforced.

14 responses so far

Jun 22 2010

Cisco UCS Networking Best Practices (in HD)

This is a presentation I developed covering networking best practices for Cisco UCS, and now have recorded in High Definition for your viewing pleasure! Sweet! :-)

This presentation assumes familiarity with basic networking and server VNIC concepts in UCS, and familiarity with virtual port channels.

This version of the presentation (v2.5) focuses primarily on the Ethernet uplinks. SAN uplinks and VMware networking scenarios are briefly discussed but not covered extensively. Those topics and others such as QoS, the Cisco VIC, and vNIC fabric failover may be included in future versions of this presentation.

Stay tuned for updates! RSS feed: http://bradhedlund.com/feed/


Part 1 – Cisco UCS Networking Overview

In Part 1 we take start with a baseline overview of Cisco UCS Networking. At the heart of the system is the Fabric Interconnect (6100) “the Brains of UCS” which provides 10GE & FC networking for all the compute nodes in its domain as well as being the central configuration, management, and policy engine for all automated server and network provisioning.


Part 2 – Switch Mode vs. End Host Mode

Part 2 is an examination of the two different switching modes supported by the Fabric Interconnect, “Switch Mode” and “End Host Mode”. With “Switch Mode”, the Fabric Interconnect behaves like a normal Layer 2 switch on all server ports and uplinks, and therefore attaches to the upstream data center network as a spanning tree enabled “Switch”.

“End Host Mode”, on the other hand, while still providing local Layer 2 switching on the server ports, does not behave like a normal Layer 2 switch on its uplinks.  Instead, server NICs are “pinned” to a specific uplink, and no local switching happens from uplink to uplink.  This allows “End Host Mode” to attach to the network like a “Host” without spanning tree, and all uplinks forwarding on all VLANs.

End Host Mode is the preferred mode, and it’s enabled by default.


Part 3 – End Host Mode – Individual Uplinks

In Part 3 we take a look how the individual uplinks behave in End Host Mode, and how the system reacts to uplink failures. When an uplink fails, the Fabric Interconnect will move the server NICs to a new uplink in under a second without causing any disruption to the server NIC.  This uplink failover process is called dynamic re-pinning.

After the dynamic re-pinning process, the Fabric Interconnect will send Gratuitous ARP messages for all of the MAC address that were previously using the failed uplink. This GARP process aids the upstream network in quickly learning the new location of the affected MAC address now using the new uplink.


Part 4 – Port Channel Uplinks

Here we take a look at the benefits of using Port Channel uplinks with Cisco UCS. The key advantages to port channel uplinks is the minimal impact of a physical link failure and the potential for better overall uplink load balancing. During individual physical link failures fewer moving parts required to provide a fast recovery.  For example, Gratuitous ARP messages and dynamic re-pinning are not required when an individual physical member link fails in a port channel uplink.  Port Channel uplinks are definitely recommended whenever possible.


Part 5 – Virtual Port Channel Uplinks (vPC)

Part 5 covers the advantages of using virtual port channel (vPC) uplinks with Cisco UCS. With vPC uplinks, there is minimal impact of both physical link failures and upstream switch failures. With more physical member links in one larger logical uplink, there is the potential for even better overall uplink load balancing and better high availability than with a standard Port Channel uplink discussed in Part 4. Using a virtual port channel uplink is highly recommended if you have vPC capabilities present in your upstream network switches.


Part 6 – Connecting Cisco UCS to separate networks

In Part 6 we discuss the scenario of connecting a single Cisco UCS system in End Host Mode to separate Layer 2 networks. When the system is in End Host Mode, it expects and assumes that all uplinks are connected to the same common Layer 2 domain. If some uplinks are connected to physically separate networks you will have connectivity problems.  The Fabric Interconnect will randomly pick one of its uplinks to process broadcast messages for all VLANs.  As a result, only servers associated with the chosen network will be able to see and process broadcasts messages on their network.  The solution is create a common Layer 2 network for the Fabric Interconnect in End Host Mode and each of the separate networks to attach to, or, use Switch Mode.  If creating a common Layer 2 network or using Switch Mode is not an option, you can always deploy a unique Cisco UCS system per separate network to preserve the existing “silos”.


Part 7 – Inter Fabric Traffic Examples

This is a brief look at some the common types of traffic flows that may flow between Fabric-A and Fabric-B within a single Cisco UCS system. With this understanding, the subsequent material will make more sense.


Part 8 – Don’t: Connect Cisco UCS to vPC domains without vPC uplinks

This is a fairly extensive look at the scenario of attaching UCS to upstream switches configured for vPC, without using vPC uplinks. Here we will show that this scenario doesn’t make much sense and in fact can cause some unwanted traffic black holes under some failure scenarios. This is a prelude to Part 9 where we illustrate that if your upstream network is configured for virtual port channel capability (vPC), you should always attach UCS with vPC uplinks.


Part 9 – Do: Connect Cisco UCS to vPC domains with vPC uplinks

This section shows that if you have virtual port channel capabilities in your upstream switches, you have everything to gain and nothing to loose by connecting Cisco UCS with vPC uplinks. You will gain the benefit of the upstream switch locally switching all Fabric-A to Fabric-B traffic, and acheiving more bandwidth scalability for inter-fabric traffic because all inter-fabric traffic will travel on the vPC uplinks, rather than on less abundant inter-switch links. Additionally, you will avoid potential black hole failure scenarios discussed in Part 8, if vPC is already present in the upsteam network switches.


Part 10 – Connecting Cisco UCS without vPC

While there are certainly advantages to uplinking Cisco UCS with virtual port channels, vPC is certainly not required. Cisco UCS easily and efficiently connects to any data center network environment with or without vPC. This section discusses best practices connecting UCS to networks without vPC.  The key best practice here is to always dual attach each Fabric Interconnect to two upstream network switches, whether its with vPC uplinks, or multiple individual uplinks.  Another suggested practice is to avoid attaching Cisco UCS to a second tier Layer 2 switch with spanning tree blocking links.  A better approach is to either have vPC capabilites at the second tier Layer 2 switch, or connect Cisco UCS directly to the tier 1 switch, avoiding a traffic bottlenecks induced by spanning tree.

17 responses so far

Jun 03 2010

Data Center Networking Q&A #1 – starring HP, Nexus 1000V, QoS

I thought it would be fun to pilot a series of posts where I pick out interesting search engine queries that were used to find my blog.  Often times these are good questions that deserve a good answer, or other interesting topics that can start a good discussion or fun debate.

This particular series will focus on Data Center Networking.  Another data center computing focused series may be started such as “Cisco UCS Q&A”. Stay tuned.

Each question or search query will be headlined in bold and the text beneath will be my answer, commentary, or general response.

Furthermore, if you have a question you think I might be able to answer please submit your question in the comments section to be considered for immediate answer or highlighted in a future Q&A post.

So here we go, this is the first Data Center Networking Q&A :) I hope you enjoy it!


hp qos marking configuration

I seem to be getting quite a number of these queries finding my site lately.  This person might be trying to find out how to configure their HP blade switch to classify and mark traffic for a QoS policy in their data center.  Well, there are two blade switches made by HP worth discussing: HP Virtual Connect Flex-10, and the HP Procurve 6120XG.

Lets start with HP Virtual Connect Flex-10.  I’ve got some real bad news for you on this one.  Flex-10 has no QoS capabilities what so ever.  If you follow my blog and tweets you have heard me point this out several times and I’ll do it here again.  Flex-10 is not capable of classifying traffic, not capable of marking traffic, and not capable of giving any special treatment or guaranteed bandwidth to important traffic.  If you search for “QoS” in the latest Virtual Connect User Guide it produces zero hits.  But don’t take my word for it, here is what HP says about Virtual Connect Flex-10 QoS:

VC does not currently support any user configurable Quality of Service features … these features are on the roadmap for future implementation.

Moving on to HP Procurve 6120XG; this blade switch made by HP actually has what  I will describe as “so-so” QoS capabilities, much better than Flex-10 anyway.  The Procurve 6120XG can give special treatment to traffic via (4) traffic queues, each with a differing degree of priority; Low, Normal, Medium, and High.  The High priority queue will always be serviced before the Medium queue, and so on.  This is a simple implementation of a QoS technique called Priority Queueing.  The downside to Priority Queueing is that the high priority queue (if busy) can starve bandwidth from all other queues, with no insurance or fairness that all traffic will get some portion of the bandwidth.  The 6120XG QoS implementation does not provide guaranteed bandwidth, it simply allows some packets to be transmitted before others, based on the queue they are serviced from.

The Procurve 6120XG can assign traffic to each of the (4) queues based on the incoming 802.1P COS value in the Ethernet header, or the IP DSCP or TOS value in the IP header.  One important thing to note here is that the HP Procure 6120XP expects the packet to already be marked when entering the switch.  The Procurve 6120XG is not capable of marking traffic based on MAC, IP, or TCP information.  If a packet enters the 6120XG with no marking, the packet cannot be classified and not provided any special treatment, no QoS.  The 6120XG can mark traffic based on incoming port, meaning all traffic from a certain port can be given a defined marking.  However, this rudimentary port-based classification has little use in a 10G server environment where different types of traffic will be converged on a single 10G server interface.

Given that the HP Procurve 6120XG’s QoS capabilities are only useful when the traffic has already been marked before entering the switch, it’s important to understand if your blade servers are capable of traffic classification and packet marking.  This becomes especially relevant with server virtualization hosts, such as with VMware.  The VMware vStandard Switch (VSS) or vNetwork Distributed Switch (VDS) does not have traffic classification or marking capabilities, however the optional Cisco Nexus 1000V has a comprehensive set of QoS classification and marking capabilities.  Hence, the Cisco Nexus 1000V can classify and mark important traffic leaving the VMware host (such as vMotion or management traffic) before entering the Procurve 6120XP where it can then be placed into one of the (4) QoS queues based on the marking it receives.

The downside to QoS on the HP Procurve 6120XG is that its basic Priority Queueing implementation does not provide any bandwidth guarantees to all traffic types, and the lack of traffic classification capabilities requires that your servers do the classification and marking, hence the need for Cisco Nexus 1000V on the vSphere Host.  This is why I give it a “so-so” rating.

You can find the complete QoS configuration details here in the HP Procurve 6120XP Traffic Management Guide (PDF)

For HP blade servers, my recommendation is to not use the HP Procurve 6120XG, and instead use the HP 10G Passthrough module.  This allows you to connect your HP blade servers directly to a Cisco Nexus switching environment, bypassing all of the other mediocre 10G blade switch options from HP.  The Cisco Nexus series has a rich set of QoS capabilities that do not just provide basic Priority Queuing, but rather offer an advanced Class Based Weighted Fair Queueing mechanism that can assign minimum guaranteed bandwidth to all traffic types.  This behaves in a manner very similar to how VMware provides reservations for CPU and memory to a virtual machine, but without limitations.  If more CPU and memory is available, the virtual machine can use it, but there is always a minimum guarantee provided by the reservation.  This is similar to how Cisco Nexus switches allocate bandwidth, minimum guarantees without limitations.

The Cisco Nexus switches are also capable of classifying and marking traffic based on MAC, IP, or TCP information.  So if a packet arrives unmarked you can mark it at the switch and provide granular QoS, with or without the Nexus 1000V.  You can later add the Nexus 1000V for all of the security, management, and network visibility reasons, but you can at least get started with rich QoS capabilities with or without the Nexus 1000V.


hp blades cisco nexus 1000v

Great news! You can absolutely run the Cisco Nexus 1000V on an HP blade server.  Furthermore, this can be done with any blade switch.  You can use Virtual Connect Flex-10, the Procurve 6120XG, the 10G Passthrough module, or any other standard Ethernet switch.

There is one particular note about using HP Virtual Connect Flex-10 with Nexus 1000V that I want to discuss.  The Flex-10 module can be setup in two different switching modes, mapped mode, or tunnel mode.  The most common mode is mapped mode because that is the default mode.  In mapped mode the Flex-10 administrator defines vNets, which are basically VLANs inside the Virtual Connect domain.  The Virtual Connect administrator can decide the VLAN ID used for these vNets independently of the network administrator who might be configuring the physical network switches and the Nexus 1000V.  The Flex-10 uplinks to the network with VLANs defined by the network administrator, however the VLAN IDs on the network uplink are mapped to a VLAN ID of a vNet, which might be a different VLAN ID, or the same.

If the Virtual Connect administrator is using mapped mode and mapping the VLAN ID on the network to a different VLAN ID inside the Virtual Connect domain, this can cause a problem with the Nexus 1000V.  If the Nexus 1000V VSM was configured with the VLAN ID’s known on the physical network, the Nexus 1000V VEM running on the server is expecting to see the same VLAN ID’s that were defined on the VSM.  However if the Flex-10 mapped mode configuration is changing (mapping) the VLAN ID’s to something different, you will have broken the linkage between Nexus 1000V VSM and VEM.

Tunnel mode, on the other hand, simply takes all VLAN ID’s defined on the network uplinks and sends them down to the server ports, without changing or mapping the VLAN ID’s to anything different.

The moral of the story here is that if you are going to use Nexus 1000V with Virtual Connect Flex-10, you can, just make sure you are not changing VLAN ID numbers.  If using mapped mode, do not map the network VLAN ID’s to a different VLAN ID at the server, keep them the same.  Or, you can also use tunnel mode which does not provide any option of changing VLAN ID numbers.

One you have insured VLAN ID consistency inside of Virtual Connect you are all set to have a very successful implementation of Nexus 1000V on HP blades, even with Virtual Connect Flex-10.

That’s all for now.

Hope you enjoyed the pilot episode of Data Center Networking Q&A #1

Please remember to submit questions or comments in the comment section below.  Some questions may be answered here or featured in a future installment.

4 responses so far

May 07 2010

Setting the stage for TRILL, rethinking data center switching

As data centers become increasingly dynamic and dense with virtualization – how the classic Ethernet switching design adopts to these new models and scales becomes an important and challenging question. Virtualization and cloud based services says that any workload can exist anywhere, at anytime, on demand, and move to any location without disruption. This is a major paradigm shift from the old days where a “Server” and the application it supported had a very static location in the network. When the application has a static location you can build walls around it in a very structured manner with minimal trade-offs. In the old “static” Data Center, you could for example provide Layer 3 routing boundaries at the server edge for the very good reasons of robust scalability, minimal or no Spanning Tree, and active/active router-like link load balancing and fast convergence. In today’s dynamic Data Center, the imposition of Layer 3 boundaries no longer works.

The next generation dynamic Data Center requires a pervasive Layer 2 deployment enabling the aforementioned fluid mobility of application workloads. Any VLAN, on any switch, on any port, at anytime. As as result, switch makers (in order to remain viable in the data center) must be geared towards enabling pervasive Layer 2 data center fabrics in a manner that is highly scalable (agile), robust, maximizes bandwidth (resources), with plug & play simplicity.

One major step forward in designing next generation data centers is the promising technology which is currently defined in RFC 5556 named TRILL (Transparent Interconnection of Lots of Links). Some switch vendors (such as Cisco) may initially offer the capabilities found in TRILL with additional enhancements as a proprietary system. Therefore, for the time being I am going to use the word TRILL in a generic sense. And where a capability is discussed that is a unique enhancement offered by Cisco (or any other vendor) I will simply cite that with an *, such as TRILL*.

Before we discuss TRILL in great detail I think it’s important first to take a step back and “Set the stage” a little by revisiting the classic Ethernet design principles currently in use today, understanding both the strengths and challenges. Then we’ll look at some alternative approaches that attempt to address these challenges, and where they fall short. As we go through the various areas I will point out where TRILL can make design improvements. Once we have this basis of understanding we will be ready to understand the value of TRILL with more detail in subsequent discussions. Sound cool? Great!

Revisiting Classic Ethernet. What works? What needs improvement?

A fundamental underpinning of Ethernet is the “Plug & Play” simplicity that in no small measure has contributed to the overall tremendous success of Ethernet. When you connect Ethernet switches together they can auto discover the topology and automatically learn about each host’s location on the network, with little to no configuration. Any future evolution of Ethernet must retain this fundamental “Plug & Play” characteristic to be successful. The key enabler of this Plug & Play capability is Flooding Behavior.

Figure 1 - Ethernet flooding

Figure 1 above shows two simple examples of Ethernet flooding behavior. On the left, if an Ethernet switch receives a Unicast frame with a destination that it doesn’t know about, it simply floods that frame out on all ports. This behavior is called Unicast Flooding and it insures that the destination host receives the frame so long as it is connected to the network, without any special configuration (Plug & Play).

The other flooding behavior shown on the right (Figure 1 above) is a Broadcast message that is intended for all hosts on the network. When the Ethernet switch receives a broadcast frame, it will simply do as told and send a copy of that frame to all active ports. Broadcast messages are tremendously useful for hosts seeking to dynamically discover other hosts connected to the network without any special configuration (Plug & Play).

This default flooding behavior of Ethernet is fundamental to its greatest virtue, Plug & Play simplicity. However, this same flooding behavior also creates design challenges that we will discuss shortly.

The flooding of unknown unicasts and broadcasts frames also allows for Plug & Play learning of the all the hosts and their location in the network, without any special configuration. Once the location of a host is known, all subsequent traffic to that host will be sent only to the ports leading to the host. I will refer to this type of traffic as Known Unicast traffic.

The process of automatically discovering a hosts location on the network is called MAC Learning:

Figure 2 - Classic Ethernet MAC Learning

Figure 2 above shows a simple example of the automatic MAC Learning process. Every time an Ethernet switch receives a frame on a port it looks at the source MAC address of the received frame and records the port and source MAC on which it received the frame in its forwarding table, aka MAC Table. That’s it! Its that simple. Any future frames received that are destined to the learned MAC address will be directed only to the port on which it was learned. This process is more specifically described as Source MAC Learning, because only the Source MAC address is examined upon receiving a frame.

Because of the flooding behavior discussed earlier, the Ethernet switch can quickly learn the location of all hosts on the network. Anytime a host sends a broadcast message it will be received by all Ethernet switches where the source MAC address of the sending station will be recorded and learned, as shown in Figure 2.

There is a peculiar side effect to Source MAC Learning: All Ethernet switches will inevitably learn about all hosts, needed or not. For example, in Figure 2 above, Host C and Host D are communicating on Switch 4. The Source MAC learning process was useful in establishing a Known Unicast conversation for these two hosts using Switch 4. However, despite the fact that Host A and Host B are not using Switch 4 for any conversations, Switch 4 has still populated its MAC Table with entries for Host A and Host B.

“Whats wrong with that?” you ask? Well, in the old “static” Data Center with small Layer 2 domains this was never a concern. Now imagine this inefficient behavior on a much larger scale in the dynamic Data Center with thousands of virtual hosts in a pervasive Layer 2 domain. The unfortunate side effect is that you will have many unnecessary entries in every Ethernet switch. And each one of these unnecessary entries consumes valuable space in the MAC Table where there is a limited number of entries available. A typical data center class Ethernet switch might support 16,000 MAC entries. Again, not a problem in the “static” Data Center. However this poses a scalability challenge in the virtualization dense dynamic Data Center. Is this something that can be improved while maintaining the Plug & Play auto learning behavior? The answer is, Yes, this is an area enhanced by TRILL* :-)

Now lets move on to the design challenge with flooding behavior I mentioned earlier. Remember, the flooding behavior of Ethernet is fundamental to achieving Plug & Play capabilities, so we cant get rid of it, we need it. The challenge with flooding is there is no mechanism to know when a flooded frame (such as a Broadcast) has already made its way through the network. Every time a Broadcast or Unknown Unicast frame is received it is immediately flooded out on all ports, no questions asked, even if this is the same frame returning to the switch from a previous flood, there is no way to know. This can become a real problem when you have multiple paths from one switch to another.

Figure 3 - Ethernet flooding loop

In Figure 3 above, Host A sends a Broadcast or Unknown Unicast frame into Switch 3 which is then flooded on the links connecting to Switch 1 and Switch 2. Once received, Switch 1 & 2 will also flood the frame on all of their ports, and so on. Switch 3 ultimately receives the original frame again and the same process repeats. Unlike an IP packet that increments a TTL field (time to live) with every hop, there is no such TTL field or other mechanism in an Ethernet frame that provides information about the frames age or history on the network. As a result, the flooding loop repeats infinitely with every new broadcast. It doesn’t take long for the loop to have catastrophic effects on the network (within seconds). Can Ethernet be enhanced with a TTL field just like IP to limit the scope of unwanted loops? The answer is, Yes, this is an area enhanced by TRILL. :-)

This looping challenge above led to the development of a Plug & Play mechanism in Ethernet to detect and prevent loops called Spanning Tree Protocol (STP).

Figure 4 - Classic Ethernet Loop Prevention

In Figure 4 above, the Ethernet switches have auto discovered a redundant path in the network using STP and placed certain interfaces in a “Blocking” state to prevent the disastrous infinite looping of flooded frames. The Spanning Tree protocol is Plug & Play, requiring no configuration work, and because it prevents the disastrous loops that allow flooding to work properly in a network with redundant paths, you could argue that STP (even with it’s infamous reputation) is THE reason why Ethernet is so successful today as a mission critical data center network technology. Now, truth be told, STP does require some configuration tuning if you want to have precise control over which links are placed into a “Blocking” state. Such as in Figure 4 above, whereby defining Switch 1 as the “Root” bridge we can influence redundant links from Switch 3 & 4 to block loops and provide a balance of bandwidth available to hosts on either Switch 3 or Switch 4, each switch having 50% of its bandwidth available for hosts.

There is an unfortunate side effect with STP. Remember, it is the Broadcast and Unknown Unicast frames flooding and looping the network that cause the catastrophic effects which we must correct with STP. The non-flooded Known Unicast traffic is not causing the problem. However, when STP blocks a path to close a loop, it is in fact punishing bandwidth availability for ALL traffic, including the Known Unicast traffic, the significant majority of all traffic on the network! Thats not fair! Can we enhance Ethernet to correct this unfair side effect? The answer is, Yes, this an area enhanced by TRILL. :-)

Given that STP creates a single loop free forwarding topology for all traffic, flooded or non-flooded, it became increasingly import to build loop free topologies while maintaining multiple paths, maximizing bandwidth, without STP blocking any of those valuable paths, especially in 10GE data center networks. In order for STP to not block any of the paths we must first show STP a loop free topology from the start.

Building loop free topologies with multiple paths can be accomplished with the development of a capability generically referred to as Multi Chassis EtherChannel (MCEC), available in some switches today -mostly notably Cisco switches ;) – but other switch vendors have started to implement MCEC as well. Some switch platforms such as the Cisco Nexus family refer to this capability as Virtual Port Channels (vPC).

Figure 5 - Multi Path with Classic Ethernet (MCEC)

As shown in Figure 5 above, Switch 1 and Switch 2 form a special peering relationship with each other that allows them to be viewed as single switch in the topology, rather than two separate switches. This significant accomplishment allows Switch 3 and Switch 4 to form a single logical link with a single standard Etherchannel to both Switch 1 & 2. STP treats Switch 1 & 2 as a single node on the network, and as a result finds a loop free topology from the start, and no links need to be blocked, all links are active. Virtual Port Channels is a popular design choice today for maximizing bandwidth in new data center network deployments and redesigns.

Accomplishing MCEC or vPC capabilities is not a trivial task. A significant engineering effort is required. For MCEC implementations to behave properly you must engineer lock step synchronization of several different roles and states on each peer switch (Switch 1 & 2). You need to make sure MAC learning is synchronized, any MAC’s learned on Switch 1 must be made known to Switch 2. You need to make sure the interface states (up/down) are synced and the interface configurations are identical. You also need to determine which switch will process STP messages on behalf of the other. And to top it all off, most importantly, you need to have a robust split brain failure detection and determine how each switch will react and assume or relinquish the aforementioned roles and state. All of these different synchronization elements and split brain detection can lead to a complex matrix of failure scenarios that the switch maker must test and insure software stability.

The significant engineering effort of MCEC is for the simple purpose of providing STP a multi path loop free topology so that no links will be blocked. Will it be possible to build a multi path loop free topology without all of the system complexity of MCEC? The answer is, Yes, this is an enhancement in TRILL :-)

Scaling the next generation Data Center with Classic Ethernet

Now lets switch gears to scaling a pervasive Layer 2 data center fabric. Lets start by looking at the scaling options for Tier 1 (the Aggregation layer). First of all, why would you want to scale Tier 1 anyway? Well, the more capacity you can have available at Tier 1 means more Tier 2 (Server Access Layer) switches that can exist in the layer 2 domain. Furthermore, the more ports you have at Tier 1 means more aggregate bandwidth you can deliver to a Tier 2 switch. Therefore, the ability to efficiently scale Tier 1 is critical to the overall scaling of size and bandwidth to the server environment.

Scaling out Tier 1

One interesting approach to scaling Tier 1 is to simply scale out by adding more switches horizontally across the Tier. This makes sense for a number of reasons. First of all, if you can connect the Tier 2 switch to an array of switches at Tier 1 you gain the advantage of spreading out risk, much like a RAID array of hard disk drives. For example, when a Tier 2 switch connects to (4) Tier 1 switches, a single uplink or Tier 1 switch failure would result in a 25% loss of available bandwidth, compared to a more significant 50% loss when there are just (2) Tier 1 switches. Second, if you can easily add more Tier 1 switches as you grow, the density of the Tier 1 switch becomes less of a factor in achieving the overall scale you need. For example, when you have the flexibility to eventually grow Tier 1 to (8) or even (16) switches, rather than only being limited to (2), you can achieve respectible scale with with an array of smaller low cost Tier 1 switches, or mind boggling scale with a wide array of larger modular switches.

Sounds great! Right? But before we start the high fives, how does scaling out Tier 1 work with the Classic Ethernet network relying on Spanning Tree Protocol for loop prevention? Well, it doesn’t… :-(

Figure 6 - Scaling out Tier 1 with Classic Ethernet

In Figure 6 above, I have attempted to scale out Tier 1 in a Classic Ethernet network. I have added Switches 5 & 6 to Tier 1 and linked my Tier 2 switches to the (4) switch array at Tier 1. Unfortunately though, the only thing I was able to accomplish was creating more loops that must be blocked by Spanning Tree Protocol. In order to maintain a loop free topology for flooded traffic (broadcasts & unknown unicasts), all of the extra links I added from Tier 2 to Tier 1 have been disabled by STP, which if you remember punishes all traffic including the Known Unicast and Multicast traffic. What was the point? This was a futile exercise.

It is for this very reason why having more than (2) Tier 1 switches has never made any sense with Classic Ethernet. This long standing rigid design constraint has led the density of the Tier 1 switch being a very import criteria to achieving large scale and bandwidth. “How many ports can I shove in one box?” To achieve even respectable density in a modern data center requires a pair of large modular switches positioned in Tier 1, from which you can add modules as you grow. Once the module slots are filled you have hit your scalability wall, adding more Tier 1 switches is not a viable option.

Alright, so if loop prevention with Spanning Tree Protocol is the problem to achieving scale in Classic Ethernet, why not scale out Tier 1 with a design that does not create a looped topology to begin with? Such as with Multi Chassis EtherChannel (MCEC)? A great idea! Right? Well, maybe not…

Figure 7 - Scaling out Tier 1 with MCEC

In Figure 7 above, I have attempted to scale out Tier 1 with (4) switches all jointly participating in a Multi Chassis Etherchannel peering relationship. (First of all, this is a fictitious design, as no switch vendor has engineered this, not even Cisco. But lets just imagine for a second…) The plan here is to allow each Tier 2 switch to connect to all (4) Tier 1 switches with a single logical Port Channel, thus creating a loop free topology at the onset so Spanning Tree will not block any links. If I can already have (2) switches configured for MCEC peering, why not (4)? Heck, why stop at (4), why not (16)? The problem here of course is extreme complexity. Remember that accomplishing MCEC between just (2) switches is a significant engineering accomplishment. There are many states, roles, and Layer 2 / Layer 3 interactions that must be synchronized and orchestrated for the system to behave properly. On top of that, you must be able to quickly detect and correctly react to split brain failure scenarios. Once the MCEC domain is increaed from (2) switches to just (4), you have increased the engineering complexity by an order of magnitude. As a testament to the engineering complexity of MCEC, consider that Cisco is the only major switch vendor to successly engineer and support MCEC with (2) fully featured Layer2/Layer3 switches. And NO switch vendor, not a single one, has successfully engineered, sold, and supports a (4) switch MCEC cluster. Some switch vendors are hinting at such capabilities as a possible future roadmap in their data sheets. All I have to say about that is … Good Luck!

Is it possible to scale out Tier 1 with (4), (8), or even (16) switches in a loop free design with a lot less engineering complexity? The answer is, Yes! This is an enhancement in TRILL. :-)

Another approach worth discussing is the complete removal of Layer 2 switching and replacing it with Layer 3 IP routing. By removing Layer 2 switching and replacing it with Layer 3 routing the switches behave more like routers with load balancing on multiple paths and no Spanning Tree blocking links. This also allows for scaling out each Tier without any of the Layer 2 challenges in Classic Ethernet as we have been discussing thus far. Sounds pretty good, right? Well, not so fast…

How do you provide pervasive Layer 2 services over a network of Layer 3 IP routing? The IP cloud formed by the Tier 1 & 2 switches would be used to create an MPLS cloud and deploy services such as VPLS (Virtual Private LAN Services) providing virtual Layer 2 circuits (pseudo wires) over the Layer 3 cloud. After a full mesh of VPLS pseudo wires has been configured between all Tier 2 switches you can begin to provide Layer 2 connectivity from any Tier 2 switch to another. Sound complicated? That’s because it is!

Figure 8 - Scaling Tier 1 with IP + MPLS + VPLS

In Figure 8 above, the data center network has been setup as a VPLS-over-MPLS-over-IP cloud. Once that foundation is in place, I need to configure a full mesh of Layer 2 VPLS pseudo wires between all Tier 2 switches. How many pseudo wires do you need to configure? You can use this formula where N equals the number of Tier 2 switches: N * (N-1) / 2. And, for each new Tier 2 switch you add you will need to go back to every other Tier 2 switch and configure a new set of pseudo wires to the newly added switch. Not exactly Plug & Play, is it?

Rather than replacing Layer 2 with Layer 3, and then trying to overlay Layer 2 services over the Layer 3 … wouldn’t it be better to simply evolve Plug & Play Layer 2 switching with more Layer 3 like forwarding characteristics? This is exactly the idea behind TRILL. :-)

Now lets finish up with a look at how Tier 2 scales under Classic Ethernet. Remember from earlier that having any more than (2) switches at Tier 1 makes no sense in Classic Ethernet, thanks to flooding loops and Spanning Tree. Because of this (2) switch design constraint the density potential of the Tier 1 switch you choose becomes a key factor in determining the scalability of network.

Figure 9 - Scaling Tier 2 with Classic Ethernet

In Figure 9 above, because Tier 1 cannot have anymore than (2) switches there will always be a clear trade off between scaling bandwidth or scaling size. If I choose to give more bandwidth to Tier 2 it means less available capacity for adding more Tier 2 switches. This is largely the result of not being able to scale out Tier 1 horizontally with Classic Ethernet. If the rigid (2) switch design constraint was removed from the equation you suddenly have a lot more flexibility in how you can scale, and the trade off between bandwidth or size becomes less of black and white matter. Gaining this valuable flexibility with the data center switching design is a key promise behind the evolution to TRILL.

The stage has been set for the next evolution of data center switching

The next generation dynamic data center needs to have tremendous design flexibility to build highly scalable, agile, and robust Layer 2 domains. To get there, classic Ethernet switching as we know it today needs to evolve in the data center. The most successful solutions will be those that address many of the challenges facing the data center today, not just bandwidth.

What are the challenges that should be addressed?

  • Plug & Play simplicity
  • MAC address scalability*
    • a more efficient method of MAC learning*
    • a hierarchichal approach to Layer 2 forwarding*
  • Minimal configuration requirements
  • All links forwarding & load balancing – No Spanning Tree
  • More bandwidth for all traffic types, including Multicast, not just Unicast*
  • Fast convergence
  • Layer 3 virtues of scalability and robustness with the Plug & Play simplicity of Layer 2
  • Flexible and agile scaling out of either Tier 1, or Tier 2.
  • Configuration simplicity for automation with open API’s.

OK. Remember the goal here was to “set the stage” with a basis level understanding of why classic Ethernet switching needs to evolve for the next generation data centers. Please stay tuned for further detailed discussions on data center switching and TRILL.

RSS Feed: http://bradhedlund.com/feed/

Future topics may include:

  • TRILL technical deep dives
    • Conversation based MAC learning
    • Configuration examples
    • Design examples
  • How and where does FCoE and Unified Fabric fit into this picture?
  • Industry news & analysis
  • Suggestions?

Presentation Download

I’m not sure who’s crazier: You (for reading this entire post without falling asleep)? Or Me (for writing such a long post)? Anyway, I have a reward for your time and attention! You get to download the presentation I developed for this post. There are some extra slides that provide a sneek peak into my next posts, an Introduction to TRILL.

PDF: http://internetworkexpert.s3.amazonaws.com/2010/trill1/TRILL-intro-part1.pdf

Original Power Point with Animations: *Please ask your Cisco representitive


Disclosure: The author (Brad Hedlund) is an employee of Cisco Systems, Inc. which plans to have TRILL based solutions embedded into the companies data center switching product line.
Disclaimer: The views and opinions expressed are solely those of the author as a private individual and do not necessarily represent those of the authors employer, Cisco Systems, Inc. The author is not an official media spokesperson for Cisco Systems, Inc.


Special Thanks to my colleagues at Cisco: Marty Ma, and Francois Tallet. Both of whom are deeply involved with Cisco’s implementation of TRILL* and took precious time to provide me with a 1:1 education about some of the topics covered here.

© Copyright 2010, Brad Hedlund, INTERNETWORK EXPERT .ORG

29 responses so far

Older Entries »