Jun 22 2010

Cisco UCS Networking Best Practices (in HD)

This is a presentation I developed covering networking best practices for Cisco UCS, and now have recorded in High Definition for your viewing pleasure! Sweet! :-)

This presentation assumes familiarity with basic networking and server VNIC concepts in UCS, and familiarity with virtual port channels.

This version of the presentation (v2.5) focuses primarily on the Ethernet uplinks. SAN uplinks and VMware networking scenarios are briefly discussed but not covered extensively. Those topics and others such as QoS, the Cisco VIC, and vNIC fabric failover may be included in future versions of this presentation.

Stay tuned for updates! RSS feed: http://bradhedlund.com/feed/


Part 1 – Cisco UCS Networking Overview

In Part 1 we take start with a baseline overview of Cisco UCS Networking. At the heart of the system is the Fabric Interconnect (6100) “the Brains of UCS” which provides 10GE & FC networking for all the compute nodes in its domain as well as being the central configuration, management, and policy engine for all automated server and network provisioning.


Part 2 – Switch Mode vs. End Host Mode

Part 2 is an examination of the two different switching modes supported by the Fabric Interconnect, “Switch Mode” and “End Host Mode”. With “Switch Mode”, the Fabric Interconnect behaves like a normal Layer 2 switch on all server ports and uplinks, and therefore attaches to the upstream data center network as a spanning tree enabled “Switch”.

“End Host Mode”, on the other hand, while still providing local Layer 2 switching on the server ports, does not behave like a normal Layer 2 switch on its uplinks.  Instead, server NICs are “pinned” to a specific uplink, and no local switching happens from uplink to uplink.  This allows “End Host Mode” to attach to the network like a “Host” without spanning tree, and all uplinks forwarding on all VLANs.

End Host Mode is the preferred mode, and it’s enabled by default.


Part 3 – End Host Mode – Individual Uplinks

In Part 3 we take a look how the individual uplinks behave in End Host Mode, and how the system reacts to uplink failures. When an uplink fails, the Fabric Interconnect will move the server NICs to a new uplink in under a second without causing any disruption to the server NIC.  This uplink failover process is called dynamic re-pinning.

After the dynamic re-pinning process, the Fabric Interconnect will send Gratuitous ARP messages for all of the MAC address that were previously using the failed uplink. This GARP process aids the upstream network in quickly learning the new location of the affected MAC address now using the new uplink.


Part 4 – Port Channel Uplinks

Here we take a look at the benefits of using Port Channel uplinks with Cisco UCS. The key advantages to port channel uplinks is the minimal impact of a physical link failure and the potential for better overall uplink load balancing. During individual physical link failures fewer moving parts required to provide a fast recovery.  For example, Gratuitous ARP messages and dynamic re-pinning are not required when an individual physical member link fails in a port channel uplink.  Port Channel uplinks are definitely recommended whenever possible.


Part 5 – Virtual Port Channel Uplinks (vPC)

Part 5 covers the advantages of using virtual port channel (vPC) uplinks with Cisco UCS. With vPC uplinks, there is minimal impact of both physical link failures and upstream switch failures. With more physical member links in one larger logical uplink, there is the potential for even better overall uplink load balancing and better high availability than with a standard Port Channel uplink discussed in Part 4. Using a virtual port channel uplink is highly recommended if you have vPC capabilities present in your upstream network switches.


Part 6 – Connecting Cisco UCS to separate networks

In Part 6 we discuss the scenario of connecting a single Cisco UCS system in End Host Mode to separate Layer 2 networks. When the system is in End Host Mode, it expects and assumes that all uplinks are connected to the same common Layer 2 domain. If some uplinks are connected to physically separate networks you will have connectivity problems.  The Fabric Interconnect will randomly pick one of its uplinks to process broadcast messages for all VLANs.  As a result, only servers associated with the chosen network will be able to see and process broadcasts messages on their network.  The solution is create a common Layer 2 network for the Fabric Interconnect in End Host Mode and each of the separate networks to attach to, or, use Switch Mode.  If creating a common Layer 2 network or using Switch Mode is not an option, you can always deploy a unique Cisco UCS system per separate network to preserve the existing “silos”.


Part 7 – Inter Fabric Traffic Examples

This is a brief look at some the common types of traffic flows that may flow between Fabric-A and Fabric-B within a single Cisco UCS system. With this understanding, the subsequent material will make more sense.


Part 8 – Don’t: Connect Cisco UCS to vPC domains without vPC uplinks

This is a fairly extensive look at the scenario of attaching UCS to upstream switches configured for vPC, without using vPC uplinks. Here we will show that this scenario doesn’t make much sense and in fact can cause some unwanted traffic black holes under some failure scenarios. This is a prelude to Part 9 where we illustrate that if your upstream network is configured for virtual port channel capability (vPC), you should always attach UCS with vPC uplinks.


Part 9 – Do: Connect Cisco UCS to vPC domains with vPC uplinks

This section shows that if you have virtual port channel capabilities in your upstream switches, you have everything to gain and nothing to loose by connecting Cisco UCS with vPC uplinks. You will gain the benefit of the upstream switch locally switching all Fabric-A to Fabric-B traffic, and acheiving more bandwidth scalability for inter-fabric traffic because all inter-fabric traffic will travel on the vPC uplinks, rather than on less abundant inter-switch links. Additionally, you will avoid potential black hole failure scenarios discussed in Part 8, if vPC is already present in the upsteam network switches.


Part 10 – Connecting Cisco UCS without vPC

While there are certainly advantages to uplinking Cisco UCS with virtual port channels, vPC is certainly not required. Cisco UCS easily and efficiently connects to any data center network environment with or without vPC. This section discusses best practices connecting UCS to networks without vPC.  The key best practice here is to always dual attach each Fabric Interconnect to two upstream network switches, whether its with vPC uplinks, or multiple individual uplinks.  Another suggested practice is to avoid attaching Cisco UCS to a second tier Layer 2 switch with spanning tree blocking links.  A better approach is to either have vPC capabilites at the second tier Layer 2 switch, or connect Cisco UCS directly to the tier 1 switch, avoiding a traffic bottlenecks induced by spanning tree.

9 responses so far

Jun 03 2010

Data Center Networking Q&A #1 – starring HP, Nexus 1000V, QoS

I thought it would be fun to pilot a series of posts where I pick out interesting search engine queries that were used to find my blog.  Often times these are good questions that deserve a good answer, or other interesting topics that can start a good discussion or fun debate.

This particular series will focus on Data Center Networking.  Another data center computing focused series may be started such as “Cisco UCS Q&A”. Stay tuned.

Each question or search query will be headlined in bold and the text beneath will be my answer, commentary, or general response.

Furthermore, if you have a question you think I might be able to answer please submit your question in the comments section to be considered for immediate answer or highlighted in a future Q&A post.

So here we go, this is the first Data Center Networking Q&A :) I hope you enjoy it!


hp qos marking configuration

I seem to be getting quite a number of these queries finding my site lately.  This person might be trying to find out how to configure their HP blade switch to classify and mark traffic for a QoS policy in their data center.  Well, there are two blade switches made by HP worth discussing: HP Virtual Connect Flex-10, and the HP Procurve 6120XG.

Lets start with HP Virtual Connect Flex-10.  I’ve got some real bad news for you on this one.  Flex-10 has no QoS capabilities what so ever.  If you follow my blog and tweets you have heard me point this out several times and I’ll do it here again.  Flex-10 is not capable of classifying traffic, not capable of marking traffic, and not capable of giving any special treatment or guaranteed bandwidth to important traffic.  If you search for “QoS” in the latest Virtual Connect User Guide it produces zero hits.  But don’t take my word for it, here is what HP says about Virtual Connect Flex-10 QoS:

VC does not currently support any user configurable Quality of Service features … these features are on the roadmap for future implementation.

Moving on to HP Procurve 6120XG; this blade switch made by HP actually has what  I will describe as “so-so” QoS capabilities, much better than Flex-10 anyway.  The Procurve 6120XG can give special treatment to traffic via (4) traffic queues, each with a differing degree of priority; Low, Normal, Medium, and High.  The High priority queue will always be serviced before the Medium queue, and so on.  This is a simple implementation of a QoS technique called Priority Queueing.  The downside to Priority Queueing is that the high priority queue (if busy) can starve bandwidth from all other queues, with no insurance or fairness that all traffic will get some portion of the bandwidth.  The 6120XG QoS implementation does not provide guaranteed bandwidth, it simply allows some packets to be transmitted before others, based on the queue they are serviced from.

The Procurve 6120XG can assign traffic to each of the (4) queues based on the incoming 802.1P COS value in the Ethernet header, or the IP DSCP or TOS value in the IP header.  One important thing to note here is that the HP Procure 6120XP expects the packet to already be marked when entering the switch.  The Procurve 6120XG is not capable of marking traffic based on MAC, IP, or TCP information.  If a packet enters the 6120XG with no marking, the packet cannot be classified and not provided any special treatment, no QoS.  The 6120XG can mark traffic based on incoming port, meaning all traffic from a certain port can be given a defined marking.  However, this rudimentary port-based classification has little use in a 10G server environment where different types of traffic will be converged on a single 10G server interface.

Given that the HP Procurve 6120XG’s QoS capabilities are only useful when the traffic has already been marked before entering the switch, it’s important to understand if your blade servers are capable of traffic classification and packet marking.  This becomes especially relevant with server virtualization hosts, such as with VMware.  The VMware vStandard Switch (VSS) or vNetwork Distributed Switch (VDS) does not have traffic classification or marking capabilities, however the optional Cisco Nexus 1000V has a comprehensive set of QoS classification and marking capabilities.  Hence, the Cisco Nexus 1000V can classify and mark important traffic leaving the VMware host (such as vMotion or management traffic) before entering the Procurve 6120XP where it can then be placed into one of the (4) QoS queues based on the marking it receives.

The downside to QoS on the HP Procurve 6120XG is that its basic Priority Queueing implementation does not provide any bandwidth guarantees to all traffic types, and the lack of traffic classification capabilities requires that your servers do the classification and marking, hence the need for Cisco Nexus 1000V on the vSphere Host.  This is why I give it a “so-so” rating.

You can find the complete QoS configuration details here in the HP Procurve 6120XP Traffic Management Guide (PDF)

For HP blade servers, my recommendation is to not use the HP Procurve 6120XG, and instead use the HP 10G Passthrough module.  This allows you to connect your HP blade servers directly to a Cisco Nexus switching environment, bypassing all of the other mediocre 10G blade switch options from HP.  The Cisco Nexus series has a rich set of QoS capabilities that do not just provide basic Priority Queuing, but rather offer an advanced Class Based Weighted Fair Queueing mechanism that can assign minimum guaranteed bandwidth to all traffic types.  This behaves in a manner very similar to how VMware provides reservations for CPU and memory to a virtual machine, but without limitations.  If more CPU and memory is available, the virtual machine can use it, but there is always a minimum guarantee provided by the reservation.  This is similar to how Cisco Nexus switches allocate bandwidth, minimum guarantees without limitations.

The Cisco Nexus switches are also capable of classifying and marking traffic based on MAC, IP, or TCP information.  So if a packet arrives unmarked you can mark it at the switch and provide granular QoS, with or without the Nexus 1000V.  You can later add the Nexus 1000V for all of the security, management, and network visibility reasons, but you can at least get started with rich QoS capabilities with or without the Nexus 1000V.


hp blades cisco nexus 1000v

Great news! You can absolutely run the Cisco Nexus 1000V on an HP blade server.  Furthermore, this can be done with any blade switch.  You can use Virtual Connect Flex-10, the Procurve 6120XG, the 10G Passthrough module, or any other standard Ethernet switch.

There is one particular note about using HP Virtual Connect Flex-10 with Nexus 1000V that I want to discuss.  The Flex-10 module can be setup in two different switching modes, mapped mode, or tunnel mode.  The most common mode is mapped mode because that is the default mode.  In mapped mode the Flex-10 administrator defines vNets, which are basically VLANs inside the Virtual Connect domain.  The Virtual Connect administrator can decide the VLAN ID used for these vNets independently of the network administrator who might be configuring the physical network switches and the Nexus 1000V.  The Flex-10 uplinks to the network with VLANs defined by the network administrator, however the VLAN IDs on the network uplink are mapped to a VLAN ID of a vNet, which might be a different VLAN ID, or the same.

If the Virtual Connect administrator is using mapped mode and mapping the VLAN ID on the network to a different VLAN ID inside the Virtual Connect domain, this can cause a problem with the Nexus 1000V.  If the Nexus 1000V VSM was configured with the VLAN ID’s known on the physical network, the Nexus 1000V VEM running on the server is expecting to see the same VLAN ID’s that were defined on the VSM.  However if the Flex-10 mapped mode configuration is changing (mapping) the VLAN ID’s to something different, you will have broken the linkage between Nexus 1000V VSM and VEM.

Tunnel mode, on the other hand, simply takes all VLAN ID’s defined on the network uplinks and sends them down to the server ports, without changing or mapping the VLAN ID’s to anything different.

The moral of the story here is that if you are going to use Nexus 1000V with Virtual Connect Flex-10, you can, just make sure you are not changing VLAN ID numbers.  If using mapped mode, do not map the network VLAN ID’s to a different VLAN ID at the server, keep them the same.  Or, you can also use tunnel mode which does not provide any option of changing VLAN ID numbers.

One you have insured VLAN ID consistency inside of Virtual Connect you are all set to have a very successful implementation of Nexus 1000V on HP blades, even with Virtual Connect Flex-10.

That’s all for now.

Hope you enjoyed the pilot episode of Data Center Networking Q&A #1

Please remember to submit questions or comments in the comment section below.  Some questions may be answered here or featured in a future installment.

4 responses so far

May 07 2010

Setting the stage for TRILL, rethinking data center switching

As data centers become increasingly dynamic and dense with virtualization – how the classic Ethernet switching design adopts to these new models and scales becomes an important and challenging question. Virtualization and cloud based services says that any workload can exist anywhere, at anytime, on demand, and move to any location without disruption. This is a major paradigm shift from the old days where a “Server” and the application it supported had a very static location in the network. When the application has a static location you can build walls around it in a very structured manner with minimal trade-offs. In the old “static” Data Center, you could for example provide Layer 3 routing boundaries at the server edge for the very good reasons of robust scalability, minimal or no Spanning Tree, and active/active router-like link load balancing and fast convergence. In today’s dynamic Data Center, the imposition of Layer 3 boundaries no longer works.

The next generation dynamic Data Center requires a pervasive Layer 2 deployment enabling the aforementioned fluid mobility of application workloads. Any VLAN, on any switch, on any port, at anytime. As as result, switch makers (in order to remain viable in the data center) must be geared towards enabling pervasive Layer 2 data center fabrics in a manner that is highly scalable (agile), robust, maximizes bandwidth (resources), with plug & play simplicity.

One major step forward in designing next generation data centers is the promising technology which is currently defined in RFC 5556 named TRILL (Transparent Interconnection of Lots of Links). Some switch vendors (such as Cisco) may initially offer the capabilities found in TRILL with additional enhancements as a proprietary system. Therefore, for the time being I am going to use the word TRILL in a generic sense. And where a capability is discussed that is a unique enhancement offered by Cisco (or any other vendor) I will simply cite that with an *, such as TRILL*.

Before we discuss TRILL in great detail I think it’s important first to take a step back and “Set the stage” a little by revisiting the classic Ethernet design principles currently in use today, understanding both the strengths and challenges. Then we’ll look at some alternative approaches that attempt to address these challenges, and where they fall short. As we go through the various areas I will point out where TRILL can make design improvements. Once we have this basis of understanding we will be ready to understand the value of TRILL with more detail in subsequent discussions. Sound cool? Great!

Revisiting Classic Ethernet. What works? What needs improvement?

A fundamental underpinning of Ethernet is the “Plug & Play” simplicity that in no small measure has contributed to the overall tremendous success of Ethernet. When you connect Ethernet switches together they can auto discover the topology and automatically learn about each host’s location on the network, with little to no configuration. Any future evolution of Ethernet must retain this fundamental “Plug & Play” characteristic to be successful. The key enabler of this Plug & Play capability is Flooding Behavior.

Figure 1 - Ethernet flooding

Figure 1 above shows two simple examples of Ethernet flooding behavior. On the left, if an Ethernet switch receives a Unicast frame with a destination that it doesn’t know about, it simply floods that frame out on all ports. This behavior is called Unicast Flooding and it insures that the destination host receives the frame so long as it is connected to the network, without any special configuration (Plug & Play).

The other flooding behavior shown on the right (Figure 1 above) is a Broadcast message that is intended for all hosts on the network. When the Ethernet switch receives a broadcast frame, it will simply do as told and send a copy of that frame to all active ports. Broadcast messages are tremendously useful for hosts seeking to dynamically discover other hosts connected to the network without any special configuration (Plug & Play).

This default flooding behavior of Ethernet is fundamental to its greatest virtue, Plug & Play simplicity. However, this same flooding behavior also creates design challenges that we will discuss shortly.

The flooding of unknown unicasts and broadcasts frames also allows for Plug & Play learning of the all the hosts and their location in the network, without any special configuration. Once the location of a host is known, all subsequent traffic to that host will be sent only to the ports leading to the host. I will refer to this type of traffic as Known Unicast traffic.

The process of automatically discovering a hosts location on the network is called MAC Learning:

Figure 2 - Classic Ethernet MAC Learning

Figure 2 above shows a simple example of the automatic MAC Learning process. Every time an Ethernet switch receives a frame on a port it looks at the source MAC address of the received frame and records the port and source MAC on which it received the frame in its forwarding table, aka MAC Table. That’s it! Its that simple. Any future frames received that are destined to the learned MAC address will be directed only to the port on which it was learned. This process is more specifically described as Source MAC Learning, because only the Source MAC address is examined upon receiving a frame.

Because of the flooding behavior discussed earlier, the Ethernet switch can quickly learn the location of all hosts on the network. Anytime a host sends a broadcast message it will be received by all Ethernet switches where the source MAC address of the sending station will be recorded and learned, as shown in Figure 2.

There is a peculiar side effect to Source MAC Learning: All Ethernet switches will inevitably learn about all hosts, needed or not. For example, in Figure 2 above, Host C and Host D are communicating on Switch 4. The Source MAC learning process was useful in establishing a Known Unicast conversation for these two hosts using Switch 4. However, despite the fact that Host A and Host B are not using Switch 4 for any conversations, Switch 4 has still populated its MAC Table with entries for Host A and Host B.

“Whats wrong with that?” you ask? Well, in the old “static” Data Center with small Layer 2 domains this was never a concern. Now imagine this inefficient behavior on a much larger scale in the dynamic Data Center with thousands of virtual hosts in a pervasive Layer 2 domain. The unfortunate side effect is that you will have many unnecessary entries in every Ethernet switch. And each one of these unnecessary entries consumes valuable space in the MAC Table where there is a limited number of entries available. A typical data center class Ethernet switch might support 16,000 MAC entries. Again, not a problem in the “static” Data Center. However this poses a scalability challenge in the virtualization dense dynamic Data Center. Is this something that can be improved while maintaining the Plug & Play auto learning behavior? The answer is, Yes, this is an area enhanced by TRILL* :-)

Now lets move on to the design challenge with flooding behavior I mentioned earlier. Remember, the flooding behavior of Ethernet is fundamental to achieving Plug & Play capabilities, so we cant get rid of it, we need it. The challenge with flooding is there is no mechanism to know when a flooded frame (such as a Broadcast) has already made its way through the network. Every time a Broadcast or Unknown Unicast frame is received it is immediately flooded out on all ports, no questions asked, even if this is the same frame returning to the switch from a previous flood, there is no way to know. This can become a real problem when you have multiple paths from one switch to another.

Figure 3 - Ethernet flooding loop

In Figure 3 above, Host A sends a Broadcast or Unknown Unicast frame into Switch 3 which is then flooded on the links connecting to Switch 1 and Switch 2. Once received, Switch 1 & 2 will also flood the frame on all of their ports, and so on. Switch 3 ultimately receives the original frame again and the same process repeats. Unlike an IP packet that increments a TTL field (time to live) with every hop, there is no such TTL field or other mechanism in an Ethernet frame that provides information about the frames age or history on the network. As a result, the flooding loop repeats infinitely with every new broadcast. It doesn’t take long for the loop to have catastrophic effects on the network (within seconds). Can Ethernet be enhanced with a TTL field just like IP to limit the scope of unwanted loops? The answer is, Yes, this is an area enhanced by TRILL. :-)

This looping challenge above led to the development of a Plug & Play mechanism in Ethernet to detect and prevent loops called Spanning Tree Protocol (STP).

Figure 4 - Classic Ethernet Loop Prevention

In Figure 4 above, the Ethernet switches have auto discovered a redundant path in the network using STP and placed certain interfaces in a “Blocking” state to prevent the disastrous infinite looping of flooded frames. The Spanning Tree protocol is Plug & Play, requiring no configuration work, and because it prevents the disastrous loops that allow flooding to work properly in a network with redundant paths, you could argue that STP (even with it’s infamous reputation) is THE reason why Ethernet is so successful today as a mission critical data center network technology. Now, truth be told, STP does require some configuration tuning if you want to have precise control over which links are placed into a “Blocking” state. Such as in Figure 4 above, whereby defining Switch 1 as the “Root” bridge we can influence redundant links from Switch 3 & 4 to block loops and provide a balance of bandwidth available to hosts on either Switch 3 or Switch 4, each switch having 50% of its bandwidth available for hosts.

There is an unfortunate side effect with STP. Remember, it is the Broadcast and Unknown Unicast frames flooding and looping the network that cause the catastrophic effects which we must correct with STP. The non-flooded Known Unicast traffic is not causing the problem. However, when STP blocks a path to close a loop, it is in fact punishing bandwidth availability for ALL traffic, including the Known Unicast traffic, the significant majority of all traffic on the network! Thats not fair! Can we enhance Ethernet to correct this unfair side effect? The answer is, Yes, this an area enhanced by TRILL. :-)

Given that STP creates a single loop free forwarding topology for all traffic, flooded or non-flooded, it became increasingly import to build loop free topologies while maintaining multiple paths, maximizing bandwidth, without STP blocking any of those valuable paths, especially in 10GE data center networks. In order for STP to not block any of the paths we must first show STP a loop free topology from the start.

Building loop free topologies with multiple paths can be accomplished with the development of a capability generically referred to as Multi Chassis EtherChannel (MCEC), available in some switches today -mostly notably Cisco switches ;) – but other switch vendors have started to implement MCEC as well. Some switch platforms such as the Cisco Nexus family refer to this capability as Virtual Port Channels (vPC).

Figure 5 - Multi Path with Classic Ethernet (MCEC)

As shown in Figure 5 above, Switch 1 and Switch 2 form a special peering relationship with each other that allows them to be viewed as single switch in the topology, rather than two separate switches. This significant accomplishment allows Switch 3 and Switch 4 to form a single logical link with a single standard Etherchannel to both Switch 1 & 2. STP treats Switch 1 & 2 as a single node on the network, and as a result finds a loop free topology from the start, and no links need to be blocked, all links are active. Virtual Port Channels is a popular design choice today for maximizing bandwidth in new data center network deployments and redesigns.

Accomplishing MCEC or vPC capabilities is not a trivial task. A significant engineering effort is required. For MCEC implementations to behave properly you must engineer lock step synchronization of several different roles and states on each peer switch (Switch 1 & 2). You need to make sure MAC learning is synchronized, any MAC’s learned on Switch 1 must be made known to Switch 2. You need to make sure the interface states (up/down) are synced and the interface configurations are identical. You also need to determine which switch will process STP messages on behalf of the other. And to top it all off, most importantly, you need to have a robust split brain failure detection and determine how each switch will react and assume or relinquish the aforementioned roles and state. All of these different synchronization elements and split brain detection can lead to a complex matrix of failure scenarios that the switch maker must test and insure software stability.

The significant engineering effort of MCEC is for the simple purpose of providing STP a multi path loop free topology so that no links will be blocked. Will it be possible to build a multi path loop free topology without all of the system complexity of MCEC? The answer is, Yes, this is an enhancement in TRILL :-)

Scaling the next generation Data Center with Classic Ethernet

Now lets switch gears to scaling a pervasive Layer 2 data center fabric. Lets start by looking at the scaling options for Tier 1 (the Aggregation layer). First of all, why would you want to scale Tier 1 anyway? Well, the more capacity you can have available at Tier 1 means more Tier 2 (Server Access Layer) switches that can exist in the layer 2 domain. Furthermore, the more ports you have at Tier 1 means more aggregate bandwidth you can deliver to a Tier 2 switch. Therefore, the ability to efficiently scale Tier 1 is critical to the overall scaling of size and bandwidth to the server environment.

Scaling out Tier 1

One interesting approach to scaling Tier 1 is to simply scale out by adding more switches horizontally across the Tier. This makes sense for a number of reasons. First of all, if you can connect the Tier 2 switch to an array of switches at Tier 1 you gain the advantage of spreading out risk, much like a RAID array of hard disk drives. For example, when a Tier 2 switch connects to (4) Tier 1 switches, a single uplink or Tier 1 switch failure would result in a 25% loss of available bandwidth, compared to a more significant 50% loss when there are just (2) Tier 1 switches. Second, if you can easily add more Tier 1 switches as you grow, the density of the Tier 1 switch becomes less of a factor in achieving the overall scale you need. For example, when you have the flexibility to eventually grow Tier 1 to (8) or even (16) switches, rather than only being limited to (2), you can achieve respectible scale with with an array of smaller low cost Tier 1 switches, or mind boggling scale with a wide array of larger modular switches.

Sounds great! Right? But before we start the high fives, how does scaling out Tier 1 work with the Classic Ethernet network relying on Spanning Tree Protocol for loop prevention? Well, it doesn’t… :-(

Figure 6 - Scaling out Tier 1 with Classic Ethernet

In Figure 6 above, I have attempted to scale out Tier 1 in a Classic Ethernet network. I have added Switches 5 & 6 to Tier 1 and linked my Tier 2 switches to the (4) switch array at Tier 1. Unfortunately though, the only thing I was able to accomplish was creating more loops that must be blocked by Spanning Tree Protocol. In order to maintain a loop free topology for flooded traffic (broadcasts & unknown unicasts), all of the extra links I added from Tier 2 to Tier 1 have been disabled by STP, which if you remember punishes all traffic including the Known Unicast and Multicast traffic. What was the point? This was a futile exercise.

It is for this very reason why having more than (2) Tier 1 switches has never made any sense with Classic Ethernet. This long standing rigid design constraint has led the density of the Tier 1 switch being a very import criteria to achieving large scale and bandwidth. “How many ports can I shove in one box?” To achieve even respectable density in a modern data center requires a pair of large modular switches positioned in Tier 1, from which you can add modules as you grow. Once the module slots are filled you have hit your scalability wall, adding more Tier 1 switches is not a viable option.

Alright, so if loop prevention with Spanning Tree Protocol is the problem to achieving scale in Classic Ethernet, why not scale out Tier 1 with a design that does not create a looped topology to begin with? Such as with Multi Chassis EtherChannel (MCEC)? A great idea! Right? Well, maybe not…

Figure 7 - Scaling out Tier 1 with MCEC

In Figure 7 above, I have attempted to scale out Tier 1 with (4) switches all jointly participating in a Multi Chassis Etherchannel peering relationship. (First of all, this is a fictitious design, as no switch vendor has engineered this, not even Cisco. But lets just imagine for a second…) The plan here is to allow each Tier 2 switch to connect to all (4) Tier 1 switches with a single logical Port Channel, thus creating a loop free topology at the onset so Spanning Tree will not block any links. If I can already have (2) switches configured for MCEC peering, why not (4)? Heck, why stop at (4), why not (16)? The problem here of course is extreme complexity. Remember that accomplishing MCEC between just (2) switches is a significant engineering accomplishment. There are many states, roles, and Layer 2 / Layer 3 interactions that must be synchronized and orchestrated for the system to behave properly. On top of that, you must be able to quickly detect and correctly react to split brain failure scenarios. Once the MCEC domain is increaed from (2) switches to just (4), you have increased the engineering complexity by an order of magnitude. As a testament to the engineering complexity of MCEC, consider that Cisco is the only major switch vendor to successly engineer and support MCEC with (2) fully featured Layer2/Layer3 switches. And NO switch vendor, not a single one, has successfully engineered, sold, and supports a (4) switch MCEC cluster. Some switch vendors are hinting at such capabilities as a possible future roadmap in their data sheets. All I have to say about that is … Good Luck!

Is it possible to scale out Tier 1 with (4), (8), or even (16) switches in a loop free design with a lot less engineering complexity? The answer is, Yes! This is an enhancement in TRILL. :-)

Another approach worth discussing is the complete removal of Layer 2 switching and replacing it with Layer 3 IP routing. By removing Layer 2 switching and replacing it with Layer 3 routing the switches behave more like routers with load balancing on multiple paths and no Spanning Tree blocking links. This also allows for scaling out each Tier without any of the Layer 2 challenges in Classic Ethernet as we have been discussing thus far. Sounds pretty good, right? Well, not so fast…

How do you provide pervasive Layer 2 services over a network of Layer 3 IP routing? The IP cloud formed by the Tier 1 & 2 switches would be used to create an MPLS cloud and deploy services such as VPLS (Virtual Private LAN Services) providing virtual Layer 2 circuits (pseudo wires) over the Layer 3 cloud. After a full mesh of VPLS pseudo wires has been configured between all Tier 2 switches you can begin to provide Layer 2 connectivity from any Tier 2 switch to another. Sound complicated? That’s because it is!

Figure 8 - Scaling Tier 1 with IP + MPLS + VPLS

In Figure 8 above, the data center network has been setup as a VPLS-over-MPLS-over-IP cloud. Once that foundation is in place, I need to configure a full mesh of Layer 2 VPLS pseudo wires between all Tier 2 switches. How many pseudo wires do you need to configure? You can use this formula where N equals the number of Tier 2 switches: N * (N-1) / 2. And, for each new Tier 2 switch you add you will need to go back to every other Tier 2 switch and configure a new set of pseudo wires to the newly added switch. Not exactly Plug & Play, is it?

Rather than replacing Layer 2 with Layer 3, and then trying to overlay Layer 2 services over the Layer 3 … wouldn’t it be better to simply evolve Plug & Play Layer 2 switching with more Layer 3 like forwarding characteristics? This is exactly the idea behind TRILL. :-)

Now lets finish up with a look at how Tier 2 scales under Classic Ethernet. Remember from earlier that having any more than (2) switches at Tier 1 makes no sense in Classic Ethernet, thanks to flooding loops and Spanning Tree. Because of this (2) switch design constraint the density potential of the Tier 1 switch you choose becomes a key factor in determining the scalability of network.

Figure 9 - Scaling Tier 2 with Classic Ethernet

In Figure 9 above, because Tier 1 cannot have anymore than (2) switches there will always be a clear trade off between scaling bandwidth or scaling size. If I choose to give more bandwidth to Tier 2 it means less available capacity for adding more Tier 2 switches. This is largely the result of not being able to scale out Tier 1 horizontally with Classic Ethernet. If the rigid (2) switch design constraint was removed from the equation you suddenly have a lot more flexibility in how you can scale, and the trade off between bandwidth or size becomes less of black and white matter. Gaining this valuable flexibility with the data center switching design is a key promise behind the evolution to TRILL.

The stage has been set for the next evolution of data center switching

The next generation dynamic data center needs to have tremendous design flexibility to build highly scalable, agile, and robust Layer 2 domains. To get there, classic Ethernet switching as we know it today needs to evolve in the data center. The most successful solutions will be those that address many of the challenges facing the data center today, not just bandwidth.

What are the challenges that should be addressed?

  • Plug & Play simplicity
  • MAC address scalability*
    • a more efficient method of MAC learning*
    • a hierarchichal approach to Layer 2 forwarding*
  • Minimal configuration requirements
  • All links forwarding & load balancing – No Spanning Tree
  • More bandwidth for all traffic types, including Multicast, not just Unicast*
  • Fast convergence
  • Layer 3 virtues of scalability and robustness with the Plug & Play simplicity of Layer 2
  • Flexible and agile scaling out of either Tier 1, or Tier 2.
  • Configuration simplicity for automation with open API’s.

OK. Remember the goal here was to “set the stage” with a basis level understanding of why classic Ethernet switching needs to evolve for the next generation data centers. Please stay tuned for further detailed discussions on data center switching and TRILL.

RSS Feed: http://bradhedlund.com/feed/

Future topics may include:

  • TRILL technical deep dives
    • Conversation based MAC learning
    • Configuration examples
    • Design examples
  • How and where does FCoE and Unified Fabric fit into this picture?
  • Industry news & analysis
  • Suggestions?

Presentation Download

I’m not sure who’s crazier: You (for reading this entire post without falling asleep)? Or Me (for writing such a long post)? Anyway, I have a reward for your time and attention! You get to download the presentation I developed for this post. There are some extra slides that provide a sneek peak into my next posts, an Introduction to TRILL.

PDF: http://internetworkexpert.s3.amazonaws.com/2010/trill1/TRILL-intro-part1.pdf

Original Power Point with Animations: *Please ask your Cisco representitive


Disclosure: The author (Brad Hedlund) is an employee of Cisco Systems, Inc. which plans to have TRILL based solutions embedded into the companies data center switching product line.
Disclaimer: The views and opinions expressed are solely those of the author as a private individual and do not necessarily represent those of the authors employer, Cisco Systems, Inc. The author is not an official media spokesperson for Cisco Systems, Inc.


Special Thanks to my colleagues at Cisco: Marty Ma, and Francois Tallet. Both of whom are deeply involved with Cisco’s implementation of TRILL* and took precious time to provide me with a 1:1 education about some of the topics covered here.

© Copyright 2010, Brad Hedlund, INTERNETWORK EXPERT .ORG

23 responses so far

Mar 14 2010

links for 2010-03-14

Published by Brad Hedlund under Bookmarks

No responses yet

Mar 02 2010

The FOLLY in the HP vs Cisco UCS Tolly Group report on bandwidth

Folly: lack of good sense or normal prudence and foresight

Tolly Group: “Clients work with Tolly Group senior personnel to identify the chief marketing message desired

HP: Client of Tolly Group with a desired marketing message of “Cisco UCS bandwidth sucks”, but in fact received an embarrassing Folly. (refund?)

By now you may have read or heard about the recent HP funded Tolly Group report which attempts to position HP Bladesystem as being superior to Cisco UCS for blade-to-blade bandwidth scalability in a single blade chassis. Unfortunately though for HP, The Tolly Group, and You (who wasted your time reading this report), it contains an egregious FOLLY that effectively makes it a useless waste of time.

The report begins with a crucial and fatal misunderstanding about Cisco UCS:

Only one fabric extender module was used as the second is only used for fail-over.

WRONG! This is completely untrue. When two fabric extenders are installed in a Cisco UCS chassis they are both ACTIVE, and provide redundancy. Each fabric extender provides 40 Gbps of I/O to the chassis, so with two active fabrics you have a total of 80 Gbps of active and useable I/O per chassis under normal conditions. In the event one of the fabrics is failed (or completely missing in the Tolly tests) the other fabric will provide non disruptive I/O for all of the Server vNICs that were using the failed fabric.

Because of this fatal misunderstanding, the HP Tolly Group tests proceeded with the belief that a Cisco UCS chassis only has 40 Gbps of active I/O under normal operations. How could HP and Tolly Group miss this simple fact? After all, the Cisco.com data sheet for the Cisco UCS fabric extender clearly states:

Typically configured in pairs for redundancy, two fabric extenders provide up to 80 Gbps of I/O to the chassis.

http://www.cisco.com/en/US/prod/collateral/ps10265/ps10278/data_sheet_c78-524729_ps10276_Products_Data_Sheet.html

Figure 1 below shows normal operations of Cisco UCS with 80 Gbps ACTIVE/ACTIVE redundant fabrics. Each blue line is 10GE.

Figure 1 - Cisco UCS with 80 Gbps ACTIVE/ACTIVE redundant fabrics

Figure 1 above shows the Cisco recommend configuration for scaling UCS for maximum bandwidth. Servers 1 – 4 can have their vNIC associated to the Fabric A side with 40 Gbps of bandwidth. While Servers 5 – 8 can have their vNIC associated to the Fabric B side which also has 40 Gbps. The vNIC on each Server can also be configured for failover to the other fabric in a failure condition. This failover happens non-disruptively to the OS. The OS never sees a link down event on the Adapter. During the fabric failure condition, all (8) blades will share the same 40 Gbps of bandwidth on the remaining fabric.

Figure 2 below shows how to select the active fabric for a UCS server vNIC and enable failover

Figure 2 - Selecting the fabric for a vNIC with failover

Under normal operations each blade has full dedicated 10 Gbps of bandwidth. Any server can talk to any server at full line rate 10GE with ZERO oversubscription, ZERO shared bandwidth.

Under a fabric failure condition, each blade shares 10GE with another, resulting in a 2:1 oversubscription.

The HP funded Tolly Group tested Cisco UCS in a failed fabric condition, under the false premise of normal operations.

Figure 3 below shows the failed fabric condition as tested by HP and Tolly Group

Figure 3 - Cisco UCS with a failed fabric and 1/2 bandwidth

In the failed fabric condition shown above, (8) blades will share 40 Gbps. More specifically with the HP Tolly Group tests that used 6 servers, Servers 1 & 5 will share the same 10GE link, and Servers 2 & 6 will also share the same 10GE link on the Fabric A side.

This is exactly how the Tolly Group tested Cisco UCS under the premise of showing “Bandwidth Scalability” – when in fact they did not provide the full available bandwidth to the Cisco UCS blades. However, the full available bandwidth was provided to the HP blades. Is that a fair test? No way Jose!

What is even more interesting is that even with Cisco UCS tested in a failed fabric condition it still out performed HP in bandwidth tests using 4 servers:

Aggregate throughput of 4 Servers with HP in normal conditions: 35.83 Gbps

Aggregate throughput of 4 Servers with Cisco UCS under failed fabric conditions: 36.59 Gbps

Cisco UCS with (3) hops outperforms HP with only (1) hop — Ouch! That’s gotta be a tough one for the folks at HP to explain.

The major blow the HP Tolly Report tries to deliver is a test with 6 servers where HP almost doubles the performance of Cisco UCS. Again, this should not come as a surprise to anybody because Cisco UCS was tested while in a failed condition, while HP was tested under normal conditions:

Aggregate throughput of 6 servers with HP in normal conditions: 53.65 Gbps

Aggregate throughput of 6 servers with Cisco UCS under failed fabric conditions: 26.28 Gbps

Cisco UCS with (3) hops and half its fabric missing performs at half the speed of HP with (1) hop and a full fabric. Why is that a shocker?

What would have happened if the Tolly Group actually provided a fair test between HP and Cisco on the 6 server test? Is that something the Tolly Group should figure out? After all, the Tolly Group has what it describes as a Fair Testing Charter that states:

With competitive benchmarks, The Tolly Group strives to ensure that all participants are [tested] fairly

http://www.tolly.com/FTC.aspx

That sure sounds nice, I wonder if this actually means anything? Only the Tolly Group can tell us for sure.

Furthermore, I wonder if HP will continue to mislead the public with this unfair testing? Or will HP do the right thing and insist the Tolly Group re-test under apples-to-apples fair test condtions?

At this point the ball is in their court to either disappoint or impress.

###

Disclaimer: The views and opinions are solely those of the author as a private individual and do not necessarily represent those of the authors employer (Cisco Systems). The author is not an official spokesperson for Cisco Systems, Inc.

25 responses so far

Older Entries »