Setting the stage for TRILL, rethinking data center switching

Filed in FabricPath, Featured, Switching, TRILL by on May 7, 2010 46 Comments

As data centers become increasingly dynamic and dense with virtualization – how the classic Ethernet switching design adopts to these new models and scales becomes an important and challenging question. Virtualization and cloud based services says that any workload can exist anywhere, at anytime, on demand, and move to any location without disruption. This is a major paradigm shift from the old days where a “Server” and the application it supported had a very static location in the network. When the application has a static location you can build walls around it in a very structured manner with minimal trade-offs. In the old “static” Data Center, you could for example provide Layer 3 routing boundaries at the server edge for the very good reasons of robust scalability, minimal or no Spanning Tree, and active/active router-like link load balancing and fast convergence. In today’s dynamic Data Center, the imposition of Layer 3 boundaries no longer works.

The next generation dynamic Data Center requires a pervasive Layer 2 deployment enabling the aforementioned fluid mobility of application workloads. Any VLAN, on any switch, on any port, at anytime. As as result, switch makers (in order to remain viable in the data center) must be geared towards enabling pervasive Layer 2 data center fabrics in a manner that is highly scalable (agile), robust, maximizes bandwidth (resources), with plug & play simplicity.

One major step forward in designing next generation data centers is the promising technology which is currently defined in RFC 5556 named TRILL (Transparent Interconnection of Lots of Links). Some switch vendors (such as Cisco) may initially offer the capabilities found in TRILL with additional enhancements as a proprietary system. Therefore, for the time being I am going to use the word TRILL in a generic sense. And where a capability is discussed that is a unique enhancement offered by Cisco (or any other vendor) I will simply cite that with an *, such as TRILL*.

Before we discuss TRILL in great detail I think it’s important first to take a step back and “Set the stage” a little by revisiting the classic Ethernet design principles currently in use today, understanding both the strengths and challenges. Then we’ll look at some alternative approaches that attempt to address these challenges, and where they fall short. As we go through the various areas I will point out where TRILL can make design improvements. Once we have this basis of understanding we will be ready to understand the value of TRILL with more detail in subsequent discussions. Sound cool? Great!

Revisiting Classic Ethernet. What works? What needs improvement?

A fundamental underpinning of Ethernet is the “Plug & Play” simplicity that in no small measure has contributed to the overall tremendous success of Ethernet. When you connect Ethernet switches together they can auto discover the topology and automatically learn about each host’s location on the network, with little to no configuration. Any future evolution of Ethernet must retain this fundamental “Plug & Play” characteristic to be successful. The key enabler of this Plug & Play capability is Flooding Behavior.

Figure 1 - Ethernet flooding

Figure 1 above shows two simple examples of Ethernet flooding behavior. On the left, if an Ethernet switch receives a Unicast frame with a destination that it doesn’t know about, it simply floods that frame out on all ports. This behavior is called Unicast Flooding and it insures that the destination host receives the frame so long as it is connected to the network, without any special configuration (Plug & Play).

The other flooding behavior shown on the right (Figure 1 above) is a Broadcast message that is intended for all hosts on the network. When the Ethernet switch receives a broadcast frame, it will simply do as told and send a copy of that frame to all active ports. Broadcast messages are tremendously useful for hosts seeking to dynamically discover other hosts connected to the network without any special configuration (Plug & Play).

This default flooding behavior of Ethernet is fundamental to its greatest virtue, Plug & Play simplicity. However, this same flooding behavior also creates design challenges that we will discuss shortly.

The flooding of unknown unicasts and broadcasts frames also allows for Plug & Play learning of the all the hosts and their location in the network, without any special configuration. Once the location of a host is known, all subsequent traffic to that host will be sent only to the ports leading to the host. I will refer to this type of traffic as Known Unicast traffic.

The process of automatically discovering a hosts location on the network is called MAC Learning:

Figure 2 - Classic Ethernet MAC Learning

Figure 2 above shows a simple example of the automatic MAC Learning process. Every time an Ethernet switch receives a frame on a port it looks at the source MAC address of the received frame and records the port and source MAC on which it received the frame in its forwarding table, aka MAC Table. That’s it! Its that simple. Any future frames received that are destined to the learned MAC address will be directed only to the port on which it was learned. This process is more specifically described as Source MAC Learning, because only the Source MAC address is examined upon receiving a frame.

Because of the flooding behavior discussed earlier, the Ethernet switch can quickly learn the location of all hosts on the network. Anytime a host sends a broadcast message it will be received by all Ethernet switches where the source MAC address of the sending station will be recorded and learned, as shown in Figure 2.

There is a peculiar side effect to Source MAC Learning: All Ethernet switches will inevitably learn about all hosts, needed or not. For example, in Figure 2 above, Host C and Host D are communicating on Switch 4. The Source MAC learning process was useful in establishing a Known Unicast conversation for these two hosts using Switch 4. However, despite the fact that Host A and Host B are not using Switch 4 for any conversations, Switch 4 has still populated its MAC Table with entries for Host A and Host B.

“Whats wrong with that?” you ask? Well, in the old “static” Data Center with small Layer 2 domains this was never a concern. Now imagine this inefficient behavior on a much larger scale in the dynamic Data Center with thousands of virtual hosts in a pervasive Layer 2 domain. The unfortunate side effect is that you will have many unnecessary entries in every Ethernet switch. And each one of these unnecessary entries consumes valuable space in the MAC Table where there is a limited number of entries available. A typical data center class Ethernet switch might support 16,000 MAC entries. Again, not a problem in the “static” Data Center. However this poses a scalability challenge in the virtualization dense dynamic Data Center. Is this something that can be improved while maintaining the Plug & Play auto learning behavior? The answer is, Yes, this is an area enhanced by TRILL* :-)

Now lets move on to the design challenge with flooding behavior I mentioned earlier. Remember, the flooding behavior of Ethernet is fundamental to achieving Plug & Play capabilities, so we cant get rid of it, we need it. The challenge with flooding is there is no mechanism to know when a flooded frame (such as a Broadcast) has already made its way through the network. Every time a Broadcast or Unknown Unicast frame is received it is immediately flooded out on all ports, no questions asked, even if this is the same frame returning to the switch from a previous flood, there is no way to know. This can become a real problem when you have multiple paths from one switch to another.

Figure 3 - Ethernet flooding loop

In Figure 3 above, Host A sends a Broadcast or Unknown Unicast frame into Switch 3 which is then flooded on the links connecting to Switch 1 and Switch 2. Once received, Switch 1 & 2 will also flood the frame on all of their ports, and so on. Switch 3 ultimately receives the original frame again and the same process repeats. Unlike an IP packet that increments a TTL field (time to live) with every hop, there is no such TTL field or other mechanism in an Ethernet frame that provides information about the frames age or history on the network. As a result, the flooding loop repeats infinitely with every new broadcast. It doesn’t take long for the loop to have catastrophic effects on the network (within seconds). Can Ethernet be enhanced with a TTL field just like IP to limit the scope of unwanted loops? The answer is, Yes, this is an area enhanced by TRILL. :-)

This looping challenge above led to the development of a Plug & Play mechanism in Ethernet to detect and prevent loops called Spanning Tree Protocol (STP).

Figure 4 - Classic Ethernet Loop Prevention

In Figure 4 above, the Ethernet switches have auto discovered a redundant path in the network using STP and placed certain interfaces in a “Blocking” state to prevent the disastrous infinite looping of flooded frames. The Spanning Tree protocol is Plug & Play, requiring no configuration work, and because it prevents the disastrous loops that allow flooding to work properly in a network with redundant paths, you could argue that STP (even with it’s infamous reputation) is THE reason why Ethernet is so successful today as a mission critical data center network technology. Now, truth be told, STP does require some configuration tuning if you want to have precise control over which links are placed into a “Blocking” state. Such as in Figure 4 above, whereby defining Switch 1 as the “Root” bridge we can influence redundant links from Switch 3 & 4 to block loops and provide a balance of bandwidth available to hosts on either Switch 3 or Switch 4, each switch having 50% of its bandwidth available for hosts.

There is an unfortunate side effect with STP. Remember, it is the Broadcast and Unknown Unicast frames flooding and looping the network that cause the catastrophic effects which we must correct with STP. The non-flooded Known Unicast traffic is not causing the problem. However, when STP blocks a path to close a loop, it is in fact punishing bandwidth availability for ALL traffic, including the Known Unicast traffic, the significant majority of all traffic on the network! Thats not fair! Can we enhance Ethernet to correct this unfair side effect? The answer is, Yes, this an area enhanced by TRILL. :-)

Given that STP creates a single loop free forwarding topology for all traffic, flooded or non-flooded, it became increasingly import to build loop free topologies while maintaining multiple paths, maximizing bandwidth, without STP blocking any of those valuable paths, especially in 10GE data center networks. In order for STP to not block any of the paths we must first show STP a loop free topology from the start.

Building loop free topologies with multiple paths can be accomplished with the development of a capability generically referred to as Multi Chassis EtherChannel (MCEC), available in some switches today -mostly notably Cisco switches ;) – but other switch vendors have started to implement MCEC as well. Some switch platforms such as the Cisco Nexus family refer to this capability as Virtual Port Channels (vPC).

Figure 5 - Multi Path with Classic Ethernet (MCEC)

As shown in Figure 5 above, Switch 1 and Switch 2 form a special peering relationship with each other that allows them to be viewed as single switch in the topology, rather than two separate switches. This significant accomplishment allows Switch 3 and Switch 4 to form a single logical link with a single standard Etherchannel to both Switch 1 & 2. STP treats Switch 1 & 2 as a single node on the network, and as a result finds a loop free topology from the start, and no links need to be blocked, all links are active. Virtual Port Channels is a popular design choice today for maximizing bandwidth in new data center network deployments and redesigns.

Accomplishing MCEC or vPC capabilities is not a trivial task. A significant engineering effort is required. For MCEC implementations to behave properly you must engineer lock step synchronization of several different roles and states on each peer switch (Switch 1 & 2). You need to make sure MAC learning is synchronized, any MAC’s learned on Switch 1 must be made known to Switch 2. You need to make sure the interface states (up/down) are synced and the interface configurations are identical. You also need to determine which switch will process STP messages on behalf of the other. And to top it all off, most importantly, you need to have a robust split brain failure detection and determine how each switch will react and assume or relinquish the aforementioned roles and state. All of these different synchronization elements and split brain detection can lead to a complex matrix of failure scenarios that the switch maker must test and insure software stability.

The significant engineering effort of MCEC is for the simple purpose of providing STP a multi path loop free topology so that no links will be blocked. Will it be possible to build a multi path loop free topology without all of the system complexity of MCEC? The answer is, Yes, this is an enhancement in TRILL :-)

Scaling the next generation Data Center with Classic Ethernet

Now lets switch gears to scaling a pervasive Layer 2 data center fabric. Lets start by looking at the scaling options for Tier 1 (the Aggregation layer). First of all, why would you want to scale Tier 1 anyway? Well, the more capacity you can have available at Tier 1 means more Tier 2 (Server Access Layer) switches that can exist in the layer 2 domain. Furthermore, the more ports you have at Tier 1 means more aggregate bandwidth you can deliver to a Tier 2 switch. Therefore, the ability to efficiently scale Tier 1 is critical to the overall scaling of size and bandwidth to the server environment.

Scaling out Tier 1

One interesting approach to scaling Tier 1 is to simply scale out by adding more switches horizontally across the Tier. This makes sense for a number of reasons. First of all, if you can connect the Tier 2 switch to an array of switches at Tier 1 you gain the advantage of spreading out risk, much like a RAID array of hard disk drives. For example, when a Tier 2 switch connects to (4) Tier 1 switches, a single uplink or Tier 1 switch failure would result in a 25% loss of available bandwidth, compared to a more significant 50% loss when there are just (2) Tier 1 switches. Second, if you can easily add more Tier 1 switches as you grow, the density of the Tier 1 switch becomes less of a factor in achieving the overall scale you need. For example, when you have the flexibility to eventually grow Tier 1 to (8) or even (16) switches, rather than only being limited to (2), you can achieve respectible scale with with an array of smaller low cost Tier 1 switches, or mind boggling scale with a wide array of larger modular switches.

Sounds great! Right? But before we start the high fives, how does scaling out Tier 1 work with the Classic Ethernet network relying on Spanning Tree Protocol for loop prevention? Well, it doesn’t… :-(

Figure 6 - Scaling out Tier 1 with Classic Ethernet

In Figure 6 above, I have attempted to scale out Tier 1 in a Classic Ethernet network. I have added Switches 5 & 6 to Tier 1 and linked my Tier 2 switches to the (4) switch array at Tier 1. Unfortunately though, the only thing I was able to accomplish was creating more loops that must be blocked by Spanning Tree Protocol. In order to maintain a loop free topology for flooded traffic (broadcasts & unknown unicasts), all of the extra links I added from Tier 2 to Tier 1 have been disabled by STP, which if you remember punishes all traffic including the Known Unicast and Multicast traffic. What was the point? This was a futile exercise.

It is for this very reason why having more than (2) Tier 1 switches has never made any sense with Classic Ethernet. This long standing rigid design constraint has led the density of the Tier 1 switch being a very import criteria to achieving large scale and bandwidth. “How many ports can I shove in one box?” To achieve even respectable density in a modern data center requires a pair of large modular switches positioned in Tier 1, from which you can add modules as you grow. Once the module slots are filled you have hit your scalability wall, adding more Tier 1 switches is not a viable option.

Alright, so if loop prevention with Spanning Tree Protocol is the problem to achieving scale in Classic Ethernet, why not scale out Tier 1 with a design that does not create a looped topology to begin with? Such as with Multi Chassis EtherChannel (MCEC)? A great idea! Right? Well, maybe not…

Figure 7 - Scaling out Tier 1 with MCEC

In Figure 7 above, I have attempted to scale out Tier 1 with (4) switches all jointly participating in a Multi Chassis Etherchannel peering relationship. (First of all, this is a fictitious design, as no switch vendor has engineered this, not even Cisco. But lets just imagine for a second…) The plan here is to allow each Tier 2 switch to connect to all (4) Tier 1 switches with a single logical Port Channel, thus creating a loop free topology at the onset so Spanning Tree will not block any links. If I can already have (2) switches configured for MCEC peering, why not (4)? Heck, why stop at (4), why not (16)? The problem here of course is extreme complexity. Remember that accomplishing MCEC between just (2) switches is a significant engineering accomplishment. There are many states, roles, and Layer 2 / Layer 3 interactions that must be synchronized and orchestrated for the system to behave properly. On top of that, you must be able to quickly detect and correctly react to split brain failure scenarios. Once the MCEC domain is increaed from (2) switches to just (4), you have increased the engineering complexity by an order of magnitude. As a testament to the engineering complexity of MCEC, consider that Cisco is the only major switch vendor to successly engineer and support MCEC with (2) fully featured Layer2/Layer3 switches. And NO switch vendor, not a single one, has successfully engineered, sold, and supports a (4) switch MCEC cluster. Some switch vendors are hinting at such capabilities as a possible future roadmap in their data sheets. All I have to say about that is … Good Luck!

Is it possible to scale out Tier 1 with (4), (8), or even (16) switches in a loop free design with a lot less engineering complexity? The answer is, Yes! This is an enhancement in TRILL. :-)

Another approach worth discussing is the complete removal of Layer 2 switching and replacing it with Layer 3 IP routing. By removing Layer 2 switching and replacing it with Layer 3 routing the switches behave more like routers with load balancing on multiple paths and no Spanning Tree blocking links. This also allows for scaling out each Tier without any of the Layer 2 challenges in Classic Ethernet as we have been discussing thus far. Sounds pretty good, right? Well, not so fast…

How do you provide pervasive Layer 2 services over a network of Layer 3 IP routing? The IP cloud formed by the Tier 1 & 2 switches would be used to create an MPLS cloud and deploy services such as VPLS (Virtual Private LAN Services) providing virtual Layer 2 circuits (pseudo wires) over the Layer 3 cloud. After a full mesh of VPLS pseudo wires has been configured between all Tier 2 switches you can begin to provide Layer 2 connectivity from any Tier 2 switch to another. Sound complicated? That’s because it is!

Figure 8 - Scaling Tier 1 with IP + MPLS + VPLS

In Figure 8 above, the data center network has been setup as a VPLS-over-MPLS-over-IP cloud. Once that foundation is in place, I need to configure a full mesh of Layer 2 VPLS pseudo wires between all Tier 2 switches. How many pseudo wires do you need to configure? You can use this formula where N equals the number of Tier 2 switches: N * (N-1) / 2. And, for each new Tier 2 switch you add you will need to go back to every other Tier 2 switch and configure a new set of pseudo wires to the newly added switch. Not exactly Plug & Play, is it?

Rather than replacing Layer 2 with Layer 3, and then trying to overlay Layer 2 services over the Layer 3 … wouldn’t it be better to simply evolve Plug & Play Layer 2 switching with more Layer 3 like forwarding characteristics? This is exactly the idea behind TRILL. :-)

Now lets finish up with a look at how Tier 2 scales under Classic Ethernet. Remember from earlier that having any more than (2) switches at Tier 1 makes no sense in Classic Ethernet, thanks to flooding loops and Spanning Tree. Because of this (2) switch design constraint the density potential of the Tier 1 switch you choose becomes a key factor in determining the scalability of network.

Figure 9 - Scaling Tier 2 with Classic Ethernet

In Figure 9 above, because Tier 1 cannot have anymore than (2) switches there will always be a clear trade off between scaling bandwidth or scaling size. If I choose to give more bandwidth to Tier 2 it means less available capacity for adding more Tier 2 switches. This is largely the result of not being able to scale out Tier 1 horizontally with Classic Ethernet. If the rigid (2) switch design constraint was removed from the equation you suddenly have a lot more flexibility in how you can scale, and the trade off between bandwidth or size becomes less of black and white matter. Gaining this valuable flexibility with the data center switching design is a key promise behind the evolution to TRILL.

The stage has been set for the next evolution of data center switching

The next generation dynamic data center needs to have tremendous design flexibility to build highly scalable, agile, and robust Layer 2 domains. To get there, classic Ethernet switching as we know it today needs to evolve in the data center. The most successful solutions will be those that address many of the challenges facing the data center today, not just bandwidth.

What are the challenges that should be addressed?

  • Plug & Play simplicity
  • MAC address scalability*
    • a more efficient method of MAC learning*
    • a hierarchichal approach to Layer 2 forwarding*
  • Minimal configuration requirements
  • All links forwarding & load balancing – No Spanning Tree
  • More bandwidth for all traffic types, including Multicast, not just Unicast*
  • Fast convergence
  • Layer 3 virtues of scalability and robustness with the Plug & Play simplicity of Layer 2
  • Flexible and agile scaling out of either Tier 1, or Tier 2.
  • Configuration simplicity for automation with open API’s.

OK. Remember the goal here was to “set the stage” with a basis level understanding of why classic Ethernet switching needs to evolve for the next generation data centers. Please stay tuned for further detailed discussions on data center switching and TRILL.

RSS Feed: http://bradhedlund.com/feed/

Future topics may include:

  • TRILL technical deep dives
    • Conversation based MAC learning
    • Configuration examples
    • Design examples
  • How and where does FCoE and Unified Fabric fit into this picture?
  • Industry news & analysis
  • Suggestions?

Presentation Download

I’m not sure who’s crazier: You (for reading this entire post without falling asleep)? Or Me (for writing such a long post)? Anyway, I have a reward for your time and attention! You get to download the presentation I developed for this post. There are some extra slides that provide a sneek peak into my next posts, an Introduction to TRILL.

PDF: http://internetworkexpert.s3.amazonaws.com/2010/trill1/TRILL-intro-part1.pdf

Original Power Point with Animations: *Please ask your Cisco representitive


Disclosure: The author (Brad Hedlund) is an employee of Cisco Systems, Inc. which plans to have TRILL based solutions embedded into the companies data center switching product line.
Disclaimer: The views and opinions expressed are solely those of the author as a private individual and do not necessarily represent those of the authors employer, Cisco Systems, Inc. The author is not an official media spokesperson for Cisco Systems, Inc.

Special Thanks to my colleagues at Cisco: Marty Ma, and Francois Tallet. Both of whom are deeply involved with Cisco’s implementation of TRILL* and took precious time to provide me with a 1:1 education about some of the topics covered here.

© Copyright 2010, Brad Hedlund, INTERNETWORK EXPERT .ORG

Tags:

About the Author ()

Brad Hedlund (CCIE Emeritus #5530) is an Engineering Architect in the CTO office of VMware’s Networking and Security Business Unit (NSBU). Brad’s background in data center networking begins in the mid-1990s with a variety of experience in roles such as IT customer, value added reseller, and vendor, including Cisco and Dell. Brad also writes at the VMware corporate networking virtualization blog at blogs.vmware.com/networkvirtualization

Comments (46)

Trackback URL | Comments RSS Feed

  1. Ian says:

    Really, good writeup Brad. Been wanting to do a benefits of TRILL paper myself, but just have not gotten there.

  2. Vijay says:

    Excellent post Brad. I was super excited when I came across MCEC/vPC and didn’t realize it had the limitations you mentioned.

    Not sure, but are the switches in figure 5 (right hand side) labelled incorrectly? Should it be sw3 & sw4 instead of sw3 & sw3?

    Can’t wait to read your future posts on TRILL :-)

  3. Xavier says:

    Hi,

    first thank you for your impressive article.
    You seem to make a clear distinction between vPC and TRILL.

    I thought vPC was a TRILL like protocol, but proprietary.
    As I understood, switches doing vPC are learning the layer 2 topology and communicating it to the other peer.
    Sounds a bit like TRILL, doesn’t it ?

    Maybe vPC is just one step towards the ultimate goal.

    • Brad Hedlund says:

      Xavier,
      vPC and TRILL set to accomplish a similar goal, that being multiple active paths from an access layer switch with no STP blocking links. Howerver, how the two set to acheive that goal is very different. From a protocol perspective, there is nothing about vPC that makes it like TRILL. VPC helps two switches to behave as one, hence a vendor specific proprietary protocol between two vPC switches. TRILL, on the other hand, helps each individual switch to behave more like a router in its forwarding behavior and participating in a much broader network of other switches operating in a similar manner.

      By the way — as I eluded to at the beggnining of this article, I’m using the word “TRILL” here in the generic sense, as we will likely see some vendors (such as Cisco) initially implementing TRILL like capabilities as found in RFC 5556 but with additional enhancements that make it a proprietary implementation.

      Thanks for the comment!

      Cheers,
      Brad

  4. Great write up Brad – Always good to see you publish something.. and the pretty pictures really helped convey the story you were positing!

    We should sit down and discuss TRILL sometime soon :)

  5. Brad, first off, great post with lots of excellent detail.

    Second, I too am a huge proponent of TRILL and am enthusiastic about industry solutions becoming available.

    Lastly, you comments “as no switch vendor has engineered this, not even Cisco” regarding MCEC, strictly speaking is correct (regarding vPC), but many of these advantages are available in the 3750 line (with up to nine chassis forming a logical switch entity).

    I realize the 3750 is not generally considered front line data center gear, but I have encountered environments where the requirements were conducive to its consideration. In one smallish data center, the oversubscription allowances would have permitted the use of a stack of four 3750s with GEC 10Gb uplinks to the core and 4x1Gb GEC towards each server access switch. In this way having an aggregation layer width of 4x (potentially even 8x, although this starts to get silly) was feasible.

    Not disputing your conclusions, just noting an aside to that one comment. I look forward to the next installment.

  6. Camden Ford says:

    Hey Brad,

    Fair write up on TRILL, but I dont think you covered enough of the implementation side of things. In theory, all of those benefits sound wonderful, but what is the cost……will I need a supercomputer on each switch to manage all of this? How complicated will the next-generation Ethernet switch chips become? What will the power be?

    It seems to me that Ethernet is becoming a bit like Cisco IOS…..by this I mean that it is a legacy architecture that continues to be enhanced over time. IOS has a ton of great features that had been developed over a 15 year span…..but the basic architecture was defined long ago and ultimately has become somewhat unweildy…..hence the need to start from scratch with NX-OS (or have I assumed incorrectly as to why NX-OS was necessary).

    Ethernet is a great technology, but as the technology evolves….more and more features, functions, capabilities, standards,etc. are piled on top. Each time, the technology becomes much more complex, power hungry, and complicated to manage (hence the next CCIE exam may be a Doctoral Thesis). Many folks in the industry like to talk about how “Ethernet always wins”. The question is “why”….the answer is that Ethernet won the datacenter becuase it was cheap and unmanaged. Now, switch and network management are becomong so complex….it is quickly becoming the largest percentage of Opex in the datacenter…..hence the push for automation……when does it become too complex, to high power, too unweildy, too hard to manage? When is it time for the NX-OS transition for Ethernet?

    • Brad Hedlund says:

      Camden,
      You sound a bit disgruntled with the success of Ethernet. Why is that?
      I’ll go ahead and respond to some of your more “interesting” statements :-)

      will I need a supercomputer on each switch to manage all of this?

      No, not all. All we are doing here is simply enhancing the way Layer 2 MAC learning and forwarding behaves in the data center environment. We are not proposing that each switch calculate orbits for NASA.

      How complicated will the next-generation Ethernet switch chips become? What will the power be?

      Customers generally are not worried about how difficult is was for the chip designer to make the chip, so why are you? Customers care about value and simplicity, two key reasons why Ethernet is still successful as a data center interconnect. What will be the power be? Again, customers are worried about power across the entire data center, servers, storage, network, cooling, etc. If you add all of that up the power used by an Ethernet chip or a entire switch is chump change. The real question you should be asking is: by implementing an efficient and highly scalable data center fabric that enables high percentages of virtualization, how much power is that saving?

      Many folks in the industry like to talk about how “Ethernet always wins”. The question is “why”….the answer is that Ethernet won the datacenter becuase it was cheap and unmanaged

      Glad you are acknowledging that as fact and asking “why”, rather than debating the premise. It’s true, Ethernet and IP always win, now, and into the foreseeable future. Both get the job done, are easy to innovate, and have a huge ecosystem of vendors and workforce knowledge behind it. Until there is a clear and present problem statement why people should start ripping out Ethernet, it aint gonna happen.

      when does [Ethernet] become too complex, to high power, too unweildy, too hard to manage?

      I wouldn’t hold your breathe, or gamble your paycheck on it ;-)

      Cheers,
      Brad

  7. Camden Ford says:

    Hi Brad, (just a note back to you…no need to post)

    Thanks for posting….wasnt sure you would. I am not disgruntled in any way….nor am I blinded by any specific technology and I dont drink the company kool-aid. I always try to look at technology, solutions, and markets from the customer perspective…..as in, what would i do if I were in their shoes……

    History says that the Ethernet market will begin to fragment based on the needs of the solution….I am merely asking the question as to when this will begin to happen. the introduction of CEE is the first shot. Some switches will be CEE and some will not….next comes the flow-based flow control, full QoS for some, etc. The world is a changing….its not your fathers Ethernet anymore…..

    All comments are meant in good spirit ….I have many friends at Cisco and have no concerns that you will continue to do well.

  8. Sorry for the late comment, but it’s hard finding anyone willing to discuss TRILL’s strengths or weaknesses. TRILL and OTV (another “recent” /cir. 2007/ Cisco’s “invention”) both seem to be steps in the wrong direction. Firstly, neither TRILL nor OTV truly solve Ethernet’s scalability problems, which stem in “flat identity” address space and automatic data-plane learning.

    1) TRILL/802.1aq creates routable topology for Ethernet bridges. Firstly, the same could have been accomplished by tunneling Ethernet over IP – implementing basic ID/location split by means of routable header addition. Simply enabling MAC address learning on the multipoint IP tunnels (along with IP-multicast based tunneling for ethernet broadcast) could emulate most of TRILL’s functionality.

    2) TRILL/OTV do not solve the problem of flat MAC address space scaling. While TRILL present some sort of MAC-in-MAC style address stacking , the edge devices CAM tables still have to grow proportional to the numbed of end-point devices. This is a direct result of “full-mesh” data-plane connectivity in Ethernet.

    3) OTV implements the above described solution for tunneling Ethernet over IP packet switched network. However, the data-plane learning has been replaced by control-plane broadcasting using a link-state protocol, as opposed to data-plane flooding. While this is promoted as a big benefit, after short consideration this is does not appear to be a huge savior. Simply imagine that the good old unknown unicast flooding in data-plane could be either limited or truncuted, e.g. by flooding only frame headers.

    4) In OTV/TRILL spanning tree has been removed from the network core but remains at the edge to keep “legacy” L2 segments loop free. The designated forwarder design offered in TRILL/OTV has approximately the same load-balancing issues that the original STP had (e.g. per-VLAN forwarders as equivalent of per-VLAN STP instance). This problem can not be simply resolved as long as the basic concept of “broadcast cable” and “implicit MAC learning” is maintained in Ethernet.

    5) Just when you would think that packet routing has better potential for traffic engineering than spanning tree, recall that TRILL’s traffic engineering is purely metric based, just like any IGP’s. While achieving “close-to-optimal” traffic engineering using IGP weight maniputation is possible, optimal solutions require explicit data-plane traffic engineering approaches such as MPLS TE. Sadly enough, we have to reinvent the things that have been created for IP networks like a decade ago and re-implement them on pure Ethernet encapsulation.

    The core problem is that existing Ethernet mode of identity-based addressing and automatic data-plane learning simply does not scale until you rethink some of the undelying mechanics. During last years, some reasearch results have been obtained in this area: protocols such as SEATTLE, Smartbridges, ROFL and VRR offer helpful insights in the paths to grow Ethernet networks. Unfortunately, IETF and IEEE decided to reinvent the wheel, solving a few problems but mainly keeping the fundamental concepts intact. It is worth noting that a much simple yet still functional tunneling solution could be used in place of TRILL, if yout dont get stuck in the “VPLS” trap.

    And even with VPLS, numerous optimizations are possible such as using multipoint-LSPs and BGP-based discovery (offering perfect control-plane scaling). The full-mesh of pseudowires is direct reflection of Ethernet’s full-mesh connectivity in data plane. Thinking from data-plane perspective, maintaining forwarding entries for pseudowires is the same as maintaining forwarind entries in TRILL where MAC addresses are mapped to their egress locations.

    • Brad Hedlund says:

      Hi Petr,

      You frequently put TRILL and OTV together like “TRILL/OTV” or “OTV/TRILL” as if these technologies are one in the same or interchangeable. The fact is, OTV and TRILL, while complimentary, are each intended to solve two very different sets of problems.

      Simply put, TRILL was meant to address a bandwidth problem (STP blocking links). With that, there is also the ancillary benefit of building larger Layer 2 Ethernet domains achieved from horizontally scaling Tier 1, as I describe in this article.

      OTV was designed to solve a Layer 2 extension complexity problem between data centers. There is a very real market opportunity here for Cisco to bring some value to customers.

      neither TRILL nor OTV truly solve Ethernet’s scalability problems

      Again, let’s set OTV aside for a second. OTV was designed to sit at the edge of data center and easily extend the reach of the L2 domain; it wasn’t intended to replace the core data center switching technology.

      With regards to TRILL, I agree with your statement above. TRILL does not address all of the scalability problems facing large scale data centers. TRILL does address bandwidth and replacement of STP, but it does nothing for the “flat” MAC address learning, the result of auto-learning from broadcasts in the data plane.

      edge devices CAM tables still have to grow proportional to the number of end-point devices

      With regards to the current TRILL specification this is true. If you have 20,000 hosts (virtual or physical) in the data center, every switch will eventually need to store all 20,000 MAC addresses. This is true of Ethernet today and TRILL does not change that.

      Cisco engineers agreed this was a shortcoming of TRILL and designed an enhancement to this behavior in the implementation of TRILL* for Cisco Nexus switches. If you download the PDF linked in this article you will see this explained on the slides titled “MAC Learning – Evolved”.

      With Cisco’s implementation of TRILL*, the MAC learning process has been optimized to learn only the MAC’s needed for conversations using the switch. The edge switches will not learn all 20,000 MAC’s from my example above. Rather, the edge switch will only need to store MAC’s in it’s CAM table for active flows using the switch, which in most cases will be a very small percentage of the total MAC’s present in such a large deployment.

      The other scalability problem is the ever increasing burden of broadcasts as the L2 domain scales to thousands and thousands of end hosts, each host issuing ARP broadcasts to resolve IP/MAC pairings.

      The broadcast burden largely comes from how the end host is told to behave, not necessarily the switches themselves. Can things be done differently to reduce the broadcast load an end host places on the network? Sure. But changing end host behavior would be a monumental effort and largely orthogonal to the L2 switching behavior anyway.

      The Cisco implementation of TRILL* does help to minimize the propagation of broadcasts with an Auto VLAN Pruning capability. The Edge switch will tell the Core which VLAN ID’s are active on the switch, and therefore only broadcasts on the active VLANs will be received by that Edge switch.

      Ultimately we are faced with these challenges:

      How can we scale very large Layer 2 domains (large number of end hosts) given the entrenched broadcast based behavior of Ethernet end hosts?

      How can we safely extend the reach of a layer 2 domain between or within data center facilities over any transport with minimal configuration and operational complexity?

      Considering that, I strongly disagree with your leading statement:

      TRILL and OTV both seem to be steps in the wrong direction

      I would certainly argue that Cisco’s implementation of TRILL*, and OTV, both are two very big positive steps in the right direction. ;-)

      Cheers,
      Brad

      • David Coulthart says:

        Brad,

        You state:

        “The Cisco implementation of TRILL* does help to minimize the propagation of broadcasts with an Auto VLAN Pruning capability.”

        Is this indeed specific to Cisco’s TRILL* implementation or is this part of the TRILL standard (or being proposed as part of it) or is this actually MVRP?

        While currently all of our switches are set to VTP transparent mode & we configure VLANs statically, I’m regaining interest in automatic VLAN distribution & pruning thanks to an explosion of VLANs across both our data center & campus networks. But I would hate to see this functionality be limited yet again to a proprietary protocol. I’m also concerned if this is TRILL specific since the 802.1aq vs. TRILL fight has yet to be decided.

        Thanks,
        Dave

        • Brad Hedlund says:

          Dave,
          Cisco’s implementation of TRILL* is now called Cisco Fabric Path and indeed is Cisco proprietary. The Auto VLAN pruning capability in Fabric Path is therefore also proprietary in the same sense.

          Given that Cisco and Radia Perlman are heavily involved in the TRILL (RBridges) effort, compared to no Cisco involvement in 802.1aq, I don’t see TRILL being a loser here. That may sound a little conceded but it’s an obvious point to make. If the #1 switch & router maker is not at all involved in your switching & routing standard, I cant see how that is a formula for wide spread success.

          Cheers,
          Brad

  9. Brad,

    Thank you for you response. Apparently, I made semantical mistake mixing OTV and TRILL together, but my reason was pointing out obvious similarity – both OTV and TRILL rely on packet routing in the “core” while maintaining classic Ethernet behavior at the edge.

    My first argument against TRILL was that reinventing routing for Ethernet sounds silly when you already have it implemented for IP networks. With respect to that, OTV sounds a bit more “reasonable” than TRILL as it attempts to reuse existing solutions. Furthermore, even though “routing” definitely scales better than “bridging”, it does not completely solve traffic engineering problems, as could be seen in pure routed networks.

    Now a few words about conversation-based MAC address learning. This “optimization” procedure clearly relies on “special” traffic patterns in data center. While these are indeed present (there are research papater on this topice.g. “The Nature of Datacenter Traffic: Measurements & Analysis”), reliance on such behavior is not universal and is not guaranteed to work in all scenarios. Besides, the same “compartmentization” functionality could have been enforced in e.g. by using community-type private VLANs within the same layer-2 broadcast domain (this isn’t totally plug and play but at least much more secure in terms of expected behavior).

    Finally, for Ethernet scalability and native broadcast (full-mesh in data plane) optimization. I highly recommend you looking into the paper http://www.cs.princeton.edu/~chkim/Research/SEATTLE/seattle.pdf , which provides an example of higly-scalable adaptation of Ethernet topology based on distributed-has table functionality. The publication also reference a lot of other interesting papers, such as ROFL or Smartbridges.

    Thanks for your time,

    Petr

    • Brad Hedlund says:

      Hi Petr,
      Thank you for the comments. The research paper you link to looks interesting, I’ll be sure to read it. No doubt that TRILL and OTV is not the end-all final solution for everything. Surely over time Ethernet will continue to evolve to meet customer needs and scalability requirements. That doesn’t mean we shouldn’t do anything until a perfect solution is finally ready. It makes a lot of sense to look at how we can incrementally or perhaps even significantly make improvements with things like TRILL and OTV along the way.

      reinventing routing for Ethernet sounds silly when you already have it implemented for IP networks

      If Cisco’s implementation of TRILL* had set out to adopt all of the same behavior and characteristics of routed IP networks, sure, I could agree completely with the “silliness” you cite. And I doubt Cisco would have thought it was a good idea either and had invested millions in silicon to support it. But the fact is, Cisco’s implementation of TRILL* is not simply reinventing IP routing. Rather, it seeks to gain the much desired multi-path forwarding behavior of IP routing, while preserving the plug & play characteristics of Ethernet (required by customers) that a routed IP network simply cannot deliver. There is also the critical element of configuration simplicity that must not be overlooked. Unfortunately, not every customer has a Petr Lapukhov on staff 24×7 to configure and troubleshoot their network. One of the key design goals in Cisco’s implementation of TRILL* is exactly that, a very straight forward and simple configuration customers are accustomed to in traditional Ethernet. So considering all of that, TRILL* makes a lot of sense, in my humble opinion, as well as Cisco’s ;-)

      Its the overlaying of L2 VPNs on an IP network with a complex configuration, just to make IP behave more like Ethernet, now THAT sounds silly to me. :-)

      Cheers,
      Brad

  10. dave says:

    Hello Brad,

    Interesting article from a technical prospective.

    But I’d like to pick you up on a couple points:

    your quote on MCEC:

    “As a testament to the engineering complexity of MCEC, consider that Cisco is the only major switch vendor to successly engineer and support MCEC with (2) fully featured Layer2/Layer3 switches. And NO switch vendor, not a single one, has successfully engineered, sold, and supports a (4) switch MCEC cluster. Some switch vendors are hinting at such capabilities as a possible future roadmap in their data sheets. All I have to say about that is … Good Luck!”
    I’m now I’m a bit confused here, there’s a product (commercially available and shipping) that claims to support a 10 switch based virtual chassis?

    Your two arguments against MPLS, I think that your discounting of MPLS is not exactly playing fair, you mention:

    1> Firstly you use that old N^2 problem and then dismiss MPLS with “Not exactly Plug & Play, is it?”
    Have you considered BGP based VPLS/Pseudowires as they go some way to solving the plug and play argument.

    2> Your second point against MPLS you indicate that MPLS is a Layer 3 technology, your words “Rather than replacing Layer 2 with Layer 3, and then trying to overlay Layer 2 services over the Layer 3 …”
    The fact is that in an MPLS network L3 is simply the control plane for MPLS, MPLS transport is merely a shim (or demux) header/label.

    Now I’m not saying that MPLS is a better solution for “scaling out tier 1″ but I suggest that solutions should be pushed on their merits, not misinformation on the alternates.

    Regards

    dave

    • Brad Hedlund says:

      Dave,

      Cisco also has a product that can combine 9 switches into a so called “virtual chassis” — it’s called StackWise Plus for the 3750 series Catalyst switches, or 3100 series blade switches. The other product you’re probably thinking of is Juniper’s “Virtual Chassis” capability for their EX 4200 switches. In either case, be it the Juniper EX 4200 or the Catalyst 3750, these are small 1RU fixed configuration switches that are NOT a robust modular platform for Tier 1. For example, try failing power on the master switch of the “virtual chassis”, see how long it takes to fully converge, and then tell me if you think it makes sense to have that technology positioned in Tier 1.

      My statement remains true, but I will be more specific: Cisco is the only major switch vendor to successfully engineer and support MCEC with (2) fully featured *MODULAR* Layer2/Layer3 Tier 1 switches. And NO switch vendor, not a single one, has successfully engineered, sold, and supports a (4) switch MCEC cluster of *MODULAR* Tier 1 class switches.
      Disagree?

      Have you considered BGP based VPLS/Pseudowires as they go some way to solving the plug and play argument.
      The fact is that in an MPLS network L3 is simply the control plane for MPLS, MPLS transport is merely a shim (or demux) header/label.

      So, you’re saying I need to change my Tier 2 access switch from being a standard Layer 2 device, to a Layer 3 device running a routing protocol like BGP? Thanks for proving my point :-)

      Cheers,
      Brad

      • I believe Dave’s point about VPLS was that *with time* the once simple Ethernet bridges will:

        1) By means of TRILL, become a full-scale router with traffic tunneling features (first boost in comlexity) You may also add lossless Ethernet and other DC-specific QoS features here. That is not simple already, though the control plane still remains rather “lightweight”.
        2) As networks grow, face the same limitations that high-speed packet networks have had for a long time (ECMP, traffic engineering by means if metric weights, temporary loops, slow reconvergence).
        3) Reinvent the idea of (G)MPLS Traffic Engineering, adding MPLS data plane, then MPLS TE extensions to IS-IS and using MP-BGP (auto)configuration :)
        4) Implement something else from the routing world, such as Loop Free Alternates and IGP-based fast re-routing trying to avoid complexities of GMPLS.
        5) Etc etc

        In the end, in the name of saving the sacred cow of Ethernet plug-and-play capabilities we have to hit the same bumps that routed networks have been hitting since 70s. At the same time, a simple modification of “classic” Ethernet (e.g. addition of explicit login/logout capabilities such as found in Fibre Channel or name directory service) could have saved a lot of “useless” efforts right now and be a step in “better” direction.

        I believe there are always some tradeoffs (you may call it Heisenberg’s inequality, though I’m against applying quantum mechanics to real world ;). You cannot make something easy to configure and not sacrifice other things in exchange – e.g. scalability or fast convergence. However, I do admit that if there is a clearly defined growth horizon for the modern data centers, we may let them go with some “basic” routing features implemented as “workarounds”. But if we expect the data-centers to grow “infinitely”, that’s the same path Internet and packet networks have went through for the last 20 years.

      • Brad, and I almost forgot – thanks for keeping this interesting discussion going!

        Petr

      • Rich says:

        Hi Brad,

        I think that I’m going to agree with Dave here that you’re using generalisations to prove your point.

        “In either case, be it the Juniper EX 4200 or the Catalyst 3750, these are small 1RU fixed configuration switches that are NOT a robust modular platform for Tier 1. For example, try failing power on the master switch of the “virtual chassis”, see how long it takes to fully converge, and then tell me if you think it makes sense to have that technology positioned in Tier 1.”

        I typed ‘juniper virtual chassis resiliency’ into Google, and clicked on the first result (http://www.juniper.net/us/en/local/pdf/industry-reports/virtual-chassis-performance.pdf) and saw that network test had made the following comment on page 2:

        “A Virtual Chassis configuration took just 6 microseconds to recover from loss of power to its master switch.”

        To be fair, I’d suggest that 6 microseconds is pretty acceptable to me. Do you disagree?

        Cheers
        Rich

        • Brad Hedlund says:

          Rich,
          Fair enough, it would appear Juniper made some significant improvements to failure recovery since I last checked. Regardless, what difference does it make when you can’t position that technology at Tier 1? This “test” (paid for by the company being tested), again proves my point that Juniper virtual chassis is, as I said: “not a robust modular platform for Tier 1″.

          I wonder what’s taking Juniper so long to deliver this in a modular platform? Cisco has had this technology available since 2007, starting with VSS (Virtual Switching System) in the Catalyst 6500.

          Cheers,
          Brad

      • Peter Bernat says:

        Hi Brad,

        have you heard/read about H3C Intelligent Resilient Framework?

        cheers
        Peter

        • Brad Hedlund says:

          Peter,
          Yes, I am familiar with H3C “IRF” which is essentially H3C’s version of MCEC technology similar to VSS (virtual switching system) in the Cisco Catalyst 6500, and vPC (virtual port channels) in the Cisco Nexus 7000 & 5000 family.

  11. Peter Ashwood-Smith says:

    “Ethernet is a great technology, but as the technology evolves….more and more features, functions, capabilities, standards,etc. are piled on top. ”

    Actually TRILL is not an extension to Ethernet. The data path and OA&M are brand new and the control plane is ISO IS-IS, currently it has no OA&M which of course Ethernet does have. None of the work is being done at the IEEE, its all IETF work and you need all new ASIC’s to run TRILL (or NPU’s).

    So you are getting your wish of ‘something brand new’ which of course has good and bad consequences.

  12. Paul Unbehagen says:

    Trill is one way of having intelligence to Ethernet forwarding, but so is IEEE 802.1aq Shortest Path Bridging. The IEEE has been working on a link state Ethernet control protocol for a couple of years now.

    It pretty much subsumes the functions of MSTP, RSTP, MMRP, MVRP and numerous others into a IS-IS based version of native Ethernet control. By that I mean it simply controls native ethernet instead of inventing a whole new encapsulation. And it has a very strong OAM ability with 802.1aq and Y.1731 which basically gives sonet like OAM to Ethernet. A few pre-standard versions are actually deployed in a few places too.

    See here for more info:
    http://www.networkcomputing.com/next-gen-network/shortest-path-bridging-will-rock-your-world.php

  13. Flintstone says:

    Brad, I’m not sure if you know but TRILL sounds like something that Cabletron have already used many years ago. Cabletron created something called Securefast that used VLSP(Virtual Link State Protocol) which uses OSPF at layer 2. This done away with STP and allowed a full meshed topology. All the issues you mentioned were overcome and an RFC was even published for ratification. Just goes to show how bad STP really is?

  14. John Scaglietti says:

    Hi Brad

    I read with great interest your post; very well prepared and presented.
    My objective is to learn more about TRILL, having so far mostly had exposure to the IEEE 802.1aq variant.

    But before I come to TRILL I have a couple of comments…

    You state:
    “Cisco is the only major switch vendor to successfully engineer and support MCEC with (2) fully featured *MODULAR* Layer2/Layer3 Tier 1 switches. And NO switch vendor, not a single one, has successfully engineered, sold, and supports a (4) switch MCEC cluster of *MODULAR* Tier 1 class switches.
    Disagree?”

    Well, I realize you work for Cisco and Cisco’s marketing might would have you believe that Cisco invented MCEC (as you refer to it).
    Yet in all fairness MCEC was invented by Nortel back in 2001 under the name Split-MLT, as any of Nortel’s loyal customers will attest. True, Nortel is a failed company but the ethernet switching product line – and SMLT – have been taken over by Avaya who seem to be re-launching that offering with new vigour. So Cisco are definitely not the only ones to offer this (unless of course you consider Cisco the only “major” switch vendor..!)
    Furthermore, SMLT is still superior to vPC in many respects.

    As for no other vendor engineering and supporting a 4 switch MCEC I would expect this to change when Juniper launch their Virtual Chassis on their modular platform.
    Yet, in my experience, the problem you describe in figure 7 is not so much of an issue because if you need additional aggregation capacity you can always deploy additional MCEC clusters which can be interconnected together without introducing Spanning Tree and still using all links; though in your diagram this would mean that SW3 would be only connected to one MCEC cluster and SW4 to another MCEC cluster.

    Coming to TRILL, I’d be interested to take your view on the pros and cons of TRILL vs 802.1aq and why Cisco is going the TRILL route.
    As far as I can tell, both are very similar in that they rely on IS-IS to program the MAC tables (as OSPF does for IP routing tables).
    They both address the plug & play simplicity, eliminate Spanning Tree, flatten the network (and thus reduce latency), deliver fast failure recovery, are scalable and robust.
    The differences seem to be concentrated around the load balancing capabilities and the actual packet encapsulation used and the fact that one is an IETF standard and the other is IEEE.
    Whereas TRILL hashes any and all traffic across all available links, SPB is more conservative and will select a unique path across the network for all traffic belonging to a same service instance. But in practice you do achieve load balancing with SPB as you provision additional service instances.
    SPB seems to have a better pedigree in that it is an evolution on other IEEE standards such as 802.1ah (MACinMAC encapsulation) and 802.1ag (CFM) which were developed to for the carrier ethernet space.
    The CFM one means that SPB has powerful OA&M capabilities (much like what carriers get from MPLS and ATM..) whereas TRILL seems to have none.
    Essentially SPB’s more deterministic approach is a necessary trade off in order to leverage those OA&M capabilities.
    So while the hashing capability of TRILL is very attractive, the downside is that having to troubleshoot a TRILL network where some conversations across the TRILL network suffer from performance degradation while other conversations (on the same edge vlans) don’t, is going to be painful.

    I’m also surprised to learn that hierarchical MAC learning is not an integral part of TRILL, but rather a Cisco proprietary add-on.
    With SPBM (which leverages 802.1ah MACinMAC encapsulation) this is per standard.

    So it seems to me that TRILL is a bit of a messy kludge at the moment with Cisco’s first implementation adding bits and bobs which certainly make sense, but are not part of the standard and that raises questions to how interoperable TRILL implementations will be between different vendors implementing it.

    Anyway, I’ll stay tuned for more..
    Thanks again
    John Scaglietti

    • Brad Hedlund says:

      Hi John,
      I was waiting for someone to play the Nortel card :-) I didn’t say Cisco “invented” MCEC. I said: “Cisco is the only major switch vendor to *successfully engineer* MCEC with fully featured Layer2/3 modular switches”. Nortel Split MLT was a good first attempt at MCEC, but because Nortel SMLT was strictly a software based implementation it wasn’t very robust or scalable. It’s one thing to get MCEC basically working, thats the easy part. Its another thing to engineer a solution that can handle many types of failure scenarios for both Layer 2 and Layer 3 protocols. To successfully engineer MCEC you need to have MCEC awareness in both hardware and software, such as with Cisco vPC or VSS. By the way, I’m really interested to hear more specifics on why you think SMLT is superior to vPC. Really, I’m all ears.

      though in your diagram this would mean that SW3 would be only connected to one MCEC cluster and SW4 to another MCEC cluster.

      Right, which means you would need yet another Tier of switches above to be the L3 gateway and provide L2 interconnection between the two MCEC clusters. That’s a perfectly valid design but is obsolete now with FabricPath/TRILL.

      I’d be interested to take your view on the pros and cons of TRILL vs 802.1aq and why Cisco is going the TRILL route

      As for TRILL vs. 802.1aq – The TRILL problem statement and RBridges solution was initially proposed to the IEEE by Radia Perlman, however the IEEE decided not to adopt Radia’s proposal and instead take a different direction of 802.1aq. My guess is that the IEEE originally did not like the MAC-in-MAC encapsulation proposed in RBridges. Actually, the original 802.1aq drafts did not have MAC-in-MAC encapsulation, it wasn’t until several years later that a MAC-in-MAC variant of 802.1aq was drafted. Somebody at the IEEE must have said: “Oooops, maybe Radia Perlman was right all along”.

      I’m also surprised to learn that hierarchical MAC learning is not an integral part of TRILL, but rather a Cisco proprietary add-on.
      With SPBM (which leverages 802.1ah MACinMAC encapsulation) this is per standard.

      I don’t think you understand the TRILL (RBridges) proposal very well. Maybe you should give it another read. As I mentioned above it was actually TRILL (RBridges) that has had MAC-in-MAC encapsulation from Day 1 … 802.1aq did not, that came later. So yes, TRILL does have “hierarchical” MAC addressing in the sense that the Core RBridges will only need to know about the MAC addresses of the Edge RBridges. The Edge RBridges will learn ALL station MACs. The Cisco enhancement in FabricPath is something called “Conversation Based MAC Learning”, where the Edge FabricPath switch will not learn ALL MACs, rather, only the MACs required for the unicast conversations using that switch.

      raises questions to how interoperable TRILL implementations will be between different vendors implementing it.

      Wait a minute, isn’t that the whole point of a *standard*? The vendors implementing a TRILL standard (whenever that might me) will all follow an agreed upon standard implementation to insure multi vendor interoperability? No?

      Cheers,
      Brad

  15. SMLT says:

    Hi Brad,

    Good and interesting post, although I would take offence to the comment

    “As a testament to the engineering complexity of MCEC, consider that Cisco is the only major switch vendor to successly engineer and support MCEC with (2) fully featured Layer2/Layer3 switches”.

    Nortel, invented S-MLT upon which VSS/MCEC in 2001, and have been successfully building L2/L3 network on the technology since, I think you need to step out of your Cisco Cloud more often :-)

    • Brad Hedlund says:

      One other reader already brought up Nortel SMLT in the comments above (John Scaglietti).

      Here again is what I said about Nortel SMLT in response to John:

      Hi John,
      I was waiting for someone to play the Nortel card. I didn’t say Cisco “invented” MCEC. I said: “Cisco is the only major switch vendor to *successfully engineer* MCEC with fully featured Layer2/3 modular switches”. Nortel Split MLT was a good first attempt at MCEC, but because Nortel SMLT was strictly a software based implementation it wasn’t very robust or scalable. It’s one thing to get MCEC basically working, thats the easy part. Its another thing to engineer a solution that can handle many types of failure scenarios for both Layer 2 and Layer 3 protocols. To *successfully engineer* MCEC you need to have MCEC awareness in both hardware and software, such as with Cisco vPC or VSS.

  16. SMLT says:

    John/Brad

    SMLT was not software-only based, all forwarding layer 2 or 3 was carried out on the line-cards in hardware including layer 3 load-balancing for VRRP and OSPF. The MAC learning process like all Ethernet switches was carried out in software, but once the MAC wasn’t learn’t all further forwarding was carried out in hardware.

    Alex

    • Brad Hedlund says:

      Alex,

      The MAC learning process on Nortel and other switches might be in software, but that is not the case with Cisco switches. Cisco switches have always performed hardware MAC learning, which is far more robust. Obviously the data plane on a Nortel switch forwards in hardware just like any other switch, but the hardware is oblivious to the logical topology SMLT was building on top of it.

      Simple example: A server is attached via SMLT to two Nortel switches. The server sends a broadcast packet. Nortel switch #1 receives it, sends it to every other port in that VLAN including Nortel switch #2. Because the hardware on Nortel switch #2 is oblivious to the SMLT topology, it forwards (in hardware) the broadcast packet back to the Server that originated it and out to the upstream network again. Nortel SMLT implementations are notorious for creating duplicate packets. Normally this goes unnoticed in a small network, but as the network grows lots of strange performance problems begin to surface.

      • John Scaglietti says:

        Brad

        Sorry, I’m a bit late at replying at your reply of my earlier post…
        I am quite familiar with the Nortel SMLT offering and have spent a considerable time evaluating SMLT vs vPC (and VSS; though this is not MCEC).
        As Alex has pointed out already, there was no concept of software forwarding (slowpath) on any of the Nortel platforms which implemented SMLT. The software populates the hardware forwarding records (MAC tales, ARP entries, IP routes and multicast S.G records) but traffic is ultimately always switched in hardware.
        In the original ERS8600 implementation software was also called upon to modify those hardware forwarding records following a link or node failure in the MCEC Cluster but still the solution consistently delivered sub-second recovery times (except in scaled environments where the tables were very large).

        Now the example you give in your reply to Alex, below, is not correct:

        >”Simple example: A server is attached via SMLT to two Nortel switches. The server sends a broadcast packet. Nortel switch #1 receives it, sends it to every other port in that VLAN including Nortel switch #2. Because the hardware on Nortel switch #2 is oblivious to the SMLT topology, it forwards (in hardware) the broadcast packet back to the Server that originated it and out to the upstream network again.”

        This is not true, the hardware forwarding records are programmed in a way that is consistent with the active SMLT topology; traffic arriving on the IST (equivalent to vPC peer link) simply cannot be switched or flooded out of an active SMLT link (equivalent to vPC link) (where active means the corresponding SMLT was active on the IST-peer switch).

        >”Nortel SMLT implementations are notorious for creating duplicate packets.”

        Really? That’s news to me. If that were true, SMLT would be unusable.

        Still, in comparing SMLT to vPC you are in fact comparing older Nortel platforms (ERS8600) with the Nexus and that is hardly a fair comparison.
        You should raise your sights at the latest incarnation of SMLT in the Avaya VSP9000 platform, where MAC learning is hardware assisted and the hardware is now able to re-route traffic following link or node failures in the MCEC cluster giving sub 50ms recovery times, even in scaled environments. I have seen this in a recent bakeoff.

        You were also very keen on hearing why I felt SMLT still had the edge over vPC, and these are the main points I have:
        - On vPC, routing protocol peerings over a vPC VLAN (which is also carried over the vPC peer link) are not supported (even if using the vPC peer-gateway feature which is equivalent to Nortel’s RSMLT). This is because you might end up with traffic arriving on the vPC peer link and vPC is unable to IP route traffic from the vPC peer link over to an active vPC link. Nortel only had this issue on their lower end SMLT platform (ERS5500) but never on the higher end platform (ERS8600) where typically SMLT Clusters were connected back to back with OSPF enabled on a single OSPF VLAN (RSMLT was required on that vlan) allowing IP routing between SMLT clusters (Cisco can only do this with VSS today).
        - vPC has issues with single attached devices in a vPC VLAN; this practice is discouraged by Cisco. Was never an issue on SMLT.
        - IP multicast support over vPC is still not complete (PIM-SSM is not supported) and there is no distribution of IP Multicast streams across the vPC peers (only one of the vPC peers, the Primary, is selected to forward all streams); failover times for IP Multicast are not sub-second. With SMLT, both PIM-SM and PIM-SSM are supported and multicast streams are distributed across both nodes forming the MCEC cluster; failover times are the same as for unicast traffic. On SMLT you can even have a single logical PIM-RP running on both nodes forming the cluster (in a kind of anycast RP mode, though MSDP is not required) which ensures no PIM-SM reconvergence times in case of RP failure. Re-designing a protocol such as PIM to operate over MCEC is very complex; vPC still has some catching up to do.

        So I have to insist! Cisco is NOT the only vendor to successfully engineer MCEC.

        About Trill/802.1aq. Yes, you were right on both scores.
        The ieee standard has 2 separate flavours to it ;SPBV which does not have any MAC-in-MAC capabilities and SPBM which uses 802.1ah MAC-in-MAC encapsulation. SPBV is a bit of a compromise for hardware which is not MAC-in-MAC capable; to date I have not heard any vendor implementing SPBV; the real interest is in SPBM which can be compared to TRILL.
        The recent Nanog event last October did pitch TRILL against 802.1aq and I have found all the info I was looking for there.
        It is interesting that vindicating Ms Perlman is a recurring argument from the TRILL camp. I don’t know exactly why she and the IEEE disagreed, but clearly TRILL’s design goal of providing massive mutipathing requires a TTL field for loop mitigation which in turn requires a new encapsulation.
        Yet that must have been incompatible with the IEEE’s desire to preserve prior IEEE standards such as 802.1ah (MACinMAC) and 802.1ag (OA&M) which relies on 802.1ah enacpsulation.

        As far as I can tell, the two standards are substantially similar in what they achieve and the differences represent different trade offs. The biggest difference is in TRILL greater multipathing flexibility which it pays for with a new encapsulation (which requires new chipsets) and lack of OA&M, which in turn are SPBM’s greatest strengths.

        Incidentally, apparently Cisco’s FastPath implementation uses a different encapsulation than that defined by TRILL (Nexus needs to re-spin chipsets to do TRILL?) so initially it is more of a proprietary implementation anyway.

        Anyhow, I think we just have to wait and see how these variants play out in the market.

        Best regards
        John Scaglietti

  17. SMLT says:

    typo above the line:

    “but once the MAC wasn’t learn’t all further forwarding was carried out in hardware”

    should read

    “but once the MAC was learn’t all further forwarding was carried out in hardware”

  18. Peter Ashwood-Smith says:

    Interesting discussion. I can add a few points from inside the 802.1aq camp.

    802.1aq’s mac-in-mac mode called SPBM was first born in my lab in its proprietary form which was Nortel’s PLSB. It was designed by about 5 individuals and went through a number of flavors before the design settled and we shipped it in the years leading up to Nortel’s bankruptcy. There are a number of live networks running it and one has 60+ nodes. There are DC and metro deployments. There was never any question what data path to use, mac-in-mac was in our hardware and thorougly tested top to bottom, certainly we did not consider use of something new that had no OA&M (and still doesn’t) and no I-SID (and still doesn’t). We certainly did not need to learn about encapsulation from anbody else .. come on get serious .. mac-in-mac was preceeded by a proprietary format that was somewhat similar OEL2. The first standard version of IEEE 802.1aq was based on only S-tag encapsulation and no link state (but had shortest paths) and we brought the mac-in-mac mode and link state proposals as proposed solutions to the IEEE. The work was based on many years of real live deployment experience with a proprietary and then a standard encapsulation in our metro products.

    Anyway speaking of 802.1aq SPBM we just did interoperability tests on real honnest to goodness switches .. what a concept. If anybody wants to see the actual network tested (37 nodes, 5 physical, 2 real vendors) the screen shots and CLI commands executed, network diagrams etc. etc. have a look below. This is all real deal, line rate forwarding, real ASIC/NPU implementations, Equal Cost paths with .1ag OA&M all up and running.

    http://www.ieee802.org/1/files/public/docs2010/aq-ashwood-interop1-1110-v02.pdf

    A second round will be done in the early new year with many more physical nodes.

    • Kalhas says:

      I always thought Nortel engineers came up with the ideas and implemented them only to lose the starting edge and let other companies take over the proposals and refine them and tune them to the markets.

      Your experience proves my point. So long Nortel!

  19. Rob says:

    Hi Brad,

    Excellent postings!

    I was reading on Fabric Path and realized that the F series card is capable of 320Gbps, but the current fabric (46Gbps x 5) is 230Gbps per slot. So for every Nexus 7018 considering the 230Gbps that would be 368 10GE interfaces which can be split in half for uplink versus end point connection for no over-subscription theoretically.

    Is there a cabling methodology that can be used for Fabric Path to not have to traverse the chassis fabric to offer 320Gbps per line card. The reason why i ask is in the following document:

    http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9402/data_sheet_c78-605622.html

    [
    230 Gbps in each direction (460 Gbps full duplex) distributed across five Cisco Nexus 7000 46Gbps/Slot Fabric Modules fabric modules
    320-Gbps switching capacity, per module, in meshed architectures
    ]

    What is this meshed architecture?

    Is it like 1 port on each F series line card vertically forming a port channel to each spine while forming ISIS adjacencies horizontally?

    This may be off topic, but since it was related to TRILL and Fabric Path…

    Thanks Again for sharing a lot of interesting topics!

    Rob

    • brianG says:

      I have been trying to find this out also, with no luck. Trying to build a Nexus solution is proving to be much harder than I anticipated. Many dependencies.

  20. John says:

    So the concept of having a layer two mesh is pretty cool. I come from old school breaking up of broadcast domains. What’s good practice for sizing a vlan under trill? Is it one massive vlan (8 bit), or several medium ones, or should I have a lot of small ones (24 bit). Or is it all dependent on logically seperating your nodes. In the old days we just broke everything up into 24 bit masks regardless, to control broadcasts.

  21. Richard says:

    Hi, it is a best document on trill I ever read. Great Job. Thanks.

  22. Joe Smith says:

    Richard, this is indeed a nice write up, but it has nothing to do with TRILL, per se. This is a write up about Ethernet and its basic operation and some of its shortcomings.

    Not that I blame Brad, there is no vendor that has TRILL implemented and there is also no white paper on TRILL of any value, since no one seems to understand it too well. The only write up on TRILL is the IETF document.

  23. kulin shah says:

    Great article Brad. I might be taking a step back here, but the only reason L3 leaf-spine design fails is when customers deploy a significant amount their servers as VM that they intend to be mobile across racks. From the customer designs I have dealt with, I am yet to see a large virtual deployment for large scale data centers (I am talking 4000-8000 nodes).
    TRILL of course sounds like the holy grail for achieving scale with simplicity in the data center but as long as customers don’t move to a substantial VM deployment, L3 with ECMP seems to be working just fine for them.

Leave a Reply

Your email address will not be published. Required fields are marked *