Top of Rack vs End of Row Data Center Designs

This article provides a close examination and comparison of two popular data center physical designs, “Top of Rack”, and “End of Row”. We will also explore a new alternative design using Fabric Extenders, and finish off with a quick look at how Cisco Unified Computing might fit into this picture. Lets get started!

Top of Rack Design

http://internetworkexpert.s3.amazonaws.com/2009/04/ToR-vs-EoR-ieorg1-large.png

Figure 1 - Top of Rack design

In the Top of Rack design servers connect to one or two Ethernet switches installed inside the rack. The term “top of rack” has been coined for this design however the actual physical location of the switch does not necessarily need to be at the top of the rack. Other switch locations could be bottom of the rack or middle of rack, however top of the rack is most common due to easier accessibility and cleaner cable management. This design may also sometimes be referred to as “In-Rack”. The Ethernet top of rack switch is typically low profile (1RU-2RU) and fixed configuration. The key characteristic and appeal of the Top of Rack design is that all copper cabling for servers stays within the rack as relatively short RJ45 patch cables from the server to the rack switch. The Ethernet switch links the rack to the data center network with fiber running directly from the rack to a common aggregation area connecting to redundant “Distribution” or “Aggregation” high density modular Ethernet switches.

Each rack is connected to the data center with fiber. Therefore, there is no need for a bulky and expensive infrastructure of copper cabling running between racks and throughout the data center. Large amounts of copper cabling places an additional burden on data center facilities as bulky copper cable can be difficult to route, can obstruct airflow, and generally requires more racks and infrastructure dedicated to just patching and cable management. Long runs of twisted pair copper cabling can also place limitations on server access speeds and network technology. The Top of Rack data center design avoids these issues as there is no need to for a large copper cabling infrastructure. This is often the key factor why a Top of Rack design is selected over End of Row.

Each rack can be treated and managed like an individual and modular unit within the data center. It is very easy change out or upgrade the server access technology rack-by-rack. Any network upgrades or issues with the rack switches will generally only affect the servers within that rack, not an entire row of servers. Given that the server connects with very short copper cables within the rack, there is more flexibility and options in terms of what that cable is and how fast of a connection it can support. For example, a 10GBASE-CX1 copper cable could be used to provide a low cost, low power, 10 gigabit server connection. The 10GBASE-CX1 cable supports distances of up to 7 meters, which works fine for a Top of Rack design.

Fiber to each rack provides much better flexibility and investment protection than copper because of the unique ability of fiber to carry higher bandwidth signals at longer distances. Future transitions to 40 gigabit and 100 gigabit network connectivity will be easily supported on a fiber infrastructure. Given the current power challenges of 10 Gigabit over twisted pair copper (10GBASE-T), any future support of 40 or 100 Gigabit on twisted pair will likely have very short distance limitations (in-rack distances). This too is another key factor why Top of Rack would be selected over End of Row.

Figure 2 - Blade enclosures with integrated Ethernet and FC switching

The adoption of blade servers with integrated switch modules has made fiber connected racks more popular by moving the “Top of Rack” concept inside the blade enclosure itself. A blade server enclosure may contain 2, 4, or more ethernet switching modules, multiple FC switches, resulting in an increasing number of switches to manage.

One significant draw back of the Top of Rack design is the increased management domain with each rack switch being a unique control plane instance that must be managed. In a large data center with many racks, a Top of Rack design can quickly become a management burden by adding many switches to the data center that are each individually managed. For example, in a data center with 40 racks, where each rack contained (2) “Top of Rack” switches, the result would be 80 switches on the floor just providing server access connections (not counting distribution and core switches). That is 80 copies of switch software that need to be updated, 80 configuration files that need to be created and archived, 80 different switches participating in the Layer 2 spanning tree topology, 80 different places a configuration can go wrong. When a Top of Rack switch fails the individual replacing the switch needs to know how to properly access and replace the archived configuration of the failed switch (assuming it was correctly and recently archived). The individual may also be required to perform some verification testing and trouble shooting. This requires a higher skill set individual who may not always be available (or if so comes at a high price), especially in a remotely hosted “lights out” facility.

The top of rack design typically also requires higher port densities in the Aggregation switches. Going back to the 80 switch example, with each switch having a single connection to each redundant Aggregation switch, each Aggregation switch requires 80 ports. The more ports you have in the aggregation switches, the more likely you are to face potential scalability constraints. One of these constraints might be, for example, STP Logical Ports, which is a product of aggregation ports and VLANs. For example, if I needed to support 100 VLANs in single L2 domain with PVST on all 80 ports of the aggregation switches, that would result in 8000 STP Logical Ports per aggregation switch. Most robust modular switches can handle this number. For example, the Catalyst 6500 supports 10,000 PVST instances in total, and 1800 per line card. And the Nexus 7000 supports 16,000 PVST instances globally with no per line card restrictions. None the less, this is something that will need to be payed attention to as the data center grows in numbers of ports and VLANs. Another possible scalability constraint is raw physical ports – does the aggregation switch have enough capacity to support all of the top of rack switches? What about support for 10 Gigabit connections to each top of rack switch, how well does the aggregation switch scale in 10 gigabit ports?

Summary of Top of Rack advantages (Pro’s):

  • Copper stays “In Rack”. No large copper cabling infrastructure required.
  • Lower cabling costs. Less infrastructure dedicated to cabling and patching. Cleaner cable management.
  • Modular and flexible “per rack” architecture. Easy “per rack” upgrades/changes.
  • Future proofed fiber infrastructure, sustaining transitions to 40G and 100G.
  • Short copper cabling to servers allows for low power, low cost 1oGE (10GBASE-CX1), 40G in the future.
  • Ready for Unified Fabric today.

Summary of Top of Rack disadvantages (Con’s):

  • More switches to manage. More ports required in the aggregation.
  • Potential scalability concerns (STP Logical ports, aggregation switch density).
  • More Layer 2 server-to-server traffic in the aggregation.
  • Racks connected at Layer 2. More STP instances to manage.
  • Unique control plane per 48-ports (per switch), higher skill set needed for switch replacement.

End of Row Design

End of Row

Figure 3 - End of Row design

Server cabinets (or racks) are typically lined up side by side in a row. Each row might contain, for example, 12 server cabinets. The term “End of Row” was coined to describe a rack or cabinet placed at either end of the “server row” for the purpose of providing network connectivity to the servers within that row. Each server cabinet in this design has a bundle of twisted pair copper cabling (typically Category 6 or 6A) containing as many as 48 (or more) individual cables routed to the “End of Row”. The “End of Row” network racks may not necessarily be located at the end of each actual row. There may be designs where a handful of network racks are placed in a small row of their own collectively providing “End of Row” copper connectivity to more than one row of servers.

For a redundant design there might be two bundles of copper to each rack, each running to opposite “End of Row” network racks. Within the server cabinet the bundle of copper is typically wired to one or more patch panels fixed to the top of the cabinet. The individual servers use a relatively short RJ45 copper patch cable to connect from the server to the patch panel in the rack. The bundle of copper from each rack can be routed through over head cable troughs or “ladder racks” that carry the dense copper bundles to the “End of Row” network racks. Copper bundles can also be routed underneath a raised floor, at the expense of obstructing cool air flow. Depending on how much copper is required, it is common to have a rack dedicated to patching all of the copper cable adjacent to the rack that contains the “End of Row” network switch. Therefore, there might be two network racks at each end of the row, one for patching, and one for the network switch itself. Again, an RJ45 patch cable is used to link a port on the network switch to a corresponding patch panel port that establishes the link to the server. The large quantity of RJ45 patch cables at the End of Row can cause a cable management problem and without careful planning can quickly result in an ugly unmanageable mess.

Another variation of this design can be referred to as “Middle of Row” which involves routing the copper cable from each server rack to a pair of racks positioned next to each other in the middle of the row. This approach reduces the extreme cable lengths from the far end server cabinets, however potentially exposes the entire row to a localized disaster at the “Middle of Row” (such as leaking water from the ceiling) that might disrupt both server access switches at the same time.

Figure 4 - Middle of Row variation of End of Row

The End of Row network switch is typically a modular chassis based platform that supports hundreds of server connections. Typically there are redundant supervisor engines, power supplies, and overall better high availability characteristics than typically found in a “Top of Rack” switch. The modular End of Row switch is expected to have a longer life span of at least 5 to 7 years (or even longer). It is uncommon for the end of row switch to be frequently replaced, once its in – “it’s in” – and any further upgrades are usually component level upgrades such as new line cards or supervisor engines.

The End of Row switch provides connectivity to the hundreds of servers within that row. Therefore, unlike Top of Rack where each rack is its own managed unit, with End of Row the entire row of servers is treated like one holistic unit or “Pod” within the data center. Network upgrades or issues at the End of Row switch can be service impacting to the entire row of servers. The data center network in this design is managed “per row”, rather than “per rack”.

A Top of Rack design extends the Layer 2 topology from the aggregation switch to each individual rack resulting in an overall larger Layer 2 footprint, and consequently a larger Spanning Tree topology. The End of Row design, on the other hand, extends a Layer 1 cabling topology from the “End of Row” switch to each rack, resulting in smaller and more manageable Layer 2 footprint and fewer STP nodes in the topology.

End of Row is a “per row” management model in terms of the data center cabling. Furthermore, End of Row is also “per row” in terms of the network management model. Given there are usually two modular switches “per row” of servers, the result of this is far few switches to manage when compared to a Top of Rack design. In my previous example of 40 racks, lets say there are 10 racks per row, which would be 4 rows each with two “End of Row” switches. The result is 8 switches to manage, rather than 80 in the Top of Rack design. As you can see, the End of Row design typically carries an order of magnitude advantage over Top of Rack in terms of the number of individual switches requiring management. This is often a key factor why the End of Row design is selected over Top of Rack.

While End of Row has far less switches in the infrastructure, this doesn’t necessarily equate to far less capital costs for networking. For example, the cost of a 48-port line card in a modular end of row switch can be only slightly less in price (if not similar) to an equivalent 48-port “Top of Rack” switch. However, maintenance contract costs are typically less with End of Row due to the far fewer number of individual switches carrying maintenance contracts.

As was stated in the Top of Rack discussion, the large quantity of dense copper cabling required with End of Row is typically expensive to install, bulky, restrictive to air flow, and brings its share of cable management headaches. The lengthy twisted pair copper cable poses a challenge for adopting higher speed server network I/O. For example, a 10 gigabit server connection over twisted pair copper cable (10GBASE-T) is challenging today due to the current power requirements of the 10GBASE-T silicon currently available (6-8W per end). As a result there is also scarce availability of dense and cost effective 10GBASE-T network switch ports. As the adoption of dense compute platforms and virtualization quickly accelerates, servers limited to 1GE network I/O connections will pose a challenge in obtaining the wider scale consolidation and virtualization capable in modern servers. Furthermore, adopting a unified fabric will also have to wait until 10GBASE-T unified fabric switch ports and CNA’s are available (not expected until late 2010).

10GBASE-T silicon will eventually (over the next 24 months) reach lower power levels and switch vendors (such as Cisco) will have dense 10GBASE-T line cards for modular switches (such as Nexus 7000). Server manufactures will also start shipping triple speed 10GBASE-T LOM’s (LAN on Motherboard) – 100/1000/10G, and NIC/HBA vendors will have unified fabric CNA’s with 10GBASE-T ports. All of this is expected to work on existing Category 6A copper cable. All bets are off however for 40G and beyond.

Summary of End of Row advantages (Pro’s):

  • Fewer switches to manage. Potentially lower switch costs, lower maintenance costs.
  • Fewer ports required in the aggregation.
  • Racks connected at Layer 1. Fewer STP instances to manage (per row, rather than per rack).
  • Longer life, high availability, modular platform for server access.
  • Unique control plane per hundreds of ports (per modular switch), lower skill set required to replace a 48-port line card, versus replacing a 48-port switch.

Summary of End of Row disadvantages (Con’s):

  • Requires an expensive, bulky, rigid, copper cabling infrastructure. Fraught with cable management challenges.
  • More infrastructure required for patching and cable management.
  • Long twisted pair copper cabling limits the adoption of lower power higher speed server I/O.
  • More future challenged than future proof.
  • Less flexible “per row” architecture. Platform upgrades/changes affect entire row.
  • Unified Fabric not a reality until late 2010.

Top of Rack Fabric Extender

Figure 5 - Fabric Extenders provide the End of Row management model in a Top of Rack design

The fabric extender is a new data center design concept that allows the for the “Top of Rack” placement of server access ports as a Layer 1 extension of an upstream master switch. Much like a line card in a modular switch, the fabric extender is a data plane only device that receives all of its control plane intelligence from it’s master switch. The relationship between a fabric extender and it’s master switch is similar to the relationship between a line card and it’s supervisor engine, only now the fabric extender can be connected to its master switch (supervisor engine) with remote fiber connections. This allows you to effectively decouple the line cards of the modular “End of Row” switch and spread them throughout the data center (at the top of the rack), all without loosing the management model of a single “End of Row” switch. The master switch and all if its remotely connected fabric extenders are managed as one switch. Each fabric extender is simply providing a remote extension of ports (acting like a remote line card) to the single master switch.

Unlike a traditional Top of Rack switch, the top of rack fabric extender is not an individually managed switch. There is no configuration file, no IP address, and no software that needs to be managed for each fabric extender. Furthermore, there is no Layer 2 topology from the fabric extender to it’s master switch, rather it’s all Layer 1. Consequently, there is no Spanning Tree topology between the master switch and it’s fabric extenders, much like there is no Spanning Tree topology between a supervisor engine and it’s line cards. The Layer 2 Spanning Tree topology only exists between the master switch and the upstream aggregation switch it’s connected to.

The fabric extender design provides the physical topology of “Top of Rack”, with the logical topology of “End of Row”, providing the best of both designs. There are far fewer switches to manage (much like End of Row) with no requirement for a large copper cabling infrastructure, and future proofed fiber connectivity to each rack.

There is a cost advantage as well. Given that the fabric extender does not need the CPU, memory, and flash storage to run a control plane, there are less components and therefore less cost. A fabric extender is roughly 33% less expensive than an equivalent Top of Rack switch.

When a fabric extender fails there is no configuration file that needs to be retrieved and replaced, no software that needs to be loaded. The failed fabric extender simply needs to be removed and a new one installed in its place connected to the same cables. The skill set required for the replacement is somebody who knows how to use a screwdriver, can unplug and plug in cables, and can watch a status light turn green. The new fabric extender will receive its configuration and software from the master switch once connected.

Top of Rack Fabric Extender - Master switch in Aggregation area

Figure 6 - Top of Rack Fabric Extender linked to Master switch in Aggregation area

In the design above show in Figure 6, top of rack fabric extenders use fiber from the rack to connect to their master switch (Nexus 5000) somewhere in the aggregation area. The Nexus 5000 links to the Ethernet aggregation switch like any normal “End of Row” switch.

Note: Up to (12) fabric extenders can be managed by a single master switch (Nexus 5000).

Figure 4 - Top of Rack Fabric Extender linked to End of Row master switch

Figure 7 - Top of Rack Fabric Extender linked to End of Row master switch

In Figure 7 above the top of rack fabric extenders use fiber running from the rack to an “End of Row” cabinet containing the master switch. The master switch, in this case a Nexus 5000, can also provide 10GE unified fabric server access connections.

It is more common for fiber to run from the rack to a central aggregation area (as show in Figure 6). However the design shown above in Figure 7 where fiber also runs to the end of a row may start to gain interest with fabric extender deployments as a way to preserve the logical grouping of “rows” by physically placing the master switch within the row of the fabric extenders linked to it.

Summary of Top of Rack Fabric Extender advantages (Pro’s):

  • Fewer switches to manage. Fewer ports required in the aggregation area. (End of Row)
  • Racks connected at Layer 1 via fiber, extending Layer 1 copper to servers in-rack. Fewer STP instances to manage. (End of Row)
  • Unique control plane per hundreds of ports, lower skill set required for replacement. (End of Row)
  • Copper stays “In Rack”. No large copper cabling infrastructure required. (Top of Rack)
  • Lower cabling costs. Less infrastructure dedicated to cabling and patching. Cleaner cable management. (Top of Rack)
  • Modular and flexible “per rack” architecture. Easy “per rack” upgrades/changes. (Top of Rack)
  • Future proofed fiber infrastructure, sustaining transitions to 40G and 100G. (Top of Rack)
  • Short copper cabling to servers allows for low power, low cost 1oGE (10GBASE-CX1), 40G in the future. (Top of Rack)

Summary of Top of Rack Fabric Extender disadvantages (Con’s):

  • New design concept only available sine January 2009. Not a widely deployed design, yet.

Link to Learn more about Fabric Extenders

Cisco Unified Computing Pods

The Cisco Unified Computing solution provides a tightly coupled architecture of blade servers, unified fabric, fabric extenders, and embedded management all within a single cohesive system. A multi-rack deployment is a single system managed by a redundant pair of “Top of Rack” fabric interconnect switches providing the embedded device level management, provisioning, and linking the pod to the data center aggregation Ethernet and Fibre Channel switches.


Figure 8 - Unified Computing System Pod

Above, a pod of 3 racks makes up one system. Each blade enclosure links with a unified fabric — fabric extender — to the fabric interconnect switches with 10GBASE-CX1 or USR 10GE fiber optics (ultra short reach).

A single Unified Computing System can contain as many as 40 blade enclosures as one system. With such scalability there could be designs where an entire row of blade enclosures is linked to the “End of Row” or “Middle of Row” fabric interconnects. As shown below…

Figure 9 - Unified Computing System Row

These are not the only possible designs, rather just a couple of simple examples. Many more possibilities exist as the architecture is as flexible as it is scalable.

Summary of Unified Computing Systems advantages (Pro’s):

  • Leverages the “Top of Rack” physical design.
  • Leverages Fabric Extender technology. Fewer points of management.
  • Single system of compute, unified fabric, and embedded management.
  • Highly scalable as a single system.
  • Optimized for virtualization.

Summary of Unified Computing System disadvantages (Con’s):

  • Cisco UCS is not available yet. :-( Ask your local Cisco representative for more information.
  • UPDATE: Cisco UCS has been available and shipping to customers since June 2009 :)

Link to Learn more about Cisco Unified Computing System

Deploying Data Center designs into “Pods”

Choosing “Top of Rack” or “End of Row” physical designs is not an all or nothing deal. The one thing all of the above designs have in common is that they each link to a common Aggregation area with fiber. The common Aggregation area can therefore service the “End of Row” pod area no differently than a “Top of Rack” pod. This allows for flexibility in the design choices made as the data center grows, Pod by Pod. Some pods may employ End of Row copper cabling, while another pod may employ top of rack fiber, with each pod linking to the common aggregation area with fiber.

Figure 10 - Data Center Pods

Conclusion

This article is based on a 30+ slide detailed presentation I developed from scratch for Cisco covering data center physical designs, Top of Rack vs. End of Row. If you would like to see the entire presentation with a one-on-one discussion about your specific environment, please contact your local Cisco representative and ask to see “Brad Hedlund’s Top of Rack vs. End of Row data center designs” presentation! What can I say, a shameless attempt at self promotion. ;-)

-Brad Hedlund

© INTERNETWORK EXPERT .ORG – 2009

About the Author ()

Brad Hedlund is an Engineering Architect with the CTO office of VMware’s Networking and Security Business Unit (NSBU), focused on network & security virtualization (NSX) and the software-defined data center. Brad’s background in data center networking begins in the mid-1990s with a variety of experience in roles such as IT customer, systems integrator, architecture and technical strategy roles at Cisco and Dell, and speaker at industry conferences. CCIE Emeritus #5530.

Comments (36)

Trackback URL | Comments RSS Feed

  1. Jorge Nolla says:

    Hi Brad,

    This is a great article. As you know, we have gone through this design process in our company, I would like to share a few thoughts:

    1 – Depending on the platform deployed for EOR, ISSU capabilities are a must. ISSU is not provided by TOR solutions that I’m aware. This allows for hitless upgrades at the access layer.
    2 – TOR provides no separation between the switch fabric and the route processor.
    3 – TOR are usually fixed solutions. Take for example the PFC and DFC on the 6500 platform. They provide modularity for future expandability, and allow for a routed access layer.
    4 – Per-row switched fabric. If VSS is deploying, the entire row shares one switch fabric, providing for the lowest latency possible for intra-row communication.
    5 – TOR uplinks. Depending on the amount of access switches deployed, assuming we are looking at 10GE redundant uplinks per cabinet, the Aggregation layer must support an immense amount of 10GE ports. Current offerings are limited on the amount of 10GE ports they support at wire speed; this increases the amount of devices require to deploy a TOR 10GE solution.

    Also consider the single point of failure on the Nexus 5000 with the fabric extender. Each extender exponentially increases your failure domain, and no redundancy is provided by the 5000 other than power that I’m aware off.

    Best Regards

    • Brad Hedlund says:

      Jorge,
      You bring up some really good points. Allow me to address a few things:

      ISSU is not provided by TOR solutions that I’m aware.

      This is true. The only Ethernet switch that provides true hitless ISSU is Nexus 7000, which is a good choice for End of Row. For Top of Rack, true hitless ISSU is coming for Nexus 5000 and Fabric Extenders.

      TOR provides no separation between the switch fabric and the route processor.

      This depends on the TOR switch. This is true for Catalyst switches, 4948, 4900M, 3750. However the Nexus 5000 has complete control plane and data plane separation (which is why hitless ISSU will be possible).

      If VSS is deployed, the entire row shares one switch fabric, providing for the lowest latency possible for intra-row communication.

      This is true that with End of Row more server-to-server traffic on the same VLAN stays in the row, as does L3 server-to-server traffic if you have a routed access design. However a design with Fabric Extenders top of rack connected to Nexus 5000 has similar latencies to a Catalyst 6500. Granted Nexus 5000 is Layer 2 only, so you would not deploy a routed access design with Nexus 5000 and fabric extenders, which isn’t a big deal in my opinion because routed access designs in the Data Center are going away due to virtualization and better L2 topology technologies such as VSS, vPC, and DCE L2 multi pathing.

      Current [Aggregation switch] offerings are limited on the amount of 10GE ports they support at wire speed; this increases the amount of devices require to deploy a TOR 10GE solution.

      Every design varies based on many factors however honestly I rarely see the requirement for all non-blocking 10GE ports for every Top of Rack facing link, 2:1 and 4:1 facing the top of rack switch is generally acceptable. Yes it would be great to have all non-blocking ports in the aggregation, but that comes at a very high cost that is often hard to justify. At the Aggregation switch it’s more common to see 1:1 ports used for connecting to the peer Aggregation switch and Core facing ports.

      Also consider the single point of failure on the Nexus 5000 with the fabric extender

      For the Nexus 5000 and Fabric Extender design, the redundancy is provided by the Server being redundantly connected to two different fabric extenders, each of which are connected to different Nexus 5000’s (see Figure 5 above). This is very similar to having a Catalyst 6500 with only one supervisor engine, as would be the case with VSS. The Nexus 5000 is the one supervisor engine and its connected fabric extenders are its linecards. Just as you would connect a Server to two different VSS chassis, you would also connect a server to two different Fabric Extender Nexus 5000 pairs. In a VMWare deployment, the Multi-Chassis Etherchannel capabilities VSS provides may not be necessary if the VMWare design is using Virtual Port ID based VM load balancing. Nonetheless, VSS like capabilities are coming soon for the Nexus 5000 Fabric Extender design, where the server can connect with an active/active LACP port channel to two different fabric extenders, something that’s never been possible before in a Top of Rack design.

  2. Craig Weinhold says:

    I know you’re trying to keep the design comparisons at a high level, but since the Nexus 2000 is a real, shipping product, I think it’s fair to bring up its disadvantages:

    1. no LACP to servers (today).
    2. no ISSU on N5K (today).
    3. no 10/100 support.
    4. No RSPAN support.
    5. SPAN limitations: N2K ports can’t be SPAN destinations, can’t be discrete sources, and may show duplicate traffic.
    6. Multicast issues: reduced number of (S,G) multicast states and delays in joins/leaves make it unsuitable for demanding multicast environments (e.g., market data).
    7. Takes up to 3 minutes to boot / recover from software upgrade.
    8. Downstream switches are supported, but only with flexlinks or with STP-disabled single-homing. Note that N5K does not support mac-address move update (MMU) to assist with flexlinks.
    9. No local switching — local traffic crosses the uplinks twice, adding 15-20 msec of latency.

    • Brad Hedlund says:

      Craig,
      Platform and architecture decisions for data centers are largely based around CapEx and OpEx factors, requiring a higher level look at the design choices. The CapEx/OpEx advantages of Nexus 2000 and fabric extenders far outweigh some the things you bring up, however if you want to jump into the weeds, lets go there…

      1. no LACP to servers (today).

      Very few designs actually require LACP to the server. Furthermore, LACP to the server is rarely used where HA is important because the server will have two NIC’s connected to two different access switches (not the same switch), and it’s never been possible to have LACP to the server from two separate access switches until VSS (Catalyst 6500) and vPC (Nexus 7000) became available – both of these switches are not used in Top of Rack designs. Lets also keep in mind that Nexus 2000 is positioned for a Top of Rack design, and LACP to the server from two separate Top of Rack switches has never been possible, however Nexus 2000 will be the first to deliver this when vPC is available on Nexus 5000, later this year.

      2. no ISSU on N5K (today).

      Why is this a disadvantage? No other Top of Rack ethernet switch provides full ISSU today so I don’t see why this matters. Would the fact that the Nexus 2000 doesn’t print money be considered a “disadvantage”?

      3. no 10/100 support.

      True. However I have yet to see this become a show stopping concern as usually all of the gear installed in the rack has GE support, including the various management connections.

      4. No RSPAN support.

      Saying the Nexus 2000 doesn’t support RSPAN doesn’t really apply because the Nexus 2000 is not a configurable switch, there is no control plane to configure a feature like RSPAN, that’s the whole idea behind fabric extenders, minimizing configuration touch points. The SPAN capabilities should be applied at the Nexus 5000, the master switch, of which right now the Nexus 5000 supports local SPAN.
      Again, SPAN related capabilities are nice but rarely influence platform discussions and data center architectures.

      7. Takes up to 3 minutes to boot / recover from software upgrade.

      And this is a problem why? A server will be connected to a redundant Nexus 2000 if high availability is important. Furthermore, Cisco switches undergo extensive POST testing when booting up —> Which is a good thing! Which ultimately results in better availability. Do you want your servers establishing a link to a switch that hasn’t taken the proper time to fully assess it’s health? I always chuckle when I hear the complaint about how long it takes a Cisco switch to boot up.

      8. Downstream switches are supported, but only with flexlinks or with STP-disabled single-homing.

      What’s wrong with Flexlinks? One of the key advantages of the fabric extender (Nexus 2000) is that it minimizes the STP footprint within the data center. Most network engineers agree that less STP is a good thing.

      9. No local switching — local traffic crosses the uplinks twice, adding 15-20 msec of latency.

      True the fabric extenders do not locally switch but your numbers of 15-20 msec of added latency are just flat out wrong. The Nexus 2000 port to port latency for a standard 1500 byte packet is 15-20 micro-seconds (not miliseconds). This is roughly equivalent to local switching on a Catalyst 6500. So the fact that Nexus 2000 does not locally switch is largely irrelevant from a latency perspective.

  3. Craig Weinhold says:

    In the theoretically perfect environment, N2K is fine. But in the real-world, you’ll see singly-homed hosts, load-balanced NIC teams, cascaded switches and transparent appliances, and needs for 10/100 and RSPAN.

    You don’t think N5K’s lack of ISSU is a disadvantage? Maybe you should tell R&D to stop working on it. >:)

    If you don’t think the lack of RSPAN/ERSPAN/Netflow on the N2K/N5K is serious, then consider that those are the most prominent features of the Nexus 1000V.

    And I do realize that 20 microseconds won’t impact gig servers much (sorry for the “msec” versus “ms” confusion). Regardless, load on the N5K uplinks is still a concern. It’s similar to the issue of 3750 stacks versus 3750E stacks.

    • Brad Hedlund says:

      But in the real-world, you’ll see singly-homed hosts, load-balanced NIC teams, cascaded switches and transparent appliances, and needs for 10/100 and RSPAN.

      Singly homed hosts work fine with N2K. Not sure what your point is there. A singly homed host is at the mercy of the single switch it’s connected to whether that be a Nexus 2000 fabric extender or a traditional top of rack switch.
      As for load balanced NIC teams, most “load balanced” server NIC teams are not performing true active/active bidirectional load balancing. Rather, many of those configurations are using active/active for server-to-network flows, however the network-to-server flows are still active/standby. Some configurations may provide active/active for network-to-server flows by alternating ARP replies for each NIC’s source MAC. Either configuration works fine with N2K because neither configuration is true 802.3ad LACP.
      I don’t see many deployments where 10/100 and RSPAN are hard requirements, but I agree with you that if those are absolutely required the Nexus 2000/5000 will not fit at this time.

      You don’t think N5K’s lack of ISSU is a disadvantage? Maybe you should tell R&D to stop working on it. >:)

      No, it’s not a disadvantage. Let’s define “disadvantage”: when Product A lacks a capability that is readily available in competing Product B. That’s a disadvantage for Product A.
      If Product A lacks a capability that no other product has either, that is not a disadvantage of Product A. If Product A will soon have a capability that no other similar product has (such as ISSU), that is a pending “advantage” of Product A, and is the perfect reason why R&D is working on said capability –> to create an “Advantage”.

      If you don’t think the lack of RSPAN/ERSPAN/Netflow on the N2K/N5K is serious, then consider that those are the most prominent features of the Nexus 1000V.

      Again, yes it would be great to have RSPAN/ERSPAN/Netflow on the N2K/N5K — however at this time no other platform in the same class provides those capabilities either, and rarely top the list of priorities. Such capabilities will come to the Nexus 5000/2000 in time, with the timing largely driven by customer demand and competition which drives the business case for prioritizing those features.

      To your point, the Nexus 1000V providing SPAN and Netflow capabilities will fill those gaps quite nicely.

      Regardless, load on the N5K uplinks is still a concern.

      Let’s imagine the worst case scenario (which also happens to be a highly unlikely scenario) — all 48 ports of the Nexus 2000 are populated with GE attached servers. Each server is both sending and receiving a full 1GE load to another server on the same N2K. This would result in 48 GE of offered load on the uplinks – of which you can provision 40GE with 4 x 10GE uplinks to the N5K.
      That’s not bad at all. In fact, even in this worst case scenario you could reduce your server count from 48 to 40 (not too much to ask) and achieve 1:1 non-blocking.
      The reality is such a scenario would never exist. In a top of rack design there is usually a significant percentage of traffic that is leaving the rack and is not locally switched, and furthermore not all 48 attached servers will be offering a full 1GE load all at the same time, and with 4 x 10GE uplinks you have plenty of available uplink bandwidth.

  4. Craig Weinhold says:

    it’s never been possible to have LACP to the server from two separate access switches until VSS (Catalyst 6500) and vPC (Nexus 7000) became available

    3750/3750-E supports LACP in a top-of-rack, dual-supervisor-esque mode of operation.

    many of those configurations are using active/active for server-to-network flows, however the network-to-server flows are still active/standby

    That describes a textbook unicast flooding scenario that I’ve seen crash many a-network, and is exactly my counterpoint to the perfect network that the N5K/N2K demands. The real data center is full of garbage traffic and protocols that could potentially be affected by a N5K/N2K deployment. E.g., unicast flooding is most definitely not kosher, but many networks get away with it today. Will that change with N5K/N2K? How about asymmetric switching, Microsoft NLB, or other garbage?

    Don’t get me wrong — Nexus’s defaults are a vast improvement over IOS/CatOS. NX/OS’s design team has made good choices, and features like bridge assurance and per-interface IGP’s are very welcome.

    But, the client-sat issue is that some network admin will need to deal with why Nexus behaves differently than with Catalyst.

    when Product A lacks a capability that is readily available in competing Product B. That’s a disadvantage for Product A.

    I agree that there is no apples-to-apples equivalent to the N5K/N2K. One analogy is the chassis switch, which always supports ISSU or subsecond software upgrades with proper design (as well as RSPAN/ERSPAN, Netflow, etc). Another analogy is the ToR switch where a software upgrade only affects a small portion of the data center at a time.

    With N5K/N2K, a software upgrade will take down all N2K ports for several minutes, requiring all hosts to detect and adjust. Hosts that don’t are dead in the water. I can’t think of any “product B” that exhibits such poor behavior.

    Of course, this will be fixed when N2K supports uplinks to a vPC’d pair of N5K’s. But your justification of the current behavior as being reasonable for a data-center class piece of gear is fairly egregious.

    Let’s imagine the worst case scenario

    Your worst-case scenario neglected to include flows to the servers from other places. However, I agree that worrying about load is farfetched. But, I’m just taking my cue from how Cisco markets other products. E.g., one of the main selling points of 3750E stacking is that it supports local switching while 3750 stacks do not.

    • Brad Hedlund says:

      Craig,
      You are coming at this from a very tactical and operational perspective (rather than strategic and architectural), and that’s fine. However many data center architects are looking for ways to reduce points of management while avoiding a restrictive and expensive copper cabling infrastructure, which is why fabric extenders are gaining tremendous interest among those planning and designing data centers today.

      3750/3750-E supports LACP in a top-of-rack, dual-supervisor-esque mode of operation.

      True. However 3750/3750-E is not a data center class top of rack switch, rather it’s more suited for low end deployments or wiring closets.

      unicast flooding is most definitely not kosher, but many networks get away with it today. Will that change with N5K/N2K? How about asymmetric switching, Microsoft NLB, or other garbage?

      As I said in my previous comment, such configurations and traffic work fine with N2K. The N5K/N2K domain does not operate any differently than any other switch in how it handles asymmetric or unicast flooded traffic. Garbage in, garbage out.

      some network admin will need to deal with why Nexus behaves differently than with Catalyst.

      Understood. There will be some education and learning required. As data center requirements evolve, technology will evolve to meet those requirements, and the skills required to operate and design next generation data center architectures will evolve as well, resulting in career opportunities and professional growth.

      One analogy is the chassis switch, which always supports ISSU or subsecond software upgrades with proper design

      The only chassis based Ethernet switch that performs true ISSU is Nexus 7000, which is not used in top-of-rack designs. So in the context of top of rack designs I would argue that is not a fair analogy.

      Another analogy is the ToR switch where a software upgrade only affects a small portion of the data center at a time.

      True, however that software upgrade needs to be performed for every top of rack switch in the data center. This management burden is what the fabric extender design addresses.

      With N5K/N2K, a software upgrade will take down all N2K ports for several minutes, requiring all hosts to detect and adjust. Hosts that don’t are dead in the water.

      And this is a problem why? Software upgrades are performed during maintenance windows, and any hosts where HA is important will be dual connected to separate N2Ks. What you are leaving out is that the software only needs to be upgraded at one device to upgrade many different top of rack devices all at once (the fabric extenders). Maintenance windows are getting shorter and shorter and upgrading software once, rather than 12 times, makes a big difference in meeting those ever shrinking windows.

      one of the main selling points of 3750E stacking is that it supports local switching while 3750 stacks do not.

      Wrong. The 3750 stacks locally switched as well. 3750E brought more power for high power PoE devices, a 64Gbps ring, and 10G uplinks.

  5. Craig Weinhold says:

    You’re right that I am tactical. A client showed me his draft N5K/N2K integration document which read “N5000s will be set as clients to the core VTP domain.” Another Cat6K->N7K forklift deal was at the 11th hour before the client’s reliance on IPX was finally discussed. Etc. IMHO, Cisco presales is not stressing education/training/compatibility nearly enough. Where are the technical Q&A documents that are so useful for other products? Right now, a potential client has to read between the lines to figure out that the N2K/N5K doesn’t support 10/100.

    3750/3750-E is not a data center class top of rack switch,

    That’s Cisco’s official stance, but we sell an awful lot of them for data center/server room use.

    The only chassis based Ethernet switch that performs true ISSU is Nexus 7000,

    True, but the subsecond upstream outage for a Cat6K upgrade is a far cry from the 3+ minutes total access port outage for the N5K/N2K.

    And this (forcing hosts to deal with NIC outages) is a problem why?

    You’re confident that 100% of physical servers, virtual servers, vswitches, and blade switches are configured properly? Both for the initial failover and for the later recovery?

    Or, put another way, once N5K supports vPC, how often do you think you’ll recommend singly-homing your N2K’s?

    Wrong. The 3750 stacks locally switched as well.

    http://www.cisco.com/en/US/prod/collateral/switches/ps5718/ps7077/prod_qas0900aecd805bbea5.html#wp9000199

    StackWise Plus can locally switch. StackWise cannot. Furthermore, in StackWise, since there is no local switching and since there is source stripping, even locally destined packets must traverse the entire stack ring.

    • Brad Hedlund says:

      Craig,
      I get the feeling you work mostly with mid-to-small size companies? I hear what you are saying about Cisco pre-sales in this space. Cisco doesn’t have enough resources to dive deep into every deal at the millions of SMB customers globally. That is where partners like you can add tremendous value (as it sounds like you do). For the larger enterprise customers Cisco has very focused support and dedicated teams to inspect every intricate detail of every deal and detailed discussions of what each product supports, doesn’t support, and the roadmaps.

      Another Cat6K->N7K forklift deal was at the 11th hour before the client’s reliance on IPX was finally discussed. Etc.

      Glad you caught that. This customer must be using Supervisor 2. Even an upgrade of the existing 6500’s to Supervisor 720 would have been problematic given the presence of IPX.

      True, but the subsecond upstream outage for a Cat6K upgrade is a far cry from the 3+ minutes total access port outage for the N5K/N2K.

      You are comparing the software upgrade of a Catalyst 6500 Core/Distribution switch to a N5K/N2k positioned in the access layer. That’s not an apple-to-apples fair comparison. You should be comparing access layer to access layer. Example, what happens to the hundreds of servers connected to a Catalyst 6500 end-of-row access layer switch when its software is upgraded? –> All the line cards go down when the switch reboots and the hundreds of servers loose their NIC connections to that switch. –> No different than upgrading a N5K/N2K access layer.

      You’re confident that 100% of physical servers, virtual servers, vswitches, and blade switches are configured properly? Both for the initial failover and for the later recovery?

      Again, customers have been thinking about this for years when upgrading their End of Row access switch such as Catalyst 6500. This is nothing new to N5K/N2K.

      Or, put another way, once N5K supports vPC, how often do you think you’ll recommend singly-homing your N2K’s?

      Quite often, actually. When N5K supports vPC doesn’t mean that singly homed N2K designs will go away forever. There are advantages to both. One advantage of keeping N2K’s singly homed with vPC enabled at the N5K is that it will allow for 802.3ad LACP to the server across separate N2Ks. Another advantage is keeping the architecture simple and deterministic. Once the failure scenarios are explained, most customers are comfortable with N2K singly homed to N5K, in my experience. Keep in mind the fabric extender in CIsco UCS is singly homed to its fabric interconnect and this is an accepted architecture.

      You are right that 3750 non-E stacks pass all packets around the ring, even packets destined for a local port. Thanks for clarifying that.
      When you said the stack does not locally switch, I thought you were saying the master switch was making all of the forwarding decisions, which is not the case in either 3750/3750E stacks. Each switch makes its own local forwarding decisions, granted the 3750 non-E stacks exhibit sub optimal forwarding behavior, as you linked to.

      we sell an awful lot of them [3750’s] for data center/server room use.

      Appreciate you selling an awful lot of Cisco gear, regardless of what it is :-) However I should point out that Catalyst 4948 is a better performing switch for data center applications than 3750E at a lower price. $15K versus $20K.

      Cheers,
      Brad

  6. Brendan Doorhy says:

    Brad,

    What is the value of the patching shown in all of the network architectures? Do the ToR and UCS architectures eliminate the need for patching?

    Thanks,
    Brendan

    • Brad Hedlund says:

      Brendan,
      Patching is important in both top of rack and end of row architectures, each design has cables to/from the rack that need to be organized and structured.
      With a top of rack architecture, the number of cables and patch fields required can be significantly less than end of row, nonetheless there is still a need for a fiber patch panel in the rack wired to a fiber patch field in an aggregation area.

      Cisco UCS can further reduce patching requirements depending on how it’s deployed. With Cisco UCS you have the option of wiring the blade enclosure to the fabric interconnect with a CX1 SFP+ copper cable with distances up to 7m. The CX1 cable is not patched, rather it is a point-to-point cable run. For example, it is entirely feasible to deploy a 6 rack Cisco UCS pod, where only the middle 2 racks containing the fabric interconnects require a fiber patch field, and all of the enclosures within the 6 rack pod are linked to the fabric interconnects with hand laid CX1 copper cables.

      Some data centers may not have the per rack power and cooling density to closely couple the enclosures with the fabric interconnect to take advantage of CX1 cables, and rather the enclosures will be more spread out at farther distances, thus requiring fiber connectivity from each enclosure to the interconnect, in which case each rack still has a fiber patch panel for the purposes of linking to the fabric interconnect (UCS Manager).

      Cheers,
      Brad

  7. JamesD says:

    Thanks for the useful info. It’s so interesting

  8. r maloney says:

    nice post,

    In the Q & A portion, the left margin is cut off. Perhaps my point of view is skewed or your post is leaning to the “left”.

    http://www.bradhedlund.com/2009/04/05/top-of-rack-vs-end-of-row-data-center-designs/

    excellent details on TOR and EOR-accronyms we throw around but don’t appreciate the nouances.

    rob

  9. Steve M says:

    Thanks Brad. This is one of the most informative posts I’ve come across with regard to DC design. Very well written – it’s clear that you know your stuff! Thanks.

  10. Craig Weinhold says:

    Or, put another way, once N5K supports vPC, how often do you think you’ll recommend singly-homing your N2K’s?

    I rescind the rhetorical question above. After finally seeing an N2K dual-homed to a N5K vPC and observing the that the access port config is not synchronized and/or consistency checked between the two N5K’s, it’s clear to me that no sane person should ever do this.

    • Brad Hedlund says:

      Craig,

      Correct that host port configurations are not synchronized between dual-homed N2K’s, however configuration consistency is checked and if there is a violation the ports will not come up.
      You can use the command “show vpc consistency interface” to check the configuration consistency status.

      Cheers,
      Brad

  11. Craig Weinhold says:

    configuration consistency is checked and if there is a violation the ports will not come up

    True for the port-channel to the nexus 2000, but not for the FEX host ports themselves. I.e., this is taken without error:


    ! N5K #1
    int eth 100/1/1
    switchport access vlan 5

    ! N5K #2
    int eth 100/1/1
    switchport access vlan 6

  12. michaelK says:

    Brad,

    I see above in a may post that you mention a 7m cable. Has Cisco certified the 7m cable yet for Fex? Do they now sell a 7m cable? Great website.

    mk

    • Brad Hedlund says:

      Michael,
      We have not certified a passive 7m cable and it doesn’t look like we will anytime soon. Right now we are looking at certifying an active 7m cable. The concern with the passive 7m cable is around bit error rates beyond what would be acceptable for FCoE. For just normal Ethernet traffic the passive 7m cable would be fine. If you think it would be a good idea for Cisco to certify a passive 7m cable for Ethernet only (no FCoE), post your thoughts here.

      Cheers,
      Brad

  13. Vijay Tyagi says:

    Brad it’s a very good article, we really appreciate your efforts.

  14. pedroq says:

    hey brad:
    the nexus 2k are inflexible and do not fit every datacenter..your point of view is directed to a data-center that is new..some legacy datacenters do need tagging and etherchannels. The nexus 2k does not allow itself to be repurposed or positioned in another datacenter . it is completetelly attached to the nexus platform.there is some value in that as one datacenter get newer stuff ,the 4948 can be re-purpsed somewhere else…The nexus 2148t can only extendto 12 swithes or a max of 480 ports…Come on why let it scale on the nexus 5000 to 56 nexus 2k switches.you can take the same 4948 ond attach it to the 5k up to 56 switches…and therefore allows your redundant 5k to be scale more..
    intead of buying a pair of 5k for every 12 nexus 24148t…

    the only thing you get with a nexus 5k that the 4948 does not have is stacking .you would have to mange a lot of 4948 vs a single 5k ..
    that is it after that you dont get any more value…I want to see you argue that…

    so for those of you thinking about the nexus platform..take this into account..THe nexus 2148t sucks…it is missing many features, the nexus platform is not there yet ..If you have a datacenter where you have many technologies and you need that traditional cisco switch handle and feature set you dont want the nexus 2148t.cisco will be tying you to an inflexible techology and you cant use those 2k anywhere else but the nexus 5k…

    I do see some value on the nexus 7k and next version of the 5k ..but i would opt out of the nexus 2k.

  15. Ijlal Shah says:

    Brad nice article on ToR and EoR.

    Thanks

  16. Manice96 says:

    Good blog…

  17. Naveen says:

    Brad, nice article. Still am not getting it….where do Fabric extenders visa vis Fabric Interconnects fits in the entire UCS architecture?

  18. Hello, Brad. Interesting article, just had a few comments to share.

    You state “Distributed architectures are more difficult to manage in large deployments because you would potentially have thousands of switches deployed throughout your data center space. ” This is the most valid problem with the ToR model (and why Cisco’s FEX solution is so different; FEX is really a ToR physical model with a chassis management style in an effort to address this issue).

    However, I think a number of other claims in the article are questionable. The TOR model doesn’t present additional security risks compared with a chassis model, they are about the same. The TOR model also isn’t any more limiting when it comes to expanding into new architectures; this is just as true for the TOR approach as for the other methods listed.

    But my biggest concern is the statements about energy management. You state that the TOR architecture is not energy efficient, since unused ports still consume electricity and produce heat that needs to be cooled. I have to disagree; the ToR model usually consumes much less power than a chassis based EoR solution. Given that the server connects with very short copper cables within the rack, there is more flexibility and options in terms of what that cable is and how fast of a connection it can support. For example, a 10GBASE-CX1 copper cable could be used to provide a low cost, low power, 10 gigabit server connection. The 10GBASE-CX1 cable supports distances of up to 7 meters, which works fine for a Top of Rack design. Likewise, the statement that LAN and switch gear can overheat due to the TOR problem has never occurred in my experience; perhaps this was more common before TOR switches had front to rear cooling.

    Some of this may be the result of trying to compress a complex subject into a short article with bullet points. Often you have to get into the details of a specific design to draw meaningful conclusions, and this doesn’t always generalize. Still, I appreciate your bringing up these points and contributing to the discussion, nice work overall.

    • Brad Hedlund says:

      Casimer,
      Perhaps you are confused? Where did I say the ToR model consumes more power? Where did I say the ToR model has more security risks? Moreover, there is a whole paragraph on low cost, low power 10GBASE-CX1. This article was written to be largely positive on the ToR design.

      I’ll give you the benefit of the doubt, perhaps you read some other article and misplaced your comment here?

      Cheers,
      Brad

  19. Steven King says:

    I just felt the need to drop by and say this is a fantastic article, and I love your well-articulated responses to the “challenges” presented by other commentors on this article. This article has been immensely helpful to me as I’m relatively new to networking and especially new to Nexus/Data Center stuff. Thanks!

  20. hoji says:

    pleas stansards of data center & rack

  21. vICTOR cRITELLI says:

    Brad,
    Hello, my name is Victor Critelli and I have been reading your data center white papers on the web. Let me just say your white papers are very good. Very clear explanation regarding Top of Rack vs. End of Row data center designs. Would you have any white paper that goes deeper into protocols that would be used like (VPC, OSPF, HSRP, VRRP, STP, Port Channel, and Ether Channel)? How would you layout a good Top of Rack vs. End of Row data center designs based on protocols. I like drawings with a shot write up explaining why you choses this protocol over other protocols. Would you have any white papers on this subject?

    Thanks
    Vic

  22. jan devos says:

    Brad, I would like to share your thoughts on a variant of EOR. What would you recommend in a setup with two redundant switches in both EOR’s (so four switches in total). I dual homed each server to the two redundant ethernet switches inside the same rack. So the West half of the server racks connect to the 2 switches in the West EOR, and the same for the East. My motivation was : keep the cables short, by avoiding ‘long connects’ crossing the whole row. I’m convinced that the availablity is not influenced by not connecting to two different EOR’s,. A rack as a whole will not fail, it are the components inside a rack that may fail. And all these are fully redundant (power circuits, cooling and ethernet switches). In your EOR drawing, you have only one switch per EOR, so you are obliged to dual home the server racks to both EORs. In my situation, I have the luxury of two redundant switches per EOR, so I do not dual home to the two EORs. And now that all is done and we are about going into production, customer is complaining about this. You can imagine my slight reluctance to recable all. So, ifurther arguments that my cabling approach is ok are welcome. But feel as well free to disagree with my above PoV. Tx on beforehand.

    • Brad Hedlund says:

      Sounds fine to me, Jan. If the upside is a tidier cabling system, that’s a nice outcome. For the extremely cautious customer, the only risk would be that some sort of localized disaster happens at one end of the row such as a water leak, fire, or maybe a failed AC unit overheating the rack.

      • jan devos says:

        Thank you Brad, for your fast reply. Do you see in the DC’s that you were/are exposed to a any deployments like mine, i.e. two EORs, populated with two redundant switches each; And if, what is your observation? Dual connect the server racks to both EORs, or to just one, leveraging on the presence of two redundant switches per EOR? I now realize also that a dual MOR merges the best of both worlds : ‘rack redundancy’ + cable run optimization.

Leave a Reply

Your email address will not be published. Required fields are marked *