Comparing fabric efficiencies of Fixed vs Chassis based designs, non-blocking

Filed in Fabrics by on May 17, 2012 30 Comments

Recently I made the observation that Fixed switches continue to outpace Chassis switches in both power and space efficiency.  Simply put, with Fixed switches you can cram more ports into fewer RUs, and each port will draw less power when compared to Chassis switches.  Those are the indisputable facts.

A common objection when these facts are presented is that ultimately when you go to build a *fabric* of Fixed switches, that fabric will consume more total power, more total RU, and leave you with a lot more switches and cables to manage when compared to a single Chassis *switch*.  For example,  one 384 port line rate chassis switch (Arista 7508) consumes less power and RU when compared to the (10) Dell Force10 Z9000 fixed switches you would need to build a 384 port fabric.  While that is true, this is purely academic with no relevance to the real world.  Who in their right mind runs their 384 port non-blocking fabric on *one* switch? Nobody.  To carry this flawed logic out a bit further, the largest fabric you could have would be equal to the largest chassis you can find.  Nobody wants that kind of scalability limit.

In the real world we build scalable fabrics with more than one switch.  So with that in mind lets look at some various non-blocking fabric sizes, staring at 384 ports and on up to 8192 ports.  In each fabric size lets compare the total power and RU of the network switches.  I’ll make the observation that when you actually construct a real world non-blocking fabric, designs with all fixed switches consume less power and less space than comparable designs with chassis switches.  Another interesting observation is the non-blocking fabrics constructed with fixed switches available today result in fewer switches and cables to manage – compared to designs a chassis vendor might propose.

The chassis based design uses a typical 1RU fixed switch as the Leaf (Arista 7050S-64), connecting to a Chassis switch Spine layer (Arista 7508 or 7504), something Arista would likely propose.

The design of all fixed switches uses a 2RU switch at both the Leaf and Spine layer, the Dell Force10 Z9000.  The intention here is not to pick on Arista – quite the contrary – I’m using Arista as an example because of the current crop of monstrous power sucking chassis switches, Arista’s are the more efficient (sucking less).  Kudos to them for that.

To get straight to the point, lets first look at the overall summary charts.  The individual designs and data will follow for those interested in nit-picking.

Fabric Power Efficiency

The chart above shows that fully constructed non-blocking fabrics of all fixed switches are more power efficient than the typical design likely proposed by a Chassis vendor.  As the fabric grows the efficiency gap widens.  Given we already know that fixed switches are more power efficient than chassis switches, this data should make sense.

Fabric Space Efficiency

Again, the chart above shows a very similar patter with space efficiency.  A fully constructed non-blocking fabric of all fixed switches consumes less data center space than the typical design of Chassis switches aggregating fixed switches.

Designs & Data

As you look at the designs below, notice that non-blocking fabrics with fixed switches actually have fewer switches and cables to manage than non-blocking fabrics with Chassis switches — contrary to the conventional wisdom.

Above: (6) Leaf fixed switches, (4) Spine fixed switches interconnected with 40G and providing 384 line rate 10G access ports at the Leaf layer, and 96 inter-switch links.  (10) switches total, each with a max rated power consumption of 800W.

Above: (12) Leaf fixed switches, (2) Spine chassis switches interconnected with 10G.  Each Leaf switch at 220W max power has 32 x 10G uplink, and 32 x 10G downlink for 384 line rate access ports, and 384 inter-switch links (ISL).  The (2) chassis switches are 192 x 10G port Arista 7504 each rated at 7RU and 2500W max power.

 

Above: (32) Leaf fixed switches, (16) Spine fixed switches interconnected with 40G providing 2048 line rate 10G access ports at the Leaf layer, and 512 inter-switch links.  (48) switches total, each with a max rated power consumption of 800W.

 

Above: (64) Leaf fixed switches each at 220W max power and 1RU, with 32 x 10G inter-switch links, and 32 x 10G non-blocking fabric access ports.  (8) Arista 7508 Spine chassis each with (6) 48-port 10G linecards for uniform ECMP.  Because each 11RU chassis switch is populated with 6 linecards of 8 possible, I’ve factored down the power from the documented max of 6600W, down to 5000W max. (72) total switches.

Above: (64) Leaf fixed switches, (32) Spine fixed switches interconnected with 10G providing 4096 line rate 10G access ports at the Leaf layer, and 4096 inter-switch links.  (96) switches total, each with a max rated power consumption of 800W.

Above: (128) Leaf fixed switches each at 220W max power and 1RU, with 32 x 10G inter-switch links, and 32 x 10G non-blocking fabric access ports.  (16) Arista 7508 Spine chassis each with (6) 48-port 10G linecards for uniform ECMP.  Because each 11RU chassis switch is populated with 6 linecards of 8 possible, I’ve factored down the power from the documented max of 6600W, down to 5000W max. (144) total switches.

Above: (64) Leaf fixed switches, (128) Spine fixed switches interconnected with 10G providing 8192 line rate 10G access ports at the Leaf layer, and 8192 inter-switch links.  (192) switches total, each with a max rated power consumption of 800W.

Above: (256) Leaf fixed switches each at 220W max power and 1RU, with 32 x 10G inter-switch links, and 32 x 10G non-blocking fabric access ports.  (32) Arista 7508 Spine chassis each with (6) 48-port 10G linecards for uniform ECMP.  Because each 11RU chassis switch is populated with 6 linecards of 8 possible, I’ve factored down the power from the documented max of 6600W, down to 5000W max. (288) total switches.

Conclusion

When building a non-blocking fabric, a design of all fixed switches scales with better power and space efficiency, and with fewer switches and cables (if not the same), when compared to designs with chassis switches.

Something wrong with my data, designs, or assumptions?  Chime in with a comment.

Follow-up post:

Cheers,
Brad

About the Author ()

Brad Hedlund is an Engineering Architect with the CTO office of VMware’s Networking and Security Business Unit (NSBU), focused on network & security virtualization (NSX) and the software-defined data center. Brad’s background in data center networking begins in the mid-1990s with a variety of experience in roles such as IT customer, systems integrator, architecture and technical strategy roles at Cisco and Dell, and speaker at industry conferences. CCIE Emeritus #5530.

Comments (30)

Trackback URL | Comments RSS Feed

  1. krisiasty says:

    I think it’s still an academic discussion. If we are talking about real world, do you think you really need such a huge NON-BLOCKING fabric?

    • Brad Hedlund says:

      Do people build non-blocking fabrics in the real world? Absolutely.
      By the way, the data for 3:1 oversubscribed fabrics is not much different (coming in a subsequent post).

      • krisiasty says:

        Yeah, sure they do, but the question is: how large? Do you really need 8 thousand servers sending to each other 10Gbps traffic? Probably not. And neither 4 thousand or even 1 thousand. The second question is: despite some people do build large non-blocking fabrics, do they really need them? What for?

  2. Actually, if you stick with the single-trident-chip 64-port switches like the Arista 7050S-64, you an do the same sort of Clos fabric with just 125W per switch. You’d need 12 edge switches and 6 spine switches. The power works out to <6W per port "typical" according to the Arista spec sheet.Again you have an issue of 32 ports divided by 6 spine switches, but since it's statistically non-blocking, that should't matter too much, and the ECMP won't care if some of the links terminate on the same spine switch.You can argue that there would be a cabling mess, but I would suspect it would be less so than trying to run 384 fibers into one rack from all over the datacenter. With the 1U solution, the edge ports are distributed short copper, without expensive optics. MPO ribbon fibers and patch panels would be used from leaf to spine just as they would with 40G connectors to cut down on the number of cables needed, and you would probably divide the spine switches into just two locations in the DC.The real benefit of the 1U solution is being able to adjust the number of leafs and spines you need based on actual traffic patterns, oversubscribing at the edge where that makes economic sense. You don't have that same flexibility with $350k+ chassis solutions (remember you need *two* of those beasts for redundancy).

  3. … also, it’s a bit misleading to compare # of cables when some gear you use has 40 Gbps uplinks while the other has only 10 Gbps links. But I get it – Z9000 is a great spine switch :D

    • Brad Hedlund says:

      Hi Ivan, :-D
      Point taken. However, the relevant message here is that Fixed switches are faster to market with higher speed interfaces than chassis switches and customers can benefit from that.

  4. Mark Berly (@markberly) says:

    As we have discussed before, fixed switches have a place in network design, as they can offer low power and a small size – there are caveats as with all platforms/designs. But if you are going to focus on power and footprint then why not use the platform with the lowest power and highest density, the Arista 7050 series, this would allow you to build similar networks to the ones describe above in about half the RU and power.

    • Brad Hedlund says:

      Hi Mark,

      why not use the platform with the lowest power and highest density, the Arista 7050 series, this would allow you to build similar networks to the ones describe above in about half the RU and power.

      I’ll have to disagree. For example, let’s look at building a 2048 non-blocking fabric.

      2048 non-blocking fabric w/ all 1RU Arista 7050 switches:
      (192) Arista 7050 switches — 64 ToR + two parallel Leaf/Spine of 64 switches each.
      (4096) inter-switch links
      Arista Max power = 192 x 220W = 42.2KW

      2048 non-blocking fabric w/ all 2RU Dell Force10 Z9000 switches:
      (48) Dell Force10 Z9000 switches
      (512) inter-switch links
      Dell Max power = 48 x 800W = 38.4KW

      Did I get that wrong?

      Cheers,
      Brad

      • Simon Leinen says:

        Well, if it’s really 800W per Z9000, then yes, you got it wrong: 96 x 800W = 76.8kW.

      • Brad Hedlund says:

        Wait a minute … I think I did get that wrong…

        You could have (64) Leafs 7050s connected with 32-way ECMP to (32) Spine 7050s.
        That would give you 2048 non-blocking access ports at the Leafs.
        So, thats (96) Arista 7050s @ 220W = 21.1KW

      • Mark Berly (@markberly) says:

        As you indicated you can build a non-blocking 2048 10GbE design with Arista 7050s using 96 switches and a Dell Z9000 design using 48 switches. Which would yield:

        Arista 7050 (power): 96*220W = 21.1KW
        Dell Z9000 (power): 48*800W = 38.4KW

        To my earlier point the power draw is about half (46% less) but I was wrong on the space required as they are both 96RU. I guess we split this one point Brad one point Mark ;-)

  5. krisiasty says:

    There is one thing here missing: as far as I know Arista switches does not support any form of L2 multipathing, except MLAG which allows you to pair two switches (and consumes at least two ports on each switch for MLAG peer-link). So am I missed something or you can’t build single large L2 non-blocking fabric using Arista switches? The same question is valid for Dell’s F10 gear and any other vendor.
    You just can’t grab a bunch of switches and count ports and power usage, ignoring how it works and what functionality is supported…

    • Brad Hedlund says:

      Yes, such a fabric constructed today will be Layer 3. If your application is Big Data, Web 2.0, HPC, etc. no problem. If your application is an IaaS cloud, you can use a network virtualization overlay to provide transparent L2 segments for the VMs — this is here today in early form but still evolving. If an overlay is not your cup of tea, there will be Layer 2 multipathing with TRILL in the coming year or two.

      In a nutshell, the applications and Layer 2 services either work on Layer 3 scale-out networks today or moving in that direction fast. So this is somewhat a forward looking post.

      • krisiasty says:

        So in traditional DC networks where you need to connect every server to two different ToR switches (and both must be in the same VLAN) this L3 fabric is useless? Yeah, I know… there will be TRILL… in a year or two… maybe… but what about building scalable fabric NOW?

        • Brad Hedlund says:

          Why not? There’s nothing stopping you from taking two ToR switches and connecting them together at Layer 2 and redundantly attach your servers or appliances in the rack.

          Slide #12 depicts that here: http://www.bradhedlund.com/2012/05/22/architecting-data-center-networks-in-the-era-of-big-data-and-cloud/

          Cheers,
          Brad

          • krisiasty says:

            But then you have to connect both switches via L2 trunk… with how many links? And how about peer-links for MLAG? I remind you we are still talking about non-blocking fabric…
            I can hardly imagine how this kind of network should look like without breaking the “non-blocking” rule… And how many additional switches, links, power and RUs are needed?

          • Brad Hedlund says:

            If you absolutely need servers attached via MLAG you just reduce the number of server ports at each ToR. When you add requirements to a design sometimes there are tradeoffs. This would be no different.

          • Brad Hedlund says:

            For example, you would take 2 x 10G for the VLT peer-link between a ToR pair, which would leave you with 30 servers per ToR pair (instead of 32). This would be if the S4810 was the ToR — Z9000 does not support VLT (yet).

  6. Mark Berly (@markberly) says:

    Looking at the 8192 port design your spine layer is not going to work as the Z9000 does not support 64-way ECMP. With 32-way ECMP you would need to add another layer so the number of boxes would jump to ~400 to build a multi-tier spine (real quick back of the napkin math) making the power and space numbers 800RU and 320KW (39W per port).

    Disclaimer: I need to sit down and do the actual math to get the number of switches exact

  7. Tarun says:

    Thanks, a good post to understand the different design options & pros & cons of both.

    -Regards,
    -Tarun

  8. Brad,

    great post. I’m a fan of having small boxes and box2box redundancy. Some things that came to my mind, with fixed switches:
    – with less ports per box, the number of devices to take care of raises
    – mostly there is no ISSU capability within fixed switches, but if we build a L3 fabric, it would be nice to use something like the overload bit in ISIS to enhance the upgrade phase. So that the switch is out of the routing before reboot. But this is still an issue with L2 fabrics.

    Cheers,
    Matt

  9. Jason says:

    Very interesting read! I’m looking into networks for MeerKAT (http://en.wikipedia.org/wiki/MeerKAT) which will use an all-Ethernet interconnect (with lots of 40GbE) and we’re just now evaluating the chassis vs 1RU options, and had even identified the same components mentioned here. It’s a relief to see others thinking about non-blocking systems with thousands of ports. The one aspect that isn’t discussed is cost. 40GbE transceivers are pricey and significantly increase the total cost of a distributed fabric solution.

    • Brad Hedlund says:

      Jason,
      If you look at what really counts, price per Gb of bandwidth, 40GbE transceivers actually *lower* the cost of the solution. For example, a typical 10GbE optic is about $1,000. A typical 40GbE optic is about $1,800.

      For 40GbE of bandwidth, (1) 40GbE optic is 55% cheaper than (4) 10GbE optics. In a large fabric those savings add up *fast*, really fast.

      40GbE fabric for the win!

      ;-)

      Cheers,
      Brad

      • Jason says:

        Yeah, sorry I didn’t make my point clear: you potentially need fewer transceivers (40GbE, 10GbE or any other) in a monolithic chassis solution than in a distributed fabric for some access-portcount configurations. Especially for large datacentres (where the interswitch links need to be optical because 7m SFP copper cables don’t reach between all racks along the cable trays), this is a significant cost. I’m wondering if this was a factor in your decision to go for a distributed rather than chassis solution.

        As an obvious example, considering 10GbE, if you don’t need TOR switches and can afford point-to-point links everywhere then you can get 384 ports on the Arista 7500 without any interswitch links. The same setup with a distributed setup needs 384 additional links… a $800k hike in the total cost of the solution assuming your price above of $1000/transceiver at two transceivers per link (multimode 10GbE here seems closer to $500/transceiver). Sure, you can reduce the interswitch link count by a factor of four by using 40GbE, but…

        You do raise an interesting question WRT cost per Gbps. Pricing here for the 40GBASE-LR4 modules are *EIGHT* times more expensive than 10GBASE-LR units, making any 40GbE solution twice as expensive as a comparable 10GbE unit (ignoring fibre costs). Hopefully this will change as 40GbE uptake increases because we have other constraints driving us towards 40GbE. We need long range transceivers to each of our antennas, so these are the prices I have in front of me. Is the picture different for multimode? Or perhaps it’d be more cost effective to use 10GBASE-T, where the interconnect is basically free and I’m hopeful that we’ll start seeing LOM onboard 10GbE ports on next-gen servers, which’d mean no additional cost for end nodes either.

        I’m kinda playing devil’s advocate here because I’ve actually come to the same conclusion as you and am leaning strongly towards a scalable, distributed setup mainly because of all the other benefits that it brings (improved reliability with no single point of failure, scalability, flexibility etc etc). I do just want to get all my ducks in a row!

        • Brad Hedlund says:

          Jason,
          I’ll just re-iterate what said in your last paragraph… If somebody wants to build a 384 port non-blocking fabric, who in their right mind would connect it all to one chassis switch and call it a day? Perhaps if it was just an experimentation lab on a tight budget … But an Enterprise mission critical app, such as for example oil & gas exploration? I don’t think so. You wouldn’t want problems with one switch taking the whole application offline.

          And what happens when your fabric needs that 385th port? Rip out that chassis and put in bigger box? Good luck explaining that one to management.

Leave a Reply

Your email address will not be published. Required fields are marked *