Construct a Leaf Spine design with 40G or 10G? An observation in scaling the fabric.

Filed in Fabrics, Routing, TRILL by on January 25, 2012 45 Comments

Should you construct a Leaf/Spine fabric with 10G or 40G?

In this post I’ll make the simple observation that using 10G interfaces in your leaf/spine fabric scales to more servers than using 40G interfaces, all with the same hardware, bandwidth, and oversubscription.

Let’s suppose you’ve decided to build a Leaf/Spine fabric for your data center network with the current crop of 10G/40G switches today that have QSFP ports.  Each QSFP port can be configured as a single 1 x 40G interface, or 4 x 10G interfaces (using a breakout cable).   With that option in mind, does it make more sense to construct your Leaf/Spine fabric with N * 40G? Or instead should you use N * (4 x 10G)?  Well, as always, it depends on what you’re trying to accomplish.

Here’s a simple example.  I want to build a data center fabric with my primary goal of having, say, 1200 10G servers in one fabric with 2.5:1 oversubscription.  I also want the ability to seamlessly expand this fabric to over 5000 10G servers as necessary without increasing latency or oversubscription.

For my example I’ll use the Dell Force10 Z9000 as my Spine switch, and the Dell Force10 S4810 as my Leaf switch in the top of rack.  I’ll have 40 servers per rack connected to the S4810, and I’ll use the 4 x QSFP uplink ports to attach upstream to the Z9000 Spine layer of my fabric.  Let’s look at two design choices, one with 40G, and another with 10G.

40G Leaf/Spine Fabric

If I configure each QSFP port in the fabric as a single 40G interface, how wide will I be able to scale in terms of servers?

Each of my Z9000 Spine switches has 32 ports of 40G.  Each S4810 Leaf is attached to the Spine with 4 ports of 40G.  Every Leaf switch is connected to every Spine.  Therefore, the number of connections used for uplinks from each Leaf determines the number of Spine switches I can have.  And the number of ports on each Spine switch determines the number of Leaf switches I can have.

In building this fabric with 40G interfaces, the largest I can go is 1280 10G servers at 2.5:1 oversubscription.  That certainly accomplished my initial scale target of 1200 servers, but I’m stuck there.  Before I can get to my 5000 servers stretch goal I’ll need to re-architect my fabric.

Now, before we start re-architecting our fabric, let’s see what would have happened had we decided to configure each QSFP port as 4 x 10G interfaces, rather than our first choice of 1 x 40G.

10G Leaf/Spine Fabric

If I configure each QSFP port in the fabric as 4 x 10G interfaces and using an optical breakout cable, how wide will I  be able to scale in terms of servers?

Each of my Z9000 Spine switches now has 128 ports of 10G.  Each S4810 Leaf is attached to the Spine with 16 ports of 10G.

In building this fabric with 10G, the largest I can go is 5120 10G servers at 2.5:1 oversubscription.  Shazaam!  That did it.  I can initially build this fabric to 1200 servers and seamlessly scale it out to over 5000 servers, all with the same bandwidth, latency, and oversubscription.

The four times greater scalability of this design was enabled by simply choosing to build my Leaf/Spine fabric with 10G interfaces, rather than the obvious choice of 40G.  Compared to the previous 40G design, all of the hardware is the same.  And all of the bandwidth, latency, and oversubscription is the same too.

The magic boils down to two simple principles of scaling a Leaf/Spine fabric.  Port count, and port count.

  1. The uplink port count on the Leaf switch determines the max  # of Spine switches.
  2. The Spine switch  port count determines the max # of Leaf switches.

Each principle works independently.  If you have Leaf switches with lots and lots of uplinks connected to a Spine with a low port count, you can get some scale.  If you have only a handful of uplinks in your Leaf switches connecting to a Spine with lots and lots of ports, you can get some scale there too.

But when combined, the two principles work better together.  If you have Leaf switches with lots and lots of uplinks connected to Spine switches with lots and lots of ports, you get lots of scale.

Housekeeping and Caveats

Using an optical breakout cable to get 4 x 10G from a single QSFP port will likely reduce the supported distance of each 10G cable run.  A normal 10G SFP+ link on multi-mode fiber can go 300 meters, but you may only be able to go 100 meters with QSFP and optical breakout cables. Be sure to verify that fact and see how it may impact the max distance you can have between your Leaf and Spine switches.  This fact alone may put a limit on your fabric scalability, be it 10G or 40G.

Yep, these Leaf/Spine fabrics today are Layer 3.  The switches form a standard routing protocol relationship with each other, such as with BGP or OSPF.  Today, that works well for applications such as Hadoop, Web and Media applications, HPC, or perhaps an IaaS cloud using network virtualization with overlays.  Moving forward, you will start to see network vendors supporting the TRILL standard, at which point you’ll be able to build the same Leaf/Spine architecture to support a Layer 2 topology between racks.  With TRILL, you’ll have the freedom to choose different network vendors at the Leaf and Spine layers, rather than being locked in with a vendor specific proprietary protocol or architecture (e.g. Cisco FabricPath, Brocade VCS, and Juniper QFabric).

You can also scale the server count in a Leaf/Spine design by using the Leaf as a connection point for your top of rack layer, rather than using the Leaf itself as the top of rack.

Yep, in the 10G fabric you have 4 x more interfaces to configure, 4 x more cables, 4 x more routing protocol adjacencies, 4 x more infrastructure subnets, and so forth.  For you, that might be a problem or no big deal at all.

Why would you build a Leaf/Spine design anyway?  Well, because you might like the fact that your fabric “Core” is striped across lots of individually insignificant pizza boxes (think RAID), rather than the typical approach of anchoring everything on two expensive mainframe like power sucking monstrous chassis.

Have something you want to add?  Chime in with a comment.

Cheers,
Brad


Disclaimer: The author is an employee of Dell, Inc. However, the views and opinions expressed by the author do not necessarily represent those of Dell, Inc. The author is not an official media spokesperson for Dell, Inc.

About the Author ()

Brad Hedlund is an Engineering Architect with the CTO office of VMware’s Networking and Security Business Unit (NSBU), focused on network & security virtualization (NSX) and the software-defined data center. Brad’s background in data center networking begins in the mid-1990s with a variety of experience in roles such as IT customer, systems integrator, architecture and technical strategy roles at Cisco and Dell, and speaker at industry conferences. CCIE Emeritus #5530.

Comments (45)

Trackback URL | Comments RSS Feed

  1. colin says:

    The other major benefit of staying with 10G for all ports is now there are no speed mismatches within the fabric which will allow cut-through forwarding to work and lowering the latency.

  2. Richard says:

    Thats a LOT of cable to run. Add a leaf switch 100m away (10G x 16)? I wonder how long that will take to get up and running, probably a lot less time then adding a spine (10G x 128).

  3. Nice write up, Brad.

    One other point I’ll bring up that will impact overall cost.

    Overall cost of the 10G vs. 40G optics. I can’t speak for Dell pricing, but Cisco a 10G MMF SFP (SFP-10G-SR) is $1495 and 40G QSFP (QSFP-40G-SR4) is $3995. Both are list prices. So, with deploying a network there would be a 50% increase in cost to use 10G optics (~$6k vs. $4k). This could be significant when talking about 128 leafs and 16 spines. Suppose there will be trade-offs as usual – cost vs. scalability, etc.

    Just another house keeping point ;)

    Regards,
    Jason

  4. Petr Lapukhov says:

    Brad,

    You play a nice Jedi mind trick here, by taking the SAME box and effectively reducing the port count from 128 to 32, thus limiting the number of podsets (leafs) in the fabric ;) So I believe it would be more accurate to put a disclaimer that this is a BRCM Trident-specific design “limitation” (4x10G->40G).

    Now imagine that you have a 128 40G port box (chassis, of course :) – and all port-based scaling limitations have gone away. Not here right now? I believe something like that would come out this year, as fabric capacities for “large” boxes permit that :)

    Of course, this naturally bring us back to the old discussion of large crossbars vs clos :)

  5. David Rodgers says:

    I think the idea that distributed pizza boxes are less power hungry than “power sucking monstrous chassis” when you’re talking about 144 pizza box switches required to do the same number of 10g ports as 12 chassis.

    • Brad Hedlund says:

      David,
      Are you saying that you can build a fabric supporting 5120 10G servers with just 12 chassis? Really?
      How would you propose cabling all 5120 servers to your “12 chassis”?

  6. Mark says:

    Thank you Brad for the article. I am confused on one point. I think I am missing something, because it appears to me that you can also have the scaling benefits without using 10gb/breakout between the leaf and the spine. I am thinking if you have 16 spines and 128 leafs, you would still be able to scale to a guest count of 5120. My math:

    16 spines with 32 40gb ports = 512 40gb ports available for leafs.
    512 40gb ports available for leafs divided by 4 40gb ports on each leaf = 128 leafs.
    40 guests per leaf times 128 leafs = 5120 guests.

    Help me find what I’m missing here please :)

    • Brad Hedlund says:

      Hi Mark,
      A fabric should provide more than just connectivity for all hosts, it should go a step further in providing uniform latency and bandwidth across the fabric from any host to any host. This enables the flexibility to place workloads anywhere in the fabric without concern for network performance. That’s one difference between a “fabric”, and a “network”.

      Your design is more like a “network”. Yes, it provides connectivity. But it doesn’t provide the uniform latency and bandwidth properties of a real “fabric”.

      Each Leaf is only connected to 4 of the 16 Spines. As a result, you’ll have some some hosts that will need to make several Leaf-Spine-Leaf hops to communicate, whereas other hosts will only be one Spine hop away. Non-uniform latency.

      Similarly, you’ll have some rack pairs that can only communicate through one Leaf uplink, while other rack pairs can communicate with more than one Leaf uplink. Non-uniform bandwidth.

      As a result, application performance will vary depending on where workloads are placed in the network, complicating the provisioning model and partitioning the resources.

      To provide the uniform bandwidth and latency people will expect from your fabric, make sure your Leafs are connected to all Spines with the same amount of bandwidth.

  7. John G. says:

    If we zoom out a bit on the 40G uplink scenario specifically, we also have constraints for the L3 hop out of this fabric and this is even more limiting.

  8. Alex says:

    Brad, are you able to deliver uniform latency and bandwidth across the fabric with a Z9000 as the Spine switch. Is the Z9000 not a trident network in a box, meaning port-to-port latency will not be consistent accross all 32/128 ports of each spine switch.

    • Brad Hedlund says:

      Yep, in the Z9000 there will be a small difference in latency depending on the ingress/egress port pairs. That’s not any different than a chassis switch having lower latency on the linecard vs. between linecards. When you’re building a fabric that can scale, this is generally acceptable. As for bandwidth, the Z9000 is line rate on all ports.

      • Garry Shtern says:

        The alternative to Z9000 which is power hungry, takes up 2U of rack space and has variable latency depending on which ports are communicating, is Mellanox SX1036 switch. It is 1U switch, that has 36 40G ports and is rated at 200W. Granted, you can’t break it out to 144 10G ports (PHY limitations), and no L3 support but for the price (32k MSRP), you can’t beat it.

        Using the same Trident/Trident+ leafs, which you can get from Extreme, Cisco, Arista, Juniper or IBM, you can scale your setup to 1,728 nodes at substantial cost savings.

        Also, using optics for uplinks is not an ideal approach. Since you are within the same data center (presumably), your leafs and spines are within 150m of each other. If so, you can just get pre-terminated QSFP+ cables from Mellanox (or equivalent). The pre-terminated cable will run you less than $1,000, whereas the 40G LR/SR can easily cost 3k, which means you are saving at least 5k per each 40G uplink.

        • Brad Hedlund says:

          Garry,
          Without L3 or TRILL in the Mellanox SX 1036, I don’t see how you can possibly use that switch in the Spine or Leaf layer. So, yeah, the price is great but its worthless if you can’t use it to build the network you want.

          • Garry Shtern says:

            I don’t see a need for L3 in the spine at all. All of your leafs function as routs, with spine being used as a L2 transit. The simplest solution is to use BGP dynamic peer-groups to avoid configuring individual peers, but with a little ingenuity, one can use OSPF or EIGRP (in case of Cisco), as well. ECMP takes care of distributing the load between your multiple spine connections, so TRILL is not necessary, either.

            I grant you that this might tax the CPU of your leafs a bit, considering it will have to maintain multiple sessions to every other leaf but not substantial.

          • Brad Hedlund says:

            Yeah, I suppose you *could* do that. At the cost of additional complexity, and limited scalability. Each Leaf will need 140 routing adjacencies, and if you want fast convergence you’ll need sub second timers for each one — if the ToR control plane CPU can even handle it. If your spine had routing capabilities (ie. L3 or TRILL) you can add a third tier if needed ( ToR > Leaf > Spine ), but this would not work in your L2 only spine running STP. Other than that, yeah, I accept your argument — you don’t *need* routing in the Spine.

          • Simon says:

            Does the Mellanox SX1036 now support Layer 3?

  9. Anton N says:

    Hi Brad, nice reading.
    Could you please clarify term “2.5:1 oversubscription”? If all leavesspines switches are L3, Can a leaf box use simultaneously 16 10G uplinks to route traffic to other leaf box? The case when all server of a leaf want to talk with all server of another single leaf. If I see it right, oversubscription is more “statistical” term than “real” in this example.
    Thanks in advance!

    • Brad Hedlund says:

      Anton,
      The oversubscription here is calculated from the fact that a Leaf switch has 40 x 10G server ports that will share 16 x 10G uplinks to the rest of the fabric.
      At the Leaf, there are 2.5 server ports for every 1 uplink.

  10. Derwin Warren says:

    While the 10G design scales to support more servers, it still requires a lot of cable runs and when dealing in the container realm, this can be a major headache. Personally, I want vendors to support massive 40G ports in the spine. Say, a modular chassis (9-12 slots) with at least 32-port 40G modules but would love upwards of 64-port 40G modules. If we can get this today in a 1RU footprint, this should be easy. I was dealing with a large scale Hadoop containerized solution (920 nodes per container across four containers at 1:1 over-subscription…eventually recommended 3:1) dealing with 10G ports and it was a pain from a cabling perspective. I was looking at a minimum 128 outbound 10G cables per container (for 3:1). Massive 40G ports in the spine and in the future 100G (once the price decreases) is the answer. Still, very good post.

    • Brad Hedlund says:

      Derwin,
      Thanks for sharing your experience and perspective. I’m in full agreement you here. The good news is the 40G port density you want is not that far out. The commercial silicon vendors such as Broadcom, Fulcrum, Marvell, and others are already surpassing in-house silicon development (Cisco, Brocade, Juniper).

      Cheers,
      Brad

  11. Marius Purice says:

    Hi, Brad,

    What if we turn all 40G ports into 4x10G ports and we use only Z9000 switches to build the leaf-spine network? This would allow for 64 spine switches, 128 leafs and 8192 non-blocking 10G ports for the servers. Could you, please, comment on the advantages/drawbacks of such a deployment? I know cabling would be a real challenge :).

    Regards,

    Marius Purice

    • Brad Hedlund says:

      Hi Marius,

      Absolutely, that is a perfectly valid design (Z9000 top of rack, Z9000 spine).
      Couple of things to keep in mind:

      1) With Z9000 top of rack you’ll need to use a 5m QSFP-to-SFP copper breakout cable for the 10G server connections. Just make sure the 5m distance is not going to be an issue.

      2) The Z9000 is about 5x the list price of S4810, for 2x the port density. This means that there is finite window of fabric size where Z9000 at the top of rack is lower cost than S4810 top of rack. And that fabric window is 4096 to 8192 *non-blocking* server ports. If your fabric is either larger or smaller than that, it makes more financial sense to have S4810s top of rack. Inside that window, Z9000 top of rack works financially because it keeps the fabric to two stages (Leaf-Spine). When the fabric gets larger than 8192 non-blocking ports you’ll need a three stage fabric no matter what (ToR-Leaf-Spine), and having Z9000-Z9000-Z9000 at all three stages will cost more than having S4810-Z9000-Z9000.

      Make sense?

      Cheers,
      Brad

  12. Ryan Malayter says:

    Is Dell/Force10 shipping or beta-testing standards-compliant TRILL for FTOS yet? I know the S4810 and Z9000 support it in hardware, as do all other Trident+ switches. But nobody seems to actually have the software part of TRILL working yet ;-)

    • Brad Hedlund says:

      Is Dell shipping TRILL in FTOS? No. As for when Dell plans to do so, I can’t disclose that.
      It’s *my opinion* (not necessarily Dell’s) that TRILL will be largely irrelevant anyway by the time its commercially available. You have Layer 2 becoming less relevant in the data center, and you have higher density switches with which you can build pretty large L2 domains using well understood multi chassis LAG technology (VLT in FTOS).

      • Ryan Malayter says:

        I agree L2 should become less important over time. Hopefully VXLAN/NVGRE/whatever may help with this issue, letting us do all layer-3 on the physical netowrk. But they make little sense right now as they are single-vendor (or in the case of NVGRE no-vendor) options that really only work in PowerPoint.

        But the elimination of spanning tree and all of its nightmares is a noble goal. STP-related outages still crop up even when you “do it right” with BPDU guard and other stability features. See http://blog.ioshints.info/2012/04/stp-loops-strike-again.html

        Even if the L2 domain were just 4 switches, implementing TRILL would still be a very valuable tool.

  13. Carlos Ribeiro says:

    I love the leaf and spine concept. But in my opinion, L2 will still be necessary for the time being. But TRILL would not be necessary, if we just agreed on a simple EoIP (Ethernet over IP) standard.

    Packing an Ethernet frame inside IP may seem foolish. But consider this:

    1. It can be done very efficiently on hardware (both packing *and* unpacking).
    2. It allows one to leverage current L3 routing protocols, right now.
    3. You could map VLANs or individual MACs to some specific IP address.

    The end result would be similar in some ways to a MPLS network, without the same advanced traffic engineering capabilities, but also without all the extra complexity, and using only well known tools. I may be missing something but I honestly don’t know why it wasn’t done before. There are a few potential issues – for example, one would need a few extra bytes to handle jumbo frames in the core (fragmentation couldn’t be allowed as not to kill performance). But besides that there aren’t any explicit technical reasons why *not* do it. It seems more like a design philosophy or marketing issue.

  14. Sebastian Maniak says:

    Hi Brad,

    Have you built leaf and spine and introduced virtualscale on the s4810’s before? do you see any issues with doing this?

    • Brad Hedlund says:

      Hi Sebastian,
      Did you mean VLT on the S4810 Leaf nodes? Yes, you can do that. Best practice would be to have all servers attached with LAG to the VLT. The spine will deliver traffic to any one of the two Leaf nodes in a VLT — hence you’ll want each Leaf node to have a direct connection to the destination, to avoid sub-optimal forwarding across the VLT peer-link.

      • Terry says:

        Hi Brad,

        related to the VLT question, most of the networks I have worked on (Enterprise) require server NIC teaming or LACP bonding split across two switches for resilience, so I am curious why links between leaf pairs never appear on the diagrams. Is spine&leaf not suited to designs requiring layer 2 connectivity between access switch pairs for NIC resilience (whether it be plain active/standby or via VLT/MLAG/vPC), does it make the case less convincing in some way? Clearly it will burn some ports that could otherwise be used for uplinks or servers. Or is it that most deployments connect servers into a sigle ToR and obtain resilience some other way?

        • Brad Hedlund says:

          Hi Terry,
          You certainly *can*, take two Leaf switches, connect them with mLAG/vPC etc, and connect the Servers to both for NIC bonding HA. You would still have L3 uplinks from each Leaf, and each Leaf pair would be advertising the same in-rack subnets. Nothing wrong with doing that.
          Some people have large enough environments where rack level HA is good enough, so the extra configuration/troubleshooting intensity of mLAG per rack only makes things worse, not better.

          Cheers,
          Brad

Leave a Reply

Your email address will not be published. Required fields are marked *