Construct a Leaf Spine design with 40G or 10G? An observation in scaling the fabric.

Should you construct a Leaf/Spine fabric with 10G or 40G?

In this post I’ll make the simple observation that using 10G interfaces in your leaf/spine fabric scales to more servers than using 40G interfaces, all with the same hardware, bandwidth, and oversubscription.

Let’s suppose you’ve decided to build a Leaf/Spine fabric for your data center network with the current crop of 10G/40G switches today that have QSFP ports.  Each QSFP port can be configured as a single 1 x 40G interface, or 4 x 10G interfaces (using a breakout cable).   With that option in mind, does it make more sense to construct your Leaf/Spine fabric with N * 40G? Or instead should you use N * (4 x 10G)?  Well, as always, it depends on what you’re trying to accomplish.

Here’s a simple example.  I want to build a data center fabric with my primary goal of having, say, 1200 10G servers in one fabric with 2.5:1 oversubscription.  I also want the ability to seamlessly expand this fabric to over 5000 10G servers as necessary without increasing latency or oversubscription.

For my example I’ll use the Dell Force10 Z9000 as my Spine switch, and the Dell Force10 S4810 as my Leaf switch in the top of rack.  I’ll have 40 servers per rack connected to the S4810, and I’ll use the 4 x QSFP uplink ports to attach upstream to the Z9000 Spine layer of my fabric.  Let’s look at two design choices, one with 40G, and another with 10G.

40G Leaf/Spine Fabric

If I configure each QSFP port in the fabric as a single 40G interface, how wide will I be able to scale in terms of servers?

Each of my Z9000 Spine switches has 32 ports of 40G.  Each S4810 Leaf is attached to the Spine with 4 ports of 40G.  Every Leaf switch is connected to every Spine.  Therefore, the number of connections used for uplinks from each Leaf determines the number of Spine switches I can have.  And the number of ports on each Spine switch determines the number of Leaf switches I can have.

In building this fabric with 40G interfaces, the largest I can go is 1280 10G servers at 2.5:1 oversubscription.  That certainly accomplished my initial scale target of 1200 servers, but I’m stuck there.  Before I can get to my 5000 servers stretch goal I’ll need to re-architect my fabric.

Now, before we start re-architecting our fabric, let’s see what would have happened had we decided to configure each QSFP port as 4 x 10G interfaces, rather than our first choice of 1 x 40G.

10G Leaf/Spine Fabric

If I configure each QSFP port in the fabric as 4 x 10G interfaces and using an optical breakout cable, how wide will I  be able to scale in terms of servers?

Each of my Z9000 Spine switches now has 128 ports of 10G.  Each S4810 Leaf is attached to the Spine with 16 ports of 10G.

In building this fabric with 10G, the largest I can go is 5120 10G servers at 2.5:1 oversubscription.  Shazaam!  That did it.  I can initially build this fabric to 1200 servers and seamlessly scale it out to over 5000 servers, all with the same bandwidth, latency, and oversubscription.

The four times greater scalability of this design was enabled by simply choosing to build my Leaf/Spine fabric with 10G interfaces, rather than the obvious choice of 40G.  Compared to the previous 40G design, all of the hardware is the same.  And all of the bandwidth, latency, and oversubscription is the same too.

The magic boils down to two simple principles of scaling a Leaf/Spine fabric.  Port count, and port count.

  1. The uplink port count on the Leaf switch determines the max  # of Spine switches.
  2. The Spine switch  port count determines the max # of Leaf switches.

Each principle works independently.  If you have Leaf switches with lots and lots of uplinks connected to a Spine with a low port count, you can get some scale.  If you have only a handful of uplinks in your Leaf switches connecting to a Spine with lots and lots of ports, you can get some scale there too.

But when combined, the two principles work better together.  If you have Leaf switches with lots and lots of uplinks connected to Spine switches with lots and lots of ports, you get lots of scale.

Housekeeping and Caveats

Using an optical breakout cable to get 4 x 10G from a single QSFP port will likely reduce the supported distance of each 10G cable run.  A normal 10G SFP+ link on multi-mode fiber can go 300 meters, but you may only be able to go 100 meters with QSFP and optical breakout cables. Be sure to verify that fact and see how it may impact the max distance you can have between your Leaf and Spine switches.  This fact alone may put a limit on your fabric scalability, be it 10G or 40G.

Yep, these Leaf/Spine fabrics today are Layer 3.  The switches form a standard routing protocol relationship with each other, such as with BGP or OSPF.  Today, that works well for applications such as Hadoop, Web and Media applications, HPC, or perhaps an IaaS cloud using network virtualization with overlays.  Moving forward, you will start to see network vendors supporting the TRILL standard, at which point you’ll be able to build the same Leaf/Spine architecture to support a Layer 2 topology between racks.  With TRILL, you’ll have the freedom to choose different network vendors at the Leaf and Spine layers, rather than being locked in with a vendor specific proprietary protocol or architecture (e.g. Cisco FabricPath, Brocade VCS, and Juniper QFabric).

You can also scale the server count in a Leaf/Spine design by using the Leaf as a connection point for your top of rack layer, rather than using the Leaf itself as the top of rack.

Yep, in the 10G fabric you have 4 x more interfaces to configure, 4 x more cables, 4 x more routing protocol adjacencies, 4 x more infrastructure subnets, and so forth.  For you, that might be a problem or no big deal at all.

Why would you build a Leaf/Spine design anyway?  Well, because you might like the fact that your fabric “Core” is striped across lots of individually insignificant pizza boxes (think RAID), rather than the typical approach of anchoring everything on two expensive mainframe like power sucking monstrous chassis.

Have something you want to add?  Chime in with a comment.

Cheers,
Brad


Disclaimer: The author is an employee of Dell, Inc. However, the views and opinions expressed by the author do not necessarily represent those of Dell, Inc. The author is not an official media spokesperson for Dell, Inc.

Comments

  1. colin says:

    The other major benefit of staying with 10G for all ports is now there are no speed mismatches within the fabric which will allow cut-through forwarding to work and lowering the latency.

  2. Richard says:

    Thats a LOT of cable to run. Add a leaf switch 100m away (10G x 16)? I wonder how long that will take to get up and running, probably a lot less time then adding a spine (10G x 128).

  3. Nice write up, Brad.

    One other point I’ll bring up that will impact overall cost.

    Overall cost of the 10G vs. 40G optics. I can’t speak for Dell pricing, but Cisco a 10G MMF SFP (SFP-10G-SR) is $1495 and 40G QSFP (QSFP-40G-SR4) is $3995. Both are list prices. So, with deploying a network there would be a 50% increase in cost to use 10G optics (~$6k vs. $4k). This could be significant when talking about 128 leafs and 16 spines. Suppose there will be trade-offs as usual – cost vs. scalability, etc.

    Just another house keeping point ;)

    Regards,
    Jason

  4. Petr Lapukhov says:

    Brad,

    You play a nice Jedi mind trick here, by taking the SAME box and effectively reducing the port count from 128 to 32, thus limiting the number of podsets (leafs) in the fabric ;) So I believe it would be more accurate to put a disclaimer that this is a BRCM Trident-specific design “limitation” (4x10G->40G).

    Now imagine that you have a 128 40G port box (chassis, of course :) – and all port-based scaling limitations have gone away. Not here right now? I believe something like that would come out this year, as fabric capacities for “large” boxes permit that :)

    Of course, this naturally bring us back to the old discussion of large crossbars vs clos :)

  5. David Rodgers says:

    I think the idea that distributed pizza boxes are less power hungry than “power sucking monstrous chassis” when you’re talking about 144 pizza box switches required to do the same number of 10g ports as 12 chassis.

    • Brad Hedlund says:

      David,
      Are you saying that you can build a fabric supporting 5120 10G servers with just 12 chassis? Really?
      How would you propose cabling all 5120 servers to your “12 chassis”?

  6. Mark says:

    Thank you Brad for the article. I am confused on one point. I think I am missing something, because it appears to me that you can also have the scaling benefits without using 10gb/breakout between the leaf and the spine. I am thinking if you have 16 spines and 128 leafs, you would still be able to scale to a guest count of 5120. My math:

    16 spines with 32 40gb ports = 512 40gb ports available for leafs.
    512 40gb ports available for leafs divided by 4 40gb ports on each leaf = 128 leafs.
    40 guests per leaf times 128 leafs = 5120 guests.

    Help me find what I’m missing here please :)

    • Brad Hedlund says:

      Hi Mark,
      A fabric should provide more than just connectivity for all hosts, it should go a step further in providing uniform latency and bandwidth across the fabric from any host to any host. This enables the flexibility to place workloads anywhere in the fabric without concern for network performance. That’s one difference between a “fabric”, and a “network”.

      Your design is more like a “network”. Yes, it provides connectivity. But it doesn’t provide the uniform latency and bandwidth properties of a real “fabric”.

      Each Leaf is only connected to 4 of the 16 Spines. As a result, you’ll have some some hosts that will need to make several Leaf-Spine-Leaf hops to communicate, whereas other hosts will only be one Spine hop away. Non-uniform latency.

      Similarly, you’ll have some rack pairs that can only communicate through one Leaf uplink, while other rack pairs can communicate with more than one Leaf uplink. Non-uniform bandwidth.

      As a result, application performance will vary depending on where workloads are placed in the network, complicating the provisioning model and partitioning the resources.

      To provide the uniform bandwidth and latency people will expect from your fabric, make sure your Leafs are connected to all Spines with the same amount of bandwidth.

  7. John G. says:

    If we zoom out a bit on the 40G uplink scenario specifically, we also have constraints for the L3 hop out of this fabric and this is even more limiting.

  8. Alex says:

    Brad, are you able to deliver uniform latency and bandwidth across the fabric with a Z9000 as the Spine switch. Is the Z9000 not a trident network in a box, meaning port-to-port latency will not be consistent accross all 32/128 ports of each spine switch.

    • Brad Hedlund says:

      Yep, in the Z9000 there will be a small difference in latency depending on the ingress/egress port pairs. That’s not any different than a chassis switch having lower latency on the linecard vs. between linecards. When you’re building a fabric that can scale, this is generally acceptable. As for bandwidth, the Z9000 is line rate on all ports.

      • Garry Shtern says:

        The alternative to Z9000 which is power hungry, takes up 2U of rack space and has variable latency depending on which ports are communicating, is Mellanox SX1036 switch. It is 1U switch, that has 36 40G ports and is rated at 200W. Granted, you can’t break it out to 144 10G ports (PHY limitations), and no L3 support but for the price (32k MSRP), you can’t beat it.

        Using the same Trident/Trident+ leafs, which you can get from Extreme, Cisco, Arista, Juniper or IBM, you can scale your setup to 1,728 nodes at substantial cost savings.

        Also, using optics for uplinks is not an ideal approach. Since you are within the same data center (presumably), your leafs and spines are within 150m of each other. If so, you can just get pre-terminated QSFP+ cables from Mellanox (or equivalent). The pre-terminated cable will run you less than $1,000, whereas the 40G LR/SR can easily cost 3k, which means you are saving at least 5k per each 40G uplink.

        • Brad Hedlund says:

          Garry,
          Without L3 or TRILL in the Mellanox SX 1036, I don’t see how you can possibly use that switch in the Spine or Leaf layer. So, yeah, the price is great but its worthless if you can’t use it to build the network you want.

  9. Anton N says:

    Hi Brad, nice reading.
    Could you please clarify term “2.5:1 oversubscription”? If all leaves\spines switches are L3, Can a leaf box use simultaneously 16 10G uplinks to route traffic to other leaf box? The case when all server of a leaf want to talk with all server of another single leaf. If I see it right, oversubscription is more “statistical” term than “real” in this example.
    Thanks in advance!

    • Brad Hedlund says:

      Anton,
      The oversubscription here is calculated from the fact that a Leaf switch has 40 x 10G server ports that will share 16 x 10G uplinks to the rest of the fabric.
      At the Leaf, there are 2.5 server ports for every 1 uplink.

  10. Derwin Warren says:

    While the 10G design scales to support more servers, it still requires a lot of cable runs and when dealing in the container realm, this can be a major headache. Personally, I want vendors to support massive 40G ports in the spine. Say, a modular chassis (9-12 slots) with at least 32-port 40G modules but would love upwards of 64-port 40G modules. If we can get this today in a 1RU footprint, this should be easy. I was dealing with a large scale Hadoop containerized solution (920 nodes per container across four containers at 1:1 over-subscription…eventually recommended 3:1) dealing with 10G ports and it was a pain from a cabling perspective. I was looking at a minimum 128 outbound 10G cables per container (for 3:1). Massive 40G ports in the spine and in the future 100G (once the price decreases) is the answer. Still, very good post.

    • Brad Hedlund says:

      Derwin,
      Thanks for sharing your experience and perspective. I’m in full agreement you here. The good news is the 40G port density you want is not that far out. The commercial silicon vendors such as Broadcom, Fulcrum, Marvell, and others are already surpassing in-house silicon development (Cisco, Brocade, Juniper).

      Cheers,
      Brad

Speak Your Mind

*