Construct a Leaf Spine design with 40G or 10G? An observation in scaling the fabric

Should you construct a Leaf/Spine fabric with 10G or 40G?

In this post I’ll make the simple observation that using 10G interfaces in your leaf/spine fabric scales to more servers than using 40G interfaces, all with the same hardware, bandwidth, and oversubscription.

Let’s suppose you’ve decided to build a Leaf/Spine fabric for your data center network with the current crop of 10G/40G switches today that have QSFP ports. Each QSFP port can be configured as a single 1 x 40G interface, or 4 x 10G interfaces (using a breakout cable). With that option in mind, does it make more sense to construct your Leaf/Spine fabric with N * 40G? Or instead should you use N * (4 x 10G)? Well, as always, it depends on what you’re trying to accomplish.

Here’s a simple example. I want to build a data center fabric with my primary goal of having, say, 1200 10G servers in one fabric with 2.5:1 oversubscription. I also want the ability to seamlessly expand this fabric to over 5000 10G servers as necessary without increasing latency or oversubscription.

For my example I’ll use the Dell Force10 Z9000 as my Spine switch, and the Dell Force10 S4810 as my Leaf switch in the top of rack. I’ll have 40 servers per rack connected to the S4810, and I’ll use the 4 x QSFP uplink ports to attach upstream to the Z9000 Spine layer of my fabric. Let’s look at two design choices, one with 40G, and another with 10G.

40G Leaf/Spine Fabric

If I configure each QSFP port in the fabric as a single 40G interface, how wide will I be able to scale in terms of servers?

40G Leaf Spine fabric

Each of my Z9000 Spine switches has 32 ports of 40G. Each S4810 Leaf is attached to the Spine with 4 ports of 40G. Every Leaf switch is connected to every Spine. Therefore, the number of connections used for uplinks from each Leaf determines the number of Spine switches I can have. And the number of ports on each Spine switch determines the number of Leaf switches I can have.

In building this fabric with 40G interfaces, the largest I can go is 1280 10G servers at 2.5:1 oversubscription. That certainly accomplished my initial scale target of 1200 servers, but I’m stuck there. Before I can get to my 5000 servers stretch goal I’ll need to re-architect my fabric.

Now, before we start re-architecting our fabric, let’s see what would have happened had we decided to configure each QSFP port as 4 x 10G interfaces, rather than our first choice of 1 x 40G.

10G Leaf/Spine Fabric

If I configure each QSFP port in the fabric as 4 x 10G interfaces and using an optical breakout cable, how wide will I be able to scale in terms of servers?

10G Leaf Spine fabric

Each of my Z9000 Spine switches now has 128 ports of 10G. Each S4810 Leaf is attached to the Spine with 16 ports of 10G.

In building this fabric with 10G, the largest I can go is 5120 10G servers at 2.5:1 oversubscription. Shazaam! That did it. I can initially build this fabric to 1200 servers and seamlessly scale it out to over 5000 servers, all with the same bandwidth, latency, and oversubscription.

The four times greater scalability of this design was enabled by simply choosing to build my Leaf/Spine fabric with 10G interfaces, rather than the obvious choice of 40G. Compared to the previous 40G design, all of the hardware is the same. And all of the bandwidth, latency, and oversubscription is the same too.

The magic boils down to two simple principles of scaling a Leaf/Spine fabric. Port count, and port count.

The uplink port count on the Leaf switch determines the max # of Spine switches.
The Spine switch port count determines the max # of Leaf switches.

Each principle works independently. If you have Leaf switches with lots and lots of uplinks connected to a Spine with a low port count, you can get some scale. If you have only a handful of uplinks in your Leaf switches connecting to a Spine with lots and lots of ports, you can get some scale there too.

But when combined, the two principles work better together. If you have Leaf switches with lots and lots of uplinks connected to Spine switches with lots and lots of ports, you get lots of scale.

Housekeeping and Caveats

Using an optical breakout cable to get 4 x 10G from a single QSFP port will likely reduce the supported distance of each 10G cable run. A normal 10G SFP+ link on multi-mode fiber can go 300 meters, but you may only be able to go 100 meters with QSFP and optical breakout cables. Be sure to verify that fact and see how it may impact the max distance you can have between your Leaf and Spine switches. This fact alone may put a limit on your fabric scalability, be it 10G or 40G.

Yep, these Leaf/Spine fabrics today are Layer 3. The switches form a standard routing protocol relationship with each other, such as with BGP or OSPF. Today, that works well for applications such as Hadoop, Web and Media applications, HPC, or perhaps an IaaS cloud using network virtualization with overlays. Moving forward, you will start to see network vendors supporting the TRILL standard, at which point you’ll be able to build the same Leaf/Spine architecture to support a Layer 2 topology between racks. With TRILL, you’ll have the freedom to choose different network vendors at the Leaf and Spine layers, rather than being locked in with a vendor specific proprietary protocol or architecture (e.g. Cisco FabricPath, Brocade VCS, and Juniper QFabric).

You can also scale the server count in a Leaf/Spine design by using the Leaf as a connection point for your top of rack layer, rather than using the Leaf itself as the top of rack.

Yep, in the 10G fabric you have 4 x more interfaces to configure, 4 x more cables, 4 x more routing protocol adjacencies, 4 x more infrastructure subnets, and so forth. For you, that might be a problem or no big deal at all.

Why would you build a Leaf/Spine design anyway? Well, because you might like the fact that your fabric “Core” is striped across lots of individually insignificant pizza boxes (think RAID), rather than the typical approach of anchoring everything on two expensive mainframe like power sucking monstrous chassis.

Cheers,
Brad

Disclaimer: The author is an employee of Dell, Inc. However, the views and opinions expressed by the author do not necessarily represent those of Dell, Inc. The author is not an official media spokesperson for Dell, Inc.