On “Why TRILL wont work for the data center”

Today I came across “Why TRILL won’t work for data center network architecture” by Anjan Venkatramani of Juniper. Anjan’s article makes a few myopic and flawed arguments in slamming TRILL, setting up a sale for QFabric. The stated problems with TRILL include FCoE, L3 multi-pathing, VLAN scale, and large failure domains. The one and only Ivan Pepelnjak has already tackled the flawed FCoE argument, be sure to read that, so I’ll opine here on the L3, VLAN scale, and failure domain arguments.

Anjan writes this about L3 gateways in a TRILL network:

While TRILL solves multi-pathing for Layer 2, it breaks multi-pathing for Layer 3. There is only one active default router with Virtual Router Redundancy Protocol (VRRP), which means that there is no multi-pathing capability at Layer 3.

This is a bit shortsighted and assumes that we are simply stuck with existing L3 gateway protocols of today like VRRP, and therefore you just have to use those in a TRILL network. Why? As the L2 technology evolves it makes perfect sense to look at how L3 protocols should evolve with it. For example, it’s entirely possible that a simple Anycast method could be used for the L3 gateway in TRILL. In short, each L3 TRILL switch would have the same IP address and same MAC address for the L3 default gateway. The server resolves the L3 gateway to this MAC address which is available on ALL links, because each TRILL spine switch is originating it as if it were its own. The L2 edge switch now makes a simple ECMP hash calculation to decide which L3 switch and link receives the flow. Simple, right? The same Anycast concept can also be used for services, such as load balancers and firewalls.

Anjan’s L3 gateway argument is a setup for stating why fewer VLANs should be used with TRILL, thus requiring more hosts on each VLAN (to reduce the L3 switching bottlenecks) thereby adding to scalability problems. Therefore, any subsequent argument related to having too many hosts on one VLAN can be dismissed as FUD based on a shortsighted premise. There’s no reason to change the number of VLANs you deploy with TRILL, or the number of hosts per VLAN.

Anjan continues about TRILL failure domains:

Security and failure isolation are a real concern in the TRILL architecture. Both issues stem from being artificially forced into large broadcast domains. Flapping interfaces, misbehaving or malicious applications and configuration errors can cause widespread damage and in a worst case scenario result in a data center meltdown.</blockquote>
Again, the "large broadcast domains" can be dismissed as myopic FUD. There would be no reason to have larger than normal broadcast domains in a TRILL deployment. Now, lets talk about "configuration error" and the resulting "wide spread damage". Coming from Juniper, lets acknowledge the somewhat obvious ulterior motive of selling their (still in slideware) QFabric architecture. Given that, the assumption Juniper would like you to believe is that QFabric would be much less vulnerable to wide spread damage from configuration error than a TRILL network. But how can that be possible? On what basis? The QFabric architecture resembles that of one big _proprietary_ 128-slot switch. One configuration change affects the entire architecture, for better or worse. **How is Juniper proposing their architecture is any less vulnerable to disastrous configuration mistakes?** If anything, a single network-wide configuration input such as you get with QFabric only increases this risk. No? Why not? Furthermore, why would "security and failure isolation" be less of a concern with Juniper QFabric, compared to any other **standards based** architecture such as TRILL? Cheers,
Brad --- _Disclaimer: The author is an employee of Cisco Systems, Inc. However, the views and opinions expressed by the author do not necessarily represent those of Cisco Systems, Inc. The author is not an official media spokesperson for Cisco Systems, Inc._