massively scalable data centers

Lately I’ve been thinking a lot about Massively Scalable Data Centers. These are the emerging data centers built by cloud service providers that may contain tens or even hundreds of thousands of physical server machines. Throw a modest amount of server virtualization on top and you could have more than a million end nodes in a single brick and mortar facility, or perhaps dispersed across a collection of internet connected pods. Example: Microsoft, Google, Amazon

What’s interesting to me about this space is not just the massive scale (that too) but mostly the cloud driven economics, new applications, changing traffic patterns, and the market disruptive architectures and that will inevitably surface as a result. New start-ups might emerge with unique solutions for these very specific customers. Open source projects combined with a dumbed down network may lead to the full scale commoditization of the network switch. Or perhaps a more intelligent cloud aware network will enable service providers to offer new products and services, expanding into new markets.

Let’s start with the economics. Why on earth would anyone build and maintain a facility with a hundred thousand or more servers? That must be incredibly expensive, right? It certainly is! James Hamilton of Amazon points out: majority of overall data center costs come from acquiring servers with short amortization cycles – not the network. The fact is, this is probably the case for data centers of all sizes; big, medium, and small. And therein lies the economic opportunity for both the cloud service provider and their customers.

For the IT Manager or Software Developer, why buy a bunch of servers for “Application X”, absorbing all of that cost and months of deployment effort, when you can instead run your application today on somebody else’s servers? Paying only for what you really need (CPU, Bandwidth, Storage), and only pay for the amount you use. What was once unthinkable (or at best laughable) just five years ago is now a real option worth serious consideration. Understanding the obvious security concerns, not every application may be considered today, but other server hungry data crunching apps could be perfect candidates. With time, the security protections offered by the service provider will only get better, thus creating more opportunities for more applications.

To make a compelling case, the cloud service provider sets up an infrastructure that can run Application X orders of magnitude cheaper, instantly provisioned, and backed by the performance and elasticity of a massive data center. The consumer is saving time and lowering costs, while the service provider is making money. It’s a win-win. Sounds nice, but how is this possible? And what does this have anything to do with massive scale?

What makes this work is the universal truth that a server which is idle and unused is a server wasting money. Think of the server as an “Opportunity” to achieve a “Result”. As in life, every opportunity has an associated cost, an “opportunity cost”, being a prerequisite to realizing results. Furthermore, the opportunity cost is decoupled from the subsequent results achieved. For example, if you buy a business for $10,000 – whether or not you subsequently make $1 million or go bankrupt (the result) doesn’t change the fact that your opportunity cost was $10,000. The same fundamental principal is true for data centers and servers. Every cost required to get a server in a rack, powered up, cooled, connected, managed, and ready to begin its first transaction represents the opportunity cost incurred regardless if the server is subsequently 10% utilized or 100% utilized (the result). The server with 10% utilization has wasted 90% of its opportunity cost (money). Higher opportunity costs mean more money wasted.

I’ll refer to this as the “Actual Efficiency”, which would be calculated as follows:

Result / Opportunity Cost = Actual Efficiency

For the IT Manager or Software Developer who chooses to purchase and deploy their own servers for Application X, maximizing Actual Efficiency means perfectly predicting the number of servers required such that all servers are at or near 100% utilization. The more time and engineering resources spent calculating these server requirements increases the opportunity costs, thereby lowering the Actual Efficiency of the effort. Provisioning extra servers as a “cushion” to handle spikes in load also increases opportunity costs and lowers Actual Efficiency. To execute perfectly would require guessing with little effort and 100% accuracy, which of course is not humanly possible. But what if you could eliminate the need to perfectly predict server requirements? What about that costly “cushion”? Can you remove both from the efficiency equation and make that somebody else’s problem? Yes. That’s where the cloud service provider steps in.

The IT Manager or Software Developer (Consumer) can choose to run Application X on the cloud service provider’s infrastructure, radically simplifying the opportunity costs to the actual consumption of commodity: CPU, Bandwidth, and Storage. The cloud service provider takes on the complexity and costs of calculating and deploying server requirements. Underutilized servers represent money wasted by the service provider, not the consumer.

Sounds great, right? However if the service provider is not able to do a much better job of Actual Efficiency than the consumer, remaining profitable would require passing on higher costs to the consumer, which would ultimately defeat the whole purpose. We know that this business model is working today, so how is this possible? Is the service provider that much smarter than the consumer, super-human or something, perfectly guessing with 100% accuracy? Not necessarily. Rather, the service provider has some distinct advantages and financial motivations.

  1. Excess CPU capacity can be sold on a global market (advantage)
  2. The economies of massive scale (advantage)
  3. Vested interest in efficiency (motivation)
  4. Efficiency competition among other service providers (motivation)

For the service provider, it’s all about Actual Efficiency; the business model depends on it. To remain profitable requires highly utilized servers and low opportunity costs. In achieving those outcomes, the service provider data center has a distinct advantage over the typical consumer data center. Server CPU resources are available to a global market of consumers, rather than a singular consumer. Furthermore, a global market of consumers provides the opportunity to build larger, massively scalable data centers, thereby leveraging economies of scale. More servers can be managed by fewer people, power can be converted more efficiently at Megawatts of scale, and bulk purchasing power can negotiate lower equipment costs.

Meanwhile, new application platforms provide the ability to process huge amounts of data in shorter periods of time using a distributed computing model. Rather than processing data in a serial fashion on a single machine (Application > OS > Server), application platforms leveraged in the service provider data center such as Hadoop using computational paradigms like MapReduce allow applications to process large amounts of data in parallel, on many machines at once (Application > Platform > Lots of Servers), providing incredible performance across an array of low cost throw-away commodity servers. This means a larger data center with a larger network. More importantly, given that arrays of servers now collectively behave as single CPU, the result is significantly more server-to-server (east-west) traffic when compared to the classic serial processing model.

So if it’s true that massively scalable data centers and highly utilized servers is the way forward, how do we build the network for this new paradigm? Do we build it the same old way it’s always been done before, just bigger? I’m afraid it’s not going to be that easy. Given that the applications and traffic patterns have changed, the scale has changed; it’s time for the network architecture to change as well.

Before we start to define what this new architecture might look like, let’s take a look at some of the stated goals and requirements of the new massively scalable data center.

  • Profitability: Highly utilized servers
  • Agility: Any service on any server, at any time. Workload mobility.
  • Uniform Capacity: consistent high bandwidth and low latency between servers, regardless of physical location and proximity
  • Manageability: Plug & Play, zero touch provisioning, minimal configuration burden, minimal human interaction.
  • Modularity: easily expand or shrink with modular building blocks of repeatable pods
  • Scalability: millions of IP and MAC addresses from hundreds of thousands of servers each with virtual machines

Profitability requires highly utilized servers. To have highly utilized servers requires extreme amounts of Agility in workload placement and mobility. Any pockets of underutilized servers should be able to relieve load from any other servers with overwhelming load. A server’s location in the network should be irrelevant to its ability to run any service at any time. To achieve such Agility assumes transparency from the network not only in the server’s address (IP/MAC), but also just as important is transparency in performance (bandwidth, latency). A server’s location in the network should be irrelevant to the performance it can deliver to any assigned service; Uniform Capacity.

In other words, first and foremost, the network should not get in the way of the cloud achieving peak performance and profitability. That part should be table stakes for any architecture moving forward.

Today’s current architectures fall way short of that goal. James Hamilton of Amazon writes: Data center networks are in my way. James perfectly describes from a cloud service provider point of view how today’s data center networks at large scale result in compromises of Agility, which ultimately affect his Profitability.

While I’m not ready yet to accept James’s portrayal of the network as a dying mainframe business model in need of commoditization, I understand the argument, and I accept the challenge. This point of view, I believe, comes from the belief that the network has no special value to the cloud other than just ‘getting out of the way’. I’m a little bit more optimistic than that.

Above and beyond the table stakes discussed earlier of Agility, Uniform Capacity, etc., the network may be able to add value to the cloud in areas that enhance the service offerings, expanding into new markets, having a positive effect on Profitability. Such as:

  • Quality of Service (QoS), service differentiation as a service
  • Service metering and monitoring, accountability
  • Security as a service
  • Service Level Agreements (SLA’s) as a service
  • TCP/HTTP performance optimization as a service (user to cloud)

If it’s true these areas would be valuable to the service provider and consumer, the question becomes; where and how to best to implement these services? The network? The server? Coordination between the two? Secondly, who best to execute on delivering these services in way that’s both reliable and cost effective? Will it be the open source software and commodity hardware approach? Or will it be the network vendor with a tight coupling of unique hardware and software – building networks with standards based protocols? I find these questions to be quite fascinating and will thoroughly enjoy watching this unfold. Or better yet, participating in the outcome ;-) But I digress.

Data center networks today are built upon the same common Ethernet/IP protocols (e.g. STP, OSPF) regardless of scale. Therefore the same design fundamentals used in smaller networks are also applied to larger networks. At scale, fundamentals such as STP rooted tree topologies limit Layer 2 domain sizes and constrain available Layer 2 uplink bandwidth, limiting Agility. OSPF areas help to scale larger Layer 3 domains and increase available bandwidth; however this comes at the cost of configuration complexity and limited workload mobility, limiting both Agility and Manageability.

There’s no denying the time has come to implement new design fundamentals that better address the specific needs of the massively scalable data center. I think everybody can agree on that. It’s important to first build a solid foundation of scalability, agility, and performance. From that solid foundation we can then look to layer on value added services on top of or within the solution.

So far much of the thought leadership in re-creating the foundation has been provided by the academic and research community, publishing several papers identifying the problems, describing many of these requirements, and proposing solutions. Two such proposals of particular interest to me are; PortLand, and VL2. Of the many proposals, I bring up these two because each are fairly recent and provide an interesting juxtaposition with the clear dichotomy of Server vs. Network; Layer 2 vs. Layer 3.

PortLand represents the network based solution. It takes the current problems and solves them using a Layer 2 network with new forwarding behaviors on top of familiar tree based topologies. Side note: There is a lot of talk among some network vendors about “flattening the network”. While that might achieve high bandwidth how well does it really scale? The PortLand proposal doesn’t get caught up in the “flatten the network” hype (neither does VL2). Rather, the foundation for massive scale is provided in an Edge, Aggregation, Core hierarchy leveraged by the auto-discovery protocols and new forwarding behaviors. How well PortLand actually scales is unknown, but at least the foundation is there.

Because PortLand is based on Layer 2 switching it inherently provides workload mobility and minimizes switch configuration requirements. The typical Layer 2 scalability problem with millions of arbitrary MAC addresses is removed by leveraging a hierarchical MAC addressing scheme based on a switch’s awareness of its location in the topology. The server’s actual MAC address (which could be anything) is never shown the rest of the network. The PortLand switch re-writes the server’s MAC address (as it enters the network) to something that is hierarchical, summarized, and location specific. A service called Fabric Manager provides IP ARP resolution (among other things) such that servers will learn the PortLand assigned MAC address (PMAC) of a server, not its actual MAC. When transmitting packets to the PMAC, the PortLand switches can use load sharing techniques just like an IP router, using equal cost paths to a summarized destination. I recommend you read the proposal to gain a complete understanding.

PortLand is a network based solution, and therefore no changes are required on the hundreds of thousands of server machines – the solution is transparent to the connecting devices. However, obviously, changes are required on the network equipment which will take some time before we see commercially available implementations.

On the other hand we have the VL2 proposal, which is very much a server based solution that overlays itself on top of an existing, untouched, Layer 3 network. The VL2 proposal begins by pointing out that this solution can be implemented today – no need to wait for new unproven network technologies. VL2 is based on the fact that the server is a programmable end point (where as the network switch typically is not). The programmability of the server provides opportunity to implement highly customizable and feature rich network functionality into the server itself. We have already witnessed this happening with the Cisco Nexus 1000V.

VL2 proposes network scale with a similar Edge, Aggregation, Intermediate hierarchical network topology as PortLand. Network throughput, Uniform Capacity, is largely provided by overlaying VL2 onto a Layer 3 network configuration using well known IP routing equal cost multi path (ECMP) forwarding. To achieve workload mobility, Agility, VL2 proposes inserting a shim in the server’s TCP/IP stack that decouples the server’s network interface IP address “Location Address” from the IP address the upper layer application or service is using, the “Actual Address”. An application on two servers will believe they are communicating with the “Actual Address”, while the VL2 shim encapsulates (tunnel) that traffic using the appropriate Location Addresses. The Layer 3 network delivers traffic based on the outer Location Address in the IP header. To provide the appropriate mapping of Actual Address to Location Address, the VL2 shim uses a directory service to provide the resolution (similar to PortLand’s Fabric Manager). IP ARP may still be present for the Location Address to find its default gateway, but that’s immaterial to VL2. No IP ARP is required for one service to communicate with another. The VL2 shim and directory service allow the “Actual Address” to be located anywhere, providing workload mobility and Agility. Again, I encourage you to read the proposal to gain a full understanding.

In my opinion, VL2 has some shortcomings. It doesn’t address the Manageability challenges with a large scale Layer 3 network. The network switch still needs a Layer 3 network configuration which could differ based on its physical location. VL2 does not have the same plug and play potential compared to PortLand. The network switch still needs a more intense configuration and the server needs to have shim installed. We can debate about automation and scripting solving those problems, but the point is that’s another problem to tackle not inherently addressed in VL2. Furthermore, VL2’s IP-in-IP encapsulation will prove more difficult for the network to have awareness and visibility into the services carried within. Treating the network like a dumb transport may pose a significant challenge in deploying more value added services mentioned earlier, QoS, Security, SLA’s, etc.

VL2’s biggest strength is in leveraging server programmability, which I tend to believe is a powerful enough point that it should not be ignored. We have already seen this model achieve great success in the server virtualization space (e.g. VMware vSwitch, Cisco Nexus 1000V). It’s hard to imagine an ultimate solution for the massively scalable data center that doesn’t somehow leverage server programmability in one way or another.

Other Layer 3 network based technologies such as LISP may be able to address specific problems such as workload mobility, however I’m not sure configuring LISP on every switch may be a manageable solution either. In my humble opinion, LISP would be a great fit for the Layer 3 internet gateways at the edge of the massively scalable DC, providing Inter-DC workload Agility, while something else like VL2 or PortLand handles the Intra-DC mobility beneath.

Call me crazy, or naive, but I’m dreaming of a solution that collectively combines the intelligence of the network (ala PortLand) with the flexibility of server programmability (ala VL2), with something like LISP sitting on top of it all.

By the way, guess which company has experience in combining both network programmability and server programmability into a single intelligent system?

Just say’n ;-)


Disclaimer: The views and opinions expressed are those of the author, and not necessarily the views and opinions of the author’s employer. The author is not an official media spokesperson for Cisco Systems, Inc. For design guidance that best suites your needs, please consult your local Cisco representative.