Emergence of the Massively Scalable Data Center

Lately I’ve been thinking a lot about Massively Scalable Data Centers.  These are the emerging data centers built by cloud service providers that may contain tens or even hundreds of thousands of physical server machines. Throw a modest amount of server virtualization on top and you could have more than a million end nodes in a single brick and mortar facility, or perhaps dispersed across a collection of internet connected pods. Example: Microsoft, Google, Amazon

What’s interesting to me about this space is not just the massive scale (that too) but mostly the cloud driven economics, new applications, changing traffic patterns, and the market disruptive architectures and that will inevitably surface as a result.  New start-ups might emerge with unique solutions for these very specific customers.  Open source projects combined with a dumbed down network may lead to the full scale commoditization of the network switch.  Or perhaps a more intelligent cloud aware network will enable service providers to offer new products and services, expanding into new markets.

Let’s start with the economics.  Why on earth would anyone build and maintain a facility with a hundred thousand or more servers?  That must be incredibly expensive, right? It certainly is!  James Hamilton of Amazon points out: The majority of overall data center costs come from acquiring servers with short amortization cycles — not the network.  The fact is, this is probably the case for data centers of all sizes; big, medium, and small.  And therein lies the economic opportunity for both the cloud service provider and their customers.

For the IT Manager or Software Developer, why buy a bunch of servers for “Application X”, absorbing all of that cost and months of deployment effort, when you can instead run your application today on somebody else’s servers? Paying only for what you really need (CPU, Bandwidth, Storage), and only pay for the amount you use.  What was once unthinkable (or at best laughable) just five years ago is now a real option worth serious consideration.  Understanding the obvious security concerns, not every application may be considered today, but other server hungry data crunching apps could be perfect candidates.  With time, the security protections offered by the service provider will only get better, thus creating more opportunities for more applications.

To make a compelling case, the cloud service provider sets up an infrastructure that can run Application X orders of magnitude cheaper, instantly provisioned, and backed by the performance and elasticity of a massive data center.  The consumer is saving time and lowering costs, while the service provider is making money.  It’s a win-win.  Sounds nice, but how is this possible?  And what does this have anything to do with massive scale?

What makes this work is the universal truth that a server which is idle and unused is a server wasting money.  Think of the server as an “Opportunity” to achieve a “Result”.  As in life, every opportunity has an associated cost, an “opportunity cost”, being a prerequisite to realizing results.  Furthermore, the opportunity cost is decoupled from the subsequent results achieved.  For example, if you buy a business for $10,000 – whether or not you subsequently make $1 million or go bankrupt (the result) doesn’t change the fact that your opportunity cost was $10,000.  The same fundamental principal is true for data centers and servers. Every cost required to get a server in a rack, powered up, cooled, connected, managed, and ready to begin its first transaction represents the opportunity cost incurred regardless if the server is subsequently 10% utilized or 100% utilized (the result).  The server with 10% utilization has wasted 90% of its opportunity cost (money).  Higher opportunity costs mean more money wasted.

I’ll refer to this as the “Actual Efficiency”, which would be calculated as follows:

Result / Opportunity Cost = Actual Efficiency

For the IT Manager or Software Developer who chooses to purchase and deploy their own servers for Application X, maximizing Actual Efficiency means perfectly predicting the number of servers required such that all servers are at or near 100% utilization.  The more time and engineering resources spent calculating these server requirements increases the opportunity costs, thereby lowering the Actual Efficiency of the effort.  Provisioning extra servers as a “cushion” to handle spikes in load also increases opportunity costs and lowers Actual Efficiency. To execute perfectly would require guessing with little effort and 100% accuracy, which of course is not humanly possible.  But what if you could eliminate the need to perfectly predict server requirements? What about that costly “cushion”? Can you remove both from the efficiency equation and make that somebody else’s problem? Yes.  That’s where the cloud service provider steps in.

The IT Manager or Software Developer (Consumer) can choose to run Application X on the cloud service provider’s infrastructure, radically simplifying the opportunity costs to the actual consumption of commodity: CPU, Bandwidth, and Storage.  The cloud service provider takes on the complexity and costs of calculating and deploying server requirements.  Underutilized servers represent money wasted by the service provider, not the consumer.

Sounds great, right? However if the service provider is not able to do a much better job of Actual Efficiency than the consumer, remaining profitable would require passing on higher costs to the consumer, which would ultimately defeat the whole purpose.  We know that this business model is working today, so how is this possible?  Is the service provider that much smarter than the consumer, super-human or something, perfectly guessing with 100% accuracy?  Not necessarily.  Rather, the service provider has some distinct advantages and financial motivations.

1)      Excess CPU capacity can be sold on a global market (advantage)

2)      The economies of massive scale (advantage)

3)      Vested interest in efficiency (motivation)

4)      Efficiency competition among other service providers (motivation)

For the service provider, it’s all about Actual Efficiency; the business model depends on it.  To remain profitable requires highly utilized servers and low opportunity costs.  In achieving those outcomes, the service provider data center has a distinct advantage over the typical consumer data center.  Server CPU resources are available to a global market of consumers, rather than a singular consumer.  Furthermore, a global market of consumers provides the opportunity to build larger, massively scalable data centers, thereby leveraging economies of scale.  More servers can be managed by fewer people, power can be converted more efficiently at Megawatts of scale, and bulk purchasing power can negotiate lower equipment costs.

Meanwhile, new application platforms provide the ability to process huge amounts of data in shorter periods of time using a distributed computing model.  Rather than processing data in a serial fashion on a single machine (Application > OS > Server), application platforms leveraged in the service provider data center such as Hadoop using computational paradigms like MapReduce allow applications to process large amounts of data in parallel, on many machines at once (Application > Platform > Lots of Servers), providing incredible performance across an array of low cost throw-away commodity servers.  This means a larger data center with a larger network.  More importantly, given that arrays of servers now collectively behave as single CPU, the result is significantly more server-to-server (east-west) traffic when compared to the classic serial processing model.

So if it’s true that massively scalable data centers and highly utilized servers is the way forward, how do we build the network for this new paradigm?  Do we build it the same old way it’s always been done before, just bigger?  I’m afraid it’s not going to be that easy.  Given that the applications and traffic patterns have changed, the scale has changed; it’s time for the network architecture to change as well.

Before we start to define what this new architecture might look like, let’s take a look at some of the stated goals and requirements of the new massively scalable data center.

  • Profitability: Highly utilized servers
  • Agility: Any service on any server, at any time.  Workload mobility.
  • Uniform Capacity: consistent high bandwidth and low latency between servers, regardless of physical location and proximity
  • Manageability: Plug & Play, zero touch provisioning, minimal configuration burden, minimal human interaction.
  • Modularity: easily expand or shrink with modular building blocks of repeatable pods
  • Scalability: millions of IP and MAC addresses from hundreds of thousands of servers each with virtual machines

Profitability requires highly utilized servers.  To have highly utilized servers requires extreme amounts of Agility in workload placement and mobility.  Any pockets of underutilized servers should be able to relieve load from any other servers with overwhelming load.  A server’s location in the network should be irrelevant to its ability to run any service at any time.  To achieve such Agility assumes transparency from the network not only in the server’s address (IP/MAC), but also just as important is transparency in performance (bandwidth, latency).  A server’s location in the network should be irrelevant to the performance it can deliver to any assigned service; Uniform Capacity.

In other words, first and foremost, the network should not get in the way of the cloud achieving peak performance and profitability.  That part should be table stakes for any architecture moving forward.

Today’s current architectures fall way short of that goal.  James Hamilton of Amazon writes: Data center networks are in my way. James perfectly describes from a cloud service provider point of view how today’s data center networks at large scale result in compromises of Agility, which ultimately affect his Profitability.

While I’m not ready yet to accept James’s portrayal of the network as a dying mainframe business model in need of commoditization, I understand the argument, and I accept the challenge.  This point of view, I believe, comes from the belief that the network has no special value to the cloud other than just ‘getting out of the way’.  I’m a little bit more optimistic than that.

Above and beyond the table stakes discussed earlier of Agility, Uniform Capacity, etc., the network may be able to add value to the cloud in areas that enhance the service offerings, expanding into new markets, having a positive effect on Profitability.  Such as:

  • Quality of Service (QoS), service differentiation as a service
  • Service metering and monitoring, accountability
  • Security as a service
  • Service Level Agreements (SLA’s) as a service
  • TCP/HTTP performance optimization as a service (user to cloud)

If it’s true these areas would be valuable to the service provider and consumer, the question becomes; where and how to best to implement these services? The network? The server? Coordination between the two?  Secondly, who best to execute on delivering these services in way that’s both reliable and cost effective? Will it be the open source software and commodity hardware approach? Or will it be the network vendor with a tight coupling of unique hardware and software – building networks with standards based protocols?  I find these questions to be quite fascinating and will thoroughly enjoy watching this unfold.  Or better yet, participating in the outcome 😉 But I digress.

Data center networks today are built upon the same common Ethernet/IP protocols (e.g. STP, OSPF) regardless of scale.  Therefore the same design fundamentals used in smaller networks are also applied to larger networks.  At scale, fundamentals such as STP rooted tree topologies limit Layer 2 domain sizes and constrain available Layer 2 uplink bandwidth, limiting Agility.  OSPF areas help to scale larger Layer 3 domains and increase available bandwidth; however this comes at the cost of configuration complexity and limited workload mobility, limiting both Agility and Manageability.

There’s no denying the time has come to implement new design fundamentals that better address the specific needs of the massively scalable data center.  I think everybody can agree on that.  It’s important to first build a solid foundation of scalability, agility, and performance.  From that solid foundation we can then look to layer on value added services on top of or within the solution.

So far much of the thought leadership in re-creating the foundation has been provided by the academic and research community, publishing several papers identifying the problems, describing many of these requirements, and proposing solutions.  Two such proposals of particular interest to me are; PortLand, and VL2.  Of the many proposals, I bring up these two because each are fairly recent and provide an interesting juxtaposition with the clear dichotomy of Server vs. Network; Layer 2 vs. Layer 3.

PortLand represents the network based solution.  It takes the current problems and solves them using a Layer 2 network with new forwarding behaviors on top of familiar tree based topologies.  Side note: There is a lot of talk among some network vendors about “flattening the network”.  While that might achieve high bandwidth how well does it really scale?  The PortLand proposal doesn’t get caught up in the “flatten the network” hype (neither does VL2).  Rather, the foundation for massive scale is provided in an Edge, Aggregation, Core hierarchy leveraged by the auto-discovery protocols and new forwarding behaviors. How well PortLand actually scales is unknown, but at least the foundation is there.

Because PortLand is based on Layer 2 switching it inherently provides workload mobility and minimizes switch configuration requirements.  The typical Layer 2 scalability problem with millions of arbitrary MAC addresses is removed by leveraging a hierarchical MAC addressing scheme based on a switch’s awareness of its location in the topology.  The server’s actual MAC address (which could be anything) is never shown the rest of the network.  The PortLand switch re-writes the server’s MAC address (as it enters the network) to something that is hierarchical, summarized, and location specific.  A service called Fabric Manager provides IP ARP resolution (among other things) such that servers will learn the PortLand assigned MAC address (PMAC) of a server, not its actual MAC.  When transmitting packets to the PMAC, the PortLand switches can use load sharing techniques just like an IP router, using equal cost paths to a summarized destination.  I recommend you read the proposal to gain a complete understanding.

PortLand is a network based solution, and therefore no changes are required on the hundreds of thousands of server machines – the solution is transparent to the connecting devices.  However, obviously, changes are required on the network equipment which will take some time before we see commercially available implementations.

On the other hand we have the VL2 proposal, which is very much a server based solution that overlays itself on top of an existing, untouched, Layer 3 network.  The VL2 proposal begins by pointing out that this solution can be implemented today – no need to wait for new unproven network technologies.  VL2 is based on the fact that the server is a programmable end point (where as the network switch typically is not).  The programmability of the server provides opportunity to implement highly customizable and feature rich network functionality into the server itself.  We have already witnessed this happening with the Cisco Nexus 1000V.

VL2 proposes network scale with a similar Edge, Aggregation, Intermediate hierarchical network topology as PortLand.  Network throughput, Uniform Capacity, is largely provided by overlaying VL2 onto a Layer 3 network configuration using well known IP routing equal cost multi path (ECMP) forwarding.  To achieve workload mobility, Agility, VL2 proposes inserting a shim in the server’s TCP/IP stack that decouples the server’s network interface IP address “Location Address” from the IP address the upper layer application or service is using, the “Actual Address”.  An application on two servers will believe they are communicating with the “Actual Address”, while the VL2 shim encapsulates (tunnel) that traffic using the appropriate Location Addresses.  The Layer 3 network delivers traffic based on the outer Location Address in the IP header.  To provide the appropriate mapping of Actual Address to Location Address, the VL2 shim uses a directory service to provide the resolution (similar to PortLand’s Fabric Manager).  IP ARP may still be present for the Location Address to find its default gateway, but that’s immaterial to VL2.  No IP ARP is required for one service to communicate with another.  The VL2 shim and directory service allow the “Actual Address” to be located anywhere, providing workload mobility and Agility.  Again, I encourage you to read the proposal to gain a full understanding.

In my opinion, VL2 has some shortcomings.  It doesn’t address the Manageability challenges with a large scale Layer 3 network.  The network switch still needs a Layer 3 network configuration which could differ based on its physical location.  VL2 does not have the same plug and play potential compared to PortLand.  The network switch still needs a more intense configuration and the server needs to have shim installed.  We can debate about automation and scripting solving those problems, but the point is that’s another problem to tackle not inherently addressed in VL2.  Furthermore, VL2’s IP-in-IP encapsulation will prove more difficult for the network to have awareness and visibility into the services carried within.  Treating the network like a dumb transport may pose a significant challenge in deploying more value added services mentioned earlier, QoS, Security, SLA’s, etc.

VL2’s biggest strength is in leveraging server programmability, which I tend to believe is a powerful enough point that it should not be ignored.  We have already seen this model achieve great success in the server virtualization space (e.g. VMware vSwitch, Cisco Nexus 1000V).  It’s hard to imagine an ultimate solution for the massively scalable data center that doesn’t somehow leverage server programmability in one way or another.

Other Layer 3 network based technologies such as LISP may be able to address specific problems such as workload mobility, however I’m not sure configuring LISP on every switch may be a manageable solution either.  In my *humble* opinion, LISP would be a great fit for the Layer 3 internet gateways at the edge of the massively scalable DC, providing Inter-DC workload Agility, while something else like VL2 or PortLand handles the Intra-DC mobility beneath.

Call me crazy, or naive, but I’m dreaming of a solution that collectively combines the intelligence of the network (ala PortLand) with the flexibility of server programmability (ala VL2), with something like LISP sitting on top of it all.

By the way, guess which company has experience in combining both network programmability and server programmability into a single intelligent system?

Just say’n  😉


Disclaimer:  The views and opinions expressed are those of the author, and not necessarily the views and opinions of the author’s employer.  The author is not an official media spokesperson for Cisco Systems, Inc.  For design guidance that best suites your needs, please consult your local Cisco representative.


  1. says

    Hi Brad,

    It was only about 20-30 year ago when we tried scaling MPP systems in very similar manner :) The now so popular Fat-Tree topology dates back in 80s and many modern routing scaling proposals are related to compact routing concepts developed for MPP systems. Modern DC networks are simply what used to be the MPP interconnects, though we see “special” trend where local memory resources are being decoupled from processors and separated into an entity on its own (storage). For an interesting comparison I would suggest reviewing the Cray T3D architecture, which is one of the most classical among MPPs.

    As for maxing out resource utilization – in your context, the physical machines. I would like to point out there are certain limits to this. Firstly, the modern data-center virtualization trend is nothing else than old stochastic multiplexing principle applied in the way where virtual machines are dynamically multiplexed (mapped) over a set of hardware resources (physical servers). Say vMotion is just a way to balance the load across physical machines. Just like with any stochastic multiplexing system, the goal is to maximize physical resource utilization. The trade-off is well known too – by maximizing utilization we quickly lose precious QoS guarantees as system utilization climbs above 50-60%. In fact, the queueing theory states that queue depth for requests to be served by a resource grows approximately as:


    where U is resource utilization (U=1 means 100% utilization). Thus, reaching higher utilization (e.g. 80% has Q=4) results in dramatic increase of wait times, especially for “slower” resources. This effect is well known in packet switched networks, where links are commonly over-provisioned to keep average utilization way below the “critical” level and QoS policies are used to handle the occasional peak rate utilization bursts. Of course, not every task is time sensitive, though many business critical applications are. Through years we learned that using priority queue scheduling we could let important tasks utilize the resource capacity they need while handling the “low priority” tasks in background using some form of fair scheduling. The problem is that the resource capacity you can allocate to “priority” tasks is limited to approximately 30-40% due to contention among the critical tasks – this could be calculated based on desired QoS guarantees. In fact, it might be even less if there is requirement for real-time processing.

    This quick example illustrates the fact that stochastically multiplexed resource has some ‘maximum’ utilization that could not be exceeded if some sort of guarantee is required. This pretty much makes any data center that intends on handling time-sensitive applications to be “over-provisioned”. Even if such requirement does not exist, reaching the maximum utilization makes stochastically multiplexed system unstable due do dramatic increase of wait times and potential feedback loops.

    I have a few more comment to throw in, but not much time to post them right now 😀

    Thanks for your blog,


  2. says

    Brad, what do you think about Xsigo?

    It seems like an extremely efficient and flexible way of providing I/O consolidation without all the cost and complications of FCoE.

    Would love to hear your thoughts on this.


  3. joe smith says


    This is a series of deep-dive questions about the interaction between PFC and ETS, buffer allocation and bandwidth allocation as implemented in the Nexus 5000. I’d rather have a vendor-neutral discussion, but using the Nexus 5000 could prove to be a good tool toward understanding the standards themselves and the architectures and primitives they suggest.

    I hope you can help. Perhaps starting a new thread altogether is in order since this discussion is very specific. Moreover, from my experience, there are VERY few people out there who really understand the architecture and operation of these protocols, including CCIEs and CCIE-caliber engineers that I have posed these questions to.

    1.) Each class of traffic that you assign to a no-drop system class will get its own set of buffers, thereby eliminating contention of buffer resources from the drop system class. So, for the default setup, in which FCoE is the only class of traffic that is no-drop, does that mean that FCoE gets half the 480KB interface buffer space, and that the no-drop system class gets the rest? In other words, is it a 50-50 split of the buffers?

    1a) Corollary: What if another traffic class is added to the no-drop system class – lets say its iSCSI traffic – will the iSCSI traffic get its equal share of the buffer space, too, such that now FCoE gets 33%, iSCSI 33% and the rest of the traffic in the drop class 33% of the interface buffer?

    2.) Using the language found in the IEEE ETS standard itself, they use the word “priority group” (PG) to differentiate the different types of traffic. Do those priority groups map to Cisco’s system classes? I think they do. Cisco’s PFC implementation speaks of traffic that belongs to Sytem Classes drop and no-drop, and each of those clases can include traffic types from several different priority classes (CoS values). This is much the same way that ETS refers to priority groups, in which each group can consist of traffic belonging to several different classes of service, too. Thoughts?

    3.) If you leave FCoE in the no-drop system class and leave everything else at the default drop system class, does this mean that, from the perspective of ETS, each system class is automatically assigned a minimum bandwidth of 50% – with the ability, of course, to leverage any unused bandwidth that the other system class may not be using at any given time? Moreover, is that bandwidth allotment configurable? For example, can you assign only 40% to the no-drop class and 60% to the drop class? Moreover, once again, lets say you configure another lossless traffic type, like iSCSI, and add it to the no-drop system class – will ETS distribute bandwidth to iSCSI such that FCoE gets 33%, iSCSI gets 33% and the rest gets 33%.

    4.) This isn’t a question but more of a statement that I need validated by you. It seems that even though ETS can allow several different traffic priorities (CoS values) to be assigned to a single priority group, the bandwidth allocation mechanism of ETS does NOT assign bandwidth on a CoS level, but only on a priority group level, which is why the standard recommends that traffic classes with the same traffic congestion management requirements be placed in the same priority group.

    • says

      Hi Joe,
      In a nutshell, the amount of bandwidth you can guarantee to a traffic class is not directly related to the amount of buffers allocated to it. The bandwidth allotment is configurable.
      As you have found, the ETS standard, like many others, doesn’t tell the switch vendor exactly how to implement the feature. Rather it just provides a general framework.
      For the Nexus 5000, you can define priority groups based on CoS values and configure individual bandwidth guarantees per CoS value(s), if you want. In other platforms, such as the Nexus 5500, you can classify “priority groups” based on DSCP if you want, (among other ways) the hardware is capable of that. So, the switch vendor is free to implement the standard in ways that meets the framework while providing additional value to the customer and differentiation among competitors.


  4. joe smith says

    Brad, you must be super busy with your new gig because I haven’t seen you engage anyone on your blog in weeks. Where for art thou, young man??? :-)

  5. says

    Unfortunately, Brad, you make some big assumptions about the economics of data centers – as does James Hamilton (who I have corresponded with). And equally unfortunate, these assumptions are to the detriment to Cisco.

    The average large-scale, enterprise data center – call them Global 5000 – Cisco’s customers, run 1000’s of applications on their servers. Most with unique requirements and SLAs. Amazon doesn’t come close to managing that type or level of complexity. Their’s is a different sort of complexity, though Amazon’s and Google’s approach does have a certain sex appeal.

    But let’s be clear on one thing. Nobody is calling up the CEO of Amazon in the middle of the night because a series of settlement transactions got screwed up and 50 companies now, inadvertently, have $200M of your company’s money.

    Now onto the facts about what really matters to Cisco’s enterprise customers.

    First, the big compute costs in enterprise class data centers is not servers. It is software. And these customers pay a boatload of money to MS, IBM, Oracle, etc., for the software. It runs 3-15 times the cost of the server for complex stuff. And that stuff and costs are not going away anytime soon.

    So my questions are these: How much does Cisco make on database software? How much UCS is Cisco selling to Facebook or Amazon?

    My observation is Cisco could be selling a lot more UCS to existing clients if Cisco better targeted their solution offering to these enterprise class customers. CFOs don’t give a hoot about tech talk about Amazon.

    Most of those companies hold onto their servers 3-5 years. In the Classic Virtualization model – why is that still the case?

    When software is 3-15 times the cost the server, and server capacity or speed grows 25-40% per year, the financially savvy thing to do is refresh those servers far more often. That means Cisco sells more, high end servers!

    Who loses? Oracle, IBM, MS. Enterprise data centers don’t have to purchase more very expensive licenses. IBM is not going to bring a solution like that to a customer. Why? Their margins are better on software. Better to sell more software than servers.

    Since Cisco doesn’t have a server software franchise to protect, why is it Cisco lets IT managers spend budget on more software without push-back? Cisco brings the productivity gains via servers, but MS, Oracle and IBM make the money.

    BTW, I have been a CIO and CFO running large-scale outsourcing. That is why we formed Ravello Analytics. If Cisco wants to lower their customer’s compute costs, they need to sell them more servers – more often – so those customers can keep their license costs low.

  6. Ming says

    This is a great writing. I really enjoy reading it.

    However, I find a major hole in the PortLand paper. In section 3.3 Proxy-based ARP, it describes how VM migration is handled by saying:

    “There is one additional detail for supporting VM migration. Upon completing migration from one physical machine to another, the VM sends a gratuitous ARP with its new IP to MAC address mapping. This ARP is forwarded to the fabric manager in the normal manner. Unfortunately, any hosts communicating with the migrated VM will maintain that host’s previous PMAC in their ARP cache and will be unable to continue communication until their ARP cache entry times out. To address this limitation, the fabric manager forwards an invalidation message to the migrated VM’s previous switch. This message sets up a flow table entry to trap handling of subsequent packets destined to the invalidated PMAC to the switch software. The switch software transmits a unicast gratuitous ARP back to any transmitting host to set the new PMAC address in that host’s ARP cache.”

    My challenge to the last sentence in this statement. How the switch knows which hosts it needs to transmite unicast gratuitous ARP to? and if there are handreds of thousands of VM communicated with the moved VM (it is a real requirement from some large data center) , how to transmite that large number of gratuitous ARP to each of those VMs? Since VM mobility is a key feature in server virtualization, this issue must be addressed in any valid proposal.


    • says

      Hi Ming,
      Implementing large scale Layer 2 in the physical network to support live migration of VM’s is a daunting task, and it doesn’t scale well. To your point, the PortLand paper is no exception.
      Hence, there is a lot of interest now in implementing Layer 2 in the virtual network, with software switching and overlays, and leave the physical network a standard Layer 3 fabric. Check out the NVo3 problem statement draft. In my opinion, software switching overlays are inevitable. And while they’re not perfect, they certainly provide the best solution yet to large scale Layer 2.


Leave a Reply

Your email address will not be published. Required fields are marked *