The “appliance” approach to Big Data and Private Cloud

Filed in Big Data, Fabrics, Hadoop, OpenStack, VMware by on August 27, 2012 10 Comments

In the past I’ve written about the basics of Hadoop network traffic, 10GE Hadoop clustersLeaf/Spine fabrics, and talked about how you might construct a fabric for a Big Data and private cloud.  In this post I’ll continue on that theme with a high level discussion on linking these environments to the rest of the world via the existing monolithic IT data center network.

The conventional wisdom and default thinking usually runs along one of two lines:

  1. Expand the existing monolithic IT data center network to absorb the new big data or private cloud environment.
  2. Build a unique fabric for the new environment, then link that fabric to the existing IT network via switch-to-switch connections and networking protocols.

Rather than bore you with the conventional wisdom, I’ll break from that and instead discuss a third approach, one where we treat the big data / cloud fabric and its machines collectively as an “appliance”.  To link this appliance to the outside world we use x86 Linux based gateway machines.  This is something I’ve found several times now in travels listening to customers talk about their plans or production environments.

In the appliance model we recognize that not all of the machines in our big data cluster or private cloud need to have a direct network path to the outside world.  Many of these machines are worker “nodes” holding and processing sensitive data, or virtual machines. The worker node communicates directly to other worker nodes or master coordination nodes in the cluster — this is where we observe so-called “East-West” traffic.

We also recognize that access to the data or VMs residing on the worker nodes, from the outside, is often facilitated through another machine providing a gateway function for the cluster.  Two examples of such a gateway machine are the “Client” machines (or sometimes called “Edge Nodes”) in Hadoop, and the Quantum networking machines providing L3/NAT services in OpenStack.  Another example might be machines in a VMware vCloud running instances of vShield Edge.

The gateway machines are the interface point and pathway to our “appliance”, having one leg (NIC) on the inside connected to the cluster fabric, another leg (NIC) on the outside connecting to the existing data center network.  Outside of the cluster we have users who want to access the data or applications inside our “appliance”, a typical Client/Server relationship — this is where we observe so-called “North-South” traffic.

How does this differentiate from the aforementioned conventional approaches?

Simple

The connection point between our cluster “appliance” and the existing IT data center network is like that of any other server.  The IT data center network administrator need only provide the appropriate server access connectivity to the “outside” leg of the Gateway machine(s).  There are no IP routing protocols or spanning-tree connections to engineer between the East-West cluster fabric, and the North-South data center network.  Even more interesting, given the East-West cluster fabric is purpose-built and isolated from the main data center network, the cluster administrator can implement tools that orchestrate and simplify the cluster network switch provisioning specific to the needs of the cluster nodes — one step closer to a more self-contained “appliance” like deployment model.

Secure

With no direct network path between the cluster fabric and the outside world, access to our cluster resources must traverse through x86 Linux based machines (the gateways) – which have a lot more security controls available than a typical network switch.  Security hardened Linux kernels (SELinux) and firewalls (IPTables) are some examples of freely available and well understood Linux based security.  Additionally, any disruption or instability event in the main data center network will not cascade (via network protocols) into the cluster fabric.

Scaleable

With our cluster fabric purpose-built and isolated, the cluster administrator is now free to design and scale the fabric specific to the needs of the cluster, independently of the main data center network.  A network design that works well for scaling the cluster fabric, such as Leaf/Spine, may not be the design that works well for the existing applications on the main data center network.  A purpose-built and isolated cluster fabric allows each administrator to make the best network design choice for their specific environment.

Cost effective

This is about placing the right features, at the right cost, at the right places in the network.  With our cluster fabric designed independently from the data center network, we can choose networking equipment with the feature set and performance that best meets the needs of our cluster fabric designed for East-West traffic (Leaf/Spine) — nothing more, nothing less.  On the other hand, our main data center network usually needs to accommodate a more complex feature set suited for the North-South traffic (such as VRF, MPLS, LISP, etc.) — so it makes sense to pay for those features where you need them, and not where you don’t.

Enabling maximum automation

With our cluster fabric and its machines encapsulated into a unique administrative boundary, including x86 Linux machine gateways — we are ready to introduce automation tools specifically designed for our particular deployment of big data or cloud, with comprehensive and coordinated control over the servers and network settings of our cluster “appliance”.

Maybe you’re worried about not having the necessary in-house skills in, say, Hadoop, VMware vCloud, or OpenStack?  Perhaps this is where you hand that administrative boundary (and its necessary skills & responsibilities) over to a 3rd party — perhaps the vendor of a comprehensive big data or cloud turn-key “solution”.  On the other hand, you might have uber-skilled folks in-house capable of  building homegrown cluster deployment tools — where the maximum innovation potential will be achieved with an authoritative domain that includes both the cluster servers and the cluster network — with a well-defined, secure interface to the North-South data center network domain via x86 gateway machines.

How have you engineered this interface between big data & private cloud fabrics, and the data center network?  Chime in with a comment.

Cheers,
Brad

 

About the Author ()

Brad Hedlund is an Engineering Architect with the CTO office of VMware’s Networking and Security Business Unit (NSBU), focused on network & security virtualization (NSX) and the software-defined data center. Brad’s background in data center networking begins in the mid-1990s with a variety of experience in roles such as IT customer, systems integrator, architecture and technical strategy roles at Cisco and Dell, and speaker at industry conferences. CCIE Emeritus #5530.

Comments (10)

Trackback URL | Comments RSS Feed

  1. Brad,

    interesting idea. Isn’t it comparable with having a POD with a L3 boundary (e.g. router or firewall (preferred)).
    I would build this with a firewall cluster in front of the POD and having no direct connection to the outside world except through FW (NAT or rules) So I can separate the networks just as well.
    I do not get the point why should I use a linux appliance? Having said this I’m also thinking that the automation part for routers and firewall is also already available.
    Just an idea :-)

    Cheers,
    Matt

    • Brad Hedlund says:

      Matt,
      The x86 Linux based gateway machines may already be there as part of the private cloud or big data cluster architecture. If you want to throw another set of expensive firewalls in front of them, just because you can, go right ahead. :-)

      • Tim Rider says:

        But the network switches are also “already there” as part of the private cluster. Why not also designate a pair of “spine” switches as “gateway devices”? I fail to see much point or benefit in turning the x86 Linux boxes into packet forwarding devices.

        • Brad Hedlund says:

          Tim,
          A switch doesn’t have the security controls of a Linux gateway machine. Also, when a switch is acting as the “gateway” to another switched network, you need to engineer a network routing protocol configuration between those two environments. And as such any time you make changes to the fabric network you need to be mindful how that will affect the other network. Contrast that to an x86 machine machine acting as the “gateway” where you simply connect that gateway to next network like a standard server, and the fabric network behind it can be changed and scaled independently of any other network. The “x86 Linux boxes” acting as “packet forwarding devices” are already there in the cloud architecture — e.g Hadoop client machines, and instances of vShield Edge are the most basic examples. Its not a matter of “turning” a Linux box into something that wasn’t there before.

        • Naveen says:

          Tim,

          SDN is all about using low cost/opensource devices/software to have a dynamic cloud.

          Using Cisco/Juniper gear would defeat the purpose…

          You need to start visualizing a network without Cisco/juniper….and with (usually) opensource technology…

          I see the dissappointment in your face….

          • Tim Rider says:

            Naveen – there is no disappointment on my face whatsoever. Just puzzlement. What does SDN and “Virtualizing the Network” have to do with the above discussion? Nothing whatsoever. We are talking here about the best way to ring-fence the private compute environment from the rest of the Datacenter.

            You sound like someone who just like throwing buzzwords around, without real understanding of the problem at hand.

  2. Durai says:

    Hi Brad,

    Very interesting articles. Thanks for very valuable information and insights on infrastructures esp for Big data enviornments. I have some fundamental question, Should we choose non-blocking switches over regular switch for Hadoop clusters. What’s actually recommended? Should we decide based on what factors?
    Thanks in advance for your response.
    Cheers,
    Durai

    • Brad Hedlund says:

      Durai,
      It’s common to have non-blocking switches deployed at the rack interconnect layer (aka “Spine”). For 1GE Hadoop deployments, you can deploy that non-blocking too (if you want) for a reasonable price increase over a more common 2:1 oversubscribed top of rack layer. For 10GE Hadoop, however, its almost a given the the top of rack layer will be oversubscribed (3:1 is common). It’s prohibitively expensive to have a fully non-blocking 10GE fabric that scales. And you have to consider that Hadoop was designed with the assumption of an oversubscribed top of rack layer — as it will locate workloads on the machine that has the data, or a machine in the same rack as the data — all of this to avoid the need for a machine to copy data over the network before it can begin working. That’s the typical SAN approach that Hadoop was designed to work around.

Leave a Reply

Your email address will not be published. Required fields are marked *