Demo: End to end, hop by hop, physical and virtual network flow visibility with NSX

You’ve probably heard it before.  The myth goes something like this:  “With software based overlays, troubleshooting in real-time where a flow is going with ECMP hashing on the fabric is going to be a real problem.”  The implied message being that this can only be possible with special hardware in a new proprietary fabric switch.

I’ve heard this one a number times, usually while seated comfortably in a session presented by a vendor who’s invested in the failure of software-centric network virtualization such as VMware NSX.  As if this person has never heard of Netflow?  Or maybe they assume you won’t bother to do the research, connect the dots, and in fact discover all that is possible.

Well, guess what? I decided to do the research :-) And I put together a short demo showing you just how simple it is to get this troubleshooting capability with generally available software, using any standard network switch, constructed in any standard fabric design (routed Leaf/Spine, L2 with MLAG, etc).

I presented this demo to the VMworld TV crew and embedded it here for your convenience:

How does it work?

It’s really simple, actually.  Here’s what I explain in the video:

The virtual switch encapsulates traffic into VXLAN and exports Netflow (IPFIX) data, for every flow, to a collector of your choice.

The virtual switch also exports a template to the collector that allows it to share a lot of additional VXLAN related information for each flow, above and beyond the standard flow information.  This includes things such as the outer VTEP IP addresses, and the VXLAN UDP port numbers used to transmit each flow across the fabric.  Note: The UDP source port will be unique for each flow.

The physical switches also export Netflow, IPFIX, or sFlow data as they observe these VXLAN flows on the fabric.  Any decent switch worth its price tag is capable of doing this.

The flow collector is receiving detailed VXLAN flow data from the virtual and physical switches.

At this point you can go to your collector and pick any flow, in real time or historically, and see where it went on the virtual and physical switches, hop by hop.

To make it easy to search this data quickly, I decided to use a collector that can aggregate all of that Netflow data and convert it into Syslog messages.  This is capability is provided by Netflow Integrator, from Netflow Logic.

With all of my virtual and physical switch flow data now in Syslog, I can easily search and analyze it from Splunk, or VMware Log Insight, or something else.

For example, I can type in queries that narrow in on the flows between any two IP addresses, and pick my time range.

I can see the end-to-end byte and packet count for each flow, bidirectionally, and quickly tell if any packets were lost in the fabric by looking for identical byte and packet counts on each end (hypervisor to hypervisor).

If I want to see where a flow went on the physical network, I can simply query the VXLAN source UDP port used for that flow, and I’ll see every switch and interface that observed that flow.

All of the necessary data is there for analysis by humans, or a machine.  Today, I’m typing in queries at a Syslog engine.  Tomorrow, it might be a network analysis tool looking at the same data, drawing a nice picture for me, looking for any anomalies, and perhaps making correlations to other events found in the same Syslog data from other IT equipment.

If you take a step back and think about it, Syslog is the perfect means to converge all IT event and troubleshooting data.  Dare I say, Big Data.  Every flow on your network should be considered an event.  Why not?  And if properly stored in a common data repository, you have the opportunity to give analytic tools a broad view of what’s happening in your data center, what’s likely to happen, and how to plan for it.  Capacity planning.  Cross functional troubleshooting.  Security forensics.  Just to name a few.  There’s a lot more to troubleshooting application performance than simply counting packets on the network.


An introduction to Zero Trust virtualization-centric security

This post will be the first in a series that examine what I think are some of the powerful security capabilities of the VMware NSX platform and the implications to the data center network architecture.  In this post we’ll look at the concepts of Zero Trust (as opposed to Trust Zones), and virtualization-centric grouping (as opposed to network-centric grouping).

Note: Zero Trust as a guiding principle to enterprise wide security is inspired by Forrester’s “Zero Trust Network Architecture“.

What are we trying to accomplish?

We want to be able to secure all traffic in the data center without compromise to performance (user experience) or introducing unmanageable complexity.  Most notably the proliferation of East-West traffic; we want to secure traffic between any two VMs, or between any VM and physical host, with the best possible security controls and visibility — per flow, per packet, stateful inspection with policy actions, and detailed logging — in a way that’s both economical to obtain and practical to deploy.

Trust Zones of Insecurity

Until now, it hasn’t been possible (much less economically feasible or even practical) to directly connect every virtual machine to its own port on a firewall.  Because of this, the firewall has always been a “thing” (a physical piece of iron, or virtual machine) that we need to bolt on top of the network.  First, you need a network to connect, aggregate, and group machines.  After that you can connect the firewall to a port on that network-centric grouping (a virtual switch Port-Group and/or VLAN).  Meanwhile, the network construct establishing the group provides unfettered connectivity within the group.  In other words, the firewall has no visibility or security control over the East-West traffic between machines in given group.  The result is a “Trust Zone”.  We “trust” (read: hope), but can’t verify, that one machine in the zone will not laterally infect/attack the other zone members.

Unsecured Trust Zones

Network-centric grouping in a virtual environment

Groups form the basis of a security policy.  Similar machines from a policy standpoint are placed into a group, at which point a policy governs how traffic is handled in to, out of, and within that group.  How these groups are defined and where they exist can make a big difference in a virtualized data center.  For example, when groups are defined by a networking construct, and then pushed into a virtual environment (vSphere), the security policy attached to a virtual machine is determined by its connection to a specific network-centric grouping object, with the most minimal granularity being a Port Group.  Taking a network-centric approach in a virtual environment presents a number of challenges.

First, this approach can quickly create a large quantity of networking objects to deal with — a morass of Port Groups cluttering the virtual network inventory.  For example, lets say you have 100 applications, each with three distinct tiers of policy groups (Web, App, DB); this would result in 300 Port Groups to choose from in your distributed virtual switch.

Second, the virtual administrator needs to correctly choose, and manually attach, the specific Port Group for each virtual machine network interface when it’s deployed.  With an inventory of hundreds or thousands of virtual machines and Port Groups to choose from, human error in applying the wrong security policy is something to contend with.  Despite the clutter of Port Groups, the manual aspects can be mitigated however if there is good integration with upstream automation software, namely vCloud Automation Center (vCAC).

Third, a Port Group is an object that’s specific to one distributed virtual switch (DVS). If the security policy for a virtual machine depends on its connection to a specific Port Group, the mobility domain for that virtual machine is limited to one DVS. Migrating outside of the DVS would involve a cold stop/start operation, and manually attaching the virtual machine to a different and specific Port Group in a new DVS.

Fourth, there are no security controls for East-West traffic within the Port Group that establishes a group.  It’s just another “Trust Zone”.  Only traffic between groups can be secured; which might lead to an effort to obtain more granularity by creating more and more Port Groups.

Zero Trust transparent security

In the Zero Trust model, we take the usual Firewall-bolted-on-top approach and turn it upside down.  Every virtual machine is first connected to a transparent in-kernel statefull firewall filtering engine (with logging) before it’s even connected to the network.  This means that any traffic to or from a virtual machine can be secured, regardless of the network construct it’s attached to.  Because the firewall is below the network, directly adjacent to the things we want to protect, there is never an unfettered “Trust Zone”.  Security is omnipresent — per flow, per packet, statefull inspection with policy actions and detailed logging, per virtual machine, per virtual NIC.  The network constructs still exist, of course, but only to provide connectivity (not security).  The Zero Trust model is also referred to as Micro Segmentation.

Zero Trust transparent security

Virtualization-centric grouping in a virtual environment

A security policy works with the basic concept of a group, comprised of similar objects, to which you then apply a policy based on group names.  In the network-centric model these groups were represented by Port Groups in a distributed virtual switch.  In contrast, another approach is to employ a virtualization-centric grouping model, as implemented by VMware NSX, where the groups that form the basis of your security policy are decoupled from the network, and are simply an abstract object called a “Security Group” existing in the virtualization layer.  There are a number of advantages to this approach in a virtual environment (e.g. vSphere).

First, the virtual network inventory remains simple and uncluttered.  For every Security Group created there is no requisite and corresponding Port Group to create.  The virtual network inventory remains constant as the environment grows.  For example, this time your 100 applications, each with distinct tiers of policy groups (Web, App, DB), can be deployed with only one Port Group and VLAN providing the network connectivity.

Second, the virtual environment can dynamically attach virtual machines to the appropriate Security Group based on virtualization relevant context, tags, and business logic.  As a simple example, in the diagram above, any VMs with the name “PROD-web” are placed in the “Web” Security Group automatically.  Another scenario might be; if VMs are deployed by members of the “Engineering” active directory group, tag them as “Engineering”, and based on that tag dynamically add them to the “Dev/Test” Security Group, and isolate them from “Prod”.  It doesn’t matter which Port Group the VMs are attached to.  An incorrect Port Group assignment might only break network connectivity, not security policy.

Third, mobility is not artificially limited a network-centric object such as a single distributed virtual switch.  Security Groups are not coupled to a distributed virtual switch (DVS), or any network construct for that matter.  It doesn’t matter which Port Group connects to your virtual machine, and by consequence it also doesn’t matter which DVS your virutal machines are connected to either.  This means you can live migrate virtual machines from one DVS to another; and someday soon, between vCenter instances — all while maintaining consistent security policy.

And finally, as previously discussed, there are no insecure Trust Zones with virtualization-centric grouping.  Even traffic within a Security Group can be subject to policy controls and statefull inspection with detailed logging.  The highest degree of granularity is provided at the onset (per virtual machine, per virtual nic).

Architecture implications

With a transparent firewall underneath the network, as opposed to bolted on top, this will have implications to the data center network architecture.  The result, I contend, will be virtual and physical topology simplification.

When the firewall is bolted on top, the network substrate needs to be designed in such a way that correctly implements a security policy — selectively steering traffic from a virtual machine to some physical or virtual firewall several hops away.  The more granularity you attempt, the more complex the design becomes with a quagmire of network-centric traffic steering and isolation tools like Port Groups, VLANs, ACLs, and VRFs.  Meanwhile, more and more East-West traffic needs to be detoured several hops to a firewall, impacting performance (user experience). And in the end, you’re still left with unsecured Trust Zones, as you can never realistically obtain per-VM granularity.

With virtualization-centric VMware NSX, on the other hand, policy is applied underneath the network, in the virtualization layer.   Throw away that East-West traffic detouring bag of tricks.   Security is applied, transparently, before the packets even arrive at the first virtual network port.  Latency sensitive East-West traffic is free to travel directly to its destination, taking the lowest latency path, having already been secured at the onset.

The network architecture is simply designed for connectivity; whether that might be a handful of VLAN backed Port Groups in an L2 fabric that you’re already using today; or migrating toward full network virtualization with VXLAN backed Logical Switches, Logical Routers, and simple L3 farbrics.  You can start with the former and gradually move to the later.

Some points of differentiation

When evaluating options and comparing the security capabilities of VMware NSX for vSphere to other solutions, here are some points of differentiation to keep in mind.

Headless operation — The VMware NSX for vSphere distributed firewall does not rely on some other virtual machine for the data plane to function. Rules are centrally programmed by the NSX Manager and each host is able to inspect and enforce security policy for every flow and packet on its own, without the Manager (including headless vMotion).

Mobility — Your virtual machines are not constrained to single distributed virtual switch. Security policy is consistent irrespective of the DVS or Port Group providing the connectivity, and virtual machine live migration is not artificially constrained to a single DVS.

Zero Trust — Even traffic within the most minimal grouping construct is secured.  East-West traffic within a Security Group, can be subject to policy, statefull inspection, and logging.  There are no insecure Trust Zones.

Automation — The virtual environment can automatically attach virtual machines to the appropriate Security Group and subsequent policy based on virtualization relevant context.  The virtual administrator doesn’t need to correctly choose and manually assign virtual machines to a specific Port Group.  And when a host is added to a cluster, all of the required software is automatically installed.

Dynamic security — Just as the virtual environment can automatically assign a virtual machine to a Security Group, based on context, it can also change the Security Group (and policy) dynamically, based on changing context, or context provided from a third party, such as a malware or vulnerability assessment solution (Rapid7, McAfee, Symantec, Trend Micro).

Distributed platform for NGFW — One of the policy actions you can apply to a Security Group is selectively redirecting traffic to a local user space service virtual machine on each host.  For example, 3rd party firewall providers can leverage this platform to add NGFW inspection to the environment in a distributed manner.  Palo Alto Networks has already leveraged this capability with their VM-Series NGFW firewall that integrates with VMware NSX for vSphere.

Quick Video Demonstration

Finally, here’s a quick video demonstrating the scenario depicted in the diagrams above.  I will show how a Security Group is created, how virtual machines are automatically assigned to a group, how East-West traffic within this group can be filtered by the NSX statefull firewall, and how the logs can be viewed and analyzed.




Three reasons why Networking is a pain in the IaaS, and how to fix it

In this post I share the slides, audio recording, and short outline of a presentation I gave at the Melbourne VMUG conference (Feb 2014) called “Three reasons why Networking is a pain in the IaaS, and how to fix it”.

As network technologists we know that when the compute architecture changes, the network architecture changes with it.  Consider the precedent.  The transition from mainframe to rack servers brought about Ethernet and top-of-rack switches.  Blade servers introduced the blade switch  and a cable-less network. And of course the virtual server necessitating the software virtual switch and a hardware-less network.  At each iteration, we observe the architecture change occurring at the edge, directly adjacent to compute.

We can look at this superficially and say, “yes, the network architecture changed”.  However if you think about it, the catalyzing change in each shift was the operational model, with intent to increase agility and reduce costs. The architecture change was consequential.

Without compute, there is no reason for a network.  Networking, both as a profession and technology, exists as a necessary service layer for computing.  Without a network, computing is practically useless.  As such, the capabilities of the network will either enable or impede computing. Viewed in that light, when an organization decides to change the operational model of computing (virtualization, IaaS), the operational model of the network must evolve with it.  If not, the “Network” becomes the impediment to the organization, not an enabler. (Hint: you don’t want to be on the receiving end of that).

  • Static compute > Static network
  • Virtual compute > Virtual network
  • Infrastructure as a Service > Networking as a Service

 Audio Recording (MP3) 44 min

Audio clip: Adobe Flash Player (version 9 or above) is required to play this audio clip. Download the latest version here. You also need to have JavaScript enabled in your browser.

Click here to download the MP3


Three reasons: Outline

1) Impedance Mismatch

Deploying legacy non-virtual networking with virtual computing creates an operational impedance mismatch.  Virtual computing provides instant provisioning, mobility, and template based deployments.  Despite these advances, the virtual compute is still coupled to network services that are slow to provision, anchored to specific physical equipment, and manually deployed at the risk of configuration drift and human error.  The full potential of virtualization and the IaaS cannot be realized.  Simply creating virtual machine equivalents of Firewalls and Load Balancers doesn’t change the operational model of network services, it only changes the form factor.

The solution is to bring the same operational model of virtual computing to the network — network virtualization.  Networking services should be instantly provisioned from a capacity pool, decoupled from specific hardware, made equally mobile, and deployed by machines using templates.

2) Lost in Translation (Scripting)

Attempting network “automation” or “orchestration” by scripting against individual device interfaces is untenable.  Some 3rd party scripting tool has the difficult job of providing both an upstream interface with which to accept desired network state, and display the real time network state.  This requires translation and coordination across many different autonomous devices and interfaces (languages).

The solution is deploy a virtual networking platform (like a virtual chassis switch) where many different devices connect to the platform like a virtual line card using the platform API.  The virtual networking platform can then expose a single API endpoint to an upstream automation tool (e.g. OpenStack or VMware vCloud Automation Center).  All of the complexities around deploying desired network state and gathering the real time state are removed from the automation tool and assumed by virtual networking platform.  The individual device interfaces (languages) still remain for operational tasks (code upgrade), but are out of the way in terms of service provisioning.

Examples: VMware NSX + F5 (tech preview video), and VMware NSX + Palo Alto Networks (PDF)

3) Choke points

In many cases Firewalls are required to handle east-west traffic between compute instances, or between different trust zones.  If the firewall is a “box”, be it a physical piece of iron, or even a virtual machine, it’s a single “device” in the network somewhere to which traffic must be forced through so that it can be inspected by a policy.  This is a choke point catching packets.  Performance of east-west traffic suffers, and the choke point (several layers removed from the source of traffic) has no real meaningful visibility into where the traffic came from, who sent it, or where it’s going.  The choke point is merely inspecting IP packet headers against an access list.  This means IP addresses of the workloads are critical to the applied security policy.  This is not what we want in a highly agile Infrastructure as a Service.  Security policy should be attached to the applications and workload, not the IP addresses.  And there should be no choke points that impede performance.

The solution is to centrally define, and physically distribute the security policy across the virtual switching layer in the hypervisor kernel.  Every virtual port attached to a virtual machine is not just the access port, it’s the stateful firewall too.  The security policy is applied to the virtual machine, not the IP address, and enforced at the very first hop — no more choke points.  And your policy can trigger on a large set of semantics such as user identity, operating system, security posture, or any arbitrary and hierarchical grouping of virtual machines (applications).

Example: VMware NSX Distributed Firewall

The rest of the presentation covers some example multi tenant topologies you can deploy in your IaaS with NSX, and how to introduce NSX into your existing environment and make a gradual migration. Listen to the full audio, and stay tuned for more blogs on these topics and more. :-)