Brad Hedlund

Demo: Cloud Networking with Overlapping CIDR, L7 Firewalls, Segmentation, and Flow visibility

2021-07-12T13:00:00-05:00

I created a live demo showing some cool capabilities of the Aviatrix Cloud Networking Platform. In this demo I play the role of a SaaS provider that onboards new customers via VPN, and needs to meet the following requirements:

Easily onboard new customers even if their IP addressing overlaps with the SaaS provider.
Provide secure segmentation and isolation between customers.
Easily insert next gen firewalls between the customers and the SaaS for deep packet inspection and threat analysis.
Have complete flow level visibility of customer network traffic, and operation tools to diagnose and troubleshoot problems.
Provide end-to-end encryption to secure sensitive data in flight.
And be able to meet all of these requirements using any cloud provider.

In the demo I show how easy it is to meet requirements like this using Aviatrix. And best of all, no matter which cloud provider(s) you’re using, the solution and architecture is exactly the same. This SaaS provider can use the services and global footprint of any or all cloud providers, and do it with consistent repeatable architecture.

You can leave comments on this post here: where I posted this on LinkedIN.

Is there a particular scenario you want to see in a demonstration? Connect and send me a message on LinkedIN.

Cheers,
Brad

Notes on Aviatrix

2021-04-09T12:00:00-05:00

Miscellaneous notes on Aviatrix.
Usually updated on Fridays.
New and updated notes are placed at the top.

Updating the Aviatrix Controller IAM Policy:
When deploying the Aviatrix controller in AWS for the first time, the AWS CloudFormation template that launched your controller may not have the most current IAM policy definitions for the IAM roles it creates for the controller to use. To remedy this, right after your controller is launched and you’ve logged on for the first time, do the following:

Define your Primary access account. Go to Onboarding > AWS > Create Primary Access Account. This is the AWS account that your controller lives in.
Now go to Accounts > Access Accounts. Highlight the Primary access account you just created and click “Update Policy”. This will update the IAM policy applied to the IAM roles your controller will be using to the latest and greatest.

How to use an AWS ACM Certificate with your Aviatrix controller:
To apply an ACM public certificate to your UI sessions with the Aviatrix controller you’ll need to use a Load Balancer and attach your certificate to it. Here’s what I did:

Create a Network Load Balancer (NLB)
Create a TLS:443 listener on your NLB and attach your ACM certificate.
Create a target group and add your Aviatrix controller EC2 instance as an instance target.
Associate your target group with the listener you just created.
Create a DNS entry for your Aviatrix controller (one that comports with your ACM certificate) and point it to your NLB with an A-alias record in Route 53.

You should now be able to logon to your Aviatrix controller UI without seeing any security warnings from your browser.

Aviatrix controller IAM permission errors despite correct IAM policy If you encounter a situation where the Aviatrix controller is unable to perform a task due to IAM AccessDenied errors for an action that it does in fact have IAM permissions to perform – there’s a good chance that your AWS Organization has an SCP installed that is overriding your IAM policy and denying the action. Check the service control policies (SCP) in your AWS Organization setup for any conflicting policies.

Terraform and Aviatrix

Terraform documentation page for the Aviatrix provider
Aviatrix documentation on using Terraform is located here

It’s time for Enterprise Cloud Networking

2021-04-02T14:12:00-05:00

It’s time to get things cranking here again and a big topic is going to be enterprise cloud networking. What I mean by that in simple terms is how an enterprise can use the networking services of cloud providers to build, migrate, and run their most important applications in any cloud.

Over the last 6 years a lot has happened in the shift to public cloud. I don’t need to explain that to you. We already know that building and migrating applications in/to the cloud is what the world is doing – and for reasons that no longer need explaining.

What’s more interesting now is that the term “the cloud” used to mean one thing: Amazon Web Services. Six years ago, when you said to somebody, “Yeah, so, we are going to migrate this application to the cloud.” – nobody asked what cloud you were talking about and why.

And in the very same stride “cloud networking” implied AWS Networking. If you told somebody that you were a cloud network architect, nobody questioned that either. It meant that you knew AWS VPC, Direct Connect, Route 53, NAT Gateways, Security Groups, VPC subnets and route tables, the various AWS instances sizes and their network performance, and all that goodness. And if anybody questioned your credentials you would flash them your shiny AWS Advanced Networking Speciality certification.

This is how that conversation goes today:

You: “Yeah, so, we’re migrating this application to the cloud and I need to setup the network for that.”

Them: “Cool. Which cloud are we talking about here? AWS, Azure, GCP?”

You: “Oracle, actually.”

So as a network expert you need to be ready to take your company to any cloud. You’ll need to know the various building blocks that each cloud provider has and how to build an architecture with that toolkit. If it’s an enterprise cloud network it will have granular security and segmentation controls, it will selectively insert L4-L7 services, and it will provide traffic visibility and troubleshooting tools for you to continually tune and optimize with.

As with anything in this industry there’s going to be multiple ways to do this. My preference is to use a platform based approach. Let me express to the platform what I want, and let the platform go build that desired state for me on top of any underlying infrastructure, any cloud. A proven approach that has worked very well in the past. Need I point out the success of Cisco UCS or VMware NSX?

For the enterprise cloud network the platform I believe in is Aviatrix.

I’ve joined them as a Principal Solution Architect and will write about this space with an unapologetic bias. I will examine other ways to build the enterprise cloud network (when there’s something to write about there) but it will be in the lens of how I feel it compares to the platform based approach and Aviatrix.

You can leave comments and feedback on this post here on LinkedIn

Disclaimer: the views and opinions expressed are the authors alone and do not necessarily reflect or represent the views of any company or entity that the author may be affiliated with.

Going Over the Edge with your VMware NSX and Cisco Nexus

2015-02-06T11:34:08-06:00

What could possibly be more fun than connecting your awesome new NSX gear to your Cisco Nexus gear? For the life of me I really don’t know. All right then. Lets do it!

Lets kick things off with this email question I received from a reader.

Hi Brad, In our environment we have two prevailing server standards, rackmounts and UCS. I read your excellent NSX on UCS and 7K design guide and the section on not running routing protocols over the VPC links makes sense. My related question concerns how we can achieve a routing adjacency from the NSX Distributed Router to the N7K with a rack mount with 2x10gbe interfaces connecting to 2x7Ks via VPC? (we don’t use the NSX Edge Router).

This reader has politely pointed out that my VMware NSX on Cisco UCS and Nexus 7000 design guide could have provided a bit more detail on NSX Edge design. I totally agree. There’s no time like the present, so let’s dive into that now and stir up some content that might end up in the next version of the guide.

All right. We won’t worry too much about the form factor of the servers right now. Whether it’s a blade or a rack mount doesn’t matter; lets just generalize that we have servers. And to make things extra difficult, these servers will only have 2 x 10GE interfaces –no more, no less. Those interfaces [ultimately] connect to a vPC enabled VLAN, or they will connect to a non-VPC normal VLAN. Working with this baseline of 2 x 10GE also helps to keep everything easily applicable to either blades or rack mounts.

I’m going to present the logical topology of three different designs. How these translate into a physical topology is something I’ll leave for the time being to your own expertise and imagination.

Any design discussion can have a number of variables and permutations, and especially here, and especially in the bottom half section depicting On Demand virtual networks. “What about inserting service X for this or that tenant?” etc. If I attempted to discuss all such nuances in completeness this post would get way off topic. Lets keep it simple for now and focus on the edge topology. At a later time we’ll come back to the various flavors of On Demand virtual networks you can lay down underneath the Pre-Created edge topology of your choice.

Lost Your Edge

We’ll start with the scenario posed in the opening question; “we don’t use the NSX Edge Router”. Thus, the only NSX router is the distributed logical router (running in kernel on your ESX compute hosts) which is directly adjacent to your virtual machines (naturally); and it’s also directly adjacent to your Nexus 7000s on a vPC enabled VLAN. The later constitutes the “Uplink” of the distributed router and allows for the possibility of running a dynamic routing protocol with an upstream router.

The motivation for the Lost Your Edge design might be simplicity, where-in you don’t want – or feel that you don’t need – an additional layer of NSX Edge virtual machines to worry about and manage.

Notice that we’ve laid this topology down on all vPC-enabled VLANs. Remember the NSX distributed router is running in-kernel on your ESX compute hosts, and I presume that you do want your ESX compute hosts attached via vPC. As a result our NSX distributed router is also vPC attached. By consequence, this prevents us from running a dynamic routing protocol between the NSX distributed router and the Nexus 7000s. The reason for this I have explained here.

We can most definitely do the Lost Your Edge design with static routing. Your NSX distributed router would have a simple default route pointing to the Nexus 7000s HSRP address on the “Edge VLAN”. Meanwhile, the Nexus 7000s will have a static aggregate route (eg. 10.1.0.0/16) pointing to the NSX distributed router forwarding address. Later on, the individual subnets you create (On Demand) behind the NSX distributed router will of course fall into that aggregate route. The only thing left to do now is redistribute this static route into your enterprise backbone with BGP or OSPF. One thing to be aware of in the Lost Your Edge design is the need for a Designated Instance (DI) on the NSX distributed router for the Uplink logical interface on the “Edge VLAN” facing the Nexus 7000s.

When the NSX distributed router has an interface on a VLAN, one of the ESX hosts will be designated as responsible for ARP handling and forwarding for the distributed router’s forwarding MAC address on that VLAN. By consequence, that one host will receive all traffic coming from other devices on that VLAN (like the Nexus 7Ks). Once received, the designated host will locally route traffic to the proper VXLAN (or VLAN) containing the destination, and send it as a logical Layer 2 flow to the host where the VM resides (which might be on another host, or the same host).

The DI host is elected by the NSX Controller cluster. This is not something that you can easily influence or predict, any host could be elected DI. And when it fails, a new one needs to be re-elected. The failure detection and recovery of a new DI can take as long as 45-60 seconds. This is something you might want to test for yourself in a lab.

The other important thing to point out about Lost Your Edge is that you’re missing an opportunity to apply services like NAT, VPN, or perimeter Firewall inspection as traffic enters or exits the NSX domain.

In designs to follow you’ll see how we can obtain faster failure recovery, services, better traffic distribution, and even dynamic routing.

On the Edge

Let’s assume for the moment that you’re fine with static routing (or maybe you’re stuck with all vPC VLANs in your physical design). Maybe it’s the failure recovery and ingress choke of the Designated Instance that you’re not cool with (heck, I don’t blame you). No problem. In this On the Edge design we’ll introduce the NSX Edge routing VMs and see what happens.

Nothing has changed with the Nexus 7000s and the physical VLAN setup. We still have all vPC enabled VLANs, and we still have the previously discussed static aggregate route. The difference lies in the NSX topology. Our first hop into NSX is now an NSX Edge Router VM which we’ve protected by a state-synced shadow VM. Second, we’ve introduced a VXLAN Transit Logical Switch that will sit between our NSX Edge and NSX distributed router. All of our hosts are still attached via vPC with 2 x 10GE NICs. Some of these hosts should be designated as Edge hosts and placed in an Edge Cluster for the purpose of running your NSX Edge VMs. This (must read) VMware NSX Design Guide 2.1 covers that approach quite thoroughly as a design best practice. That said, in a lab you can certainly mingle your NSX Edge with compute hosts just for the fun of it.

For our distributed router, the concept of a Designated Instance does not apply on any VXLAN segment (such as our Transit Logical Switch) where traffic is flowing from an NSX Edge VM to the distributed router, and vice versa. When traffic arrives at the NSX Edge VM from the Nexus 7000, the Edge host machine also happens to be running the NSX distributed router in its kernel. Therefore, the next hop (.2) is always local to every Edge machine, along with the Logical Switches attached to that distributed router. In a nutshell, the Edge host machine is able route traffic from the Nexus 7000 directly to the ESX compute host where the destination VM resides. How cool is that?

You can see the On the Edge design – when compared to Lost Your Edge – has the same (if not better) traffic flow properties, faster failover (6 seconds), and the opportunity to add services like NAT, VPN, and perimeter Firewall. Not bad for a days work.

On the Upgraded Edge

Now let’s assume that you do have some flexibility in your physical design to vPC attach some hosts, and not others. With that luxury we’ll take the Edge hosts running the NSX Edge VMs and have those non-vPC attached. Meanwhile we’ll leave the compute hosts with their optimal vPC attachment. By doing this, we’ll be able to upgrade the On the Edge design with dynamic routing. Just as a reminder, this an exercise specific to Cisco Nexus 7000. Other platforms may be able handle dynamic routing on vPC or MLAG connections just fine.

From the diagram above you will notice that we’ve made the “Edge VLAN” a non-vPC VLAN and our Edge hosts will attach to it. You might also observe that we’ve added a second VTEP VLAN that is non-vPC, and we will attach our Edge host VXLAN vmkernel interfaces to it. Our Edge hosts are completely non-VPC attached while our compute hosts remain attached to all vPC enabled VLANs.

With our NSX Edge hosts free from vPC attachment, we are able run dynamic routing protocols with the Nexus 7000 without issue, such as BGP. Every new subnet created on the NSX distributed router will be advertised to the NSX Edge, and in-turn will be advertised by the NSX Edge to the upstream Nexus 7000s (or whatever) with BGP. Pretty cool, right?

The traffic flow here is very similar to the previous design, only now our VXLAN traffic between Edge and Compute hosts will take a Layer 3 hop through the Nexus 7000 (before it was Layer 2). No biggie. Depending on your physical design and host placement, this might mean an extra hop through the Nexus 7000, or not. Such as with N7K-N2K (no difference) vs. N7K-N5K-N2K (maybe) or N7K-UCS (maybe). Keep in mind, Edge to Compute host traffic is North/South in nature and generally bottle-necked by some other smaller link further upstream. Fair enough?

Over the Edge

Up to this point we’ve been placing one NSX Edge VM on that “Edge VLAN” to send/receive all traffic to/from our NSX distributed router. Well and good. A single NSX Edge VM can easily route 10Gbps of traffic. But you want more? No problem. We’ll we just 8-way ECMP that mofo and call it a day. Check it out.

What we’ve done here is deploy up to eight NSX Edge VMs on that “Edge VLAN”, placed them on separate hosts, and enabled ECMP. We also went to our NSX distributed router and enabled ECMP there as well. Our Nexus 7000s see dynamic routing updates coming from 8 equal cost next hops and perform per-flow hashing, placing each unique flow on one of our eight NSX Edge VMs (each capable of 10Gbps). The reverse applies as well. IF you had up to eight Nexus 7000s on the Edge VLAN (seriously?) each NSX Edge VM would install eight equal cost next hops for each route upstream.

The same magic applies to our NSX distributed router. Each compute host sending traffic northbound will perform eight way per-flow hashing (in-kernel), picking a NSX Edge for each unique flow. If for whatever reason a NSX Edge drops off the network, only 13% of the traffic will be affected (in theory), and only for the period of time it takes routing protocol timeouts to detect and remove the failed next hop (3 seconds or so).

When you’re letting it rip with ECMP there’s no guarantee that both directions of a flow will traverse the same NSX Edge. Because of that we need to turn off stateful services like NAT, VPN, and perimeter Firewall. That’s the only bummer. Not much we can do about that right now with ECMP. But if you need lots of bandwidth (more than one Edge) with stateful services, you can always horizontally scale Edge and distributed router in pairs. For example, Edge1+DR1, Edge2+DR2, and so on.

Design Poster

CLICK HERE to download your own copy of this 48 x 36 design poster containing all of the cool diagrams from this post. Print that bad boy out and hang it up next to your NSX + UCS + Nexus 7000 poster.

Slides

CLICK HERE to download your own copy of these diagrams in PDF slides.

A tale of two perspectives: IT Operations with NSX

2014-12-03T11:58:01-06:00

This year I had the honor and privilege to co-present a session at VMworld 2014 with my esteemed colleague Scott Lowe. As many of you know, Scott is a celebrity at VMworld and one of the most famous virtualization bloggers and the author of many best selling books on VMware vSphere.

In this session Scott and I pretended to be colleagues at a company that decided to deploy VMware NSX for their software-defined data center. I played the role of the “Network Guy”, and of course Scott played the role of the “Server Guy”. So then, how do we work together in this environment?

How do we gain operational visibility into our respective disciplines using existing tools?
How do we preserve existing roles and responsibilities?
What opportunities exist to converge operational data for cross-functional troubleshooting?
How does the Network team gain hop-by-hop visibility across virtual and physical switches?
How can the Network and Server teams work together to troubleshoot issues?

These are just some of the questions we attempt to role play and answer in this 35 min session:

***Update: this VMworld session video was removed from YouTube by VMware and is no longer available.***

On choosing VMware NSX or Cisco ACI

2014-11-03T10:35:27-06:00

Are you stuck in the middle of a battle to choose VMware NSX or Cisco ACI? In this post I’ll attempt to bring some clarity and strategic guidance in first choosing the right path, then propose how the two technologies can co-exist. I’ll start with the message below from a reader asking for my opinion on the matter:

Hi Brad,

I’m involved in a new Data Center networking project where Cisco is proposing the Cisco ACI solution. I am starting to dig-in to the technology, but my immediate “gut reaction” is to use Cisco for a standard Clos-type Leaf and Spine switch network and use NSX for providing Layer 3 to Layer 7 services.

I am interested in hearing your opinion about Cisco ACI versus VMware NSX, since you have worked for both companies. If you have time, it would be great to share your views on this subject.

As you can imagine, this is a highly political discussion and our network team are Cisco-centric and resisting my ideas. We are a VMware/Cisco shop and I want the best fit for our SDDC strategy.

For the sake of discussion, lets assume that your IT organization wants to optimize for better efficiency across all areas, and embark on a journey to “the promised land”. More specifically, you want to obtain template driven self-service automation for application delivery, as well as configuration automation for the physical switches and servers. Let’s also assume that you would like to preserve the familiar model of buying your hardware from Cisco, and your software from VMware. E.g. “We are a VMware/Cisco shop”.

Before I begin, it should be obvious that I’ll approach this with a bias for VMware NSX; the result of a thoughtful decision I made two years ago to join the VMware NSX team instead of the other hardware switch vendor opportunities available at the time. The choice was easy for the simple reason that VMware is the most capable pure software company in the networking business. It was apparent to me then (and still is now), that in the new world of hybrid cloud and self-service IT, the winners will be the ones who can produce the best software.

Choosing a path forward rooted in software

Any way you slice it, your virtual machines will be connected to a software virtual switch. This is the domain of a fluid virtual environment that will exist whether or not you decide to use VMware NSX, or go all-in with Cisco ACI. Either direction will require that you do something special with the software virtual switch before you can proceed down the chosen path to the promised land. This isn’t opinion or theory, it’s a universally accepted fact. If the solution isn’t able to gain programmatic control of the fluid network within the software-centric virtual environment, it’s a total non-starter – like buying a fancy television without a remote control. It’s not optional or even a matter that’s up for discussion. Everybody agrees this is a necessary function. Well then, what does that tell us?

To explore that thought a bit further, let’s consider the hardware-centric point of view. Any way you slice it, your hypervisors and non-virtual machines will be connected to a hardware physical switch. This is the domain of a static environment that will exist whether or not you decide to use VMware NSX or Cisco ACI. One of the two directions requires that you also do something special with hardware switches before you can even proceed with the (above) unanimous requirement for special software virtual switches (e.g. Cisco’s software virtual switch for ACI doesn’t even function without special hardware switches). However, nothing special needs to be done with hardware in the NSX direction. You’re already well down the path of VMware NSX when (above) you did something special with software virtual switches.

I can proceed to argue that nothing special with hardware will ever need to be done. The moment you gained programmatic control over the fluid software environment you’ve done everything necessary, and then pose the question; “Why do you need programmatic control over this static non-virtual environment anyway?” The point here is not to have the debate, but that the debate is there to be had. This is still a matter of opinion and theory. Suppose you bought an adjustable TV stand to go with that fancy new television; does it need a remote control too?

For the sake of argument, let’s presume you accept the theory that there needs to be some programmatic control over the static environment. Hey, it sounds nice, so why not? Maybe you do want a remote control to adjust your TV stand, “just in case”. For the Cisco ACI path to make sense, the next argument you need to make is that the fancy television should only function when it’s placed on an adjustable TV stand; and only if the TV stand can be adjusted by the same remote control that operates the television. And finally, you’ll need to convince people that your fancy television and adjustable stand must be designed by the same company – one that specializes in building television stands. Otherwise, they’d better wait and stick with the same old worn-out TV.

In contrast, for the VMware NSX path to make sense, you’ll need to make the argument that a fancy television should be able to work on any stand you can rest it on. If you can place it on an adjustable stand, well that would be nice. And if the adjustable stand came with a remote control, Wow, even better. You’ll also need to convince people that it makes more sense to buy televisions from an electronics company; and television stands should be bought from a television stand company.

Analogies aside, what this tells us is that software is the more important choice, and the hardware is secondary. There are two primary reasons for this. First, to realize the benefits of the fluid data center with fast provisioning and low OpEx requires tight integration with the overall orchestration framework. This is a function of software. Second, the first hop any packet will see is a software virtual switch, and this is where security policy and other important functionality will reside. Hardware is still important, but overall it accounts for fewer ports and has less of the necessary intelligence.

“Networking is a software industry. To succeed in a software market, you need to be a software company.” – Guido Appenzeller

“Who do you think is going to make better software, a software company or a hardware company?” – Steve Mullaney

In other words, Cisco makes great hardware switches. And of course you still need a well-engineered physical network to construct the static environment (the television stand). There are other good choices available, but if you prefer Cisco Nexus 9000 physical switches (either in NX-OS or ACI mode) that’s perfectly fine. However, that decision does not imply that Cisco is also the best fit for the fluid virtual environment, because that is a world of pure software.

The best example of this is security. Consider the distributed firewall available in VMware NSX providing true Zero Trust micro-segmentation with per virtual machine stateful security, including full auditing via syslog, partner integration such as Palo Alto Networks, all with no choke points (because it’s built-in to the vSphere kernel). In contrast, this capability provided by VMware NSX does not exist in Cisco ACI. One problem is that switching hardware is simply not capable yet of providing granular per virtual machine stateful security. However, this can easily be accomplished in software, as it’s done today in the NSX-enabled VMware distributed virtual switch. Similarly, there’s no technical reason why this same level of security couldn’t be available in Cisco’s ACI-enabled Nexus 1000V software virtual switch (AVS), but it’s not there. The point here is that critical network services like security work best in software and virtual switches. And it’s clearly evident that a pure software company has the focus on software to execute better and faster in providing these features than a hardware company.

On the NSX path, where does ACI fit?

Let’s assume you’ve decided to follow VMware’s lead in software to the promised land, and begin utilizing NSX and the vRealize Suite for your SDDC self-service automation and policy based application delivery. In that scenario, VMware NSX and Cisco ACI are not at all mutually exclusive, because they’re each fulfilling different roles. One is a network and security virtualization platform for your SDDC (NSX), the other is a well-engineered fabric (ACI). They go together, like a television and its adjustable stand (or wall-mount, whichever you prefer).

Your well-engineered fabric can certainly have its own automation interfaces for the purposes of constructing the static environment in way that’s, well, automated. The presence of NSX doesn’t prevent that. If you want to deploy Cisco Nexus 9K physical switches – great. The fabric can be deployed in NX-OS mode with a familiar Cisco CLI, and automation through either the NX-API or Python API. Or the fabric can be deployed in ACI mode (with no CLI), and automation available through the ACI-API. Either way, automation is obtainable. Your Cisco fabric APIs manage the static environment (connections for hypervisors and non-virtual hosts), while the NSX API manages the fluid virtual environment (network services and security for virtual machines).

For example, when it’s time to establish network connectivity for a new rack of hypervisor hosts, you’ll use the Cisco APIs for that. In the case of ACI-API, one example would be an application network profile, where the “application” in question is the vSphere hosts running NSX. This ACI profile would contain End Point Groups that establish connectivity policy for the various hypervisor vmkernel interfaces supporting vMotion, Management, vSAN, and NSX. The workflow to provision a new rack of hypervisors would include an API call to Cisco APIC requesting that it assign this profile to the appropriate physical switch ports.

Now it’s time to provision applications in minutes from a self-service portal, complete with network and security services. That’s when your vRealize Suite (or maybe VMware Integrated OpenStack) will call upon the VMware NSX API. You simply point vRealize orchestration software at your vCenter and NSX Manager as its API end points; and from there you proceed to create full application blueprints complete with templates for compute, storage, security, and full L2-L7 network services. You can do all of this today with NSX, vRealize, and your Cisco Nexus 9K fabric.

Later, if Cisco provides integration for ACI with the vRealize Suite, you might decide to create some application blueprints using the NSX networking services model, others using the ACI model – just for the fun of it – and then compare the two side by side. “Which model provides better security, better performance, etc?” But we’ll have to wait for that, which brings me back to my original point on the winners producing the best software in a timely manner.

In the meantime, I hope to see you in your awesome new SDDC sometime soon donning your new VMware NSX certifications!

Cheers, Brad

Demo: End to end, hop by hop, physical and virtual network flow visibility with NSX

2014-09-02T12:22:13-05:00

You’ve probably heard it before. The myth goes something like this: “With software based overlays, troubleshooting in real-time where a flow is going with ECMP hashing on the fabric is going to be a real problem.” The implied message being that this can only be possible with special hardware in a new proprietary fabric switch.

I’ve heard this one a number times, usually while seated comfortably in a session presented by a vendor who’s invested in the failure of software-centric network virtualization such as VMware NSX. As if this person has never heard of Netflow? Or maybe they assume you won’t bother to do the research, connect the dots, and in fact discover all that is possible.

Well, guess what? I decided to do the research :-) And I put together a short demo showing you just how simple it is to get this troubleshooting capability with generally available software, using any standard network switch, constructed in any standard fabric design (routed Leaf/Spine, L2 with MLAG, etc).

I presented this demo to the VMworld TV crew and embedded it here for your convenience:

How does it work?

It’s really simple, actually. Here’s what I explain in the video:

The virtual switch encapsulates traffic into VXLAN and exports Netflow (IPFIX) data, for every flow, to a collector of your choice.

The virtual switch also exports a template to the collector that allows it to share a lot of additional VXLAN related information for each flow, above and beyond the standard flow information. This includes things such as the outer VTEP IP addresses, and the VXLAN UDP port numbers used to transmit each flow across the fabric. Note: The UDP source port will be unique for each flow.

The physical switches also export Netflow, IPFIX, or sFlow data as they observe these VXLAN flows on the fabric. Any decent switch worth its price tag is capable of doing this.

The flow collector is receiving detailed VXLAN flow data from the virtual and physical switches.

At this point you can go to your collector and pick any flow, in real time or historically, and see where it went on the virtual and physical switches, hop by hop.

To make it easy to search this data quickly, I decided to use a collector that can aggregate all of that Netflow data and convert it into Syslog messages. This is capability is provided by Netflow Integrator, from Netflow Logic.

With all of my virtual and physical switch flow data now in Syslog, I can easily search and analyze it from Splunk, or VMware Log Insight, or something else.

For example, I can type in queries that narrow in on the flows between any two IP addresses, and pick my time range.

I can see the end-to-end byte and packet count for each flow, bidirectionally, and quickly tell if any packets were lost in the fabric by looking for identical byte and packet counts on each end (hypervisor to hypervisor).

If I want to see where a flow went on the physical network, I can simply query the VXLAN source UDP port used for that flow, and I’ll see every switch and interface that observed that flow.

All of the necessary data is there for analysis by humans, or a machine. Today, I’m typing in queries at a Syslog engine. Tomorrow, it might be a network analysis tool looking at the same data, drawing a nice picture for me, looking for any anomalies, and perhaps making correlations to other events found in the same Syslog data from other IT equipment.

If you take a step back and think about it, Syslog is the perfect means to converge all IT event and troubleshooting data. Dare I say, Big Data. Every flow on your network should be considered an event. Why not? And if properly stored in a common data repository, you have the opportunity to give analytic tools a broad view of what’s happening in your data center, what’s likely to happen, and how to plan for it. Capacity planning. Cross functional troubleshooting. Security forensics. Just to name a few. There’s a lot more to troubleshooting application performance than simply counting packets on the network.

An introduction to Zero Trust virtualization-centric security

2014-06-11T16:47:51-05:00

This post will be the first in a series that examine what I think are some of the powerful security capabilities of the VMware NSX platform and the implications to the data center network architecture. In this post we’ll look at the concepts of Zero Trust (as opposed to Trust Zones), and virtualization-centric grouping (as opposed to network-centric grouping).

Note: Zero Trust as a guiding principle to enterprise wide security is inspired by Forrester’s “Zero Trust Network Architecture”.

What are we trying to accomplish?

We want to be able to secure all traffic in the data center without compromise to performance (user experience) or introducing unmanageable complexity. Most notably the proliferation of East-West traffic; we want to secure traffic between any two VMs, or between any VM and physical host, with the best possible security controls and visibility – per flow, per packet, stateful inspection with policy actions, and detailed logging – in a way that’s both economical to obtain and practical to deploy.

Trust Zones of Insecurity

Until now, it hasn’t been possible (much less economically feasible or even practical) to directly connect every virtual machine to its own port on a firewall. Because of this, the firewall has always been a “thing” (a physical piece of iron, or virtual machine) that we need to bolt on top of the network. First, you need a network to connect, aggregate, and group machines. After that you can connect the firewall to a port on that network-centric grouping (a virtual switch Port-Group and/or VLAN). Meanwhile, the network construct establishing the group provides unfettered connectivity within the group. In other words, the firewall has no visibility or security control over the East-West traffic between machines in given group. The result is a “Trust Zone”. We “trust” (read: hope), but can’t verify, that one machine in the zone will not laterally infect/attack the other zone members.

Network-centric grouping in a virtual environment

Groups form the basis of a security policy. Similar machines from a policy standpoint are placed into a group, at which point a policy governs how traffic is handled in to, out of, and within that group. How these groups are defined and where they exist can make a big difference in a virtualized data center. For example, when groups are defined by a networking construct, and then pushed into a virtual environment (vSphere), the security policy attached to a virtual machine is determined by its connection to a specific network-centric grouping object, with the most minimal granularity being a Port Group. Taking a network-centric approach in a virtual environment presents a number of challenges.

First, this approach can quickly create a large quantity of networking objects to deal with – a morass of Port Groups cluttering the virtual network inventory. For example, lets say you have 100 applications, each with three distinct tiers of policy groups (Web, App, DB); this would result in 300 Port Groups to choose from in your distributed virtual switch.

Second, the virtual administrator needs to correctly choose, and manually attach, the specific Port Group for each virtual machine network interface when it’s deployed. With an inventory of hundreds or thousands of virtual machines and Port Groups to choose from, human error in applying the wrong security policy is something to contend with. Despite the clutter of Port Groups, the manual aspects can be mitigated however if there is good integration with upstream automation software, namely vCloud Automation Center (vCAC).

Third, a Port Group is an object that’s specific to one distributed virtual switch (DVS). If the security policy for a virtual machine depends on its connection to a specific Port Group, the mobility domain for that virtual machine is limited to one DVS. Migrating outside of the DVS would involve a cold stop/start operation, and manually attaching the virtual machine to a different and specific Port Group in a new DVS.

Fourth, there are no security controls for East-West traffic within the Port Group that establishes a group. It’s just another “Trust Zone”. Only traffic between groups can be secured; which might lead to an effort to obtain more granularity by creating more and more Port Groups.

Zero Trust transparent security

In the Zero Trust model, we take the usual Firewall-bolted-on-top approach and turn it upside down. Every virtual machine is first connected to a transparent in-kernel statefull firewall filtering engine (with logging) before it’s even connected to the network. This means that any traffic to or from a virtual machine can be secured, regardless of the network construct it’s attached to. Because the firewall is below the network, directly adjacent to the things we want to protect, there is never an unfettered “Trust Zone”. Security is omnipresent – per flow, per packet, statefull inspection with policy actions and detailed logging, per virtual machine, per virtual NIC. The network constructs still exist, of course, but only to provide connectivity (not security). The Zero Trust model is also referred to as Micro Segmentation.

Virtualization-centric grouping in a virtual environment

A security policy works with the basic concept of a group, comprised of similar objects, to which you then apply a policy based on group names. In the network-centric model these groups were represented by Port Groups in a distributed virtual switch. In contrast, another approach is to employ a virtualization-centric grouping model, as implemented by VMware NSX, where the groups that form the basis of your security policy are decoupled from the network, and are simply an abstract object called a “Security Group” existing in the virtualization layer. There are a number of advantages to this approach in a virtual environment (e.g. vSphere).

First, the virtual network inventory remains simple and uncluttered. For every Security Group created there is no requisite and corresponding Port Group to create. The virtual network inventory remains constant as the environment grows. For example, this time your 100 applications, each with distinct tiers of policy groups (Web, App, DB), can be deployed with only one Port Group and VLAN providing the network connectivity.

Second, the virtual environment can dynamically attach virtual machines to the appropriate Security Group based on virtualization relevant context, tags, and business logic. As a simple example, in the diagram above, any VMs with the name “PROD-web” are placed in the “Web” Security Group automatically. Another scenario might be; if VMs are deployed by members of the “Engineering” active directory group, tag them as “Engineering”, and based on that tag dynamically add them to the “Dev/Test” Security Group, and isolate them from “Prod”. It doesn’t matter which Port Group the VMs are attached to. An incorrect Port Group assignment might only break network connectivity, not security policy.

Third, mobility is not artificially limited a network-centric object such as a single distributed virtual switch. Security Groups are not coupled to a distributed virtual switch (DVS), or any network construct for that matter. It doesn’t matter which Port Group connects to your virtual machine, and by consequence it also doesn’t matter which DVS your virutal machines are connected to either. This means you can live migrate virtual machines from one DVS to another; and someday soon, between vCenter instances – all while maintaining consistent security policy.

And finally, as previously discussed, there are no insecure Trust Zones with virtualization-centric grouping. Even traffic within a Security Group can be subject to policy controls and statefull inspection with detailed logging. The highest degree of granularity is provided at the onset (per virtual machine, per virtual nic).

Architecture implications

With a transparent firewall underneath the network, as opposed to bolted on top, this will have implications to the data center network architecture. The result, I contend, will be virtual and physical topology simplification.

When the firewall is bolted on top, the network substrate needs to be designed in such a way that correctly implements a security policy — selectively steering traffic from a virtual machine to some physical or virtual firewall several hops away. The more granularity you attempt, the more complex the design becomes with a quagmire of network-centric traffic steering and isolation tools like Port Groups, VLANs, ACLs, and VRFs. Meanwhile, more and more East-West traffic needs to be detoured several hops to a firewall, impacting performance (user experience). And in the end, you’re still left with unsecured Trust Zones, as you can never realistically obtain per-VM granularity.

With virtualization-centric VMware NSX, on the other hand, policy is applied underneath the network, in the virtualization layer. Throw away that East-West traffic detouring bag of tricks. Security is applied, transparently, before the packets even arrive at the first virtual network port. Latency sensitive East-West traffic is free to travel directly to its destination, taking the lowest latency path, having already been secured at the onset.

The network architecture is simply designed for connectivity; whether that might be a handful of VLAN backed Port Groups in an L2 fabric that you’re already using today; or migrating toward full network virtualization with VXLAN backed Logical Switches, Logical Routers, and simple L3 farbrics. You can start with the former and gradually move to the later.

Some points of differentiation

When evaluating options and comparing the security capabilities of VMware NSX for vSphere to other solutions, here are some points of differentiation to keep in mind.

Headless operation – The VMware NSX for vSphere distributed firewall does not rely on some other virtual machine for the data plane to function. Rules are centrally programmed by the NSX Manager and each host is able to inspect and enforce security policy for every flow and packet on its own, without the Manager (including headless vMotion).

Mobility – Your virtual machines are not constrained to single distributed virtual switch. Security policy is consistent irrespective of the DVS or Port Group providing the connectivity, and virtual machine live migration is not artificially constrained to a single DVS.

Zero Trust – Even traffic within the most minimal grouping construct is secured. East-West traffic within a Security Group, can be subject to policy, statefull inspection, and logging. There are no insecure Trust Zones.

Automation – The virtual environment can automatically attach virtual machines to the appropriate Security Group and subsequent policy based on virtualization relevant context. The virtual administrator doesn’t need to correctly choose and manually assign virtual machines to a specific Port Group. And when a host is added to a cluster, all of the required software is automatically installed.

Dynamic security – Just as the virtual environment can automatically assign a virtual machine to a Security Group, based on context, it can also change the Security Group (and policy) dynamically, based on changing context, or context provided from a third party, such as a malware or vulnerability assessment solution (Rapid7, McAfee, Symantec, Trend Micro).

Distributed platform for NGFW – One of the policy actions you can apply to a Security Group is selectively redirecting traffic to a local user space service virtual machine on each host. For example, 3rd party firewall providers can leverage this platform to add NGFW inspection to the environment in a distributed manner. Palo Alto Networks has already leveraged this capability with their VM-Series NGFW firewall that integrates with VMware NSX for vSphere.

Quick Video Demonstration

Finally, here’s a quick video demonstrating the scenario depicted in the diagrams above. I will show how a Security Group is created, how virtual machines are automatically assigned to a group, how East-West traffic within this group can be filtered by the NSX statefull firewall, and how the logs can be viewed and analyzed.

Three reasons why Networking is a pain in the IaaS, and how to fix it

2014-03-26T20:50:13-05:00

In this post I share the slides, audio recording, and short outline of a presentation I gave at the Melbourne VMUG conference (Feb 2014) called “Three reasons why Networking is a pain in the IaaS, and how to fix it”.

As network technologists we know that when the compute architecture changes, the network architecture changes with it. Consider the precedent. The transition from mainframe to rack servers brought about Ethernet and top-of-rack switches. Blade servers introduced the blade switch and a cable-less network. And of course the virtual server necessitating the software virtual switch and a hardware-less network. At each iteration, we observe the architecture change occurring at the edge, directly adjacent to compute.

We can look at this superficially and say, “yes, the network architecture changed”. However if you think about it, the catalyzing change in each shift was the operational model, with intent to increase agility and reduce costs. The architecture change was consequential. Without compute, there is no reason for a network. Networking, both as a profession and technology, exists as a necessary service layer for computing. Without a network, computing is practically useless. As such, the capabilities of the network will either enable or impede computing. Viewed in that light, when an organization decides to change the operational model of computing (virtualization, IaaS), the operational model of the network must evolve with it. If not, the “Network” becomes the impediment to the organization, not an enabler. (Hint: you don’t want to be on the receiving end of that).

Static compute > Static network
Virtual compute > Virtual network
Infrastructure as a Service > Networking as a Service

Audio Recording (MP3) 44 min

Click here to download the MP3

Three reasons: Outline

1) Impedance Mismatch

Deploying legacy non-virtual networking with virtual computing creates an operational impedance mismatch. Virtual computing provides instant provisioning, mobility, and template based deployments. Despite these advances, the virtual compute is still coupled to network services that are slow to provision, anchored to specific physical equipment, and manually deployed at the risk of configuration drift and human error. The full potential of virtualization and the IaaS cannot be realized. Simply creating virtual machine equivalents of Firewalls and Load Balancers doesn’t change the operational model of network services, it only changes the form factor.

The solution is to bring the same operational model of virtual computing to the network – network virtualization. Networking services should be instantly provisioned from a capacity pool, decoupled from specific hardware, made equally mobile, and deployed by machines using templates.

2) Lost in Translation (Scripting)

Attempting network “automation” or “orchestration” by scripting against individual device interfaces is untenable. Some 3rd party scripting tool has the difficult job of providing both an upstream interface with which to accept desired network state, and display the real time network state. This requires translation and coordination across many different autonomous devices and interfaces (languages).

The solution is deploy a virtual networking platform (like a virtual chassis switch) where many different devices connect to the platform like a virtual line card using the platform API. The virtual networking platform can then expose a single API endpoint to an upstream automation tool (e.g. OpenStack or VMware vCloud Automation Center). All of the complexities around deploying desired network state and gathering the real time state are removed from the automation tool and assumed by virtual networking platform. The individual device interfaces (languages) still remain for operational tasks (code upgrade), but are out of the way in terms of service provisioning.

Examples: VMware NSX + F5 (tech preview video), and VMware NSX + Palo Alto Networks (PDF)

3) Choke points

In many cases Firewalls are required to handle east-west traffic between compute instances, or between different trust zones. If the firewall is a “box”, be it a physical piece of iron, or even a virtual machine, it’s a single “device” in the network somewhere to which traffic must be forced through so that it can be inspected by a policy. This is a choke point catching packets. Performance of east-west traffic suffers, and the choke point (several layers removed from the source of traffic) has no real meaningful visibility into where the traffic came from, who sent it, or where it’s going. The choke point is merely inspecting IP packet headers against an access list. This means IP addresses of the workloads are critical to the applied security policy. This is not what we want in a highly agile Infrastructure as a Service. Security policy should be attached to the applications and workload, not the IP addresses. And there should be no choke points that impede performance.

The solution is to centrally define, and physically distribute the security policy across the virtual switching layer in the hypervisor kernel. Every virtual port attached to a virtual machine is not just the access port, it’s the stateful firewall too. The security policy is applied to the virtual machine, not the IP address, and enforced at the very first hop – no more choke points. And your policy can trigger on a large set of semantics such as user identity, operating system, security posture, or any arbitrary and hierarchical grouping of virtual machines (applications).

Example: VMware NSX Distributed Firewall

The rest of the presentation covers some example multi tenant topologies you can deploy in your IaaS with NSX, and how to introduce NSX into your existing environment and make a gradual migration. Listen to the full audio, and stay tuned for more blogs on these topics and more.

Cheers, Brad

Networking is a Service, and you are the Service Provider

2014-03-05T14:01:36-06:00

The status quo approach to Networking is the biggest barrier to realizing the full potential of Virtualization and the private, public, or hybrid cloud. We must re-think how Networking Services are delivered, in a way that comports with automation, decoupling, pooling, and abstractions. I would argue, the solution is a more software-centric approach – Network Virtualization. But more importantly, we must re-think how we view Networking as a career skill set and the value we bring to an organization.

This was the message of two keynote talks I recently gave at the Sydney & Melbourne VMUG user conferences. The title of the talk was Three reasons why Networking is a pain in the IaaS, and how to fix it. I will share the slides and a brief summary of that talk in a subsequent post. But before I do that, please indulge me in a heart-to-heart chat from one long time Networking professional (me) to another (you):

I emphasize the word services because if you really think about it, that is what Networking really is – Networking is a Service. It always has been, and will always continue to be a service – a service that will always be needed. To some, that may seem like an obvious statement. But to others, Networking is still viewed as a set of hardware boxes with ports and features.

What box should I buy? What features does it have? How fast is it? How do I configure that box? I better buy a box with all the features, just in case I might need it. I better buy a box with with lots of ports, just incase I might need it. And so on. And you begin to associate your career value to the knowledge you have in evaluating, configuring, and managing these boxes and their complex feature sets. At this point, the mere thought of a software-centric approach to Networking can be quite unsettling. If networking moves to software (read: x86 machines, hypervisors, SDN), well, that makes me less relevant and/or I don’t have the skills for that. And to appeal to your anxieties, the hardware box vendors serve up a healthy plate of Fear, Uncertainty, and Doubt (FUD) assuring you that software-centric networking will fail, keeping you comfortably stuck in your Networking-is-a-hardware-box comfort zone. Meanwhile, the organization continues to see your value associated to the efficient operations and deployment of it’s infrastructure hardware. When the platform changes, and it will, where does that leave you?

Contrast that to a mindset where you view Networking as a service – a service that can be fulfilled by any underlying platform, architecture, or another service (hardware, software, external providers). You know that the ideal platform will change over time, because it always does (Client-Server, Virtualization, Cloud, Everything as a Service). You make it your job to recognize when those changes are starting to occur and prepare both yourself, and the organization. You’re able to comfortably adapt to these architecture changes because you own the service of networking – you are a Service Provider. Things such as Connectivity, Routing, Security, High Availability, Access, Performance, Analytics, Reporting, just to name a few; these services are perpetual and platform independent. You’ve put yourself in a position to help the organization navigate the ever changing landscape of applications and IT architecture, keeping the business one step ahead of its competitor that’s still stuck on legacy platforms and architectures.

Your value to the organization is much different now. It’s no longer a situation of “I need this person to configure and manage that gear over there”. Rather, it’s now in the realm of “I need this person to keep the business competitive and relevant in an ever changing technology landscape”.

I believe Network Virtualization (e.g. VMware NSX) really enables this shift in both platform, architecture, and career value. Networking services (the things we really care about) are finally abstracted and decoupled from infrastructure, and become portable across a variety of architectures, platforms, and for that matter, service providers. It makes it easier to provide a clean separation of the (more interesting) services that provide value, from the (less interesting) infrastructure that supports it.

Over time, everything will change – both the services and the infrastructure, but probably not at the same pace. The decoupling of services from infrastructure, provided by Network Virtualization, allows us to:

Change, add, and optimize services quickly – without changing infrastructure
Change, add, and optimize infrastructure – without changing the services

It’s that basic freedom that allows Networking to be elevated and identified as a perpetual and discrete service to which the organization can associate tangible business value. And the person who owns that service is linked to that value. There’s a hero waiting to be made here. Is it going to be you, or someone else? If you ask me, there’s no more exciting time in Networking than right now. The opportunity at hand now will not come around again.

Cheers, Brad