Distributed virtual and physical routing in VMware NSX for vSphere

This post is intended to be a primer on the distributed routing in VMware NSX for vSphere, using a basic scenario of L3 forwarding between both virtual and physical subnets. I’m not going to bore you with all of the laborious details, just the stuff that matters for the purpose of this discussion.

In VMware NSX for vSphere there are two different types of NSX routers you can deploy in your virtual network.

The NSX Edge Services Router (ESR)
The NSX Distributed Logical Router (DLR)

Both the ESR and DLR can run dynamic routing protocols, or not. They can just have static/default routes if you like. The ESR is a router in a VM (it also does other L4-L7 services like FW, LB, NAT, VPN, if you want). Both the control and data plane of the ESR router are in the VM. This VM establishes routing protocol sessions with other routers and all of the traffic flows through this VM. It’s like a router, but in a VM. This should be straight forward, not requiring much explanation.

The ESR is unique because it’s more than a just router. It’s also a feature rich firewall, load balancer, and VPN device. Because of that, it works well as the device handling the North-South traffic at the perimeter of your virtual network. You know, the traffic coming from and going to the clients, other applications, other tenants. And don’t be fooled. Just because it’s a VM doesn’t mean the performance is lacking. Layer 4 firewall and load balancer operations can reach and exceed 10 Gbps throughput, with high connections per second (cps). Layer 7 operations also perform well compared to hardware counterparts. And because it’s a VM, well, you can have virtually unlimited ESRs running in parallel, each establishing the secure perimeter for their own “tenant” enclave.

The DLR is a different beast. With the DLR the data plane is distributed in kernel modules at each vSphere host, while only the control plane exists in a VM. And that control plane VM also relies on the NSX controller cluster to push routing updates to the kernel modules.

The DLR is unique because it enables each vSphere hypervisor host to perform L3 routing between virtual and physical subnets in the kernel at line rate. The DLR is configured and managed like one logical router chassis, where each hypervisor host is like a logical line card. Because of that the DLR works well as the “device” handling the East-West traffic in your virtual network. You know, the traffic between virtual machines, the traffic between virtual and physical machines, all of that backend traffic that makes your application work. We want this traffic to have low latency and high throughput, so it just makes sense to do this as close to the workload as possible, hence the DLR.

The ESR and DLR are independent. You can deploy both in the same virtual network, just one, or none.

Now that we’ve established the basic difference and autonomy between the ESR and DLR, in this blog we’ll focus on the DLR. Let’s look at a simple scenario where we have just the DLR and no ESR.

Let’s assume a simple situation where our DLR is running on two vSphere hosts (H1 and H2) and has three logical interfaces:

Logical Interface 1: VXLAN logical network #1 with VMs (LIF1)
Logical Interface 2: VXLAN logical network #2 with VMs (LIF2)
Logical Interface 3: VLAN physical network with physical hosts or routers/gateways (LIF3)

Routers have interfaces with IP addresses and the DLR is no different. Each vSphere host running the DLR has an identical instance of these three logical interfaces, with identical IP and MAC addresses (with the exception of LIF3).

The IP address and MAC address on LIF1 is the same on all vSphere hosts (vMAC)
The IP address and MAC address on LIF2 is the same on all vSphere hosts (vMAC)
The IP address on LIF3 is the same on all vSphere hosts, however the MAC address on LIF3 is unique per vSphere host (pMAC)

LIFs attached to physical VLAN subnets will have unique MAC addresses per vSphere host.

Side note: the pMAC cited here is not the physical NIC MAC. It’s different.

The DLR kernel modules will route between VXLAN subnets. If for example VM1 on Logical Network #1 wants to communicate with VM2 on Logical Network #2, VM1 will use the IP address on LIF1 as it’s default gateway, and the DLR kernel module will route the traffic between LIF1 and LIF2 directly on the vSphere host wherever VM1 resides. The traffic will then be delivered to VM2, which might be on the same vSphere host, or perhaps another vSphere host where VXLAN encapsulation on Logical Network #2 will be used to deliver the traffic to the hypervisor host where VM2 resides. Pretty straight forward.

VMware NSX Distributed Logical Router for vSphere

The DLR kernel modules can also route between physical and virtual subnets. Let’s see what happens when a physical host PH1 (or router) on the physical VLAN wants to deliver traffic to a VM on a VXLAN logical network.

PH1 either has a route or default gateway pointing at the IP address of LIF3. PH1 issues an ARP request for the IP address present on LIF3. Before any of this happened, the NSX controller cluster picked one vSphere host to be the Designated Instance (DI) for LIF3.

The DI is only needed for LIFs attached to physical VLANs.
There is only one DI per LIF.
The DI host for one LIF might not be the same DI host for another LIF.
The DI is responsible for ARP resolution.

Let’s presume H1 is the vSphere host selected as the DI for LIF3, so H1 responds to PH1’s ARP request, replying with its own unique pMAC on its LIF3. PH1 then delivers the traffic to the DI host, H1. H1 then performs a routing lookup in its DLR kernel module. The destination VM may or may not be on H1. If so, the packet is delivered directly. (i) If not, the packet is encapsulated in a VXLAN header and sent directly to the destination vSphere host, H2. (ii) For (ii) return traffic, the vSphere host with the VM (H2 in this case) will perform a routing lookup in its DLR kernel module and see that the output interface to reach PH1 is its own LIF3. Yes, if a DLR has a LIF attached to a physical VLAN, each vSphere host running the DLR had better be attached to that VLAN.

Each LIF on the DLR has its own ARP table. By consequence, each vSphere host in the DLR carries an ARP table for each LIF. The DLR ARP table for LIF3 may be empty or not contain an entry for PH1, and because H2 is not the DI for LIF3, it’s not allowed to ARP. So instead H2 sends a UDP message to the DI host (H1) asking it to perform the ARP.

Note: The NSX controller cluster, upon picking H1 as the DI, informed all hosts in the DLR that H1 was the DI for LIF3. The DI host for LIF3 (H1) issues an ARP request for PH1 and subsequently sends a UDP response back to H2 containing the resolved information. H2 now has an entry for PH1 on its LIF3 ARP table and delivers the return traffic directly from the VM to PH1. The DI host (H1) is not in the return data path.

All of that happened with just a DLR and static/default routes (no routing protocols). The DLR can also run IP routing protocols – both OSPF and BGP.

In the case where the DLR is running routing protocols with an upstream router, the DLR will consume two IP addresses on that subnet. One for the LIF in the DLR kernel module in each vSphere host, and one for the DLR control VM. The IP address on the DLR control VM is not a LIF, it’s not present in the DLR kernel modules of the vSphere hosts, it only exists on the control VM and will be used for establishing routing protocol sessions with other routers – this IP address is referred to as the “Protocol Address”.

The IP address on the LIF will be used for the actual traffic forwarding between the DLR kernel modules and the other routers – this IP address is referred to as the “Forwarding Address” – and is used as the next-hop address in routing advertisements. When the DLR has a routing adjacency with another router on a physical VLAN, the same process described earlier concerning Designated Instances happens when the other router ARPs for the DLR’s next-hop forwarding address. Pretty straight forward. If however the DLR has a routing adjacency with the “other” router on a logical VXLAN network – such as with a router VM running on a vSphere host (eg. ESR) – where that vSphere host is also running the DLR – then no Designated Instance process is needed because the DLR LIF with the Forwarding Address will always be present on the same host as the “other” router VM. How’s that for a Brain Twister? ;)

The basic point here is that the DLR provides optimal routing between virtual subnets, and physical subnets, and can establish IP routing sessions with virtual and physical routers.

One example where this would work might be a three tier application where each tier is its own subnet. The Web and App tiers might be virtual machines on VXLAN logical networks, whereas the Database machines might be non-virtualized physical hosts on a VLAN. The DLR can perform optimal routing between these three subnets, virtual and physical, as well as dynamically advertise new subnets to the data center WAN or Internet routers using OSPF for BGP.