Distributed virtual and physical routing in VMware NSX for vSphere

Filed in Network Virtualization, NSX, Routing, VMware, VXLAN by on November 20, 2013 21 Comments

This post is intended to be a primer on the distributed routing in VMware NSX for vSphere, using a basic scenario of L3 forwarding between both virtual and physical subnets. I’m not going to bore you with all of the laborious details, just the stuff that matters for the purpose of this discussion.

In VMware NSX for vSphere there are two different types of NSX routers you can deploy in your virtual network.

  1. The NSX Edge Services Router (ESR)
  2. The NSX Distributed Logical Router (DLR)

Both the ESR and DLR can run dynamic routing protocols, or not.  They can just have static/default routes if you like.

The ESR is a router in a VM (it also does other L4-L7 services like FW, LB, NAT, VPN, if you want).  Both the control and data plane of the ESR router are in the VM.  This VM establishes routing protocol sessions with other routers and all of the traffic flows through this VM.  It’s like a router, but in a VM.  This should be straight forward, not requiring much explanation.

The ESR is unique because it’s more than a just router.  It’s also a feature rich firewall, load balancer, and VPN device.  Because of that, it works well as the device handling the North-South traffic at the perimeter of your virtual network.  You know, the traffic coming from and going to the clients, other applications, other tenants.  And don’t be fooled.  Just because it’s a VM doesn’t mean the performance is lacking.  Layer 4 firewall and load balancer operations can reach and exceed 10 Gbps throughput, with high connections per second (cps).  Layer 7 operations also perform well compared to hardware counterparts.  And because it’s a VM, well, you can have virtually unlimited ESRs running in parallel, each establishing the secure perimeter for their own “tenant” enclave.

The DLR is a different beast.  With the DLR the data plane is distributed in kernel modules at each vSphere host, while only the control plane exists in a VM.  And that control plane VM also relies on the NSX controller cluster to push routing updates to the kernel modules.

The DLR is unique because it enables each vSphere hypervisor host to perform L3 routing between virtual and physical subnets in the kernel at line rate.  The DLR is configured and managed like one logical router chassis, where each hypervisor host is like a logical line card.  Because of that the DLR works well as the “device” handling the East-West traffic in your virtual network.  You know, the traffic between virtual machines, the traffic between virtual and physical machines, all of that backend traffic that makes your application work.  We want this traffic to have low latency and high throughput, so it just makes sense to do this as close to the workload as possible, hence the DLR.

The ESR and DLR are independent.  You can deploy both in the same virtual network, just one, or none.

Now that we’ve established the basic difference and autonomy between the ESR and DLR, in this blog we’ll focus on the DLR.  Let’s look at a simple scenario where we have just the DLR and no ESR.

Let’s assume a simple situation where our DLR is running on two vSphere hosts (H1 and H2) and has three logical interfaces:

  • Logical Interface 1: VXLAN logical network #1 with VMs (LIF1)
  • Logical Interface 2: VXLAN logical network #2 with VMs (LIF2)
  • Logical Interface 3: VLAN physical network with physical hosts or routers/gateways (LIF3)

Routers have interfaces with IP addresses and the DLR is no different.  Each vSphere host running the DLR has an identical instance of these three logical interfaces, with identical IP and MAC addresses (with the exception of LIF3).

  • The IP address and MAC address on LIF1 is the same on all vSphere hosts (vMAC)
  • The IP address and MAC address on LIF2 is the same on all vSphere hosts (vMAC)
  • The IP address on LIF3 is the same on all vSphere hosts, however the MAC address on LIF3 is unique per vSphere host (pMAC)

LIFs attached to physical VLAN subnets will have unique MAC addresses per vSphere host.

Side note: the pMAC cited here is not the physical NIC MAC.  It’s different.

The DLR kernel modules will route between VXLAN subnets.  If for example VM1 on Logical Network #1 wants to communicate with VM2 on Logical Network #2, VM1 will use the IP address on LIF1 as it’s default gateway, and the DLR kernel module will route the traffic between LIF1 and LIF2 directly on the vSphere host wherever VM1 resides.  The traffic will then be delivered to VM2, which might be on the same vSphere host, or perhaps another vSphere host where VXLAN encapsulation on Logical Network #2 will be used to deliver the traffic to the hypervisor host where VM2 resides.  Pretty straight forward.

VMware NSX Distributed Logical Router for vSphere

The DLR kernel modules can also route between physical and virtual subnets.  Let’s see what happens when a physical host PH1 (or router) on the physical VLAN wants to deliver traffic to a VM on a VXLAN logical network.

PH1 either has a route or default gateway pointing at the IP address of LIF3.
PH1 issues an ARP request for the IP address present on LIF3.
Before any of this happened, the NSX controller cluster picked one vSphere host to be the Designated Instance (DI) for LIF3.

  • The DI is only needed for LIFs attached to physical VLANs.
  • There is only one DI per LIF.
  • The DI host for one LIF might not be the same DI host for another LIF.
  • The DI is responsible for ARP resolution.

Let’s presume H1 is the vSphere host selected as the DI for LIF3, so H1 responds to PH1’s ARP request, replying with its own unique pMAC on its LIF3.
PH1 then delivers the traffic to the DI host, H1.
H1 then performs a routing lookup in its DLR kernel module.
The destination VM may or may not be on H1.
If so, the packet is delivered directly. (i)
If not, the packet is encapsulated in a VXLAN header and sent directly to the destination vSphere host, H2. (ii)

For (ii) return traffic, the vSphere host with the VM (H2 in this case) will perform a routing lookup in its DLR kernel module and see that the output interface to reach PH1 is its own LIF3.  Yes, if a DLR has a LIF attached to a physical VLAN, each vSphere host running the DLR had better be attached to that VLAN.

Each LIF on the DLR has its own ARP table.  By consequence, each vSphere host in the DLR carries an ARP table for each LIF.
The DLR ARP table for LIF3 may be empty or not contain an entry for PH1, and because H2 is not the DI for LIF3, it’s not allowed to ARP.  So instead H2 sends a UDP message to the DI host (H1) asking it to perform the ARP.

Note: The NSX controller cluster, upon picking H1 as the DI, informed all hosts in the DLR that H1 was the DI for LIF3.

The DI host for LIF3 (H1) issues an ARP request for PH1 and subsequently sends a UDP response back to H2 containing the resolved information. H2 now has an entry for PH1 on its LIF3 ARP table and delivers the return traffic directly from the VM to PH1.  The DI host (H1) is not in the return data path.

All of that happened with just a DLR and static/default routes (no routing protocols).

The DLR can also run IP routing protocols — both OSPF and BGP.

In the case where the DLR is running routing protocols with an upstream router, the DLR will consume two IP addresses on that subnet. One for the LIF in the DLR kernel module in each vSphere host, and one for the DLR control VM.  The IP address on the DLR control VM is not a LIF, it’s not present in the DLR kernel modules of the vSphere hosts, it only exists on the control VM and will be used for establishing routing protocol sessions with other routers — this IP address is referred to as the “Protocol Address”.

The IP address on the LIF will be used for the actual traffic forwarding between the DLR kernel modules and the other routers — this IP address is referred to as the “Forwarding Address” — and is used as the next-hop address in routing advertisements.

When the DLR has a routing adjacency with another router on a physical VLAN, the same process described earlier concerning Designated Instances happens when the other router ARPs for the DLR’s next-hop forwarding address.  Pretty straight forward.

If however the DLR has a routing adjacency with the “other” router on a logical VXLAN network — such as with a router VM running on a vSphere host (eg. ESR) — where that vSphere host is also running the DLR — then no Designated Instance process is needed because the DLR LIF with the Forwarding Address will always be present on the same host as the “other” router VM.  How’s that for a Brain Twister? ;)

The basic point here is that the DLR provides optimal routing between virtual subnets, and physical subnets, and can establish IP routing sessions with virtual and physical routers.

One example where this would work might be a three tier application where each tier is its own subnet.  The Web and App tiers might be virtual machines on VXLAN logical networks, whereas the Database machines might be non-virtualized physical hosts on a VLAN.  The DLR can perform optimal routing between these three subnets, virtual and physical, as well as dynamically advertise new subnets to the data center WAN or Internet routers using OSPF for BGP.

Pretty cool, right?

Stay tuned.  More to come…

Cheers,
Brad

About the Author ()

Brad Hedlund is an Engineering Architect with the CTO office of VMware’s Networking and Security Business Unit (NSBU), focused on network & security virtualization (NSX) and the software-defined data center. Brad’s background in data center networking begins in the mid-1990s with a variety of experience in roles such as IT customer, systems integrator, architecture and technical strategy roles at Cisco and Dell, and speaker at industry conferences. CCIE Emeritus #5530.

Comments (21)

Trackback URL | Comments RSS Feed

  1. Andrea says:

    Very good overview Brad, now I’m beginning to understand what NSX is.
    One question: the best DLR (and ESR also, I suppose) deploy needs separated LAN between physical and virtual servers. Doesn’t it?

    Let me clarify: in some deployments physical LAN could span between the physical and the virtual world. For example a database dedicated LAN which holds few database VMs and a big database silos. The optimal design for DLR should separate the physical database silos from all database VMs?

    • Brad Hedlund says:

      Hi Andrea,
      The DLR is a router. So, yes, using your example, if you wanted the DLR to provide connectivity between the database VMs (1) and the database physical machines (2), you would have (1) and (2) on different IP subnets. (1) would be a VXLAN, (2) would be a VLAN.

      If you wanted (1) and (2) to be on the same IP subnet being L2 adjacent, then you would not use the DLR for that. You would use L2 bridging capabilities of the NSX Edge (different topic for a different day).

      Cheers,
      Brad

  2. William Caban says:

    Great explanation thanks! Here some questions:

    Questions on the ESR:
    – Are the L4-L7 services of the ESR is all or nothing? Meaning, can the FW or LB be replaced by a third party?
    – If so, is this integration over APIs or does the user have to force the path of the traffic over that third party for the specific functionality?
    – If not, if we use a third party service solution (FW, LB, VPN), can we still use the ESR?

    Questions on the DLR:
    – Are the LIF tied to a special type of port/port-group/vmkernel port, or are do we have to assign a physical port fort them to use like uplinks (like we do with the vSS)?
    – When there is a designated instance into play, how does the system handle fail-overs of the physical NIC or the host with DI?
    – When a host have multiple uplinks (think side A and side B), does the DLR load-balace/load-share across the uplinks? If so, is it a fixed load distribution (like MAC pining) or does it account for BW utilization?
    – What type of OSPF areas does it support?
    – Does it support any BGP extension (like communities) to publish and/or enforce QoS over the traffic?

    Can a host using DLR still be using vDS and/or using any third party vDS?

    Does this support regular vCenter+vSphere Ent Plus deployments or does it require vCD or vCAC?

    • Brad Hedlund says:

      Hi William,

      Answers on the ESR:
      – You can pick and choose each individual service you want to enable on the ESR. If you want the ESR to just be a router and no FW & LB, you can do that. If you want the ESR to do routing and LB, but no Firewall, you can do that too.
      – NSX provides a platform for 3rd party vendors to seamlessly integrate their services into the virtual network, whereby they can integrate at the NSX API.
      – For Example (above): Palo Alto Networks NGFW already integrates with VMware NSX for vSphere (pdf): http://www.vmware.com/files/pdf/products/nsx/vmw-nsx-palo-alto-networks.pdf

      Answers on the DLR:
      – The DLR in NSX for vSphere works with the VDS. So whatever physical uplinks the VDS owns will be the same uplinks used by the DLR.
      – The NSX controller cluster will handle failures of the Designated Instance. After a failure, the NSX controllers will choose a new DI and inform the other hosts of the DI change.
      – When traffic egresses the DLR, the VDS it’s installed on will provide the load balancing across multiple physical uplinks. LACP, load based teaming, MAC pinning, active/standby, are all possibilities.
      – You can configure the DLR to participate in normal OSPF areas, or stub areas (NSSA).
      – There is no support for BGP communities.
      – The host running the DLR can also run other vswitches, such as the VSS or N1K, but the DLR will not operate on these.
      – You don’t need vCD or vCAC. You can install, operate, and consume NSX entirely within vSphere if you want.

      Cheers,
      Brad

  3. YaoJinYuan says:

    Questions on the DLR:
    when DLR receive a packet(from VM1 10.1.1.1 to VM2 10.1.1.2 whin the same VxLAN,but VM1 and VM2 not in a same host) ,it lookup routing table , How do DLR know where is the VM2? DLR only kown the ARP of VM2 ,but it doesnot kown which host is VM2.

    • Brad Hedlund says:

      The DLR would not receive those packets, because VM1 is sending directly to VM2’s MAC address. This traffic is handled by the NSX Logical Switch to which both VMs are attached (and the DLR). If VM1 is sending packets to a destination on a subnet other than it’s own, the VM will ARP for the MAC address of it’s default gateway, which will be the DLR — and that is how the DLR “receives” traffic to perform a routing lookup.

      Cheers,
      Brad

  4. Bhargav says:

    Hi Brad,

    Great writeup.

    Routing between VMs on vSphere seems direct but routing between Physical & Virtual does not appear trivial. Wanted to highlight few points

    1) Exchange of UDP messages to ARP info exchange seems quite uncomfortable to me at this point of time. Instead, why cannot NSX program ARP info as well for that LIF in other host ?. This

    2) It seems to me that one is required to configured VLANs on all the vSphere hosts and underlying network if it is required to talk to a physical host. Is this not a pro-vising issue ?. Will not a dedicated VxLAN physical router solve the problem ?

    3) Will DI take care of all the policy thingy for inter-subnet routing ?

    4) It is still not clear to me the useful of DLR CNTL VM ?. Can you elobrate more on this ?

    -Bhargav

    • Brad Hedlund says:

      Hi Bhargav,

      1) The NSX controllers do not manage ARP state on physical VLANs.

      2) There are several ways virtual machines in NSX can talk to a physical host on a VLAN, not all of which require configuring that VLAN to every vSphere host. As far as provisioning, physical VLANs for physical hosts are usually static and infrequently created. It’s the virtual subnets that are more dynamic and ephemeral. So in the interest of provisioning speed, you definitely want the virtual subnets to be created efficiently, which is what you have in software with the NSX routers. A centralized VxLAN router is one of the other methods using the NSX Edge Services Router virtual machine, as it can route from a VXLAN subnet to a physical VLAN at 10+ Gbps. A physical VxLAN router would be possible too, but the trick is provisioning the ephemeral virtual subnets in sync with NSX, which is something that we might see someday using the same techniques as the physical L2 top of rack VXLAN gateway switches that were demoed at VMworld 2013.

      3) Not sure what you mean by “policy thingy”. Can you elaborate?

      4) Think of the DLR like a router chassis (Cisco 7600, etc.), where the kernel modules on the host are the line cards forwarding packets, and the Control VM is the supervisor engine establishing routing protocol sessions with other routers and programming routing updates to the line cards.

      Cheers,
      Brad

  5. Bhargav says:

    Hi Brad,

    Thanks for the clarification.

    #1) Got it and Nicely done too. This mechanism seems similar to a std router where the system requests ARP manager to resolve ARP for a directly connected host.

    #2) Agree on the static part of physical VLANs. One could throw multiple ESRs running at 10+ Gbps for physical storage.

    #3) The example you have considered is inter-subnet routing for same tenant. How does it work across different tenants ?. Consider an example of tenant-A and tenant-B. B would probably see public IP of A, will DLR take care of routing between different tenants ?. What kind of policy that would be applied at DLR to take care of such scenario’s ?.

    #4) Understood. So, DLR VM would summarize virtual routes of each tenant and advertise it to ESR using BGP. So, this BGP to support VRF ?. DLR to support BGP/OSPF for VxLAN transport ?

    Additionally,

    5) Is DLR part of DVS ? or is it a separate module ?. How DVS & DLR work to-gether ?

    -Bhargav

    • Brad Hedlund says:

      Hi Bhargav,

      3) The DLR kernel module on the hypervisor host can have thousands of isolated instances of a DLR. So, if you want, tenant-A and tenant-B can each have their own DLR. You could think of the DLR as a replacement of what would otherwise be a VRF on a physical L3 switch/router. Also, if you want, each tenant can have their own Edge Services Router (after all the ESR is just a VM). This would provided total IP addressing multi-tenancy between tenant-A and tenant-B, with the exception of the IP address on the external DMZ facing interface of their ESR. If you just wanted security isolation between tenants, each tenant can share the same DLR, and you simply apply policy on the Distributed Firewall that says tenant-A can’t talk to tenant-B, as all packets pass through the distributed firewall before they even touch the first virtual switch port.

      4) See #3 above. The inherent multi-tenancy provided by parallel instances of DLR and ESR is providing the functionality of what you know as a VRF. If you want tenant-A’s logical network to be connected to a physical VRF for tenant-A in the WAN/Campus, that can be done too. Just place the external interface of tenant-A’s ESR on a physical VLAN that maps to tenant-A’s physical VRF.

      5) The DLR is provided as an add on kernel module automatically installed on a cluster of hosts that you decide to run NSX on.

      Cheers,
      Brad

      • Bhargav says:

        Hi Brad,

        3) Interesting. There are 2 logical view here.
        Option-1) A DLR for each customer with it’s own ESR. Something like a owner router chassis for each customer.
        Option-2) With distributed firewall, a single DLR is shared by every customer.

        With Option-1, Inter-customer(Inter-VRF) packets would traverse thro’ ESRs DMZ while with Option-2, it will be done locally.

        With Option-1, the DLR kernel (& underlying network transporting VxLAN packets) need not learn about public IP’s of other customers. While with Option-2, DLR kernel’s may have to learn about it.

        -Bhargav

  6. nEIL bARNETT says:

    Brad, how would you consume DLR Features within a vCloud deployment perspective (OrgVDC / ProVDC) , from my understanding this isnt possible within the current / future release.

  7. nEIL bARNETT says:

    Brad, working on an NSX for vSphere Design. Trying to Determine Engineering Tradeoffs (Multicast vs. Unicast vs. Hybrid) Customer is possibly going Unicast, but i’m worried that the unicast design may limit future VXLAN Scalabilility and/or Cross Data Center / Cross VDC Transport movement. (Am i off track ?)

  8. AndreyO says:

    Hello Brad, many thanks for your explanations of NSX’s internals.
    But there couple of nuances with need for clarifications:
    1) U write “In the case where the DLR is running routing protocols with an upstream router, the DLR will consume two IP addresses on that subnet. One for the LIF in the DLR kernel module in each vSphere host, and one for the DLR control VM. The IP address on the DLR control VM is not a LIF, it’s not present in the DLR kernel modules of the vSphere hosts, it only exists on the control VM and will be used for establishing routing protocol sessions with other routers…”
    It sounds slightly confusingly: do NSX need by IP-address for each routed subnet on each NSX host or not? If it does there must be more IP-addresses (specifically, #hosts+#DLR_control-VMs).
    2) When VxLANed VM, located on non-DI host, sends traffic to physical host VLAN, traffic traverse DLR, correct? Is this traffic also encapsulated in VxLAN? If so, the ingress traffic for physical host must be decapsulated on VTEP? How it is done?
    Thank U

    • Brad Hedlund says:

      Hi Andrey,

      1) I probably could have worded those two sentences a bit better. Here’s a simple example that might provide some clarify. If you have a DLR with 10 interfaces (9 subnets for virtual machines, 1 subnet for uplink) you would only need 11 IP addresses for this DLR, regardless of how many hosts you have. First, each of the 9 interfaces for the virtual machine subnets would only need 1 IP address, and that same IP/MAC address is simultaneously present on all hosts — the default gateway for each VM is local, no matter which host its on. Second, on the uplink subnet the DLR will need two IP addresses, one for the address that upstream routers will use as the next-hop, the other for the IP address that the DLR control VM will use to speak BGP/OSPF to routers on that uplink subnet.

      2) The DLR can directly route between both normal VLAN based subnets as well as VXLAN based subnets. If a VM wants to speak with a phsyical host on a VLAN, the VM will send a packet to its default gateway (the DLR) which is directly in the hosts kernel, and thus no VXLAN encapsulation happens between the VM and the DLR. At this point the DLR routes the packet directly on to the VLAN, because it has a logical interface on that VLAN, so again, no VXLAN enapsulation happens between DLR and the physical host. As you can see, traffic from the VM to the physical host takes the most direct path and there is no VXLAN encapsulation. The return traffic, from physical host to VM, will enter through the host elected as the DI for that VLAN. This may or may not be the host where the VM resides. The DI host will receive the packet and route it to the VM subnet. If the DI host is where the VM resides, the packet will be delivered directly to the VM with no VXLAN encapsulation. If the VM is on a different host, the DI host (after routing) will VXLAN encapsulate the packet and deliver it to the host where the VM resides.

      Cheers,
      Brad

Leave a Reply

Your email address will not be published. Required fields are marked *