Cisco UCS Fabric Failover: Slam Dunk? or So What?

Fabric Failover is a unique capability found only in Cisco UCS that allows a server adapter to have a highly available connection to two redundant network switches without any NIC teaming drivers or any NIC failover configuration required in the OS, hypervisor, or virtual machine.  In this article we will take a brief look at some common use cases for Fabric Failover and the UCS Manager software versions that support each implementation.

With Fabric Failover, the intelligent network provides the server adapter with a virtual cable that (because its virtual) can be quickly and automatically moved from one upstream switch to another.  Furthermore, the upstream switch that this virtual cable connects to doesn’t have to be the first switch inside a blade chassis, rather it can be extended through a “fabric extender” and connected to a common system wide upstream switch (in UCS this would be the fabric interconnect). Shown in the diagram below.

Cisco UCS Fabric Failover

When the intelligent network moves a virtual cable from one physical switch to another, the virtual cable is moved to an identically configured virtual Ethernet port with the exact same numerical interface identifier.  Additionally,  we should also consider other state information that should move along with the virtual cable, such as the adapters network MAC address.  In Cisco UCS, the adapters network MAC address is implicitly known (meaning, it does not need to be learned) because the adapter MAC address and network port it connects to were both provisioned by UCS Manager.  When Fabric Failover is enabled, the implicit MAC address of the adapter is synchronized with the second fabric switch in preparation for a failure.  This capability has been present since UCS Manager version 1.0.  Below is an example of a Windows or Linux OS loaded on the bare metal server with Fabric Failover enabled.  The OS has a simple redundant connection to the network with a single adapter and no requirement for a NIC Teaming configuration.

Bare Metal OS with single adapter and Fabric Failover

In addition to synchronization of the implicit MAC, there may be other information we may need to migrate depending on our implementation.  For example, if our server is running a hypervisor it may be hosting many virtual machines behind a software switch, with each VM having its own MAC address, and each VM using the same server adapter and virtual cable for connectivity outside of the server.  When the virtual machines are connected to a software based hypervisor switch, the MAC addresses from the individual virtual machines are not implicitly known because the VM MAC was provisioned by vCenter, not UCS Manager.  As a result, each VM MAC will need to be learned by the virtual Ethernet port on the upstream switch (UCS Fabric Interconnect).  If a fabric failure should occur we would want these learned MAC addresses to be ready on the second fabric as well.  This capability of synchronizing the learned MAC address between fabrics is not currently available in UCS Manager version 1.3, rather this capability is planned in UCSM version 1.4.

The diagram below shows the use of Fabric Failover in conjunction with a hypervisor switch, running UCS Manager v1.3 or below.  As you can see, the implicit MAC of the NIC used by hypervisor is in sync, but the more important learned MACs of the virtual machines and Service Console is not.  Hence, in a failed fabric scenario the 6100-B would need to re-learn these MAC addresses, which could take an unpredictable amount of time, depending on how long it takes for these end points to send another packet (preferably a packet that travels to the upstream L3 switch for full bi-directional convergence).  Therefore, the design combination shown below is not supported in UCSM version 1.3 or below.

Fabric Failover with Hypervisor switch, UCSM v1.3 or below

With the release of UCS Manager v1.4, support will be added for synchronizing not only the implicit MACs, but also any learned MACs.  This will enable the flexibility to use Fabric Failover in any virtualization deployment scenario, such as when a hypervisor switch is present.  Should a fabric failure occur, all affected MACs will already be available on the second fabric, and the failure event will trigger the second UCS Fabric Interconnect to send Gratuitous ARPs upstream to aid in the upstream networks fast re-learning of the new location to reach the affected MACs.  The Fabric Failover implementation with a hypervisor switch and UCSM v1.4 is shown in the diagram below.

Fabric Failover with Hypervisor switch and UCS Manager v1.4

Note: With the hypervisor switch examples shown above, it would be certainly possible to provide a redundant connection to the hypervisor management console with a single adapter.  However, some hypervisors, most notably VMware ESX, will complain and bombard you with warnings you that you only have a single adapter provisioned to the management network and therefore no redundancy.  This is perfectly understandable because the hypervisor has no awareness of the Fabric Failover services provisioned to it.  After all, that is the goal of Fabric Failover, to be transparent.  Furthermore, if you wish to use hypervisor switch NIC load sharing mechanisms such as VMware’s IP Hash or Load Based Teaming, or Nexus 1000V’s MAC Pinning or vPC Host-Mode, these technologies fundamentally require multiple adapters provisioned to the Port Group or Port Profile to operate and inherently provide redundancy themselves.  In a nutshell, Fabric Failover with a hypervisor switch is certainly an intriguing design choice, but it may not be a slam dunk in every situation.

There is one very intriguing virtualization based design where Fabric Failover IS a slam dunk every time, that being Hypervisor Passthrough or Hypervisor Bypass designs capable in Cisco UCS (commonly known as HW VN-Link).  When Cisco UCS is configured for HW VN-link it presents itself to VMware vCenter as a distributed virtual switch (DVS), much like the Nexus 1000V does.  Furthermore, Cisco UCS will dynamically provision virtual adapters in the Cisco VIC for each VM as they are powered on or moved to a host server.  These virtual adapters are basically a hardware implementation of a DVS port.  One method of getting the VM connected to its hardware DVS port is by passing its packets through the hypervisor with a software implementation of a pass through module (rather than a switch).  The software pass through module requires fewer CPU cycles than a software switch, yet the hypervisor still has visibility to network I/O and therefore vMotion works as expected.  This unparalleled integration with VMware has been in UCS Manager since version 1.1.

So where does Fabric Failover have a slam dunk role?  Simply put, Fabric Failover provides a redundant network connection to the VM without the need to provision a second NIC in the VM settings.  Most VM’s are provisioned with 1 NIC simply because the hypervisor switch provided the redundancy.  With the hypervisor switch gone, Fabric Failover provides the redundancy to the VM without requiring any changes to the VM configuration.  This makes it very easy to consider UCS HW VN-Link as a design choice, because minimal changes afford an easy migration.

In a design with hypervisor pass-through (HW VN-Link), every MAC address is implicitly known, including those of the virtual machines.  This is the result of the programmatic API connection between Cisco UCS and vCenter.  Every time a new VM is provisioned in vCenter, and powered on in the DVS, Cisco UCS knows about it, including the MAC address provisioned to the VM.  Therefore, the MAC addresses are implicitly known through a programming interface and assigned to a unique virtual Ethernet port for every VM.  Because UCS Fabric Failover has worked with implicit MACs since version 1.0, there are no caveats as discussed above in the hypervisor switch designs.

The diagram below depicts Fabric Failover in conjunction with hypervisor pass-through.

Fabric Failover with Hypervisor Pass through

Below is a variation of HW VN-Link that uses a complete hypervisor bypass approach, rather than pass through shown above.  This approach requires no CPU cycles to get the VM’s I/O to the network because the VM’s network driver writes directly to the physical adapter.  Given that network I/O is completely bypassing all hypervisor layers there is an immediate benefit in throughput, lower latency, and fewer (if any) CPU cycles required for networking.  The challenge with this approach is getting vMotion to work.  Ed Bugnion (CTO of Cisco’s SAVBU) presented a solution to this challenge at VMworld 2010 in which he showed temporarily moving the VM to pass-through mode for the duration of the vMotion, then back to bypass at completion.  But I digress.

Again, Fabric Failover has a slam dunk role to play here as well.  Rather than retooling every VM with a second NIC (just for the sake of redundancy), Fabric Failover will take care it :-)

Fabric Failover with Hypervisor Bypass

Fabric Failover is so unique to Cisco UCS that in order for HP, IBM, or Dell to implement the same capability would require a fundamental overhaul of their architecture into a more combined network + compute integrated system like that of UCS.

Without Fabric Failover, how are the other compute vendors going to offer a hypervisor pass-through or bypass design choice without asking you to retool all of your VM’s with a second NIC?

Even if you did add a second NIC to every VM, how are you going to insure consistency?  How will you know for sure that each VM has a second NIC for redundancy, and the second NIC is properly associated to a redundant path?

How long will it be before the others (HP, IBM, Dell) have an adapter that can provision enough virtual adapters to assign one to each VM?

When that adapter is finally available, will the virtual adapters be dynamically provisioned via VMware vCenter? Or will you need to statically configure each adapter?  How will vMotion work with statically configured adapters?

Of course none of this is a concern with Cisco UCS, because all of this technology is already there, built in, ready to use. :-)  Whether or not you use hypervisor bypass in your design is certainly your choice — but at least with Cisco UCS you had that choice to make!

What do you think?  Is Fabric Failover a “Slam Dunk”? Or a “So What”?

VMware 10GE QoS Design Deep Dive with Cisco UCS, Nexus

Last month I wrote a brief article discussing the intelligent QoS capabilities of Cisco UCS in a VMware 10GE scenario, accompanied by some flash animations for visual learners like me.  In that article I (intentionally) didn’t get too specific with design details or technical nuance.  Rather, I wanted to set the stage first with a basic presentation of what Quality of Service (QoS) is, how its applicable to data center virtualization,  and how QoS compares to other less intelligent approaches to bandwidth management.

In this article we build upon the previous and take a closer look, dive a little deeper, and discuss some of the designs that could be implemented with Cisco UCS or Nexus that can provide intelligent QoS in VMware deployments with 10GE networking.  We will look at many different scenarios including designs with Cisco UCS using fancy adapters (Cisco VIC), designs with Cisco UCS using more standard adapters (e.g. Emulex or Qlogic), and finally we will also explore designs without Cisco UCS where you may have non-Cisco servers connected to a Cisco Nexus switch infrastructure.  The commonality and key enabler in all of the designs discussed here is the intelligent Cisco network.

With the release of vSphere 4.1, one of the major improvements VMware is proud to discuss is in the area of network I/O performance.  For example, vMotion has been improved to allow 8 concurrent vMotion flows per Host, up from 2.  Furthermore, one individual vMotion flow can reach 8Gbps I/O performance.  If you add another vMotion flow you can quickly saturate a 10GE adapter. See this post by Aaron Delp from ePlus who wrote about the potential for network I/O saturation and the need for intelligent QoS, based on the testing of his colleague Don Mann.  Keep an eye on Aaron’s blog for more content this subject.

Saturating a 10GE adapter with network bandwidth is not the problem.  In fact, that’s a good thing.  This means you are making the most of the infrastructure resources and driving higher levels of efficiency.  Similar to how higher server CPU utilization is generally considered a good thing, the same goes for network resources.  The problem that can arise with saturating a 10GE network adapter is the performance impact on the many different applications using the server & adapter when there is no policy based enforcement of fair bandwidth sharing.

If you just “let it rip” with no governance at all, the biggest bully is going to take away all of the bandwidth from the little guy, or as much as it can.  Vmotion traffic is a great example of one of those bandwidth bullies.  Its a brute force transfer of lots and lots of large data packets from one Host to another, coming at a fast and frequent rate.  The more infrequent and smaller packet size transactional traffic from the Web server taking orders for your online business can easily get stomped and kicked to the bandwidth curb by the big vMotion bullies … now I think we can agree that’s NOT a good thing.

VMware showed that they understand this challenge by introducing a capability in vSphere 4.1 called Network I/O Control (NetIOC), which provides a means for the ESX hypervisor to enforce a fair sharing of bandwidth for network traffic leaving the hypervisor. (Notice I did not say: “leaving the server” or “leaving the adapter”)  NetIOC uses the familiar “Shares” concept for scheduling bandwidth similar to how VMware has always scheduled sharing of CPU and Memory resources.  The concept of bandwidth “Shares” is also very similar to how intelligent QoS operates in a Cisco network.  My esteemed Cisco colleague M. Sean McGee recently wrote about these similarities in an article titled: Great Minds Think Alike – Cisco and VMware Agree on Sharing vs. Limiting

You might be thinking: “Great, problem solved! I’ll use NetIOC and I’m done.”  Well, not so fast. (More on that later).  A comprehensive solution to bandwidth management is one that is able to coordinate “Shares” of network bandwidth at both the server adapter AND network switch infrastructure, not just one or the other.

Lets go ahead and take a look at some specific examples of 10GE VMware designs that provide the aforementioned coordinated server & network intelligent QoS (bandwidth “Shares”).  First, we will start with some examples of Cisco UCS using the Cisco VIC (virtual interface card) as the server adpater.  To begin, lets start with a brief overview of the Cisco VIC depicted below.

The Cisco VIC, also known as “Palo”

One of the areas that sets the Cisco VIC apart from other adapters is in its powerful interface virtualization capabilities.  With a single physical Cisco VIC adapter you can effectively fool the server into believing it has many adapters installed (currently as many as 58), all being presented from one physical (dual port) 10GE adapter.  You will see how we can use this powerful capability as a flexible design tool for intelligent QoS.  You can see from the diagram above we have provisioned (10) virtual adpaters; (2) FC HBA’s; and (8) Ethernet NIC’s.  Each virtual Ethernet NIC is found as a “vmnic”  to the VMware host, and from there we can map different types of traffic to different virtual adapters.

Another area that sets the Cisco VIC apart from other adapters is in its powerful QoS capabilities.  The Cisco VIC can provide a fair sharing of bandwidth for all of its virtual adapters, and the policy engine that defines how bandwidth is shared on the Cisco VIC is conveniently centrally defined by the overarching system-wide UCS Manager.  Below is a diagram that shows how the Cisco VIC schedules bandwidth for its virtual adapters.

The Cisco VIC “Palo” QoS capabilities

In the example above we again have our (8) virtual Ethernet adapters and (2) virtual Fibre Channel adapters the Cisco VIC is presenting to the VMware host as “real” adapters.  The Cisco VIC itself has (8) traffic queues, each based on a COS value (the purple boxes numbered 0 thru 7).  COS stands for “Class of Service”, and when used is typically marked on the Ethernet packets as they traverse the network for special treatment at each hop.  In this example we have defined each virtual adapter to be associated to a specific COS value (this was centrally configured at the UCS Manager).  Therefore, any traffic received from the server by a virtual adapter will be assigned to and marked with a specific COS value and placed into its associated queue for transmission scheduling (only if there is congestion, otherwise mark and immediately send).  Another policy that was centrally defined in UCS Manager was setting the minimum percentage of bandwidth that each COS queue will always be entitled to (if there is congestion).  For example, vNIC4 and vNIC8 are assigned to COS 5, of which the Cisco VIC will always provide 40% of the 10GE link during periods of congestion.  Similarly, each vHBA has 40% guaranteed bandwidth, vNIC1 and VNIC2 are each sharing a 10% bandwidth guarantee, and vNIC3 has it’s own 10%.

With intelligent QoS, the sum of the minimum bandwidths add up to 100%.  This is a big difference from other less intelligent approaches that use “Rate-Limiting” or sometimes also called “Shaping” where the sum of maximum bandwidths must not exceed 100% (example: HP Flex-10 / FlexFabric).  Sorry, couldn’t resist :-)

If two or more virtual adapters are assigned to the same COS value (such as vNIC1 and vNIC2), the Cisco VIC will provide Round Robin placement of packets from each virtual adapter into the COS queue, to be serviced by the Bandwidth Scheduler.  You can also define maximum transmit bandwidth limits for any virtual adapter, such as vNIC1 being limited to 1GE transmit. If you decide to do that, the Cisco VIC will enforce the rate limit prior to any necessary Round Robin servicing.

Side note: There are two different QoS modes available for each virtual adapter — 1) Host control NONE, or 2) Host control FULL.  The NONE mode is stating that COS markings from the server will be ignored and overwritten with the virtual adpater’s COS setting in UCS Manager.  The FULL mode is stating that COS markings from the server will be preserved and used to determine which COS queue each packet is serviced from on the Cisco VIC.  The graphic above is a depiction of the NONE mode, where each virtual adapter has a solid black line to a specific COS queue.  The FULL mode would have dotted lines from each virtual adapter to each COS queue.

COS 7 is a reserved strict priority queue for UCS Manager to manage the Cisco VIC settings and policies.  You can, for example, change bandwidth settings on the fly for a virtual adapter while the server is running without needing to reboot the server!

All FC virtual adapters have “No Drop” service and COS 3 and 50% minimum bandwidth defined by default.  You can change the COS value and minimum bandwidth if you like, but you cannot change the “No Drop” setting (why would you do that anyway?).  The “No Drop” policy provides a lane for lossless Ethernet normally used by FCoE traffic.

The key takeaway here is that the Cisco VIC is capable of enforcing a centrally managed bandwidth policy for up to 8 different traffic classes enforced on the adapter for any traffic transmitted by the server.  The bandwidth policy set in UCS Manager defines the minimum bandwidth available to each Class of Service during periods of congestion, and it can also set maximum bandwidths on a per virtual adapter basis.  Furthermore, the Cisco VIC bandwidth management is able to account for ALL traffic using the adapter, including FCoE, something VMware Network I/O Control is lacking. (More on that in a future article).

As if that wasn’t enough, the very same policy in UCS Manager that defines how the Cisco VIC manages bandwidth is the very same policy that defines how the network fabric will manage bandwidth when transmitting to the servers!  This provides an unmatched level of server + network integrated bandwidth management from a single policy engine, UCS Manager.

OK, now that we have established the comprehensive Server + Network bandwidth management of Cisco UCS, and the Cisco VIC (Palo) unique capabilities of presenting the server many different virtual adapters, as well as providing each virtual adapter a guaranteed minimum bandwidth, lets see how we can use this powerful combination in (4) Cisco UCS + VMware 10GE design examples with intelligent QoS.

Before we get started with the designs, I am going to first categorize possible traffic types on a typical VMware host into two main categories: Required & Optional.

Required traffic: VM Data (guest virtual machines), Management (service console / vmkernel), (1) Means of central access to storage (FCoE, NFS, iSCSI, etc.), and finally: vMotion (DRS)  … I know, I know, some of you will say “vMotion isn’t required!”.  For the context of this discussion, it is. Deal with it :-)

Optional traffic: (1) or more additional means of central storage access (e.g. NFS datastores), Fault Tolerance, Real Time traffic (voice/video, market data).

The first design example (Design #1) is going to account for only the Required traffic.  We are using Cisco UCS (either rackmount or blade servers, or both) with the Cisco VIC adapter installed.  On the vSphere host we are using the standard vSwitch or the vDistributed Switch (vDS).

Design #1 — Cisco UCS + VIC QoS Design, Required traffic only

Design #1 above starts out with just the Required traffic profile.  We have Management, VM Data, vMotion, and block based central storage access via FCoE.  We are using Cisco UCS rack mount or blade servers with the Cisco VIC installed and presenting the VMware host with (8) virtual adapters; (2) FC HBA’s; and (6) 10GE Ethernet NICs.  Furthermore, the VMware hypervisor is running either the standard vSwitch or vDS.  There are (3) Port Groups defined; Management, VM Data, and vMotion. Each Port Group has its own pair of virtual adapters.

It’s important to observe that each virtual adapter has its own virtual Ethernet port on the UCS Fabric Interconnect, there is a 1:1 correlation between virtual adapter and virtual network interface.  The cable connecting the virtual adapter to its virtual ethernet interface is a virtual cable in the form of a VN-Tag.  This 1:1 correlation is import because the virtual network interface on the Fabric Interconnect will inherit the exact same minimum bandwidth policy as the virtual adapter it is assigned to.

Side note: Because the vSwitch or vDS is not capable of intelligently classifying and marking traffic with COS settings, we are using the mode “Host Control NONE” on these virtual adapters and allowing the Cisco VIC to apply the markings depending on which virtual adapter services the traffic.

Each virtual adapter in a pair is associated to a different fabric for the obvious reasons of redundancy, and the Port Group using the adapter pair is configured for Active/Standby teaming.  As shown in the diagram we can also alternate whether a Port Group is actively using Fabric A or Fabric B.  This is an important design tool we will use for deterministic traffic placement in a shared bandwidth environment.

Each virtual adapter pair is assigned to a different Class of Service (COS) within UCS Manager, of which there are Six to choose from:  Fibre Channel, Best Effort, Bronze, Silver, Gold, Platinum.  Each service level can be assigned to a different COS value (0-6), and more importantly you assign a bandwidth weight to each service level.

UCS Manager QoS System Class settings for Design #1

In Design example #1 above; the adpaters used by the MGMT Port Group have been assigned a minimum bandwidth of 10% (and transmit rate limited to 1GE), the virtual adapters for VM Data have been assigned a minimum of 40% (no max), the vMotion virtual adapters have been assigned a 10% minimum bandwith (no max).  The Fibre Channel class was adjusted from its default of 50% to 40% (no max).  The minimums add up to 100% (per Fabric).  The upstream network fabric is also applying the same bandwidth policy when transmitting traffic to the servers.  I dont need to be concerned with vMotion traffic received from other servers saturating the link and affecting other important traffic, because the upstream network (Fabric Interconnect) will insure other traffic has fair access to bandwidth based on the same minimum bandwidth policy mentioned above.

Assign QoS policy to vNIC Template

For the traffic transmitted from Host to the Network, or Network to Host, vMotion bullies will be quarantined to their 10% of link bandwidth IF there is saturation & congestion.  If no congestion exists, vMotion has full access to all 10GE of bandwidth.  Furthermore, it is import to observe that vMotion and VM Data traffic have been separated onto different fabrics.  Hence under normal operations, when both fabrics are operational, vMotion and VM Data traffic will not be sharing any bandwidth, rather they will each have access to 10GE simultaneously on their respective fabrics.  Only under a rare failed fabric condition would VM Data and vMotion traffic share link bandwidth, and only during periods of congestion when one fabric has failed would the intelligent QoS policy need to enforce bandwidth shares between vMotion and VM Data.

Design #2 — Cisco UCS + VIC QoS Design, Required + 1 Optional

Design example #2 above is very similar to Design #1, only now we have added a Port Group for NFS traffic.  You may wish to have your VMware hosts boot from an ESX image on the SAN via Fibre Channel, however once the host is fully booted you many want to use NFS data stores for the VM’s to access their virtual disks.  To accommodate the NFS traffic we have provisioned two more virtual adapters at the Cisco VIC, for a total of (8) virtual Ethernet adapters.  The NFS Port Group is assigned to its own pair of virtual adapters and using Active/Standby teaming.

The Active/Standby teaming configuration of each Port Group is set up such that VM Data will have a fabric to itself while vMotion, NFS, and MGMT share the other fabric.

Because our bandwidth minimums cannot exceed 100% (per fabric), we will make a slight modification from Design #1, where the VM Data Port Group now has 30% minimum bandwidth (rather than 40%).  In return the NFS Port Group has been given 10% minimum bandwidth via the “Gold” class of service COS 4. Everything else stays the same.  Below is the UCS Manager QoS System Class settings used to set up Design #2.  Take special note of the MTU settings for both the Silver and Gold classes.

UCS Manager QoS System Class settings for Design #2

Once you have defined the QoS System Class settings in UCS Manager (as shown above) you then create a “QoS Policy” and give it a name.  You can see in the lower left of the above screen shot that I have created a number of these QoS policies already.  One of the policies was named “IP_Storage” which is the one we will assign to the virtual adapters used for NFS traffic.  The QoS Policy defines a few things; 1) the name of the policy; 2) the class of service this policy is associated to; 3) any rate limits you might want to enforce; 4) whether or not this policy will overwrite or preserve COS markings from the server (remember: Host control FULL or NONE).  Below is a screen shot of the QoS policy “IP_Storage” created for NFS traffic.

QoS Policy for NFS Traffic

After you have created the QoS Policy, you then assign it to the vNIC.  You can manually assign the policy to each vNIC of your choice, or better yet, you can do it once with a vNIC Template.  The vNIC Template can be used every time you create a new service profile or service profile template.  vNIC Templates can also be defined as “Updating Templates”, which means if you decide to make a change to the template (such as VLAN or QoS settings) all servers and adapters created from that template will be updated to match the template.

See the Design #1 section above for a screen shot showing a QoS Policy being assigned to a vNIC Template.

Note: You do not need to create a QoS Policy for the Fibre Channel (FCoE) traffic and assign it to the vHBAs.  That is already there by default.

Design #3 — Cisco UCS + VIC QoS, Required + 2 Optional + Realtime

For Design example #3 lets work with a much more difficult scenario where we have the Required traffic, (2) Optional traffic classes, and just for the fun of it (because I like to torture myself) lets throw in Realtime traffic too!  Realtime traffic could be a multicast application serving a live video stream, a voice application such as Cisco Unified Communications (caveat*), or perhaps a multicast feed of market data.  Whatever it is, lets assume that its important traffic, but as is typical with realtime traffic the payload is not large and bulky, rather a consistent flow of small to mid size traffic where lost packets are noticed immediately by the data consumer.

Cisco UCS + VIC QoS Design #3

In Design #3 above we have built upon Design #2 by adding Port Groups and virutal adapter pairs for Fault Tolerance and Realtime traffic.  We now have a total of (12) virtual adapters; (10) 10GE Ethernet NICs; and (2) FC HBAs.

Realtime traffic from a voice application, for example, usually expects to have special treatment in the network.  It’s customary for network infrastructure engineers to configure special treatment for packets marked with COS 5 in the presence of voice applications.  Hence, the network upstream from this Cisco UCS fabric may very well be configured that way, so it makes sense for us to assign that traffic to the “Platinum” COS 5 service level, such that when the traffic leaves Cisco UCS it will be properly marked with COS 5.

Cisco UCS, by default, provides losses Ethernet service for FCoE traffic.  In addition to that, there is one more lane of losses Ethernet service available for us to configure and assign to any class of service.  In Design #3 we will assign the “Platinum” service to losses Ethernet, thereby preventing any packet loss within the UCS fabric.  If we know that this application will be multicast based (such as live streaming video) we can also assign the “Platinum” class with optimized multicast treatment.  See below.

UCS Manager QoS System Class settings for Design #3

The VMware Fault Tolerance traffic has been given its own pair of virtual adapters, assigned to the “Gold” class of service with a minimum bandwidth weighting of 20%.  Again we are using Active/Standby adapter teaming to precisely control which traffic types are shared with others during normal operations when both fabrics are operational.  Notice that Fault Tolerance and vMotion traffic are together on Fabric B, separated from VM Data traffic on Fabric A.

The Fault Tolerance traffic has been a higher 20% weighting over other traffic (excluding FCoE).  This setting is inline with guidance provided by VMware’s Network I/O Control Best Practices document which describes the importance of Fault Tolerance traffic to the performance of VM’s, and hence they recommend giving it a higher weighting.

Another interesting observation to make is that vMotion and MGMT are sharing the same pair of virtual adapters.  Remember, I can currently configure Six different traffic classes in Cisco UCS, yet I have 7 classes of traffic to work with in Design #3.  So, that leaves me no choice but to assign two different traffic types to the same class of service.  I just happened to choose vMotion and MGMT, you may choose something different in your designs, there is no black and white – right or wrong answer.  Now, just because two different traffic types are sharing the same class of service doesn’t mean they need to share the same virtual adapters.  I could have easily provisioned another pair of virtual adapters just for MGMT.

Design #4 — QoS Design with (2) 10GE adapters

Up to this point we have been looking at designs strictly based on Cisco UCS with the fancy Cisco VIC.  Now we will look at another design approach that could be used with Cisco UCS and Cisco VIC, or it could also be implemented with Cisco UCS without the Cisco VIC, or it could be non-Cisco servers connected to a Cisco Nexus 5000 network.  Furthermore, up to this point I’m sure the Nexus 1000V product management team has been a little upset that I haven’t mentioned the Nexus 1000V in a single design yet (heh).  No offense guys, but one of the main points I wanted to show here is the inherent QoS capabilities built in to the Cisco UCS architecture, with or without Nexus 1000V.  Having said that, lets dive into the Nexus 1000V with Design #4 to show its flexible deployment and value in many types of scenarios, Cisco UCS or not. 😉

QoS Design #4 with (2) 10GE adapters

In Design #4 we are using the same traffic profile of Required + (2) Optional + Realtime that was used in Design #3.  Furthermore, if this is a Cisco UCS deployment, we can use the same UCS Manager QoS System Class settings from Design #3.  The main difference here is the fact that our server (either Cisco or non-Cisco) has been provisioned with (2) 10GE adapters visible to the VMware host.  Remember from previous designs that we used the Cisco VIC virtual adapter capabilities to mark and schedule traffic.  The first step of classifying traffic was done via manual steering of Port Groups to a specific set of virtual adapters.

With the Nexus 1000v you can both classify and mark traffic directly from the hypervisor switch (before it reaches the physical adapter), and with Nexus 1000V version 1.4 you can also provide weighted minimum bandwidth scheduling (Shares) as the packets egress the hypervisor switch,  much like VMware Network I/O Control (NetIOC).

With Nexus 1000V we no longer need to provision adapters (virtual or physical) for the sake of QoS classification, as we needed to do in the previous designs with the standard vSwitch or vDS.  This allows for a more streamlined configuration for the server administrator who can now provision a VMware Host with (2) adapters, rather than (8) or (12).  Furthermore, you have the flexibility to use this approach if you have the Cisco VIC, or with any other standard CNA or 10GE adapter.

N1K adapter QoS Policy

For deployments with Cisco UCS, even though we are not leveraging the adapter virtualization capabilities of the Cisco VIC to a great degree, the Cisco VIC still provides an important role in its QoS capabilities.  The class of service based bandwidth polices defined in UCS Manager are still enforced right on the adapter.  The Nexus 1000V classifies and marks the traffic, and the Cisco VIC will place the packets into one if its (8) COS based queues for scheduled transmission.  Remember there are two modes we can define in the Cisco VIC: Host conrol FULL, or Host control NONE.  If Design #4 was leveraging the Cisco VIC we would use the “Host control FULL” setting that will tell the Cisco VIC to utilize and preserve the COS marking made by the Nexus 1000V. See the example to the right.

Enforcing a QoS policy on the physical CNA (Cisco VIC) certainly has its advantages, however QoS can still be accomplished (albeit somewhat less holistically) if you have a standard CNA or 10GE adapter that provides no special QoS capabilities.  Lets assume you have a CNA adapter (I wont mention names) that does basic first-in-first-out FIFO queuing on the adapter for standard Ethernet traffic.  With the Nexus 1000V classifying and marking traffic you can still provide QoS enforcement at the network switch port (Nexus 5000) or UCS Fabric Interconnect.  In this scenario we are relying on the fact that the network switch will drop some traffic that has exceeded the policy during periods of congestion.  When those packets are dropped, the TCP stack of the server or VM will react to that and throttle back its sending rate.  The TCP send rate will gradually increase until a packet is lost again, and the cycle repeats.  This results in the TCP based application finding its bandwidth ceiling provided by the QoS policy enforced at the network switch.

The TCP based bandwidth throttling of course only works with TCP flows.  If you have a heavily used UDP based application, your switch will still enforce the QoS policy but your sending machine may not react to it.  The result is the transmit utilization on the link between your server and switch becoming out of policy and possibly affecting other well behaved TCP applications on that server.  This would not happen with an adapter like the Cisco VIC as it can throttle any flows during congestion, TCP or not, right on the adapter before transmitting on the wire.

Hypervisor based QoS

Rather than relying on TCP based throttling, another approach you can use with non-QoS standard adapters is the Nexus 1000V weighted fair queuing (WFQ) (version 1.4) or VMware Network I/O Control.  Remember these capabilities are based in the hypervisor switch and can provide Cisco VIC like QoS capabilities in software at the hypervisor level.

Some important caveats with hypervisor based QoS worth discussing:

1) With either VMware NetIOC or Nexus 1000V WFQ, there is no visibility into FCoE traffic utilization on a CNA adapter.  The hypervisor switch has no visibility (yet) into the underlying hardware of the CNA to determine FCoE bandwidth utilization at any given moment. This can result in a miscalculation of available bandwidth.  The hypervisor switch might believe the full 10GE of bandwidth is not being used and therefore determine no policy needs to be enforced, however once the traffic is delivered to the CNA for transmission it finds a congested link.  Therefore, if you are using FCoE you should proceed with caution.  These technologies will work better where storage access is on a separate FC adapter or where all Ethernet storage is IP based (iSCSI or NFS) and fully managed by, and visible to, the hypervisor switch.

2) Assuming you are not subject to caveat #1, its important to understand how well this technology will integrate with a broader network wide QoS policy.  For example, the Nexus 1000V can apply a bandwidth transmission policy while at the same time marking the traffic for consistent treatment in the upstream network.  In contrast VMware Network I/O Control does not mark any traffic.  As a result the upstream network must either independently re-classify the traffic or not do anything at all and just “let it rip”.  If you do the later “let it rip” approach you will have a situation in which there are no controls in how the network transmits traffic to a server.  As a result you can either overwhelm your server with receive traffic and defeat the purpose of implementing NetIOC to begin with. Or, you can implement receive based rate-limiters on every VMware host in fear of excessive traffic and again defeat the purpose of intelligent bandwidth sharing that NetIOC sets out to accomplish.

Bottom line: Hypervisor based QoS should be implemented in coordination with a broader network based QoS policy. In the absence of intelligent QoS capable adapters, hypervisor based QoS can help but its not the complete answer.

Design #5 — QoS Design with (2) 10GE adapters, no Cisco VIC, no Nexus 1000V

Here is the last design we will briefly discuss.  Lets imagine you don’t have Cisco UCS – the horror! :-) (just kidding) – Since this could be a standard Cisco C-Series rack mount server you might or might not have the Cisco VIC.  And, you don’t have the Nexus 1000V.  What can we do in terms of implementing QoS in this scenario?  Lets take a look.

Design #5 -- (2) 10GE adapters, no Cisco UCS, no Nexus 1000V

Without the Nexus 1000V or Cisco VIC virtual adapters we know that classifying and marking traffic within the hypervisor or at the adapter is not possible.  If you have the Cisco VIC installed lets assume for now that we don’t (at the moment) have the capability to provision virtual adapters given this implementation is Nexus 5000 and not UCS (not commenting on the future right now).  With a Cisco VIC installed, and without virtual adapters or the Nexus 1000v we wont be able to leverage the (8) COS based queues in an effective way.  So at this point it could be the Cisco VIC or any other standard CNA adapter (Emulex or Qlogic) and we have the same design challenges.

In this scenario we need to an intelligent network switch that can classify traffic as traffic enters.  Once classified the intelligent switch should be able to provide minimum bandwidth shares to each traffic class when transmitted to the destination.  That switch of course would be the Nexus 5000 😉  The burden of traffic classification and QoS has been entirely put upon the physical network switch.

Understanding that, how will the traffic from the server be managed or react to the QoS policy in the network?  Again we have the two choices of hypervisor based QoS or TCP based throttling.  If there was very little or no FCoE in use, one solid approach might be to use VMware Network I/O Control, and then try to replicate the NetIOC QoS policy as best you can in the intelligent network switch.

If you have FCoE implemented, as shown in the diagram, the solid approach would be to leave the server configuration as-is, do not enable NetIOC, and let the upstream network switch selectively drop packets during congestion based on its QoS policy.  The dropped TCP packets will have the result of TCP based applications finding their defined bandwidth ceiling during the periods of congestion.  If you have network intense UDP based applications you may want to experiment with VMware NetIOC, understanding the potential for miscalculations in the presence of FCoE.  VMware Network I/O Control will be able to throttle UDP applications based on its policy at the hypervisor level.

Call to Action

It’s important to understand that this article is not at all intended to say “This is how you should do it” to every exact technical detail.  My main goal here was to show you some examples of what is merely possible with the technology choices you have.  There are a lot of nerd knobs in these designs that could be turned either way.

Having said that, I’m just one man shouting into my little megaphone here.  I would really like to hear from some VCDX folks, either employed by VMware or not, to contribute to this discussion.  It could be in the comments section below or in your own forum.  I’m hoping we can come to agreement on some of the solutions presented here or variations thereof.

Presentation Download

You made it this far? It’s good to know I’m not the only one who’s crazy here.

I would like to thank you for your time and attention! Download a PDF of the original slides, here:

VMware 10GE Design Deep Dive with Cisco UCS, Nexus – PDF

Disclaimer: The author is an employee of Cisco Systems, Inc. The views and opinions expressed by the author do not necessarily represent those of Cisco Systems, Inc. The author is not an official media spokesperson for Cisco Systems, Inc. The designs described in this article are not Cisco official Validated Designs. Please consult your local Cisco representative for design assistance.