VMware 10GE QoS Design Deep Dive with Cisco UCS, Nexus

Last month I wrote a brief article discussing the intelligent QoS capabilities of Cisco UCS in a VMware 10GE scenario, accompanied by some flash animations for visual learners like me.  In that article I (intentionally) didn’t get too specific with design details or technical nuance.  Rather, I wanted to set the stage first with a basic presentation of what Quality of Service (QoS) is, how its applicable to data center virtualization,  and how QoS compares to other less intelligent approaches to bandwidth management.

In this article we build upon the previous and take a closer look, dive a little deeper, and discuss some of the designs that could be implemented with Cisco UCS or Nexus that can provide intelligent QoS in VMware deployments with 10GE networking.  We will look at many different scenarios including designs with Cisco UCS using fancy adapters (Cisco VIC), designs with Cisco UCS using more standard adapters (e.g. Emulex or Qlogic), and finally we will also explore designs without Cisco UCS where you may have non-Cisco servers connected to a Cisco Nexus switch infrastructure.  The commonality and key enabler in all of the designs discussed here is the intelligent Cisco network.

With the release of vSphere 4.1, one of the major improvements VMware is proud to discuss is in the area of network I/O performance.  For example, vMotion has been improved to allow 8 concurrent vMotion flows per Host, up from 2.  Furthermore, one individual vMotion flow can reach 8Gbps I/O performance.  If you add another vMotion flow you can quickly saturate a 10GE adapter. See this post by Aaron Delp from ePlus who wrote about the potential for network I/O saturation and the need for intelligent QoS, based on the testing of his colleague Don Mann.  Keep an eye on Aaron’s blog for more content this subject.

Saturating a 10GE adapter with network bandwidth is not the problem.  In fact, that’s a good thing.  This means you are making the most of the infrastructure resources and driving higher levels of efficiency.  Similar to how higher server CPU utilization is generally considered a good thing, the same goes for network resources.  The problem that can arise with saturating a 10GE network adapter is the performance impact on the many different applications using the server & adapter when there is no policy based enforcement of fair bandwidth sharing.

If you just “let it rip” with no governance at all, the biggest bully is going to take away all of the bandwidth from the little guy, or as much as it can.  Vmotion traffic is a great example of one of those bandwidth bullies.  Its a brute force transfer of lots and lots of large data packets from one Host to another, coming at a fast and frequent rate.  The more infrequent and smaller packet size transactional traffic from the Web server taking orders for your online business can easily get stomped and kicked to the bandwidth curb by the big vMotion bullies … now I think we can agree that’s NOT a good thing.

VMware showed that they understand this challenge by introducing a capability in vSphere 4.1 called Network I/O Control (NetIOC), which provides a means for the ESX hypervisor to enforce a fair sharing of bandwidth for network traffic leaving the hypervisor. (Notice I did not say: “leaving the server” or “leaving the adapter”)  NetIOC uses the familiar “Shares” concept for scheduling bandwidth similar to how VMware has always scheduled sharing of CPU and Memory resources.  The concept of bandwidth “Shares” is also very similar to how intelligent QoS operates in a Cisco network.  My esteemed Cisco colleague M. Sean McGee recently wrote about these similarities in an article titled: Great Minds Think Alike – Cisco and VMware Agree on Sharing vs. Limiting

You might be thinking: “Great, problem solved! I’ll use NetIOC and I’m done.”  Well, not so fast. (More on that later).  A comprehensive solution to bandwidth management is one that is able to coordinate “Shares” of network bandwidth at both the server adapter AND network switch infrastructure, not just one or the other.

Lets go ahead and take a look at some specific examples of 10GE VMware designs that provide the aforementioned coordinated server & network intelligent QoS (bandwidth “Shares”).  First, we will start with some examples of Cisco UCS using the Cisco VIC (virtual interface card) as the server adpater.  To begin, lets start with a brief overview of the Cisco VIC depicted below.

The Cisco VIC, also known as “Palo”

One of the areas that sets the Cisco VIC apart from other adapters is in its powerful interface virtualization capabilities.  With a single physical Cisco VIC adapter you can effectively fool the server into believing it has many adapters installed (currently as many as 58), all being presented from one physical (dual port) 10GE adapter.  You will see how we can use this powerful capability as a flexible design tool for intelligent QoS.  You can see from the diagram above we have provisioned (10) virtual adpaters; (2) FC HBA’s; and (8) Ethernet NIC’s.  Each virtual Ethernet NIC is found as a “vmnic”  to the VMware host, and from there we can map different types of traffic to different virtual adapters.

Another area that sets the Cisco VIC apart from other adapters is in its powerful QoS capabilities.  The Cisco VIC can provide a fair sharing of bandwidth for all of its virtual adapters, and the policy engine that defines how bandwidth is shared on the Cisco VIC is conveniently centrally defined by the overarching system-wide UCS Manager.  Below is a diagram that shows how the Cisco VIC schedules bandwidth for its virtual adapters.

The Cisco VIC “Palo” QoS capabilities

In the example above we again have our (8) virtual Ethernet adapters and (2) virtual Fibre Channel adapters the Cisco VIC is presenting to the VMware host as “real” adapters.  The Cisco VIC itself has (8) traffic queues, each based on a COS value (the purple boxes numbered 0 thru 7).  COS stands for “Class of Service”, and when used is typically marked on the Ethernet packets as they traverse the network for special treatment at each hop.  In this example we have defined each virtual adapter to be associated to a specific COS value (this was centrally configured at the UCS Manager).  Therefore, any traffic received from the server by a virtual adapter will be assigned to and marked with a specific COS value and placed into its associated queue for transmission scheduling (only if there is congestion, otherwise mark and immediately send).  Another policy that was centrally defined in UCS Manager was setting the minimum percentage of bandwidth that each COS queue will always be entitled to (if there is congestion).  For example, vNIC4 and vNIC8 are assigned to COS 5, of which the Cisco VIC will always provide 40% of the 10GE link during periods of congestion.  Similarly, each vHBA has 40% guaranteed bandwidth, vNIC1 and VNIC2 are each sharing a 10% bandwidth guarantee, and vNIC3 has it’s own 10%.

With intelligent QoS, the sum of the minimum bandwidths add up to 100%.  This is a big difference from other less intelligent approaches that use “Rate-Limiting” or sometimes also called “Shaping” where the sum of maximum bandwidths must not exceed 100% (example: HP Flex-10 / FlexFabric).  Sorry, couldn’t resist :-)

If two or more virtual adapters are assigned to the same COS value (such as vNIC1 and vNIC2), the Cisco VIC will provide Round Robin placement of packets from each virtual adapter into the COS queue, to be serviced by the Bandwidth Scheduler.  You can also define maximum transmit bandwidth limits for any virtual adapter, such as vNIC1 being limited to 1GE transmit. If you decide to do that, the Cisco VIC will enforce the rate limit prior to any necessary Round Robin servicing.

Side note: There are two different QoS modes available for each virtual adapter — 1) Host control NONE, or 2) Host control FULL.  The NONE mode is stating that COS markings from the server will be ignored and overwritten with the virtual adpater’s COS setting in UCS Manager.  The FULL mode is stating that COS markings from the server will be preserved and used to determine which COS queue each packet is serviced from on the Cisco VIC.  The graphic above is a depiction of the NONE mode, where each virtual adapter has a solid black line to a specific COS queue.  The FULL mode would have dotted lines from each virtual adapter to each COS queue.

COS 7 is a reserved strict priority queue for UCS Manager to manage the Cisco VIC settings and policies.  You can, for example, change bandwidth settings on the fly for a virtual adapter while the server is running without needing to reboot the server!

All FC virtual adapters have “No Drop” service and COS 3 and 50% minimum bandwidth defined by default.  You can change the COS value and minimum bandwidth if you like, but you cannot change the “No Drop” setting (why would you do that anyway?).  The “No Drop” policy provides a lane for lossless Ethernet normally used by FCoE traffic.

The key takeaway here is that the Cisco VIC is capable of enforcing a centrally managed bandwidth policy for up to 8 different traffic classes enforced on the adapter for any traffic transmitted by the server.  The bandwidth policy set in UCS Manager defines the minimum bandwidth available to each Class of Service during periods of congestion, and it can also set maximum bandwidths on a per virtual adapter basis.  Furthermore, the Cisco VIC bandwidth management is able to account for ALL traffic using the adapter, including FCoE, something VMware Network I/O Control is lacking. (More on that in a future article).

As if that wasn’t enough, the very same policy in UCS Manager that defines how the Cisco VIC manages bandwidth is the very same policy that defines how the network fabric will manage bandwidth when transmitting to the servers!  This provides an unmatched level of server + network integrated bandwidth management from a single policy engine, UCS Manager.

OK, now that we have established the comprehensive Server + Network bandwidth management of Cisco UCS, and the Cisco VIC (Palo) unique capabilities of presenting the server many different virtual adapters, as well as providing each virtual adapter a guaranteed minimum bandwidth, lets see how we can use this powerful combination in (4) Cisco UCS + VMware 10GE design examples with intelligent QoS.

Before we get started with the designs, I am going to first categorize possible traffic types on a typical VMware host into two main categories: Required & Optional.

Required traffic: VM Data (guest virtual machines), Management (service console / vmkernel), (1) Means of central access to storage (FCoE, NFS, iSCSI, etc.), and finally: vMotion (DRS)  … I know, I know, some of you will say “vMotion isn’t required!”.  For the context of this discussion, it is. Deal with it :-)

Optional traffic: (1) or more additional means of central storage access (e.g. NFS datastores), Fault Tolerance, Real Time traffic (voice/video, market data).

The first design example (Design #1) is going to account for only the Required traffic.  We are using Cisco UCS (either rackmount or blade servers, or both) with the Cisco VIC adapter installed.  On the vSphere host we are using the standard vSwitch or the vDistributed Switch (vDS).

Design #1 — Cisco UCS + VIC QoS Design, Required traffic only

Design #1 above starts out with just the Required traffic profile.  We have Management, VM Data, vMotion, and block based central storage access via FCoE.  We are using Cisco UCS rack mount or blade servers with the Cisco VIC installed and presenting the VMware host with (8) virtual adapters; (2) FC HBA’s; and (6) 10GE Ethernet NICs.  Furthermore, the VMware hypervisor is running either the standard vSwitch or vDS.  There are (3) Port Groups defined; Management, VM Data, and vMotion. Each Port Group has its own pair of virtual adapters.

It’s important to observe that each virtual adapter has its own virtual Ethernet port on the UCS Fabric Interconnect, there is a 1:1 correlation between virtual adapter and virtual network interface.  The cable connecting the virtual adapter to its virtual ethernet interface is a virtual cable in the form of a VN-Tag.  This 1:1 correlation is import because the virtual network interface on the Fabric Interconnect will inherit the exact same minimum bandwidth policy as the virtual adapter it is assigned to.

Side note: Because the vSwitch or vDS is not capable of intelligently classifying and marking traffic with COS settings, we are using the mode “Host Control NONE” on these virtual adapters and allowing the Cisco VIC to apply the markings depending on which virtual adapter services the traffic.

Each virtual adapter in a pair is associated to a different fabric for the obvious reasons of redundancy, and the Port Group using the adapter pair is configured for Active/Standby teaming.  As shown in the diagram we can also alternate whether a Port Group is actively using Fabric A or Fabric B.  This is an important design tool we will use for deterministic traffic placement in a shared bandwidth environment.

Each virtual adapter pair is assigned to a different Class of Service (COS) within UCS Manager, of which there are Six to choose from:  Fibre Channel, Best Effort, Bronze, Silver, Gold, Platinum.  Each service level can be assigned to a different COS value (0-6), and more importantly you assign a bandwidth weight to each service level.

UCS Manager QoS System Class settings for Design #1

In Design example #1 above; the adpaters used by the MGMT Port Group have been assigned a minimum bandwidth of 10% (and transmit rate limited to 1GE), the virtual adapters for VM Data have been assigned a minimum of 40% (no max), the vMotion virtual adapters have been assigned a 10% minimum bandwith (no max).  The Fibre Channel class was adjusted from its default of 50% to 40% (no max).  The minimums add up to 100% (per Fabric).  The upstream network fabric is also applying the same bandwidth policy when transmitting traffic to the servers.  I dont need to be concerned with vMotion traffic received from other servers saturating the link and affecting other important traffic, because the upstream network (Fabric Interconnect) will insure other traffic has fair access to bandwidth based on the same minimum bandwidth policy mentioned above.

Assign QoS policy to vNIC Template

For the traffic transmitted from Host to the Network, or Network to Host, vMotion bullies will be quarantined to their 10% of link bandwidth IF there is saturation & congestion.  If no congestion exists, vMotion has full access to all 10GE of bandwidth.  Furthermore, it is import to observe that vMotion and VM Data traffic have been separated onto different fabrics.  Hence under normal operations, when both fabrics are operational, vMotion and VM Data traffic will not be sharing any bandwidth, rather they will each have access to 10GE simultaneously on their respective fabrics.  Only under a rare failed fabric condition would VM Data and vMotion traffic share link bandwidth, and only during periods of congestion when one fabric has failed would the intelligent QoS policy need to enforce bandwidth shares between vMotion and VM Data.

Design #2 — Cisco UCS + VIC QoS Design, Required + 1 Optional

Design example #2 above is very similar to Design #1, only now we have added a Port Group for NFS traffic.  You may wish to have your VMware hosts boot from an ESX image on the SAN via Fibre Channel, however once the host is fully booted you many want to use NFS data stores for the VM’s to access their virtual disks.  To accommodate the NFS traffic we have provisioned two more virtual adapters at the Cisco VIC, for a total of (8) virtual Ethernet adapters.  The NFS Port Group is assigned to its own pair of virtual adapters and using Active/Standby teaming.

The Active/Standby teaming configuration of each Port Group is set up such that VM Data will have a fabric to itself while vMotion, NFS, and MGMT share the other fabric.

Because our bandwidth minimums cannot exceed 100% (per fabric), we will make a slight modification from Design #1, where the VM Data Port Group now has 30% minimum bandwidth (rather than 40%).  In return the NFS Port Group has been given 10% minimum bandwidth via the “Gold” class of service COS 4. Everything else stays the same.  Below is the UCS Manager QoS System Class settings used to set up Design #2.  Take special note of the MTU settings for both the Silver and Gold classes.

UCS Manager QoS System Class settings for Design #2

Once you have defined the QoS System Class settings in UCS Manager (as shown above) you then create a “QoS Policy” and give it a name.  You can see in the lower left of the above screen shot that I have created a number of these QoS policies already.  One of the policies was named “IP_Storage” which is the one we will assign to the virtual adapters used for NFS traffic.  The QoS Policy defines a few things; 1) the name of the policy; 2) the class of service this policy is associated to; 3) any rate limits you might want to enforce; 4) whether or not this policy will overwrite or preserve COS markings from the server (remember: Host control FULL or NONE).  Below is a screen shot of the QoS policy “IP_Storage” created for NFS traffic.

QoS Policy for NFS Traffic

After you have created the QoS Policy, you then assign it to the vNIC.  You can manually assign the policy to each vNIC of your choice, or better yet, you can do it once with a vNIC Template.  The vNIC Template can be used every time you create a new service profile or service profile template.  vNIC Templates can also be defined as “Updating Templates”, which means if you decide to make a change to the template (such as VLAN or QoS settings) all servers and adapters created from that template will be updated to match the template.

See the Design #1 section above for a screen shot showing a QoS Policy being assigned to a vNIC Template.

Note: You do not need to create a QoS Policy for the Fibre Channel (FCoE) traffic and assign it to the vHBAs.  That is already there by default.

Design #3 — Cisco UCS + VIC QoS, Required + 2 Optional + Realtime

For Design example #3 lets work with a much more difficult scenario where we have the Required traffic, (2) Optional traffic classes, and just for the fun of it (because I like to torture myself) lets throw in Realtime traffic too!  Realtime traffic could be a multicast application serving a live video stream, a voice application such as Cisco Unified Communications (caveat*), or perhaps a multicast feed of market data.  Whatever it is, lets assume that its important traffic, but as is typical with realtime traffic the payload is not large and bulky, rather a consistent flow of small to mid size traffic where lost packets are noticed immediately by the data consumer.

Cisco UCS + VIC QoS Design #3

In Design #3 above we have built upon Design #2 by adding Port Groups and virutal adapter pairs for Fault Tolerance and Realtime traffic.  We now have a total of (12) virtual adapters; (10) 10GE Ethernet NICs; and (2) FC HBAs.

Realtime traffic from a voice application, for example, usually expects to have special treatment in the network.  It’s customary for network infrastructure engineers to configure special treatment for packets marked with COS 5 in the presence of voice applications.  Hence, the network upstream from this Cisco UCS fabric may very well be configured that way, so it makes sense for us to assign that traffic to the “Platinum” COS 5 service level, such that when the traffic leaves Cisco UCS it will be properly marked with COS 5.

Cisco UCS, by default, provides losses Ethernet service for FCoE traffic.  In addition to that, there is one more lane of losses Ethernet service available for us to configure and assign to any class of service.  In Design #3 we will assign the “Platinum” service to losses Ethernet, thereby preventing any packet loss within the UCS fabric.  If we know that this application will be multicast based (such as live streaming video) we can also assign the “Platinum” class with optimized multicast treatment.  See below.

UCS Manager QoS System Class settings for Design #3

The VMware Fault Tolerance traffic has been given its own pair of virtual adapters, assigned to the “Gold” class of service with a minimum bandwidth weighting of 20%.  Again we are using Active/Standby adapter teaming to precisely control which traffic types are shared with others during normal operations when both fabrics are operational.  Notice that Fault Tolerance and vMotion traffic are together on Fabric B, separated from VM Data traffic on Fabric A.

The Fault Tolerance traffic has been a higher 20% weighting over other traffic (excluding FCoE).  This setting is inline with guidance provided by VMware’s Network I/O Control Best Practices document which describes the importance of Fault Tolerance traffic to the performance of VM’s, and hence they recommend giving it a higher weighting.

Another interesting observation to make is that vMotion and MGMT are sharing the same pair of virtual adapters.  Remember, I can currently configure Six different traffic classes in Cisco UCS, yet I have 7 classes of traffic to work with in Design #3.  So, that leaves me no choice but to assign two different traffic types to the same class of service.  I just happened to choose vMotion and MGMT, you may choose something different in your designs, there is no black and white – right or wrong answer.  Now, just because two different traffic types are sharing the same class of service doesn’t mean they need to share the same virtual adapters.  I could have easily provisioned another pair of virtual adapters just for MGMT.

Design #4 — QoS Design with (2) 10GE adapters

Up to this point we have been looking at designs strictly based on Cisco UCS with the fancy Cisco VIC.  Now we will look at another design approach that could be used with Cisco UCS and Cisco VIC, or it could also be implemented with Cisco UCS without the Cisco VIC, or it could be non-Cisco servers connected to a Cisco Nexus 5000 network.  Furthermore, up to this point I’m sure the Nexus 1000V product management team has been a little upset that I haven’t mentioned the Nexus 1000V in a single design yet (heh).  No offense guys, but one of the main points I wanted to show here is the inherent QoS capabilities built in to the Cisco UCS architecture, with or without Nexus 1000V.  Having said that, lets dive into the Nexus 1000V with Design #4 to show its flexible deployment and value in many types of scenarios, Cisco UCS or not. 😉

QoS Design #4 with (2) 10GE adapters

In Design #4 we are using the same traffic profile of Required + (2) Optional + Realtime that was used in Design #3.  Furthermore, if this is a Cisco UCS deployment, we can use the same UCS Manager QoS System Class settings from Design #3.  The main difference here is the fact that our server (either Cisco or non-Cisco) has been provisioned with (2) 10GE adapters visible to the VMware host.  Remember from previous designs that we used the Cisco VIC virtual adapter capabilities to mark and schedule traffic.  The first step of classifying traffic was done via manual steering of Port Groups to a specific set of virtual adapters.

With the Nexus 1000v you can both classify and mark traffic directly from the hypervisor switch (before it reaches the physical adapter), and with Nexus 1000V version 1.4 you can also provide weighted minimum bandwidth scheduling (Shares) as the packets egress the hypervisor switch,  much like VMware Network I/O Control (NetIOC).

With Nexus 1000V we no longer need to provision adapters (virtual or physical) for the sake of QoS classification, as we needed to do in the previous designs with the standard vSwitch or vDS.  This allows for a more streamlined configuration for the server administrator who can now provision a VMware Host with (2) adapters, rather than (8) or (12).  Furthermore, you have the flexibility to use this approach if you have the Cisco VIC, or with any other standard CNA or 10GE adapter.

N1K adapter QoS Policy

For deployments with Cisco UCS, even though we are not leveraging the adapter virtualization capabilities of the Cisco VIC to a great degree, the Cisco VIC still provides an important role in its QoS capabilities.  The class of service based bandwidth polices defined in UCS Manager are still enforced right on the adapter.  The Nexus 1000V classifies and marks the traffic, and the Cisco VIC will place the packets into one if its (8) COS based queues for scheduled transmission.  Remember there are two modes we can define in the Cisco VIC: Host conrol FULL, or Host control NONE.  If Design #4 was leveraging the Cisco VIC we would use the “Host control FULL” setting that will tell the Cisco VIC to utilize and preserve the COS marking made by the Nexus 1000V. See the example to the right.

Enforcing a QoS policy on the physical CNA (Cisco VIC) certainly has its advantages, however QoS can still be accomplished (albeit somewhat less holistically) if you have a standard CNA or 10GE adapter that provides no special QoS capabilities.  Lets assume you have a CNA adapter (I wont mention names) that does basic first-in-first-out FIFO queuing on the adapter for standard Ethernet traffic.  With the Nexus 1000V classifying and marking traffic you can still provide QoS enforcement at the network switch port (Nexus 5000) or UCS Fabric Interconnect.  In this scenario we are relying on the fact that the network switch will drop some traffic that has exceeded the policy during periods of congestion.  When those packets are dropped, the TCP stack of the server or VM will react to that and throttle back its sending rate.  The TCP send rate will gradually increase until a packet is lost again, and the cycle repeats.  This results in the TCP based application finding its bandwidth ceiling provided by the QoS policy enforced at the network switch.

The TCP based bandwidth throttling of course only works with TCP flows.  If you have a heavily used UDP based application, your switch will still enforce the QoS policy but your sending machine may not react to it.  The result is the transmit utilization on the link between your server and switch becoming out of policy and possibly affecting other well behaved TCP applications on that server.  This would not happen with an adapter like the Cisco VIC as it can throttle any flows during congestion, TCP or not, right on the adapter before transmitting on the wire.

Hypervisor based QoS

Rather than relying on TCP based throttling, another approach you can use with non-QoS standard adapters is the Nexus 1000V weighted fair queuing (WFQ) (version 1.4) or VMware Network I/O Control.  Remember these capabilities are based in the hypervisor switch and can provide Cisco VIC like QoS capabilities in software at the hypervisor level.

Some important caveats with hypervisor based QoS worth discussing:

1) With either VMware NetIOC or Nexus 1000V WFQ, there is no visibility into FCoE traffic utilization on a CNA adapter.  The hypervisor switch has no visibility (yet) into the underlying hardware of the CNA to determine FCoE bandwidth utilization at any given moment. This can result in a miscalculation of available bandwidth.  The hypervisor switch might believe the full 10GE of bandwidth is not being used and therefore determine no policy needs to be enforced, however once the traffic is delivered to the CNA for transmission it finds a congested link.  Therefore, if you are using FCoE you should proceed with caution.  These technologies will work better where storage access is on a separate FC adapter or where all Ethernet storage is IP based (iSCSI or NFS) and fully managed by, and visible to, the hypervisor switch.

2) Assuming you are not subject to caveat #1, its important to understand how well this technology will integrate with a broader network wide QoS policy.  For example, the Nexus 1000V can apply a bandwidth transmission policy while at the same time marking the traffic for consistent treatment in the upstream network.  In contrast VMware Network I/O Control does not mark any traffic.  As a result the upstream network must either independently re-classify the traffic or not do anything at all and just “let it rip”.  If you do the later “let it rip” approach you will have a situation in which there are no controls in how the network transmits traffic to a server.  As a result you can either overwhelm your server with receive traffic and defeat the purpose of implementing NetIOC to begin with. Or, you can implement receive based rate-limiters on every VMware host in fear of excessive traffic and again defeat the purpose of intelligent bandwidth sharing that NetIOC sets out to accomplish.

Bottom line: Hypervisor based QoS should be implemented in coordination with a broader network based QoS policy. In the absence of intelligent QoS capable adapters, hypervisor based QoS can help but its not the complete answer.

Design #5 — QoS Design with (2) 10GE adapters, no Cisco VIC, no Nexus 1000V

Here is the last design we will briefly discuss.  Lets imagine you don’t have Cisco UCS – the horror! :-) (just kidding) – Since this could be a standard Cisco C-Series rack mount server you might or might not have the Cisco VIC.  And, you don’t have the Nexus 1000V.  What can we do in terms of implementing QoS in this scenario?  Lets take a look.

Design #5 -- (2) 10GE adapters, no Cisco UCS, no Nexus 1000V

Without the Nexus 1000V or Cisco VIC virtual adapters we know that classifying and marking traffic within the hypervisor or at the adapter is not possible.  If you have the Cisco VIC installed lets assume for now that we don’t (at the moment) have the capability to provision virtual adapters given this implementation is Nexus 5000 and not UCS (not commenting on the future right now).  With a Cisco VIC installed, and without virtual adapters or the Nexus 1000v we wont be able to leverage the (8) COS based queues in an effective way.  So at this point it could be the Cisco VIC or any other standard CNA adapter (Emulex or Qlogic) and we have the same design challenges.

In this scenario we need to an intelligent network switch that can classify traffic as traffic enters.  Once classified the intelligent switch should be able to provide minimum bandwidth shares to each traffic class when transmitted to the destination.  That switch of course would be the Nexus 5000 😉  The burden of traffic classification and QoS has been entirely put upon the physical network switch.

Understanding that, how will the traffic from the server be managed or react to the QoS policy in the network?  Again we have the two choices of hypervisor based QoS or TCP based throttling.  If there was very little or no FCoE in use, one solid approach might be to use VMware Network I/O Control, and then try to replicate the NetIOC QoS policy as best you can in the intelligent network switch.

If you have FCoE implemented, as shown in the diagram, the solid approach would be to leave the server configuration as-is, do not enable NetIOC, and let the upstream network switch selectively drop packets during congestion based on its QoS policy.  The dropped TCP packets will have the result of TCP based applications finding their defined bandwidth ceiling during the periods of congestion.  If you have network intense UDP based applications you may want to experiment with VMware NetIOC, understanding the potential for miscalculations in the presence of FCoE.  VMware Network I/O Control will be able to throttle UDP applications based on its policy at the hypervisor level.

Call to Action

It’s important to understand that this article is not at all intended to say “This is how you should do it” to every exact technical detail.  My main goal here was to show you some examples of what is merely possible with the technology choices you have.  There are a lot of nerd knobs in these designs that could be turned either way.

Having said that, I’m just one man shouting into my little megaphone here.  I would really like to hear from some VCDX folks, either employed by VMware or not, to contribute to this discussion.  It could be in the comments section below or in your own forum.  I’m hoping we can come to agreement on some of the solutions presented here or variations thereof.

Presentation Download

You made it this far? It’s good to know I’m not the only one who’s crazy here.

I would like to thank you for your time and attention! Download a PDF of the original slides, here:

VMware 10GE Design Deep Dive with Cisco UCS, Nexus – PDF

Disclaimer: The author is an employee of Cisco Systems, Inc. The views and opinions expressed by the author do not necessarily represent those of Cisco Systems, Inc. The author is not an official media spokesperson for Cisco Systems, Inc. The designs described in this article are not Cisco official Validated Designs. Please consult your local Cisco representative for design assistance.


  1. says

    Great blog post! I attended the UCS QoS session at Cisco Live and it really opened my eyes to how awesome the putting together of UCS, Palo, ESX, and Nexus 1000v using QoS can be. We are currently working on moving all of our UCS blades to ESX host to 4.1, with Nexus 1KV, Palo with PowerPath/VE and should be there by the end of year. The main thing is issues with the Palo drivers not working with PowerPath/VE. Thanks,

    • Bobby Bosch says

      Thanks again Brad for some enlightenment once again. I just happen to be n the middle of setting up 1000v on my UCS.

      After all of this, I think I am going to go with your Option #4. Here is why.

      1.) though i could setup the 1000v by hand, UCS has a basic integration (VM Tab). I figure if i stick with this eventually it will grow (in future releases) to support more networking options. Correct me if i’m wrong but you can’t create any Uplink interfaces if you go pure UCS plugin … yet… I could however make multiple vDS’s, connect the appropriate QOS vnics to them and use the whole switch similar to option 3.

      2.) A single UCS blade itself really can only bind to a single 10gig port on each fabric extender anyways, even if you have all 4 per IO Module x2 modules, you’ll only get one from each. So why not just assume that by only making the 2 adapters anyways. Then doing all your QOS at the port level sounds right… which i might add is very easy to configure with UCS side (VM Tab). The only issue is what the effects of adding QOS to the VEM are (cpu/load wise).

      Mike, i too am stuck with Palo drivers (fnic) not working right with my Clariion. Would be nice to see this fixed asap.

      Cisco IT

      • says

        Good points. I did not discuss a QoS design with UCS VN-Link (the VM Tab) in this post. This post was long enough just covering the basic designs. I’m thinking about a follow on post dedicated to the design with UCS Hardware VN-Link and its QoS capabilities.
        It’s great to hear from Cisco IT. Thanks for stopping by!


      • Jeff S says


        What issues are you seeing with palo and the Clariion? I just setup a UCS, CX4-240 (Flare 30), ESXi 4.1, boot from SAN, PP/VE 5.4Sp2, Clariion set to ALUA (fail mode 4), Nexus 1K, and it all seems to be working perfectly.


  2. says

    Wow! Great Stuff! I just read through it once and I need to sit back a digest a bit. You have given me some things to think about for my future posts as well.

    I would really like to see some kind of decision matrix (pro’s/con’s) around each solution to help others make their decisions as well “cook books” on some recommendations on the settings for each. If yourself (and others like Steve Chambers and Sean McGee) would like to collaborate, I’m in!


  3. says

    Brad – excellent post! This is exactly the level of detail I was looking for to address the QOS concerns in ESX environments. I don’t blog – but I did all the testing on ESX 4.1 for the information that was in Aaron’s blog. A lot of this information came from my VMworld presentation – TA8440 – 10Gb & FCoE Real World Design Considerations. In that presentation – I was more focused on the concern of vMotion saturating a 10Gb link, but we discussed options of Palo, 1000v, Flex10, and others. I really liked what I thought Palo could provide, but needed these details you have provided here to fully address the concerns I saw in ESX.

    This solution allows traffic prioritization (vs shaping) while fully integrating with a datacenter QOS strategy ensuring that you are also prioritizing traffic on uplinks leaving blade chassis, etc. It also covers inbound and outbound traffic management.

    Great stuff – thanks for the detailed post.


  4. says

    Brad – Just another note. I really don’t deserve credit for the discovery. I credit Don Mann at ePlus for all his hard work in preparation for his VMworld session: 10GB and FCoE Design Considerations. I urge everyone to go catch the replay when it is made available.

  5. Ryan B says

    You touched on NFS use case and specify a single adapter for IP storage traffic. I’m working on a build-out right now and had settled on creating two separate port groups for IP storage (unique VLANs), and each port group would use a different active adapter. This seems somewhat analogous to the FCoE use case where both 10G links are actively carrying storage traffic.

    Can you expand on your thoughts behind using one vs. two links for IP storage? If the best-practice is to only use one active link, why wouldn’t that also apply to FCoE?

    We will have a mix of iSCSI (DR for legacy VM’s) and NFS. iSCSI can use NMP RR to utilize both links, and for NFS datastores we will manually round-robin between subnets when mounting different datastores.

    Thank you!

    • says

      You could absolutely setup multiple NFS port groups with alternating active fabrics. Just understand that if one fabric fails you will have both NFS port groups sharing the same minimum bandwidth class of service on the remaining fabric. As I said there are many different nerd knobs and tweaks that can be made to these design to fit your specific deployment. Thanks for stopping by and leaving a comment!


  6. says


    Be aware of how little IO you are probably doing for NFS or any storage for that matter. In my research – we took 1000 VM’s and added all the IO throughput the average throughput to disk was around 3Mbps per VM. Most ESX host have between 20 and 60 VM’s. That would be around 180Mbps. IMO opinion – you should worry more about fast convergence and failover times and only add the complexity of an additional datastore to leverage the second link if you need it (which I have not found a single person who does on 10Gb).

    Just a thought. I found the whole kick about multiple NFS datastores to be more of a checkbox because people think they need multiple active links. Immediate or fast failover (etherchannel) is more critical IMO.


  7. Kris says

    So if we have Menlo adapters with Nexus 1kV then we need to do Design 4?

    What would the QoS policy need to look like on the vNICs in the UCSM or is there any point in applying one?

    Also do you recommend using the failover option with the Menlo in the UCSM or just let vmware manage the failover?

    • says

      Yes, with Nexus 1000V and Menlo based adapters your applicable design is Design #4. You wouldn’t really need to assign a QoS policy to the Menlo vNICs because those adapters are basically implementing FIFO queuing with no advanced QoS to leverage. You will still want to have your QoS System Class configured for proper treatment once the packets (marked by Nexus 1000V) reach the Fabric Interconnects.
      As for the failover option, take a look at this new post and let me know if this helps answer that question:


  8. Clément says

    Hi all,

    I’am insterested in one this about this solution: What is the amount of memory we really have available on the VIC for queuing?


    • says

      The size of each egress COS queue on the Cisco VIC itself is 16KB. However that is rather insignificant because if that queue is filled the Cisco VIC will not drain packets from the Host memory, so its the amount of Host memory that is effectively available to buffer data when an egress flow is controlled.


  9. Gyan says

    Brad, Thanks for an excellent excellent writeup. Finally something in English. One point I could not understand. You seem concerned about software based QoS schemes since they may not have visibility on FCoE bandwidth and may miscalculate the available bandwidth. In UCS environment aren’t the FEX uplink bandwidth shared with all the blades and if so even hardware based QoS provided by Cisco VIC cannot predict the available upstream bandwidth. That would depend on what other blades are doing at that time.

    • says

      In Cisco UCS, the FEX is transparent from a QoS perspective. If the FEX uplinks become congested it will begin sending PFC flow control messages to the Cisco VIC, applying back-pressure to the adapter, where the Cisco VIC can begin to queue and apply the QoS policy. The FEX is basically a bump-in-the-wire that does not drop packets, it just pushes congestion back to the source. The next hop up from the FEX, the Fabric Interconnect, is also applying the same QoS policy on its ports that the Cisco VIC is applying at the adapter.

      In the logical view the FEX is transparent and the Cisco VIC is directly connected to the Fabric Interconnect, both of which have consistent QoS driven from the same central policy defined in UCS Manager.

      Make sense?

      Thanks for reading and leaving a comment!


      • Cheuk says

        Brad, What flow control exists between the Cisco VIC and the Nexus 1000V in Design #4? When a vNIC is backed up, does it back-pressure the flows in the 1000V? Thanks!

        • says

          The Cisco VIC only drains data from the Host if it has room to queue and transmit that data. Therefore, if the Cisco VIC is congested it will not drain packets from the vSphere networking kernel (Nexus 1000V). If packets are not drained from the hypervisor networking kernel, the kernel will in turn cease to drain data from the virtual machine’s networking stack, providing back-pressure up to the application which will wait for available memory.


          • Cheuk says

            Thanks, Brad! A further question: Suppose two vNICs form a port channel to drain traffic from a 1000V uplink. When one of the vNICs is congested, does it affect traffic that is supposed to go to the second vNIC?

            I assume the answer is no? Then it will result in the 1000V dropping packets assigned to the congested vNIC while the other vNIC still passes through traffic normally. If that is the case, will the load balancer in the 1000V kick in and redistribute the load?

  10. says

    Great post!! We are Busy optimizing our democenter with UCS and all other parts. This Saves us hours of searching for info how to implement QoS.

  11. says


    A comment to test my understanding and then a question to follow-up…

    The UCS ecosystem leverages a port aggregation solution to chassis I/O, namely, the FEX modules.

    The FEX modules are not fully featured switches. Nor do they possess any forwarding policy intelligence at all. Instead, the FEX modules deploy a “pinning” approach in which downlinks (those that face the blade server’s NIC’s, LOMs, mezzanine cards) are mapped to an uplink port (those that face a 6100 Fabric Interconnect) to form what can be described as an aggregator group.

    The result is a simplified approach to blade I/O in which the traffic patterns are predictable and failover is deterministic. Moreover, there is no need to configure STP because the ports are uplinked in a manner as to preclude any possibility of a bridging loop.

    This having been said, is there some merit to the argument that this port aggregation design creates a hole — a discontinuity — in the middle of a QoS deployment because the scheduling of packets on the uplink ports facing the 6100 Fabric Interconnect is not performed in a manner that recognizes priority? In other words, no QoS on the FEX.

    To elaborate a bit more, one can have a VMware deployment and leverage NetIOC or perhaps configure QoS on a 1000v switch (whose uplink ports are mapped to a port on the Palo VIC) and configure QoS on the VIC, and then on the 6100 Fabric Interconnect. But, since the FEX is not scheduling traffic to the 6100 Fabric Interconnect according to any priority, the QoS deployment has a hole in the middle, so to speak.


  12. says


    Good article.

    I like the QoS prospective of VMware and Cisco in relation to shares and limits and setting maximums as well.

    2 questions

    1) Does the UCS solution offer any means of providing FC QoS or plans for the future like this year ? I know the MDS supports QoS but I wasn;t sure end to end. This isn’t related to the article. I am not asking Switch to Switch QoS pure end to end., SAN Array to VM. I am not sure possibilities here with FCoE and how that work. Brocade currently though has a solution for this end to end using their HBAs and obviously their switches its interesting.

    2) In a situation without UCS and using NetIOC with different Vlans, Port Group, Egress traffic shapping (what I showed last week) couldn’t you just do your QoS on the Vlans at your upstream switch easily? I see your point though not as pretty cause it isn’t end to end.


  13. jerry says


    I’ve borrowed some of your ideas above to build my QoS classes and apply them as in your first example. In an attempt to verify the functionality of QoS I intentionally forced all traffic through one FEX uplink and blasted line-rate flows from multiple hosts/VMs. and performed vmotions I can physically see my apps struggling (or not) depending on how I carve up the policy so I know that something is going on but I want to be able to point to some counters to prove what’s happening. On the FEX all I can see are pause frames, but what I expected was to see drops back on the Palo (under vNIC stats).

    Can you point me to where I can some per-vnic data or perhaps propose a POC type test to help me feel comfortable with my setup before trusting it in production?

    Thanks Again!

  14. says

    Hi Brad,

    Been reading this post a couple of times now and I’m wondering whether you can clear this up.
    In design #4 you assign on the Cisco 1000V the QoS policies on port-profiles. This policy tags each packet according to its priority, but what happens if you do this and there is no switch upstream, that does QoS? There will be no QoS happening right?

    thanks for your response,

  15. says


    Let’s say the scenario is a blade chassis with four (4) 10G CEE uplinks to a ToR FCF…. or even UCS, it doesn’t matter.

    In such a converged network environment at the server edge, how does one determine how much uplink bandwidth to dedicate to FCoE traffic? Is there a standard recommendation to dedicate bandwidth using some sort of policing or rate limiting mechanism or ….what?

    I have read a lot about FCoE lately, but I have never seen anything about this.

    Any help would be greatly appreciated.


  16. Matt Faiello says

    Hey Brad, Great Blog….I think in Design #3, Auto-correction on the spelling in your design converted “loss-less” into losses.

  17. ashnay says

    is there any design with cisco 4507 Single switch considering 10gb modules.
    1x c7000 chasis with 3xbl460c g6 blades: Vsphere 4.1 (6 Application Virtual hosts including vCenter+SQL database

    2xbl460c g6 blades:( non vmware i.e for Physical aplication (symantec backup blade with rhel5 and application server blade

    c7000 chasis consist of 2x flex10 VC and 2x MDS9124e SAN switch

    1xnetapp 3140 dual controller with 2x Shelf with 24x600GB SAS HDD
    1xnetapp 3140 single controller with 1x shelf with 24 x 2TB SATA

    disk to diskto tape backup

  18. CJ says

    Hey this was a very helpful article. How would a Nexus 1010 fit in to a 10Gb scheme like the one shown in the dual 10G Nexus 1000V/5000 diagram above? Have about 40 Cisco B servers to use as ESXi hosts so interested in better switching mgmt.

  19. says

    Hiya, love your work. Anyone got CVD (full config) for mutating NetApp COS (NFS) and applying QOS priority/no-drop/queueing end-to-end on 7ks and 1ks? (UCS outbound is a no-brainer!)

  20. Ankit says

    Hi Brad,

    Great Blog.. and very informative indeed.

    I have a question here. My ESXi host on UCS blade will boot from SAN, will have FC SAN datastore and network trafiic types will be Management, vMotion and VM network.

    I would like to keep the Boot from SAN traffic and the FC Datastore separate and therefore plan to provision 4 vHBA, two for each traffic type. These will also have different zoning configurations.

    Now, I understand that the Boot from SAN traffic will not be that significant (ESXi hardly reboot in current environment and in offline hrs during maintenance ) as FC datastore, so can we apply a different CoS to those vHBAs with a lower rate limit? For the vHBAs used for FC datastore we will apply the fc CoS with default bandwith weight.

    Please suggest.


  21. Mukund Pohakar says

    Hello Brad,

    This is very elaborate and precise guideline to implement VMware switching .

    I have a scenario:

    How do you deploy jumbo and non-jumbo in VMWARE where you have just 2 x 10G Nics? What are guidelines to deploy VMware VDS and Cisco Nexus Switching line

    From all documentation I can see JUMBO should be enabled end to end . Mixing non-JUMBO traffic is not option. I have multiple VLANS on servicing NAS, Data traffic, Heart-beat, Oracle RAC, etc

    Kindly help to work out the solution

    Best Regards


Leave a Reply

Your email address will not be published. Required fields are marked *