After the feedback on my recent post Host Swapping (another Case Study) I decided to share some more real world troubleshooting scenario. In this case study a small VDI cluster (VMware View) with approximately 120 virtual desktops and 6 hosts (average of 20 VMs per host at this stage). Users were complaining of slow performance at some moments of the day.
Administrators also noticed that RDP sessions were being disconnected during VMotion. Because of that they decided to reduce aggressiveness of DRS to avoid VMotion. Additionally, because of a known bug the HA was disabled (KB1016262 – Patch released on January, 5 2010)
I’ll focus here on performance issues perceived by end-users when using virtual desktops and will leave the disconnected sessions, DRS and HA issues to another post although the lack of DRS could directly impact on performance.
To put in context:
- Cluster with 6 ESXi hosts with DRS aggressiveness set to the bottom low
- VMs have VMware Tools and TPS is enabled
- No memory reservations are assigned to VMs
- Disks are a Hitachi SAN with 2GB HBAs
Looking at host and cluster memory usage, following by CPU usage, I noticed that VMs had high %IDLE times with peaks of up to 300. One of the VMs had average CPU Ready Time of 3986 milliseconds with peak of 9679ms. This means that the CPU in the VM was waiting for whole 9.5 seconds for IO instructions to be returned to the guest OS and resume operations.
%IDLE – Percentage of time the resource pool, virtual machine, or world was idle. Subtract this percentage from %WAIT to see the percentage of time the resource pool, virtual machine, or world was waiting for some event. The difference, %WAIT- %IDLE, of the VCPU worlds can be used to estimate guest I/O wait time.
%RDY – Percentage of time the resource pool, virtual machine, or world was ready to run, but was not be provided CPU resources on which to execute.
It is interesting to note in this picture that CPU Usage does not necessarily follow the CPU Ready Time. The VM was waiting for IO operations at different moments of the highest CPU usage. The obvious would be that host is somehow CPU or Memory constrained.
In most of my engagements this scenario would point me to memory or CPU contention but in this case Cluster and Hosts did not show signs of Memory (50%) or CPU (15%) contention. (I should have a picture of that but…).
It’s important to remember that CPU Ready Time demonstrates the percentage of time the vCPU was ready and waiting for IO, any IO. That could be CPU, Memory or Disk.
After looking at disk performance in few hosts we were able to identity high disk IO latency. I like to use Veeam Monitor to consolidate disk usage and latency for all hosts in a single chart. The free version of the product allows you to consolidate up to 24 hours; and that should suffice for the analysis. Also, consider changing vCenter logging levels to provide more information during the troubleshooting process.
Some LUNs with AVG 304.68ms of disk latency were facing severe contention. VMware recommended practice stipulates that on average disk latency should be no more than 10ms and 20ms during peak time. In this environment we could see peaks of 606ms for the past 24 hours. Drilling down vCenter Performance Graphs I was actually able to see moments when peak latency was up to 3500ms. This represents 3.5 seconds for the storage array to return data to the host and Guest OS since the IO command was issue.
From this point onwards we were speaking to the storage admins…. but at first glance they could not see contention at the storage level. That’s when vscsistats is always a handy tool to have in your arsenal.
Note: vscsistats is now included in ESXi 4.0 U2 as part of the image so no additional download required. Thanks to @virtualirfan. Visit him at http://virtualscoop.org/. If you want more information on vscsistats please read this and this.
|Histogram: latency of IOs in Microseconds (us)||virtual machine worldGroupID|
|Frequency||Histogram Bucket Limit|
For this VM most of the IO response was under 5ms but few IOs were taking considerable amount of time to be returned from the SAN. Have a look at max! Looking at the HBAs throughput we noticed that it was minimal. This information helped storage engineers to get to the bottom of the problem and work towards a solution.
The issue ended up being that RAID groups at storage subsystem that were serving the VDI infrastructure did not have enough spindles to provide the required number of IOps for the VDI environment.
There are two lessons to be learnt here in my opinion:
- It is important to use all your arsenal of tools to help during a troubleshooting process. One piece of information may lead you to a different analysis that will help you to understand the issue. Sometimes an issue will create another issue and it’s not always straightforward to understand the real causes. There are a number of free tools available on the blogosphere and you may start at this good article from Kendrick Coleman.
- Design, Plan and understand the technology. The engagement with key stakeholders is essential in virtual environments where several different technologies MUST work in harmony.