Disk Latency (another Case Study)

After the feedback on my recent post Host Swapping (another Case Study) I decided to share some more real world troubleshooting scenario. In this case study a small VDI cluster (VMware View) with approximately 120 virtual desktops and 6 hosts (average of 20 VMs per host at this stage). Users were complaining of slow performance at some moments of the day.

Administrators also noticed that RDP sessions were being disconnected during VMotion. Because of that they decided to reduce aggressiveness of DRS to avoid VMotion. Additionally, because of a known bug the HA was disabled (KB1016262 – Patch released on January, 5 2010)

I’ll focus here on performance issues perceived by end-users when using virtual desktops and will leave the disconnected sessions, DRS and HA issues to another post although the lack of DRS could directly impact on performance.

To put in context:

  • Cluster with 6 ESXi hosts with DRS aggressiveness set to the bottom low
  • VMs have VMware Tools and TPS is enabled
  • No memory reservations are assigned to VMs
  • Disks are a Hitachi SAN with 2GB HBAs

Looking at host and cluster memory usage, following by CPU usage, I noticed that VMs had high %IDLE times with peaks of up to 300. One of the VMs had average CPU Ready Time of 3986 milliseconds with peak of 9679ms. This means that the CPU in the VM was waiting for whole 9.5 seconds for IO instructions to be returned to the guest OS and resume operations.

%IDLE – Percentage of time the resource pool, virtual machine, or world was idle. Subtract this percentage from %WAIT to see the percentage of time the resource pool, virtual machine, or world was waiting for some event. The difference, %WAIT- %IDLE, of the VCPU worlds can be used to estimate guest I/O wait time.

%RDY – Percentage of time the resource pool, virtual machine, or world was ready to run, but was not be provided CPU resources on which to execute.

CPU Ready

It is interesting to note in this picture that CPU Usage does not necessarily follow the CPU Ready Time. The VM was waiting for IO operations at different moments of the highest CPU usage. The obvious would be that host is somehow CPU or Memory constrained.

In most of my engagements this scenario would point me to memory or CPU contention but in this case Cluster and Hosts did not show signs of Memory (50%) or CPU (15%) contention. (I should have a picture of that but…).

It’s important to remember that CPU Ready Time demonstrates the percentage of time the vCPU was ready and waiting for IO, any IO. That could be CPU, Memory or Disk.

After looking at disk performance in few hosts we were able to identity high disk IO latency. I like to use Veeam Monitor to consolidate disk usage and latency for all hosts in a single chart. The free version of the product allows you to consolidate up to 24 hours; and that should suffice for the analysis. Also, consider changing vCenter logging levels to provide more information during the troubleshooting process.

Disk Latency

Some LUNs with AVG 304.68ms of disk latency were facing severe contention. VMware recommended practice stipulates that on average disk latency should be no more than 10ms and 20ms during peak time. In this environment we could see peaks of 606ms for the past 24 hours. Drilling down vCenter Performance Graphs I was actually able to see moments when peak latency was up to 3500ms. This represents 3.5 seconds for the storage array to return data to the host and Guest OS since the IO command was issue.

From this point onwards we were speaking to the storage admins…. but at first glance they could not see contention at the storage level. That’s when vscsistats is always a handy tool to have in your arsenal.

Note: vscsistats is now included in ESXi 4.0 U2 as part of the image so no additional download required. Thanks to @virtualirfan. Visit him at http://virtualscoop.org/. If you want more information on vscsistats please read this and this.

Using vscsistats it’s possible to collect real-time information virtual machines and create graphs with the information. Below is a quick sample of the collect information for latency of IOs.
Histogram: latency of IOs in Microseconds (us) virtual machine worldGroupID
min 111
max 813013
mean 9410
count 34407
Frequency Histogram Bucket Limit
0 1
0 10
0 100
845 500
151 1000
24586 5000
6258 15000
1289 30000
344 50000
451 100000
483 100000+
Histogram in Microseconds (um)

For this VM most of the IO response was under 5ms but few IOs were taking considerable amount of time to be returned from the SAN. Have a look at max! Looking at the HBAs throughput we noticed that it was minimal. This information helped storage engineers to get to the bottom of the problem and work towards a solution.

Disk Usage

The issue ended up being that RAID groups at storage subsystem that were serving the VDI infrastructure did not have enough spindles to provide the required number of IOps for the VDI environment.

There are two lessons to be learnt here in my opinion:

– It is important to use all your arsenal of tools to help during a troubleshooting process. One piece of information may lead you to a different analysis that will help you to understand the issue. Sometimes an issue will create another issue and it’s not always straightforward to understand the real causes. There are a number of free tools available on the blogosphere and you may start at this good article from Kendrick Coleman.

– Design, Plan and understand the technology. The engagement with key stakeholders is essential in virtual environments where several different technologies MUST work in harmony.

9 comments

1 ping

Skip to comment form

  1. You mention that the “storage monkeys” could not see anything up with at there end. If they had HDS Tuning Manager they could have picked up on that before the View Admin and user land noticed it.

    Tuning Manager has some good alarms which can be bound to that VDI RAID group to alert the team before it goes bad.

    Once again always surprised to see people not using the tools that come with the storage arrays.

    David

  2. @David Francis
    You are spot on when mentioning about HDS Tuning Manager. Unfortunately the tool was not available as part of the storage kit.

    In regards to the “storage monkeys”, not being a native English speaker I didn’t know the impact this would have. I have respectfully changed that to ‘Storage Admins’.

    Thanks for your comments.

    • PiroNet on 06/29/2010 at 12:39 am

    Excellent post! I would add also to the list of things to check: alignement, especially if your desktops run XP 🙂

    People at PQR.NL have posted a great article about Windows XP and VDI -> http://www.virtuall.eu/creating-a-vdi-template

    Rgds,
    Didier

  3. Thanks for your comment! I have published a Mastering VDI Templates document with several important customizations. http://myvirtualcloud.net/?p=929

  4. Thx for the link to your Mastering VDI Templates document, extremely valuable!

  5. “VMware recommended practice stipulates that on average disk latency should be no more than 10ms and 20ms during peak time”

    do you have a pointer to that best practice document, or is it tribal knowledge ?

    also +1 on the alignment comment 🙂

  6. John, I knew those numbers from the top of my head and started to doubt myself after your question.
    However, after a quick search I found not only references on the blogosphere, but also found reference in a performance document from VMware.
    You will find the Performance Troubleshooting for VMware vSphere 4 at http://www.vmware.com/resources/techresources/10066. Page 20.

  7. Thanks Andrew,
    I was worried that my 20ms response time I was using in my blog posts was set too high for vmware best practices. Anecdotally sub 20ms average disk latencies typically result in happy users and applications. I’ll dig through this document now.

    Regards
    John

    • Umesh on 06/29/2012 at 12:53 am

    Thanks for this post..i am also facing such issue where i am getting 300 ms latency from disk.This post is guiding me to the right direction to trouble shoot the issue.

    • HIV Vaccine Success | AIDS on 07/01/2010 at 4:52 pm

    […] myvirtualcloud.net » Disk Latency (another Case Study) […]

Leave a Reply