How to offload Write IOs from VDI deployments

One of the major concerns in a VDI solution is the end-user experience. Poor user experience = Low user acceptance. I have discussed this paradigm in my article VDI USER EXPERIENCE and USER ACCEPTANCE.

Behind the scenes and hidden in the physical supporting infrastructure there are a number of factors contributing for possible poor user experience. The most common pain points are network and storage.

Networking is commonly an issue when dealing with remote offices/branch offices. However, storage may as well be the most common bottleneck in a VDI implementation if not designed properly.

In the past I have discussed storage IO pattern, read/write skews, boot storms, login storms, dedicated replica datastores, IO split, etc. I particularly recommend the reading of The biggest Linked Clone “IO” Split Study for detailed information on IO behavior in a VDI environment.

The graph below demonstrates the number of read and write IO on Linked Clone disks. The old saying “VDI is write intensive” can be observed in this graph.

IOPs and RAID type define the storage layout and the numbers of disks required. The RAID type will set the numbers of spindles required to support the workload based on the amount of IOPS and Read/Write ratio – especially Write IO given that RAID adds a write penalty that is dependent upon the RAID type chosen. RAID5 adds a write penalty of 4 whilst RAID10 adds a write penalty of 2.

VM IO = VM Read IO + (VM Write IO * RAID Penalty)


What practices can be used to offload the number of Write IOs from the storage array? IOs can be reduced at source, de-duplicated, single instanced or served from cache.

Storage vendors have been providing solutions that help to reduce the number of IOs hitting the spindles, therefore driving down the numbers of disks required. Those solutions range from flash drives for caching, automated storage tiering, flash RAM and others. The objective of this article is to discuss techniques that help to reduce or avoid that IOs hit the drives (spindles).


1. Windows Customization

Windows customization not only helps to reduce memory footprint and CPU cycles, but also reduce the number of IOs. There are a number of resources available to help you to customize master images.

Application virtualization tools such as ThinApp, App-V or XenApp are a superb way to offload IOs from the storage array; mostly read IO. Application virtualization provide operational benefits that include easy of management, easy application upgrade, and easy applications rollout.

Many of the application virtualization products also allow for single instancing. This means that the applications will only exist a single time in storage. This differs from installing the same application for each virtual desktop.

Single instancing itself does not help to reduce the number of IOs required to serve applications since all users will be accessing that single instanced application. However, most storage arrays have the ability cache heavy accessed blocks in DRAM. Having application single instance allow storage arrays to maintain the application in DRAM cache or extended cache, serving data in a much faster fashion.

Picture (A) demonstrates storage cache utilization without application virtualization; picture (B) demonstrates utilization of storage cache with application single instancing.




3. .vswp offload to local storage

When designing VDI solutions on vSphere, one of the many ways to reduce shared storage consumption is to allow VM Swap files (.vswp) placement on host local storage. A .vswp file is automatically created by ESXi when the desktop is Powered On, and deleted when the VM is Powered Off.

Benefits of offloading .vswp file to local storage are the reduction of the storage footprint and the offload of read and write IOs from shared storage to local storage.

When offloading .vswp file to local storage it is important to make sure that local storage on the host is capable of providing the required number of IOs to support the virtual desktops on the host. It is recommended to use Flash drives for .vswp files on local storage.

If you are interested in how to implement .vswp offload read Save [VDI] Storage using VM Swap File Host Placement.


4. Antivirus IO Offloading (vShield Endpoint)

vSphere introspection capabilities through vShield Endpoint considerably reduce CPU cycles, memory consumption and storage IO. vShield Endpoint plugs directly into vSphere and consists of a virtual appliance (delivered by VMware partners), a driver for virtual machines to offload file events, and VMware Endpoint security (EPSEC) to link the first two components at the hypervisor layer.


Both McAfee and Symantec solutions required that a separate instance of the AV agent run in each virtual machine. TrendMicro Deep Security required one instance of its virtual appliance per host. Numbers and graph from Tolly Test Report #211101.



Note: McAfee MOVE does not use introspection but achieve similar results using the network layer to offload AV operations.


5. Storage Level Caching and Write Buffering

Read Caching make sure highly active data is served from Flash drives or RAM. EMC FAST Cache technology dynamically absorbs unpredicted spikes buffering the most accessed blocks in DRAM and Flash drives. The movement of the data is dynamic, near real-time, and at 64KB sub-slices chunks which is ideal for bursty data. Other storage vendors have different technologies that achieve similar results.

The biggest storage problem is the Write IO, aka Write Penalty. Write buffering can be done in couple different ways – at the host level or at the storage level. EMC FAST Cache also helps to accommodate Write IO in an extended cache if spindles are busy to deal with the write IOs. This technique alleviates the latency responsible for poor end-user experience and AFAIK is the only shared storage side caching solution that also alleviates Write IOs.

Tests with persistent desktops demonstrate the 9 out of 10 IOs may be served from cache using Fast Cache. If you are interested in finding more about EMC FAST Cache technology read this detailed paper.




6. Host Level Caching and De-duplication

Few start-ups have been creating industry furor with their technology and ability to boost VDI performance beyond what shared storage arrays can provide nowadays. One of the products that I have been paying attention is Atlantis ILIO.

Atlantis ILIO promises up to 90% reduction in IO load on storage infrastructure. ILIO achieve those numbers re-sequencing read/write operations from small random IO to large sequential, processing real-time IO locally from host memory and instantly characterizing IO based on Windows NTFS file system characteristics.

ILIO is a software virtual appliance that seats on each host or in a “top-of-rack” configuration where hosts connect to a single Atlantis ILIO virtual appliance. The storage is then presented to via NFS or iSCSI.

The base ILIO appliance uses 22GB RAM per host and support up to 65 virtual desktops. Additional desktops may be added at a 150MB RAM price tag. Because of the RAM memory used the VDI consolidation ratio is reduced, increasing costs associated to RAM, Hosts, and VDI licensing depending on the solution in use.

To be cost-effective Atlantis ILIO licensing and host RAM costs, plus any additional hosts, blade enclosures, and VDI licenses required, would have to have substantially less if compared to the cost of adding additional spindles to the storage infrastructure. I’m yet to do this calculation.

I have done some ILIO lab tests that look promising from a technology and performance standpoint. Does it make sense from a $/performance standpoint? I will be publishing the results in my next blog post.


Next Steps

For most VDI deployments the first 5 options will considerably reduce the number of IOs, consequently improving storage response times, reducing latency and increasing consolidation ratios. The option number 6 can be investigated in extreme cases where the number of IOps per virtual desktop is very high.

For the past few weeks I have been collecting data from my VDI Read/Write Ratio Challenge. Please, take a minute to help us understand the overall IO pattern of VDI deployments.


3 pings

Skip to comment form

    • Matt Cowger on 11/07/2011 at 2:55 pm


    Couple comments:

    1) You say that “RAID5 adds a write penalty of 4”. This is only true for RAID5 3+1, which is not always true. Many arrays support other levels – look at VNX, 3PAR, Clariion, etc, and as a result suffer write amplification of different levels depending on the raid level. Assuming 3+1 isn’t even a good choice – the top array vendor for VMware (EMC) uses 4+1 as best practice. A proper calculation would take into account the RAID stripe size.

    2) Using cache to service a write IO doesn’t offload the write from the array. It still has to be handled by the controller/service processor, move through cache, and eventually hit the disks. As you know, in most scenarios, given the performance of SSDs, its the array processor that gets pegged before the SSDs. FAST cache and related technologies help ALOT by offloading the IOs from the spinning disks (the slowest piece), but not the array.

    • Andy on 11/08/2011 at 4:10 pm

    Matt – can you explain the different RAID 5 write amplification penalties. I have only seen a 4:1 penalty referenced before. Can you provide further reading? Nothing obvious on google to support otherwise.

    • Andy on 11/08/2011 at 5:55 pm

    After more searching, still nothing correlating the quantity of disks in a RAID 5 parity set and the write penalty.

    Here is a great summarization of where the 4 penalty I/O’s come from, I have added (quantity) to highlight the I/Os:

    “when the data is changed the old data (1) and parity data (2) are read, XOR’d to remove the old data’s parity contribution, then the new strip is XOR’d with that to get the new parity calculation, then the new data strip (3) and parity strip (4) are written to the disk. ”

  1. Looking forward to the Atlantis results. I have been utilizing it in our lab and doing some admittedly unscientific tests, but am seeing good results. Last one I did I saw a peak of 300 IOPS during my login and heavy web-surfing for 5 mins be reduced to 35 on the backend VNX datastore. This is one of the most intriguing storage-related startup’s I’ve seen to date. Only concern I still have is the same one you do, have the priced it appropriately to make it compelling?

  2. @Hoosier Storage Guy
    I am currently running some medium test workload with aproximately 100 VMs on ILIO. As soon I am able to consolidate results I should be be able to write a post and share the results. If you are running VNX may it’s worth looking at your Read and Write FAST Cache Ratio. They REALLY help to oflload IOs from spindles.


  1. […] the only and foremost solution to help in reducing IO impact and improve performance. In my article How to offload Write IOs from VDI deployments I give examples of ways to architect VDI solutions for lower […]

  2. […] innumerous times reasoned about the impact of write IO on storage arrays for VDI deployments and How to offload Write IOs from VDI deployments. If ILIO is able to offload up to 70% of the write IO, than storage requirements would either be […]

  3. […] IOs represent anywhere between 50% to 80% of the total number of IOs. Go ahead and read my article How to offload Write IOs from VDI deployments for a better understanding on how to handle write IOs more […]

Leave a Reply