«

»

Nov 28 2011

EMC FAST Cache effectiveness with VDI

It is not often that a technology amazes me so much that makes me go from disbeliever to evangelizer. EMC’s FAST Cache is worth every single penny you pay for a EMC Storage solution.

Before I go on I need to make a disclosure here – I work for EMC. If you follow my blog and articles you know that I try as much as possible to be vendor neutral and focus specifically on the technology– and that’s what I am doing here in the VDI context.

Most intelligent storage arrays have a small amount of expensive L1 DRAM cache that is not only responsible for caching read and write IOs but also hold pre-fetched sequential IOs. L1 DRAM is an expensive type of memory therefore storage arrays have a limited set of DRAM. As an example, the EMC VNX5300 has 15GB, the VNX5500 has 24GB, the VNX5700 has 36GB and the VNX7500 has 48GB. When I looked at NetApp ECC L1 memory the numbers were somewhat similar; the FAS3140 has 4GB, the FAS3160 has 16GB and the FAS3170 has 32GB (number above are reference for dual controller configurations).

 

What is FAST Cache or EFD (Enterprise Flash Drives) cache?

The EFD cache is placed below the L1 DRAM cache and above the HDD drives, and contains copies of logical blocks resident on HDDs. FAST Cache will serve the following main purposes:

  1. To extend the functionality of the DRAM cache by mapping frequently accessed data to EFD’s which are an order of magnitude faster than HDD’s.
  2. To provide a much larger, scalable cache by virtue of using EFD drives that can provide larger data capacities per device.
  3. To improve the benefits of write hits, write coalescing, and write ordering by deferring host writes destined for the HDD’s as long as possible.
  4. To decrease the response time of HDD’s to read cache misses by managing workloads through buffering in cache.

clip_image002

 

How does it work?

The storage system’s primary read/write cache optimally coalesces write I/Os to perform full stripe writes for sequential writes, and prefetches for sequential reads. However, this operation is generally performed in conjunction with slower mechanical storage. FAST Cache monitors the storage processors I/O activity for blocks that are being read or written multiple times from storage, and promotes those blocks into the FAST Cache.

Once a block has been promoted, FAST Cache handles the I/O to and from that block. FAST Cache reduces read activity to the backend as well as writes. It also allows the storage processor’s write cache to flush faster, because it’s flushing to high-speed flash drives. This allows the primary cache to absorb a greater number of non-FAST Cache write I/Os. These optimizations reduce the load on mechanical drives and as a result improve overall storage system performance.
(1) Source: H8268-vnx-block-bext-practices

In simple technical words – near real-time FAST Cache dynamically analyze and promote 64KB blocks when they are accessed three or more times than any other block. From there the promoted block is served out from FAST Cache delivering the data in much faster fashion. If the block being accessed is already in FAST Cache then write operations are also handled by FAST Cache, and only than flushed to low performing SAS and NL-SAS drives.

 

Workload

In VDI deployments Read IO, Write IO, IO Pattern, IO Size, RAID Type, Pool Type etc… are all extremely important to correctly architect a VDI solution.

I am lucky enough to have access to a multitude of hardware where I can run tests and validations. Amongst a large number of tests that I was able to execute in lab I am exposing a VDI deployment with 200 concurrent virtual desktops managed by VMWare View with View Composer.

The workload was generated by LoginVSI with the Default heavy profile for approximately 1:30 hour. The simulated boot storm was created with 3 LoginVSI launchers, each starting a session every 5 seconds.

The storage array used was a EMC VNX5500 with 12GB L1 DRAM (single controller) and 2 x 100GB EFD drives for FAST Cache in RAID 1 configuration. Connectivity to/from hosts was established using Fiber Channel protocol.

The VMware View Replica disk was placed in a dedicated pool of Flash Drives with FAST Cache set to Disabled – and Linked Clones were placed in a RAID 5 set with only 5 (five) 15K SAS drives with FAST Cache set to Enabled. The picture below demonstrates the logical architecture.

 

clip_image004

Note: No technology other than L1 DRAM Cache and FAST cache was in use during the tests.
Note1: All the numbers presented below are Linked Clone specific. The data for the Replica disks will be discussed in future article. Replica disks were not served by FAST Cache.

 

Results

There are numerous reference architectures from EMC for arrays with FAST Cache but I wanted to run my own tests. My first test objective was to understand the total throughput in Operations per Second required to max out a set of 5 drives in RAID 5 with FAST Cache in use. I was able to max out the RAID set with an averaged 60s peak of 3848 IOPs (the collector used averages every 60s; therefore the effective peak number is probably a lot higher).

In production environment it is recommended to have utilization averages set to at most 70%. This environment has an average utilization of 59.69%. Therefore, not considering peaks during logon/logoff storms, it is possible to see that 5 disks in RAID 5 with FAST Cache are able to handle a large production workload. The more FAST Cache available more IOPs will be delivered.

 

clip_image006

 

The second very important part of this analysis is to understand the response time during the workload. Storage latency is responsible for most performance issues in VDI deployments. High latency in VDI is directly perceived by the end-user characterized by slowness, lagging or long time to refresh screen.

The recommendation from VMWare for virtual workloads is never have average above 10 ms, or peaks above 20 ms. The graph below demonstrate that response time has an averaged peak of 3.17 ms. In fact; the total average was never beyond 2 ms.

 

clip_image008

 

Nonetheless these results are not valid unless a number of other variables are clearly disclosed. All these variables are essential in estimating storage capacity and performance.

  • Average Read Size: 14 KB
  • Average Write Size: 12KB
  • Average Read Throughput: 1.34 IOPs per VM
  • Average Write Throughput: 7.58 IOPs per VM
  • Read/Write IOPs Ratio: 15%/85%

The most interesting aspect in this validation is what happens to SP Cache (L1 DRAM) and FAST Cache during the simulated workload. Without FAST Cache the DRAM Cache would be utilized to its limits (12GB for this VNX5500 with single controller) not allowing for workload growth, unless a new array with more L1 DRAM is used or new spindles are added to the array to handle the required IO. The graph below demonstrates SP Cache behavior in a different simulation with 100 VMs without FAST Cache.

 

clip_image010

 

In my simulated test with 200 VMs and 100GB FAST Cache the opposite start to happen as the workload is absorbed by the array. The graph below demonstrate how FAST Cache is freeing up resources from L1 DRAM. With more L1 DRAM available the array is able to utilize it for more IO hits that are not being covered by FAST Cache.

 

clip_image012

The graphs above are read in %. 1 = 100%

 

The graphs above also demonstrate all the greatness FAST Cache is delivering. FAST Cache Hit Ratios for both Reads and Writes are close to 100%. It means that the large majority of IOs are being served from Flash Drives, instead of slower SAS 15K drives.

FAST Cache will normally consume memory from L1 DRAM Cache when FAST Cache is created. However, FAST Cache reduces the overall number of dirty pages due to faster flushing. That counteracts the reduction in overall write cache pages.

Because FAST Cache is seen by the array as an extension of L1 DRAM its size is also limited by the amount of L1 DRAM Cache available on the array. The VNX7500 with 48GB L1 DRAM can support up to 2TB of FAST Cache.

By now you are probably asking why I went from disbeliever to evangelizer. Well, if you are an avid reader you probably know that I am a big advocate of Non Persistent Desktop Pools for VDI. It just makes the whole sense in the world (with few exceptions).

 

Two things to consider:

  1. VDI is all about Write IOs (in this example 85%)
  2. In Non Persistent Pools the desktops get often deleted or refresh after use;

    So, how does FAST Cache caching mechanism help when every new VM created in a Non Persistent Pool has a complete new set of blocks created and they are inherently different from the previous VMs?

    How does FAST cache pre-cache blocks before a write is committed by the virtual desktop is those blocks are from new VM?

    My lack of familiarity with FAST Cache… and thanks to Aaron Patten for putting me in the right track.

    FAST Cache locks at LBA blocks level, not VMFS, and not NTFS. If the LBA block was promoted to FAST cache during previous operations with an old VM, the block will still be in Cache for the new VM allowing faster Read and Write IO transactions.

    There are, off course, a number of nuances about how effective FAST Cache is for different workloads. Things like data locality, active data, and pool tier sizes are some of the variables. However, for VDI workloads it works like a treat!

     

    If you are about to size or purchase a storage array for your VDI solution make sure that the chosen solution is able to intelligently handle Write IOs (remember, it’s all about the Write IOs) in an effective manner that will allow you to scale your environment without having to replace the entire array in the future.

    From the best of my knowledge, there are many solutions on the market that promise to offload Read IOs, but not many can optimally handle Write IOs. I’ll leave the vendor and product research to you.

     

    FAST cache is a great solution but it should not be the only and foremost solution to help in reducing IO impact and improve performance. In my article How to offload Write IOs from VDI deployments I give examples of ways to architect VDI solutions for lower IOPs.

    I have also been looking at 3rd Party solutions such as Atlantis Computing ILIO and I will soon share the my test results here.

    8 comments

    4 pings

    Skip to comment form

    1. Tobias

      Big problems with EMC fast cache in Sweden :/

    2. Peter Wilson

      I guess you could always mention IBM EasyTier which provides the same function….

      Pete

    3. Andre Leibovici

      @Peter Wilson
      I did a bit a google but could not fully understand if IBM EasyTier is a Fully Automated Storage Tiering solution that move blocks across tiers overtime during shedulled times (overnight), or if it moves data dynamicaly and real-time to SSD during the day when required. Would you know?

      EMC FAST Cache will dynamicaly and real-time move sub-slices of 64KB blocks to SSD from SAS and NL-SAS when required during production time. EMC FAST VP will move bigger chunk of data (1GB) across diferent tiers when the blocks are hotter or colder.

      Andre

    4. Dave

      Andre,

      Have you ever looked at the Whiptail Technologies solution? They make an appliance full of SSDs that is purpose built for VDI. They claim 250,000 IOPs at a fraction of the cost of doing this using solutions from companies like Netapp, EMC, etc.

      Dave

    5. Andre Leibovici

      @Dave
      Whiptail is a SSD only storage appliance solution. It’s very fast due to the SSD nature but when the discussion is capacity the $ will skyrocket very fast. In non-persistent VDI deployments performance is key and capacity is secondary. In this cake Whiptail or any other similar solution will provide you with enough IOPs.

      The storage appliance deployment model makes sense for non-persistent if you confortable dedicating storage arrays for VDI only.

      Drawbacks IMO

      Who is Whiptail? A start-up no one heard 2 years ago. What is their ability to support enterprise deployments in large organizations? How are they funded?
      The solution has no block dedup, no inline dedup, no L2 DRAM Cache.
      There are other players on the startup market doing more inteligent things.

      My take – Large storage vendor are playing catch up here but I have seen what they have on their delivery pipeline and I think in many cases I would stick to the traditional deployment mechanics.

      Andre

    6. Rehan

      @Andre Leibovici
      Hi Andre,

      Although an old post and hopefully you have already found the answer to your question but yes EasyTier does move data continuously 24×7, it always has. The period used to analyse extents for promotion and demotion is 24hrs based on an intelligent algorithm.

    7. Aya saito

      Great post!let me confirm that cm full clone would not be effective with fast cache?

    8. Andre Leibovici

      Aya, thanks for your comment. I have been for far too long away for this technology to make any comments. However, I would you suggest looking at hyper-congerved platforms. They easily solve the performance and scalability challenges in VDI.

    1. - Cliff Davies

      […] to serve a larger number of IOPs without having to utilize an all SSD based array. Find out more here. […]

    2. - Cliff Davies

      […] If you would like to know more about FAST Cache I suggest reading my article EMC FAST Cache effectiveness with VDI. […]

    3. myvirtualcloud.net » View Storage Accelerator Performance Benchmark

      […] write intensive workloads. For write intensive offloading I recommend you to look at solutions like EMC FAST Cache and Atlantis […]

    4. Un-architecting VDI with XtremIO « PROJECT: Virtual

      […] intelligent caching.  In a great article (http://myvirtualcloud.net/?p=2502) , Andre Leivbovici does a great job in laying out why putting in some FAST Cache from EMC in front […]

    Leave a Reply