It is not often that a technology amazes me so much that makes me go from disbeliever to evangelizer. EMC’s FAST Cache is worth every single penny you pay for a EMC Storage solution.
Before I go on I need to make a disclosure here – I work for EMC. If you follow my blog and articles you know that I try as much as possible to be vendor neutral and focus specifically on the technology– and that’s what I am doing here in the VDI context.
Most intelligent storage arrays have a small amount of expensive L1 DRAM cache that is not only responsible for caching read and write IOs but also hold pre-fetched sequential IOs. L1 DRAM is an expensive type of memory therefore storage arrays have a limited set of DRAM. As an example, the EMC VNX5300 has 15GB, the VNX5500 has 24GB, the VNX5700 has 36GB and the VNX7500 has 48GB. When I looked at NetApp ECC L1 memory the numbers were somewhat similar; the FAS3140 has 4GB, the FAS3160 has 16GB and the FAS3170 has 32GB (number above are reference for dual controller configurations).
What is FAST Cache or EFD (Enterprise Flash Drives) cache?
The EFD cache is placed below the L1 DRAM cache and above the HDD drives, and contains copies of logical blocks resident on HDDs. FAST Cache will serve the following main purposes:
- To extend the functionality of the DRAM cache by mapping frequently accessed data to EFD’s which are an order of magnitude faster than HDD’s.
- To provide a much larger, scalable cache by virtue of using EFD drives that can provide larger data capacities per device.
- To improve the benefits of write hits, write coalescing, and write ordering by deferring host writes destined for the HDD’s as long as possible.
- To decrease the response time of HDD’s to read cache misses by managing workloads through buffering in cache.
How does it work?
The storage system’s primary read/write cache optimally coalesces write I/Os to perform full stripe writes for sequential writes, and prefetches for sequential reads. However, this operation is generally performed in conjunction with slower mechanical storage. FAST Cache monitors the storage processors I/O activity for blocks that are being read or written multiple times from storage, and promotes those blocks into the FAST Cache.
Once a block has been promoted, FAST Cache handles the I/O to and from that block. FAST Cache reduces read activity to the backend as well as writes. It also allows the storage processor’s write cache to flush faster, because it’s flushing to high-speed flash drives. This allows the primary cache to absorb a greater number of non-FAST Cache write I/Os. These optimizations reduce the load on mechanical drives and as a result improve overall storage system performance.
(1) Source: H8268-vnx-block-bext-practices
In simple technical words – near real-time FAST Cache dynamically analyze and promote 64KB blocks when they are accessed three or more times than any other block. From there the promoted block is served out from FAST Cache delivering the data in much faster fashion. If the block being accessed is already in FAST Cache then write operations are also handled by FAST Cache, and only than flushed to low performing SAS and NL-SAS drives.
In VDI deployments Read IO, Write IO, IO Pattern, IO Size, RAID Type, Pool Type etc… are all extremely important to correctly architect a VDI solution.
I am lucky enough to have access to a multitude of hardware where I can run tests and validations. Amongst a large number of tests that I was able to execute in lab I am exposing a VDI deployment with 200 concurrent virtual desktops managed by VMWare View with View Composer.
The workload was generated by LoginVSI with the Default heavy profile for approximately 1:30 hour. The simulated boot storm was created with 3 LoginVSI launchers, each starting a session every 5 seconds.
The storage array used was a EMC VNX5500 with 12GB L1 DRAM (single controller) and 2 x 100GB EFD drives for FAST Cache in RAID 1 configuration. Connectivity to/from hosts was established using Fiber Channel protocol.
The VMware View Replica disk was placed in a dedicated pool of Flash Drives with FAST Cache set to Disabled – and Linked Clones were placed in a RAID 5 set with only 5 (five) 15K SAS drives with FAST Cache set to Enabled. The picture below demonstrates the logical architecture.
Note: No technology other than L1 DRAM Cache and FAST cache was in use during the tests.
Note1: All the numbers presented below are Linked Clone specific. The data for the Replica disks will be discussed in future article. Replica disks were not served by FAST Cache.
There are numerous reference architectures from EMC for arrays with FAST Cache but I wanted to run my own tests. My first test objective was to understand the total throughput in Operations per Second required to max out a set of 5 drives in RAID 5 with FAST Cache in use. I was able to max out the RAID set with an averaged 60s peak of 3848 IOPs (the collector used averages every 60s; therefore the effective peak number is probably a lot higher).
In production environment it is recommended to have utilization averages set to at most 70%. This environment has an average utilization of 59.69%. Therefore, not considering peaks during logon/logoff storms, it is possible to see that 5 disks in RAID 5 with FAST Cache are able to handle a large production workload. The more FAST Cache available more IOPs will be delivered.
The second very important part of this analysis is to understand the response time during the workload. Storage latency is responsible for most performance issues in VDI deployments. High latency in VDI is directly perceived by the end-user characterized by slowness, lagging or long time to refresh screen.
The recommendation from VMWare for virtual workloads is never have average above 10 ms, or peaks above 20 ms. The graph below demonstrate that response time has an averaged peak of 3.17 ms. In fact; the total average was never beyond 2 ms.
Nonetheless these results are not valid unless a number of other variables are clearly disclosed. All these variables are essential in estimating storage capacity and performance.
- Average Read Size: 14 KB
- Average Write Size: 12KB
- Average Read Throughput: 1.34 IOPs per VM
- Average Write Throughput: 7.58 IOPs per VM
- Read/Write IOPs Ratio: 15%/85%
The most interesting aspect in this validation is what happens to SP Cache (L1 DRAM) and FAST Cache during the simulated workload. Without FAST Cache the DRAM Cache would be utilized to its limits (12GB for this VNX5500 with single controller) not allowing for workload growth, unless a new array with more L1 DRAM is used or new spindles are added to the array to handle the required IO. The graph below demonstrates SP Cache behavior in a different simulation with 100 VMs without FAST Cache.
In my simulated test with 200 VMs and 100GB FAST Cache the opposite start to happen as the workload is absorbed by the array. The graph below demonstrate how FAST Cache is freeing up resources from L1 DRAM. With more L1 DRAM available the array is able to utilize it for more IO hits that are not being covered by FAST Cache.
The graphs above are read in %. 1 = 100%
The graphs above also demonstrate all the greatness FAST Cache is delivering. FAST Cache Hit Ratios for both Reads and Writes are close to 100%. It means that the large majority of IOs are being served from Flash Drives, instead of slower SAS 15K drives.
FAST Cache will normally consume memory from L1 DRAM Cache when FAST Cache is created. However, FAST Cache reduces the overall number of dirty pages due to faster flushing. That counteracts the reduction in overall write cache pages.
Because FAST Cache is seen by the array as an extension of L1 DRAM its size is also limited by the amount of L1 DRAM Cache available on the array. The VNX7500 with 48GB L1 DRAM can support up to 2TB of FAST Cache.
By now you are probably asking why I went from disbeliever to evangelizer. Well, if you are an avid reader you probably know that I am a big advocate of Non Persistent Desktop Pools for VDI. It just makes the whole sense in the world (with few exceptions).
Two things to consider:
- VDI is all about Write IOs (in this example 85%)
- In Non Persistent Pools the desktops get often deleted or refresh after use;
So, how does FAST Cache caching mechanism help when every new VM created in a Non Persistent Pool has a complete new set of blocks created and they are inherently different from the previous VMs?
How does FAST cache pre-cache blocks before a write is committed by the virtual desktop is those blocks are from new VM?
My lack of familiarity with FAST Cache… and thanks to Aaron Patten for putting me in the right track.
FAST Cache locks at LBA blocks level, not VMFS, and not NTFS. If the LBA block was promoted to FAST cache during previous operations with an old VM, the block will still be in Cache for the new VM allowing faster Read and Write IO transactions.
There are, off course, a number of nuances about how effective FAST Cache is for different workloads. Things like data locality, active data, and pool tier sizes are some of the variables. However, for VDI workloads it works like a treat!
If you are about to size or purchase a storage array for your VDI solution make sure that the chosen solution is able to intelligently handle Write IOs (remember, it’s all about the Write IOs) in an effective manner that will allow you to scale your environment without having to replace the entire array in the future.
From the best of my knowledge, there are many solutions on the market that promise to offload Read IOs, but not many can optimally handle Write IOs. I’ll leave the vendor and product research to you.
FAST cache is a great solution but it should not be the only and foremost solution to help in reducing IO impact and improve performance. In my article How to offload Write IOs from VDI deployments I give examples of ways to architect VDI solutions for lower IOPs.
I have also been looking at 3rd Party solutions such as Atlantis Computing ILIO and I will soon share the my test results here.