If you have been following my recent posts you probably noticed that few of them relate to EMC storage arrays with FAST Cache and FAST VP implementation. If you would like to know more about how FAST Cache works I suggest the reading my article EMC FAST Cache effectiveness with VDI.
No, I am not becoming a storage blogger. However, I have been doing a good number of tests and simulations to validate certain conditions for VDI workloads. FAST Cache and FAST VP are such powerful features on EMC arrays that in many cases they are underestimated.
Make sure you watch the following two videos:
As result from the 2011 EMC VDI/EUC Team Summit held in Santa Clara, CA the team members of this group decided that one of the summit outcomes would be the validation of a different approach for sizing storage for VDI deployments.
This new approach would have to be easily scalable and replicable. The team deliberated for few days on a number of subjects such as capacity, IOPs, data locality, linked clones, lazy clones, datastore sizes, FAST Cache utilization, FAST VP utilization, Storage Processors Utilization amongst many other metrics and components.
The model validated is rather simple and applicable to most VDI implementation of any size when used in conjunction with FAST Cache and FAST VP:
- 3% EFD
- 25% SAS
- 72% NL-SAS
Please note that the architecture mentioned in this article is not a EMC blueprint nor is a approved reference architecture. The conclusions below be NOT valid from an EMC supportability standpoint, and they would have to be validated and approved by EMC technical marketing and/or solutions group.
I was in charge of tests and validations. To make sure I was successful I invited a storage guru in the vSpecialist team to help me out with the analysis – Joel Sprouse was instrumental to get all tests accomplished. I keep telling him that he should start his own blog – hopefully this will get him started.
To validate the model 8 ESX servers with VMware View 5.0 were utilized, along with Nexus switches, and an EMC VNX5500 with no other workload competing for storage resources.
For this setup a single FAST pool with only (9) drives ( from each tier of storage (EFD, SAS, and NL-SAS)) were configured for a 500-user VDI deployment with non-persistent desktops. Additionally, FAST Cache was configured with 2 x 100GB EFD.
Specifically, was important to validate that the following scenarios would behave properly during the workload.
- Initial provisioning of 500 desktops
- Login storm of 500 users
- Steady state “medium” user compute workload with 500 concurrent users
VNX O/E for block 05.31.000.5.509
VNX O/E for file 126.96.36.199
NFS VAAI Plugin 1.0-10
View Composer 2.7.0-481620
NX-OS 5.0(3)N2 (2a)
Processor Sockets 2
Cores per Socket 4
Processor Model Xeon-E5640
Processor Speed (GHz) 2.67
Memory Capacity (GB) 96
Network Connectivity 2x10Gb
Connectivity (FILE) 2 x 10Gb (active/standby)
Connectivity (BLOCK) n/a (not currently cabled)
The testbed was configured using (6) NFS volumes
- 1 x 100GB for replicas
- 5 x 600GB for linked clones
The NFS filesystems were provisioned using a RAID-5 FAST Pool with 3 x 100GB EFD, 3 x 300GB SAS, and 3 x 2TB NL-SAS drives. (4:13:83 tier capacity ratio)
- Another 2 x 100GB EFD’s were configured into FAST Cache.
Storage Setup: (2) x 100GB EFD’s are configured in FAST Cache:
For the initial 500-user VDI test, a single RAID-5 mixed disk storage pool was created using (3) drives for each of the (3) classes of available storage EFD, SAS, and NL-SAS.
Using the FAST pool, (10) equally sized LUN’s were created and mapped for file provisioning:
As seen below, the mapped FAST pool properties passed directly through for file provisioning:
Using the mapped FAST pool, (6) filesystems were created using thin provisioning with auto-extension:
The filesystems for linked clones (aka viewlab-osdisk-xx) are configured using an initial size of 50GB with autoextension enabled to support a size of up to 600GB if necessary (high water mark was reduced from the default of 90 down to 60)%:
The replica filesystem (viewlab-replica-01) was configured identically to the linked clone filesystems, except it used an initial size of 20GB with autoextension enabled to support a size of up to 100GB (note that the screenshot below indicates that this filesystem has extended to the current size of 39.5GB).
The filesystems were mounted using the uncached, noscan, and nonotify mount options:
And presented to a storage network via a single 10Gb interface (fail-safe networking is used for redundancy).
The NFS datastores are mounted on (8) ESXi servers and a floating pool of 500 desktops was created using VMware View 5.0 and Composer 2.7:
Creation of 500 User Desktops
The first step in the testing process was to create 500 user desktops. A floating desktop pool was built using the settings as designated above. Once the provisioning process started, we observed an increase in IO to the storage system as expected. The increase in write activity was consistent; averaging about 2,000 NFS write operations per second. Conversely, the read activity was staggered in a zig-zag pattern ranging from between 2,000 – 18,000 read operations per second as seen below:
And when we looked at throughput, we again found that write activity is pretty consistent at about 17MB/sec while read throughput varies substantially ranging between 25MB/sec and 400MB/sec at peak.
What’s interesting to note with this workload is the read cache hit ratio on the NFS server itself. Looking at the graph below note that the read throughput from the storage processor complex is steady at typically less than 5MB/sec while the NFS output regularly bounces at or above 200MB/sec.
NFS Server and Storage Processor utilization are well below the 70% threshold.
And the read/write latencies are well below the generally accepted level of 15,000 microseconds. Read response times were typically serviced in under a millisecond (as would be consistent with the high datamover read cache hit ratio) and writes were consistently serviced under 3 milliseconds (high cache hit ratio from the storage processor complex).
Desktop Provisioning Conclusion:
The VNX5500 as configured using a single FAST pool and FAST Cache for replica and OS-disk storage performs sufficiently to support the “out of the box” provisioning settings of VMware View and View Composer. Given that only a limited number of create and power-on operations may occur concurrently, the provisioning aspect of this configuration should scale linearly to support any number of desktops. That said the linear scaling conclusion only applies to the provisioning piece of this exercise and is in fact because of the way VMware View limits the provisioning concurrency by default.
Login VSI Testing – 500 concurrent users (medium compute workload)
Login VSI is a powerful benchmarking tool designed to simulate end-user computing workloads. For purposes of this test, a “medium” compute profile was used which included end-user simulation of Microsoft Outlook, Adobe Acrobat, Microsoft Word, and Internet Browsing to name a few of the application workloads that were executed by the tool.
The Login VSI workload is quite different than the provisioning workload. The initial spike at the beginning of the graph is caused by 500 users getting logged into their desktops through the View environment. The login storm lasts approximately 20 minutes (Login VSI was configured to start a new session every 2 seconds) and then subsides to what we’ll call the steady state workload.
There’s also a dramatic shift in the read:write profile between the login storm and the steady state workload. During the storm, reads and write IO’s are about even at 50:50. But, after the workload transitions to steady state, you can see the ratio shift to something closer to 10:90 with a preference towards writes. This behavior is consistent with how we expect active linked clone guests to work given that they need to grow to support the differences from the parent guest.
The response times measured during the same period are shown below. Notice that the write response time was higher during the login storm, but only by a few milliseconds. The net result is that NFS latencies were consistently measured below 5ms for the duration of the testing (well below the generally accepted 15ms watermark).
Response times are healthy during the 500-user login storm and steady state testing.
NFS server read cache is benefiting this workload, especially during the login storm. The blue line below is network traffic passed out of the NFS server, while the red line is the actual data being passed across the fibre-channel backend from the storage processor complex.
Based on the data reviewed, we are confident that the storage configuration is sufficient to support the 500 user workload – and that’s goodness! But there are a few other questions that we’d like to answer.
- How many more users will this configuration support?
- Does performance degrade over time?
To answer the first question, we review the different component bottlenecks along the IO path (simplified using core utilization metrics only for this analysis as we’re assuming wide open pipes for network and SAS connectivity).
- NFS Server Utilization (CPU)
- Storage Processor Utilization (CPU per component)
- Physical Disk Utilization
- FAST Cache Utilization
From a “head” perspective, there is plenty of expansion available in terms of available CPU:
From a physical disk perspective, the utilization metric gets busy during the login storm for the SAS tier of storage in the FAST Pool:
Reviewing the actual drive IOPS during that same period confirms that the SAS drives in the pool were nearing saturation. Typically, a conservative rule-of-thumb recommendation is to plan no more than 120 IOPS for a 10K SAS drive.
And then there’s FAST Cache utilization. Using a tool called Unified Block Locality Analyzer (aka UBLA), we were able to measure the data “skew” at the storage pool slice level. After inputting the appropriate data, the tool determined that the skew rate on this workload is approximately 95:5.
The skew value suggests is that 95% of the storage system IO is being serviced by 5% of the allocated capacity of the working set. In other words, using simple round numbers ~250GB of storage is servicing 95% of the IO workload for the VDI testing.
Let’s net the calculation out:
EFD in FAST Cache: 2x100GB, RAID-1 = ~100GB protected
EFD in FAST Pool: 3x100GB, RAID-5 = ~200GB usable
SAS in FAST Pool: 3x300GB, RAID-5 = ~600GB usable
NL-SAS in FAST Pool: 3x2000GB, RAID-5 = ~4,000GB usable
Because FAST Cache can’t be used to store data (it’s a caching tier), we exclude that from the calculation of “usable storage”:
200GB + 600GB + 4,000GB = 4,800GB
4,800GB * 5% = 240GB
Based on the calculation above, assuming the FAST policy engine is able to optimize the majority of slice data and FAST Cache is able to buffer the remaining I/O, our working set of 4.8TB has the potential to be serviced almost entirely by EFD given the current workload (~300GB of usable EFD storage is available between the FAST Pool and FAST Cache). In reality, that hasn’t quite happened because the “default tier” of allocation for slice data is SAS and not direct to EFD (this setting is configurable in the storage pool settings).
As a result, the SAS tier definitely warrants consideration if this configuration is to be scaled past the current number of users.
As for the performance degradation question, assuming floating desktops with a logoff refresh policy, there is no reason to expect performance degradation over time. The same filesystem blocks will be reused – blocks already promoted to Fast Cache and EFD should remain there.
I/O Skew in VDI deployments is such an important topic that I will probably dedicate a whole article to talk about it in the future.
A single FAST pool configured using 3x100GB EFD, 3x300GB SAS, and 3x2TB NL-SAS drives (4:13:83 tier capacity ratio) supplemented with 2x100GB EFD for FAST Cache is a viable solution for a 500-user VDI desktop configuration. This assumes a VDI use case that takes advantage of floating desktops configured to refresh after user logoff.
Testing’s confirmed that the VNX5500 configuration can handle the following stress tests with latency measurements consistently below 5ms.
- Initial View provisioning of 500 desktops
- Login storm of 500 users
- Steady state “medium” user compute workload with 500 concurrent users
The configuration has room for expansion, but it is unlikely that far too many additional users could be safely added without considering the addition of more physical drives.
From a modeling and building block perspective it seems that that the 3% EFD, 25% SAS, 72% NL-SAS will work just fine for most VDI implementation. FAST Cache might need to be doubled for every 500 users.
Next we are set to test the model under a 1,000 and 5,000 user workload.