Recently there’s been an exchange between industry luminaries on data reduction, data locality, and data protection. Howard Marks wrote a thoughtful piece <here> that expands on VSAN’s approach to data locality. Josh Odgers, the brilliant blogger now at Nutanix, responded with some notes <here> that in my opinion struggled to close the issue. They both seem to be reaching for a silver bullet that’s not there.
The objective of this blog post is to demonstrate that Data Locality is essential for enhancing application performance, and explain how it is possible to solve the application locality and the management complexity dilemmas seamlessly yet wielding high performance and data reduction benefits
When it comes to Performance, just get out of the way of Intel.
If you can let server hardware serve applications as fast as possible and remain stateless, performance will be as good as possible, and what’s left is making administration simple.
When the Datrium team set out to build the best converged system possible, they considered many of these issues. In addition to all this, one overriding concern we had was simplicity – figuring out what features to enable when is a complete waste of time. If the feature is on a per-VM or per-some-group-of-objects basis, then the complexity truly becomes unmanageable. It is simply not possible to track 1000s of VMs and figure out what needs to be enabled and when. So: All features must be On all the time.
Incidentally, this is one area where modern arrays like Pure Storage nailed it, but most HCI vendors have checkboxes galore. If you can also add capacity or bandwidth/IOPS at will, then you will have a solved a real problem at scale.
Let’s look at how Datrium figures in each one of these angles.
The fundamentals: a true log-structured filesystem
The fundamental technology that enables features such as data reduction, data locality, and data protection in a Datrium DVX system is a Log Structured Filesystem, first described by Mendel Rosenblum (who is incidentally one of our investors, and one of the founders of VMware).
In a nutshell, an LFS works by treating the entire filesystem as a log: objects are written exactly once in an append-only log, never overwritten, and deleted. It is a difficult problem to implement a distributed, scale out LFS – especially when you consider how to reclaim space, but it gives you several excellent properties.
- It lets us handle variable sized objects – if a 4K block is compressed to 3172 bytes, saving that 25% of space is a Good Thing. “Normal” 4K-based block allocation techniques will not let you save that space, but appending a 3.1K object to a log is as simple as appending a 4K object, there is no difference.
- Garbage Collection, an inherent part of any LFS, is also a natural opportunity for deduplication. As live data is copied forward to reclaim space, duplicate as well as dead data can be left behind. Moreover, the more data that can be left behind, the more efficient the process.
- You can compute parity to tolerate N failures once, and write a whole stripe from the get go and never modify the stripe – you do not have to incur the complexity or cost of reading cold data, erasure coding the data, and re-writing stripes. Unless you have a true append-only LFS, doing distributed erasure coding can get very challenging, which is why most HCI systems do not have always-on EC.
- Flash is great at random reads but not random writes. LFS converts random writes to sequential writes, which is ideal for both flash and disk.
Of course, these are fundamental properties of a filesystem. It is next to impossible to change a filesystem at such fundamental levels once it is implemented. You have to figure out most of the requirements up front, and most current HCI products were not built for storage efficiency from the beginning – erasure coding, dedupe, compression, and encryption were layered on (or sometimes not) with bad side effects.
Compression ratios are of course workload dependent. Once you have an LFS compression is in fact quite straightforward to implement. You are only appending variable sized objects to a log. The trick is to find a compression algorithm that quickly gives up if the data is incompressible. We use a modified version of Google’s snappy. It is very CPU efficient thanks to new Intel instructions and has a reasonable compression ratio.
Here’s where skeptics come out and say “but you are taking up CPU even if the workload is incompressible”. The answer is – Who Cares? We measured workloads with an internal option that we specifically added for measurement purposes. The difference is truly in the noise – maybe 1-2% CPU savings. That is it! What do you care about, managing options on 1000s of objects or a couple of percent of CPU utilization, which Intel will make up shortly after this article is written?
Here’s how deduplication works at a high level: if there are multiple references to the same piece of data, the filesystem keeps one copy of the data. All references to the data will then use this same copy.
Dedup is very workload dependent. With VDI, you get 10X+ deduplication. With a log-processing workload, you get almost no dedupe except the OS image. Deduplication systems may have some overhead in terms of fingerprinting data, but much more in keeping deduplication tables up to date; and that is the fundamental reason some vendors only implement post-process deduplication.
With content addressing, a data block is immutable (new data = new content = new fingerprint), and it does not belong to any particular object. This saves lots of bookkeeping. Replicating at an object granularity is irrelevant in such a system – there’s just a pool of content shared by whatever file object wants to use it, including snapshots and clones. DVX doesn’t have refcounts and that also greatly simplifies both cloning and snapshotting. In particular, DVX doesn’t have to update a bunch of ref counts just to create a clone.
Suffice it to say that given the nature of our filesystem, deduplication is Always-On. Computing a fingerprint is almost free with new Intel instructions, especially if you can do it in one pass along with compressing the data. After computing the fingerprint, the cost is the same whether a piece of data has duplicates or not. There is little to no performance loss because of any of this, as is proven by a few array vendors. With deduplication and compression always enabled, Pure is killing it with good performance in the Tier-0/Tier-1 market.
We will publish a detailed DVX performance white paper with our performance later this year, and you can see the numbers for yourself. We have already published read IOPS/bandwidth numbers before though: with deduplication and compression enabled, with undedupable random data, we achieve about 140K IOPS (4K random read) and 1.5GBps bandwidth (32K random read) from a single host. We used undedupable data in the above benchmark so that there is zero gaming of the results. Note that in a DVX system the read bandwidth scales with the number of hosts, given that reads are local due to data locality.
As you can see, there is no reason to tweak four kinds of knobs for performance and data reduction depending on workload, if you have the right kind of filesystem.
Data durability in the presence of failures is table stakes for any storage system. Failure tolerance is achieved by redundancy in some fashion. One way to achieve redundancy is with mirroring. You can mirror 2-way (RF=2, FTT=1) or 3-way (RF=3, FTT=2). At any scale and seriousness, you have to do 3-way replication, or you are rolling the dices on data loss.
The reason is not that you will lose 2 disk drives at the time. What is much, much more common is the following scenario: 1 drive fails, and the system starts re-mirroring data from the remaining drive. For this re-mirroring, you have read from the remaining drive. All it takes is a sector read error from the remaining drive, and you have now lost data. NetApp has published extensive studies on this that demonstrates this problem. The summary for that study: 5% to 20% of all disks in the study had at-least one sector read error. So, if you are mirroring, choose RF=3/FTT=2 if you care about your data.
The problem with 3-way mirroring is that you now have 3X the overhead. Enter Erasure Coding. At a high level, Erasure Coding tolerates 2 drive failures (or 1 drive failure and a sector read error in the remaining drive). This is achieved by computing Error Correcting Codes that tolerate 2 failures. With a good implementation, you can tolerate 2 drive failures with an overhead of 25% or so. Which is way better than 3X.
Many HCI vendors have this as a checkbox with various caveats because doing distributed Erasure Coding is a hard problem. As Howard Marks points out, Erasure Coding messes with locality. Also, HCI vendors’ implementation of EC works only with write-cold data or on all-flash systems because there might be some read-modify-write involved which tanks performance. In some cases, even if you enable the “Erasure Code” checkbox, the system cannot actually Erasure Code the data – which means that you cannot bank on a 20% overhead, it might, in fact, be 3X overhead even if you have the box checked. This is one aspect that a few array vendors like Pure nailed: Erasure Coding is on by default, always on, and the overhead for durability is 20-30%.
DVX is similar to arrays in this regard – the only data durability method offered, which is not an option, is 2-drive failure tolerance using Erasure Coding. All data is always Erasure Coded and stored in a data node cluster. The DVX software computes the codes and writes the parity stripes directly from the host to the data nodes. Because of the write-once nature of LFS, there is no read-modify-write issue. Hosts are thus stateless. The data node itself has no single point of failure, so you are covered there as well. As a nice side-effect, when a host fails, there are none of the re-replication nightmares.
On to data locality and how it interacts with Erasure Coding.
As Howard Marks points out, 3-way replication is expensive and Erasure Coding in an HCI system shmears data all over the nodes which make data locality arguments problematic. Data Locality is another place where the DVX system is fundamentally different from both arrays and traditional HCI. By data locality, we allude to the data residing on flash on the host where a VM is running.
The key word here is host flash. In a DVX, we hold all data in use on a host on flash on the host. Moreover, we guide customers to size host flash to hold all data for the VMDKs. With always-on dedupe/compression for host flash as well, this is totally feasible – with just 2TB flash on each host and 3X-5X data reduction you can have 6-10TB of effective flash. (DVX supports up to 16TB of raw flash on each host). Experience proves this is in fact what our customers do: by and large, our customers configure sufficient flash on the host and get close to 100% hit rate on the host flash.
With any array, you have to traverse the network. However, with any modern SSD, the network latency can be an order of magnitude higher than the device access latency. Flash really does belong in the host, especially if you are talking about NVMe drives with sub-50usec latency. IOPS and throughput can be improved by buying bigger and bigger controllers, but there is no way out for latency: you need the flash on the host or suffer the consequences.
What about VMotion (live migration)? When a VM is VMotioned to another host, the DVX uses a technology we call F2F (flash-to-flash) – the destination host will fetch data from the source host flash and move the VM’s data over to the destination host. You lose data locality for the period during which this move happens, but it is restored reasonably quickly as the workload continues to run on the destination host. However, VMware DRS does not do VMotions every few minutes – even at the most aggressive level DRS has hysteresis, and VMs move on average once or twice a day at most. This means that in the common case, data locality really does help reduce network traffic and latencies hugely.
In the uncommon case i.e., after a VMotion, the DVX performance will be more like an array (or some other HCI systems without data locality). That is, the DVX worst case is someone else’s best case. Note that we preserve data locality even as data protection uses erasure coding – this is the key point.
Finally, because all data is fingerprinted and globally deduplicated, when a VM moves between servers there is a very high likelihood that most data blocks for similar VMs (Windows/Linux OS, SQL Servers, etc.) are already present on the destination server and data movement will not be necessary for those blocks.
With the right approach, you can solve all constraints: Datrium has always-on Data Reduction features like arrays, always-on Erasure Coding like arrays, Data Locality like the best HCI systems out there, and incremental scaling of capacity and IOPS/bandwidth like no one. I’m sorry if this reads like a commercial, but it is in fact true 🙂
Thanks to Ganesh Venkitachalam, Sazalla Reddy, Hugo Patterson and Brian Biles for helping craft and review this post.
Latent Sector Read Errors: http://research.cs.wisc.edu/wind/Publications/latent-sigmetrics07.pdf
This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net.