Last month Datrium announced the first part of the DVX 3.0 payload and I blogged here; and today we are announcing the second part. In the first announcement, Datrium took us where no HCI vendor has been before, offering not only a multi-hypervisor platform (VMware vSphere, RedHat Virtualization and Open-source KVM on CentOS hosts), but also adding support for bare-metal containers on Linux hosts, both with granular data management.
- Red Hat Virtualization (RHV) support
- Linux RHEL and CentOS hosts running bare-metal containers
- Full Data Services for Containers & KVM virtual machines
Today we are introducing some real awesomeness with mind-boggling performance!
- Split Provisioning (128 servers and 10 data nodes)
- Cloud Scale (18 Million IOPS and 200GB/s)
- Instant, Application Consistent Snapshots (Zero VM stun)
Allow me to provide some background to Open Convergence and the problems that we are effectively solving with the new architecture.
The SAN proposition
In legacy storage arrays, all the CPU intensive data management (deduplication, compression, erasure coding, etc.) is carried out on the array side, by the controllers. These controllers are often sized for maximum performance, but as the solution scales with more servers, each host will get fewer IOPS and less storage capacity.
Scaling storage arrays can be done either by attaching multiple disk shelfs to the same controllers, therefore bottlenecking the controllers or optionally doing controller head-swap, but it requires downtime, and it is expensive. Some storage arrays use a scale-out approach with multiple controllers, but because data management happens at those beefy controllers, it quickly becomes a costly proposition.
Finally, another option is to adopt a multi array strategy, leading to storage silos, complex management, lack of global deduplication and coordination failures.
The HCI proposition
HCI places compute and storage together, and as you scale one dimension, you also scale the other (lockstep provisioning), not allowing for independent scaling. Some vendors provide storage only nodes, but those also come with additional and unnecessary computing.
Additionally, HCI vendors do not allow different hardware vendors as part of the same solution or cluster, not allowing the reuse of existing servers and also not allowing the re-purpose of existing storage investments – all that equates to vendor lock-in.
When it comes to the data path and IO traffic, data being written always go across servers and network, and there is a lot of traffic between servers (east <> west), in many instances creating noisy neighbor issues when heavy workloads impact lighter workloads.
Finally, at scale, multiple clusters are formed due to the requirement to create multiple failure domains, and also due to the cost of creating additional replicas of the data for resiliency – the larger the cluster, the higher the possibility of a double or triple failure.
The Open Convergence (OCI) proposition
DVX is a scale-out system where capacity can be scaled by adding Data Nodes and performance can be scaled by adding Compute Nodes. With the DVX 3.0 payload, we now support a maximum of 10 data nodes. This translates to more than 1PB of effective usable capacity (300 TB of usable capacity before data reduction). This hyperscale approach eases administrative tasks and reduces the cost for private clouds at scale.
With the DVX 3.0 payload, we support a single Datastore spanning all data nodes. This creates a single global namespace and a single deduplication storage pool. Each data node has dual NVRAM, dual controllers, redundant data network links (2-4 depending on the Model), redundant power supplies, etc.… so that there is no single point of hardware failure within the data node cluster.
The DVX architecture is server powered, with all data management functions (replication, encryption, fingerprinting, compression, space reclamation, erasure coding, drive rebuild) being carried out on Compute Nodes. The compute available for storage data services scales out as more compute nodes are added. You can add data nodes to add capacity to the system. As more data nodes are added, write performance also increases lineraly, since we get more disks and more network links resulting in more storage pool write bandwidth. The increased pool read bandwidth also helps with increased space reclamation performance and increased drive rebuild performance.
Split Provisioning Architecture – No Bottlenecks
Each data node brings drives, and they are pooled together in single drive pool, and disks are uniformly distributed across data nodes, increasing the NVRAM bandwidth for VMs because the system aggregates NVRAM across data nodes.
Data strips are broken down to data chunks that are distributed to the storage pool using a Layout Map. The Layout Map solves 2 problems; (1) make sure that the data distributed evenly across all data nodes, and (2) in the event of a disk failure and during a rebuild, the load is distributed across all disks and hosts.
When a new data node is added, and more disks are added to the drive pool, the data is rebalanced, and the rebalancing also scales as data nodes are added. The data stripe is written via distributed erasure coding, making sure that there are two EC parity chunks to tolerate 2 simultaneous drive failures in the drive pool. More parity chunks may be added in the future, if there is a need to tolerate more concurrent drive failures. Rebuild times decrease linearly as data nodes are added.
Darium DVX now uses the mDNS multicast protocol to support the ZeroConf. ZeroConf is a set of technologies where, when a device is plugged into a network, a unique IP is assigned automatically, and it will resolve to a known hostname within that local subnet. ZeroConf can also be extended to provide a means to discover services available on each device.
With Datrium, ZeroConf is used for an initial deployment, to connect to a DVX on the local network with a public hostname and to configure the system. Also, it provides node discovery in the cluster, and to list all the available nodes in the local network.
Ok, here is where things start to get really interesting. The Datrium architecture has been from the ground up architected for 1000’s of drives and 100’s of hosts. With the DVX 3.0 payload, the solution achieves mind-boggling numbers, scaling 10X, up to 128 compute nodes, 10 data nodes, 1.7 Petabyte data pool, 18 Million IOPS, and more than 8GB/sec write throughput.
The system-wide performance is a combination of the number of compute nodes and data nodes in the platform. The numbers below demonstrate some of the incredible internal benchmarks done with 2x-compressible, undedupable data.
How does Datrium look like, compared to…
- To put into perspective, using a 70:30 read to write split ratio with 8K block sizes would give us a direct comparison to XtremIO. Using this configuration, DVX will do 3.3M IOPS; 3.7x better performance than the largest XtremIO. XtremIO Specifications (here).
- You may compare with 32KB reads where DVX will deliver 6.25M IOPS; 17x better than the largest Pure Storage FlashArray. Pure Storage Specifications (here).
- You may compare with nominal read IOPS, likely 4KB, where DVX will deliver 18M IOPS; 2.7x better than the largest EMC VMAX All Flash. EMC VMAX All Flash Specifications (here).
- You may compare with nominal read IOPS, likely 4KB, where DVX will deliver 18M IOPS; 1.8x better than the largest SolidFire All Flash array. SolidFire All Flash Specifications (here).
In all honesty, the only thing that is in our league might be EMC Scale IO, and performance is their only metric. But if you care about data services like VM awareness, deduplication, compression, snapshots, cloning, erasure coding and replication, Datrium DVX is the only solution that can get the job done.
Bear in mind that Datrium is a hybrid platform, not an All Flash system like the above arrays, and Datrium data nodes use 7,200 RPM hard disks for durable storage. That is truly mind-boggling! The secret sauce comes from designing a brand-new log structured file system that works by treating the entire filesystem as a log: erasure-coded objects are written exactly once in an append-only log, never overwritten, and deleted. It is a difficult problem to implement a distributed, scale out LFS – especially when you consider how to reclaim space, but it gives you several excellent properties. Read mode in this post.
If any of the vendors mentioned above disagree, or if you have a more current specs sheet, please let me know, and I will happily correct the disparity.
HOW ABOUT 100% RANDOM WRITE LARGE SPAN WORKLOADS?
(This is what separates the men from the boys)
We could not find a single vendor that would publish such workload numbers, because it’s a very difficult workload. Our performance engineering teams have been hard at work trying to push the system as much as they can. The picture below demonstrates the DVX 100% random-write performance without gaming the results, like writing to a small file in NVRAM or such.
- 490 VMs
- 100% 32KB random writes
- 2.1x compressible
- Undedupable data
- 35 compute hosts
- 10 data nodes
- Large span (1TB per VM) with roughly half petabyte of logical LBA span
Those with any enterprise storage experience will agree that 8.5 GBps of Random-Write Throughput with 1.5ms application latency is an astonishing achievement, especially if one considers that all data is being checksummed, deduplicated, compressed and erasure coded inline to disk.
Please note that I refer to application latencies as seen by ESX because this is the latency that is measured from the ESX level perspective, all the way through the network and storage stack, and back. Storage arrays normally report only one component of this latency, namely internal array latency that excludes the network and client storage protocol overheads.
Another interesting point is that the bigger the solution, more throughput & NVRAM is available, but also faster drive rebuilds. The time to rebuild drive failures decreases more than linearly.
Instant, Application Consistent Snapshots
VSS (Volume Shadow-Copy Service) is a Microsoft Windows service that allows backup and snapshot applications to “quiesce” guest applications –, i.e., put them on a consistent on-disk state — before taking a backup. Datrium has created its own VSS Provider to implement native instance app-consistent snapshot capabilities for Microsoft workloads.
With this first release of the VSS Provider, supports Windows Server 2008 R2 onwards and Microsoft SQL 2005 onwards. Additional Microsoft applications such as Exchange and AD Controllers will be enabled in the future, after testing.
Due to the native integration with the Datrium platform, where both the VM metadata and the data lives, Datrium can eliminate VM stun times and drastically reduce the application performance dip. As a result, IT admins can take VM-level pause-less snapshots of applications with high change rates and at greater frequencies for more granular recovery.
Datrium DVX 3.0 supports up to 1.2M snapshots in a 10 data node configuration, and 300K with a single data node.
Below is an example of Datrium VSS provider quiescing a Microsoft SQL Server and snapshotting the VM.
- 16 core, 64GB, 10 virtual disks (2x500GB, 8x 40GB)
Number of VM stuns:
- VMWare VSS – 3 VM stuns
- Datrium VSS – No VM stun
Duration of Application Performance Dip:
- VMWare VSS – 8-10 minutes
- Datrium VSS – up to 10 seconds.
Peer Cache Mode
Read about peer cache mode here.
There is a lot more goodness coming from Datrium in the next few months, and all types of organizations are already realizing the benefits of Open Convergence, and at the same time recognizing the shortcomings and disadvantages of legacy architectures, both SAN and HCI.
Thanks to Ganesh Venkitachalam, Sazzala Reddy, and Tushar Agrawal for helping review this post.
This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net.