I recently published couple blog posts with all features being made available with the latest Datrium DVX software release. This is an aggregation of those posts for easy reading.
- Red Hat Virtualization (RHV) support
- Linux Bare-Metal (RHEL and CentOS) support
- Docker Persistent Volumes (Virtualized and Bare-Metal)
- Full Data Services for Containers
- Split Provisioning (128 servers and 10 data nodes)
- Cloud Scale (18 Million IOPS and 200GB/s)
- Instant, Application Consistent Snapshots (Zero VM stun)
Do you know Datrium Open-Convergence?
From an architectural perspective, the best way to describe this game changing tech is to visualize all active data, both VMs and Containers, serviced with data locality and using internal flash (SSD and NVMe) on each server. At the same time, a protection copy of the data is hosted in clustered data nodes with distributed erasure coding. Each server runs the DVX hyperdriver software responsible for IO processing and enterprise data services.
One of the advantages of the architecture is that servers are stateless, and losing any given amount of servers doesn’t impact data protection, availability, or SLAs. On the other hand, data nodes are highly available and protected with active/standby controllers, mirrored NVRAM, and hot-plug drives.
Lastly, when applications move between servers or when a failover happens, the DVX software instantly uploads the data to the target server. The DVX software uses other servers as the source before pulling data from the data cluster, guaranteeing flash to flash performance whenever possible. Nevertheless, because of the native global deduplication is it likely that most fingerprinted data is readily available on the target server.
For official information on features and time frame refer to the official Datrium Press Release (here).
Red Hat Virtualization (RHV)
Datrium customers now can deploy Red Hat Virtualization (RHV) and inherently get data service benefits, including Flash and NVMe IO acceleration and end-to-end blanket encryption.
Red Hat is the world’s leading provider of open source solutions and has been named a Visionary in the 2016 Gartner’s Magic Quadrant for x86 Server Virtualization Infrastructure.
Besides enabling the use of data services, one of the biggest benefits of Datrium’s multi-hypervisor implementation is the ability use of the same DVX system for supporting concurrently RHV and VMware vSphere deployments.
Datrium is now certified by Red Hat and providing support for RHV we are providing choice to customers, but also paving a path to support the entire Red Hat stack and application partner ecosystem, including OpenStack, OpenShift, and CloudForms, providing a unified and consistent set of management capabilities across:
- Red Hat Virtualization, VMware vRealize, and Microsoft Hyper-V(*).
- Private cloud platforms based on OpenStack®.
- Public cloud platforms like Amazon Web Services and Microsoft Azure.
While Datrium works independent from CloudForms, it does enable multiple virtualization platforms to run across the same DVX system, eliminating silos and complexity, and in some cases allowing easy workload migration between hypervisors.
An interesting fact about RHV is that it has record-setting SPECvirt_sc2013 benchmark results, including highest overall performance and the highest number of well-performing VMs on a single server.
(*) Microsoft Hyper-V is not supported at this point in time.
Download Red Hat Virtualization Runs on Datrium Solution Brief (here)
Linux Bare-Metal (RHEL and CentOS)
DVX 3.0 enables customers to deploy the DVX hyperdriver on Linux bare-metal servers and inherently enjoy all enterprise data services benefits from the DVX platform, including Flash and NVMe IO acceleration and end-to-end encryption.
Linux administrators see datastores as local NFS mounts, and the mounts are backed by the DVX hyperdriver (manually installed in each server with 3.0 release) responsible for enabling IO acceleration and data services.
With this release, Datrium provides support for KVM and Containers, but other use-cases may be supported in upcoming releases, including Splunk, SAP, Hadoop and more.
Containers Persistent Volumes (Bare-Metal and Virtualized)
Containers are ephemeral, and files and services running inside a Container will not exist outside its lifetime. However, many applications require the ability to persist user session activity, making some aspects of the application stateful. Enterprises want persistent storage for Containers, and they also want to use the same infrastructure to manage dockerized and traditional workloads during the application lifecycle, development and production.
Datrium native Container implementation, via a Docker Volume Plugin, enables customers to seamlessly implement Continuous Integration/Delivery and Micro-services solutions as part of the delivery infrastructure, while still leveraging Datrium data services, such as deduplication, compression, erasure coding, encryption, replication, snaps, clones, etc.
No more choosing between bare-metal and hypervisor
Containers and VMs used together to provide a great deal of flexibility in deploying and managing apps.
Organizations usually start their Containers journey running apps in VMs for the added flexibility provided by virtualization stacks. However as soon the application lifecycle and methodology are fully defined, organizations move their production Containers environment to bare-metal to harvest additional performance, reducing the (9-15%) CPU overhead created by the virtualization stack.
Datrium supports Docker persistent volumes for both virtualized and bare-metal deployments, while still providing IO optimization, acceleration and data services, including end-to-end encryption, snaps, replication and more. Using Datrium’s approach to Containers the development lifecycle is streamlined and automated much more easily because the drift between environments (Dev, QA, Staging, Pre-Prod, and Prod) is minimal.
Image courtesy of Docker Website
Data Services and Protection for Containers
Albeit some may argue that Containers should remain ephemeral, in my experience working with enterprises, there is a clear need for maintaining persistence across sessions for some applications and datasets, but also there is an enormous need to protect data in persistent volumes.
With Datrium persistent volumes may be cloned on one server can be immediately used on another, between both virtual and bare-metal deployments.
A significant challenge with Containers, however, is that it represents an order of magnitude more objects to manage than virtual machines, especially when persistent volumes are implemented. DVX 3.0 addresses this challenge with a combination of powerful search capabilities, the ability to create logical groups of Containers (called a Protection Group) aligned to applications, and assignment of protection policies to those groups for instant recovery, archive, DR and more.
In other words, all data services typically used with virtual machine workloads, such as snaps, cloning replication, and blanket encryption are now also available for Containers at the granular Container level, and the Datrium GUI makes it easy to understand and monitor.
Download the Docker Containers on Datrium DVX solution Brief (here)
Allow me to provide some background to Open Convergence and the problems that we are effectively solving with the new architecture.
The SAN proposition
In legacy storage arrays, all the CPU intensive data management (deduplication, compression, erasure coding, etc.) is carried out on the array side, by the controllers. These controllers are often sized for maximum performance, but as the solution scales with more servers, each host will get fewer IOPS and less storage capacity.
Scaling storage arrays can be done either by attaching multiple disk shelves to the same controllers, therefore bottlenecking the controllers or optionally doing controller head-swap, but it requires downtime, and it is expensive. Some storage arrays use a scale-out approach with multiple controllers, but because data management happens at those beefy controllers, it quickly becomes a costly proposition.
Finally, another option is to adopt a multi array strategy, leading to storage silos, complex management, lack of global deduplication and coordination failures.
The HCI proposition
HCI places compute and storage together, and as you scale one dimension, you also scale the other (lockstep provisioning), not allowing for independent scaling. Some vendors provide storage only nodes, but those also come with additional and unnecessary computing.
Additionally, HCI vendors do not allow different hardware vendors as part of the same solution or cluster, not allowing the reuse of existing servers and also not allowing the repurpose of existing storage investments – all that equates to vendor lock-in.
When it comes to the data path and IO traffic, data being written always go across servers and network, and there is a lot of traffic between servers (east <> west), in many instances creating noisy neighbor issues when heavy workloads impact lighter workloads.
Finally, at scale, multiple clusters are formed due to the requirement to create multiple failure domains, and also due to the cost of creating additional replicas of the data for resiliency – the larger the cluster, the higher the possibility of a double or triple failure.
The Open Convergence (OCI) proposition
DVX is a scale-out system where capacity can be scaled by adding Data Nodes and performance can be scaled by adding Compute Nodes. With the DVX 3.0 payload, we now support a maximum of 10 data nodes. This translates to more than 1PB of effective usable capacity (300 TB of usable capacity before data reduction). This hyperscale approach eases administrative tasks and reduces the cost for private clouds at scale.
With the DVX 3.0 payload, we support a single Datastore spanning all data nodes. This creates a single global namespace and a single deduplication storage pool. Each data node has dual NVRAM, dual controllers, redundant data network links (2-4 depending on the Model), redundant power supplies, etc.… so that there is no single point of hardware failure within the data node cluster.
The DVX architecture is server powered, with all data management functions (replication, encryption, fingerprinting, compression, space reclamation, erasure coding, drive rebuild) being carried out on Compute Nodes. The compute available for storage data services scales out as more compute nodes are added. You can add data nodes to add capacity to the system. As more data nodes are added, write performance also increases linearly, since we get more disks and more network links resulting in more storage pool write bandwidth. The increased pool read bandwidth also helps with increased space reclamation performance and increased drive rebuild performance.
Split Provisioning Architecture – No Bottlenecks
Each data node brings drives, and they are pooled together in single drive pool, and disks are uniformly distributed across data nodes, increasing the NVRAM bandwidth for VMs because the system aggregates NVRAM across data nodes.
Data strips are broken down to data chunks that are distributed to the storage pool using a Layout Map. The Layout Map solves 2 problems; (1) make sure that the data distributed evenly across all data nodes, and (2) in the event of a disk failure and during a rebuild, the load is distributed across all disks and hosts.
When a new data node is added, and more disks are added to the drive pool, the data is rebalanced, and the rebalancing also scales as data nodes are added. The data stripe is written via distributed erasure coding, making sure that there are two EC parity chunks to tolerate two simultaneous drive failures in the drive pool. More parity chunks may be added in the future if there is a need to tolerate more concurrent drive failures. Rebuild times decrease linearly as data nodes are added.
Darium DVX now uses the mDNS multicast protocol to support the ZeroConf. ZeroConf is a set of technologies where, when a device is plugged into a network, a unique IP is assigned automatically, and it will resolve to a known hostname within that local subnet. ZeroConf can also be extended to provide a means to discover services available on each device.
With Datrium, ZeroConf is used for an initial deployment, to connect to a DVX on the local network with a public hostname and to configure the system. Also, it provides node discovery in the cluster, and to list all the available nodes in the local network.
Ok, here is where things start to get interesting. The Datrium architecture has been from the ground up architected for 1000’s of drives and 100’s of hosts. With the DVX 3.0 payload, the solution achieves mind-boggling numbers, scaling 10X, up to 128 compute nodes, 10 data nodes, 1.7 Petabyte data pool, 18 Million IOPS, and more than 8GB/sec write throughput.
The system-wide performance is a combination of the number of compute nodes and data nodes in the platform. The numbers below demonstrate some of the incredible internal benchmarks done with 2x-compressible, un-dedupable data.
How does Datrium look like, compared to…
To put into perspective, using a 70:30 read to write split ratio with 8K block sizes would give us a direct comparison to XtremIO. Using this configuration, DVX will do 3.3M IOPS; 3.7x better performance than the largest XtremIO. XtremIO Specifications (here).
You may compare with 32KB reads where DVX will deliver 6.25M IOPS; 17x better than the largest Pure Storage FlashArray. Pure Storage Specifications (here).
You may compare with nominal read IOPS, likely 4KB, where DVX will deliver 18M IOPS; 2.7x better than the largest EMC VMAX All Flash. EMC VMAX All Flash Specifications (here).
You may compare with nominal read IOPS, likely 4KB, where DVX will deliver 18M IOPS; 1.8x better than the largest SolidFire All Flash array. SolidFire All Flash Specifications (here).
In all honesty, the only thing that is in our league might be EMC Scale IO, and performance is their only metric. But if you care about data services like VM awareness, deduplication, compression, snapshots, cloning, erasure coding and replication, Datrium DVX is the only solution that can get the job done.
Bear in mind that Datrium is a hybrid platform, not an All Flash system like the above arrays, and Datrium data nodes use 7,200 RPM hard disks for durable storage. That is truly mind-boggling! The secret sauce comes from designing a brand-new log structured file system that works by treating the entire filesystem as a log: erasure-coded objects are written exactly once in an append-only log, never overwritten, and deleted. It is a difficult problem to implement a distributed, scale out LFS – especially when you consider how to reclaim space, but it gives you several excellent properties. Read mode in this post.
If any of the vendors mentioned above disagree, or if you have a more current specs sheet, please let me know, and I will happily correct the disparity.
HOW ABOUT 100% RANDOM WRITE LARGE SPAN WORKLOADS?
(This is what separates the men from the boys)
We could not find a single vendor that would publish such workload numbers because it’s a very difficult workload. Our performance engineering teams have been hard at work trying to push the system as much as they can. The picture below demonstrates the DVX 100% random-write performance without gaming the results, like writing to a small file in NVRAM or such.
- 490 VMs
- 100% 32KB random writes
- 2.1x compressible
- Undedupable data
- 35 compute hosts
- 10 data nodes
- Large span (1TB per VM) with roughly half petabyte of logical LBA span
Those with any enterprise storage experience will agree that 8.5 GBps of Random-Write Throughput with 1.5ms application latency is an astonishing achievement, especially if one considers that all data is being checksummed, deduplicated, compressed and erasure coded inline to disk.
Please note that I refer to application latencies as seen by ESX because this is the latency that is measured from the ESX level perspective, all the way through the network and storage stack, and back. Storage arrays normally report only one component of this latency, namely internal array latency that excludes the network and client storage protocol overheads.
Another interesting point is that the bigger the solution, more throughput & NVRAM is available, but also faster drive rebuilds. The time to rebuild drive failures decreases more than linearly.
Instant, Application Consistent Snapshots
VSS (Volume Shadow-Copy Service) is a Microsoft Windows service that allows backup and snapshot applications to “quiesce” guest applications –, i.e., put them on a consistent on-disk state — before taking a backup. Datrium has created its own VSS Provider to implement native instance app-consistent snapshot capabilities for Microsoft workloads.
With this first release of the VSS Provider, supports Windows Server 2008 R2 onwards and Microsoft SQL 2005 onwards. Additional Microsoft applications such as Exchange and AD Controllers will be enabled in the future, after testing.
Due to the native integration with the Datrium platform, where both the VM metadata and the data lives, Datrium can eliminate VM stun times and drastically reduce the application performance dip. As a result, IT admins can take VM-level pause-less snapshots of applications with high change rates and at greater frequencies for more granular recovery.
Datrium DVX 3.0 supports up to 1.2M snapshots in a 10 data node configuration, and 300K with a single data node.
Below is an example of Datrium VSS provider quiescing a Microsoft SQL Server and snapshotting the VM.
- 16 core, 64GB, 10 virtual disks (2x500GB, 8x 40GB)
Number of VM stuns:
- VMWare VSS – 3 VM stuns
- Datrium VSS – No VM stun
Duration of Application Performance Dip:
- VMWare VSS – 8-10 minutes
- Datrium VSS – up to 10 seconds.
Peer Cache Mode
In DVX, we hold all data in use on flash on the host. Moreover, we guide customers to size host flash to hold all data for the VMDKs. With always-on dedupe/compression for host flash as well, this is feasible – with just 2TB flash on each host and 3X-5X data reduction you can have 6-10TB of effective flash. (DVX supports up to 16TB of raw flash on each host). Experience proves this is in fact what our customers do: by and large, our customers configure sufficient flash on the host and get close to 100% hit rate on the host flash.
However, in most instances due to data reduction benefits, customer decide to have only 1 or 2 flash devices on each server, because that’s more than enough from a capacity and performance standpoints. With previous releases of DVX if the last available flash device failed the workload would then stop, and applications would have to be restarted, manually or via HA, on a different host.
With DVX 3.0 we are introducing the ability to utilize Peer Cache, the flash devices from other hosts, to keep the workload running even if the last available flash device fails, and without drastically impacting application performance until new flash devices are placed. As with any array, you now would have to traverse the network, and there would be some additional latency, but in this case, DVX would be working like any other SAN.
As with any array, you now would have to traverse the network for IO read operations, and there would be some additional latency given that we would be introducing East <-> West traffic for reads instead of being local. But in this case, DVX would be working just like a SAN.
This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net