How VMware intends to Leverage Datrium DRaaS

Breaking: VMware Announces Intent to Acquire Datrium to Provide Disaster Recovery-as-a-Service for Hybrid Cloud Environments https://lnkd.in/gREXsSD

If you are curious to understand how VMware intends to leverage Datrium Disaster-Recovery-as-a-Service with their Cloud watch this video.

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

1 minute and 18 sec … to DR a 13 Terabyte Microsoft SQL Server to VMware Cloud with Datrium

Yesterday I published Microsoft SQL Server Performance, Consistency at Peak, and The Art of Possible demonstrating Microsoft SQL Server with HammerDB TPC-C workload running on Datrium.

I thought that if I can run MSSQL so fast and consistently what would happen if my datacenter was suddenly gone and I used Datrium DR-as-a-Service with VMware Cloud. How long would it take for my 13 TB VM to come up online and be ready for my applications? (see my previous articles for additional information on VM and HammerDB)

First things first, Datrium DRaaS with VMware Cloud on AWS is a comprehensive cloud-based backup and disaster recovery service for the protection of VMware workloads on-premises and in the cloud. Datrium DRaaS includes Cloud Backup, DR plan configuration and orchestration, and the VMware Cloud on AWS that live mount snapshots. It’s available as a downloadable virtual appliance.

To protect the Microsoft SQL Server VM I created a protection group in our on-prem vSphere environment, and I have set policies to take snapshots and replicate to the Cloud Backup repository on AWS every 30 minutes (That’s my current RPO).

Then I created a DR plan. A DR plan includes a set of recovery steps that capture ordering constraints and action sequencing instructions for DR operations. These are the ordered instructions that will occur when the plan is executed.

The figure below shows the sites and replication topology, where the on-premise Microsoft SQL Server VM is running on site DVX110, replicating to Cloud Backup repository on AWS, and set to failover to the VMware Cloud SDDC ‘Solutions’.

Sites and Replication Topology

DR plan recovery steps apply to the plan itself and control the recovery workflow. For example, a planned failover creates a new workflow of recovery operations based on the recovery steps defined in the plan. An executing plan’s recovery steps are executed on the source site (power off VMs, replicate the last snapshot) and destination site (recover VMs in the predefined order). An unplanned failover creates a different workflow based on the same recovery steps defined in the plan.

1 MINUTE AND 18 SECONDS

The total amount of data before deduplication is 13 Terabytes (1 x 500GB vDISK for the GuestOS and 5 x 2500GB vDISK for data), however, upon executing the DR plan to failover the Microsoft SQL Server VM is ready to start serving transactions in just 1 minute and 18 seconds as denoted in the picture below.

The reasoning for such incredible RTO is the unique ability to life mount the backups as a secure NFS mount into the ESXi hosts in the VMware Cloud SDDC; and this is unique to Datrium.

After the Microsoft SQL Server VM is up and running the solution starts a low-priority background process to relocate the data from Cloud DVX to the VMware SDDC local storage, and this process takes 5 hours and 25 seconds. However, during this time, Microsoft SQL Server is online and serving transactions normally.

At this point, DRaaS starts tracking the new data and changes that occur to the VM to synchronize them back to the Cloud Back repository and made them available for fail-back.

As you can see, IT and SQL administrators can take full advantage of Datrium DRaaS to failover Microsoft SQL Server and other applications to the VMWare Cloud while granting applications near-instant RTO even for large databases for marginal cost.

This and other gems are topics in an upcoming Datrium paper.

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

Microsoft SQL Server Performance, Consistency at Peak, and The Art of Possible with Datrium

HammerDB is a free database load testing and benchmarking tool for databases. For this HammerDB test, I have chosen the TPC-C schema to build a roughly 500 GB database (to start with) of 200 warehouses and used 100 concurrent users to drive the database transactions with no think-time enabled.

TPC-C involves a mix of five concurrent transactions of different types and complexity either executed on-line or queued for deferred execution. The database consists of nine types of tables with a wide range of records and population sizes. TPC-C is measured in transactions per minute (tpmC). While the benchmark portrays the activity of a wholesale supplier, TPC-C is not limited to the activity of any particular business segment, but rather represents an industry that must manage, sell, or distribute a product or service.

TPC-C simulates online transaction processing (OLTP) workloads that are typically identified by a database receiving both requests for data and multiple changes to the data from user transactions. HammerDB simulated roughly a 70:30 split of read/write transactions at 8KB block size in most of the test runs.

The virtual machine was configured with 16 CPU(s) and 64 GB RAM. The server is a PowerEdge R930 – E7-8890 v4 @ 2.20GHz / 2016 model with 8 x 1.9TB SATA Samsung GC57 (MZ7LM240HMHQ0D3/2016). The Datrium Datanode F12X2.

I’m basically using a 4-year-old hardware configuration.

Performance and Consistency at Peak

The figure below demonstrates the performance achieved during a steady-state run with the resources made available through the virtual machine configuration. We define steady-state when data services are turned on (Datrium Default); and that includes erasure coding (three-way replication), compression, deduplication, space reclamation, and snapshots. 

The system achieved over 1.07 Million transactions per minute (TPM) with HammerDB TPC-C workload in steady-state. Measured by the hypervisor, the maximum read latency was 1.3ms and the maximum write latency was 1.1ms, during a nearly 100% write burst.

While performance is important for business-critical applications, consistency during high peak workloads is as important, if not more important, and the figure below demonstrates that during the HammerDB workload Datrium was able to maintain consistent IO and throughput.

Bottlenecks and The Art of Possible

The particular workload above was bottlenecked by the amount of CPU available to the virtual machine and Microsoft SQL Server. The benchmark in the next section demonstrates the art of possible when bottlenecks are eliminated.

Increasing the virtual machine CPU count from 16 to 64 the number of transactions per minute (TPM) more than doubled, while Datrium is also able to maintain consistent IO and throughput.

From a performance perspective, even this workload isn’t really scratching the surface of what Datrium is capable of producing. In average terms, Datrium is capable of producing 1500 MiB/s per host and this HammerDB workload produced an average of 430 MiB/s.

Further optimizations are also possible, such as optimizing the host hardware in use, including making use of faster processors or NVMe devices. However, most enterprise-grade workloads will likely fall well below what Datrium is capable of delivering using older CPU generations and just a couple of older flash drives.

You you cans see, IT and SQL administrators can take full advantage of the performance gains of the Datrium solution, deploy production databases and leverage the advances brought by newer CPUs, local flash storage access, and data protection management to meet the needs of today’s modern datacenters.

This and other gems, including a Microsoft SQL Server with HammerDB Disaster Recovery to VMware Cloud on AWS, are topics in an upcoming Datrium paper.

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

Load more