World’s 1st Just-in-Time Cloud DR (VMware Cloud)… Everything Techies Need to Know….

No, the title of this post isn’t a clickbait.

Organizations spend hundreds of thousands of dollars per year to keep Disaster Recovery (DR) datacenters operational while they wait for disaster to strike so they can realize that huge upfront and ongoing investment, but at the same time crossing fingers and hoping that the tragedy never happens. In many ways, DR datacenters are like insurance with huge upfront premiums.

Cloud DR where you only pay when needed has been an unfulfilled promise by all sorts of vendors, especially for those running applications on the VMware ecosystem.

There are just too many datacenter components that need to be in place before any IT organization could even start considering Cloud DR; things like networks, resource pools, folders, datastores, IP addresses, VM, replication, etc. Further, if the Cloud does not support VMware those VMs would need to be converted into a new disk format. And yes, it’s unbelievable that some vendors actually offer such VM conversions, and I’m not even going to get into how cumbersome these conversions really are.

However, even if VM and infrastructure conversions are successful and you have been able to successfully failover your entire datacenter, then you will need to start worrying about the failback process. How do you replicate all the changes back and how to do so in a way that your organization is not charged exorbitant egress costs for a full private datacenter re-seeding.

Cloud DR with VM Conversion and no Failback are akin to Helicopters with Ejection Seats. Good Luck!

Honestly, there are just so many things that could go wrong in this process that most IT organizations would not even dare to start thinking about the cloud as a way to eliminate DR costs.

How wonderful would be if VMware workloads could reliably use the VMware Cloud on AWS for DR? Datrium is announcing today a groundbreaking new service and another piece of the Automatrix platform.

DRaaS, or DR-as-a-Service.

At a high-level DRaaS if a complete DR solution offered as a subscription that leverages the VMware Cloud and native Datrium cloud-based backups to deliver a fully-orchestrated low cost, low RPO and lower RTO disaster recovery for the VMware ecosystem. Ok, there’s a lot to unpack here.

First things first… follow the numbers as things will start to make sense.

1) Datrium DVX on-premise provides high-performance storage with built-in backup. Individual and granular native VM snapshots can be kept on the system for many years. The universal dedupe and compression make the whole solution very cost-effective.

2) All backups are replicated to Cloud DVX, an as-a-service backup vault on AWS. Cloud DVX uses S3 as a repository of backups stored in a native compressed and deduplicated form. On-premises DVX systems send native forever- incremental backups to Cloud DVX. Global Deduplication ensures that data is transferred only when needed. The built-in Datrium VPN ensures there are no transfer charges during backup.

3) ControlShift is a DR orchestration service that executes DR plans and provisions and monitors SDDCs in VMware Cloud on AWS. DR plans and states are stored in a highly-available Plan database replicated across multiple availability zones. Built-in self-healing ensures that, in the event of public cloud unavailability, all affected Datrium services automatically migrate to a healthy availability zone without any data loss.

All Datrium services are deployed as AMIs into a Datrium-created VPC and Subnet. VPC endpoints used to access all other external services required by ControlShift and Cloud DVX are created automatically. All components are monitored and restarted for high availability and resilience. All required state is replicated to ensure resilience.

There’s a lot of things happening under the covers, but you as a VMware administrator only need to know and understand vCenter as all cloud constructs are completely abstracted for you.

4) VMware Cloud SDDC provides a vSphere-based execution environment for a cloud DR target. An SDDC can be provisioned on-demand via ControlShift. A provisioned SDDC incurs hourly charges. Upon DR test completion, the SDDC can be decommissioned via the ControlShift UI. ControlShift performs automated network configurations for both AWS and VMware Cloud to make S3 backups from Cloud DVX available for spin-up in SDDC. The SDDC is managed via the familiar vCenter interface.

5) Efficiently failback with minimal AWS egress charges by transferring only changed and globally deduplicated data. Similar to failover, failback is fully automated. Data changes that occur while executing in the VMware Cloud are captured and stored as a Cloud DVX snapshot in S3.

6) ControlShift also orchestrates the transfer back to the on-premises data center, which includes just the data that changed. Cloud egress charges are minimized by delta transfers, which only occur when deltas are not already present in the on-premises data center.


The example above is the Just-in-Time deployment mode, but DRaaS also supports Ahead-of-Time Deployment and Pilot-Light with Cloud Burst deployment.

Just-in-Time deployment – This mode eliminates any infrastructure upfront CAPEX costs and drastically cutting OPEX costs. You will only pay VMware Cloud when in a disaster event. However, when DR is triggered, you may need to wait for an SDDC creation, and that may take approximately 90 minutes. After using the changes are synchronized back, and the SSDC is torn down.

Ahead-of-Time Deployment – This mode is the most similar to a hot DR site, where the SDDC is created upfront and is readily available to use with a very low RTO. In this mode, you will keep paying for all the hosts available in the SDDC until the DR event happens.

Pilot-Light with Cloud Burst – This mode is a compromise between the two options above. COntrolshift will create an SDDC on VMware Cloud with a minimal amount of hosts to failover the most critical VMs with very low RTO. Then, on-demand, new hosts are added to the SDDC to complete the failover of less essential VMs. In this mode, you pay for just a minimal number of hosts until DR is triggered, and full capacity only when DR is in full effect.

The advantages of using VMware Cloud in combination with Datrium DRaaS solves a massive problem for organizations that require bulletproof DR but at the same time would like to reduce costs. The picture below demonstrate each state of a DR plan with the existing vendor offerings (in red) vs. Datrium DRaaS (in green).

DRaaS offer autonomous deployment, configuration, maintenance, upgrades, and healing from component failures, and IT organisations don’t need to touch or understand AWS or VMware Cloud. Datrium monitors and supports all components of the system, including AWS and VMware Cloud SDDC, aided by partnerships with both companies.

This article was first published by Andre Leibovici (@andreleibovici) at

VMware Cloud + Datrium = Awesome Recovery!!

VMware is the de-facto standard for the enterprise datacenter, and AWS has the broadest set of cloud services with global scale and reach. If you not aware, both companies have jointly engineered a cloud solution that delivers the best of VMware and AWS for customers. The VMware Cloud on AWS has a massive global presence.

Datrium DRaaS with VMware Cloud on AWS is a comprehensive cloud-based backup and disaster recovery service for the protection of Datrium DVX on-premises systems. It encompasses Cloud DVX backup, ControlShift orchestration, as well as VMware Cloud on AWS. DRaaS dramatically reduces costs, keeps data safe and secure, and delivers enterprise-grade failover and failback. It enables organizations to eliminate physical DR sites, provides integrated management, and because it’s delivered as a SaaS solution, it eliminates the complexity of packaged software.

Just-in-Time Deployment

As the name suggests, Just-In-Time deployment will instantiate a VMware Cloud SDDC from scratch and migrate backup data (already in Cloud DVX on AWS) to VMware Cloud upon a click of a button. This deployment model enables customers to have a DR site with networking and everything else needed in 1 hour or less, without incurring in any upfront costs. Yeah, there are other DR modes with lower RTO, check them here.

Yes, it sounds too good to be true, and that is the reaction we get from customers when they see a demo for the first time. It’s really groundbreaking!

During steady-state operation, backups are sent to the cloud backup site, and after some processing, land in a cost-effective compressed and deduplicated form in an S3 bucket. In the just-in-time mode of deployment, a cloud DR site is created only following a disaster. VMware Cloud SDDC is deployed only immediately before executing a DR plan.

To make this possible, DRaaS leverages the space and cost efficiencies of Cloud DVX. The protected site replicates VMs or protection groups in their forever incremental format to Cloud DVX, which in turn stores them in a compressed and deduplicated native format within the low-cost S3. During steady-state operation, the costs of data protection are limited to the costs of the Cloud DVX backup service and the cost of the S3 media.

Following a DR event, ControlShift deploys a new SDDC and orchestrates the failover to this SDDC as part of a DR plan execution. This process uses a fast high-bandwidth network link from VMware Cloud SDDC to AWS S3 to get access to backups. The recurring charges for the Cloud DR site start accumulating only after the SDDC deployment.

After the DR event, DRaaS starts to replicate unique data back to the Cloud Backup site, fail all VMs back on-premise and teardown the SDDC.

Visit the DRaaS website for more info.

This article was first published by Andre Leibovici (@andreleibovici) at

What I have learned from 100 Oracle Benchmarks on Datrium

I have recently completed a series of benchmarks with Oracle on Datrium, and as it comes with no surprise Oracle is a perfect fit for Datrium’s host flash acceleration and integrated data protection. I have documented tests, findings and recommendations in a document published here, but below you will find a few highlights and lessons learned.

  • [OEL + ASM FTW] XFS with CentOS/7 provides the highest number of Read IO/sec in a single sample, and ASM + OEL (Oracle Enterprise Linux) produced the highest number of Write IO/sec. However, the highest average IO/sec for both Read and Write IOs across all benchmarks was provided by Oracle ASM + OEL, making ASM with Oracle Enterprise Linux the best option in terms of performance and manageability.
  • [READ OR MIXED?] Linux LVM and Oracle ASM provide excellent performance with their default configuration. The difference between 512KB and 4MB block sizes is marginal, but larger block sizes should be used for read-intensive workloads with Large Read I/O.
  • [FLASH + LVM] – When using Linux LVM the XFS file system provides best performance when paired with high speed disks, such as SSDs, that can take real advantage from parallelized access and multi-thread designs.
  • [SCREAMING PERFORMANCE] Datrium performance outshine any published and comparable benchmark by HCI vendors that I have been able to find. This picture demonstrates 3 hosts delivering 165K IOPS and 1.3GB/s throughput, with an average VM-Level latency of 1.7ms. Considering Datrium upper system limit of 128 hosts, and assuming the same workload per host, we can expect approximately 7M IOPS and 55.4GB/s throughput with the same VM-Level latency of 1.7ms. (source blog at the end)
  • [ZERO IMPACT DATA PROTECTION] Datrium Protection Groups and snapshots offer zero impact during high-performance workloads, unlike most HCI solutions. The picture below demonstrates two 12-hour Oracle burn-in tests, being the first one without snapshots, and the second one with native snapshots taken every 10 minutes, and retained for 24 hours. (source blog at the end)
  • [ZERO IMPACT ENCRYPTION] Datrium provides consistent performance and near-zero impact when applying End-to-End encryption along with all enterprise data services. The picture below demonstrates three 1-hour Oracle burn-in tests, being the first one without encryption, the second one with Approved mode encryption and the third one with Validated mode encryption. (source blog at the end)
  • [RMAN IS BONUS] Use Datrium crash-consistent protection for Oracle DB with instant recovery from local or remote snapshots. Where app-consistent data protection is required compliment with Oracle RMAN and Data Guard.
  • [MAX OUT PVSCSI]– Four paravirtual SCSI (PVSCSI) controllers for high I/O load. The use of multiple paravirtual SCSI controllers allows the execution of several parallel I/O operations inside the guest OS.
  • [KEEP SCSI LEAN] Multiple disks evenly distributed across paravirtual SCSI (PVSCSI) controllers provide enhanced parallel I/O operations inside the guest operating system. My tests demonstrate that peak performance (at least with Datrium) is achieved with 3 data disks per controller, but 2 data disks per controller offer similar performance results while keeping the solution simpler.
  • [IT’S FLASH BABY] Quantity, capacity and performance is important. My tests demonstrate that for Oracle databases, four or more cheaper and slower SSDs will perform at lower latency and higher IOPS than two fast and expensive SSDs due to I/O parallelization and multithreading.
  • [KEEP DB SIMPLE] Datrium does not require separate Redo/Log I/O traffic from data file I/O traffic through separate virtual SCSI controllers, but some Oracle administrators may want to segment for database hygiene perspective.
  • [PACKET SIZES] – Default MTU of 1500 between ComputeNodes and Data Nodes provide excellent performance. Jumbo frames with an MTU of 9000 is not a requirement.
  • [DB_BLOCK_CHECKSUM is OFF] – DB_BLOCK_CHECKSUM determines whether the direct loader will calculate a checksum and store it in the cache header of every data block when writing it to disk. Datrium already natively calculate data block checksums and guarantee data integrity.
  • [REDUNDANCY INCLUDED] Datrium natively provides 3-way replication using Erasure Coding with Parity 2, there Oracle ASM should simply be set for External Redundancy. In external redundancy, the underlying disks in the disk group must provide redundancy.

There has been a lot of lessons learned, including finding out that few vendors lie, hide, skew, and deceive customers about their benchmark numbers. As I mentioned, you will find all about my tests in a document published here.

All blog posts on Oracle:

This article was first published by Andre Leibovici (@andreleibovici) at

Load more