Data Integrity Should Be Taken Very Seriously

By now it is pretty clear that Datrium is one of the only Convergence and/or Storage vendors that has been entirely open about performance, not hiding numbers or gaming performance benchmarks.

Check out this article by Ganesh and Dhazi on our real performance numbers… What Separates The Storage Industry’s Men From Boys? Random Writes! or read this white-paper on Real-World Microsoft SQL Server Performance On Datrium DVX.

 

Now Let’s be open and talk about Data Integrity

 

Datrium takes data integrity very seriously and has worked hard to build a system that delivers the highest levels of data integrity and durability. To achieve this, Datrium has pursued a three-prong strategy: architecture, in-line integrity checks, and rigorous testing.

 

Architecture

The Datrium DVX is designed to minimize the risk of systemic or software issues impacting data integrity. Architectural features incorporated for this purpose include:

 

– Content-Addressed Data

All data is addressed by cryptographic-strength fingerprint very early in the write pipeline (prior to encryption) to uniquely identify each grouping of data. This is the strongest possible check that the content of the data matches what was stored. In addition, such data never changes; if it does, it is new data with a new and unique fingerprint. Thus, errors resulting from races to update a particular location in the storage system are eliminated because there are no such updates. All newly written data are written to a new location.

 

 

– Log-Structured Filesystem (LFS)

The log-structured layout ensures that new data are written to new locations. Even in the case of an overwrite, the old data remains in the system till the space reclamation process runs. This ensures that no needed data is inadvertently overwritten or lost. In addition, writes are batched into whole, erasure-coded stripes. This ensures that there is no chance of data loss due to a partially updated stripe.

Learn more about Datrium Log-Structured Filesystem (LFS) here.

 

– Double Fault Tolerance Erasure Coding

The system is designed to protect against two simultaneous disk failures. Thus, even when one disk fails, the system maintains redundancy that can be used to recover from any integrity fault discovered by the In-Line Integrity Verification described below.

 

– In-line Integrity Verification and Healing

All data stored in the DVX is encapsulated in data structures that indicate which data it is and include a data integrity checksum. On every read from the system, these fields are double checked to ensure that the data returned from the storage device is the requested data and that its integrity is intact. If the data is found not to be correct, the system will, inline, use the redundancy provided by erasure coding to rebuild the missing data and deliver it to the requesting VM. If the data integrity check fails when reading from the cache on the Compute Node, the hyperdriver will request the correct data from the Storage Pool on the Data Node cluster.

 

A traditional array can only protect the integrity of data it receives (dutifully safeguard data that may have been corrupted by intervening network or host-side problems). DVX protects your data before storing it locally or sending it over the network.

 

Rigorous Stress Testing

Datrium also uses a battery of automated tests to make sure every code change maintains the high standard for data integrity. Our engineers run some of these tests before code check-in. Continuous integration and test infrastructure run the full battery of tests against the full body of code on a non-stop basis. The in-line integrity verification described above flags data integrity issues that may have cropped up so that we can find and fix any bug.

In addition to the suite of automated tests, there is a set of what we call soak tests. All during development and for an additional week after development completes but before the new code is released, the software subjected to a set of soak tests designed to subject the system to long-term stress similar to different kinds of workloads in a production environment.

These soak tests include the following.

 

1 – Oracle RAC

  • SLOB and 3 large hosts with four Oracle VMs running 70/30 read/write ratio and approximately 90k IOPS.

 

2 – Exchange Testing with JetStress

  • 4k mailboxes, 0.07 IOPS/mailbox, 32 MB mailbox size, 16 threads with replication.
  • 8k mailboxes, 0.13 IOPS/mailbox, 1024 MB mailbox size, 16 threads.

 

3 – SQL Server with HammerDB plus as In-House Load Generator

  • 250 users and 5,000 warehouses running with TPC-C schema, 0 user delay.

 

4 – VMware Horizon VDI

  • 1,000 VDI users with a rotating set of LoginVSI worker profiles (task worker,

power worker, etc.) plus 5,000 additional users simulated with VDBench on 128

physical hosts.

 

5 – Citrix VDI

  • 200 VDI LoginVSI Knowledge Worker users with VSS and Datrium Replication.

 

6 – Veeam Backup

  • 3 DVX backup of Windows and Linux VMs with concurrent replication of VSS enabled SQL guests.

 

7 – General Stress

  • Several DVX system with workloads running on hundreds of guest VMs simulating workloads such as cache torture, boot and login storms, VDI workload based on SNIA evaluation of VDI workload, FIO, VDBench and in-house tools.

 

There can never be enough testing and validation when dealing with data integrity, but it is comforting to know that the Datrium engineering team deals with it with the utmost seriousness and have designed automated stress tests that run continuously for the severest of the scenarios.

 

Thanks to Hugo Patterson (the man with 42 patents in his name) for contributing to this article.

 

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

 

Leave a Reply