PostgreSQL Benchmark on Datrium – 4.3 Million TPS with 1 GB RAM – and some Frustration!

Cutting to the chase, I want to share some benchmark numbers I have been able to run in our Solutions lab and demonstrate how Datrium DVX compares to other published figures. While some may claim that benchmarks can be gamed (and they can), I tried to stick to a simple formula that can be easily repeated by anyone on any platform for comparable results. Furthermore, the more hardware you throw at the problem, the more performance you will get, but generally if you fix as many variables as possible, the results should be within a reasonable margin.

This blog post is about PostgreSQL performance on Datrium, but I do make direct comparisons with results published by other vendors. If you don’t like reading competitive pieces of evidence, stop here. You have been warned!

I read through a few benchmarks, and the one I felt to be more honest regarding configuration and executed instructions was the EDB PostgresTM Advanced Server Performance on EMC XtremIO, so I followed a similar formula.

Host
Supermicro X10DRi
Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
Intel(R) NVMe DC P3608 2 x 1.6 TB
VMware ESXi, 6.0.0, 3620759
Datrium Hyperdriver Agent – 17.5 GB RAM

Datrium
Data Node – 1 x F12X2

If you don’t know how Datrium architecture works, I recommend watching this video from Clint Wyckoff. In a Datrium system, data nodes are used for storing durable data, while a copy of the data is stored on the host flash. All read IO is local to the host with intrinsic data locality, while write IO is stored on the host flash and also on the data node(s) using Erasure Coding (N+2 parity). Furthermore, all IO operations are compressed and deduplicated, by default – no check boxes.

VM
16 vCPU
1 GB RAM **
CentOS 3.10.0-514.10.2.el7.x86_64 (default install)

** PostgreSQL utilizes all allocated memory, and uses shared_buffers to cache as much data possible. Since we’re aiming to demonstrate storage performance I limited VM memory to 1 GB to force the PostgreSQL to utilize the storage device as much as possible.

PostgreSQL 9.2
Shared_buffers = 32 MB                   (default)
fsync = on                                          # turns forced synchronization on or off
synchronous_commit = on                # synchronization level;
full_page_writes = on                        # recover from partial page writes

** These PostgreSQL parameters can be changed to improve performance however, it is possible to lose data whenever a sudden shutdown occurs. Some vendors that perform data integrity checks recommend to turn these settings off for better performance. I have chosen to NOT turn these off during this benchmark.

pgbench 9.2 was used to create the database and run the benchmark. pgbench results are shown as TPS or transactions per second. In the XtremIO paper, they executed a read-only and an OLTP-B mixed workload (read/write). I decided to skip the read-only benchmark because it’s useless for production environments. I used the same commands used in the XtremIO white paper to produce the benchmark. The commands are as follows:

Create database instance using psql:
# CREATE DATABASE foobar OWNER postgres TABLESPACE foobar;

Run the pgbench database initialization. The following command loads a pgbench database using a scale factor of 7500, vacuums the resulting data, and then indexes it. It will create a database of approximately 113 GB in size:
# pgbench -i -s 7500 –index-tablespace=foobar –tablespace=foobar foobar

Run the pgbench read/write workload for 30 minutes using the following command:
# pgbench -s 7500 -c 100 -r -N -T 1800 foobar

During the benchmark the VM was running at about 70% CPU utilization.

How does Datrium DVX compare?

I have not seen a single vendor benchmark that executed pgbench demonstrating the real end-to-end application latency. All papers that I have found report array controller latencies – and there’s a big reason for that! There is enormous latency difference depending on where latency is measured. Application latency, measured by the application, is what matters at the end of the day, so I’m not hiding it.

Latencies shown by ExtremIO are not real application latencies, but rather the latency measured at the array controller. Moreover, I found a Gotcha in their performance numbers.

TPS AVG Read Latency AVG Write Latency
ExtremIO 7,642 ~0.2 ~0.4 (not real)
Datrium 10,673 ~0.2 ~4.1

Granted, I chose to compare to XtremIO because it’s probably the lowest latency storage solution for raw performance when discussing single host deployment. Also, the white-paper does not specify the data protection RAID-level used. This makes me wonder if they were actually using RAID 6 (Disk Striping with Double Parity). Finally, as with any SAN the more hosts and VMs you add, less performance you get for each application.

The Gotcha!

The XtremIO paper states the following, “We ran the following pgbench command to generate a mixed workload with a 2:1 read/write ratio” (page 16). However, the results table (page 19) demonstrate that Read IOPS is 4.7X higher than Write IOPS – it’s 80R:20W!

Where is the 2:1?

I want to believe that there is a genuine mistake in the report and that the authors were not trying to game the results. Therefore, It’s just fair to say that the specified latency numbers are not real or valid.

Table from XtremIO paper

When I ran the same pgbench command on Datrium the results were consistently 70R:30W. We can clearly see that XtremIO handled ~8,000 Write IOPS at peak, while Datrium absorbed 16,523 Write IOPS at peak – more than double the amount.  (see below)

This other paper for the VMAX 250F All Flash with 32 SSDs achieved 11,757 TPS in a RAID 5 (3+1) configuration and a VM with 96 GB Memory. The paper does not clarify if compression is enabled during tests, but no serious enterprise SAN array promotes RAID 5 for data protection nowadays. Lower data resiliency and memory caching plays out in a performance benchmarks. Moreover, latencies are also measured at the array controller.

Datrium always use N+2 parity erasure coding to mitigate against any two simultaneous drive or block failures while still providing compression, and deduplication.

How about HyperConverged?

I would love to compare Datrium to Converged or HyperConverged solutions, but vendors seem hesitant to report their real performance numbers, and when they do, they do not provide enough information for a decent comparison.

I did, however, find Nutanix numbers (here) provided by user jcytam that I used as a general guidance. I replicated the pgbench benchmark as much as I could, using the same VM configuration, 8 vCPU and 24 GB RAM. I also used the same pgbench command as described in the post, and the same pgbench major release. Unfortunately, the Replication Factor (akin to RAID) was not specified.

In Nutanix warm Read IO comes from SSDs/RAM and Write IOs go to SSDs. That said, this is not an official Nutanix benchmark and should not be seen as official numbers – many factors can influence a benchmark.

Further down on this blog I measure Datrium DVX with Samsung PMA SSDs.


I could not find pgbench benchmarks for VMware VSAN, Hyperflex or Simplivity.

Benchmark Tuning

The XtremIO benchmark above was a comparison without any tuning, but the XtremIO paper does not  indicate that there were no PostgreSQL, VMware or Linux tuning. So, I decided to do a simple tuning, while keeping all the declared configuration the same. That means, no change to VM memory or CPU.

Note that I have also run pgbench with lots of memory, CPU cores and higher shared_buffers, and I got to multiple hundreds of thousands TPS – however, it means nothing because it doesn’t demonstrate the storage performance capability.

Here is my PostgreSQL and Linux tuning:

vi ./data/postgresql.conf
fsync = off                # turns forced synchronization on or off
synchronous_commit = off   # synchronization level; on, off, or local
full_page_writes = off     # recover from partial page writes

ext4 mount options to (defaults,nodiratime,noatime,data=writeback,barrier=0,discard)

I also implemented the changes recommended by PgTune according to my environment.

Let’s look at the new results.

TPS AVG Read Latency AVG Write Latency
ExtremIO 7,642 ~0.2 ~0.4 (not real)
Datrium (default) 10,673 ~0.2 ~4.1
Datrium (tuned) 14,504 ~0.3 ~5.7

Just reinforcing the idea that latencies shown by ExtremIO are not real application latencies, but rather the latency measured at the array controller. In the image below, on the Datrium benchmark, where disk latency at the vSphere VM level never goes above 4 ms (lower than the ~5.7 at the application level), and for the most part is below 3 ms.

If I was to measure latency at the data node, it would have been much lower, but would be meaningless. I suggest that vendors always provide real application latencies – it’s just fair to customers.

Scalability

Based on the workload generated by pgbench using a single VM and single Host, Datrium DVX tells me that I would be able to add another 29 servers with the equivalent workload and results, before I need to add another data node to the pool, totalizing 435,120 TPS. (see image below)

10 data nodes can be part of a data pool, in which case we would have approximately 4.3 Million TPS.

When it comes to scaling performance, no vendor can beat Datrium – look at this 3rd party IOmark audited benchmark with 128 servers, or the review by Storage Review.

SATA SSD vs NVMe

As NVMe approaches price parity with SATA SSDs, we will start seeing greater adoption of the technology, and Datrium is well positioned to support NVMe – customers have been utilizing NVMe on hosts for over a year.

Since I ran the benchmark on a host with two NVMe SSDs, I decided to run the same workload on another host with two SATA SSD to understand the difference, because I thought readers would ask about it.

This host is a Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz and the SSDs are cheap ($0.5/GB) Samsung PMA. Looking at the results, it’s clear that the lower-grade SSD doesn’t provide the same performance as the NVMe, and we also notice a bump in write latency.

That said, the performance numbers are outstanding for a pgbench running with 1GB RAM on cheap commodity flash with N+2 parity erasure coding, while still providing compression and deduplication. Repeat that 30 times until you get to the data node boundary, and then add more data nodes, up to ten.

TPS AVG Read Latency AVG Write Latency
Datrium NVMe (tuned) 14,504 ~0.3 ~5.7
Datrium SATA SSD (tuned) 10,319 ~0.3 ~8.6

I can’t stress enough that all the performance numbers presented in the blog post have been generated on a single Supermicro server with Two flash devices and a Datrium data node (F12X2) with 12 SSDs. The list price for a data node and a host license is sub $150K, and it scales to 435,120 TPS based on this same workload.

Conclusion

We have to remember that if we throw memory, host Flash, CPU, and make changes to shared_buffers on PostgreSQL, it is possible to get up to hundreds of thousands of TPS from a single VM with the same pgbench workload. I could have added up to 16 NVMe devices on the host to distribute the load and get more parallelism, but it is too easy ? and costly ? to solve performance problems throwing hardware at it.

I didn’t run this benchmark to prove that PostgreSQL does an outstanding job caching and managing data in memory, or that Intel newer processors are faster, but instead to show Datrium raw storage performance.

I also know that comparing benchmarks can lead to endless debates, in which case I invite vendors to run this very benchmark and share their numbers. I can provide the source VM, pre-configured, and you just run the benchmark. I also invite vendors to demonstrate their numbers with Erasure Coding (or equivalent data protection) with Deduplication and Compression ENABLED, like Datrium.

To me, the exciting part is to see how well storage systems handle benchmarks when all parts are moderately equal. Datrium is on par with any enterprise-grade Tier 1 storage solution, providing industrial-strength data resiliency, data reduction and scalability. Datrium scalability, up to 18 Million IOPS, and 256 GB/s Random Write throughput is unmatched in the industry.

My Rant – Over the last few days, I’ve spent many hours over storage benchmarks from various vendors, but honestly, what’s up with benchmarks that do not use production-grade conditions to demonstrate performance numbers? Some papers appear to purposely hide details to avoid vendors replicating their benchmark, while others game their numbers to make them look good. As an industry, we need to be better than that!

As a next step, I am planning to run the same benchmark with Red Hat Enterprise Virtualization. I also will run a scale-out pgbench benchmark with VMs on multiple servers – adding up to 2,000 snapshots per VM. pgbench-tools is also an option.

If you would like to see a specific benchmark on Datrium let me know and we will do everything possible to run it – that is one of my team’s charter at Datrium – and we shall not hide or lie performance numbers.

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

The Single reason why HCI vendors do Not Like to provide Best-in-Class Protection for your Data

 

 

It makes HCI expensive and not cost competitive!

 

Now that we got that out of the way let me explain. It is not that HCI solutions cannot provide higher levels of availability, most of them do, but vendors frequently steer customers to a resiliency factor that makes them look cost-effective – makes them look good from a financial viewpoint.

Data durability in the presence of failures is table stakes for any organization, and failure tolerance is achieved by data redundancy in some fashion. One way to achieve redundancy is with mirroring. You can mirror 2-way or 3-way.

At any scale and seriousness, you have to do 3-way replication, or you are rolling the dice on data loss. The reason is not so much that you will lose two drives at the same time. What is much more common is the following scenario: 1 drive fails, and the system starts re-mirroring data from the remaining drive. All it takes is a sector read error (also known as Latent Sector Errors ), and you have now lost data.

Over the past 2 decades, most of the industry has moved beyond 1FT (i.e., 1-drive failure tolerance). Examples of 1FT are RAID-5 in SAN arrays, RF2 from Nutanix and 1FT from VMware VSAN. No serious enterprise SAN array promotes 1FT. However, most HCI vendors still, by default, recommend 1FT. These HCI vendors have regressed in providing resiliency in order to make their products look financially viable.

Can’t argue with math, if you do not have 2-drive failure tolerance, the chance of data loss is an astonishing 0.5% per year. Gartner also recommends 3-way replication.

“Use traditional three-way mirroring with RF3 when best performance, best availability and best reprotection time of data are all equally important.” [1]

 

THE REAL REASON, THE COST!

Implementing HCI using 3-way replication will incur over 300% overhead due to the capacity required to protect and re-protect data. Furthermore, there’s a higher minimum number of hosts required for cluster protection.

Here is how the capacity math works out: Take 5 hosts (7-2 for N+2) that need to provide ~118TBs. Now, to determine the capacity required per host with 3-way mirroring, use 118TBs [useable required across the cluster] / 5 [number of hosts online] * 3 [three-way mirroring overhead] = 71TBs per host.

At this point, one may be making the correlation between the raw capacity and the useable capacity required to ensure that data can always be fully re-protected.

Typically 3x is banded about when talking about FFT2 or RF3, however that’s with 100% utilization, and no ability to re-protect data, in reality, the system requires 497TBs (71TBs * 7 hosts) of total capacity to provide 100TBs of usable capacity, this is an overhead of 4.97x.

On the host count issue, there is a minimum number of hosts required to provide the additional availability and re-protection with a 3-way mirror – and it’s higher than with 2-way mirror. HCI vendors have different architectures, but the math works similarly.

The additional cost of servers and storage capacity is thought as a deal breaker for many organizations considering HCI.

 

3-WAY MIRRORING

Storing three copies of data with 2FT (examples include RAID 6 for arrays, RF3—Nutanix, FTT2—VMware VSAN) or using erasure-coding techniques that tolerate two failures improves reliability significantly.

To lose data in a system that tolerates two drive failures, there needs to be either three simultaneous drive failures, two drive failures and a LSE, or one drive failure and LSEs in both redundant copies of the same chunk. All of these are very improbable events, and the chance of data loss is reduced by many orders of magnitude.

Datrium has done exhaustive studies with its data, but also with public studies and data provided by Google, Facebook, Nutanix, Netapp and Backblaze. These studies have been done by engineers with PhDs on the topic of disk failures; this is a serious study. Furthermore, recently they included the math for Flash drives, and the results do not look any better – SSDs encounter Latent Sector Errors (also called uncorrectable errors) at an alarming rate

I would not be writing all that if Datrium did not implement, by default, 2FT. Datrium uses Double Fault Tolerance Erasure Coding, a Log-Structured Filesystem (LFS), and In-line Integrity Verification and Healing. I also wrote an article discussing our data integrity methodology, but the most important is to know that system protects customer data with higher levels of resiliency at a highly competitive price point, even against HCI RF2 implementations.

 

CONCLUSION

This article is not meant to say that HCI is not good, nor this article is picking on any specific vendor, but rather pointing at any storage vendor that wants to offer lower levels of data resiliency in exchange for a better solution cost.

HCI provide numerous benefits to enterprises, but with power comes responsibility and IT teams are responsible for data in their organizations.

I believe that anyone reading this article will agree that 3-way mirroring is better than 2-way mirroring — so as an industry we all should be advocating for better resiliency, even if the solution will cost a little more. At this point we are probably entering the Risk Management realm. “you pay your money, you take your chances”.

My recommendation to you:

If you are considering Datrium, Nutanix, VMware VSAN, or any Tier-1 storage solution, always ensure that you are comparing apples-to-apples and that your data is going to be protected with best of breed resiliency and data integrity.

Hey, I’m just one opinion here. Do you agree? Disagree? Let me know what you think.

 

[1] Key Differences Between Nutanix, VxRail and SimpliVity HCIS Appliances – Architecture and Storage I/O Published: 26 April 2017 ID: G00319293

 

For comments, please use the article version posted on LinkedIn.
https://www.linkedin.com/pulse/single-reason-why-hci-vendors-do-like-provide-your-data-leibovici/?

 

This article was first published by Andre Leibovici (@andreleibovici) at LinkedIn

 

Zero RTO Application Restores – Myth or Reality?

Most of my life I have been closer to applications and primary storage, and later on when managing technology teams I always counted on well-trained data protection professionals to handle such critical role in organizations. One thing was clear to me – backup is equivalent to insurance. It is a tool that is there for when sh*t hit the fan. {pardon the expression}

Other than Data Domain adding deduplication and more recently backup vendors using HCI approach to store backups in scale-out clusters, not much has changed since my customer days. Some vendors are now starting to tinker with Cloud as tape replacement, but for the most part, the cost is still prohibitive due to the high capacity requirements for incremental, daily, weekly and monthly full backups.

I may not be a data protection authority, but I do understand RPO, RTO and the business impacts.

 

RPO (Recovery Point Objective) refers to the amount of data at risk. It is determined by the amount of time between data protection events and reflects the amount of data that potentially could be lost during disaster recovery. The metric is an indication of the amount of data at risk of being lost.

RTO (Recovery Time Objective) is related to downtime. The metric refers to the amount of time it takes to recover from a data loss event and how long it takes to return to service. RTO refers then to the amount of time the system’s data is unavailable or inaccessible preventing normal service.

Descriptions By David Vellante (2008)

Zero RTO!

The team at Datrium is knee deep into data protection technologies, including having the Data Domain founding team as part of the founding team. They made data protection an integral part of Datrium DVX, including primary storage (BTW fastest and most scalable in history) — and the most exciting — Zero RTO!

RTO ZERO means that when an application or VM restore is needed, the restore is instantaneous — a single click and ZERO waiting to restore the application to a consistent state. Please note that I am not talking about reverting a snapshot that is co-located on the same storage sub-system that the application is running on (like VMware vSphere and HCI snapshots) — this is a restore from data at-rest on different media than your primary storage. Let’s have a look at how this works!

There are two important things to know about how Datrium DVX works:

 

  1. The DVX Hyperdriver Software on each compute node (hypervisor) manages all active data for the VMs within that host. It provides scalable IO performance, availability and data management capabilities.
  2. The DVX Data pool provides persistence and resiliency for a durable copy of all data in the cluster. In normal operation it is write-only, but it also provides streaming read performance for flash uploads as well as cluster coordination for simple management.

 

(1)

 

(2)

 

The DVX maintains two distinct “namespaces” (filesystem metadata about VMs and datastore-files).

The “datastore” contains the current, live version of all VMs and files. This is what the hypervisor “sees.” The hypervisor management tool (vCenter for VMware, or RHEV Manager for RedHat) allows you to browse the contents of the live datastore at any time. The contents of these files always contain the most recently written data.

The “snapstore” on the other hand contains previous point-in-time snapshots of the live datastore as it existed previously. Every time a protection group causes a snapshot to be taken; entries are made in
the snapstore with the contents of every file in the live datastore at that instant in time.

Datrium uses a “redirect on write” (ROW) technique to store incoming data. New data is always written to new locations (vs. copy-onwrite techniques that can introduce delays as changes are copied). Because only changes are stored in a snapshot, and because DVX only stores compressed and deduplicated data; snapshots consume relatively little capacity.

If, for example, a given VM’s protection policy causes snapshots to be taken every hour and retained for two days, then snapstore would contain up to 48 different versions of this VM’s files. These policies can be overlapped delivering RPO of 10 minutes and retaining up to 2,000 snapshots per VM.

Datrium DVX provides two uses for the point in time copies of VMs and datastore – restoring/reverting VMs/files in the live datastore and creating net new VMs/files (cloning).

 

  • Restoring/reverting replaces the state of live VMs or datastore-files with the state from the point-in-time when the snapshot was taken.  It instantly “rolls back” VMs or datastore-files to a previous point in time.
  • Cloning is the process of taking a point-in-time snapshot and creating a net new VM is immediately populated with the state contained in the snapshot. It is an instantaneous way to create one or more copies of existing VMs and applications.

In contrast, conventional backup tools that understand VMs (many don’t know VM-level objects) would need to restore/copy the data from an external repository (such as a NAS) via a proxy server. However, because the primary storage and the backup tool operate disjointly and don’t possess global deduplication awareness, the entire VM dataset has to be restored, causing the system to transfer large sums of data, taking hours and sometimes days to complete restores, depending on storage and link performances.

It is crucial to understand that the Datrium DVX global deduplication enables the system to simply have to up-stream differential data from the data pool to compute nodes when a restore is triggered – and that happens both synchronously and asynchronously, allowing applications restart instantly.

The ability to instantly “roll back” or clone a VM to from the data pool delivers Zero RTO, and that is what matters at the end of the day when trying to restore systems. In many cases, even a thirty minutes delay window to restore systems can cause organizations millions of dollars in damages.

Please note that there are situations where you may need to work with both Datrium and  the leading backup vendors, specially in heterogeneous on-premise storage and sub-5 minute remote-site RPO.

Check out the paper on data protection and secondary storage (here).

 

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

 

Load more