Accurate and Effective Storage Benchmarking

Determining Scope for a Storage Performance Benchmark

Building an effective system performance benchmark requires a structured approach to systematically evaluate a system's capabilities under well-defined conditions. A comprehensive benchmarking process begins with one of the most critical steps: Determining the Scope.

The first step in creating a meaningful storage performance benchmark is to determine precisely what aspects of the system you want to evaluate. This requires a detailed analysis of the system's specifications, including hardware components, software components, and configurations. A thorough understanding of the system will help you identify potential bottlenecks—areas where performance might be constrained under certain conditions.

Different components, and component types have different performance tradeoffs. Spinning hard drives tend to offer the best capacities and relatively good streaming throughput but have high latency that mean certain workloads will have abysmal performance. SSDs, with their lack of moving parts, overcome much of the latency problem and can saturate their SATA or SAS bus lanes while being only moderately more expensive. NVMe overcomes the limitations of the SAS bus by directly connecting to the CPU via the PCI-Express bus. It offers extremely high throughput and even lower latency, while the improved protocol allows much higher concurrency than even SSDs.

If you are designing a benchmark of hard drives, a small block random read workload can be expected to perform at only a few megabytes per second. This is because what you are testing is actually seek latency and IOPS, not throughput. While at the same time, if your requirement is to saturate a 10-gigabit network link with streaming video, this same thing can be accomplished for significantly less using HDDs rather than NVMe drives.

Testing Conditions

Once potential bottlenecks are identified, they must be carefully considered when planning the testing procedure. For instance, a storage subsystem with slower I/O might skew overall performance results, making it crucial to design tests that either account for or work around such limitations. Failing to address bottlenecks in advance can result in benchmarks that do not reflect the system's true performance or its behavior under real-world workloads.

Aligning Benchmark Tests with Real-World Usage

The benchmark tests should be meticulously designed to replicate the actual usage scenarios of the system as closely as possible. This involves creating benchmarks that mirror the types of workloads and operations the system will handle in production. For example:

Database Server. Tests should focus on query execution times, synchronous transaction rate, and concurrent user handling.

Data Analytics System. Tests should emphasize processing large datasets and performing complex calculations.

Media Editing Workloads. Streaming throughput and 99th percentile latency are the key metrics.

Ensuring that tests align with real-world usage will provide more accurate insights and actionable results.

To look at how to investigate and resolve bottlenecks, see our article on Managing and Tracking Storage Performance.

Testing Under Worst-Case Scenarios

Additionally, the benchmark should be designed for the worst possible case, entirely random reads with no patterns, with all caching disabled, etc. This ensures that the result of the storage performance benchmark is the minimum performance that can be expected in production, under the worst possible circumstances. Knowing that the system will perform “at least this well” is much more beneficial to capacity planning than knowing that under ideal circumstances, the system can handle twice that amount of load. Production rarely, if ever, experiences ideal circumstances.

To validate the benchmark, it’s vital to standardize the testing conditions, such as consistent environmental factors, software configurations, and system load. This consistency ensures that the benchmark is reproducible and that results are not skewed by external variables.

Verifying Results

As stated in our prior article Key Considerations for Benchmarking Network Storage Performance, when running a performance testing tool like fio, you should verify the results with multiple other tools. In the case of testing a ZFS storage system, iostat, gstat, nfsstat, and zpool iostat are all great options to confirm that the performance observed in the benchmark matches the performance of the underlying devices. If the benchmark reports a throughput of 12 gigabytes/second but the underlying storage media is almost entirely idle, a caching layer somewhere between the hardware and the benchmark is influencing your results. Conversely, if the benchmark is unexpectedly slow, and the hardware is reporting doing significantly more work than the results suggest, read or write amplification may be skewing your results.

Ensuring Repeatability in a Storage Performance Benchmark

Controlled Environment

In order to get more reproducible and reliable measurements, it is important to eliminate as much noise and as many external factors as possible. So, stopping unnecessary processes, periodic tasks, and anything else that might disrupt the system. This can often be best accomplished by running the system in single-user mode, where most services are enabled. If the network is not required as part of the benchmark, disabling it can ensure that other traffic doesn’t skew the results of your test.

Using A/B Testing for Performance Modifications

When measuring the impact of performance tuning on a system it is crucial to conduct testing both before and after each individual change. This approach enables A/B testing for each modification, allowing you to precisely assess the impact of each change. While it may be tempting to implement multiple changes simultaneously and then evaluate performance, this practice obscures the specific contribution of each adjustment and may prevent you from identifying changes that have a negative impact.

Although there are scenarios where multiple adjustments must be made together due to dependencies or system requirements, testing remains essential. A thorough performance evaluation should include testing each change in isolation as well as in combination. This method ensures you have clean, reliable data to understand the individual and collective effects of the modifications.

Repeatability: Ensuring Consistency in Storage Performance Testing

It is equally important to run multiple iterations of the tests for each change. System variability—such as environmental effects (temperature), and hidden background processes (TRIM and garbage collection on SSD and NVMe devices), or other unpredictable factors—can influence the results of a single test, leading to potentially misleading conclusions. For example, if you are benchmarking different ZFS record sizes (e.g., 4K, 8K, 16K, 32K) across various datasets, it is insufficient to conduct a single test per record size. Instead, running multiple tests (at least three) for each configuration allows you to detect outliers and assess the consistency of the results.

By identifying and accounting for outliers, you reduce the risk of basing decisions on inaccurate data. This iterative and methodical approach provides a more accurate understanding of how each change affects system performance, ultimately leading to better optimization and more reliable results.

The reproducibility of results is essential for reliable performance analysis. Conducting multiple tests and employing A/B testing methodologies ensures that results can be consistently replicated in the future. An approach like this is particularly valuable when evaluating performance across multiple systems or comparing different hardware configurations. By systematically testing and documenting variations, you gain the ability to identify patterns, confirm findings, and draw meaningful conclusions about system performance under diverse conditions.

Data Collection

All data collected during testing should be preserved and ideally be backed up to a secondary location to ensure its safety and accessibility. Specifically, for FIO testing, we recommend utilizing the JSON output option and organizing all results into separate directories named according to the specific changes or scenarios being tested. This structure helps maintain clarity and ensures that test results are easy to locate and reference in the future.

One of the key advantages of using FIO’s JSON output is its flexibility for future analysis. While the initial focus of your evaluation might be on metrics such as bandwidth and IOPS, retaining the JSON files allows you to revisit previous tests and analyze additional performance aspects, such as read and write latency. This ability to perform retrospective analysis is invaluable for gaining deeper insights and identifying trends or anomalies that may not have been apparent during the initial review.

Moreover, the JSON output from FIO is inherently well-suited for processing with Python, thanks to its structured and machine-readable format. Python's robust JSON libraries make it straightforward to parse and manipulate the data, enabling detailed analyses and custom reporting. By adopting this approach, you not only ensure comprehensive data preservation but also enhance your ability to extract meaningful insights from past performance tests.

Only by preserving the results of previous benchmarks can the performance of a system be compared over time. There are multiple factors that can change the performance of a system as it ages: from mechanical degradation to fragmentation and the different performance characteristics of a full filesystem.

Analysis in Storage Performance Benchmarking

During ZFS audits conducted for our customers, we utilize Python to process all FIO result data, which we then export for further analysis and visualization. This approach provides flexibility and compatibility with a wide range of data analysis tools, enabling both custom Python scripts and third-party applications to interact seamlessly with the data. By structuring the data in this way, we can efficiently aggregate results and apply statistical methods to calculate averages, percentiles and identify trends in performance metrics.

This structured analysis goes beyond basic insights. By identifying outliers within the data and correlating these anomalies with system logs, we can pinpoint potential issues that require further investigation or resolution. This comprehensive approach not only provides a clear picture of the system's performance but also helps in diagnosing underlying problems, ensuring a more reliable and optimized ZFS deployment for our customers.

Interpreting the Data

When conducting a performance analysis, it is essential to approach the process with an open mind, free from rigid, preconceived conclusions about the outcomes. While having a hypothesis or an initial expectation is natural and can provide direction, maintaining objectivity is critical to ensure an unbiased interpretation of the results.

One of the performance investigation methodologies we use, called the “WHY Method”, involves seeking to understand the reason behind the results of a storage performance benchmark. It is basically a recursive loop of asking “but why” as if imitating a particularly annoying 5-year-old.

“Why are we only getting 200 IOPS when there are multiple disks, each capable of that level of performance?” The answer may be: “Because the pool is configured in a RAID-Z topology.” Taking it to the next step, “Why does RAID-Z only provide the IOPS of one disk?” To which the answer is: “The record is split up across the members of the vdev, so they seek in lock step with each other. This can provide additional throughput but does NOT provide additional IOPS.” On from there, we might ask why use RAID-Z instead of mirrors, or why HDDs instead of faster media, and so on.

If the results deviate from your expectations, it is not a cause for dismissal but an opportunity for inquiry and gain deeper understanding. Ask probing questions:

Why do the results differ?

Are there hidden factors or overlooked variables influencing the performance?

Could there be errors in the methodology or assumptions made?

By embracing unexpected findings with curiosity rather than resistance, you allow for a more thorough and accurate analysis that can lead to valuable insights and improvements.

Benchmarking with Confidence

Effective storage benchmarking requires a structured approach—defining scope, designing realistic tests, and ensuring repeatability. By verifying results with multiple tools and eliminating external noise, you can trust that your storage performance benchmark reflects actual system performance.

A methodical process helps uncover bottlenecks, validate optimizations, and guide informed decisions. Retaining structured data ensures long-term comparability, while an investigative mindset turns unexpected results into valuable insights for improving storage efficiency.

Topics / Tags

benchmarking disk storage

Back to Articles

JT Pennington

JT Pennington is a ZFS Solutions Engineer at Klara Inc, an avid hardware geek, photographer, and podcast producer. JT is involved in many open source projects including the Lumina desktop environment and Fedora.

Embedded ARM Development Experts

OpenZFS Development & Support

FreeBSD Development & Support

Stay Informed and Make Smart Business Decisions with Klara's Resources

Unlock the Power of OpenZFS, Linux, and FreeBSD with Klara's Open Source Development Experts

Accurate and Effective Storage Benchmarking

Additional Articles

Determining Scope for a Storage Performance Benchmark

Testing Conditions

Aligning Benchmark Tests with Real-World Usage

Testing Under Worst-Case Scenarios

Verifying Results

Ensuring Repeatability in a Storage Performance Benchmark

Controlled Environment

Using A/B Testing for Performance Modifications

Repeatability: Ensuring Consistency in Storage Performance Testing

Data Collection

Analysis in Storage Performance Benchmarking

Interpreting the Data

Benchmarking with Confidence

JT Pennington