Announcement

OpenZFS Development, Solutions, and Support. Learn More

Klara

Understanding Common Mistakes in ZFS Storage Benchmarks

Performance testing is a very complex and intricate process. Care and attention must be given to all variables and idiosyncrasies, both in the system being tested and the intricacies of the testing software. This ensures accurate and reliable results during the testing process. Without it, it’s quite possible to end up with unreliable or misleading results about a system's true performance capabilities.

ZFS has a lot of capabilities and features you need to account for when testing so that the results reflect the target system's actual performance capabilities. There are many ways where missing some specific details can result in wildly incorrect results. Something as simple as mismatched recordsize tuning, unexpected caching, misalignment, meddling prefetch, device temperature, or enabling compression can cause entirely unexpected results that defy reality.

#1 Misaligned Recordsize

One key element of getting accurate results is alignment. You must ensure that the reads and writes are aligned with the recordsize of the dataset under test. Depending on your production workload, the recordsize of a dataset can have massive implications on performance. As a result, this is important to consider when testing. The storage benchmark should properly mirror the real-world use case of your system. However, the ZFS configuration should be tuned to match as well.

Testing improperly can result in incorrect benchmarks due to an effect called read/write amplification. This amplification happens when the read/write block size used does not align with the ZFS recordsize. For example, if someone is testing writing 64 KiB blocks into a dataset with 128 KiB recordsize, ZFS will have to edit that 128 KiB record twice. The first is when writing the initial 64 KiB and then again when writing the second 64 KiB. By contrast, if the original write was of a 128 KiB block into a dataset with a 128 KiB recordsize, it would be completed as a single IO.

Recordsize mistakes

One of the most common mistakes we see is someone using fio (the flexible IO tester) to test one of the main measurements of system performance, 4 KiB random read IOPS. With the default ZFS configuration of 128 KiB recordsize, reading 1000 randomly selected 4 KiB blocks causes ZFS to read and checksum 125 MiB of data. Although the original requests total less than 4 MiB, ZFS processes much more due to the larger recordsize. Even if the disk backing ZFS can do 250 MiB/sec, the storage benchmark will report a maximum result of 8 MiB/sec.

The same effect applies to writes, but even worse. If you randomly write 4 KiB blocks across a large file split into 128 KiB records, ZFS will need to write 128 KiB of data for each 4 KiB change. Before doing so, it must first read the original 128 KiB block, modify the 4 KiB of changed data, and then write the updated 128 KiB record back.

As with our earlier example of writing 1000 random blocks, we now must: 1) read 125 MiB of data, and 2) write 128 MiB of data to change just 4 MiB of data. If ZFS is configured correctly for this workload, with a recordsize of 4 KiB, then it would simply write the 4 MiB of data, and never have to read at all.

#2 Caching

Caching is a useful performance enhancement. However, when storage benchmarking in ZFS, it is often more useful to measure the 'worst possible case' where the cache is entirely ineffective. The read workload is purely random and never reads the same block twice. This means that once put into production, you know the system should perform at least as well as the storage benchmark. When it does better due to caching, that is a win. If you test the performance of the cache, you may get unrealistic results. These results could leave you disappointed with the real-world performance of the system.

Impact of ZFS Caching on Performance Testing

The configuration of the ZFS caching features can drastically change the results of performance testing. As mentioned above, it is recommended that when testing disk IO, caching should be reduced to only metadata or entirely disabled. The only alternative is to ensure the storage benchmark's working set is multiple times larger than the cache. However, this can make the benchmarks extremely slow to perform.

A simple example of this would be a server with 256 GiB of RAM doing random read tests on a data set of 128 GiB of data that is backed by six SAS drives. Each drive can read/write at around 200 MiB/s. The theoretical maximum performance of all those drives being read at the same time would be 1.2 GiB/s.

If the performance testing cycle reads the data multiple times or reads it randomly with caching enabled, each block will only be read from the disks once. Afterward, ZFS retrieves the data from RAM for subsequent reads, bypassing the disks. Since RAM has much higher bandwidth, the testing results may see data being read at 24 GiB/s a second. This is obviously far beyond what the disks in the system are capable of providing. In this example, a tester would think that their system can provide 20X the performance of what they will experience in real world use.

#3 Compression and Zero Elimination

Another way that performance results can be skewed is by ZFS’s transparent compression feature, or its zero elimination feature for savings space from “sparse files”. In the case of compression, if you are reading or writing highly compressible data, the testing software will report the logical data amount as the performance figure. Depending on the compressibility of the data and the compression algorithm used, this can give the appearance that the disks are performing many times faster than they really are.

Storage benchmark challenges

A similar situation exists with sparse files. ZFS is smart and will realize that large sections of the file are all zeroes, and it will instead store this data as a “hole” in the file. As a result, ZFS will not write the zeros to disk, but rather just flag that record in the file as being a hole. When read back, ZFS will generate the zeros without needing to read from the disk.

This means that if you are reading from a sparse file, ZFS can perform at speeds of 10s of GiB/sec because it is just feeding a stream of zeros; it doesn’t have to read from the disk at all.

A good ZFS benchmarking tool will use uncompressible data. Again, to test the “worst possible case”, so that any compression you get under real-world workloads is a gain. Now you know the system will always be able to do “at least” as good as the benchmark.

fio has settings to create partly compressible data, but they do not work how you might think. If you set the blocks to be 30% compressible, fio will simply put zeroes for the last 30% of the block, rather than make a block that is actually 30% compressible. The same is true for fio’s deduplication testing facilities.

#4 Write Behind

Another complication exists when testing asynchronous writes. If you perform a simple test using a tool such as dd, it will exit and report the results as soon as the last write system call finishes. However, the data might not actually be safe on disk just yet. This means the MiB/sec result you got from dd is not accurate. It was only measuring the speed to write the data into ZFS’s write behind buffer. It was not measuring how long it took to write that much data to the underlying disks.

ZFS will attempt to buffer writes in order to write the data to disk as efficiently as possible. This includes eliminating overwrites that happened in the same transaction group and aggregating many small writes into fewer large ones. Depending on the testing software used and how it interprets data from ZFS, this can be presented as overly fast or overly slow writes. The software reports the speed at which it writes information to ZFS. This, however, doesn’t reflect the actual speed happening on the disks. What the storage benchmark needs to do, as it finishes, is make sure that the write behind buffer is flushed to disk, then factor this additional time into the final throughput measurements.

Not configuring the performance testing tool

A performance testing tool can attempt to mitigate this by issuing many fsyncs. However, this will lead to worse performance than what ZFS would normally achieve because it forces excessive pauses to flush the write-behind buffer, negating the performance gains from aggregation.

With fio, the best way to achieve this is the --end_fsync flag, which causes fio to issue one fsync of the entire file at the end of the benchmark. An alternative is to have fio call fsync every N writes, something like --fsync=10000 will issue one fsync() call every 10,000 writes.

#5 Not Validating the Results

With all of the issues identified above in this article, the best way to be sure the ZFS storage benchmark is measuring the true performance of the hardware is to cross check the results. Here’s how to ensure the results are accurate and reflect reality: while the system is under test, validate the throughput results being observed by using tools like iostat -d (on any UNIX like OS) or gstat (on FreeBSD) to match the results reported by the storage benchmark to the actual activity being observed on the hardware.

If the benchmark says the system is reading at 1.2 GiB/sec, but the sum of all disks in the system are only reading 200 MiB/sec, then caching or compression are likely skewing your results. While your 4 KiB random write performance may appear terrible, if iostat shows the disks are constantly writing at the maximum throughput, write inflation is the cause. You should check the configured recordsize of the dataset being tested.

Each of these potential pitfalls can contribute to incorrect performance metrics for your systems. It is always recommended to validate any results from performance testing software against the system itself using other utilities.

Interpreting ZFS Storage Benchmark Results: Understanding the "Why"

The most important thing to keep in mind when looking at benchmark results is to ask yourself “why”. The storage benchmark says the throughput is 1.2 GiB/sec, but why is it 1.2 GiB/sec? Does that figure match the expected throughput of the hardware after considering that the parity disks won’t contribute to throughput?

If the numbers don’t make sense, you need to dig deeper. Of course, you are not expected to master the science that is benchmarking. Klara offers a ZFS Performance Analysis Solution, where our team of experts will benchmark your system, and dig into the why of each result, unlocking more performance, and explaining why the system works the way it does, and how to change it to achieve the results you want.

Topics / Tags
Back to Articles