Klara

As much as we think of resources like CPU and RAM as the main performance components of a system, it is usually storage that is the bottleneck for most workloads. The team at Klara regularly undertakes ZFS performance optimization, performance audits, and bug investigations related to storage. Today, we’ll look back at a few examples of interesting performance issues we have investigated and resolved. 

The System That Spent A Lot Of Time Doing Nothing, Repeatedly

Backups are critically important, but they tend to generate a lot of data, and sometimes that data doesn’t change all that much. When backing up a system that is not capable of tracking which data has changed, ZFS has a very powerful feature, NOPWrite

ZFS stores a checksum (hash) of the data as part of its metadata for each block on disk. It uses this checksum to ensure the data is correct each time it’s read back. When a block is to be overwritten in-place, the NOPWrite feature uses this hash to determine whether the “new” data is actually different from the old data. If the data hasn’t changed, the write operation for that block returns success without actually overwriting the old copy of the block.

One use case that benefits from the NOPWrite feature is overwriting a VM image with a newer copy. Without NOPWrite, overwriting a 1 TiB VM image requires 1 TiB of writes. With NOPWrite, the operation only writes a fraction of the data since most of it remains unchanged.

But in today’s “war story,” the customer was seeing an odd issue. The number of writes to disk remained low as expected, but reads and CPU usage were unusually high. Upon performance profiling, we discovered that the checksum algorithm, SHA256, was consuming the most CPU time during performance profiling.

Addressing Performance Bottlenecks with ZFS Optimization

When overwriting a file in place, NOPWrite calculates the hash of the new block, and compares it to the stored hash of the existing block. If the old hash matches the new hash, then the write can be skipped. This customer was using SHA256 hashes with a CPU that did not support any SHA-NI acceleration. Klara recommended switching to the SHA512 checksum. (Counterintuitively, SHA512 hashes can be calculated around 50% faster than SHA256 hashes on 64bit x86 CPUs.)

The change in hash algorithm provided a significant ZFS performance optimization boost but did not explain the amount of overhead, or the reads from disk.

Resolving Misalignment Issues for ZFS Efficiency

Further analysis revealed the problem: the overwrites were misaligned. The incoming random access writes were in 64 KiB blocks. However, on disk, the data was stored in 128 KiB blocks (the default value of the recordsize property). This required ZFS to perform a read/modify/write cycle on each record. 

When the first 64 KiB was written, ZFS would first need to read the old 128 KiB record and its checksum. Next, it would recalculate the checksum and compare the freshly-calculated checksum to the one read from disk, to verify correctness of the data. Then it would construct a new 128 KiB block. This block consisted of the new 64 KB of data and the unmodified 64 KiB it read in from the existing block on-disk. ZFS would then calculate the new checksum, determine it to be the same as the one found in the original block, and skip the write. 

Optimizing ZFS to Reduce CPU Usage and Latency

A  second 64 KiB write would arrive from NFS, for the latter half of the same 128 KiB block on disk. This forced the entire read-modify-write cycle to happen twice on the same on-disk block. Adding insult to injury, this dataset was configured with the ZFS property primarycache=metadata, meaning ZFS would not cache the records it read. As a result, ZFS also needed to read the entire original 128 KiB from disk a second time! To summarize:

  • 2x 128 KiB reads from disk that were not required
  • Calculating the checksum of the same data 4 times
  • Write nothing new to disk

Setting primarycache=all would at least cut the number of full block reads in half. However, the better fix was matching the NFS configuration to the ZFS recordsize property resolved the issue entirely.  In this case, we reconfigured NFS to use 128KiB blocks. As a result, with NFS’ block size and ZFS’ block size aligned, no read-modify-write cycle was necessary. Only a single metadata read (the checksum of the existing block) needed to be performed.

We saved two full block reads and eliminated three unnecessary checksum operations per 128KiB block. It significantly reduced CPU usage, disk utilization, and latency, allowing ZFS to ingest incoming data as fast as the network could deliver it.

   | For a deeper dive into how ZFS caching mechanisms like the Adaptive Replacement Cache (ARC) improve system performance, check out our article on applying the ARC algorithm to ZFS.

(Storage) Ghosts of the Past

During another ZFS performance optimization investigation, we found a storage pool where one particular vdev was much slower than its peers. The system had 49 mirror sets of spinning disks, and a mirrored LOG vdev to accelerate synchronous writes. 

Our team observed that one particular mirror set would take much longer to make new allocations during heavy write operations. , This delay was bringing the performance of the entire pool down. Upon initial investigation, we noticed that the fragmentation level on that vdev was much higher than the rest. So, we tuned ZFS to avoid allocation attempts to overly fragmented metaslabs. This quickly improved the write performance for this system.

Reducing Fragmentation for ZFS Performance Gains

Although performance improved, this raised an important question: why did that one vdev have much higher levels of fragmentation than the others? Probing the ZFS internal data structures, we found that this pair of disks had a much higher number of metaslabs than the other identical mirrors. 

Researching the history of the pool (and this particular vdev) revealed an all-too-common story. Many years earlier, the operator had made a mistake when attempting to add the mirror pair of SSDs to act as a LOG vdev. They’d forgotten to use the “log” keyword, accidentally adding the SSDs as a 49th data mirror instead. 

If this happened today, you’d use ZFS’s “device evacuation” feature to remove the mistakenly added vdev from the pool—but this mistake happened before that feature was developed. Without the ability to remove a vdev, the operator’s only option was to replace the SSDs with another pair of spinning disks in the existing vdev, then add another new vdev for the SLOG using the SSDs. This worked, and the systems appeared to function normally for years.

Leveraging Modern ZFS Features for Performance Optimization

To understand why this lead to massively increased fragmentation several years down the road, we need to talk briefly about how ZFS stores data on-disk at a very low level. When a vdev is created, ZFS carves it into roughly 200 individual data structures called metal slabs. The size of the metal slabs is determined by the size of the vdev itself.

When the problem vdev in this story was created, the size of each of the metal slabs was only 1 GiB due to the relatively small size of the SSDs used to create it. Replacing the small SSDs with large spinning disks expanded the size of the vdev. But it didn’t change the size of the metaslabs themselves, resulting in a vdev with many thousands of metaslabs instead of only 200.

When ZFS needed to allocate space from this vdev, it would search through the first metaslab and find it did not contain enough contiguous free space. If ZFS didn’t find enough space, it would unload that metaslab, load another,  and continue the process until it found enough free space. This is usually not that big of a problem. However, because there were thousands of tiny metaslabs instead of a couple hundred larger ones, the process could take a long time. This in turn delayed allocations, held up the entire transaction group, and resulted in very poor performance.

Enhancing Efficiency

Newer versions of ZFS (0.8 and later) have the “spacemap_histogram” feature. It much less expensively informs ZFS which metaslabs contain large contiguous ranges of free space, and how fragmented they are, allowing it to find a suitable metaslab much more quickly.

For this customer, the ultimate resolution was to evacuate the data and recreate the pool, to resolve this and other configuration errors. After it was recreated, the pool offered much higher performance.

Dude, Where's My Data?

Data, the android officer from Star Trek: The Next Generation, wearing his yellow and black Starfleet uniform, sits aboard the USS Enterprise, looking intently off-screen.

One of the things that makes ZFS such a scalable filesystem is the fact that everything is dynamically allocated. Metadata structures that most filesystems pre-allocate statically—such as inodes—are replaced in ZFS with dynamically allocated structures. This allows the filesystem to grow and scale to unimaginable sizes.

However, this dynamic allocation does have a cost. Although fragmentation is a serious potential problem in any filesystem, ZFS’s dynamic allocation of metadata adds complexity. The interconnected nature of its metadata can increase both the seek distance and the number of metadata reads needed for a given operation.

Enhancing Backup Systems With ZFS Performance Tuning

One of our current customers, came to our team looking for advice on how to improve the performance of their large-scale backup system. This system is constantly ingesting backups from hundreds of different sources, often using tools which require heavy metadata operations. 

Many backup tools (such as borg or rsync) attempt to optimize the incremental backup case by only copying the bits of files that have changed. These tools begin by examining the modified times, file sizes, and other parameters of the files on source and destination. If this metadata is different for the same file on source and target, the tool then inspects the contents of the file on each end in order to determine what blocks need to be updated on the target.

This means that these non-filesystem-aware backup tools must generate a lot of load even when nothing has changed at all. All those stat calls to check times and sizes are small blocksize random reads, one of the most difficult storage workloads. These random reads are especially slow on HDD backed pools!

Based on Klara ZFS engineering team’s recommendations, the customer augmented their pool with a mirror of 3 high endurance SSDs allocated as the ZFS “special” vdev. The special vdev type is dedicated to storing metadata, and optionally small blocks that are inefficient to store on wide RAIDz vdevs, such as those in this customer’s pool. 

Long-Term ZFS Performance Optimization with Special vdevs

However, it’s important to understand that the special is not a cache—it is the deisgnated storage device for the pool’s metadata. If the special is corrupted or fails, the entire pool will become unreadable. 

Since losing the pool’s metadata means losing the pool itself, the special vdev should always be a mirror, and ideally at least 3 deep. The flash in the multiple devices will wear evenly, so it is recommended to replace its members on a staggered cycle. This ensures they don’t all wear out at once.

Some of the pools we have designed purposely use a mix of different SSD models/manufacturers/sizes to further reduce the likelihood that multiple devices in the mirror will fail at the same time.

Improved Pool Efficiency Through Metadata Management

ZFS will not migrate the metadata for existing files when a special is added to the pool. As files are modified and updated, their metadata will be written to the new dedicated devices. As our customer’s metadata migrated to the special over time, the overall performance of the system improved. (We  expected this as the latency for a random read from an SSD is often measure in 10s or 100s of microseconds, as opposed to the 4-30 milliseconds of a spinning disk.) 

The performance gain wasn’t limited to lower latency of stat calls, though. The performance of the pool’s spinning disks increased as well, since they no longer had to service seek-heavy random metadata reads. This resulted in a marked increase in the average throughput from the HDDs, as they could spend more time reading data, and less time seeking.

As data continued to churn through the system, performance continued to increase as the IOPS load from metadata reads shifted from the HDDs to the new special

Although a special can store small data blocs as well as metadata blocks, we decided not to enable that feature after some initial testing, For this particular customer, enabling small data block storage filled the special vdevs and did not provide the same kind of performance gains. Each backup operation required reading the metadata, while the data blocks themselves were read much less often.

Is a special vdev right for your workload? Klara Storage Performance Audit will recommend the changes that will provide the most benefit to your workload and ensure you are getting all of the performance your system is capable of.

Conclusions

Our team is committed to and has consistently helped customers across many industries investigate and improve the performance of their storage through ZFS Performance Optimization and resolve pathologies that were impacting their end users. If your storage isn’t performing as fast as you think it should, reach out to the Klara team for a Storage Performance Audit. We’ll get to the bottom of the issue and deliver the results your system is truly capable of.

Topics / Tags
Back to Articles

Getting expert ZFS advice is as easy as reaching out to us!

At Klara, we have an entire team dedicated to helping you with your ZFS Projects. Whether you’re planning a ZFS project or are in the middle of one and need a bit of extra insight, we are here to help!