openzfs success stories

ZFS Optimization Success Stories

OpenZFS Series

ZFS Optimization Success Stories

As much as we think of resources like CPU and RAM as the main performance components of a system, it is usually storage that is the bottleneck for most workloads. The team at Klara regularly undertakes performance audits and bug investigations related to storage. Today, we’ll look back at a few examples of interesting performance issues we have investigated and resolved.


The system that spent a lot of time doing nothing, repeatedly

Backups are critically important, but they tend to generate a lot of data, and sometimes that data doesn’t change all that much. When backing up a system that is not capable of tracking which data has changed, ZFS has a very powerful feature, NOPWrite. 

ZFS stores a checksum (hash) of the data as part of its metadata for each block on disk, which it uses to ensure the data is correct each time it’s read back. When a block is to be overwritten in-place, the NOPWrite feature uses this hash to determine whether the “new” data is actually different from the old data—and if the data hasn’t changed, the write operation for that block returns success without actually overwriting the old copy of the block.

Overwriting a VM image with a newer copy is one use case which can make excellent use of the NOPWrite feature. Without NOPWrite, overwriting a 1TiB VM image would require 1TiB of writes—but with NOPWrite, the same operation will likely only actually write out a small fraction of the total data, since most of it presumably has not changed between versions.

But in today’s “war story,” the customer was seeing an odd issue—the number of writes to disk was remaining low as expected, but reads and CPU usage were quite high. Performance profiling revealed that the biggest consumer of CPU time was the checksum algorithm, SHA256. 

When overwriting a file in place, NOPWrite calculates the hash of the new block, and compares it to the stored hash of the existing block. If the old hash matches the new hash, then the write can be skipped. This customer was using SHA256 hashes with a CPU which did not support any SHA-NI acceleration, so Klara recommended switching to the SHA512 checksum. (Counterintuitively, SHA512 hashes can be calculated around 50% faster than SHA256 hashes on 64bit x86 CPUs.)

The change in hash algorithm provided a significant performance boost, but did not explain the amount of overhead, or the reads from disk.

Further analysis revealed the problem: the overwrites were mis-aligned. The incoming random access writes were in 64 KiB blocks, but on disk the data was stored in 128 KiB blocks (the default value of the recordsize property). This required ZFS to perform a read/modify/write cycle on each record. 

When the first 64 KiB was written, ZFS would first need to read the old 128 KiB record and its checksum. Next, it would recalculate the checksum and compare the freshly-calculated checksum to the one read from disk, to verify correctness of the data. Then it would construct a new 128 KiB block, consisting of the new 64 KB of data and the unmodified 64 KiB it read in from the existing block on-disk. ZFS would then calculate the new checksum, determine it to be the same as the one found in the original block, and skip the write. 

Then a second 64 KiB write would arrive from NFS, for the latter half of the same 128 KiB block on disk, forcing the entire read-modify-write cycle to happen twice on the same on-disk block. Adding insult to injury, this dataset was configured with the ZFS property primarycache=metadata, meaning ZFS would not cache the records it read—so ZFS also needed to read the entire original 128 KiB from disk a second time! To summarize:

  • 2x 128 KiB reads from disk that were not required
  • Calculating the checksum of the same data 4 times
  • Write nothing new to disk

Setting primarycache=all would at least cut the number of full block reads in half, but there was a much better fix—matching the NFS configuration to the ZFS recordsize property resolved the issue entirely.  In this case, we reconfigured NFS to use 128KiB blocks—and with NFS’ block size and ZFS’ block size aligned, no read-modify-write cycle was necessary, and only a single metadata read (the checksum of the existing block) needed to be performed.

This saved two full block reads and three unnecessary checksum operations per 128KiB block, significantly reducing CPU usage, disk utilization, and experienced latency, allowing ZFS to ingest incoming data as fast as the network could deliver it.

(Storage) Ghosts of the Past

During another investigation, we found a storage pool where one particular vdev was much slower than its peers. The system had 49 mirror sets of spinning disks, and a mirrored LOG vdev to accelerate synchronous writes. 

Our team observed that one particular mirror set would take much longer to make new allocations during heavy write operations, and this was bringing the performance of the entire pool down. Upon initial investigation, we noticed that the fragmentation level on that vdev was much higher than the rest, so we tuned ZFS to avoid allocation attempts to overly fragmented metaslabs. This quickly improved the write performance for this system.

Although performance improved, this raised an important question: why did that one vdev have much higher levels of fragmentation than the others? Probing the ZFS internal data structures, we found that this pair of disks had a much higher number of metaslabs than the other identical mirrors. 

Researching the history of the pool (and this particular vdev) revealed an all-too-common story. Many years earlier, the operator had made a mistake when attempting to add the mirror pair of SSDs to act as a LOG vdev—they’d forgotten to use the “log” keyword, accidentally adding the SSDs as a 49th data mirror instead. 

If this happened today, you’d use ZFS’s “device evacuation” feature to remove the mistakenly added vdev from the pool—but this mistake happened before that feature was developed. Without the ability to remove a vdev, the operator’s only option was to replace the SSDs with another pair of spinning disks in the existing vdev, then add another new vdev for the SLOG using the SSDs. This worked, and the systems appeared to function normally for years.

To understand why this lead to massively increased fragmentation several years down the road, we need to talk briefly about how ZFS stores data on-disk at a very low level. When a vdev is created, ZFS carves it into roughly 200 individual data structures called metaslabs—which means the size of the metaslabs is determined by the size of the vdev itself.

When the problem vdev in this story was created, the size of each of the metaslabs was only 1 GiB due to the relatively small size of the SSDs used to create it. Replacing the small SSDs with large spinning disks expanded the size of the vdev—but it didn’t change the size of the metaslabs themselves, resulting in a vdev with many thousands of metaslabs instead of only 200.

When ZFS needed to allocate space from this vdev, it would search through the first metaslab and find it did not contain enough contiguous free space—then unload that metaslab, load another, and continue the process until it found enough free space. This is usually not that big of a problem—but since there were thousands of tiny metaslabs instead of a couple hundred larger ones, the process could take a long time. This in turn delayed allocations, held up the entire transaction group, and resulted in very poor performance.

Newer versions of ZFS (0.8 and later) have the “spacemap_histogram” feature, which much less expensively informs ZFS which metaslabs contain large contiguous ranges of free space, and how fragmented they are, allowing it to find a suitable metaslab much more quickly.

For this customer, the ultimate resolution was to evacuate the data and recreate the pool, to resolve this and other configuration errors. After it was recreated, the pool offered much higher performance.

Dude, Where’s My Data?

One of the things that makes ZFS such a scalable filesystem is the fact that everything is dynamically allocated. Metadata structures that most filesystems pre-allocate statically—such as inodes—are replaced in ZFS with dynamically allocated structures that allow the filesystem to grow and scale to unimaginable sizes.

However, this dynamic allocation does have a cost. Although fragmentation is a serious potential problem in any filesystem, the dynamic allocation of metadata in ZFS—and the interconnected nature of its metadata, which we’ll get into more detail about later—can increase both the seek distance and the number of metadata reads necessary for a given operation. 

One of our current customers, came to our team looking for advice on how to improve the performance of their large-scale backup system. This system is constantly ingesting backups from hundreds of different sources, often using tools which require heavy metadata operations. 

Many backup tools (such as borg or rsync) attempt to optimize the incremental backup case by only copying the bits of files that have changed. These tools begin by examining the modified times, file sizes, and other parameters of the files on source and destination. If this metadata is different for the same file on source and target, the tool then inspects the contents of the file on each end in order to determine what blocks need to be updated on the target.

This means that these non-filesystem-aware backup tools must generate a lot of load even when nothing has changed at all—all those stat calls to check times and sizes are small blocksize random reads, one of the most difficult storage workloads. These random reads are especially slow on HDD backed pools!

Based on Klara ZFS engineering team’s recommendations, the customer augmented their pool with a mirror of 3 high endurance SSDs allocated as the ZFS “special” vdev. The special vdev type is dedicated to storing metadata, and optionally small blocks that are inefficient to store on wide RAIDz vdevs, such as those in this customer’s pool. 

However, it’s important to understand that the special is not a cache—it is the deisgnated storage device for the pool’s metadata. If the special is corrupted or fails, the entire pool will become unreadable. 

Since losing the pool’s metadata means losing the pool itself, the special vdev should always be a mirror, and ideally at least 3 deep. The flash in the multiple devices will wear evenly, so it is suggested to  replace its members on a staggered cycle so they don’t all wear out at once.

Some of the pools we have designed purposely use a mix of different SSD models/manufacturers/sizes to further reduce the likelihood that multiple devices in the mirror will fail at the same time.

ZFS will not migrate the metadata for existing files when a special is added to the pool. As files are modified and updated, their metadata will be written to the new dedicated devices. As our customer’s metadata migrated to the special over time, the overall performance of the system improved. (This was expected, as the latency for a random read from an SSD is often measure in 10s or 100s of microseconds, as opposed to the 4-30 milliseconds of a spinning disk.) 

The performance gain wasn’t limited to lower latency of stat calls, though—the performance of the pool’s spinning disks increased as well, since they no longer had to service seek-heavy random metadata reads. This resulted in a marked increase in the average throughput from the HDDs, as they could spend more time reading data, and less time seeking.

As data continued to churn through the system, performance continued to increase as the IOPS load from metadata reads shifted from the HDDs to the new special

Although a special can store small data blocs as well as metadata blocks, we decided not to enable that feature after some initial testing, For this particular customer, enabling small data block storage filled the special vdevs and did not provide the same kind of performance gains—reading the metadata was required for every backup operation, while the data blocks themselves were read much less frequently.

Is a special vdev right for your workload? A Klara Storage Performance Audit will recommend the changes that will provide the most benefit to your workload and ensure you are getting all of the performance your system is capable of.

Conclusions

Our team is committed to and has consistently helped customers across many industries investigate and improve the performance of their storage and resolve pathologies that were impacting their end users. If your storage isn’t as fast as you feel it should be, reach out to the Klara team for a Storage Performance Audit—we’ll get to the bottom of the issue, and get the results your system is capable of.

You might also be interested in

Get more out of your FreeBSD development

Kernel development is crucial to many companies. If you have a FreeBSD implementation or you’re looking at scoping out work for the future, our team can help you further enable your efforts.

<strong>Meet the Author</strong>: Allan Jude” class=”wp-block-coblocks-author__avatar-img” src=”<a href=https://klarasystems.com/wp-content/uploads/2020/05/allan.jpg&#8221;>
Meet the Author: Allan Jude

ZFS Engineering Manager at Klara Inc., Allan has been a part of the FreeBSD community since 1999 (that’s more than 20 years ago!), and an active participant in the ZFS community since 2013. In his free time he enjoys baking, and obviously a good episode of Star Trek: DS9.

Tell us what you think!

Discover more from Klara Inc

Subscribe now to keep reading and get access to the full archive.

Continue reading