Announcement

Upcoming Webinar: Cost-Efficient Storage on TrueNAS with Fast Dedup  Learn More

Klara

Extending ZFS Performance Without Hardware Upgrades 

OpenZFS is famously configurable, with tunables and parameters that can be modified to fit nearly any conceivable workload. But most users, we find, bounce off all that complexity and use OpenZFS in its default, untuned configuration.

There’s absolutely nothing wrong with that, of course–there’s no point in tuning systems for extra performance if they’re already offering all the performance you need. However, a system that met your needs several years ago may start to struggle under an ever-increasing workload as you discover more things you can do with it. 

In this article, we’ll explore approaches to ZFS performance tuning that don’t break the bank and require additional hardware. 

Recordsize 

This is, frankly, the OpenZFS tunable to master. Effective ZFS performance tuning often starts with understanding recordsize. Recordsize sets the size of OpenZFS’s fundamental storage unit, the block. This setting is per dataset, meaning you can tune a single pool for multiple workloads by putting different workloads in differently tuned datasets! 

By default, each ZFS block is 128KiB in size–a middle-of-the-road setting that is acceptable for most workloads but not optimal for any specific one.  

For most use cases, you want recordsize to be muchlarger—1MiB if you’re running on mirrors (or a single drive) or 4MiB if you’re running wider RAIDz2 VDEVs. Setting recordsize to large values minimizes the potential for fragmentation and conserves IOPS on the underlying devices. 

The exception here is any dataset that contains database or VM images—large files which do a lot of small random access I/O within those files. If you set a large recordsize on these types of workload, you significantly increas read amplification and write latency. 

For databases, you typically want recordsize=16K–the default page size for MySQL, and for recent versions of PostgreSQL also. VM images usually perform best with a recordsize of 64K–unless they’re mostly running databases, in which case the dataset for that drive should have the same 16K recordsize you’d give to a database running directly on the host itself. 

Home directories can also be problematic–although the Documents, Downloads, and Pictures subdirectories are typically used for bulk data storage and benefit from larger recordsizes, various apps tend to put SQLite, sleepycat, and other flat-file databases inside users’ home directories. If you need to tune recordsize for home directories, consider making separate datasets for the home folder itself, and for Downloads, Pictures, and Documents beneath it–with a smaller recordsize for the home folder itself, and large recordsize for Downloads, Pictures, and Documents. 

Warning: although recordsize is dynamically tunable per dataset, changing the recordsize value on a dataset will not re-write existing files within that dataset, nor will opening those existing files for modification later. If you need to rewrite existing files to a new recordsize, you’ll need to copy those files, block by block, after the recordsize value for the dataset has been changed! 

Compression 

Depending on the version of OpenZFS on your system, compression may be off by default, and we see a lot of people leaving it that way–which is a mistake. 

You might think that turning compression on would save disk space at the cost of performance–but it typically improves performance while saving disk space. This is because your CPU can compress or decompress data faster than your drive can read or write it–even if you’ve got solid-state drives and a lower-end CPU. 

Like recordsize, inline compression is a per-dataset tunable–so you can set different compression levels and algorithms on different workloads. Let’s talk about those compression levels, because the choice absolutely matters: 

compression=lz4: this is currently the algorithm you get if you simply set “compression=on” as well as if you select it directly. LZ4 is an extremely fast streaming algorithm with moderately low compression ratios but extremely low CPU impact. We typically see a performance penalty of roughly 5% when writing completely incompressible data to an LZ4-compressed dataset, and performance bonuses of 25-30% for most real-world tasks (eg, copying a Windows Server installation). 

compression=gzip: this was an excellent algorithm to use for datasets you are certain will only have plain text or other extremely compressible data written to them. Plain text (such as server logfiles) frequently compresses down to 20% or less of its original size. Be careful, though–gzip is a much more CPU-hungry algorithm than LZ4, and enabling gzip compression for workloads not ideally suited for it can result in performance problems to go along with your space savings. 

compression=zstd: Zstd is a newer compression algorithm that aims to offer the speed of LZ4 with much of the compression of gzip. This algorithm is tunable, allowing you to set compression=zstd-1 through compression=zstd-19. Recommended for those willing to actually test their workload, but not recommended for folks who want to just slap something in place–if you choose the wrong zstd compression level, your performance will suffer compared to using LZ4. Any place where you used gzip in the past, zstd will offer similar compression but with a lower performance cost. 

compression=zle: Zero Length Encoding is a special algorithm that only compresses repeating zeroes. This is handy for datasets containing mostly incompressible data–because even when the data itself is incompressible, the slack space (unused bytes in the final block of a file) is quite compressible indeed. With recordsize=1MiB, ZLE compression can save as much as 1,016KiB per file! ZFS will not waste space in small files, instead using a record that best fits the file, however once a file is more than 1 record, all of the records are the same size, the maximum record size. 

compression=lzjb: LZJB was the original default compression algorithm used by OpenZFS. It’s still available for legacy purposes, but we do not recommend its use for newly created datasets–LZ4 outperforms it in every way. 

For the vast majority of potential use-cases, we recommend setting LZ4 compression without thinking twice about it. It’s better to set LZ4 as a default, even if that means catching incompressible workloads, than to not turn compression on at all because it seems too daunting! 

If you feel the need to dive a little further down into the weeds–but only a little–then set compression=ZLE for datasets with incompressible data (pictures, music, movies) and compression=LZ4 for everything else. 

Warning: just like recordsize, compression may be adjusted on-the-fly per dataset, but does not affect data that’s already been written. If you need to change the compression method of data that’s already been written to disk, you’ll need to copy that data and delete the originals after changing the compression setting. 

Topology 

We see an awful lot of OpenZFS users who didn’t really understand the performance implications of pool topology when they first got started, and just dumped all the drives they had into a single wide RAIDz1 or RAIDz2 VDEV. 

If you’re not picky about performance, this generally works okay–and when your system isn’t terribly busy, you can even see some big satisfying numbers when doing a single large file read or write, making you feel good about all those drives in your system. 

Unfortunately, the performance of a wide RAIDz VDEV–or any other simple striped array, such as conventional RAID6–falls to pieces under IOPS-heavy workloads. That single ten-wide Z2 that worked like gangbusters for a simple NAS hosting a bunch of media files might turn into a real liability several years later, when your workloads get more complex (especially databases, bittorrent, VMs, or other random-access-heavy operations). 

There are too many perfect surfaces, spherical vacuums, and frictionless chickens to count here–we’re offering a rule of thumb, not a perfect calculator–but for the most part, ZFS performance scales with VDEV count, not with raw disk count! 

So, let’s say you naively created a pool consisting of a single 10-wide RAIDz2 VDEV, and now that you’ve got more users and heavier workloads, you’re frequently seeing performance no better than a single disk would offer. How can you improve that? 

First of all, you’re going to need a full backup–you can’t reshape a pool on-the-fly; you’ll need to destroy it, rebuild it, and restore your data. But that’s okay, you should have had a full backup anyway! 

Now that you’ve verified your backups are current and valid, tearing down that 10-wide Z2 and replacing it with a pool of 5 two-wide mirrors will increase your performance significantly. Again, this is a rule of thumb, not a perfect calculator–but you can expect 3-4x the performance of that single wide Z2 on most workloads–even simple file sharing workloads, let alone databases.

With all that said, we should give you a warning: a pure NAS in a smaller environment will typically bottleneck on the 1Gbps LAN throughput long before hitting any local storage constraints.  

This sort of tuning is usually more applicable to application servers (and desktops, and VM hosts) which need to move far more data to and from storage than directly over the network. 

Prefetch 

One of OpenZFS’s lesser-known performance tunables, prefetch–known in some other storage systems as “readahead”–allows the filesystem to pre-emptively fetch additional blocks belonging to a file which has been opened for reading. 

By default, file-level prefetch is enabled, and will grab more and more as-yet-unrequested blocks from an open file as the ones it already fetched are read from cache. Say you open a file and read the first block–prefetch goes ahead and grabs another block. When you read that block, prefetch grabs several more blocks. When you read those, it gets even more aggressive, eventually fetching as many as 256 extra blocks at a time. 

Prefetch is great for sequential reads of very large files–it minimizes IOPS consumed for the same volume of data read, and minimizes experienced latency as well, in the applications whose storage needs are successfully predicted ahead-of-time by the prefetch algorithm. 

Okay, if it’s so great, and it’s already on by default, why are we talking about it? Because it’s not great for heavily random access workloads–especially databases. The odds that you want the next handful of 16KiB blocks out of a MySQL table that you just opened for read are not very good–and fetching data that you didn’t need and won’t use consumes IOPS and increases latency unnecessarily, rather than the other way around. 

In older versions of ZFS, prefetch is unfortunately a system-level configurable–it cannot be enabled and disabled selectively per dataset.  With 2.3 it is now a per-dataset property allowing you to disable prefetch for databases and VMs without impacting other workloads. Think long and hard before setting zfs_prefetch_disable=1 on your system… but if your system is a dedicated host running big databases, this could be the performance enhancer that you’re looking for. 

Conclusion 

We’ve only scratched the surface of OpenZFS’s total tuning and configuration capabilities here today–if you’ve got an enterprise workload that needs careful analysis and the maximum possible efficient configuration, we’d recommend reaching out for a ZFS Performance Analysis to get help uncovering your specific performance needs, and the configuration to unlock that performance. 

With that said, what we covered today–topology, recordsize, compression, and prefetch–is enough for a thoughtful storage admin to greatly improve their OpenZFS performance from its default configuration.

With the right approach to ZFS performance tuning, you can significantly improve performance without investing in new hardware. Just remember–all workloads are different, including yours. In order to improve performance for your workload, you must first understand your workload so that you can tune for it. 

Back to Articles