Tuning recordsize in OpenZFS

March 30, 2022

For many people, tuning OpenZFS isn’t really necessary—performance on the conservative default settings is more than ample to get what they need done. However, To get the best performance, matching the recordsize to your application provides a large performance boost. Learn how to match your dataset to your workload.

For many people, tuning OpenZFS isn’t really necessary—performance on the conservative default settings is more than ample to get what they need done. But for those who need to eke more performance out of their file system, OpenZFS offers an unusual array of tunables, such as inline compression.

While basic advice for compression is simple—enable it!—recordsize is a more challenging topic. Before we can begin discussing how to tune it, let’s run through a quick refresher on what recordsize actually means.

But first, let’s talk about sectors

Sector size in OpenZFS must be a power of two, and is set with the ashift property. For example, ashift=9 corresponds to sectors 2^9 bytes wide, and ashift=12 corresponds to sectors 2^12 bytes wide—512 bytes and 4,096 bytes, respectively.

Ashift is a vdev-wide setting, must be set when creating each vdev, and is immutable once set—it cannot be manipulated afterward, or on a per-dataset basis. If a vdev makes it into a pool with a bad ashift value—for example, a Samsung SSD which lies to ZFS and claims to use 512-byte sectors, and an admin who doesn’t manually specify -o ashift=13 when creating a vdev with it—the only recourse is destroying the pool and rebuilding it from scratch.

Records, blocks, and volblocks

Now that we understand sectors, let’s talk about the next OpenZFS unit—the block. A block is a collection of one or more sectors, and is treated by OpenZFS as an immutable, indivisible unit. Once written, a block can only be read in full or unlinked (deleted)—it cannot be either partially read or modified in-place.

If you’re using RAIDz vdevs, blocks are striped across the member disks of the vdev. For example, a 1MiB block on a 10-wide RAIDz2 vdev will be split into eight 128KiB data pieces and two 128KiB parity pieces, with one piece stored on each disk of the vdev.

You might wonder why we’re spending so much time talking about “blocks” when the article is about the recordsize property—the answer boils down to inconsistent nomenclature. The recordsize property sets the maximum logical size of blocks in an OpenZFS dataset, while the volblocksize property does the same for blocks in a zvol—but to OpenZFS developers, and within the OpenZFS codebase, the relevant unit is referred to as a block in either case.

Volblocksize is fixed, but recordsize is dynamic

Now that we understand what a “block” is, we can talk about actually tuning its size. Whether we’re tuning volblocksize for zvols or recordsize for datasets, the value must be an even power of 2. Currently, volblocksize defaults to 8KiB (16KiB in future versions), and recordsize defaults to 128KiB.

So far, this seems pretty simple: set volblocksize=64K, you get 64KiB blocks in your zvol, and that’s that. But recordsize is a bit trickier: the blocks in a dataset are dynamically sized, and recordsize sets the maximum size for blocks in that dataset—not a fixed size.

Understanding the dynamic nature of blocks in a dataset is crucial to tuning it properly, as we’ll see in the next section.

Tune recordsize to match random I/O workload

Now that we understand what recordsize does, how should we tune it? Those who have significant experience with storage likely already know that small-block random I/O is the most punishing form, and be tempted therefore to set recordsize arbitrarily low—if the pain lies at 4KiB, one might reason, then one should tune for 4KiB. After all, isn’t your home directory absolutely littered with tiny dotfiles?

First, let’s talk about the relationship between blocks and files inside a dataset. A block can never store data for more than one file, but a file may consist of many blocks. The blocks of an individual file will always be the same size, set when the first block is written. That size will be the lowest power of 2 that will fit all of the data, up to the maximum, the recordsize. This means that ten tiny files will be stored in ten individual tiny blocks—and those blocks may be as small as a single sector each, regardless of how large recordsize is. A file that is larger than 20KiB, will be stored with a block size of 32KiB. If that file later grows to 60KiB, it will be rewritten as a single 64KiB block. You don’t have to worry about files that start small and eventually grow, the first block will be rewritten each time the file grows until it reaches the recordsize, then a 2^nd block will be added. (This is the dynamic aspect of recordsize that we foreshadowed in the previous section.)

Therefore, for simple file sharing you should typically set recordsize=1M—tiny files take care of themselves by being stored in tiny blocks, requiring no additional tuning. But your large files get the benefit of higher compression ratios, fewer I/O operations required to read or write the file, fewer indirect blocks, and less impact from on-disk fragmentation.

This raises the question, when should you tune recordsize smaller—perhaps even smaller than the default 128K? You use small recordsize when you have small-blocksize random I/O inside larger files.

Let’s say that you have a MySQL database sitting in a dataset we’ll call tank/mysql. By default, the InnoDB storage engine uses 16KiB pages—which means MySQL wants to perform random reads and write inside massive individual files, 16KiB at a time. This in turn means we should perform zfs set recordsize=16K tank/mysql for optimal performance—this avoids both read and write amplification within the MySQL I/O tasks.

However, we want to make sure that only the MySQL data storage is affected by this setting—because as discussed earlier, small files take care of themselves, and recordsize=16K is drastically lower than optimal for general-purpose storage.

VM image files are another excellent example of a workload benefiting from tuning. If you’re using the Linux KVM hypervisor with file-based storage, the default qcow2 cluster_size is 64KiB—and you should perform zfs set recordsize=64K on the dataset holding those files to match.

Modifying recordsize doesn’t modify existing data

You now know the majority of what you need in order to properly tune recordsize to match your workload. In short, general purpose file sharing demands large recordsizes, even when individual files are small—but random I/O withinfiles demands recordsize tuned to the typical I/O operation within those files.

However, merely setting the recordsize on a dataset won’t change the structure of blocks which were already written to it. The block size of a file is set when the first block is written. In order to change the recordsize of existing files, you must actually re-write the existing files. Using simple tools like cp or rsync is generally the easiest way to force blocks to be resized.

In particular, you should be aware that OpenZFS replication will not resize blocks for you—replication occurs on a per-block basis, and will not coalesce smaller blocks in to larger or vice versa along the way.

A (very) brief note on volblocksize

As discussed earlier, volblocksize is to zvols what recordsize is to datasets. A zvol is a ZFS block-level device which can be directly formatted with another file system (eg ext4, ntfs, exfat, and so forth).

A zvol can also be used as direct storage for applications which make use of “raw” unformatted drives. In general, datasets can be thought of as “ZFS file systems” and zvols can be thought of as “ZFS virtual disk devices.”

For the most part, tuning advice for volblocksize matches tuning advice for recordsize—match the block size to the typical random I/O operation size expected. However, volblocksize is completely fixed rather than dynamic. Typically, you still want to tune for the applications being hosted—so, 16KiB for a MySQL InnoDB store, or 8KiB for a PostgreSQL store, not some larger value.

Conclusion

Now that you know how to properly tune recordsize, we encourage you to look over your existing pools and datasets and ask yourselves where setting it explicitly might benefit you.

General rules of thumb:

1MiB for general-purpose file sharing/storage
1MiB for BitTorrent download folders—this minimizes the impact of fragmentation!
64KiB for KVM virtual machines using Qcow2 file-based storage
16KiB for MySQL InnoDB
8KiB for PostgreSQL

And finally, consider when and where to split your existing storage into logically organized datasets—such as moving MySQL InnoDB storage into its own dataset with its own recordsize, rather than forcing it and your “normal” non-random-access files to share a single setting that isn’t optimal for either. MySQL binary logs should live in their own dataset, so they can have a large recordsize.

For a broader look at the risks of relying on AI for tuning decisions, here’s why we recommend caution with AI-based ZFS tuning.

Topics / Tags

snapshots vdev encryption

Back to Articles