Improve the way you make use of ZFS in your company.
Did you know you can rely on Klara engineers for anything from a ZFS performance audit to developing new ZFS features to ultimately deploying an entire storage system on ZFS?
ZFS Support ZFS DevelopmentAdditional Resources
Here are more interesting articles on ZFS that you may find useful:
- Fast Dedup Economics When Deduplication Beats Buying New Disks
- Extending ZFS Performance Without Hardware Upgrades
- Compensating for RAM Constraints with L2ARC on ZFS
- The Hidden Value of CPU-Intensive Compression on Modern Hardware
- FreeBSD and OpenZFS in the Quest for Technical Independence: A Storage Architect’s View
Safe ZFS Tuning Practices for Production Databases
Databases–particularly relational databases–are both some of the most important and least widely understood storage loads. When you combine a relational DB storage load with a copy-on-write filesystem like OpenZFS, there are even more things you can get right–or, conversely, horribly wrong.
Today, we’re going to dive into the fundamentals of tuning OpenZFS for production database workloads.
Understand Your Database Engine
All modern relational database engines have a few things in common–they’re organized into tables, columns, and rows; different column types take up different amounts of space per row–sometimes static, sometimes dynamic, and they all have a native page and/or extent size.
In order to maximize database performance, we will typically need to prioritize latency–not throughput. A relational database is, at heart, a radically random-access storage workload which benefits very little from the types of “hacks” built into filesystems to accelerate typical filesystem operations–which overwhelmingly tend to revolve around the reading and writing of individual files in their entirety, not random access within those files.
There are a few things we’ll want to make sure that we know about our specific database engine–first of all, is it a journaling engine? If not, it won’t be crash safe–meaning the database could become corrupted any time the database isn’t shut down cleanly. Nearly all modern DB engines are journaling and crash safe–including MSSQL, PostgreSQL, and MySQL InnoDB but not MySQL MyISAM.
The next thing you want to know is the page and/or extent size of your DB engine. Although this is a configurable parameter for most DB engines, we very rarely see it modified–so you can usually pretty safely assume older PostgreSQL instances will have 8KiB pages, MySQL instances (and the absolute newest PostgreSQL instances) will have 16KiB pages, and MSSQL databases will usually operate in 64KiB extents.
If you’re using an application that bundles its own installer for a DB engine, it’s worth checking to make sure that the application didn’t mess with its DB’s default page/extent size. But, again, this is quite rare in our experience.
The reason we want to know our page or extent size is so that we can tune recordsize of the dataset containing the database’s binary storage. The best performing values will typically be half, equal to, or double the database page or extent size.
Understand Your Database-Backed Application
So far, we’ve talked about tuning for a database engine’s default page or extent size. This is typically the correct approach–but in some cases, you may need or want to consider the behavior of the application using that database.
For example, perhaps we’re developing a special photo-hosting application called GalleryChain, which focuses on maintaining the chain of custody of images as they are first shot and uploaded, and later modified.
Our hypothetical GalleryChain application has a primary table, a metadata table, and a photos table. The primary table has nothing but a primary ID; the metadata table has a many-to-one relationship with the primary table, and the photos table has a one-to-one relationship with the metadata table.
When an image is initially imported, a row is created for it in all three tables. Each table has a SIGHASH column, and each time an authorized user uploads a new image or modifies an existing one, the total contents of the row in question are signed with that user’s private key and the resulting hash is uploaded to SIGHASH.
Understanding this workflow, and the basic architecture of the database application, tells us that for maximal performance, we actually want these tables in different datasets. We’ll need quite a lot of fast querying of the metadata and primary tables–each of which can likely fit an entire row into a single database page–so we want a very small recordsize for those tables; typically 16KiB or 32KiB.
The photos table, however, is an entirely separate beast. We wanted to tune recordsize very small for the primary and metadata tables, because we don’t want read or write amplification. But the photos table hosts BLOBs of the photos themselves, with a typical size ranging from 4-16MiB per photo.
Therefore, for the photos table we want to approximate a much larger operation size–we’ll never be reading or writing a partial photo, so we want recordsize=1M, or potentially even larger. This keeps our photos from getting unnecessarily over-fragmented as they’re written, and minimizes IOPS (and queue depth) as they’re read back in as well.
Understand Your Topology
Storage layout is one of the most critical decisions when tuning OpenZFS for production database workloads . One of the worst possible mistakes you can make when architecting a database is hosting it on a striped RAID implementation and expecting high performance.
It’s a very common human failing to look for the biggest number you can find, rather than the most applicable number. When it comes to striped parity RAID–conventional or ZFS–one way we see this failing repeatedly manifest itself is the attempt to use dd to get a really big number and assume this means you’ve got a “high performance array.”
In practice, striped RAID–whether RAID5/6, or RAIDz1/2/3–generally performs worse with heavily random-access workloads than a single disk of the same class used in the array would perform.
To understand why, let’s think back to those basics of database architecture–page or extent sizes in the 8KiB, 16KiB, or 64KiB range. If we assume 4kn disks and MySQL with default page size, that’s only four sectors per page.
Now, imagine we’ve got a ten-wide RAID6 or RAIDz2 array. There are only two possible choices to make when saving a 16KiB page to a ten-wide dual-parity stripe–you can either write the stripe undersized but complete (as OpenZFS does), or you can implement a read/modify/write cycle (as conventional RAID6 does).
With a conventional RAID6 array, you’ll typically have a fixed chunk size–the amount of data saved per drive in the array, and that chunk size is typically going to default to somewhere around 256KiB to 512KiB, depending on your specific RAID manager. This means every time you write a new 16KiB metadata record, you’re going to have to light up all ten disks–first you light up the eight data drives for the stripe you’re going to save that 16KiB page to, then you have to modify one of those drives with the new data, then you have to recalculate parity for the stripe, then light up all ten drives to write the modified stripe and its updated parity.
OpenZFS and that 10-wide RAIDz2 will store your four-sector page on six total drives–four data and two parity. The good news is, you don’t have to worry about read-modify-write. The slightly bad news is, you aren’t getting the storage efficiency you expected–66%, not 80%. You’re only lighting up six of ten disks on each read or write–so you’re not getting the performance of all ten drives either. And you’re writing and writing in single sector operations, absolutely the most pathological of all possible workloads for any storage device we have ever encountered.
Although the two different RAID systems struggle with the workload differently, they do both struggle with it. Highly random access, small blocksize storage operations do not perform well at all under striped parity arrays!
For databases–even more than for most workloads–the most performant topology is mirrors. Let’s say we need to write the same 16KiB database page to the same ten drives, but this time they’re five two-wide mirrors–now only two drives of our ten total need to be lit up, and they’re lit up in four-sector-wide (16KiB) per-disk operations, not single-sector-wide (4KiB) per-disk operations.
This means that we’re getting better performance per disk with mirrors than we did with Z2, and we’re also getting the ability to handle up to 5x as many concurrent operations as the Z2 can. This is not a subtle difference!
Sync Writes Are The Frenemy
There are two basic types of storage write–asynchronous, and synchronous. The vast majority of storage write operations are committed asynchronously–meaning the operation sits in a cache in RAM until the storage stack decides the time is optimal to commit that operation to the metal.
The throughput wins from asynchronous commits are truly extraordinary–but the approach does come with its drawbacks. Since asynchronous commits may be committed out of order, they can leave a database in an inconsistent (corrupt) state quite easily. Write operations committed only to a volatile cache–such as the system RAM–are also vulnerable to being lost entirely, should a power outage or other system crash happen before those writes have been safely committed to metal.
For this reason, database engines typically write almost entirely synchronously. With sync writes, the function call does not return until the operation requested has been safely and non volatilely committed. This prevents writes from being committed out of sequence, as well as minimizing the window of time in which data might be lost.
Remember when we talked about how extraordinary the performance wins are for async writes? The flip side of that coin is just as valid–a conventional drive capable of delivering 150MiB/sec of asynchronous writes may well fail to achieve one MiB/sec throughput on a fully synchronous workload.
We have three possible ways of accelerating sync writes in the OpenZFS world–we can use powerloss-safe drives (which return from sync write calls instantly, because their capacitor-backed DRAM cache is effectively non-volatile), we can add a LOG vdev, or we can simply disable sync directly.
Use Crash-Safe Drives
Just using powerloss-safe drives is, by far, the highest performance mitigation we have available to us. This allows the drives’ DRAM cache to be treated as non-volatile, so sync writes can be handled essentially the exact same way that async writes are.
Add a LOG VDEV
The next highest-performance mitigation we can apply is adding a LOG VDEV to our pool. OpenZFS commits sync writes to the ZFS Intent Log, a simple on-disk ring buffer, so that it can return from them more rapidly than it could if it had to wait for the next TXG to be committed to disk. A LOG VDEV simply provides a separate, hopefully higher-performance place for OpenZFS to store the ZIL.
You do need to be careful when selecting drives for a LOG VDEV–they need to be extremely high endurance and low latency. Memristor-based drives like Intel’s Optane are ideal for this task; but with the disappearance of Optane from the small business and consumer market, we recommend MLC SSDs with powerloss protection as the next best thing.
Finally, make sure you’re considering write endurance when selecting an SSD model for use in LOG VDEVs. Although the ZIL itself is very small, you will probably need a large SSD in order to get a decent amount of write endurance–remember, the LOG VDEV will have every single sector of the entire pool’s workload written to it.
Perhaps you store your database on a pool of five 2-wide mirror VDEVs, with a single disk (or mirror) LOG vdev. That LOG VDEV is going to experience writes at five hundred percent the rate which the drives in the main storage VDEVs do! So figure your daily volume of writes, the DWPD rating of the drive you’re considering, and do a little simple math to figure out how large a model you need in order to stay performant for the number of years you expect the LOG VDEV to last you.
Ideally–and for hopefully obvious reasons–you want the drives in a LOG VDEV to be powerloss-safe. That doesn’t make them any safer, but it does significantly improve the throughput of the LOG VDEV itself (by decreasing the latency of the individual operations committed to it).
zfs set sync=disabled
The final mitigation we can apply to a sync write workload is simple, blindingly fast, and dangerous: we can simply turn off sync entirely, on a per-dataset basis.
Disabling sync means that sync writes do not go to the ZIL at all, and are handled precisely the same way async writes are. Although this is definitely the highest-performance mitigation for slow sync writes, we do not typically recommend this setting for anything beyond testing.
Although both OpenZFS itself and (hopefully) your database engine of choice are quite crash-safe, that crash safety assumes writes are committed in order. Removing that guarantee via zfs set sync=disabled means that if your system crashes while DB writes are happening, you’re disturbingly likely to end up with application-level corruption.
Remember our theoretical GalleryChain application? If we set sync=disabled, and the power goes out as we’re importing a new photo, we could easily wind up with (for example) a new entry in the metadata table with no associated row in the primary table. That would result in “hidden” photos that take up space in the database, but can’t be found via the application itself (which begins by querying the primary table to find primary IDs, then loads related records–records with the same primary ID–from the metadata table).
The occasional “hidden photo” might not sound so bad, but that’s one of the most innocuous types of application-level corruption. This type of corruption can easily render applications unable to even start, when they encounter back-end data that does not meet that application’s expectations.
Compression Is Your Friend
You may find this extremely counter-intuitive–but for most database applications, the right compression algorithm is a distinct performance win.
This is because most systems bottleneck on storage long before they bottleneck on CPU–and because most databases store extremely compressible data (eg, employee names, phone numbers, and email addresses in plain text).
In most cases, we would advise compress=lz4 for databases. If we were trying to tune our hypothetical GalleryChain application stack as tightly as possible, we’d use lz4 for the primary and metadata tables, and ZLE (which only compresses slack space and other sources of repeating zeroes) for the photos table, since the photos are presumably already-compressed JPEG data.
If you know precisely what kind of data your database will be storing and have the time for testing, it’s worth actually running scientific tests using that data and your typical access pattern, to find the best performing compression algorithm for your use case. In some cases, ZSTD or even gzip might outperform lz4, as well as out-compressing it–but in others, they can introduce a new and punishing bottleneck, so do be careful here or even gzip might outperform lz4 as well as out-compressing it–but in others, they can introduce a new and punishing bottleneck, so do be careful here.
Prefetch and Databases Don’t Mix
OpenZFS offers, by default, a file-level prefetch algorithm. The way this works is pretty simple: if you asked for the first block of a file, the prefetch algorithm figures you probably are going to want the next block, too, so it goes ahead and fetches it immediately instead of waiting to be asked.
When managing a standard filesystem, the vast majority of the storage workload is reading and writing files in their entirety–so prefetch is a pretty big win, significantly reducing latency / improving throughput in workloads which revolve around whole-file access.
Databases, unfortunately, are very different beasts. Databases typically read very small pieces of very large files in an order that has absolutely nothing to do with the ordering of the sectors inside that file–so prefetch operations are almost always going to be wasted IOPS, in a heavily loaded db environment.
Unfortunately prior to ZFS 2.3.0, configuring prefetch could only be done system side using a tunable. If you want to disable prefetch, you’ll be doing so in sysctl.conf (FreeBSD) or zfs.conf (Linux), and will be doing so for not only every dataset but also every pool directly imported by that system. In ZFS 2.3 and later, there is a per-dataset property to manage the prefetch setting.
We do not recommend disabling prefetch for the entire system without performing actual testing–although it can be helpful in extremely heavily-loaded databases, the negative impact of prefetch in a database context is generally quite minimal compared to its positive impact on other filesystem operations.
atime=off
We find that, in general, almost nothing actually uses atime–a file’s access time, meaning the timestamp in which it was most recently read from or written to.
Database engines especially don’t rely on atime–what’s the point of updating the atime on a single multiple-GiB file several thousand times per hour?
In practice, we do not often see atime updates causing significant problems–but every time you burn IOPS on a uselessly-updated atime, a storage operation you actually need to happen waits in the wings.
You may set atime=off on a per-dataset basis–so if, for instance, GalleryChain saved its photos to individual discrete files rather than dumping them as BLOBs into a table, you might decide you wanted to leave atime on in the photos dataset, but turn it off in the MySQL database’s dataset.
To rewrite, or not to rewrite?
Since admins frequently get confused about what settings take place immediately and retroactively and which do not, we’re going to give you a handy list of the settings that will require re-writing your data before reaping the benefit.
Changing storage topology will obviously require a rewrite, since you’re upending the entire storage stack beneath the data. Replication is usually the best way to manage this–back up your entire pool via replication, destroy and rebuild it with your preferred topology, then replicate back onto the new pool to restore.
Changing recordsize to better match or complement underlying DB page or extent size (or application data flow patterns) will unfortunately require a brute-force rewrite of existing data. Although you can set the recordsize on a dataset dynamically, doing so does not rewrite existing data–and replication won’t do the trick either; you’ll need a brute force cp, rsync, or similar operation to force OpenZFS to re-write the entire file using the new block size.
Changing compression may or may not demand a re-write, depending on your expectations. Compression happens on a per-block basis, so OpenZFS will begin following an updated compression setting immediately, even on existing files–but it won’t re-write existing data in those files. If you’re fine with your old data remaining as-is, you don’t need to re-write your data after updating compression settings; but if you expect all data in the database to perform similarly, you’ll need the same brute-force cp, rsync, or what have you that you needed after updating recordsize.
Adding a LOG vdev does not require rewriting data or rebooting the system. (Removing a LOG vdev can also be done at any time.)
Disabling sync writes via zfs set sync=disable is cheap, quick, and easy–and so is turning it back on again, once you’ve satisfied your danger fetish!
Disabling prefetch doesn’t need a rewrite–but it will probably need a reboot, since it’s a kernel-level parameter.
Disabling atime can be done at any time, for any reason and can be re-enabled just as quickly and easily.
Conclusions
Although there are some unique challenges involved in tuning a copy-on-write filesystem like OpenZFS for maximal performance with relational databases, a properly tuned OpenZFS can perform very well indeed.
We must stress that tuning is always optional. Some admins see that something can be tuned, and feel it must be tuned–even when they aren’t experiencing any actual performance problems. In our experience, the vast majority of real-world database-backed applications perform quite well under OpenZFS with no additional tuning required at all.
For those with heavier-duty workloads, Tuning OpenZFS for Production can unlock significant performance gains when applied thoughtfully and based on real workloads. If you’re not sure which knobs should be turned for your workload or want to validate your current configuration, this is where Klara’s ZFS Performance Audit can help. By taking a holistic look at your hardware, workload patterns, and ZFS configuration, the analysis identifies real bottlenecks and provides clear, actionable guidance tailored to your system.
The result isn’t just better benchmark numbers, it’s a storage platform that behaves predictably under real production load, with the performance and reliability your applications depend on.

Jim Salter
Jim Salter (@jrssnet ) is an author, public speaker, mercenary sysadmin, and father of three—not necessarily in that order. He got his first real taste of open source by running Apache on his very own dedicated FreeBSD 3.1 server back in 1999, and he's been a fierce advocate of FOSS ever since. He's the author of the Sanoid hyperconverged infrastructure project, and co-host of the 2.5 Admins podcast.
Learn About Klara




