Improve the way you make use of ZFS in your company.
Did you know you can rely on Klara engineers for anything from a ZFS performance audit to developing new ZFS features to ultimately deploying an entire storage system on ZFS?
ZFS Support ZFS DevelopmentAdditional Articles
Here are more interesting articles on ZFS that you may find useful:
- ZFS Summer Article Roundup: Smart Hardware Advice
- ZFS in Virtualization: Storage Backend for the Pros
- Disaster Recovery with ZFS: A Practical Guide
- ZFS Performance Tuning – Optimizing for your Workload
- Reliable ZFS Storage on Commodity Hardware – A Cost-Efficient, Data-Sure Storage Solution
One of the most important aspects that goes into the design of a ZFS storage pool is the VDEV layout. What VDEV configuration is used has a deep impact on the performance and reliability of the pool, as well as its flexibility for future expansion.
If you are not aware of the different VDEV types in ZFS and their use cases, you might want to first read Understanding ZFS VDEV Types and Choosing the right ZFS pool layout to understand the concepts we will be discussing.
Capacity vs Performance
Some configurations will always perform better than others, but there is usually a trade-off to be made. Mirrors will provide more IOPS and more read throughput. However, it’ll offer far less capacity and therefore increase the overall cost of the deployment. A standard 2-way mirrored configuration may also not be able to provide the level of redundancy required without going to 3-way or 4-way mirroring, which greatly increases cost without providing any additional usable storage capacity.
Consider a standard installation of 24 theoretical disks of 10 TB, be they HDDs, SSDs, or NVMe drives, it doesn’t matter for this exercise.
Configuration | Read IOPS | Write IOPS | Usable Capacity |
---|---|---|---|
12x 2-way Mirror | 24x | 12x | 109 TiB |
8x 3-way Mirror | 24x | 8x | 72 TiB |
4x 6-wide RAID-Z1 | 4x | 4x | 174 TiB |
3x 8-wide RAID-Z1 | 3x | 3x | 183 TiB |
4x 6-wide RAID-Z2 | 4x | 4x | 145 TiB |
3x 8-wide RAID-Z2 | 3x | 3x | 155 TiB |
2x 12-wide RAID-Z2 | 2x | 2x | 166 TiB |
3x 8-wide RAID-Z3 | 3x | 3x | 124 TiB |
2x 12-wide RAID-Z3 | 2x | 2x | 158 TiB |
Workload Considerations
The workload may also dictate the selection of the VDEV type, taking the decision mostly out of your hands. For workloads that have a large volume of small block writes (particularly synchronous writes) the nature of RAIDZ striping means it will not be possible to achieve the levels of performance that mirroring can. Much of the space efficiency provided by RAIDZ may be lost to padding and parity overhead.
Assuming the default block size of 16 KiB for a MySQL/MariaDB database, or for virtual machines, the capacity of RAIDZ can be significantly altered by the requirement that each stripe be padded out to the parity value plus one sectors, to avoid orphaned space.
If we take the examples from above, which assumed the default 128 KiB block size for regular files, but apply these configurations to this database/virtualization workload:
Configuration | Original Capacity | Actual Capacity | Difference |
---|---|---|---|
12x 2-way Mirror | 109 TiB | 109 TiB | 0% |
8x 3-way Mirror | 72 TiB | 72 TiB | 0% |
4x 6-wide RAID-Z1 | 174 TiB | 145 TiB | -20% |
3x 8-wide RAID-Z1 | 183 TiB | 145 TiB | -26% |
Even accepting the performance loss of RAIDZ for this configuration, the advantage in storage capacity is reduced from 67% to 33% by the padding and lower space efficiency of the smaller block size.
Special Devices
Many storage systems have more than a single use case or workload: a single system that is responsible for hosting a busy database but also storing a large archive of video footage.
ZFS can adapt to that, using the “special” VDEV class to optimize the way the blocks are laid out. The first effect of adding a “special” class device to your pool is that the pool metadata–information such as indirect blocks, file permissions, extended attributes, and most other things that are not user data–are routed to this specialized VDEV, rather than the standard VDEVs. These smaller blocks can often be a major contributor to fragmentation since they have such a different lifecycle compared to the user data they describe.
Redundancy and Resiliency Considerations
The first consideration when adding a special/metadata VDEV is ensuring it has sufficient redundancy. It should, at a minimum, match the redundancy of the other VDEVs. If you are using a RAID-Z2, that can withstand the loss of 2 drives concurrently and continue to operate, then your metadata VDEV should be a 3-way mirror to match. In ZFS, each VDEV is responsible for its own redundancy, and the loss of any VDEV results in catastrophic data loss, usually of the entire pool.
The special VDEV is effectively the map to where the data for each block in each file lives on disk. If you lose the map, there is no way to tell where one file ends, and the next begins. This means you might also consider additional resiliency mechanisms, like using a mix of different models of storage media. One Klara customer purposely makes their 4-way metadata mirrors out of NVMe drives from 2 different manufacturers, to ensure that they won’t wear out at the same time, or otherwise face the same firmware or hardware bugs.
Metadata Performance and Read Path Optimization
When storing a larger file, ZFS will break the file into chunks of the recordsize. An array of these chunks will be stored in the metadata. As the number of chunks grows, it will no longer fit in a single metadata block, so ZFS will create an additional level of indirect blocks, that itself contains an array of metadata blocks pointing to where the data is on disk.
This means to read the 2 GiB offset of a file made of 128 KiB chunks, there is a Level 2 block containing an array of pointers to Level 1 blocks, that then each contain 1024 Level 0 blocks, that each point to 128 KiB of data. This means that with an entirely cold cache, to read that 1 block in the middle of a file, you need to read the root object for the file, the L2 block to found out where the L1 block is, read it to then find out where the L0 block is, and finally read the actual block of data.
With HDD backed storage, each of these reads will take 4-20ms. Meaning you are looking at 16 to 80 milliseconds just to read that 128 KiB. Obviously, the cache can help. If you are reading the next 128 KiB block, we already have the root, L2, and L1 blocks handy, so we just need to read the L0, and the prefetcher might have already done that.
Consider this same situation, but all of the metadata is stored on an NVMe device that takes 100 microseconds per read. Now you have 300 microseconds of reads from the NVMe, and 4 milliseconds from the HDD for a total of 4.3 milliseconds, compared to as much as 80 milliseconds. You could get as much as a 20x improvement, and free up more IOPS on the HDDs, with just a small amount of flash storage.
In production, we typically see metadata ratios that range from less than 0.1% to 5% of data volume, making it fairly inexpensive to accelerate an HDD based pool.
Small Blocks
Additionally, the “special” VDEV class can also be used to deal with specific blocks that will not be well served by the general VDEVs. For a specific dataset that is using for a database, setting the special_small_blocks property to 16K will cause all blocks 16 KiB and smaller to be written to the “special” VDEV class instead of the normal storage class. If this dataset is a small enough portion of your total storage demand, this can allow you to take advantage of the improved performance of small blocks on mirrors, while still having access to the capacity advantages of large blocks stored on HDDs.
This does require careful consideration though, as if you fill up the “special” VDEV with small blocks, then future small blocks and your metadata will have to fall back to the normal class VDEVs, and you will have hampered your performance gains. You can’t migrate the small blocks off of the special vdev, other than be deleting the files.
Hybrid Pools
Using a bit of flash to accelerate specific parts of the workload, be it a SLOG to lower latency for synchronous writes, or a special VDEV to accelerate metadata and free up IOPS, there is a lot of flexibility in ZFS to maximize what your hardware can do for a specific workload. The way ZFS was designed with per-dataset configuration and tuning allows a system to be able to adapt to multiple different concurrent workloads without having to sacrifice the performance of one workload to benefit another.
The same is true at the pool level, using the right mix of VDEVs can make all the difference in designing a system that will be able to handle the ever-evolving storage workloads of the modern enterprise.
To make sure you get the most out of your existing hardware, or your next build, partner with Klara to design and implement modern hybrid storage systems.

Allan Jude
Principal Solutions Architect and co-Founder of Klara Inc., Allan has been a part of the FreeBSD community since 1999 (that’s more than 25 years ago!), and an active participant in the ZFS community since 2013. In his free time, he enjoys baking, and obviously a good episode of Star Trek: DS9.
Learn About Klara