Klara

One of the most important aspects that goes into the design of a ZFS storage pool is the VDEV layout. What VDEV configuration is used has a deep impact on the performance and reliability of the pool, as well as its flexibility for future expansion. 

If you are not aware of the different VDEV types in ZFS and their use cases, you might want to first read Understanding ZFS VDEV Types and Choosing the right ZFS pool layout to understand the concepts we will be discussing.  

Capacity vs Performance 

Some configurations will always perform better than others, but there is usually a trade-off to be made. Mirrors will provide more IOPS and more read throughput. However, it’ll offer far less capacity and therefore increase the overall cost of the deployment. A standard 2-way mirrored configuration may also not be able to provide the level of redundancy required without going to 3-way or 4-way mirroring, which greatly increases cost without providing any additional usable storage capacity. 

Consider a standard installation of 24 theoretical disks of 10 TB, be they HDDs, SSDs, or NVMe drives, it doesn’t matter for this exercise. 

ConfigurationRead IOPSWrite IOPSUsable Capacity
12x 2-way Mirror24x12x109 TiB
8x 3-way Mirror24x8x72 TiB
4x 6-wide RAID-Z14x4x174 TiB
3x 8-wide RAID-Z13x3x183 TiB
4x 6-wide RAID-Z24x4x145 TiB
3x 8-wide RAID-Z23x3x155 TiB
2x 12-wide RAID-Z22x2x166 TiB
3x 8-wide RAID-Z33x3x124 TiB
2x 12-wide RAID-Z32x2x158 TiB

Workload Considerations 

The workload may also dictate the selection of the VDEV type, taking the decision mostly out of your hands. For workloads that have a large volume of small block writes (particularly synchronous writes) the nature of RAIDZ striping means it will not be possible to achieve the levels of performance that mirroring can. Much of the space efficiency provided by RAIDZ may be lost to padding and parity overhead. 

Assuming the default block size of 16 KiB for a MySQL/MariaDB database, or for virtual machines, the capacity of RAIDZ can be significantly altered by the requirement that each stripe be padded out to the parity value plus one sectors, to avoid orphaned space. 

If we take the examples from above, which assumed the default 128 KiB block size for regular files, but apply these configurations to this database/virtualization workload:  

ConfigurationOriginal CapacityActual CapacityDifference
12x 2-way Mirror109 TiB109 TiB0%
8x 3-way Mirror72 TiB72 TiB0%
4x 6-wide RAID-Z1174 TiB145 TiB-20%
3x 8-wide RAID-Z1183 TiB145 TiB-26%

Even accepting the performance loss of RAIDZ for this configuration, the advantage in storage capacity is reduced from 67% to 33% by the padding and lower space efficiency of the smaller block size. 

Special Devices 

Many storage systems have more than a single use case or workload: a single system that is responsible for hosting a busy database but also storing a large archive of video footage. 

ZFS can adapt to that, using the “special” VDEV class to optimize the way the blocks are laid out. The first effect of adding a “special” class device to your pool is that the pool metadata–information such as indirect blocks, file permissions, extended attributes, and most other things that are not user data–are routed to this specialized VDEV, rather than the standard VDEVs. These smaller blocks can often be a major contributor to fragmentation since they have such a different lifecycle compared to the user data they describe. 

Redundancy and Resiliency Considerations 

The first consideration when adding a special/metadata VDEV is ensuring it has sufficient redundancy. It should, at a minimum, match the redundancy of the other VDEVs. If you are using a RAID-Z2, that can withstand the loss of 2 drives concurrently and continue to operate, then your metadata VDEV should be a 3-way mirror to match. In ZFS, each VDEV is responsible for its own redundancy, and the loss of any VDEV results in catastrophic data loss, usually of the entire pool. 

The special VDEV is effectively the map to where the data for each block in each file lives on disk. If you lose the map, there is no way to tell where one file ends, and the next begins. This means you might also consider additional resiliency mechanisms, like using a mix of different models of storage media. One Klara customer purposely makes their 4-way metadata mirrors out of NVMe drives from 2 different manufacturers, to ensure that they won’t wear out at the same time, or otherwise face the same firmware or hardware bugs. 

Metadata Performance and Read Path Optimization 

When storing a larger file, ZFS will break the file into chunks of the recordsize. An array of these chunks will be stored in the metadata. As the number of chunks grows, it will no longer fit in a single metadata block, so ZFS will create an additional level of indirect blocks, that itself contains an array of metadata blocks pointing to where the data is on disk. 

This means to read the 2 GiB offset of a file made of 128 KiB chunks, there is a Level 2 block containing an array of pointers to Level 1 blocks, that then each contain 1024 Level 0 blocks, that each point to 128 KiB of data. This means that with an entirely cold cache, to read that 1 block in the middle of a file, you need to read the root object for the file, the L2 block to found out where the L1 block is, read it to then find out where the L0 block is, and finally read the actual block of data. 

With HDD backed storage, each of these reads will take 4-20ms. Meaning you are looking at 16 to 80 milliseconds just to read that 128 KiB. Obviously, the cache can help. If you are reading the next 128 KiB block, we already have the root, L2, and L1 blocks handy, so we just need to read the L0, and the prefetcher might have already done that. 

Consider this same situation, but all of the metadata is stored on an NVMe device that takes 100 microseconds per read. Now you have 300 microseconds of reads from the NVMe, and 4 milliseconds from the HDD for a total of 4.3 milliseconds, compared to as much as 80 milliseconds. You could get as much as a 20x improvement, and free up more IOPS on the HDDs, with just a small amount of flash storage. 

In production, we typically see metadata ratios that range from less than 0.1% to 5% of data volume, making it fairly inexpensive to accelerate an HDD based pool. 

Small Blocks 

Additionally, the “special” VDEV class can also be used to deal with specific blocks that will not be well served by the general VDEVs. For a specific dataset that is using for a database, setting the special_small_blocks property to 16K will cause all blocks 16 KiB and smaller to be written to the “special” VDEV class instead of the normal storage class. If this dataset is a small enough portion of your total storage demand, this can allow you to take advantage of the improved performance of small blocks on mirrors, while still having access to the capacity advantages of large blocks stored on HDDs. 

This does require careful consideration though, as if you fill up the “special” VDEV with small blocks, then future small blocks and your metadata will have to fall back to the normal class VDEVs, and you will have hampered your performance gains. You can’t migrate the small blocks off of the special vdev, other than be deleting the files. 

Hybrid Pools 

Using a bit of flash to accelerate specific parts of the workload, be it a SLOG to lower latency for synchronous writes, or a special VDEV to accelerate metadata and free up IOPS, there is a lot of flexibility in ZFS to maximize what your hardware can do for a specific workload. The way ZFS was designed with per-dataset configuration and tuning allows a system to be able to adapt to multiple different concurrent workloads without having to sacrifice the performance of one workload to benefit another. 

The same is true at the pool level, using the right mix of VDEVs can make all the difference in designing a system that will be able to handle the ever-evolving storage workloads of the modern enterprise. 

To make sure you get the most out of your existing hardware, or your next build, partner with Klara to design and implement modern hybrid storage systems. 

Topics / Tags
Back to Articles