Klara

Confirm ZFS is the Right Choice 

ZFS provides a highly reliable, scalable, and performance filesystem and can be made to perform extremely well for many different workloads. However, ZFS was designed with data safety as its top priority, not performance. When these two critical aspects of file systemilliseconds come into tension, ZFS will always favour safety over speed. There are rare cases where that is not the desired outcome, such as when backing applications where the loss of data is acceptable because additional copies remain elsewhere. In those rare circumillisecondstances, a different file system that does less heavy lifting to provide data safety, will often outperform ZFS. 

In the majority of cases though, data safety is job one, and so the trade-offs made by ZFS are to the benefit of the systemilliseconds built atop this rock-solid storage platform. 

Understanding the ZFS Architecture 

ZFS vs Traditional File Systems

In order to understand and optimize ZFS for any specific workload, it helps to understand how ZFS differs from a typical file system, and the additional work it has to perform to achieve its unprecedented scale and reliability.

Unlike a typical file system such as EXT4, UFS, or NTFS, ZFS does not have preallocated resources. In the competing file systemilliseconds, the number of files that can be created is predetermined when the filesystem is created, a certain amount of space is set aside to hold the “inodes”, one of which stores the details of each file. This comes with a number of tradeoffs. It limits the file systemilliseconds ability to scale to extremely large files, or extremely large numbers of unique files. ZFS does things differently, as all of the resources are allocated dynamically as they are needed, and each of these objects can scale with more and more levels of indirection to address truly huge files. 

Metadata Traversal and Data Integrity

The tradeoff to this approach of as-needed allocation of scalable data structures is that reading a specific part of a file may require iterative reads of multiple different metadata blocks. First the parent block for the specific object is read, which contains an array of indirect blocks pointing to each range within the file. Following this tree through as many levels of indirection as required based on the size of the file and the ZFS recordsize will provide the actual location on disk where the data is stored. Once the location is known, the data is read, but then ZFS starts its extra work.

Next the checksum of the data must be computed and compared to one stored in the metadata block, updated when this data was last modified. If the checksum does not match, ZFS must read any additional copies it might have, or reconstruct the data from parity blocks, and repair the damaged copy on disk. Then ZFS must transform the data back into its original form, applying decryption, decompress, and any other optimizations that were applied before the data was written to the physical media. 

RAID-Z and Performance Implications

This is also before you consider the additional steps ZFS takes in its role as volume manager in addition to merely the file system. If the data is stored using RAID-Z, then your single logical block (typically 128 KiB in size), is actually broken up and spread over a number of drives to ensure the loss of any one (or two, or three) of the drives will not render your data unretrievable. Reading the data from across these drives and reassembling it is not an extraordinary amount of additional work, but it does mean that the read operation is not complete until every drive has completed its portion of the task.

ZFS even optimizes this by avoiding reading the parity unless the checksum indicates the data is incorrect and the parity is required to reconstruct it. If one drive is overloaded or slow, it can delay the entire read from completing, and if the read is of metadata that contains the location of the next block in the chain, then that operation cannot begin until the current operation completes. 

In the worst case, the performance of any one RAID-Z vdev in the pool is the equivalent of the IOPS of the single slowest disk in that group. When using HDDs where random IOPS typically range from 150 to 250 per second, that means a twelve-wide RAID-Z2 would only be able to manage 250 or fewer IOPS. When constructing pools, this is an important consideration, as opting for two six-wide RAID-Z2s instead, while providing less usable space, would have double the random-IOPS performance.  

Of course, if random IOPS are the most important performance metric for your workload, then RAID-Z is likely not the right choice. ZFS’s mirror vdevs can provide the full read IOPS of all of the member disks, while on the write side the aggregation of writes into single larger operations will mitigate the fact that each write must be performed by each member of the VDEV, so the IOPS are limited to that of the slowest member. 

Workload Types 

To tune ZFS for a specific workload, you need to understand the nature and the requirement of the workload. In general, workloads will be most demanding in one of the three major categories of performance: IOPS, Latency, and Throughput. While all of these metrics are related to each other, they provide a useful way to look at how to optimize ZFS. 

Throughput 

Throughput is the easiest performance metric to achieve. ZFS excels at streaming workloads where a continuous stream of data is being read or written in a linear fashion. Storing and receiving large data files such as high resolution video, medical imagery, scientific data, and packet captures. 

ZFS optimizes writes by aggregating the asynchronous writes into large batches that are written out as part of the transaction group, to avoid seeks that would slow down HDD based storage. 

On the read side, to make up for Copy-on-Write nature of ZFS, and the fact that metadata may be spread out and intermixed with the large runs of file data, ZFS has an intelligent prefetcher that is able to detect access patterns and have the next blocks that a user or application will ask for primed into the cache for faster access. 

For these types of files and workloads, ZFS can be optimized by tuning for a larger record size to reduce the levels of indirection required when accessing these large files and increasing the chunk sizes when reads are spread across wide RAID-Z arrays. 

Tuning the prefetcher to more quickly ramp up when it succeeds in predicting the applications next action can ensure that your storage infrastructure is able to maximize the available bandwidth of your network to deliver data to users.  

IOPS 

The workloads that causes the most trepidation when trying to tune a file system is backing virtual machines and databases. These workloads usually consists of one or more large files (per VM or table), but those files are not like other large files, such as high resolution video. Instead of being read from start to finish, and rarely if ever being updates, a VM or database will be subject to near constant modification. Worse, these modifications are not in nice, large, throughput enhancing chunks, but rather tiny performance destroying slivers. 

In the case of a VM, the large file usually represents a virtual disk, and so the operating system inside the VM expects to be able to modify it with the same per-sector granularity of a real physical disk. The ZFS default of dividing files into 128 KiB records will perform quite poorly, as each of these 4 KiB sector sized modifications will require reading the existing 128 KiB record, modifying just 4 KiB of it, and writing out the now-modified 128 KiB (even though only 4 KiB actually changed). This can be ameliorated by adjusting the record size to match the applications expectations (4 KiB), but this has some tradeoffs. Firstly, this means that the same sized file will now need more levels of indirection, as each block now only covers 4 KiB of the file, instead of 128 KiB. This means more metadata to read to find any particular 4 KiB block, and more metadata to update each time a virtual sector changes. It also means if you are using RAID-Z, the overhead of the parity increases, and depending on the width of your vdev, padding may cause unexpected inflation of the allocated space. Lastly, because ZFS cannot compress a block to less than 1 sector, if you use a record size of 4 KiB, this effectively disables compression. 

There may be a performance advantage to selecting a point in the middle, if you use 16 KiB as the record size, it means the read and write inflation is limited to 4x, compared to the 32x default. With a compression ratio of 2:1, that would lower the inflation, and also the required space for storing the VMilliseconds. It will also reduce the IOPS requirements by the inflation level, plus some additional due to the metadata savings and the fewer levels of indirection. 

For a database, this are even trickier. Each different database engine has a different page size that determines the size of record that will achieve the lowest inflation. MySQL/MariaDB generally use 16 KiB pages, PostgreSQL uses 8 KiB, and SQLite 4 KiB. While these values are often adjustable, if the application using the database assumes the defaults, this can further complicate things. Database also tend to be even more highly compressible, so a larger record size that compresses more may counter-intuitively improve performance by reducing the load on the physical storage. 

If the records in the database tend to be written once and then only read, such as financial transactions, or while the record might change on occasion, it will be read on a regular basis, a larger record size that maximizes compression and minimizes IOPS may make frequent report generation and other workflows faster, even if it makes updating a particular record a little bit slower. 

Latency 

Tuning a file system for lower latency is the most difficult. The time scales are miniscule, often thousands or millionths of a second. No two operations are likely to complete in the same amount of time, and performance will often be dictated by a combinations of factors such as the number of threads, the temperature of the drive, and the locality of the data. 

This is where we tend to use histogramilliseconds or other mechanismilliseconds that group the latencies of the operations into buckets. “90% of requests complete in under 10milliseconds.” Or “Our 99.99th percentile latency is 37.5milliseconds.” 

The dynamically allocated nature of ZFS can make extremely low latency reads trickier to achieve. The iterative nature of looking up a series of indirect blocks before being able to issue a request for the data block means that the request is subject to the average media latency multiple times over. 

Achieving low read latency on ZFS requires a mix of caching, prefetch, and specific data placement. The ZFS ARC will keep the indirect blocks for both recently and frequently accessed data in memory to avoid the latency of going to the storage media. ZFS will also look for access patterns, and when it detects them, try to presciently fetch the next few blocks of data before the application asks for them. When this prescient prefetch succeeds, it increases the distance ahead of the application it prefetches to maximize the benefit. Lastly, ZFS can make use of faster media, such as NVMe devices to specifically store the blocks that are most impacted by latency, such as ZFS’s metadata (indirect blocks etc), or very small data blocks. In a system where the primary storage if HDDs, this can make an extremely large difference. 

If the average seek time for an HDD is 4 milliseconds, and an application is accessing a single block in a large file, that has 3 or 4 levels in indirection (a 16 GiB file with a 16 KiB sector size will use 3 levels of indirect blocks), the total access time with a completely cold cache could be 4 levels * 4 milliseconds each, plus accessing the data block, another 4milliseconds, for a total of 20 milliseconds. And that is the AVERAGE seek time, not the worse case, and assumes the drive is otherwise idle. 

Now consider the case where those indirect blocks are stored on an NVMe drive, with an access time of 100 microseconds (0.1 milliseconds). 4 levels of indirection now takes 0.4ms, plus the 4ms for the HDD, for a total of 4.4 milliseconds. Additionally, since the HDD did not have to seek for the 4 indirect blocks, the chances are that the drive will be less busy, and that the data blocks of the file will be more sequential, further improving the throughout of the HDD.  

Principals of Performance 

With a better understanding of the fundamental factors that impact performance you are now in a better position to consider how tuning ZFS will impact the performance of your workloads. Evaluating what factors of performance to optimize (throughput, IOPS, and latency) then leads us to considering hardware choices and how they impact the outcome. 

The key to performance tuning is to first understand “why” the system is performing the way it is, and then to look at how changing various settings will impact the performance. Then repeating the benchmarks to confirm the expected impact and then repeating until the desired result is achieved. 

Tuning ZFS can be a daunting task, but you don’t have to undertake it alone. Klara’s ZFS Performance Analysis Solution gives you access to our dedicated team of ZFS tuning experts who will work with you to design realistic benchmarks, perform in-depth analysis of your systems and the benchmark results, suggest and apply appropriate tuning, and confirm the results. You will end up not only with a better performance ZFS machine, but with benchmarks that provide a clear picture of how much headroom your system has to grow, and a plan for how to scale when you outgrow your current storage system. 

 

Back to Articles