ZFS at Scale

ZFS in production offers near limitless upwards scalability, able to address immense volumes of data and manage resiliency in the face of hardware with physical reliability limitations. ZFS is able to meet the requirements of especially demanding workloads, all without additional license costs or vendor lock-in.

However, when operating a large scale, storage systems can be incredibly unforgiving of design shortcuts, operational drift over time, and incorrect assumptions. Many production incidents attributed to “ZFS bugs” are, in reality, the delayed consequences of the misconceptions we will discuss in this article.

Deployment Patterns

The following deployment patterns reflect how ZFS in production is most commonly implemented, and where early design decisions have long-term operational consequences.

Fileserver

ZFS is frequently deployed as a file server, often using NFS or SMB to share files to many users concurrently. Be successful in this role depends more on dataset design and operational policy than on raw ZFS features. Many file server problems blamed on the protocol or client behavior are rooted in ZFS layout decisions made early and never revisited.

One Dataset per Purpose

A frequent anti-pattern is placing all files in the root of the pool, or all shares under a single dataset “for simplicity.” In production, this quickly becomes unmanageable. ZFS derives much of its power and flexibility from being able to configure datasets for different use cases. Tuning to match the workload increase performance and is an important part of maintaining reliability over the long term.

It is best practice to create separate datasets for each different share or workload, as this enables workload specific tuning, configuring of quotas and reservations, and proper permissions delegation. Being able to exclude certain data from snapshots and replication, or just having different frequencies depending on the type of data, greatly increases your ability to meet RTO and RPO requirements for the most critical data.

Breaking data up into datasets also makes it easier to monitor, account for, and define a data lifecycle for. Being able to track which data is the most used, and to move unused data to an archive can become critical as the volume of data continues to grow every day.

Virtualization

ZFS in production is widely used as the storage backend for virtualization platforms, an area in which is continues to gain traction. Virtualization workloads are some of the most demanding in the storage ecosystem. VM storage combines random I/O, synchronous writes, high snapshot churn, and unpredictable growth. Designs that work well for file servers or backups often perform poorly once placed under a hypervisor.

Successful ZFS virtualization deployments are well though through designs that are resistant to “premature optimization,” and employ continuous monitoring to respond to changes it demand to maintain high efficiency.

Latency is the Key Metric

Most virtualization workloads issue synchronous writes more frequently than operators expect. Applications, databases, and operating systems depend on reliable storage, and they achieve that reliability by asking for persistence guarantees from the storage system.

These guarantees come at a cost: latency. Unlike asynchronous writes, the storage system can’t defer, re-order, and aggregate the writes into more efficient operations, and must instead service them as quickly as possible. This can be accelerated in ZFS with a SLOG (Separate Log Device), but the device must be fast enough to achieve the write latencies demanded by the application, while maintaining the durability guarantees. Most prosumer grade SSDs do not have the levels of power-loss protection required to be an effective and safe SLOG devices, often leading to degrading reliability rather than improve it.

Worse, some operators resort to disabling “sync” (synchronous writes), having ZFS report the operation is complete, even when the data is not yet safe from a power loss. While this may appear to “fix” performance, it just converts power loss or host crashes from a minor inconvenience into full data loss events.

Backup and Archive

ZFS is able to reach huge capacities, limited only by the hardware available today, while at the same time achieving the lowest possible cost per terabyte owing to its open-source license.

The workload for backups is also well suited to ZFS and HDDs, large contiguous writes that can be efficiently spread across a RAID-Z array. Using datasets to separate different backups, and control access to them, also provides strong separation between workloads. ZFS can also use its native encryption capabilities to create offsites backups that allow all of the required maintenance operations to happen without access to the decryption keys, ensuring the confidentiality of data is maintained even if the backup host is compromised.

Data Reduction

Another place where ZFS stands out is with its ability to reduce the amount of storage required by data. Transparent compression reduces the volume of data while maintaining performance and the usability of the data. The new fast deduplication feature in ZFS automatically detects duplicate data and safely stores only a single copy.

The new block cloning feature, in addition to the ability to clone files (including VM images) to avoid duplicate blocks, also allows the construction of “synthetic full” backups. Combining multiple incremental backups with the most recent full backup to make a new full backup without re-copying the data. This not only saves space, but also network (possibly internet) bandwidth, and it reduces the RPO, by eliminating the time that would be spent writing data, when it could instead reference an existing copy of the data.

Common Pitfalls

Design Oversights

“ZFS scales linearly if you just add more disks”

As will many misconceptions, it is based on a kernel of truth, but with a heaping spoon full of assumption sprinkled on top. ZFS can scale nearly linearly by adding additional VDEVs, but simply increasing disk count will quickly degrade performance and increase risk. Simply increasing the number of disks by expanding existing VDEVs, or adding fewer but wider VDEVs will increase the failure blast radius. Extremely wide RAIDZ VDEVs may look efficient on paper, but they introduce longer rebuild windows and higher risk during degraded operation. A greater number of narrower VDEVs will almost always perform far better, while decreasing the risk of an unrecoverable error. At scale, the unit of growth is the vdev, not the disk. Designs that ignore this reality often end up with pools that are technically functional but operationally fragile. See our article: Can You Have Too Many VDEVs? A Practical Guide to ZFS Scaling.

As pools grow, the amount of metadata that must be kept warm in the cache can scale in a non-linear fashion, highly dependant on the type of workload. The time it takes to repair a disk failure (resilver) is dependent on the number of disks in the impacted VDEV, rather than the total disks in the pool. On the other hand, the time taken to verify the contents of the entire pool (scrub) is impacted more by the number of files and the performance of the storage devices, than it is by the total volume of data.

“More parity always means more safety”

RAIDZ2 or RAIDZ3 are often chosen for the additional “safety” they provide, without considering workload or recovery behavior. While additional parity protects against simultaneous disk failures, it also increases write amplification, reduces IOPS, and lengthens resilvers.

In many production environments, mirrored VDEVs provide higher availability and faster recovery, even if raw capacity efficiency is lower. Safety is a function of time-to-recovery and fault isolation, not just parity math. RAIDZ has its place, but all factors must be considered when designing a pool.

In the article: Designing a Storage Pool: RAIDZ, Mirrors, and Hybrid Configurations, we discuss some of the trade-offs between mirrors and RAIDZ, including cases where most of the storage capacity recovered by using RAIDZ over mirrors, is lost again due to the requirements of the workload.

“ZFS will prevent data loss automatically”

ZFS can detect corruption, but it does not guarantee that is can recover from it. Checksums determine that the data is wrong, not how to fix it if there is insufficient redundancy or replicas are already compromised. Checksums provide protection from accidentally treating corrupt data as valid, but they only serve to trigger reconstruction, and confirm it has completed correctly, they cannot repair the data alone.

At scale, silent data corruption, firmware bugs, and human error are more common than dramatic disk failures. ZFS can surface these issues, but only if operators are watching, running regular scrubs, and responding to alerts in a timely fashion. Unmonitored ZFS systems fail quietly until they fail loudly.

Operational Mistakes

“Snapshots are cheap, so take lots of them”

Snapshots are space-efficient initially, but not free. Excessively large numbers of snapshots (many 1000s per dataset) increase metadata pressure, can significantly slow down management operations, and complicate replication and deletion workflows.

At scale, snapshot sprawl becomes an operational tax. Deletes take longer, replication risks losing its shared baseline, and simple administrative actions turn into long-running tasks. Snapshot policy must be properly designed, consistently enforced, and periodically audited.

ZFS is able to scale to huge numbers of total snapshots without any issues, but the number of snapshots per individual datasets runs into challenges as the number exceeds 2000, as ZFS must shift to a less performant, but more scalable mechanism for tracking snapshots.

Human Factors

“ZFS eliminates the need for operational discipline”

While ZFS reduces the likelihood of certain types of failure, operating ZFS in production does not eliminate the need for operational discipline, and documentation.

The single greatest source of human error related to ZFS comes from insufficient labeling. Both physical and software labeling. Disconnecting the wrong cable, connecting an enclosure to the wrong head, and swapping the wrong drive were among the failures that our team has recently helped customers recover from. In all of these cases, a little more documentation, and a label maker would have entirely prevented the outage.

Every change to the system, be it enabling a new feature flag, changing kernel versions, updating firmware (on NICs, HBAs, drives, or motherboards), or even adjusting tuning all carry risk. These risks can be mitigated with proper testing, fallback plans, checkpoints, and documentation. Having a plan to immediately restore functionality when a problem is detected can be the difference between applying and reverting a chance within a maintenance window, and a prolonged outage requiring expert assistance to recover from.

At scale, the most dangerous failures are slow ones: configuration drift, temporary or uncommitted changes, forgotten assumptions, and undocumented decisions that surface years later during an incident.

Conclusion

Running ZFS in production provides scalability, supporting immense data volumes while maintaining resilience against hardware failure. While ZFS provides powerful tools for resiliency, performance, and data integrity, achieving reliable outcomes depends on thoughtful dataset design, appropriate vdev layouts, disciplined snapshot policies, and ongoing monitoring as systems evolve.

Successfully operating ZFS requires more than a strong initial deployment it requires sustained attention and expertise. With Klara’s ZFS Design Solution, you can validate your design choices, identify emerging risks, and maintain healthy production systems as workloads grow and requirements change. Gain access to our experienced ZFS engineers, so you or your team can operate ZFS with greater confidence and avoid costly pitfalls.

Topics / Tags

snapshots RAIDZ vdev storage ZFS scrub

Back to Articles

ZFS in Production: Real-World Deployment Patterns and Pitfalls

Additional Resources

ZFS at Scale

Deployment Patterns

Fileserver

One Dataset per Purpose

Virtualization

Latency is the Key Metric

Backup and Archive

Data Reduction

Common Pitfalls

Design Oversights

“ZFS scales linearly if you just add more disks”

“More parity always means more safety”

“ZFS will prevent data loss automatically”

Operational Mistakes

“Snapshots are cheap, so take lots of them”

Human Factors

“ZFS eliminates the need for operational discipline”

Conclusion

Embedded ARM Development Experts

OpenZFS Development & Support

FreeBSD Development & Support

Stay Informed and Make Smart Business Decisions with Klara's Resources

Unlock the Power of OpenZFS, Linux, and FreeBSD with Klara's Open Source Development Experts

ZFS in Production: Real-World Deployment Patterns and Pitfalls

Additional Resources

ZFS at Scale

Deployment Patterns

Fileserver

One Dataset per Purpose

Virtualization

Latency is the Key Metric

Backup and Archive

Data Reduction

Common Pitfalls

Design Oversights

“ZFS scales linearly if you just add more disks”

“More parity always means more safety”

“ZFS will prevent data loss automatically”

Operational Mistakes

“Snapshots are cheap, so take lots of them”

Human Factors

“ZFS eliminates the need for operational discipline”

Conclusion