Improve the way you make use of ZFS in your company.
Did you know you can rely on Klara engineers for anything from a ZFS performance audit to developing new ZFS features to ultimately deploying an entire storage system on ZFS?
ZFS Support ZFS DevelopmentAdditional Resources
Here are more interesting articles on ZFS that you may find useful:
- Designing OpenZFS Storage for Independence: Pool Architecture, Failure Domains, and Migration Paths
- ZFS Fast Dedup for Proxmox VE 9.x: Technical Implementation Guide
- ARC and L2ARC Sizing on Proxmox
- OpenZFS Monitoring and Observability: What to Track and Why It Matters
- ZFS vs Btrfs: Architecture, Features, and Stability #2
Pool and VDEV topology for Proxmox workloads
ZFS has become a strategic building block for modern infrastructure teams that want enterprise-class data services without inheriting the classic enterprise vendor lock-in. It delivers the core primitives that allow us to out-class proprietary arrays and hyperconverged stacks—checksummed storage, snapshots, clones, inline data services, and replication—and exposes them as open, tunable components rather than opaque features hidden behind licensing tiers.
Klara continues to work to turn ZFS from "powerful but sharp-edged" into a predictable, engineered platform that can be used confidently to achieve your desired storage outcomes. ZFS gives architects extremely strong guarantees, but only if the pool layout, VDEV topology, SLOG, ARC/L2ARC, and recordsize are matched to workload profiles.
Proxmox is the logical virtualization substrate in this architecture because it treats ZFS as a first-class storage backend rather than an afterthought. VM disks can be provisioned directly on ZFS datasets or zvols, Proxmox snapshots are implemented as native ZFS snapshots, and replication jobs are literally zfs send/zfs receive pipelines driven by the cluster.
That means the same OpenZFS pool can simultaneously serve as the hypervisor datastore, a snapshot/clone factory for Dev/Test, and the replication fabric for DR, without new licensing or hardware dependencies. In that context, optimizing the pool and VDEV topology for VM workloads stops being an internal storage concern and becomes an infrastructure-level design decision that directly defines performance, resiliency, and operability.
Why Pool and VDEV Topology Matter for Proxmox
ZFS organizes storage into pools (zpool) composed of one or more VDEVs; each VDEV is a redundancy group (mirror, RAID-Z1/2/3, dRAID) or an accelerator (special, log, cache). At the pool level, performance is essentially the aggregate of all VDEVs, while reliability is bounded by the weakest VDEV. For Proxmox, where each dataset or zvol typically backs a VM disk, that means:
- IOPS and latency are dominated by VDEV type and count.
- Failure domains are defined at the VDEV level, not the individual disks.
- Expansion happens by adding VDEVs, not "growing a RAID group" in-place (RAID-Z expansion exists but is higher-complexity operation).
In other words, pool design is the storage architecture for your Proxmox cluster; everything else (recordsize, compression, caching) is secondary.
Mirrors vs RAID-Z for Proxmox Workloads
The starting point is choosing between mirror VDEVs and RAID-Z for the primary VM pool.
Mirror VDEVs
Mirror VDEVs (e.g., mirror sdb sdc, mirror sdd sde, …) are generally preferred for high-IOPS, latency-sensitive Proxmox workloads:
- Each mirror VDEV can service read requests from either side, so read IOPS scale roughly with the number of disks and mirror groups.
- Write IOPS are bounded by the slowest member of each mirror, but writes can be spread across multiple mirror VDEVs, delivering predictable scaling as you add VDEV.
- Rebuilds (resilvering) are fast and localized, since data is copied from the surviving mirror member rather than reconstructed from parity across the whole pool.
A typical design for a mixed VM cluster node with six or eight SSDs is a pool like:
zpool create tank mirror /dev/sdb /dev/sdc mirror /dev/sdd /dev/sde mirror /dev/sdf /dev/sdg
This gives three mirror VDEVs, good random I/O characteristics, and simple expansion: add another mirror pair when you add more drives.
RAID-Z VDEVs
RAID-Z (Z1/Z2/Z3) is more capacity-efficient but delivers fewer random IOPS per spindle and slower rebuild behavior, making it better suited to:
- Capacity-oriented workloads: bulk VM hosting, archives, or backup targets.
- Sequential-heavy workloads: large streaming reads/writes, backup jobs, media workloads.
A classic pool for capacity might look like:
zpool create bulk raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg
In a Proxmox context, Klara typically recommends aligning with your workload:
- Mirror pools for "hot" VM and database storage.
- RAID-Z pools for bulk VM repositories, ISO/templates, and backup/archival datasets.
On Proxmox, you then register each pool as separate storage classes (e.g., fast-vmstore, bulk-vmstore) so administrators can place workloads according to latency and capacity needs.
Topology Patterns: OS vs Data and Special VDEVs
A robust Proxmox design usually separates the host OS from the VM storage pool:
- Small mirror of SSDs (or even a RAID-1 boot device) for Proxmox VE itself.
- Independent ZFS pool(s) for VM/datastore workloads.
That isolation reduces risk during upgrades and simplifies lifecycle operations.
On top of that, ZFS supports specialized VDEV classes that can be used carefully in Proxmox designs:
- Special VDEV: store metadata and small blocks on fast NVMe, accelerating metadata-heavy operations, freeing up IOPS on the primary data VDEVs. For VM workloads with many small blocks and snapshots, this can significantly improve responsiveness but introduces a critical dependency: losing a special VDEV can destroy the entire pool.
- Dedup VDEV: Locates the dedup table on a specific device, ensuring it has all of the available IOPS from that device and does not contend with other workloads. The dedup table is memory-hungry and can be redirected to the dedup (or failing that, special) VDEVs, however Klara generally discourages dedup for generic VM stores because it is often not worth the small storage savings, compared to the cost in RAM and CPU unless there is a very strong, proven dedup use-case.
These advanced topologies require explicit operational acceptance of the new failure modes; they are powerful tools, not defaults.
Integrating SLOG and Synchronous I/O Semantics
For Proxmox workloads, the critical behavior is how ZFS treats synchronous writes, which are common when:
- The guest filesystem is mounted with sync semantics.
- Databases, NFS servers, or journaling filesystems inside VMs issue fsync()/O_DSYNC aggressively.
ZFS writes synchronous I/O to the ZIL (ZFS Intent Log). By default, ZIL records live on the main pool VDEVs; adding a SLOG (separate log VDEV) moves them to a dedicated device.
Key points for Proxmox designs:
- A SLOG only affects synchronous writes; asynchronous writes bypass it.
- For HDD-backed mirror or RAID-Z pools that host sync-heavy VM workloads, a mirrored, power-loss-protected NVMe SLOG is effectively mandatory for acceptable latency.
- A typical configuration:
zpool add tank log mirror /dev/nvme0n1 /dev/nvme1n1
This isolates ZIL traffic onto very low-latency devices, reducing write stalls and minimizing fragmentation on the main data VDEVs.
Klara's guidance is to treat SLOG devices as critical infrastructure: use only enterprise-grade NVMes with power-loss protection, mirror them, and size them to the peak synchronous write rate rather than raw capacity (since ZIL is a log, not a full data copy).
ARC, L2ARC, and VDEV-Aware Capacity Planning
Although ARC/L2ARC are not "topology" in the VDEV sense, they are tightly coupled to VDEV design because they determine how much I/O reaches the disks:
- ARC lives in RAM and will opportunistically consume memory; in a Proxmox node, you must leave headroom for KVM guests while still giving ZFS enough cache to avoid thrashing.
- L2ARC extends ARC onto SSD/NVMe cache devices added as cache VDEVs. Each entry stored in the L2ARC will requires a small header entry in the ARC, so the L2ARC must be sized relative to the available memory, to avoid sacrificing more performant memory for secondary storage.
From a topology standpoint:
- For small and medium Proxmox nodes, Klara typically prefers more mirror VDEVs of SSDs and a well-sized ARC over aggressive L2ARC, because L2ARC entries consume ARC metadata and can backfire if RAM is constrained.
- L2ARC becomes attractive when:
- You have abundant RAM (so ARC hit ratios are already high).
- Working sets are larger than RAM but remain cache-friendly.
- You can dedicate fast devices purely to read caching.
Example configuration:
zpool add tank cache /dev/nvme2n1
The sizing rule of thumb is to keep L2ARC to a modest multiple of ARC size and to treat it as an optimization to a sound pool design, not a fix for an underprovisioned VDEV layout.
Recordsize and Its Interaction with VDEVs
Recordsize is a dataset-level property, but its value interacts with topology:
- Small records on RAID-Z amplify parity overhead and can fragment I/O patterns; large records on random-I/O workloads waste bandwidth and capacity with amplification.
- For Proxmox VM images on mirror pools, a 32K recordsize is a good general default:
zfs set recordsize=32K tank/vmstore
- For database datasets (either raw zvols or filesystems on ZFS shared to guests), smaller recordsize (e.g., 16K) aligns with common page sizes and improves efficiency, without sacrificing compression:
zfs set recordsize=16K tank/pgdata
- For sequential/archival datasets on RAID-Z pools, 1M recordsize maximizes throughput:
zfs set recordsize=1M tank/archive
Klara's practice is to partition Proxmox storage into multiple ZFS datasets with recordsize tuned per workload type, all sitting on top of a pool whose VDEV topology is chosen first for reliability and baseline I/O behavior.
Why Proxmox + ZFS + Klara Is a Coherent Stack Choice
In combination, Proxmox, ZFS, and Klara's engineering methodology form a coherent alternative to proprietary HCI and storage array stacks:
- Proxmox provides the virtualization and cluster control plane, with native awareness of ZFS snapshots and replication via zfs send/receive.
- ZFS provides the data plane, including integrity, RAID, snapshots, clones, compression, deduplication, and replication.
- Klara provides the design and implimentation solutions: mapping VDEV topologies, SLOG and cache devices, and dataset parameters to Proxmox workload profiles, and documenting the operational playbooks for resilvering, expansion, and disaster recovery.
The result is an infrastructure where "pool and VDEV topology" is not an afterthought but a consciously chosen architecture element, delivering predictable behavior under failure and load, and an open platform that customers can understand, audit, and evolve over time.

Allan Jude
Principal Solutions Architect and co-Founder of Klara Inc., Allan has been a part of the FreeBSD community since 1999 (that’s more than 25 years ago!), and an active participant in the ZFS community since 2013. In his free time, he enjoys baking, and obviously a good episode of Star Trek: DS9.
Learn About Klara




