As commercial storage becomes increasingly expensive, more and more of the Education vertical is looking at Open Source solutions for storage. In this article, we discuss the value of OpenZFS for Universities and how system administrators can best leverage it to their benefit.
This is part of our article series published as “OpenZFS in Depth”. Subscribe to our article series to find out more about the secrets of OpenZFS
Isaac Huang’s talk at the OpenZFS 2017 developers summit witnessed the expansion of the ZFS storage endurance envelope for large installations. dRAID or distributed RAID is a new vdev type that complements existing ZFS data protection capabilities for very large storage arrays. Starting with the RAID-Z-like underpinnings, dRAID permutes, or mixes, disk blocks together in a way where accesses are evenly spread across all the drives. Fast spindle replacement is accomplished by using all members of the pool, using pre-allocated virtual spares, spread evenly over all the spindles. Contributors include Intel, Lawrence Livermore Labs, and HP Enterprise, which have material interest in storage at datacenter scale and high reliability. The OpenZFS user community are the benefactors of this enhancement if we apply it well.
Avoiding the Death Spiral
Admins will often use wide RAID stripes to maximize usable storage given a number of spindles. RAID-Z deployments with large stripe widths, ten or larger, are subject to poor resilver performance for a number of reasons. Resilvering a full vdev means reading from every healthy disk and continuously writing to the new spare. This will saturate the replacement disk with writes while scattering seeks over the rest of the vdev. For 14 wide RAID-Z2 vdevs using 12TB spindles, rebuilds can take weeks. Resilver I/O activity is deprioritized when the system has not been idle for a minimum period. Full zpools get fragmented and require additional I/O’s to recalculate data during reslivering. A pool can degenerate into a never ending cycle of rebuilds or loss of the pool Aka: the Death Spiral.
As spindles age together, disks may fail in groups as defect counts and mechanical failure are not independent random processes with respect to age. SSD’s further complicate this math as the wear leveling endurance will be very closely matched and clusters of devices under identical load may fail together. Manufacturer provided mean time to failure is a forward-looking statement and is not suitable for replenishment planning. One manufacturer claims 1.2 Million hour MTBF: a dubious 137 year commitment to quality. It’s poor planning to assume any drive isn’t going to pick today to dramatically fail.
dRAID is an option providing rapid parity rebuild that can mitigate the death spiral behaviour of wide RAIDZ stripes, but as reflected in its default width setting of eight, it does not encourage wide stripes. Dedicating sufficient parity increases the durability of the ZFS pool and the investment in parity should be informed by the risk of losing the pool.
Spare disks are a way of keeping a disk warm and ready to replace a failed member. Usually, a spare’s life is leisurely idle until they are scrammed into action during a rebuild. That idleness is a wasted opportunity to do useful work. There are no specific spare disks in a dRAID. Rather, enough blocks are allocated throughout out the vdev to act as spares. The distributed spare is a clever redistribution of work so that all disks are always in use. A disk failure precipitates a rebuild into that dedicated space. After replacement disks are available, the vdev can be re-balanced to return the spare block and put the replacement disk in to use.
Fixed Stripe Width
Unlike RAID-Z, an entire stripe in dRAID is allocated at once, no matter how many disk blocks are needed to store the object. The stripe width is determined by the disk sector size multiplied by the number of data drives in the RAID group.
RAID-Z has a method of optimizing block layout to minimize block allocations for small files. dRAID however priorities the speed of rebuilding parity and does not make the same space preserving attempt. If your files are a small fraction of the stripe size, dRAID will not be able to use all the disk blocks fully. For example, a default dRAID vdev has a stripe with of 32k (4k per disk, 8 disks); any allocation will require at least 32k. Internal padding is allocated to fill out the stripe width after the request object is stored. Using a smaller stripe width or providing a special mirror vdev will suit smaller allocations and improve drive utilization.
A Tale of Two Resilvers
After a failure, the real or distributed spare is written to in sequence, following only the parity layout in the space map to rebuild the drive according to parity data. Sequential reconstruction can be accomplished rapidly by issuing large I/O blocks, reducing seeks, and avoiding tree indirection overhead. The rebuilt disk’s contents are not necessarily consistent with the Merkle tree that proves the zpools data is intact. It’s important to reconstruct this bitwise copy of the disk first, allowing the system to return to mostly intact state and return to service. That is to say, the sequential reconstruction process restores the redundancy level of the pool, but without being able to verify the checksums of the data. The advantage to this is that it can be completed much more quickly, reducing the window during which additional disk failures might put the pool at risk.
A healing resilver is triggered automatically after a sequential resilver, it is a final operation that verifies that all the contents of the drives match their initial checksums via block pointer traversal. The healing resilver has a number of optimizations to quickly find and reconstruct writes to the failed disk. When a replacement drive can be added to the pool, the rebalance operation is another sequential resilver followed by a healing resilver.
A scrub is the gold standard for a pool health; however, a scrub might be a prohibitive amount of work, visiting every block allocated in the pool. The healing resilver allows a practical return to operation in an environment where failures must be repaired routinely.
“Are We There Yet? When Can I Play With it?”
According to a report from the January OpenZFS leadership meeting, OpenZFS 2.1 will support dRAID in early 2021. If you must have it now; the head branch of the OpenZFS build against recently supported operating systems: FreeBSD 12.1+, Linux 5.10+, Illumos, NetBSD et al. The OpenZFS regression test suite ztest is a good indication that dRAID satisfies the ZFS commitment to data protection. Corporate customers at IBM and Panasas have been flogging other distributed RAID systems for more than ten years. It’s a mature concept that complements the ZFS tool set.
There is no better way to learn software than to run headlong into mistakes.
We’ll install ZFS head from source and gin up some ‘md’ file backed disks.
‘zpool create r2dRAID dRAID2:3d:1s:14c /dev/md1 /dev/md2 …. /dev/md13 /dev/md14’
There it is, a zpool with a dRAID vdev, ready to go to work.
The OpenZFS wiki has a good description of dRAID care, resilvering and rebalancing
Following the life cycle of failure and replacement in the documentation is recommended before those skills are tested in production.
Let’s decode the nomenclature that describes the geometry of a dRAID vdev. A string such as “dRAID2:3d:14c:1s” encodes the following about a dRAID vdev.
-parity: Required, the number of spindles to use to store parity information. Eg: A dRAID3 can survive until a fourth disk failure without losing data. Parity may be 1,2 or 3.
-[d] data: (spindles per RAID group): Determines the width of the data stripe, 8 is the default. Larger values will increase the stripe width and reduce total parity.
-[c] children: This parameter should match the number of device entries that you feed to the vdev. A helpful check will warn you if you don’t get the right number of disks named correctly: “invalid number of dRAID children; 14 required but 13 provided”
-[s] spares: The number of disk areas to mix in as distributed spares. No spares are created by default, a maximum of four are welcome. Each spare will remove a fraction of space from every disk.
The dRAID offers a solution for large arrays, vdevs with fewer than 20 spindles will have limited benefits from the new option. The performance and resilver result will be similar to RAIDZ for small numbers of spindles. Installations with many spindles will see the best results with regards to performance, fast spare activation and replacement. The benefits come with the associated cost of whole stripe at a time allocation for small objects in the pool. This overhead should be calculated in the design of the pool before it’s an operational surprise.
There is no free lunch with dRAID for in saving parity or spare drives, they are your defense against data loss. As drives increase in size, their time to resilver increases and the amount of data they can destroy increases.
Like this article? Share it!
Discover how OpenZFS can provide cost-effective and reliable storage for high-performance computing (HPC) workloads in this comprehensive write-up.
The most common category of ZFS questions is “how should I set up my pool?” Sometimes the question ends “… using the drives I already have” and sometimes it ends with “and how many drives should I buy.” Either way, today’s article can help you make sense of your options.