Klara

Setting up a ZFS pool involves a number of permanent decisions that will affect the performance, cost, and reliability of your data storage systems, so you really want to understand all the options at your disposal for making the right choices from the beginning.

 

  • This is part of our article series titled 'History of OpenZFS.' Subscribe to our series to uncover more about the secrets of OpenZFS.

 

It all depends on the kind of data you will be dealing with and its intended use. Once this is well-defined, it's just a matter of matching the right hardware with the appropriate ZFS options. The basic decision you need to make is how to group your disks to achieve the desired performance, reliability, and cost-effectiveness. In ZFS, disks are typically grouped into 'virtual devices' (vdevs), which are then combined into a pool, offering a high degree of flexibility to set everything up as needed.

However, it is worth mentioning that a vdev can be a single disk, a partition or even a regular file (for testing purposes); you could just create a stripe of several disks like this:

root@geroda:~ # zpool create testpool da1 da2 da3

The available space of such pool would be that of the 3 disks combined, but of course it would have no redundancy at all; if a single disk fails, all the pool's data is lost. This is because ZFS distributes the data among all the available vdevs for performance reasons, so stripes like this have a limited practical use. Each vdev is responsible for providing its own redundancy.

It's true that in such cases you can enable the “copies” feature of ZFS and this will keep several copies of the files in the datasets you choose, making an effort to ensure the additional copies are on different disks, but this is not guaranteed:

root@geroda:~ # zfs create testpool/thrash
root@geroda:~ # zfs create testpool/importantstuff 
root@geroda:~ # zfs set copies=3 testpool/importantstuff 

With this setup any file written to testpool/importantstuff will save 3 copies of each data block, distributed amongst the 3 disks comprising the pool, but this just protects your data against sectors failing in some disk, not against a complete disk failure. If a disk dies in a striped pool, ZFS will refuse to import it because one of its top-level vdevs is missing and then your data will be lost.

For real redundant data storage, you need to choose between the mirrorRAID-Z or dRAID vdev types.

Let's have a look at them in detail.

 

Mirror vdevs

Mirror vdevs have many good points; the main drawback is that compared to other vdev types, you need more disks to hold the same amount of data. As mirrors keep identical copies of the data in several disks, they can provide the best IOPS, the number of read and/or write operations that can be performed per second. ZFS distributes the writes amongst the top level vdevs, so the more vdevs in the pool, the more IOPS that are available. Now it's even possible to shrink a pool comprised of several mirrors by removing one of them, so the flexibility provided by mirrors is an important feature to consider when deciding which type of vdev to choose.

Let's see some practical examples; first we create a simple pool with 2 disks:

root@geroda:~ # zpool create testpool mirror da0 da1

If we need more space, we just add more pairs of disks:

root@geroda:~ # zpool add testpool mirror da2 da3

Now the pool looks like this:

root@geroda:~ # zpool status testpool
  pool: testpool
 state: ONLINE
config:
     NAME        STATE     READ WRITE CKSUM
     testpool    ONLINE       0     0     0
       mirror-0  ONLINE       0     0     0
         da0     ONLINE       0     0     0
         da1     ONLINE       0     0     0
       mirror-1  ONLINE       0     0     0
         da2     ONLINE       0     0     0
         da3     ONLINE       0     0     0

It is also possible to add additional disks to a mirror increasing the IOPS (mostly the reading performance):

root@geroda:~ # zpool attach testpool da1 da4
root@geroda:~ # zpool attach testpool da3 da5
root@geroda:~ # zpool status testpool
  pool: testpool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since [...]
config:
NAME        STATE     READ WRITE CKSUM
testpool    ONLINE       0     0     0
  mirror-0  ONLINE       0     0     0
    da0     ONLINE       0     0     0
    da1     ONLINE       0     0     0
    da4     ONLINE       0     0     0  (resilvering)
  mirror-1  ONLINE       0     0     0
    da2     ONLINE       0     0     0
    da3     ONLINE       0     0     0
    da5     ONLINE       0     0     0  (resilvering)

If for some reason our needs change, we can easily remove a mirror vdev from the pool:

root@geroda:~ # zpool remove testpool mirror-1
root@geroda:~ # zpool status testpool
  pool: testpool
 state: ONLINE
  scan: resilvered 60K in 00:00:02 with 0 errors on Mon May 17 13:10:34 2021
remove: Removal of vdev 1 copied 39.5K in 0h0m, completed on Mon May 17 13:20:11 2021
    264 memory used for removed device mappings config:
NAME          STATE     READ WRITE CKSUM
testpool      ONLINE       0     0     0
  mirror-0    ONLINE       0     0     0
    da0       ONLINE       0     0     0
    da1       ONLINE       0     0     0
    da4       ONLINE       0     0     0

We could also remove the third disk from the above mirror:

root@geroda:~ # zpool detach testpool da4 

Of course, disks can only be removed from a mirror if enough redundancy remains. If only one disk in a mirror vdev remains, it ceases to be a mirror and reverts to being a stripe, risking the entire pool if that disk fails. So, when setting up vdevs, never forget that a single vdev failing makes its whole pool fail.

The fact that we can grow and shrink mirrored pools and vdevs makes it possible to manage disks in very efficient ways, moving them from one pool to another or between vdevs according to circumstances. For example, in case one of the mirror vdev of a pool needs a disk replacement but none is available, we could just remove the degraded mirror from the pool (perhaps after relocating some data) earning an extra spare disk in the process.

Another example: we could have 3 disks mirrors for normal operation and degrade them to 2 disks mirrors for using those spare disks for temporary purposes during an upgrade or migration. Note that this ability to shrink a pool is not possible when using RAID-Z or dRAID vdevs.

Regarding performance, data is striped across each of the mirrored vdevs, so if we have two mirror vdevs, this would be like a RAID 10 that stripes writes across two sets of mirrors. This would double writing performance and quadruple reading performance compared to a single disk. Also, keep in mind that space is allocated in such a way that each vdev reaches 100% full at the same time, so there is a performance penalty if the vdevs have different amounts of free space, as a disproportionate amount of the data is written to the less full vdev.

 

RAID-Z vdevs

An improved RAID-5, that offers better distribution of data and parity, RAID-Z vdevs require at least three disks but provide more usable space than mirrors. They also solve the RAID-5 "write hole problem" in which data and parity can become inconsistent after a crash.

Each RAID-Z vdev can have single, double or triple parity (called raidz/raidz1, raidz2 and raidz3), implying that it can withstand one, two or three disks failing at the same time, so generally speaking you would use higher parity the more disks your vdev is comprised of.

However, for best performance and reliability you need to take into account several other parameters before creating a RAID-Z pool:

  1. How many disks to put in each RAID-Z vdev: this depends on your IOPS needs and the amount of space you want to use for parity. To increase IOPS use fewer disks per vdev (and more vdevs), and to increase space use more disks per vdev. If we have 12 disks available, we would get better IOPS with 4 raidz1 vdevs of 3 disks each than with 2 raidz2 vdevs of 6 disks each, even if in both cases we are roughly using 8 disks for data and 4 for parity. This happens because the data is distributed among all available vdevs.
  2. Hardware distribution: if you have several groups of disks attached to different controllers or shelves, it could make sense to define the vdevs using one disk from each of them, not just for increased reliability but also to avoid potential bottlenecks. For example, with 6 shelves of 24 disks each, you could use 24 raidz2 vdevs with 6 disks each, one from each shelf. This arrangement would tolerate up to 2 shelves failing at the same time (or 1 shelve plus 1 disk elsewhere failing).
  3. Space efficiency: to maximize space availability, use more disks per vdev in proportion to parity level. For example: 12 disks as raidz2 instead of raidz3 can decrease the space used for parity from 38% to 22%, although the actual numbers vary with the block size used (see below).
  4. ZFS lz4 compression: this isn't enabled by default, but, in many cases, it can improve performance and space availability more than vdevs disks distribution.
  5. Padding, disk sector size and recordsize setting: in RAID-Z, parity information is associated with each block, not with specific stripes as is the case in RAID-5, so each data allocation must be a multiple of p+1 (parity+1) to avoid freed segments being too small to be reused. If the data allocated isn't a multiple of p+1'padding' is used, and that's why RAID-Z requires a bit more space for parity and padding than RAID-5. This is a complex issue, but in short: for avoiding poor space efficiency you must keep ZFS recordsize much bigger than disks sector size; you could use recordsize=4K or 8K with 512-byte sector disks, but if you are using 4K sectors disks then recordsize should be several times that (the default 128K would do) or you could end up losing too much space.

To avoid issues with these settings, it's wise to test your setup before going into production, and for that a simple script like this can be helpful:

root@geroda:~ # cat test_recordsize.sh
#! /bin/sh
file=test.tmp
rm $file 2> /dev/null
ST=`sysctl -n vfs.zfs.txg.timeout` 
ST=`expr $ST + 1`
for size in 1 2 3 4 5 6 7 8 9 11 12 13 15 16 17 23 24 25 31 32 33 63 \
        64 65 127 128 129 254 255 256 257; do
    dd if=/dev/random bs=1k count=$size of=$file 2> /dev/null
    sleep $ST
    alloc=`du -k $file | awk '{print $1}'`
    rm $file
    echo "$size K size -> $alloc K alloc"
done 

Thus, you can test different pool configurations and settings. Let us try it with a pool comprised of 2 raidz2 vdevs of 6 disks (512 bytes sectors) each:

root@geroda:~ # zpool create testpool raidz2 da1 da2 da3 da4 da5 da6 \
                raidz2 da7 da8 da9 da10 da11  da12

Let's see how it looks with the default ZFS recordsize of 128K:

root@geroda:~ # cd /testpool/ 
root@geroda:/testpool # /root/test_recordsize.sh
1 K size -> 3 K alloc 
2 K size -> 3 K alloc
3 K size -> 5 K alloc
4 K size -> 5 K alloc
5 K size -> 7 K alloc 
6 K size -> 7 K alloc 
7 K size -> 9 K alloc 
8 K size -> 9 K alloc 
[...]
127 K size -> 129 K alloc
128 K size -> 129 K alloc
129 K size -> 260 K alloc
254 K size -> 260 K alloc
255 K size -> 260 K alloc
256 K size -> 260 K alloc
257 K size -> 388 K alloc 

You can see the “jump” when passing from a 128K to a 129K file. It uses twice the allocated space of the 128K file because we need another 128K to fit it in, so if you were using plenty of 129K files in that dataset you would end up losing a huge amount of space.

Let's try now with compression enabled:

root@geroda:/testpool # zfs set compression=lz4 testpool
root@geroda:/testpool # /root/test_recordsize.sh 
1 K size -> 3 K alloc
2 K size -> 3 K alloc
3 K size -> 5 K alloc
4 K size -> 5 K alloc
5 K size -> 7 K alloc 
[...]
127 K size -> 129 K alloc 
128 K size -> 129 K alloc 
129 K size -> 135 K alloc 
254 K size -> 260 K alloc 
255 K size -> 260 K alloc 
256 K size -> 260 K alloc 
257 K size -> 262 K alloc 

Now things look much better, don’t they? Enabling compression is good in most cases. The same test can be done changing the recordsize. Let's try with a small one just to see how one single setting can affect space efficiency in a dramatic way. I will disable compression to better understand the numbers:

root@geroda:/testpool # zfs set compression=off testpool 
root@geroda:/testpool # zfs set recordsize=4K testpool
root@geroda:/testpool # /root/test_recordsize.sh 
1 K size -> 3 K alloc 
2 K size -> 3 K alloc 
3 K size -> 5 K alloc 
4 K size -> 5 K alloc 
5 K size -> 13 K alloc 
6 K size -> 13 K alloc 
7 K size -> 13 K alloc 
8 K size -> 13 K alloc 
9 K size -> 17 K alloc 
23 K size -> 29 K alloc 
24 K size -> 29 K alloc 
25 K size -> 33 K alloc 
31 K size -> 37 K alloc 
129 K size -> 137 K alloc 
254 K size -> 264 K alloc 
255 K size -> 264 K alloc 
256 K size -> 264 K alloc 

As you can see, too much space is wasted for small files; other than those 4K in size or smaller, the bigger the file, the smaller the percentage of wasted space, but even so it's quite noticeable. So, although some databases do use 4K logical block sizes, it may not be such a good idea to set up a recordsize=4K just for that reason; it could be much better to just keep the default recordsize (128K) and enable lz4 compression.

Bottom line, keep in mind which kind of data the pool will hold and how it will be used, otherwise you may have an unpleasant surprise once the server gets into production at full throttle. As you can see, much thinking can go into defining RAID-Z based pools, but the simplest rule of thumb is: enable compression and choose a sensible parity level according to the number of disks comprising each vdev.

An important detail to take into account is that you need an entire vdev to grow a RAID-Z pool, typically between 4 and 12 disks, and you can't shrink it afterwards as you could with a pool comprised of several mirrors, either by removing a single disk from a vdev or a whole vdev from the pool, so RAID-Z pools can be said to be less flexible than mirror-based pools.

 

Distributed RAID (dRAID) vdevs

In short, dRAID pools are intended for environments where degraded performance during resilvering can be a problem, typically very large storage arrays where there is a real possibility of several disks failing before a faulty one has been substituted and resilvered; it also offers a shorter resilvering time when compared to RAID-Z.

It's similar to RAID-Z but provides integrated distributed hot spares which allow for faster resilvering when a disk fails. These integrated hot spares are not physical idle disks waiting for some other disk to die, they are better described as “pre-allocated virtual spares” which are spread over all vdev disks, so the more virtual spares you define, the less usable space you get from the vdev.

Overall, IOPS is not so different from RAID-Z because for any read all data disks must be accessed. It is during resilvering that performance is much better in comparison, and regarding parity, as happened with RAID-Z, we can choose between single, double or triple parity, defined by vdevs types draid/draid1, draid2 and draid3.

However, unlike RAID-Z, dRAID uses a fixed stripe width to allow fully sequential resilvering. This must be taken into account when creating a pool because the number of data disks together with its blocksize will define a minimum allocation sizefor the vdev, and this can have an impact on performance depending on the type of data stored and compression used.

For example: if we have a dRAID with 8 disks of sector size 4K, the minimum allocation size would be 32K, and this could be too big if the pool was to hold mostly smaller blocks, either because of the data itself or the use of ZFS compression. That's why, when using dRAID pools, the volblocksize and recordsize properties are modified to account for the dRAID allocation size and sometimes it can make sense to add a mirrored special vdev for storing smaller blocks.

Let's see some practical examples.

To create a default draid2:

root@geroda:~ # zpool create mypool draid2 da1 da2 da3 da4 da5 da6
root@geroda:~ # zpool status
  pool: mypool
 state: ONLINE 
config:
NAME                 STATE     READ WRITE CKSUM
mypool               ONLINE       0     0     0
  draid2:4d:6c:0s-0  ONLINE       0     0     0
    da1              ONLINE       0     0     0
    da2              ONLINE       0     0     0
    da3              ONLINE       0     0     0
    da4              ONLINE       0     0     0
    da5              ONLINE       0     0     0
    da6              ONLINE       0     0     0

Note that by default we have no distributed hot spare (the “0s” in draid2:4d:6c:0s-0), but we could define one if we want:

root@geroda:~ # zpool create mypool draid2:6d:9c:1s da1 da2 da3 da4 da5 da6 da7 da8 da9
root@geroda:~ # zpool status
  pool: mypool
 state: ONLINE
config:
NAME                 STATE     READ WRITE CKSUM
mypool               ONLINE       0     0     0
  draid2:6d:9c:1s-0  ONLINE       0     0     0
    da1              ONLINE       0     0     0
    da2              ONLINE       0     0     0
    da3              ONLINE       0     0     0
    da4              ONLINE       0     0     0
    da5              ONLINE       0     0     0
    da6              ONLINE       0     0     0
    da7              ONLINE       0     0     0
    da8              ONLINE       0     0     0
    da9              ONLINE       0     0     0
spares
  draid2-0-0         AVAIL

Now let's see what happens when a disk fails:

root@geroda:~ # zpool status
  pool: mypool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid.  Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:01:07 with 0 errors on Thu May 20 02:38:20 2021
config:
NAME                 STATE     READ WRITE CKSUM
mypool               DEGRADED     0     0     0
  draid2:6d:9c:1s-0  DEGRADED     0     0     0
    da1              ONLINE       0     0     0
    da2              ONLINE       0     0     0
    da3              ONLINE       0     0     0
    da4              ONLINE       0     0     0
    da5              UNAVAIL      0     0     0 corrupted data
    da6              ONLINE       0     0     0
    da7              ONLINE       0     0     0
    da8              ONLINE       0     0     0
    da9              ONLINE       0     0     0
spares
  draid2-0-0         AVAIL

Once the situation is assessed, we could either replace the physical disk or tell ZFS to replace it with the “virtual spare”. The advantage of dRAID virtual spares is that the resilver process reads from every remaining disk, and writes to every remaining disk, rather than writing to only the single replacement disk:

root@geroda:~ # zpool replace mypool da5 draid2-0-0
root@geroda:~ # zpool status
  pool: mypool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu May 20 02:40:12 2021
816M scanned at 35.5M/s, 292M issued at 12.7M/s, 816M total
54.3M resilvered, 35.77% done, 00:00:41 to go
config:
NAME                 STATE     READ WRITE CKSUM
mypool               DEGRADED     0     0     0
  draid2:6d:9c:1s-0  DEGRADED     0     0     0
    da1              ONLINE       0     0     0  (resilvering)
    da2              ONLINE       0     0     0  (resilvering)
    da3              ONLINE       0     0     0  (resilvering)
    da4              ONLINE       0     0     0  (resilvering)
    spare-4          DEGRADED     0     0     0
      da5            UNAVAIL      0     0     0  corrupted data
      draid2-0-0     ONLINE       0     0     0  (resilvering)
    da6              ONLINE       0     0     0  (resilvering)
    da7              ONLINE       0     0     0  (resilvering)
    da8              ONLINE       0     0     0  (resilvering)
    da9              ONLINE       0     0     0  (resilvering)
spares
  draid2-0-0         INUSE     currently in use

Once the resilvering with the virtual spare is done, we could replace the physical disk:

root@geroda:~ # zpool detach mypool da5
root@geroda:~ # zpool replace mypool draid2-0-0 da5 

Note that after replacing the failed physical disk, the whole vdev will be resilvered again before returning to its original state with one “virtual spare disk” available. So, in a way this setup (9 disks in draid2 with 1 spare disk) could be seen as a draid3 made of 9 disks with no spare, the main difference being that with the spare disk setup we have more control over when and how the resilvering process takes place, being able to decide upon using the virtual spare disk or replacing the faulty disk right away. This second resilver, to the replacement disk, is slower but is done while the loop is not missing any parity, because the virtual spare contains all of the data. The virtual spare reduces the window while the pool is vulnerable to further failures, and allows the operator to proceed with the slower disk replacement process once the risk window is behind them.

If you want to learn more about dRAID, check out this article we wrote.

 

Did you know?

 

ZFS might be complex, but the way your organisation deals with it doesn't have to be.

ZFS offers many options to deal with mostly any situation you could imagine. And if your needs are more complicated, our experts can help you make these decisions.

 

Subscribe Today >

 

Back to Articles

More on This Topic

Getting expert ZFS advice is as easy as reaching out to us!

At Klara, we have an entire team dedicated to helping you with your ZFS Projects. Whether you’re planning a ZFS project or are in the middle of one and need a bit of extra insight, we are here to help!