Klara

This is part of our article series published as "OpenZFS in Depth". Subscribe to our article series to find out more about the secrets of OpenZFS


Zpool iostat is one of the most essential tools in any serious ZFS storage admin's toolbox—and today, we're going to go over a bit of the theory and practice of using it to troubleshoot performance.

If you're familiar with the iostat command—a core tool on FreeBSD, and a part of the optional sysstat package on Debian-derived Linuxes—then you already know most of what you need to know about zpool iostat, which is simply "iostat, but specifically for ZFS." For those who are less familiar, both iostat and zpool iostat are very low level tools which offer concrete information about the throughput, latency, and some of the usage patterns of individual disks.

The zpool iostat command has been greatly expanded in recent years, with OpenZFS 2.0 (coming in FreeBSD 13.0) offering new flags, including -l to monitor latency, and the ability to filter to a specific disk. OpenZFS 2.0 is also available on FreeBSD 12.1 and later via the openzfs-kmod port/package.

A quick overview of zpool iostat output

If run without any arguments, zpool iostat will show you a brief overview of all pools on your system, with average usage statistics across that period. For example:

 root@banshee:~# zpool iostat
               capacity     operations     bandwidth 
 pool        alloc   free   read  write   read  write
 ----------  -----  -----  -----  -----  -----  -----
 banshee      774G  1.01T    174    110  17.9M  3.79M
 data        2.80T   844G    263     14  65.4M   164K
 ----------  -----  -----  -----  -----  -----  ----- 

This system has two pools, one comprised of fast but relatively small SSDs, and one with larger but relatively slow rust disks. We can also see the capacity and used/free information for each pool, along with statistics for how many operations per second and how much data per second each pool has averaged since the last boot.

The system above has only been up for 25 hours, and each pool received a scheduled scrub during those 25 hours—so we see much higher reads than writes, particularly on the slower bulk storage pool, data. The faster SSD pool includes home directories and VMs, so it's received a much closer to even number of reads and writes, despite the scrub—normally, it would be weighted heavily towards writes.

This is good information to have, but we can go much deeper, and also more granular—we can find more information about individual disks, and we can also monitor output in near-realtime, not just as averaged statistics since the most recent boot.

Monitoring individual disks

One of the ways zpool iostat is most commonly used is to look for potential problems with individual disks. In general, we expect the disks of an individual vdev to receive roughly similar numbers of read and write operations. Let's take a look at our demonstration system in a little greater depth:

 root@banshee:~# zpool iostat -v
                                     capacity     operations     bandwidth 
 pool                              alloc   free   read  write   read  write
 --------------------------------  -----  -----  -----  -----  -----  -----
 banshee                            774G  1.01T    172    110  17.7M  3.79M
   mirror                           774G  1.01T    172    110  17.7M  3.79M
     wwn-0x5002538e4982fef1-part5      -      -     86     55  8.85M  1.90M
     wwn-0x5002538e4982ff33-part5      -      -     86     55  8.85M  1.90M
 --------------------------------  -----  -----  -----  -----  -----  -----
 data                              2.81T   836G    260     14  64.6M   347K
   mirror                          1.40T   422G    184      8  32.3M   184K
     wwn-0x50014ee20d8e5e7f            -      -    152      4  16.1M  92.0K
     wwn-0x50014ee25d2510d4            -      -     31      4  16.1M  92.0K
   mirror                          1.41T   414G     76      6  32.4M   163K
     wwn-0x50014ee206fd9549            -      -     31      3  16.2M  81.6K
     wwn-0x5000c500c5dc8eca            -      -     44      2  16.2M  81.6K
 --------------------------------  -----  -----  -----  -----  -----  ----- 

On banshee, our solid state pool, we can see exactly even usage on both disks of the pool's single mirror vdev—read or write, IOPS or throughput, both sides receive the same load.

But on data—the rust pool—things aren't quite so neat and pretty.

In particular, we can see that the first disk of the first mirror vdev—the one with WWN ending in 5e7f—has received considerably more read operations than the other disk in the same vdev. That isn't necessarily a big problem—an individual read only goes to one side of a mirror vdev, not to both—but it could be an indication that the other disk in the vdev is having problems, forcing the vdev to read from 5e7f while 10d4 is unavailable or busy.

The other of data's two mirror vdevs is much more evenly loaded. It's still not perfect, the way the two SSDs in banshee are—but on rust disks, a little bit of load mismatch is expected.

Since we identified a potential problem with one of data's vdevs, we'll need to leave the cumulative statistics behind, and go near-realtime to investigate further.

Going near-realtime with zpool iostat

Since we saw what could be a problematic access pattern in our cumulative statistics, the next step is to generate artificial load on the pool, and monitor its access patterns in near-realtime. To do that, we'll create an uncached dataset, fire up an fio instance inside that dataset, and then monitor fio's progress from a second shell.

root@banshee:~# zfs create data/fio
root@banshee:~# zfs set primarycache=none data/fio
root@banshee:~# fio --directory=/data/fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=1M --size=256M --numjobs=32 --time_based --runtime=300

This fio command will put some solid read stress on the data pool—it creates 32 256MiB files, and each of 32 concurrent fio processes reads random 1MiB blocks from them. Since we set primarycache=none, every read will go straight to the disks themselves, rather than being served from ARC—even if we have an inordinately large amount of RAM on the system.

Since we used --time_based and --runtime=300, we've got five minutes to play with here. The next step is to open another console window (or screen session) and check zpool iostat from there.

 root@banshee:~# zpool iostat -vy 1 1
                                     capacity     operations     bandwidth 
 pool                              alloc   free   read  write   read  write
 --------------------------------  -----  -----  -----  -----  -----  -----
 banshee                            774G  1.01T      0     15      0   495K
   mirror                           774G  1.01T      0     15      0   495K
     wwn-0x5002538e4982fef1-part5      -      -      0      7      0   247K
     wwn-0x5002538e4982ff33-part5      -      -      0      7      0   247K
 --------------------------------  -----  -----  -----  -----  -----  -----
 data                              2.82T   828G    809      0   137M      0
   mirror                          1.40T   418G    322      0  65.4M      0
     wwn-0x50014ee20d8e5e7f            -      -    131      0  27.2M      0
     wwn-0x50014ee25d2510d4            -      -    190      0  38.3M      0
   mirror                          1.41T   410G    486      0  72.1M      0
     wwn-0x50014ee206fd9549            -      -    293      0  40.0M      0
     wwn-0x5000c500c5dc8eca            -      -    193      0  32.0M      0
 --------------------------------  -----  -----  -----  -----  -----  ----- 

The new arguments we fed zpool iostat are -y 1 1. The -y asks zpool iostat to discard its usual output of cumulative statistics entirely. The first 1 means to poll the system every second, and the second 1 means to only show a single poll's statistics.

What we can see here is that our very heavily read-loaded pool is distributing reads roughly evenly to both mirror vdevs within data—and within each mirror vdev, roughly evenly between each disk. This is what we expect to see, and is a good indication our disks are in pretty good shape.

In particular, it's worth noting that the imbalance in read load between 5e7f and 10d4 are reversed from what they were in our cumulative statistics—so the initial, weak indication that 10d4 might be having problems isn't holding up to more rigorous testing.

We could also ask for a 30 second average, specifically while running this fio test:

root@banshee:~# zpool iostat -vy 30 1
                                     capacity     operations     bandwidth 
 pool                              alloc   free   read  write   read  write
 --------------------------------  -----  -----  -----  -----  -----  -----
 banshee                            774G  1.01T      0    209  1.87K  9.44M
   mirror                           774G  1.01T      0    209  1.87K  9.44M
     wwn-0x5002538e4982fef1-part5      -      -      0    106  1.60K  4.72M
     wwn-0x5002538e4982ff33-part5      -      -      0    103    273  4.72M
 --------------------------------  -----  -----  -----  -----  -----  -----
 data                              2.82T   828G    789     62   139M   659K
   mirror                          1.40T   418G    352     40  67.9M   427K
     wwn-0x50014ee20d8e5e7f            -      -    139     20  31.2M   213K
     wwn-0x50014ee25d2510d4            -      -    212     20  36.8M   213K
   mirror                          1.41T   410G    437     22  71.1M   233K
     wwn-0x50014ee206fd9549            -      -    238     11  38.0M   116K
     wwn-0x5000c500c5dc8eca            -      -    198     10  33.1M   116K
 --------------------------------  -----  -----  -----  -----  -----  ----- 

In this longer test, we can see near-perfect distribution of read load between disks within each vdev. There are no indications of significant problems here—but should we wish, we can still drill a bit deeper.

Monitoring device latency with zpool iostat

If we want to see truly detailed information about each disk, the next step is to add the -l flag, for latency. You'll need a full-screen terminal window to make much use of this data, since the format is extremely wide!

We have six new column groups to play with here— the first four are total_wait, disk_wait, syncq_wait, asyncq_wait; each of these is further divided into read and write specifically.

Asyncq_wait and syncq_wait are, respectively, the time that data (sync or async) spends waiting to commit to disk in TXGs, and the time sync data spent waiting to commit to the ZIL. Neither of these columns include disk service time.

Disk_wait is the average disk I/O time spent reading/writing to the disk. Total_wait is the average total I/O time, including both queuing and disk I/O time. Finally, scrub wait and trim wait are exactly what they sound like: time spent in either scrub or trim queues.

Now that we know what the columns mean, we fire up our fio load engine again, and begin sampling data using the new -l flag:


 root@banshee:~# zpool iostat -vly 1 1
                                     capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
 pool                              alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
 --------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
 banshee                            774G  1.01T      0     15      0   431K      -    1ms      -    1ms      -    1us      -      -      -      -
   mirror                           774G  1.01T      0     15      0   431K      -    1ms      -    1ms      -    1us      -      -      -      -
     wwn-0x5002538e4982fef1-part5      -      -      0      7      0   216K      -    1ms      -    1ms      -    2us      -      -      -      -
     wwn-0x5002538e4982ff33-part5      -      -      0      7      0   216K      -    1ms      -    1ms      -    1us      -      -      -      -
 --------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
 data                              2.82T   828G    904      0   145M      0   64ms      -   39ms      -   23ms      -      -      -      -      -
   mirror                          1.40T   418G    374      0  72.0M      0  107ms      -   50ms      -   54ms      -      -      -      -      -
     wwn-0x50014ee20d8e5e7f            -      -    153      0  35.4M      0  173ms      -   59ms      -  106ms      -      -      -      -      -
     wwn-0x50014ee25d2510d4            -      -    220      0  36.6M      0   61ms      -   43ms      -   18ms      -      -      -      -      -
   mirror                          1.41T   410G    529      0  72.7M      0   33ms      -   31ms      -    2ms      -      -      -      -      -
     wwn-0x50014ee206fd9549            -      -    265      0  36.6M      0   32ms      -   30ms      -    2ms      -      -      -      -      -
     wwn-0x5000c500c5dc8eca            -      -    264      0  36.1M      0   35ms      -   33ms      -    2ms      -      -      -      -      -
 --------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  ----- 

In the above output, we can see new columns giving us wait times—the amount of time for an operation to sync to disk—for each pool, each of its vdevs, and each of the individual vdevs as well. Once we drill down to this final level of detail, we can get the clue that finally explains the discrepancy in reads we saw initially, in which the disk 5e7f serviced so many more reads than its companion disk 10d4.

The answer turns out to have been fairly misleading, although we had some additional clues when we began. Looking back at our first, simplest tests, we can see that although 5e7f serviced more read requests than its companion disk 10d4 did, the two had actually read almost identical amounts of data.

This implied failed requests from that disk, rather than that disk servicing more of the actual load—but without the additional data that the -l flag offered us, it's easy to miss or to misinterpret. But now that we're looking at a real-time sample of data which includes latency statistics, we can confirm that 5e7f is noticeably slower to return read requests (as shown in the read column of total_wait, 173ms) than either its companion disk, or the two disks in the pool's second vdev.

Although the degree of the slowness does fluctuate if we take several samples, 5e7f frequently takes more than three times as long to service a read request as any of the other three rust disks do. This is an indication of a disk that is, however slowly, beginning to fail.

Many admins rely on the output of smartctl -a to check for disk health—but many more admins, myself included, will warn you that many if not most failures don't show up after any warning sign that smartctl offers. Smartctl on this disk does not show any reallocated sectors, seek errors, or any other sign of a hard problem.

So, now that we know that 5e7f isn't in the best of health, what do we do about it? That can depend on many factors. Although 5e7f is clearly bottlenecking our pool—as we can see by the increased service time for its parent mirror vdev, as compared to the other mirror vdev—this particular pool isn't performing slowly enough to cause "seat-of-the-pants" problems, in which users actively complain about a slow system.

5e7f is also partnered with a healthy disk, and—more importantly—the system as a whole is backed up via both daily and hourly ZFS replication to an entirely separate system. Finally, the pool and snapshot health of both this system and the backup server is automatically monitored... so in this particular case, the sysadmin responsible is deciding to just ride 5e7f until it drops, or until the next scheduled capacity upgrade.

In a situation which demands the maximum performance, or maximum uptime, the smarter move would be to replace 5e7f proactively, and do so immediately.

Continuously updated zpool iostat

So far, we've used iostat with two numerical arguments—eg zpool iostat -y 1 1, for a single poll of one second's statistics, or zpool iostat 30 1, for a single poll of 30 seconds' statistics. We can advance the second argument—for example, zpool iostat -y 30 5—in order to get, in that case, 5 polls of 30 seconds' duration apiece.

We can also omit the second argument entirely—zpool iostat -y 1 will give us a running batch of statistics, output once every second. But it's difficult to use well in real time, because the output simply scrolls up the screen. For a more top-like experience, we can pair the zpool iostat command with watch (on Linux) or cmdwatch (on FreeBSD).

For example, cmdwatch -n 1 zpool iostat -vy 1 1 will give us the detailed output from the last section, in an updated in-place format on the screen rather than scrolling. For truly interactive use, this is the superior way to check iostat on an ongoing basis.

The only real disadvantage to using watch or cmdwatch to monitor zpool iostat in realtime is that when you ctrl-C out of its loop, it clears the screen one last time, rather than leaving the last set of numbers visible. This is a problem if you need to capture the output—and it's why for this article, we saved it for last.

In reality, I used watch -n 1 zpool iostat -vly 1 1 to monitor the latency of the drives in my data pool in real time, which allowed me to quickly spot both the greater latency on 5e7f and to verify how consistent the difference between it and its companion disk was from one second to the next—but then I ctrl-C'd out of it, and used a single zpool iostat -vly 1 1 to capture one second's data in a format that could be copied and pasted here.

Back to Articles

More on This Topic

Getting expert ZFS advice is as easy as reaching out to us!

At Klara, we have an entire team dedicated to helping you with your ZFS Projects. Whether you’re planning a ZFS project or are in the middle of one and need a bit of extra insight, we are here to help!