Improve the way you make use of ZFS in your company.
Did you know you can rely on Klara engineers for anything from a ZFS performance audit to developing new ZFS features to ultimately deploying an entire storage system on ZFS?
ZFS Support ZFS DevelopmentAdditional Articles
Here are more interesting articles on ZFS that you may find useful:
- GPL 3: The Controversial Licensing Model and Potential Solutions
- ZFS High Availability with Asynchronous Replication and zrep
- 8 Open Source Trends to Keep an Eye Out for in 2024
- OpenZFS Storage Best Practices and Use Cases – Part 3: Databases and VMs
- OpenZFS Storage Best Practices and Use Cases – Part 2: File Serving and SANs
This is part of our article series published as "OpenZFS in Depth". Subscribe to our article series to find out more about the secrets of OpenZFS
Zpool iostat is one of the most essential tools in any serious ZFS storage admin's toolbox—and today, we're going to go over a bit of the theory and practice of using it to troubleshoot performance.
If you're familiar with the iostat command—a core tool on FreeBSD, and a part of the optional sysstat package on Debian-derived Linuxes—then you already know most of what you need to know about zpool iostat, which is simply "iostat, but specifically for ZFS." For those who are less familiar, both iostat and zpool iostat are very low level tools which offer concrete information about the throughput, latency, and some of the usage patterns of individual disks.
The zpool iostat command has been greatly expanded in recent years, with OpenZFS 2.0 (coming in FreeBSD 13.0) offering new flags, including -l to monitor latency, and the ability to filter to a specific disk. OpenZFS 2.0 is also available on FreeBSD 12.1 and later via the openzfs-kmod port/package.
A quick overview of zpool iostat output
If run without any arguments, zpool iostat will show you a brief overview of all pools on your system, with average usage statistics across that period. For example:
root@banshee:~# zpool iostat capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- banshee 774G 1.01T 174 110 17.9M 3.79M data 2.80T 844G 263 14 65.4M 164K ---------- ----- ----- ----- ----- ----- -----
This system has two pools, one comprised of fast but relatively small SSDs, and one with larger but relatively slow rust disks. We can also see the capacity and used/free information for each pool, along with statistics for how many operations per second and how much data per second each pool has averaged since the last boot.
The system above has only been up for 25 hours, and each pool received a scheduled scrub during those 25 hours—so we see much higher reads than writes, particularly on the slower bulk storage pool, data. The faster SSD pool includes home directories and VMs, so it's received a much closer to even number of reads and writes, despite the scrub—normally, it would be weighted heavily towards writes.
This is good information to have, but we can go much deeper, and also more granular—we can find more information about individual disks, and we can also monitor output in near-realtime, not just as averaged statistics since the most recent boot.
Monitoring individual disks
One of the ways zpool iostat is most commonly used is to look for potential problems with individual disks. In general, we expect the disks of an individual vdev to receive roughly similar numbers of read and write operations. Let's take a look at our demonstration system in a little greater depth:
root@banshee:~# zpool iostat -v capacity operations bandwidth pool alloc free read write read write -------------------------------- ----- ----- ----- ----- ----- ----- banshee 774G 1.01T 172 110 17.7M 3.79M mirror 774G 1.01T 172 110 17.7M 3.79M wwn-0x5002538e4982fef1-part5 - - 86 55 8.85M 1.90M wwn-0x5002538e4982ff33-part5 - - 86 55 8.85M 1.90M -------------------------------- ----- ----- ----- ----- ----- ----- data 2.81T 836G 260 14 64.6M 347K mirror 1.40T 422G 184 8 32.3M 184K wwn-0x50014ee20d8e5e7f - - 152 4 16.1M 92.0K wwn-0x50014ee25d2510d4 - - 31 4 16.1M 92.0K mirror 1.41T 414G 76 6 32.4M 163K wwn-0x50014ee206fd9549 - - 31 3 16.2M 81.6K wwn-0x5000c500c5dc8eca - - 44 2 16.2M 81.6K -------------------------------- ----- ----- ----- ----- ----- -----
On banshee, our solid state pool, we can see exactly even usage on both disks of the pool's single mirror vdev—read or write, IOPS or throughput, both sides receive the same load.
But on data—the rust pool—things aren't quite so neat and pretty.
In particular, we can see that the first disk of the first mirror vdev—the one with WWN ending in 5e7f—has received considerably more read operations than the other disk in the same vdev. That isn't necessarily a big problem—an individual read only goes to one side of a mirror vdev, not to both—but it could be an indication that the other disk in the vdev is having problems, forcing the vdev to read from 5e7f while 10d4 is unavailable or busy.
The other of data's two mirror vdevs is much more evenly loaded. It's still not perfect, the way the two SSDs in banshee are—but on rust disks, a little bit of load mismatch is expected.
Since we identified a potential problem with one of data's vdevs, we'll need to leave the cumulative statistics behind, and go near-realtime to investigate further.
Going near-realtime with zpool iostat
Since we saw what could be a problematic access pattern in our cumulative statistics, the next step is to generate artificial load on the pool, and monitor its access patterns in near-realtime. To do that, we'll create an uncached dataset, fire up an fio instance inside that dataset, and then monitor fio's progress from a second shell.
root@banshee:~# zfs create data/fio
root@banshee:~# zfs set primarycache=none data/fio
root@banshee:~# fio --directory=/data/fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=1M --size=256M --numjobs=32 --time_based --runtime=300
This fio command will put some solid read stress on the data pool—it creates 32 256MiB files, and each of 32 concurrent fio processes reads random 1MiB blocks from them. Since we set primarycache=none, every read will go straight to the disks themselves, rather than being served from ARC—even if we have an inordinately large amount of RAM on the system.
Since we used --time_based and --runtime=300, we've got five minutes to play with here. The next step is to open another console window (or screen session) and check zpool iostat from there.
root@banshee:~# zpool iostat -vy 1 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------------- ----- ----- ----- ----- ----- -----
banshee 774G 1.01T 0 15 0 495K
mirror 774G 1.01T 0 15 0 495K
wwn-0x5002538e4982fef1-part5 - - 0 7 0 247K
wwn-0x5002538e4982ff33-part5 - - 0 7 0 247K
-------------------------------- ----- ----- ----- ----- ----- -----
data 2.82T 828G 809 0 137M 0
mirror 1.40T 418G 322 0 65.4M 0
wwn-0x50014ee20d8e5e7f - - 131 0 27.2M 0
wwn-0x50014ee25d2510d4 - - 190 0 38.3M 0
mirror 1.41T 410G 486 0 72.1M 0
wwn-0x50014ee206fd9549 - - 293 0 40.0M 0
wwn-0x5000c500c5dc8eca - - 193 0 32.0M 0
-------------------------------- ----- ----- ----- ----- ----- -----
The new arguments we fed zpool iostat are -y 1 1. The -y asks zpool iostat to discard its usual output of cumulative statistics entirely. The first 1 means to poll the system every second, and the second 1 means to only show a single poll's statistics.
What we can see here is that our very heavily read-loaded pool is distributing reads roughly evenly to both mirror vdevs within data—and within each mirror vdev, roughly evenly between each disk. This is what we expect to see, and is a good indication our disks are in pretty good shape.
In particular, it's worth noting that the imbalance in read load between 5e7f and 10d4 are reversed from what they were in our cumulative statistics—so the initial, weak indication that 10d4 might be having problems isn't holding up to more rigorous testing.
We could also ask for a 30 second average, specifically while running this fio test:
root@banshee:~# zpool iostat -vy 30 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------------- ----- ----- ----- ----- ----- -----
banshee 774G 1.01T 0 209 1.87K 9.44M
mirror 774G 1.01T 0 209 1.87K 9.44M
wwn-0x5002538e4982fef1-part5 - - 0 106 1.60K 4.72M
wwn-0x5002538e4982ff33-part5 - - 0 103 273 4.72M
-------------------------------- ----- ----- ----- ----- ----- -----
data 2.82T 828G 789 62 139M 659K
mirror 1.40T 418G 352 40 67.9M 427K
wwn-0x50014ee20d8e5e7f - - 139 20 31.2M 213K
wwn-0x50014ee25d2510d4 - - 212 20 36.8M 213K
mirror 1.41T 410G 437 22 71.1M 233K
wwn-0x50014ee206fd9549 - - 238 11 38.0M 116K
wwn-0x5000c500c5dc8eca - - 198 10 33.1M 116K
-------------------------------- ----- ----- ----- ----- ----- -----
In this longer test, we can see near-perfect distribution of read load between disks within each vdev. There are no indications of significant problems here—but should we wish, we can still drill a bit deeper.
Monitoring device latency with zpool iostat
If we want to see truly detailed information about each disk, the next step is to add the -l flag, for latency. You'll need a full-screen terminal window to make much use of this data, since the format is extremely wide!
We have six new column groups to play with here— the first four are total_wait, disk_wait, syncq_wait, asyncq_wait; each of these is further divided into read and write specifically.
Asyncq_wait and syncq_wait are, respectively, the time that data (sync or async) spends waiting to commit to disk in TXGs, and the time sync data spent waiting to commit to the ZIL. Neither of these columns include disk service time.
Disk_wait is the average disk I/O time spent reading/writing to the disk. Total_wait is the average total I/O time, including both queuing and disk I/O time. Finally, scrub wait and trim wait are exactly what they sound like: time spent in either scrub or trim queues.
Now that we know what the columns mean, we fire up our fio load engine again, and begin sampling data using the new -l flag:
root@banshee:~# zpool iostat -vly 1 1
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim
pool alloc free read write read write read write read write read write read write wait wait
-------------------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
banshee 774G 1.01T 0 15 0 431K - 1ms - 1ms - 1us - - - -
mirror 774G 1.01T 0 15 0 431K - 1ms - 1ms - 1us - - - -
wwn-0x5002538e4982fef1-part5 - - 0 7 0 216K - 1ms - 1ms - 2us - - - -
wwn-0x5002538e4982ff33-part5 - - 0 7 0 216K - 1ms - 1ms - 1us - - - -
-------------------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
data 2.82T 828G 904 0 145M 0 64ms - 39ms - 23ms - - - - -
mirror 1.40T 418G 374 0 72.0M 0 107ms - 50ms - 54ms - - - - -
wwn-0x50014ee20d8e5e7f - - 153 0 35.4M 0 173ms - 59ms - 106ms - - - - -
wwn-0x50014ee25d2510d4 - - 220 0 36.6M 0 61ms - 43ms - 18ms - - - - -
mirror 1.41T 410G 529 0 72.7M 0 33ms - 31ms - 2ms - - - - -
wwn-0x50014ee206fd9549 - - 265 0 36.6M 0 32ms - 30ms - 2ms - - - - -
wwn-0x5000c500c5dc8eca - - 264 0 36.1M 0 35ms - 33ms - 2ms - - - - -
-------------------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
In the above output, we can see new columns giving us wait times—the amount of time for an operation to sync to disk—for each pool, each of its vdevs, and each of the individual vdevs as well. Once we drill down to this final level of detail, we can get the clue that finally explains the discrepancy in reads we saw initially, in which the disk 5e7f serviced so many more reads than its companion disk 10d4.
The answer turns out to have been fairly misleading, although we had some additional clues when we began. Looking back at our first, simplest tests, we can see that although 5e7f serviced more read requests than its companion disk 10d4 did, the two had actually read almost identical amounts of data.
This implied failed requests from that disk, rather than that disk servicing more of the actual load—but without the additional data that the -l flag offered us, it's easy to miss or to misinterpret. But now that we're looking at a real-time sample of data which includes latency statistics, we can confirm that 5e7f is noticeably slower to return read requests (as shown in the read column of total_wait, 173ms) than either its companion disk, or the two disks in the pool's second vdev.
Although the degree of the slowness does fluctuate if we take several samples, 5e7f frequently takes more than three times as long to service a read request as any of the other three rust disks do. This is an indication of a disk that is, however slowly, beginning to fail.
Many admins rely on the output of smartctl -a to check for disk health—but many more admins, myself included, will warn you that many if not most failures don't show up after any warning sign that smartctl offers. Smartctl on this disk does not show any reallocated sectors, seek errors, or any other sign of a hard problem.
So, now that we know that 5e7f isn't in the best of health, what do we do about it? That can depend on many factors. Although 5e7f is clearly bottlenecking our pool—as we can see by the increased service time for its parent mirror vdev, as compared to the other mirror vdev—this particular pool isn't performing slowly enough to cause "seat-of-the-pants" problems, in which users actively complain about a slow system.
5e7f is also partnered with a healthy disk, and—more importantly—the system as a whole is backed up via both daily and hourly ZFS replication to an entirely separate system. Finally, the pool and snapshot health of both this system and the backup server is automatically monitored... so in this particular case, the sysadmin responsible is deciding to just ride 5e7f until it drops, or until the next scheduled capacity upgrade.
In a situation which demands the maximum performance, or maximum uptime, the smarter move would be to replace 5e7f proactively, and do so immediately.
Continuously updated zpool iostat
So far, we've used iostat with two numerical arguments—eg zpool iostat -y 1 1, for a single poll of one second's statistics, or zpool iostat 30 1, for a single poll of 30 seconds' statistics. We can advance the second argument—for example, zpool iostat -y 30 5—in order to get, in that case, 5 polls of 30 seconds' duration apiece.
We can also omit the second argument entirely—zpool iostat -y 1 will give us a running batch of statistics, output once every second. But it's difficult to use well in real time, because the output simply scrolls up the screen. For a more top-like experience, we can pair the zpool iostat command with watch (on Linux) or cmdwatch (on FreeBSD).
For example, cmdwatch -n 1 zpool iostat -vy 1 1 will give us the detailed output from the last section, in an updated in-place format on the screen rather than scrolling. For truly interactive use, this is the superior way to check iostat on an ongoing basis.
The only real disadvantage to using watch or cmdwatch to monitor zpool iostat in realtime is that when you ctrl-C out of its loop, it clears the screen one last time, rather than leaving the last set of numbers visible. This is a problem if you need to capture the output—and it's why for this article, we saved it for last.
In reality, I used watch -n 1 zpool iostat -vly 1 1 to monitor the latency of the drives in my data pool in real time, which allowed me to quickly spot both the greater latency on 5e7f and to verify how consistent the difference between it and its companion disk was from one second to the next—but then I ctrl-C'd out of it, and used a single zpool iostat -vly 1 1 to capture one second's data in a format that could be copied and pasted here.
officeklara
Learn About KlaraGetting expert ZFS advice is as easy as reaching out to us!
At Klara, we have an entire team dedicated to helping you with your ZFS Projects. Whether you’re planning a ZFS project or are in the middle of one and need a bit of extra insight, we are here to help!