As commercial storage becomes increasingly expensive, more and more of the Education vertical is looking at Open Source solutions for storage. In this article, we discuss the value of OpenZFS for Universities and how system administrators can best leverage it to their benefit.
FreeBSD iostat – A Quick Glance
iostat provides a window into the i/o effort of the storage subsystem. You can use it to determine usage patterns, bottlenecks and poor behavior at a glance. It can produce data to support conclusions and suggest further avenues of investigation when used judiciously. In this article, we will dissect its output and introduce disk subsystem troubleshooting using statistical output from iostat.
With no other arguments, iostat produces the following output:
tty nvd0 ada0 da0 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 0 112 15.89 0 0.00 27.28 1 0.04 34.56 0 0.00 0 0 0 0 100
Let’s decode the output:
column title: device name. This Column also vertically delineates the statistics for the device.
- tin: typewriter bytes in (* see ‘-d’ flag in the reducing cognitive load)
- tout: typewriter out
- KB/t: average (mean) size of transaction in KB
- tps: transaction frequency (per second)
- MB/s: total throughput in megabytes/sec (Both reads and writes)
cpu: percent of time spent split into the following:
- us: user mode
- ni: nice
- sy: system (kernel)
- in: interrupt (hardware)
- id: idle
The CPU statistics can be used to verify that the system was under load, while also looking at the storage. These stats must be interpreted with an understanding of the workload. A system with high user cpu utilization may not be i/o bound. However, if the system shows mostly kernel and interrupt use, there may be an i/o throughput limitation to investigate further.
Extended device data is available with the ‘-x’ switch
extended device statistics device r/s w/s kr/s kw/s ms/r ms/w ms/o ms/t qlen %b ada0 0 0 28.8 7.6 0 0 1 0 0 0 da0 0 0 0.3 0.1 3 10 0 6 0 0 ... da41 0 0 0.6 3.7 3 12 786 17 0 0
This view displays extended statistics for the entire system, including every disk device available on the system. These statistics can reveal the health of specific disks.
The fields explained:
r/s: read transactions / second w/s: writes / second kr/s: kilobytes read / sec kw/s: kilobytes written / sec ms/r: mean time(milliseconds) / read ms/w: mean time(milliseconds) / write ms/t: mean time(milliseconds) / transaction (read or write) qlen: the depth of the transaction queue %b: percentage busy.
The r/s and w/s are indicators for the use of the device. The figure should correlate to the workload’s demands.
For example, an OLTP application will burst reads and writes and will not tolerate high queue depths. Alternately, a data recording application will show a continuous stream of writes while being able to tolerate higher queue depths in effort to improve throughput.
kr/s and kr/s should similarly track the workload’s need. A mismatch in the observed and expected values requires investigation. If write transactions per second (w/s) are high, yet the (kw/s) throughput is suspiciously low, there might be an application using a poor i/o pattern such as single byte writes. This ‘tinygram’ anti-pattern would support modifying the application to use larger writes.
The mean time grouping (ms/r, ms/w, ms/t) reveals the drive’s record of retiring requests. The value for a single transaction should be near the stated performance of the media, <3ms for ssd’s and <20ms for spinning disks. Large departures from these values suggest a welfare check on the specific device. Qlen and %busy provide a snapshot of how heavily loaded a device is, where the queue length and the % busy are correlated to each other. The implementation of tagged queuing and native command queuing allows a storage system to improve performance by collecting requests and optimally ordering them for execution. SSD and NVME drives are comfortable with queue depths exceeding twenty, while spinning disks may struggle beyond eight outstanding transactions.
If a device reveals sustained high queue lengths or high busy values, it might be the bottleneck in your work. Investigate why that device is hot-spotted, maybe it is an opportunity to split the work to less heavily loaded peers or perhaps it is delaying operations. Excessive delayed transactions are an indication that the disk is precipitating failure. The drive firmware will retry operations in the hope of hiding underlying faults. However, that repeated operation may be blocking all the other i/o for your workload. If the drive has poor latency, investigate with a low-level tool such as smartctl. Replacing the disk before it fails mitigates an emergency into routine maintenance.
iostat will repeat its output on demand with the ‘-w <seconds>‘ flag. A large delay between reports produces an overview that hides bursts and troughs in favor of a broad indication of throughput. Decreasing the value below 1.0 will allow fine time steps. For example, ‘ -w 0.050’ will produce a report every 50 milliseconds. This resolution may be helpful if you want to see bursts of i/o or are looking for fine grain i/o patterns. At the opposite time scale ‘-I’ provides the cumulative numbers since boot time.
Reduce Cognitive Load
It’s easy to get torrents of numbers out of iostat, but resist the urge to collect more data than you need. Use filters to list specific data that is of interest. For example, the -t parameter allows you to specify the device classes you are interested in (SCSI, IDE, tty …). By default, iostat includes the CPU and tty classes, which are interesting, but not directly probative when diagnosing storage subsystem behavior. Adding ‘-d’ to the command line will mute these as it is unlikely you are troubleshooting the teletype subsystem. iostat will also display the ‘pass’ devices associated with ‘da’ scsi-like devices if asked for all devices, but they are not relevant to throughput or health analysis. Provide ‘-c <count>’ to limit the number of reports to prevent overloading your terminal session. Naming devices at the end of the command selects them one by one, however shell-like globbing patterns are not supported.
iostat is like the top command; providing indicators at a glance. However, a glance is insufficient to fully characterize a complex system; don’t use a single glance with top or iostat to make critical decisions.
iostat reports the statistical mean as a primary indicator. The mean is infamous for hiding outliers and blurring modal distributions. If iostat reports a value that is unusual, investigate further with a tool that produces better statistical indicators. A histogram of latency is more useful than a mean; tools such as dtrace can generate these indicators.
Iostat output varies in format across platforms, therefore you should interpret the output in context.
- vmstat: details cpu use, interrupts and memory
- gstat: live visual representation of GEOM level operations.
- smartctl: displays low level statistics about disks, such as delayed operations
- systat: FreeBSD text display that can produce a live graph of storage effort with the ‘ -iostat ‘ flag.
- zpool iostat: See our article about zpool iostat https://klarasystems.com/articles/openzfs-using-zpool-iostat-to-monitor-pool-perfomance-and-health/
Like this article? Share it!
Discover how OpenZFS can provide cost-effective and reliable storage for high-performance computing (HPC) workloads in this comprehensive write-up.
The most common category of ZFS questions is “how should I set up my pool?” Sometimes the question ends “… using the drives I already have” and sometimes it ends with “and how many drives should I buy.” Either way, today’s article can help you make sense of your options.