Additional Articles
Here are more interesting articles on FreeBSD, ZFS that you may find useful:
- Debunking Common Myths About FreeBSD
- GPL 3: The Controversial Licensing Model and Potential Solutions
- ZFS High Availability with Asynchronous Replication and zrep
- 8 Open Source Trends to Keep an Eye Out for in 2024
- OpenZFS Storage Best Practices and Use Cases – Part 3: Databases and VMs
There are many different ways that media can be attached to a machine to provide storage, but in all cases the administrator needs to be able to monitor and manage those devices to ensure their health and to facilitate their replacement when they eventually fail.
In this article we will discuss some strategies and tools to make managing disk arrays on FreeBSD (and related platforms like TrueNAS Core) much easier. These concepts also apply to other operating systems, but the tools might differ slightly.
Understanding Storage Protocols
There are many types of storage devices, with the most popular being magnetic (“spinning rust”) hard drives, and solid state flash devices. These can be broken down further by the protocol used to connect them to the computer.
Serial ATA (SATA) is the familiar interface used for non-enterprise storage, and is an extension of the original ATA interface dating from the 1980s. SATA was introduced twenty years ago, coupled with the Advanced Host Controller Interface (AHCI). SATA+AHCI improved data transfer speeds, simplicity of communication, and included abilities that we today take for granted, such as “hot swap” and command queueing. This has long been the interface bus used by most home users to connect their hard drives, and is supported by nearly every motherboard.
Serial Attached SCSI (SAS) is the most common interface for enterprise storage, first appearing in 2004. It too was an extension on an existing interface bus which offered greatly improved performance. SAS can also support SATA devices with some limitations. SAS provides many more features than SATA does—including full duplex operations, advanced error recovery, multipath, and disk reservations. SAS disk reservations provide the ability to connect to the disk redundantly—or even across multiple machines—while ensuring it is only used by one of them at a time.
Non-Volatile Memory Express (NVMe) is a newer storage interface that is becoming very popular for flash storage devices. NVMe connects storage devices directly to the PCIe bus, offering extremely low latency and high throughput. It also overcomes one of the primary limitations of SATA and SAS: the inability to perform more than a single command at a time.
While both SATA and SAS allow multiple commands to be issued at once to the device, these commands cannot actually be executed concurrently—instead, they are queued for sequential operation. NVMe on the other hand, supports multiple queues (often 64 queues, but the official specification allows for up to 65,536 queues) allowing for many commands to be run concurrently. This both greatly reduces latency and increases maximum throughput. NVMe storage comes in many form factors, from small M.2 devices to U.2 and other hot-swappable formats intended for servers.
The NVMe interface is also extensible to allow operating over the network (where it is known as NVMe Over Fabric or NVMe-oF). NVME-oF allows storage devices and arrays in remote chassis to be connected to local motherboards.
Other interfaces for remote storage include iSCSI, Fiber-Channel, Infiniband, RoCE, and others, but those specialized solutions are beyond the scope of this article.
Types of Disk Arrays
When building a storage system, there are many different ways the disks might be connected to the system. We are going to focus on some of the most popular for SATA and SAS drives.
AHCI Attached
For smaller numbers of drives, and for most home systems, the most common way the disks are attached is to the SATA controllers built into the motherboard. SATA disks plugged directly into the motherboard use an interface called AHCI which does not provide much in the way of advanced management features. But, if the number of ports on the motherboard is sufficient to your needs, this is the easiest way to connect the drives to the system. Often you may only have a single HDD activity LED for the entire system, with no other status information, but this is typically sufficient for builds of this scale.
Direct Attached
At somewhat larger scales, a number of drives can be connected directly to a SAS (or SATA) controller PCIe card. Some server motherboards contain an embedded controller of this type.
Direct Attached deployments require a bit more hardware and cabling. For example, direct attachment of disks usually requires the use of breakout cables which allow each drive to be connected by a SAS (or SATA) interface for each of the available lanes (typically four per connector, with two connectors) in the interface.
It is also possible to use a direct attachment backplane. In these configurations, your system may or may not support features like individual “locate” and “fault” LEDs. When using SATA disks, direct attach may be preferred to using expanders (see the next section) as it avoids some potential problems with a failing disk causing issues for all of the drives that share the same communication lane back to the controller.
Most common direct attach controllers—such as the popular LSI 9207-8i or 9300-8i—only feature two connectors for a total of 8 lanes. An eight lane controller can only directly attach to 8 disks, requiring more controllers (consuming additional PCI-E slots) to connect more drives.
SAS Expanders
For chassis with larger numbers of drives, or when connecting external JBOD chassis, it is common for the drives to connect to a specialized board that provides power and routing for the SATA/SAS signals to the controller.
These special boards, called SAS Expanders, reduce the total cabling required to provide power and signal pathways to all connected disks. Typical SAS connectors support up to 4 drives per “lane”, but with an expander up to 255 devices are possible. The total throughput possible from the connected disks is still limited by the number of lanes available, but this is likely the best approach in systems with more than a dozen disks.
Best Practices
Though a truism, it bears emphasizing that with a little planning, management and maintenance of storage systems can be made easier and safer. Keeping an accurate inventory of your storage media, knowing which disks are in which slots, their models and serial numbers, their remaining warranty durations, and other information of this type will save you from confusion, annoyance, and needless extra effort when a disk inevitably fails.
The first step is to map out the relationship between the physical chassis where the disks reside, and the logical devices enumerated by the operating system. Below we will discuss exactly how to do this with FreeBSD’s sesutil or the management tools for your HBA. In these examples, we are going to assume you are using ZFS (because why wouldn’t you be?).
While the operating system typically provides device aliases based on the disk’s serial number, WWN, or some other static identifier, this does not provide all of the information you might want. Professionals tend to prefer something like: e<enclosure#>s<slot#>-<serial#> as this gives us all of the information we need to locate, confirm the identity of, and replace the failing disk, while still being concise and easy to read. See the sesutil section later in this article for details on how to find this information and create such labels.
Experienced enterprise storage managers also keep extensive notes including the model number, SKU and/or URL for reordering, purchase order information, warranty end date, warranty URL, and any other useful information about each drive. Klara recommends embedding these details directly into the ZFS vdev properties of each disk—a feature Klara created, which will become generally available in the upcoming OpenZFS 2.2 release.
zpool set systems.klara:disk-model=”ST8000VN004-2M21” mypool e0s06-WQP46GLG
zpool set systems.klara:disk-warranty=”2024-12-01” mypool e0s06-WQP46GLG
You may also want to label the hot-swap bay itself with the serial number to make identification even easier—which is good practice, as long as you make sure the labels don’t impede airflow.
Another important aspect of managing your storage system is configuring notifications. If you rely on manually checking on your storage periodically, you will regret it. If you’d feel safer with a team of experts monitoring your storage, consider a ZFS Support Subscription. Configuring your system to notify you when a disk has errors, or when the filesystem reports a degraded device, will ensure your system gets prompt attention when something goes wrong.
At a minimum, configure the daily ZFS status check by adding this line in /etc/periodic.conf:
daily_status_zfs_enable="YES"
And ensure that mail directed to the local root user is forwarded to your inbox by editing the corresponding line in /etc/aliases. Once you’ve done so, you must test delivery to your “real” inbox—you don’t want to learn that delivery isn’t working after your storage has already become unavailable!
You should also configure smartd to monitor your disks and send you alerts, which may give you advanced notice when a drive is starting to fail.
pkg install smartmontools
service smartd enable
cp /usr/local/etc/smartd.conf.sample /usr/local/etc/smartd.conf
Edit /usr/local/etc/smartd.conf
Comment out the line: DEVICESCAN
Add lines for each disk:
/dev/ada1 -d removable -a -n never -m email@address
/dev/da0 -d removable -a -n never -m email@address
service smartd start
Better yet, consider configuring some kind of active monitoring with push notifications, rather than relying on often-unreliable email delivery (and your ability to keep on top of your inbox). A classic Nagios installation is ideal for this, especially when paired with the aNag app or something like PagerDuty.
With disk metadata embedded into your pool and monitoring in place to notify you when there is an issue, you can keep track of your disks and may be able to replace them before they fail getting the most out of whatever warranty they might have.
SES
Many backplanes include support for SCSI Enclosure Services (SES). SES provides a mechanism to query information from the enclosure, including temperature, fan speed, and status of power supplies. It also provides information about each slot in the enclosure (even if empty), including a flag to indicate if the device has recently been swapped.
In addition to the above query types, SES also supports a number of commands, including activating the “locate” and “fault” LEDs if present, and the ability to individually power off drives.
Of course, all of this chassis management technology isn’t very effective without tools to make it usable. Rather than being subject to the whims of the vendor, you can use the tool built into FreeBSD, called sesutil.
SESUtil
FreeBSD’s sesutil is a tool to interface with the SES devices on your system. It features the main commands you might need: map, show, fault, locate, and status.
sesutil map
The map command displays all of the SES devices and each element (this is the nomenclature in SES) connected to them.
Looking at a few items from the output, we can see the device names (/dev/da0 and /dev/da7 respectively) of the disks in Slot00 and Slot07. We can also see that the disk in Slot07 was recently swapped, and that Slot08 does not contain a disk and its locate LED is activated.
We can also examine the various sensors for temperature and voltage included in the backplane.
ses0:
Enclosure Name: SMC SC846P 0c1f
Enclosure ID: 500304801820593f
Element 0, Type: Array Device Slot
Status: Unsupported (0x00 0x00 0x00 0x00)
Description: ArrayDevicesInSubEnclsr0
Element 1, Type: Array Device Slot
Status: OK (0x01 0x00 0x00 0x00)
Description: Slot00
Device Names: da0,pass0
Element 8, Type: Array Device Slot
Status: OK (0x11 0x00 0x00 0x00)
Description: Slot07
Device Names: da7,pass7
Extra status:
- Swapped
Element 9, Type: Array Device Slot
Status: OK (0x01 0x00 0x00 0x00)
Description: Slot08
Extra status:
- LED=locate
Element 60, Type: Temperature Sensor
Status: OK (0x01 0x00 0x57 0x00)
Description: ChipDie
Extra status:
- Temperature: 67 C
sesutil show
The sesutil show subcommand provides an easy to read summary of the most commonly desired information:
ses3: <LSI SAS2X28 0e12>; ID: 5003048000b40b3f
Desc Dev Model Ident Size/Status
Slot 01 da36 ATA ST4000DM005-2DP1 ZGY0XH87 4T
Slot 02 da35 ATA ST4000DM000-2AE1 ZGY07YC3 4T
Slot 03 da34 ATA ST4000DM000-2AE1 ZGY07VS1 4T
Slot 04 da37 ATA ST4000DM000-2AE1 ZGY06NB8 4T
Slot 05 da38 ATA ST4000DM005-2DP1 ZGY1YT0C 4T,LED=locate
Slot 06 da23 ATA ST8000VN004-2M21 WQP46GLG 8T,Swapped
Labeling with GEOM Multipath
We can now use this information to label our disks. Each SAS Expander will present as a new /dev/ses# device, so your system may have more than one. FreeBSD supports a number of different ways to label the disk, depending on your use case.
If your system has multipath SAS, each disk will be present more than once, and you should use the gmultipathcommand to deduplicate your disks and for labeling as well.
gmultipath label e3s02-ZGY07YC3 da199
true > /dev/da324
This will write a GEOM Multipath label to the last sector of the disk. Using the no-op true command on other paths to that disk, will cause GEOM to re-”taste” the disk and see the label and automatically add the additional paths to the existing multipath.
You can also reboot, and GEOM will pick up the multipath when it first tastes the disks during boot. The device will be accessible as /dev/multipath/e3s02-ZGY07YC3.
Labeling with GUID Partition Table (GPT)
If you are going to partition the disks, you can use GPT labels:
gpart create -s gpt da36
gpart add -t 4g -a 1m -t freebsd-swap da36
gpart add -a 1m -t freebsd-zfs -l e3s01-ZGY0XH87 da36
This example creates a new GPT partition scheme on da36, creates a 4 GiB swap partition aligned to 1 MiB boundaries, and then adds a ZFS partition with the label e3s01-ZGY0XH87 using the remainder of the space on the disk. That partition will now be accessible as /dev/gpt/e3s01-ZGY0XH87.
Labeling with GEOM Labels
Lastly, you can use the GEOM Label system, similar to multipath, to store a small chunk of data at the end of the disk to persistently identify it:
glabel label e3s06-WQP46GLG da23
That disk will now be accessible as /dev/label/e3s06-WQP46GLG.
sesutil locate
sesutil can also be used to locate the disk in the physical array.While the SES data tells us that there is an 8 TB disk in Slot 06, it does not tell us which slot in the chassis corresponds to 06. Some chassis count from 0, others from 1—and there’s not even a set standard for whether labeling is left to right, top to bottom or bottom to top, left to right.
You can avoid any uncertainty by enabling the “locate” or “fault” LED for the drive you mean to replace. This greatly reduces the chance of getting it wrong when you (or the datacenter technician) physically pulls the disk.
To disable the locate LED that is already activated on Slot 05 from above:
sesutil locate da38 off
However, if a disk has died entirely, or a slot is empty, it might not have a device name. Unnamed devices can be specified by their specific SES device and element number. These can be found with the sesutil map command.
Note that the element number usually is different than the slot number:
sesutil fault -u /dev/ses0 9 on
This will activate the fault LED for element 9 (Slot 08) on the first SES device. (Note: some chassis have separate LEDs for fault and locate, while others use a single LED with different color or blink patterns for the two different conditions.)
sesutil status
For a quick overview, the status command can be used to tell if there is anything that requires further investigation. This makes sesutil status a great summary to connect to your monitoring system.
#sesutil status
ses0: OK
ses1: INFO
ses2: OK
ses3: CRITICAL
If we examine ses3 more closely:
# sesutil -u /dev/ses3 map
Element 38, Type: Power Supply
Status: Not Available (0x47 0x80 0x00 0x20)
Description: Power Supply 2
Extra status:
- Predicted Failure
We see it is predicting the failure of its number 2 power supply.
Whereas ses1 is just informing us one of the locate LEDs is on:
# sesutil -u /dev/ses1 map
Element 1, Type: Array Device Slot
Status: OK (0x01 0x00 0x02 0x00)
Description: Slot 01
Device Names: da44,pass50
Extra status:
- LED=locate
sesutil JSON output
As with a number of tools in FreeBSD, sesutil supports outputting JSON via the libxo library. When combined with a JSON parser like jq, this can be used to automate tasks for each disk.
Consider the following example:
sesutil show --libxo json,pretty | jq '\
.sesutil.enclosures[] | .enc as $enc |
.elements[]|select(.type == "device_slot" and .model != "") |
$enc + ":" + .description + ":" + \
.serial + ":" + .device_names' | \
sed 's/ //g' | cut -d '"' -f 2 | \
sed -E 's/ses([[:digit:]]+):Slot([[:digit:]]+)/e\1s\2/g' | \
sh -c 'for line in $(cat -); do \
substring=${line#*:}; \
slot=${line%%:*}; \
serial=${string%%:*}; \
disk=${line##*:}; \
gpart create -s gpt $disk; \
gpart add -t efi -s 256mb -a 4k $disk; \
gpart add -t freebsd-swap -s 6g -a 1m $disk; \
gpart add -t freebsd-zfs -l $slot-$serial $disk; \
done'
This partitions each disk and labels the ZFS partition with the enclosure, slot, and serial number of the corresponding disk.
Note: each enclosure is different, and you will likely need to make minor modifications to this example pipeline before it works for your specific configuration.
mpsutil / mprutil
If your system uses an LSI/Avago/Broadcom SAS Controller supported by the FreeBSD mps (SAS2xxx chip) or mpr(SAS3xxx chip) driver, then you can use the corresponding tool to manage your disks even without an SES device:
# mpsutil show adapters
Device Name Chip Name Board Name Firmware
/dev/mps0 LSISAS2308 SAS9207-8i 14000700
/dev/mps1 LSISAS2308 SAS9207-8i 14000700
/dev/mps2 LSISAS2308 SAS9207-8i 14000700
If you have multiple adapters, mpsutil will default to the first logical adapter. You should specify which adapter to operate on:
#mpsutil -u 2 show all
Adapter:
mps2 Adapter:
Board Name: SAS9207-8i
Chip Name: LSISAS2308
Devices:
B____T SAS Address Handle Device Speed Enc Slot
00 11 5000cca25323147d 0009 0001 SAS Target 6.0 0001 03
00 09 5000cca253253cf1 000a 0002 SAS Target 6.0 0001 01
00 08 5000cca253254e1d 000b 0003 SAS Target 6.0 0001 00
00 15 5000cca253252a45 000c 0004 SAS Target 6.0 0001 07
00 14 5000cca2532527d1 000d 0005 SAS Target 6.0 0001 06
00 13 5000cca253252ddd 000e 0006 SAS Target 6.0 0001 05
00 12 5000cca25315a53d 000f 0007 SAS Target 6.0 0001 04
00 10 5000cca253254fd1 0010 0008 SAS Target 6.0 0001 02
Enclosures:
Slots Logical ID SEPHandle EncHandle Type
08 500605b009d01dc0 0001 Direct Attached SGPIO
So, to activate the LED for the first disk displayed above, we first need to determine the enclosure handle number (0001), and then the slot number of the disk (03). The status field is a bitmask supporting a number of different options, but the main ones we care about are 1 (OK), and 2 (FAULTED). Set enclosure 1, disk 3 to the faulted (2) status:
# mpsutil slot set status 1 3 2
Successfully set slot status
On my system, this command produces a bright red LED lit for that slot, physically highlighting the correct drive to replace. Setting the status back to 1 returns the activity light back to normal—on this system, a blinking blue.
mpsutil and mprutil can also be used to upgrade the firmware on the HBA from within the FreeBSD operating system, saving you from dealing with the horror that is megacli, the hassle of creating a USB image that pretends to be a floppy disk, and/or the pseudo-MSDOS of the EFI shell.
First, create a backup of the old firmware revision:
mprutil -u 1 flash save firmware file.img
Then, flash the new version to the HBA:
mprutil -u 1 flash upload firmware SAS9300_8i_IT.bin
If you need more advanced functionality than mpsutil provides, LSI provides their native tools sas2ircu and sas3ircu for FreeBSD. Although they offer additional functionality, they are less user friendly and require agreeing to a long EULA.
Conclusion
Monitoring and maintaining your storage media is one of the most important parts of keeping your data safe.
With the tools presented here, the reader is well armed to react to failed disks and ensure that the wrong disk isn’t accidentally pulled.
With these types of best practices, we eliminate the confusion or even chaos that might cause a storage administrator to disconnect the wrong drive—which happens with disturbing frequency, and can result in degrading a redundant array beyond its ability to recover.
When dealing with critical data, you only get one chance to do it right. Experience is invaluable—so if you are unsure, consult an expert to make sure you get it right the first time, as that may well be the only chance you get.
Allan Jude
ZFS Engineering Manager and co-Founder of Klara Inc., Allan has been a part of the FreeBSD community since 1999 (that’s more than 20 years ago!), and an active participant in the ZFS community since 2013. In his free time he enjoys baking, and obviously a good episode of Star Trek: DS9.
Learn About Klara