Klara

Storage servers, whether commercially-made or custom-built, are appliances. As a result, there’s often an expectation that they should simply function reliably without issues. It is easy to deprioritize–or even entirely forget about–the needs of your NAS (Network Attached Storage) when it is cleverly tucked out of sight (or more likely, earshot). A well-used, well-designed open source FreeBSD-based NAS may operate flawlessly for several years without a hint of maintenance1. This is like not changing the oil in your car for 15,000 miles; it will likely continue to function, but it is a bad idea. It will almost certainly have permanent ramifications for longevity and reliability, not to mention the effects on day-to-day performance.

Just as your car needs that oil change, a laundry machine needs the occasional self-cleaning cycle, and you need a haircut, your NAS needs routine attention. Performing proactive replacements and reactive repairs from time to time will keep it healthy and reliable. The good news: proper maintenance of your FreeBSD-based NAS in almost every case takes little time, effort, or expense.

Klara recommends performing routine servicing tasks on your lightly- to moderately-tasked FreeBSD-based NAS every 4 to 8 weeks2. Heavily tasked servers, or a crucial office server, would naturally suggest a more aggressive maintenance cycle. Maintenance will involve looking, in turn, at the health of your storage devices, the filesystem(s), and your operating system and computer hardware.

Your Drives’ Physical Health

While NVMe storage is growing, most managing open source FreeBSD NAS servers still provision main storage with traditional SATA or SAS drives designed for NAS use. These are often optimized for NAS usage and will likely remain popular for some time. These drives contain electronics, motors, controllers, and rapidly spinning magnetic media, all in a delicate balance of precision parts. Your first line of defense is seeing to these drives’ physical health.

In previous years, there would be a relatively high number of failures early in the life of a typical drive. This is followed by relatively few failures from about 18 months through 4 years. Then, a rapid increase in failure rates for elderly drives. The term for this is the “bathtub curve.”

Happily, the past ten years have seen several changes to what we can expect, as can be appreciated in this Backblaze report. In particular, drives are now routinely fully stress tested at the factory, eliminating many of the early life failures. Drive technology and industrial quality control also seem to have improved, seemingly pushing end-of-life failures several additional years into the future. A NAS drive, properly deployed and maintained, is now rarely considered truly elderly until 50,000 or even 60,000 service hours. The oldest drives in my oldest FreeBSD-based server have nearly 77,000 hours on them. There is not even a whisper of an indication that they will fail any time soon. This means that properly maintained NAS hard drives currently cost less than $5 per terabyte of raw storage per year of expected service life3, a number that would have been inconceivable not long ago.

Hard drive failures can be catastrophic and sudden. However, this is quite rare in typical storage servers since its drives are rarely subject to the conditions (i.e. abuse) that often lead to catastrophic failures. Instead, failures tend to announce themselves slowly but clearly, giving you time to intercede.

Let’s look at the drives in a system with camcontrol(8), a utility present in every FreeBSD system since 1998:

root@daneel:~ # camcontrol devlist 
 	at scbus0 target 0 lun 0 (pass0,ada0) 
	at scbus2 target 0 lun 0 (pass1,ada1) 
	at scbus3 target 0 lun 0 (pass2,ada2) 
	at scbus4 target 0 lun 0 (pass3,ada3) 
	at scbus5 target 0 lun 0 (pass4,ada4) 
	at scbus6 target 0 lun 0 (pass5,ada5) 
	at scbus8 target 0 lun 0 (ses0,pass6) 
     	at scbus9 target 0 lun 1 (nda0,pass7) 
   	at scbus10 target 0 lun 1 (nda1,pass8) 

This system contains a motley mix of a PNY SSD as ada0 (presumably for boot), five Western Digital Red 4TB NAS drives as ada[1-5] (presumably the main NAS storage pool), a Crucial NVMe drive as nda0 and an Intel NVMe drive as nda1 with unknown purpose. You should have S.M.A.R.T. monitoring (smartd(8)) installed from the smartmontools package or port4. Let us use that to obtain some basic health information from the drive. In this case, we’ll focus on ada1 though this should be done for each drive5. smartctl(8) -A [device] will provide, among other things, a list of the S.M.A.R.T. attributes for the drive, including its hours in service, and indications of impending failure. Here is an output excerpt:

root@daneel:~ # smartctl -A /dev/ada1 
smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p2 amd64] (local build) 
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org 

=== START OF READ SMART DATA SECTION === 
SMART Attributes Data Structure revision number: 16 
Vendor Specific SMART Attributes with Thresholds: 
ID# ATTRIBUTE_NAME   	FLAG 	VALUE WORST THRESH TYPE 	UPDATED WHEN_FAILED RAW_VALUE 

 1 Raw_Read_Error_Rate 	0x002f  200  200  051	Pre-fail Always  	-  	0 
 3 Spin_Up_Time    	0x0027  178  163  021	Pre-fail Always  	-  	6058 
 4 Start_Stop_Count  	0x0032  100  100  000	Old_age  Always  	-  	153 
 5 Reallocated_Sector_Ct  0x0033  200  200  140	Pre-fail Always  	-  	0 
 7 Seek_Error_Rate   	0x002e  200  200  000	Old_age  Always  	-  	0 
 9 Power_On_Hours   	0x0032  025  025  000	Old_age  Always  	-  	55101 
 10 Spin_Retry_Count  	0x0032  100  100  000	Old_age  Always  	-  	0 
 11 Calibration_Retry_Count 0x0032  100  100  000	Old_age  Always  	-  	0 
 12 Power_Cycle_Count  	0x0032  100  100  000	Old_age  Always  	-  	153 
192 Power-Off_Retract_Count 0x0032  200  200  000	Old_age  Always  	-  	80 
193 Load_Cycle_Count  	0x0032  188  188  000	Old_age  Always  	-  	36436 
194 Temperature_Celsius 0x0022  122  113  000	Old_age  Always  	-  	28 
196 Reallocated_Event_Count 0x0032  200  200  000	Old_age  Always  	-  	0 
197 Current_Pending_Sector 0x0032  200  200  000	Old_age  Always  	-  	0 
198 Offline_Uncorrectable  0x0030  100  253  000	Old_age  Offline 	-  	0 
199 UDMA_CRC_Error_Count	0x0032  200  200  000	Old_age  Always  	-  	0 
200 Multi_Zone_Error_Rate  0x0008  200  200  000	Old_age  Offline 	-  	0

While there can be variance between brands, models, and eras, the most commonly used S.M.A.R.T. attributes are fairly consistent across all hard drives. You will always have attribute #9, which indicates power-on hours: how many hours the drive has been fully powered. In a 24/7 NAS, these accrue at 8,760 hours per year. You should always bear in mind how old the drives are in your NAS and be more open to the idea of proactive replacement with every 10,000-hour milestone beyond the first 30,000 youthful hours. This drive is 55,101 hours old and is thus in “middle age”.

Attribute #194 (and sometimes also a second attribute) indicates the current drive temperature,6 in Celsius. Unless in heavy usage (or undergoing a ZFS scrub), a properly ventilated internal drive should be somewhere around 5 to 12 degrees Celsius above room temperature. In the room of this example server, the air temperature is 21.1C, so the reported value of 28C is as expected. Active hard drives, hard drives with higher RPMs, or hard drives in tighter cases or configurations, would run hotter. There are hundreds of opinions on the matter of acceptable hard drive temperature ranges7, but sensible people try to keep their NAS drives as close to ambient as they can. Most certainly below 50C, as higher temperatures, even when “in spec”, clearly prove to be detrimental to longevity in study after study.

The attributes highlighted in red will give you a warning that your hard drive’s spinning media is beginning to degrade. When a hard drive is in excellent health, all these numbers should be 0, as you see in the above case. While you can (and should) look up the meaning of these attributes, non-zero values here are usually the first sign that something may be wrong. Depending on your own tolerances, you may want to immediately replace the drive the minute any of these is non-zero (which I do). Alternatively, you might carefully monitor these numbers–if they are very small and not rising regularly, then the drive may not be in any immediate danger.

Automating S.M.A.R.T. Monitoring for Your FreeBSD-Based NAS

Ultimately, you will be relying on smartd(8) to automatically monitor the S.M.A.R.T. information for your drives. This service is very easy to configure, and a line like this in /usr/local/etc/smartd.conf is all you need:

DEFAULT -a -n never -W 0,34,39 -m [email protected] 
  • DEFAULT indicates that unless otherwise specified, the settings apply to all devices.
  • -a indicates a number of standard settings that any reasonable person would want for general hard drives. “I want to know about any of those crucial attributes, for example, 197 and 198 going above zero” is one of those, for instance. “I want to know if any test fails” is another.
  • -n never is an indication to always check the drive at regular intervals without exception (this is appropriate for a 24/7 NAS).
  • -W [t1,t2,t3] indicates various warning temperatures in Celsius. The first number is a temperature delta (0 means off), the second is a first warning, and the third is a more shrill warning. You should adjust these to be some comfortable amount above your expected temperatures.
  • -m, of course, is the email address that a warning should go to (consider installing ssmtp from ports or packages to enable simple forwarding of system email to, say, your Gmail).

The S.M.A.R.T. system also supports various types of self-test sequences for your hard drives. Primarily, the end user will be interested in three of these. The conveyance test (usually takes less than ten minutes), if your drive supports it, is typically run once upon receiving a shipped hard drive. It does a manufacturer-specified set of simple tests that verify that mechanical and electrical elements most likely to fail during shipping have not been damaged, and checks a few typical or easily damaged sectors.

A short test might take only 1-2 minutes. It checks the drives electronics, basic controller function, and samples a few physical locations on the hard drive to check for errors. A long test thoroughly tests all of the drives functions and all of its surface locations. A long test will take hours, and for particularly large or slow hard drives, might take a whole day. You can trigger a test manually with a command like:

root@daneel:~ # smartctl -t long /dev/ada3

In general, we simply configure smartd.conf to kick off short tests and long tests from time to time. The recommended frequency of such tests varies greatly depending on whom is asked. Know that no one will fault you for weekly short tests and monthly or bi-monthly long tests. Here is a nice article from well-known FreeBSD expert Dan Langille that details how to configure smartd.conf for automatic tests. If any of these tests turn up something awry, emailing you about it is part of the “-a” option in your smartd.conf file that we talked about earlier.

With these measures in place, the physical and mechanical health of your hard drives are monitored relatively aggressively by system tools that should give you a warning before things go wrong, and certainly once they actually do go wrong.

Filesystem Health

Your filesystem offers ways to check its health. We assume you are dealing with a FreeBSD-based system utilizing ZFS.8 In ZFS, files are distributed across the storage pool, along with checksum9 values computed for each block. Whether a read block has been read correctly is easily ascertained by comparing a real-time computation of its checksum with the stored value. Unless you are using ZFS in a single disk configuration, there should be enough data redundancy present to reconstruct any damaged data. As you read your ZFS data from day to day, it is constantly and automatically checked, refreshed, and repaired if a failure has caused the data on disk to degrade.

However, if you never read some of the data, its integrity never has any reason to be verified. And it sits there; maybe it’s fine, maybe it’s not. Over time, the electromagnetic signal of your data may degrade, or errors on the media may silently erode your data’s integrity.

If you don’t touch it for some period of time (say, a few years), by then your data may be “rotten” (bit rot is the term of art), and not recoverable. The ZFS scrub is one of the central features of the ZFS filesystem. It traverses every allocated storage block in the filesystem, reads it, and verifies its contents, repairing as necessary. The scrub also verifies the parity data, which is normally only used when a problem is detected. With a scrub performed at least monthly, there is little chance of any undetected issues. You can see the last time a scrub was completed, how long it took10, and how many pieces of data had to be repaired (hopefully 0) with zpool-status(8):

root@daneel:~ # zpool status tank 
 pool: tank 
 state: ONLINE 
 scan: scrub repaired 0B in 06:37:00 with 0 errors on Sun Jan 14 07:52:08 2024

Scrubs can be manually initiated with zpool-scrub(8):

root@daneel:~ # zpool scrub [poolname]

More commonly, they are handled with scripting. A current FreeBSD installation includes a simple script for this purpose whose code you can peruse at /etc/periodic/daily/800.scrub-zfs, but it is not run until enabled by the user. The command cat /etc/defaults/periodic.conf | grep scrub will give you some idea of what items must be set to run the script11.

The Operating System and Other Components

Whether your server is based on FreeBSD, Linux, or another operating system, your filesystem will be informed by functionality and security updates and upgrades to both the kernel and userland components. While specific instructions for keeping an operating system up-to-date are beyond the scope of this article12, you cannot expect your file server and its filesystem to be operating in its most reliable and secure configuration without an up-to-date operating system. In a FreeBSD system, updates to operating system components (including ZFS) are handled by the freebsd-update(8) utility. Typically speaking, users following the RELEASE or STABLE channels of FreeBSD on personally managed file servers should perform upgrades as they become available.

After certain operating system updates, the ZFS version may change. In general, more recent ZFS versions are downwardly compatible with storage pools using previous versions. Yet, an upgraded pool will almost never be readable by an earlier ZFS version. When a mismatch between a pool ZFS version and the OS ZFS version is present, you will be prompted to upgrade by zpool-status(8):

  pool: zroot 
 state: ONLINE 
status: Some supported features are not enabled on the pool. The pool can 
        still be used, but some features are unavailable. 
action: Enable all features using 'zpool upgrade'. Once this is done, 
        the pool may no longer be accessible by software that does not
        support the features. See zpool-features(5) for details. 

There are ramifications for upgrading a pool’s ZFS version, the process cannot be reversed. Most importantly, as we have already said, such a pool will likely no longer be readable by earlier versions of ZFS, should the pool ever be moved to another system. A little bit of research will help you determine if you should immediately upgrade the ZFS version of your pools, or if that should be delayed.

Other components in your system will affect your fileserver and should be maintained:

  • Case, CPU, and PSU fans should be kept clear of dust, dirt, and debris. A hotter internal temperature of your server translates immediately to a hotter temperature for your storage media, and that is never good.
  • Servers should not be suddenly powered off. It is recommended that you use, and maintain, an uninterruptible power supply for your file server. Network UPS Tools (sysutils/nut) gracefully shut down your server when power is lost. The batteries in UPS systems are notorious for sudden failure after a couple of years.
  • Network cables often go bad. One way this manifests is the network connection to your file server is renegotiated to a slower speed like 100 Mbps or even 10 Mbps, which is at least 100 times slower than the typical 1 Gbps (or faster) that is typical these days. The effect of having a network connection downgraded to a file server impacts all systems that rely on it, and you may not realize what is happening. ifconfig(8) will indicate what speed was negotiated. The output line media: Ethernet autoselect (1000baseT) indicates, for example, that 1000 Mbps (1 Gbps) ethernet is active, as expected.
  • SATA or SAS drive cables can go bad, and can wiggle loose from their connections, and may present possible problems with the drive itself. Here is an indication that appeared in the author’s logs in December 2023. A cable had gone bad after over a decade of trouble-free use:
Dec 21 02:22:04  daneel kernel: (ada3:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 a8 e0 64 1b 40 58 01 00 02 00 00 
Dec 21 02:22:04  daneel kernel: (ada3:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error 
Dec 21 02:22:04  daneel kernel: (ada3:ahcich4:0:0:0): Retrying command, 3 more tries remain 

Concluding the Open Source FreeBSD NAS Best Practices

Running your own file storage appliance can be a very rewarding experience. With modern hardware, filesystems, and operating systems, all that is needed is relatively infrequent routine maintenance. The information above is more than enough to overachieve as your own file server administrator. For those that need extra help, the internet contains a nearly limitless supply of documentation, to say nothing of the extraordinarily detailed resources published by both the FreeBSD and ZFS communities. For the personal touch, there are experts available in forums or IRC, and podcasts13.

Resources

1 The author has one associate for whom he built a FreeBSD-based NAS, the owner of which has not performed a single maintenance task, nor software (FreeBSD 10.1) or hardware upgrade, for over seven years. Never one bit has been out of place, and every day, his NAS drives dutifully show up in Windows. But that doesn’t make it a good idea.

2 For many of us FreeBSD hobbyists, we find time to do it several times per week out of sheer joy, but this is more often than necessary!

3 At least in the North American market…

4 If not, then do it. Right now. Stop reading this, and do it now.

5 This part assumes you are using directly connected SATA drives. More complicated setups will have different ways of accessing S.M.A.R.T. data.  You probably already know how to do that if you have one of those more complex setups!

6 In certain models, this field will use its high bytes to record other temperature data as well as the current temperature. However, the low bytes always contain the current temperature in this case.

7 I start getting upset at 35C, but this is considered relatively extreme.

8 Mostly because if you are not, then almost certainly you should be.

9 “Checksum” is used loosely in this context–most often in ZFS the algorithm in use is “Fletcher’s checksum”, which more resembles a CRC than a checksum.

10 There are many factors that affect scrub time. For typical small users, it will be between 30 and 90 minutes per TB of stored files on spinning hard disks, depending on hardware and filesystem configuration.

11 In a higher use or office environment, it might be desirable to only run scrubs for example at 2:00AM on Saturdays if a scrub has not been run for 30 days. This will require a bit more scripting, often built into commercial appliances.

12 But see Chapter 26 of The Handbook.

13 The author recommends, for example, the 2.5 Admins podcast which features Klara’ own Allan Jude, with each episode ending with “free consulting” which more often than not is relevant to someone maintaining their own file server.

More Information on Building a NAS

This article is the fourth installment in our ongoing series on Building and Open Source NAS. To learn more about building, tuning, and configuring NAS, here are the links:

Back to Articles
You might also be interested in

Maximizing your FreeBSD performance starts with understanding its current state.

A FreeBSD performance audit can help you identify areas for improvement and optimize your systems.