Early on, developers working on Unix created a set of ideals that acted as a roadmap for the programs they wrote. They didn’t always follow these ideals, but they set the tone for the Unix project. Keep programs simple, design programs to work together, test early and often – are only some of these ideals. To this day, the Unix Philosophy impacts many projects.
Let’s Talk OpenZFS Snapshots
Let’s Talk OpenZFS Snapshots
If you haven’t used snapshots yet, give them a try!
In Basics of ZFS Snapshot Management, we demonstrated how snapshots and clones can be used to access data from a given point in time. In this article, we’ll learn how ZFS boot environments (BEs) are used to boot the operating system itself into a previous point in time.
If you have been following this series, you may have already discovered how easy it is to create and manage OpenZFS snapshots. If you haven’t used snapshots yet, give them a try! We’re confident you’ll quickly wonder how you ever got along without them.
If you’re just getting started using snapshots, you’ve probably wondered: How many snapshots is too many? Since there’s no such thing as infinite storage capacity, your available disk space is an obvious limiting factor. But at what point will snapshots result in a performance hit?
Unlike other filesystems, the existence of one or one thousand OpenZFS snapshots has no impact on the day-to-day performance of the filesystem—reading and writing files performs the same either way. However, the performance of administrative operations, like listing and deleting snapshots, are impacted by the number of snapshots that exist in each dataset.
If you’ve got the storage capacity, is it OK to have hundreds of snapshots? What about having thousands or tens of thousands of snapshots? In our experience, keeping too many snapshots per dataset starts to cause significant performance issues when listing, creating, replicating, and destroying snapshots.
The performance impact is not related to the total number of snapshots on the system, but the snapshots on each dataset—and, to some degree, the total RAM available to the system.
A hundred datasets each with one hundred snapshots will typically see no performance impact on listing, even on relatively lightweight systems. On the other hand, a single dataset with 2000 snapshots may take many seconds—or even minutes—to return the list of snapshots.
An internet search won’t give a definitive answer to how many snapshots is too many, with answers ranging from “don’t worry about it” to “it depends”. While not satisfying, the crux of the matter is there is no definitive answer as everyone’s storage system and data use is different.
This article introduces some questions to ask yourself as the answers will help you better understand your snapshot use. You can then use that information to determine a snapshot creation and pruning schedule that fits your needs without introducing a performance hit.
What does your storage workload look like?
In order to understand when and how often it makes sense to create snapshots, you need to understand your storage workload. Ideally, you want to create snapshots that matter and deliver the most value.
As an example: consider a web server where the content changes only when there’s a new product launch, software release for an existing product, or a periodic sweep to refresh and improve content. Ideally, you want to snapshot such a server both before and after any of these relatively infrequent changes to its data.
In this case, the number of snapshots is minimal, they are stored for a long time, and depending upon the amount of content changes, there may be quite a few differences between snapshots.
On the other hand, a file server which stores users’ home directories—or a personal workstation that you work on all day—presents a very different workload. These use cases tend to benefit from automated snapshots on a regular, frequent schedule—especially during work hours.
In this case, the server gets a lot of snapshots whose value tends to quickly diminish over time—but its frequent, regular snapshot schedule minimizes the amount of data potentially lost, despite the chaotic and somewhat unpredictable nature of when writes occur.
The key here is determining how frequently users make changes to files, how important each change is, and how long it might take to identify a problem that can be resolved by restoring data from a snapshot.
If a system administrator is making changes to config files, there is great value in keeping previous changes, at least until the changes are validated—but the admin can probably determine for themselves when to take the snapshot, and how long it should be retained.
By contrast, if a user is making continual small changes to a spreadsheet, it’s impractical to ask that user to constantly think about when each change is made, and for how long it should be captured. In this case, periodic snapshots are more appropriate—but if taken too infrequently, they may not catch a specific change the user wishes to isolate.
This brings us to the question: which applications are users using? There are several points to consider here:
- Local snapshots can’t capture changes made in apps using remote, cloud-based storage.
- Users often rely on built-in file version history, in applications which support it—but these aren’t always completely reliable, and ZFS snapshots can make an outstanding fallback.
- Most developers use a revision system, and are taught the mantra “commit early and often”—but if the versioning system goes haywire (or an entire repository is accidentally deleted), ZFS snapshots can still save the day.
Only you can understand what applications your users are using, if they are taking advantage of built-in history/revision systems, and if they are bugging you for file restores because they aren’t using revisioning applications or keep forgetting to commit or save versions.
You also know which systems are under your control and what type of data is important enough to warrant keeping previous versions using OpenZFS snapshots.
What is the cost of storing snapshots?
If you have lots of storage capacity, the cost of archiving snapshots can be low. However, scheduled snapshots do add up.
Consider the math: taking 1 snapshot of a dataset every hour results in 168 snapshots per week—in other words, 6 weeks on that schedule would result in 1000+ snapshots per dataset, and a significant performance hit for snapshot-related operations.
For this example, one would want to consider if a snapshot was needed every hour of every day, as well as when to start pruning older snapshots.
Ask yourself: is there value in keeping a snapshot of a dataset at 10:00 am and 11:00 am from 3 months ago? 1 month ago? Last week?
What is the cost of deleting snapshots?
This is the other side of the previous question. Will it be a big deal if you delete that snapshot of the filesystem at 10:00 am from 5 weeks ago? If not, how far back do you need to go to still have snapshots of value?
Perhaps your snapshots are activity-based rather than schedule-driven. If so, do you still need to access data from 3 pkg-updates ago?
Ask yourself: how much will it cost you in time and effort if a specific revision is no longer available?
When using scheduled snapshots, it’s usually a good idea to maintain multiple tiers based on period—for example, you might choose to keep 30 hourly snapshots, 30 dailies, and 3 monthlies.
This kind of staggered retention scheme offers as much total depth as three months of hourly snapshots—but with only 63 total snapshots instead of 2,232. This way, you still get three months of archive depth, but at a fraction of the capacity and performance cost of three full months of hourlies.
How much space is being used by snapshots?
By now, you should have a better idea of what data is important to snapshot and how often you want to capture that data. Next, you’ll want to determine if you have enough storage capacity to maintain the desired number of snapshots. If capacity becomes a concern, you can decide if it is worthwhile to add more capacity or to reconsider your snapshot pruning schedule.
In the simplest terms, a snapshot doesn’t cost you any storage apart from data unique to that snapshot. Whether the parent dataset contains 1TiB of data or 1GiB, the snapshot itself is nearly “free” when first taken—it only begins to “cost” you storage as you overwrite or delete the data it captured, causing the snapshot to diverge from both newer snapshots, and the live filesystem itself.
That’s not to say that a snapshot consumes no space on disk, though—taking a snapshot forces ZFS to create a new TXG (Transaction Group), which eats a few MiB of drive space even if none of the data in the snapshot is unique. This isn’t usually enough to notice, but on a system which takes hundreds or thousands of snapshots per day, it can add up.
There is also a higher cost to more frequent snapshots, in most use cases. Twenty-four hourly snapshots of an active dataset will tend to consume much more space than a single daily snapshot covering the same time period, as the hourlies capture twenty-four times as much “churn”—ephemeral data such as lockfiles, temporary dotfiles created by applications, and so forth—as the single daily would.
Did you know?
Getting your ZFS infrastructure up to date has never been easier!
Our team provides consistent, expert advice tailored to your business.
To get an idea of how much space your existing snapshots consume, start by listing the space property (-o) of the pool. Here is a snipped example of the tank pool on my laptop:
zfs list -o space NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD tank 270G 69.2G 0 88K 0 69.2G tank/ROOT 270G 44.4G 0 88K 0 44.4G tank/ROOT/mar26 270G 41.6G 18.7G 23.0G 0 0 tank/usr/home/dru 270G 4.34G 1.17G 3.17G 0 2.36M
The columns in this listing contain this information:
- NAME: the name of the filesystem (pool or dataset)
- AVAIL: available storage capacity
- USED: amount being used (as with any filesystem, OpenZFS performance will start to suffer when it gets close to capacity; typically you want to stay below 80% or consider adding more capacity as the system starts to approach 90%)
- USEDSNAP: amount consumed by snapshots of this filesystem
- USEDDS: amount being used by this filesystem
- USED REFRESERV: minimum amount of space guaranteed to this filesystem
- USEDCHILD: amount being used by children of this filesystem
In this example, there is still plenty of storage capacity on this system. It is interesting to note that over 25% of the space usage in dru’s home directory is used by snapshots.
On a system with many snapshots, this type of listing gives a quick glance of which filesystems are consuming the most snapshot space as well as an overall view of how much capacity is still available on the specified pool.
You can also zero in on a particular dataset. Note that the last command was zpool (in order to see pool-level information) while this command uses zfs (in order to see dataset-level information).
This time, I’ll get the usedbysnapshots property of my home directory dataset:
zfs get usedbysnapshots tank/usr/home/dru NAME PROPERTY VALUE SOURCE tank/usr/home/dru usedbysnapshots 1.17G -
As expected, the space used by snapshots matches the 1.17G seen in the previous listing.
While the usedbysnapshots property gives an idea of how much space is consumed by snapshots, as well as how much space would be freed if all the snapshots in a dataset were destroyed, it does not indicate how much space you’ll get back if you start pruning only some of the snapshots.
Due to its COW nature, OpenZFS can’t free blocks that are still being referred to—so until all snapshots referencing a particular block are destroyed, that block cannot be freed.
As an example, I’ll create a listing that shows the NAME, WRITTEN, REFER, and USED columns (in that order) of just the snapshots in my home directory:
zfs list -t all -o name,written,refer,used | grep dru@ tank/usr/home/dru@test-backup 2.71G 2.71G 176M tank/usr/home/dru@homedir. 176M 2.71G 12.6M tank/usr/home/dru@homedir-mod 18.5M 2.71G 18.1M
The written property is useful for understanding snapshot growth as it represents the amount of referenced space written to the dataset since that snapshot was taken. The used column indicates how much of the data is unique to that snapshot; in other words, how much space will be freed if that particular snapshot is deleted.
Performing a verbose dry-run (-nv) will show the amount of space that would be reclaimed by destroying the specified snapshot. The amount will match the used column seen in the listing above:
zfs destroy -nv tank/usr/home/dru@test-backup would destroy tank/usr/home/dru@test-backup would reclaim 176M zfs destroy -nv tank/usr/home/dru@homedir would destroy tank/usr/home/dru@homedir would reclaim 12.6M zfs destroy -nv tank/usr/home/dru@homedir-mod would destroy tank/usr/home/dru@homedir-mod would reclaim 18.1M
Did you know?
Want to learn more about ZFS? We consistently write about the awesome powers of OpenZFS in our article series.
Putting it all together
Understanding which data benefits from being in a snapshot and how long it makes sense to keep snapshots will help you get the most out of OpenZFS snapshots. Pruning snapshots to just the ones you need will make it easier to find the data you want to restore, save disk capacity, and prevent performance bottlenecks on your OpenZFS system.
Like this article? Share it!
You might also be interested in
Getting expert FreeBSD advice is as easy as reaching out to us!
At Klara, we have an entire team dedicated to helping you with your FreeBSD projects. Whether you’re planning a FreeBSD project, or are in the middle of one and need a bit of extra insight, we’re here to help!
We continue our series of articles on the history of Unix with the events led to the creation of BSD. Find out about the first Unix editions, how C evolved, and how Unix was first licensed.
In his 1999 book In the Beginning… Was the Command Line, Neal Stephenson said the following about Unix: “Windows 95 and MacOS are products, contrived by engineers in the service of specific companies. Unix, by contrast, is not so much a product as it is a painstakingly compiled oral history of the hacker subculture. It is our Gilgamesh epic.”
Read more about how the story of UNIX actually goes.