Klara

Tackling business management issues with OpenZFS tools

In Basics of ZFS Snapshot Management, we demonstrated how snapshots and clones can be used to access data from a given point in time. In this article, we’ll learn how ZFS boot environments (BEs) are used to boot the operating system itself into a previous point in time.


Today, let’s talk a little bit less about technology itself, and a little bit more about business management. There are a couple of key management terms that every system administrator and IT professional should know and love—RPO and RTO, which expand to Recovery Point Objective and Recovery Time Objective. 

Once we understand the meaning and importance of RTO and RPO, we will take a look at two ZFS technologies—snapshots and replication—which greatly ease their management. 

Recovery Point Objective (RPO)

After a disaster occurs, the first question asked by panicked users and managers is—do we have a backup? The second question, almost as important, is when that backup was taken—in other words, how much data will we have lost by needing to recover from that backup? 

The proper way to manage disasters is not to wait until after one happens to answer those questions, but to define them—and their answers—beforehand, in concrete policy which can be both understood and tested. The Recovery Point Objective is, simply stated, the maximum amount of data—measured in hours, not in bytes—which should be lost in the event of a full disaster recovery operation. 

If an organization takes a full backup daily, this puts the RPO at around 24 hours. In your absolute worst-case scenario, your last backup was at midnight, and your catastrophic disaster takes place at midnight the next day—so you've lost 24 hours-worth of work. This is a little bit naive, of course—it does not take the amount of time required to back up into account. 

If we look at a business backing up with traditional tape drives, a system backup might easily require two or three hours. This extends our RPO to 27 hours—now, our worst case is a catastrophic disaster at 3AM, after the current backup has begun, but just before it has finished. 

Recovery Time Objective (RTO)

There was a time when the RPO was considered the only important objective—if a disaster happened, the organization was expected to limp along without computers until the systems were available again. 

This is no longer practical for nearly any business or organization. If the IT systems are down, the workflow of the organization is not merely slowed, it's generally halted entirely. It is important to understand this impact from a business perspective, not just an IT perspective. 

Which brings us to the question: which applications are users using? Many modern desktop applications and operating systems provide a built-in file version history. Most developers use a revision system and are taught the mantra “commit early and often”. A lot of business applications operate online or are hosted in an external cloud, often providing a version history.

It is therefore crucially important to define the maximum acceptable amount of time that IT systems are unavailable after a disaster, in addition to the period of data that is lost. This is the Recovery Time Objective. 

Let us say that a small engineering firm has only one server, and that all ongoing work is saved on that server, via mapped drives. What happens if the motherboard fails on that server? Let us further assume that the drives themselves are undamaged, and there is no filesystem corruption—the only issue, although we may not know it before troubleshooting, is the failed motherboard. 

It is reasonable to assume that our hypothetical firm would need a full business day to diagnose the problem effectively, acquire a replacement motherboard, install the replacement motherboard, and perhaps even reconfigure the operating system to deal with changed hardware. 

If we assume that the business is open for eight hours a day, we have lost eight hours of work. If ten employees depended on that fileserver, we can estimate the direct payroll cost of having to pay them without getting any work done. 

Let us say the average annual payroll cost for each of them is $52K. That comes out to $1,000 per week, or $200 per day (assuming five-day work weeks). With all ten of them idle for a day, the business has lost $2,000 in payroll alone—let alone office overhead, lost business opportunities, potential damage to reputation, and so forth. 

So, when it comes to disaster recovery, RTO is just as important as RPO. Different organizations will have different tolerance levels for each—but they must be understood, and they must have acceptable limits defined by policy ahead of time. 

A Simple Overview of ZFS Snapshots

The ZFS snapshot is a very compelling feature for sufficiently paranoid IT professionals and managers alike. ZFS is a copy-on-write filesystem, with a tree structure identifying all storage blocks used by the current filesystem. 

Did you know?

Getting your ZFS infrastructure up to date has never been easier!

Our team provides consistent, expert advice tailored to your business.

Find out more

As new data is written to the filesystem, new entries are made linking the newly written blocks. As old data is deleted, the blocks those data were stored on are unlinked. When an application asks to modify existing data in-place, what really happens is the modified versions of those blocks are linked in, and the original versions are unlinked. 

When we take a snapshot, all we do is make a copy of that tree structure of used blocks. This is effectively instantaneous, since no actual data must be copied—only the tree structure itself. A single block may be referenced by any number of snapshots, and/or by the "live" filesystem itself. 

So long as any single snapshot—or the live filesystem—references a block, that block remains immutable. It only becomes "free space" available to be overwritten once the last reference to it has been unlinked. 

If we have a snapshot of a ZFS dataset or zvol available, we can browse it as a read-only filesystem just as it is. We can also roll the entire dataset or zvol back to the point in time of the snapshot—this is done simply by replacing the current uberblock (the root of that "tree structure" we discussed earlier) with the desired snapshot. 

Neither browsing nor rollbacks require any setup or execution time to speak of—even on datasets containing tens of TiB of data, a rollback can be accomplished in a few seconds. 

How Snapshots Enable ZFS Replication

Now that we understand what a snapshot is—a tree of block pointers referencing used blocks in a ZFS dataset or zvol—we can understand ZFS replication as well. 

Let's say that we have a one TiB dataset, and we wish to replicate it from one ZFS system to another. The first step is to make a snapshot of that dataset. Let's say we perform the following command at precisely noon: 

root@box1:~# zfs snapshot pool1/mydata@12pm

We now have a full snapshot of the dataset "mydata" as it existed at the instant we took that snapshot. This snapshot is atomic—meaning it takes place in between any other operations, and all data within the snapshot is crash-consistent. 

Next, we use ZFS replication to pull that snapshot from box1 to box2: 

root@box2:~# ssh root@box1 'zfs send pool1/mydataset@12pm' | zfs receive pool2/mydataset

This is a full ZFS replication. The zfs send command run on box1 packages up both a copy of the block tree for pool1/mydataset@1, and the contents of each of the actual blocks referenced by that snapshot. The zfs receive command, operating on box2, writes that block tree and each of the actual blocks into pool2/mydataset. 

At this point, we have two exact copies of 'mydataset', each of which representing its exact condition at 12pm—one copy on pool1, stored on box1, and another on pool2, stored on box2. We can mount 'mydataset' on box2, and we can interact with it in any way that we would have on its original host, box1. 

Did you know?

Want to learn more about ZFS? We consistently write about the awesome powers of OpenZFS in our article series.

Read More >

Unfortunately, if this dataset had 1TiB of data, actually moving the data took a significant amount of time. Let's assume that we had a gigabit LAN, and it supported an average throughput of 110MiB/sec—this leaves us with a bit less than three hours needed to replicate. 

This is unavoidable on a first replication—but the next time we replicate, we can do so incrementally. While the first replication is in process, let's say we took another snapshot each hour, on the hour: 

root@box1:~# zfs snapshot pool1/mydataset@1pm
root@box1:~# zfs snapshot pool1/mydataset@2pm
root@box1:~# zfs snapshot pool1/mydataset@3pm

So, our first replication finished at around 2:40 PM, and we've since acquired three more hourly snapshots. Let's assume we saved around 100MiB of data per hour. How do we get those changes from box1 to box2? 

root@box2:~# ssh root@box1 'zfs send -I pool1/mydataset@12pm pool1/mydataset@3pm' | zfs receive pool2/mydataset

Notice that we've passed box1's zfs send command the -I argument, and two snapshots rather than one—this means we're looking for an incremental snapshot stream. 

The first snapshot argument passed to zfs send -I tells us the common snapshot that we want the stream to be based on. The second snapshot argument tells us the last snapshot to include. 

Each snapshot in between is treated as a sort of "patch set." We don't need to send the actual blocks in pool1/mydataset@12pm—but we do need to make note of any new blocks in pool1/mydataset@1pm, and any blocks which were unlinked between @12pm and @1pm. Now, zfs send -I does the same thing for newly added and newly deleted blocks in @2pm and @3pm. 

The data that zfs send -I actually produces, then, is roughly 300MiB in size. It's the 100MiB per hour which was added to mydataset, plus perhaps some instructions on blocks which should be unlinked in between. 

So, for this incremental replication, the 300MiB of data passes from box1 to box2 in roughly three seconds—and, building from the mydataset@12pm snapshot which box2 held in common with box1, the incremental stream is rapidly applied. Now, both box1 and box2 have mydataset@12pm, mydataset@1pm, mydataset@2pm, and mydataset@3pm. 

On each system, although full browsable snapshots exist for four separate times, there are no duplicated blocks—the vast majority of 'mydataset' is referenced in all four snapshots, from @12pm to @3pm, with only a few new or unlinked blocks changing in between the four snapshots. 

Our network usage was just as efficient as our storage usage. We replicated the full 1TiB and change of data twice, covering snapshots over a four-hour period—but rather than moving that 1TiB of data twice or even four times, the majority of it only had to cross the network once, in our first full replication. 

If we continue creating 100MiB of new data per hour, we can also continue to enjoy system-to-system replication that only takes three seconds at each replication. Even if we have a banner hour with 1GiB of data transferred, we'll only need thirty seconds for replication—which makes not only hourly snapshots, but hourly over the network backups to entirely redundant hardware not only possible but trivial. 

Optimizing RTO and RPO with ZFS Snapshots and Replication

Now that we understand both the business problems and the technological resources available, we can set an effective, achievable disaster recovery policy. ZFS snapshots may be taken instantaneously, with effectively no increased load on the storage system—we can, for the most part, take them as frequently as we want. Each snapshot represents the entire dataset at the atomic instant that snapshot was taken—which greatly simplifies the question "are we backing up all the things," particularly if entire VM images are being stored as snapshots or zvols. 

Incremental ZFS replication places slightly higher demands on the system, but still well within reason. On a system which stores 100MiB of new data per hour, and with both storage and network capable of 110MiB/sec throughput, an hours' data may be replicated in roughly three seconds—or ten hours' data, in thirty seconds. 

"Soft" disasters

First, let's consider "soft disasters"—an either human or software caused error which corrupts data, but does not affect the host system itself. This might be an application which spontaneously corrupted a file, an employee who accidentally deleted a folder with critical data—or a ransomware attack which encrypted all data on the server. 

Any of this category of disasters may be recovered from using ZFS snapshots alone. So, a reasonable Recovery Point Objective policy for "soft disasters" might be set to one hour. If we cause the system to take one snapshot every hour, we lose a maximum of one hour's work if we must roll back to the most recent snapshot. 

It's worth nothing, of course, that we don't have to roll back entirely, for problems with a smaller scope—perhaps a software bug caused a single important report to become corrupt. We might instead choose to "cherry-pick" that individual file out of the most recent hourly snapshot, and replace the corrupt copy on the live filesystem with the good one from the snapshot. 

In either case, we may reasonably assume that the Recovery Time Objective—how much time passes between the failure and recovery—is ten minutes or less. That ten minutes is almost entirely the time it takes a human administrator to review and assess the problem, log in, and type the commands—actually rolling back a dataset is near-instantaneous, even if many TiB of data are affected. 

With an RPO of one hour and an RTO of ten minutes, for "soft disasters" we are light years ahead of traditional tape backup systems. A traditional tape backup puts an inordinate amount of load on the storage system—even if it's only performing an incremental backup, it must traverse the filesystem looking for changed data. For many businesses, this storage load limits tape backup operations to "overnight" hours when staff are either not working, or working at lower capacity. 

Similarly, recovery from tape tends to be an arduous procedure. The tape must be mounted, catalogued, and only then may backup data be streamed back out, with the process typically taking anywhere from half an hour to several hours. Realistically, tape is probably going to offer no better than an RPO of 24 hours, and RTO of 6 hours. 

"Hard" disasters

A "hard disaster" does more than merely corrupt or delete data at the application level—it takes out the entire system beneath it. Examples of "hard" disaster include motherboard or disk controller failure, hardware theft, and even site-wide disasters such as fire or flooding. 

To recover from hard disaster, we must have a replacement available for both the data and the hardware it's stored on. For ZFS, the most common solution is one or more redundant servers. For example, a business might have one production server, and one offsite disaster recovery server. 

ZFS replication is key to recovering from this category of disaster. As we saw in earlier sections, replication makes hourly replication trivial between two systems on the same network. It also makes offsite replication simple and affordable. 

Let us assume the same typical 100MiB/hour of new data stored that we used in earlier examples, and now consider a very inexpensive, asymmetrical internet connection with a 10Mbps upload. A daily offsite replication must either 800MiB or 2400MiB of data, depending on how many hours per day the business is active and generating data.

Let's further assume the pessimistic case, and 2400MiB of data generated per day. We do not require any time traversing the filesystem "looking" for data like tools such as rsync would; we can begin immediately streaming data down the pipe. Returning to our napkin, 2400MiB of data is 19,200 Mbits—and at 10Mbps, we'll move that data in thirty-two minutes. 

A thirty-two-minute backup operation time for 24 hours-worth of data is easily achievable, and if the business has off-hours, we can schedule it to take place then, so as not to interfere with normal internet operations as well. 

Now, let's assume a "hard" disaster happens—the entire server in the office catches on fire, is stolen, or an angry ex-employee pounds it into gravel with a sledgehammer. Not only is our data safe in the offsite server, but we also have the option of simply bringing that server in and operating from it directly if necessary—the backed-up data isn't squirreled away on a tape somewhere, it's mounted and accessible directly on that hardware.

This puts our RPO at 24.5 hours—the time between our backup operations, plus the time required to complete a backup operation—and our RTO at somewhere around an hour, assuming that we're in a big hurry, our offsite server is nearby, and we're willing to bring it onsite and work with it directly.  

This isn't bad—but we might want to introduce a third server, to bring that RTO and RPO even lower for most "hard" disasters. What we've actually planned for here is a site-wide disaster, that takes out all infrastructure. What about simple hardware failure, such as a motherboard or disk controller failing? 

If we keep an onsite hot spare server as well as an offsite disaster recovery server, we can bring both RTO and RPO sharply down for this more common type of failure. We can replicate hourly to the onsite hot spare, and daily to the offsite disaster recovery system. 

Now, if we have a disaster that's isolated to the production system itself, we can simply take the production server down, and "promote" the hot spare server to production duty. For example, if our data is stored in VM images, we can boot those VMs directly on the hot spare hardware. 

Since the hot spare is on the same network broadcast domain as the production server was, when the VMs come online, users will "see" them in the same place—that is, on the same IP address—as they were before. From the users' perspective, the only thing that happened is a "time hop" backwards to the most recent hourly snapshot—no reconfiguration is necessary, things "just work." 

With a hot spare in place, and assuming a production-limited disaster, this puts the RPO at one hour, and RTO typically at ten minutes or less—assuming this sort of recovery has been practiced and planned for by IT staff, and the VMs don't take more than a few moments to boot. 

Conclusions

Using ZFS atomic snapshots and incremental replication, Recovery Time Objectives and Recovery Point Objectives can be whittled down to a tenth or less of what they are with traditional backup schemes. 

Properly planning for and managing RTO and RPO is critical to keeping businesses productive and healthy for years or decades at a time. Disasters cannot be entirely prevented, but they can be planned for and managed—and ZFS can help business owners plan for and manage disasters with far less infrastructure expense. 

This article is a relatively simple overview, and there are more features we haven't covered—including ZFS raw send, which allows a user or organization to send an encrypted snapshot stream to a remote server, without needing to send that server the key which decrypts the data. 

With ZFS encryption and raw send, it becomes possible to leverage all the convenience and utility of ZFS replication, without needing to trust the remote party with access to your data—protecting sensitive data from "insider attack" at, for example, a third-party cloud vendor serving as offsite backup target. 

The most important consideration, of course, remains the actual planning and policy creation—disaster recovery policies are not one size fits all, and different businesses will have different needs and budget constraints. But in nearly all cases, the use of ZFS snapshots and replications make better—and more consistent—RTO and RPO possible than it would be otherwise.

The experts at Klara can help you with your OpenZFS support, so you can focus on other areas of your business.

Back to Articles

More on This Topic