ZFS High Availability

ZFS High Availability with Asynchronous Replication and zrep

ZFS High Availability with
Asynchronous Replication and zrep

Replication improves data safety while offering both high availability and disaster recovery with RPO and RTOs not possible with other solutions.


With the zfs send and receive commands, data can be synchronized to a replica system and kept up-to-date with incremental changes. When using such a redundant replica to provide highly available storage services, it is useful for failovers between the two systems to be quick and reliable.

So start we discuss the advantages and disadvantages of leveraging zfs replication to provision a highly available system and consider some alternatives.

There are a number of tools for simplifying the use of zfs send and receive and in this article we show how to use zrep which is particularly focused on the failover use case. We will also cover important practical ZFS send/receive considerations like keeping quotas and properties in sync, handling changes to the dataset structure, and interactions with other tools that use snapshots and holds.

Providing Highly Available Storage

For the purposes of this article, we will consider the design of a system where separate redundant systems are used to provide highly available storage—but we’re not focusing on hardware redundancy, we’re focusing on full-system redundancy.

In the article Achieving RPO/RTO Objectives with ZFS we discussed the terms RPO (Recovery Point Objective) and RTO (Recovery Time Objective). These terms encapsulate the key questions of how old the backup data is in the case of a failure and how long it takes to recover systems to full operation.

Assessing acceptable values for RPO and RTO serves as part of a more formalized approach towards designing a system. While these terms are usually associated with disaster recovery, we aren’t dealing solely with what might be considered a “disaster”.
Often it is useful to be able to move services over to a standby replica to enable routine engineering work, whether that be security updates to operating systems, electrical safety tests, physical hardware moves or any of the myriad other reasons that come up when managing systems. Often such activities can be scheduled in advance to take advantage of times where the effects of any disruption are minimized. When planning for serious disasters, an online replica is not a substitute for offline data backups.

With ZFS replication, a snapshot is taken at a particular point in time and then transferred to the remote system. Subsequent runs will complete more quickly, since ZFS inherently knows what has changed, and therefore only sends the changes—so there is less data to catch up on each time.

When running successive replication jobs, a dataset with moderate levels of churn can settle down with each send taking no more than a couple of seconds. On a heavily loaded system, my typical worst case still sees datasets transferring in roughly one minute. While you should take your own measurements, this does provide a rough idea of the general orders of magnitude you might expect for the RPO metric.

For some use cases, even a one-minute RPO could be unacceptable. There are other approaches to synchronized storage that keep a replica more closely in sync. The hastd daemon on FreeBSD and DRBD on Linux take the approach of replicating data at block level to a separate machine over a network. Another approach employs mirroring across storage exported from distinct disk arrays, using protocols such as iSCSI or Fibre Channel. With these approaches, the operating system will only report writes as being completed when they are written to all replicas.

An advantage of ZFS replication is that it is fundamentally very simple. The standby system just contains your files and has no complex services that are trying to communicate back to the production system. This can pay dividends when it comes to worst case for the RTO metric.

Complex systems fail in complex ways, and high availability clusters can work beautifully with failovers going practically unnoticed—but when they do fail, getting even a degraded service running again can be non-trivial.

It can be worth considering whether a human manually performing a failover when needed is sufficient for your requirements. With automatic failovers, you typically need quorum systems to avoid split-brain scenarios (more than one system thinks it has control) and further mechanisms to prevent services repeatedly bouncing—these mechanisms are not perfect and tend to leave ugly messes behind when they misbehave. Having a human in the loop avoids these issues.

Aside from transferring control of the data, a failover typically also involves transfer of associated services. Many services can be simply stopped on one system, and once a final sync has completed, the service can be started on the standby. Restarting a service in this way may lose local state data that isn’t persisted, much as it would for a restart on a single system. For an NFS server it is worth being aware that, if you use ZFS replication, the underlying file system ID will be different on a replica. This means that, following a failover, all mounts will be stale on client systems regardless of whether NFSv4 state data is synced.

Deploying zrep

zrep simplifies ZFS replication for the use-case we’re demonstrating. However, we’ll also discuss many concepts and aspects of ZFS that are not specific to zrep. zrep itself is not available from ports, but it only consists of a single shell script. The stable version of zrep needs ksh as its operating shell, though the newer version from GitHub can also use bash. The script needs to be installed in a directory that is included in $PATH on both systems. On FreeBSD, you may want to change the first line to point to the correct location of ksh or bash in /usr/local/bin.

Initially, it may also be useful to setup keys to allow password-less ssh logins for root between the two systems. This was covered in the article Introduction to ZFS Replication. You may later prefer an alternative to ssh and perhaps to use zfs allow to avoid using root but a familiar tool like ssh is convenient.

For an initial test, we create an empty ZFS dataset and then run zrep init specifying the dataset, the remote hostname and the remote dataset:

# zfs create tank/zrep-test
# zrep init tank/zrep-test fs2 tank/zrep-test
Setting zrep properties on tank/zrep-test
Creating snapshot tank/zrep-test@zrep_000000
Sending initial replication stream to fs2:tank/zrep-test
Initialization copy of tank/zrep-test to fs2:tank/zrep-test complete
Filesystem will not be mounted

Any changes made to the dataset on the first host can now be synchronized to the second system:

# zrep sync all
sending tank/zrep-test@zrep_000001 to fs2:tank/zrep-test
Also running expire on fs2:tank/zrep-test now…
Expiring zrep snaps on tank/zrep-test

This has created a snapshot—tank/zrep-test@zrep_000001—and sent it to the other system, then finished off by clearing up old snapshots.
By default, zrep keeps the last five snapshots. You can see these with e.g. zfs list -t snapshot tank/zrep-test. The state is held in ZFS properties so can be listed:

# zfs get -o property,value all tank/zrep-test|grep zrep:
zrep:savecount 5
zrep:dest-host fs2
zrep:src-host fs1
zrep:src-fs tank/zrep-test
zrep:master yes
zrep:dest-fs tank/zrep-test

Various aspects of zrep can be adjusted by modifying shell variables that are initialized in the script. The script starts by sourcing /etc/default/zrep so you can either create that file or modify the script directly. Those variables that one might want to tweak are clearly marked by comments.

We’ll come to some of these variables later. For example, one variable allows additional options to be passed to zfs send. If you use ZFS compression, it typically makes sense to retain that compression by setting ZREP_SEND_FLAGS=-c. This both minimizes the size of the send stream but also avoids the remote system needing to recompress the data.

Automating Syncs

In a typical deployment it is useful to automate the running of syncs. To keep the replica as up-to-date as possible, a script can run syncs continuously in a loop. With multiple datasets, running syncs in parallel further helps performance – zfs send tends to produce stream data in bursts which doesn’t make optimal use of a network connection.

If you prefer, syncs can instead be run periodically on a schedule. While zrep does use locking to prevent overlapping syncs, it is prudent to avoid the potential for this happening. Using FreeBSD’s cron, the use of an “@” symbol followed by a number of seconds times the interval from the completion of the previous invocation, see crontab(5). On Linux, systemd timer services are a good choice. Similarly, Solaris’ SMF supports periodic services.

Failover

To have the two servers exchange roles, we need only to run zrep failover tank/zrep-test on the current production system. This performs a final sync and then swaps the ZFS properties around. In particular, it sets the dataset to be readonly on the new standby and writable on the new production system.

In actual cases, performing a failover is usually somewhat more involved: first, services that might create snapshots or perform automatic runs of zrep sync need to be disabled on the former primary. Any network services that you want to migrate must also be configured properly on both ends, along with primary ownership of the data. You may also have a particular IP address or hostname associated with the current production system that needs to move. zrep also has facilities for handling unplanned failover scenarios where the standby system needs to takeover control. This can be important if a production system fails and can’t be contacted to ensure a clean handover.

Handling Changes with Child Datasets

If you have datasets organized in a nested structure then setting ZREP_R=-R is important to sync the datasets recursively. This will also pick up any new child datasets that are created on the production system and keep them in sync. However, if you remove a dataset, the removal will not propagate to the standby system. In this case, it is necessary to run zfs destroy on both systems.

Properties and Quotas

In addition to the data comprising the content of files, a filesystem maintains metadata about the files and directories in it. With ZFS replication, this metadata is kept in sync. ZFS also maintains properties for datasets, but the situation with regard to replication of these is somewhat more involved.

The ZFS properties for datasets remain independent  on the production and standby systems–zrep relies on this for properties like zrep:master which only has the value “yes” on the production system. However, there are occasions where it would be convenient to keep the properties in sync.

For example, after setting refquota=1G on the production system and running zrep sync all, the property remains at its default value of “none” on the standby system. There are two pieces missing to make synchronization of this property work.

First, we need to ensure that zfs send includes properties in the stream. If you used the ZREP_R option to include child datasets recursively, then this will use -R which includes properties. Otherwise, it is necessary to add -p to ZREP_SEND_FLAGS.

The second step we need to take is to tell the standby system to use the received value for the property, by using the -S option to zfs inherit. (The command’s name is zfs inherit because it normally allows a dataset to inherit a value from its parent dataset.) Following that step—and a fresh sync from the production system—zfs get should both show the value and indicate “received” as the source for the property:

# zfs inherit -S refquota tank/zrep-test
# zfs get refquota tank/zrep-test
NAME            PROPERTY  VALUE     SOURCE
tank/zrep-test  refquota  1G        received

When failing over to the standby system, zrep leaves the properties unchanged. It is, therefore, necessary to swap the states manually. For the new standby system, that implies using zfs inherit -S, possibly also using -r to act on child datasets recursively. On the new production system, we need to exchange the received values into locally set ones.

There isn’t a simple way to do this, but you may be able to adapt the following command. This command retrieves the existing values and, where there is a value set, uses zfs set to apply the same value explicitly:

# zfs get -H -o name,received -t filesystem -r refquota tank | \
  nawk '$2 != "-" { system("zfs set refquota=" $2 " " $1) }'

ZFS per-user quotas are also set as properties, For example, zfs set userquota@bob=100M tank/zrep-test limits files owned by the user bob to 100 megabytes. Similar properties exist for groups and projects. These properties behave somewhat differently: their values propagate with zfs send and, after a failover, changes propagate back. It is not even possible to set their source to received with zfs inherit -S. So it is not necessary to do anything special to keep user quotas in sync.

Interaction with Other Snapshots

By default, zrep uses the zfs send -I option to send all the intervening snapshots. As can be seen from earlier examples, zrep prefixes its snapshots with the tag “zrep” to distinguish them.

If you have other tools creating snapshots, those snapshots will also be synced to the remote system—for example, it is common to have snapshots automatically created on a fixed schedule. Such tools typically also clean up old snapshots when they are no longer needed.

However, if the snapshots are being synced to the standby system they might collect up there until a disk fills, a quota is exceeded or performance is affected by the sheer quantity of snapshots. One way to avoid this problem is to have the cleanup component of the tool also run on the standby system. This can work well where there are regular monthly, daily or hourly snapshots that are expired according to a well-defined policy.

 In some cases, you may wish to set ZREP_INC_FLAG=-i in order to configure zrep not to send intermediate snapshots (those in between the common snapshot and newest snapshot). It can also be useful to use -i temporarily when recovering after failed syncs. Leaving out the other snapshots can significantly reduce the time a synchronization takes.

ZFS also has a concept of holds which prevent snapshots from being removed. This can be useful when the same set of snapshots are used from different scripts. For example, if you have a script that runs a malware scanner or backup, it can be useful to use a snapshot rather than operate on a filesystem that is subject to continuous changes. If a hold is placed on one of zrep’s snapshots, it will be retained for as long as the hold is present and once released, a later zrep sync will expire the old snapshot. However, putting a hold on a snapshot that is in-transit can lead to zfs receive failing. For this reason it can be a good idea not to pick the very latest snapshot, but to choose a slightly older one instead. Note that holds are not included in a zfs send stream—a snapshot with a hold on it on the primary system will not get a hold automatically placed on it when replicated onto a target system.

Troubleshooting

Occasionally, a replication can fail and manual intervention is required. ZFS has improved noticeably over time—problems occurred significantly more often on early Solaris than they do on OpenZFS today. A disadvantage of using a tool such as zrep is that the underlying commands are hidden in a black box. Some familiarity with the underlying workings can be very helpful when recovering from problems.

One possible scenario occurs where we’ve used a forced takeover from the standby system to deal with the production system being fully down. When the broken system does come up again, the ZFS properties will still identify it as the master and the dataset will be writable. This can be manually rectified on the formerly broken system by putting the properties into the correct state for a standby system as follows:

zfs inherit zrep:master tank/zrep-test
zfs set readonly=on tank/zrep-test

Any attempts to perform syncs with zrep will potentially still fail because changes have occurred to the dataset on both sides. While it is generally able to cope automatically, it is useful to know howo to rollback the datasets on the standby system to the last successfully sent snapshot so that zrep can work again. This is a solution for nearly all failure scenarios.

For a rollback, the first step is to identify the last snapshot that was sent successfully. To keep track of which snapshots were sent, zrep sets the property zrep:sent on snapshots. On the production system, we can take advantage of this to check which snapshots were fully sent with, for example. zfs get -H -o name -r -s local zrep:sent tank/zrep-test and the last snapshot listed is likely a good choice for a rollback.

With zrep, you can force a rollback on the standby system by naming a snapshot with zrep sync. Note that this won’t recursively rollback nested datasets. If you’re syncing child datasets it is best to roll all of them back together. For example you might use:

zfs list -o name -t all -r tank/zrep-test | grep zrep_0000af | xargs -n 1 zfs rollback -R

If you resort to doing manual rollbacks it is important to take care to only do them on the standby system. The original error messages from zfs send and receive are always included in zrep’s output. If you get an error message stating that the “most recent snapshot” [on the standby system] “does not match incremental source”, consider checking that it is using the correct snapshot as a base for the incremental send. If this is the case then it can generally be fixed by setting or removing the zrep:sent property on snapshots. Another option is to use zfs send and zfs receive manually.

Conclusion: Replication for Data Resilience

ZFS replication enables the most consistent backups, shorter recovery times, and improved service availability. Take advantage of all that ZFS has to offer with Klara’s ZFS support subscription. With Klara’s expert guidance, businesses can unlock unprecedented potential for stability and data protection.

Tell us what you think!

Discover more from Klara Inc

Subscribe now to keep reading and get access to the full archive.

Continue reading