Klara

OpenZFS 2.2 was a milestone release that brought several long-anticipated features to everyone’s favorite filesystem. We’re going to talk about automatically deduplicated copy operations via the new Block Reference Table feature, also known as the BRT or File Cloning. 

The addition of the BRT feature allows OpenZFS to attack the traditionally thorny problem of deduplication from a different angle than it has in the past—by offering an API directly to traditional copy utilities like cp.

A Different Kind of Deduplication

Let’s not bury the lede—this article isn’t about traditional OpenZFS deduplication at all.  BRT is an entirely new technology added in the OpenZFS 2.2.0 release in late 2023. But since BRT copies and traditional dedup share some of the same goals, let’s talk briefly about the latter.

Historically, OpenZFS deduplication has gotten a pretty bad reputation. Although deduplication sounds like a great idea in theory, in practice it generally results in horrifically bad write performance.

This big performance roadblock is due to OpenZFS’s synchronous implementation. Since a write can’t be completed until its hash has been checked against the hashes of every other block OpenZFS currently has written to the pool, the technology just doesn’t scale well at all.

There is no such performance bottleneck inherent to performing BRT copies. OpenZFS already knows every single block you’re touching is a duplicate, so it can simply add new entries to the BRT table for them all. This action is much less storage-intensive than actually copying the blocks.

BRT copies in action on FreeBSD 14.1-RELEASE

The FreeBSD project directly incorporates OpenZFS into its default environment. This makes FreeBSD an excellent choice of operating system (OS) for those who want the freshest OpenZFS releases. That’s exactly what we did, using fresh installations of FreeBSD 14.1-RELEASE.

First, let’s see what it looks like making a traditional, non-BRT copy of a 21GiB VM image:

root@fbsd14-prod0:/zroot/images/demo # ls -lh sample.blob
-rw-r--r-- 1 root wheel 21G Dec 2 18:11 sample.blob

root@fbsd14-prod0:/zroot/images/demo # time cp sample.blob sample.blob.copy
27.94 real 0.00 user 9.39 sys

Our test platform isn’t exactly slow. With a brute-force copy operation taking 28 seconds to copy 21GiB of data, that works out to about 750MiB/sec. The impact of this operation on the system’s I/O latency was enormous, because the disks were entirely saturated for those 28 seconds.

What if we used the new BRT copying feature instead? To do so, we must first ensure it is enabled via the sysctl tunable:

root@fbsd14-prod0:~# sysctl -w vfs.zfs.bclone_enabled=1 
vfs.zfs.bclone_enabled: 0 -> 1

Now that we’ve enabled block cloning via the BRT, FreeBSD will automatically use that facility when the copy_file_range() function is used, such as when the cp command is invoked. The impact on storage load isn’t just obvious, it's downright dramatic:

root@fbsd14-prod0:/zroot/images/demo # time cp sample.blob sample.blob.brtcopy 
        0.25 real         0.00 user         0.25 sys

We went from 28 seconds to complete (along with extreme storage load, and heavy system load) to completion in 0.25 seconds with very little impact on the underlying storage. Not too shabby!

Saving space with BRT

Of course, we are also saving quite a lot of storage space, as well as introducing lower storage load. Let’s look at our test system after making lots of BRT copies of that 21GiB VM image:

root@fbsd14-prod0:/zroot/images/demo # zpool list 
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT 
zroot  97.5G  44.2G  53.3G        -         -     0%    45%  1.00x    ONLINE  -

Our pool size is 97.5GiB, with 44.2GiB allocated and 53.3GiB free. Now, look at the state of a dataset with plenty of large BRT copies in it:

root@fbsd14-prod0:/zroot/images/demo # ls –lh 
total 266564531 
-rw-r--r--  1 root wheel   21G Dec  2 18:11 sample.blob 
-rw-r--r--  1 root wheel   21G Jan  4 18:53 sample.blob.brtcopy.0 
-rw-r--r--  1 root wheel   21G Jan  4 18:53 sample.blob.brtcopy.1 
-rw-r--r--  1 root wheel   21G Jan  4 18:53 sample.blob.brtcopy.2 
-rw-r--r--  1 root wheel   21G Jan  4 18:53 sample.blob.brtcopy.3 
-rw-r--r--  1 root wheel   21G Jan  4 18:53 sample.blob.brtcopy.4 
-rw-r--r--  1 root wheel   21G Jan  4 18:53 sample.blob.brtcopy.5 
-rw-r--r--  1 root wheel   21G Jan  4 18:53 sample.blob.brtcopy.6 
-rw-r--r--  1 root wheel   21G Jan  4 18:53 sample.blob.brtcopy.7 
-rw-r--r--  1 root wheel   21G Jan  4 18:53 sample.blob.brtcopy.8 
-rw-r--r--  1 root wheel   21G Jan  4 18:53 sample.blob.brtcopy.9 
-rw-r--r--  1 root wheel   21G Jan  4 18:03 sample.blob.copy 
-rwxr-xr-x  1 root wheel  141B Jan  4 18:53 test-brt 
 
root@fbsd14-prod0:/zroot/images/demo # zfs list zroot/images/demo 
NAME                USED  AVAIL  REFER  MOUNTPOINT 
zroot/images/demo   254G  43.6G   254G  /zroot/images/demo 
 
root@fbsd14-prod0:/zroot/images/demo # du -hs . 
254G	. 

As you can see, we’ve managed to fit an apparent 254GiB of data into this dataset. The entire pool, however, has less than half that much total capacity.

We can investigate a bit further by querying relevant zpool properties:

root@fbsd14-prod0:/zroot/images/demo # zpool get all | grep bclone 
zroot  bcloneused                     21.2G                          - 
zroot  bclonesaved                    212G                           - 
zroot  bcloneratio                    11.00x                         -

This pool has 21.2GiB of BRT cloned blocks, which saved 212GiB of on-disk allocatable space. There is an average of 11.00 clones of each block in the BRT. In this case, it’s exactly 11 clones of each block, because we copied that 21GiB image eleven times, and none of those copies have diverged yet.

Notice how these numbers change a bit, if we force those BRT copies to diverge by changing their data:

root@fbsd14-prod0:/zroot/images/demo # dd if=/dev/urandom bs=1M count=4096 conv=notrunc of=./sample.blob 
4096+0 records in 
4096+0 records out 
4294967296 bytes transferred in 6.502231 secs (660537437 bytes/sec) 

root@fbsd14-prod0:/zroot/images/demo # dd if=/dev/urandom bs=1M count=4096 conv=notrunc of=./sample.blob.brtcopy.0 
4096+0 records in 
4096+0 records out 
4294967296 bytes transferred in 6.479870 secs (662816872 bytes/sec) 

root@fbsd14-prod0:/zroot/images/demo # dd if=/dev/urandom bs=1M count=4096 conv=notrunc of=./sample.blob.brtcopy.1 
4096+0 records in 
4096+0 records out 
4294967296 bytes transferred in 6.654832 secs (645390745 bytes/sec)

We’ve caused sample.blob to diverge by 4GiB from each of its BRT copies, by overwriting its first 4GiB of data with pseudorandom garbage. Then, we did the same for two of the BRT copies themselves. This means we needed to store 12GiB total of new data, split between those three files. But, the remainder stays deduplicated, and even that first 4GiB is still deduplicated on eight of our eleven total BRT copies!

root@fbsd14-prod0:/zroot/images/demo # zpool list 
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT 
zroot  97.5G  56.2G  41.3G        -         -     0%    57%  1.00x    ONLINE  -

Our pool went from 44.2GiB ALLOC to 56.2GiB ALLOC—in other words, we allocated 12GiB of new data. We see the same 12GiB delta going from 53.3GiB to 41.3GiB FREE, but that side won’t always be so clean.

What if we take a peek at zfs list?

root@fbsd14-prod0:/zroot/images/demo # zfs list zroot/images/demo 
NAME                USED  AVAIL  REFER  MOUNTPOINT 
zroot/images/demo   266G  31.6G   254G  /zroot/images/demo 

Examining the results of our BRT copies at a dataset level, although our BRT copies don’t take up extra space, they do still count toward the REFER and USED of a dataset.

As always, the USED column refers to blocks occupied by both the dataset and all of its snapshots. REFER only tracks blocks occupied by the current version of the dataset.

Note: the difference between USED and REFER here is 12GiB. That's the firehose of pseudorandom data we aimed at the first four GiB of three of our BRT copies.

BRT’s critical drawback

So far, the new BRT is awesome. We get drastically decreased storage load and faster completion times on certain frequent file operations. Along with the potential to get more use out of the same amount of hardware storage, what’s not to love.

Unfortunately, BRT does have an Achilles’ heel that limits its usefulness in many environments: BRT deduplication does not work across a replication chain. BRT depends on referencing the on-disk location of the original data for the additional copies, thus it can’t translate across the replication stream. We can see this below. Here we attempt to replicate our test dataset with its new crop of BRT copies to a partner system:

root@fbsd14-dr0:/home/jrs # ssh fbsd14-prod0 zfs send -I zroot/images/demo@ten zroot/images/demo@after-brtcopy | pv | zfs receive zroot/images/demo 
72.4GiB 0:01:54 [ 647MiB/s] [           <=>                   ] 
cannot receive incremental stream: out of space 

root@fbsd14-dr0:/home/jrs # zfs list zroot/images/demo 
NAME                USED  AVAIL  REFER  MOUNTPOINT 
zroot/images/demo  21.2G  72.3G  21.2G  /zroot/images/demo

If BRT clones could be maintained across a send/receive link, replication would have required no significant extra space. We actually tried this before writing the extra 12GiB of data mentioned earlier. It would have also completed nearly instantly.

Instead, we saturated the storage on both ends for two straight minutes before ultimately crashing the receive process due to ENOSPC (not enough space on disk). This drawback will unfortunately limit some otherwise exciting potential use cases.

When to use BRT, and when not to use BRT

Since BRT clones expand when replicated, we recommend caution and careful thought before deciding to put BRT into use on production systems. Injudicious use of BRT on a production source could rapidly result in broken replication chains to a target, with no easy way to recover.

This especially means that using BRT copies for “nondestructive rollbacks” is a dangerous idea. It might be tempting to just enable BRT and then cp a huge VM image from an earlier snapshot into its current dataset. This would “roll back” the VM in production. Yet, without destroying the history that occurred since the “rollback”, it would also instantly double the ALLOC of that image on the backup target, which might not even be possible to support.

Although using BRT in place of rollbacks is risky, there is one common storage operation which leapt to our mind as an excellent use case for it: virtual machine and/or container deployment. A workflow in which large sets of data are copied, then expected to diverge, can be seeded drastically more efficiently using BRT.

Remember our opening test, when copying 21GiB of data took nearly half a minute with a standard copy, but only 0.25 seconds with BRT copying enabled? Imagine needing to deploy ten copies of the same VM gold image:

root@fbsd14-prod0:/zroot/images/demo # cat test-brt 
#!/bin/sh 
sysctl vfs.zfs.bclone_enabled 

for i in 0 1 2 3 4 5 6 7 8 9 ; do 
time cp sample.blob sample.blob.brtcopy.$i
zpool sync
done root@fbsd14-prod0:/zroot/images/demo # ./test-brt vfs.zfs.bclone_enabled: 1 -> 1 0.24 real 0.00 user 0.23 sys 0.33 real 0.00 user 0.33 sys 0.33 real 0.00 user 0.33 sys 0.34 real 0.00 user 0.34 sys 0.35 real 0.00 user 0.35 sys 0.35 real 0.00 user 0.35 sys 0.35 real 0.00 user 0.35 sys 0.37 real 0.00 user 0.37 sys 0.32 real 0.00 user 0.32 sys 0.33 real 0.00 user 0.32 sys

Deploying those ten new VMs in 3 seconds instead of 5+ minutes is pretty awesome. It's worth mentioning the space savings and reducing the stress on the storage proportionately as additional pros.

Reasons to not use zfs clone instead of BRT

We imagine some readers are wondering why we’d advocate BRT cloned copies of a golden image, rather than a zfs clone of a “golden snapshot.”

The major problem with deploying anything using zfs clone is the permanent parent/child relationship the clone has with its source snapshot. You can never destroy that snapshot without destroying the clone along with it, even when the clone has diverged 100% from the snapshot (not a single shared block between the two remains).

Sure, you can use zfs promote to change which side of the parent/child relationship a clone is on—but you can’t break that relationship, which can get problematic and annoying.

By contrast, BRT copied files don’t have any special relationship with one another. While they largely converge (have the same blocks), they produce read workload improvements thanks to the ARC only needing to store a single copy of any converged block. However, you can destroy either file without impacting the other in any way at all.

   | For a deeper understanding of deduplication in ZFS, including the latest advances in 'fast dedup,' explore our in-depth article, Introducing OpenZFS Fast Dedup.

Unexpected behavior

A BRT copy of a file takes essentially no additional room on disk—but your system doesn’t really know that until after the copy has been made.

When we first tried making lots of BRT copies of our sample.blob in quick succession, we found this out the hard way:

root@fbsd14-prod0:/zroot/images/demo # for i in 0 1 2 3 4 5 6 7 ; do cp sample.blob sample.blob.$i ; done 
cp: sample.blob: No space left on device 
cp: sample.blob.5: No space left on device 
cp: sample.blob.6: No space left on device 
cp: sample.blob.7: No space left on device 

root@fbsd14-prod0:/zroot/images/demo # zpool list 
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT 
zroot  97.5G  56.3G  41.2G        -         -     0%    57%  1.00x    ONLINE  - 

The problem is that the system doesn’t know how much space the BRT saves until after the TXGs (ZFS Transaction Group) containing the BRT operations have been committed. When we tried making several in a row in a loop, we failed ENOSPC, even when plenty of free space was still available.

As mentioned in the prior section, we worked around that unexpected behavior easily enough. Simply done by inserting a zpool sync into our loop after each BRT copy operation.

In the original version of this article based on FreeBSD 14.0, we encountered a bug during testing. Using zpool sync was our second workaround attempt, though. Our first one used the sync command rather than zpool sync to force the TXGs to commit, and it exposed a nasty bug which we promptly reported:

root@fbsd14-prod0:/zroot/images/demo # ./test-brt 
vfs.zfs.bclone_enabled: 1 
        0.23 real         0.00 user         0.23 sys 
        0.32 real         0.00 user         0.32 sys

When we used sync instead of zpool sync, our system PANICed on the second copy attempt. At the time, sync was creating a brief window which exposes a BRT copying bug. In turn, it tries to memcpy something invalid, thereby crashing the system:

panic: vm_fault_lookup: fault on nofault entry, addr: 0xfffffe008e87a000 
cpuid = 0 
time = 1704410329 
KDB: stack backtrace: 
#0 0xffffffff80b9002d at kdb_backtrace+0x5d 
#1 0xffffffff80b43132 at vpanic+0x132 
#2 0xffffffff80b42ff3 at panic+0x43 
#3 0xffffffff80eb28b5 at vm_fault+0x15c5 
#4 0xffffffff80eb1220 at vm_fault_trap+0xb0 
#5 0xffffffff8100ca39 at trap_pfault+0x1d9 
#6 0xffffffff80fe3828 at calltrap+0x8 
#7 0xffffffff82128d57 at zil_lwb_write_issue+0xe7 
#8 0xffffffff821260db at zil_commit_impl+0x1db 
#9 0xffffffff81f889d1 at zfs_sync+0x71 
#10 0xffffffff80c30e08 at kern_sync+0xc8 
#11 0xffffffff80c30eb9 at sys_sync+0x9 
#12 0xffffffff8100d119 at amd64_syscall+0x109 
#13 0xffffffff80fe413b at fast_syscall_common+0xf8 

Uptime: 12m4s Dumping 1338 out of 4055 MB:..2%..11%..21%..32%..41%..51%..61%..71%..81%..91% Dump complete ---<>---

We reported the bug upstream at OpenZFS #15768, and it was fixed in OpenZFS 2.2.3 and FreeBSD 14.1.

Conclusions

It’s early days for the BRT. Some of its most exciting potential use cases might not be practical for most of us—it’s still a technology which offers significant improvements to many common workloads. 

We advise caution about using BRT with the expectation of long-term storage space savings. You might very well experience those savings in production, but you will not experience them on any backups that don’t have their own (usually quite expensive, computationally if not monetarily) deduplication system to re-deduplicate data as it arrives. 

Even if your disaster recovery storage is vastly larger than your production storage, you need to be careful that BRT doesn’t wedge you into a corner. You might be able to BRT clone a huge image ten times in production, and you might be able to save ten fully-expanded copies in disaster recovery. But what happens when you need to restore to small, fast production from your large, slow backups? 

For a deeper understanding of ZFS architecture and storage configurations, it’s important to know how different vdev types function. Read our previous article to learn more about ZFS vdev types in detail. 

Back to Articles