OpenZFS: Understanding Transparent Compression

Transparent (inline) configurable compression is one of OpenZFS' many compelling features—but it is also one of the more frequently misunderstood features. Although ZFS itself currently defaults to leaving compression off, we believe you should almost always use LZ4 compression on your datasets and zvols. There is such wide acceptance of this idea, that the default compression setting is likely to change in a future release of OpenZFS.

Compression is a per dataset/zvol property, not a pool level feature. This means that you can configure compression differently for groups of data in the same pool. For example, you might use zfs set compression=lz4 tank to set an inheritable default for the pool, but then zfs set compression=gzip tank/textfiles to get better compression ratios in a particular dataset which only contains highly compressible data.

ZFS has no problem dealing with a mix of uncompressed, LZ4 compressed, and gzip compressed records sitting side by side in the same dataset. ZFS does not need to immediately rewrite existing data if the setting changes.

After changing the compression value of a dataset or zvol, you may either let things continue as they are—in which case old data is stored under the old compression scheme, but new data is written using the new one—or you may recopy the existing data, then delete the original copies.

So, operating ZFS compression is simple—but before we move on, we should talk a little bit about how it actually works.

To do that, we'll need to dive into some low-level basics about ZFS read and write operations.

Explaining ashift and recordsize / volblocksize

The two most important configurables for ZFS storage parameters are ashift - a per-vdev option which should be set to correspond with the actual hardware blocksize of the underlying devices - and recordsize (or volblocksize, if we're talking about ZVOLs).

recordsize is a bit more difficult to explain than ashift. The smallest individually-operable block of data on a ZFS dataset is one record. This means that when you change a single byte of data within one record, ZFS makes a new copy of the entire record with your one byte change, and writes that newly modified record to disk.

With the newly modified record written to disk, ZFS unlinks the old record from the current version of the filesystem and replaces it with a link to the newly-written, modified record. ZVOLs work essentially the same way, only with volblocks—whose size is controlled by the volblocksize property—instead.

One final piece to the puzzle - ZFS can write undersized records if necessary. If you save a 1KiB text file in a dataset with recordsize=1M set, it does not consume the entire 1MiB maximum record size—your file is saved in an undersized, single-block record. With ashift=12 - corresponding to a hardware block size of 2¹², or 4K—that means your 1KiB file takes up 4KiB on disk, not 1MiB!

How OpenZFS Compression Works

Now that we understand the difference between ashift and recordsize (or volblocksize), we can more easily discuss how compression works. Let's say you've got a dataset with OpenZFS' current default of recordsize=128K, sitting atop a pool constructed of vdevs with ashift=12.

Our hardware blocksize is 4KiB, and our desired maximum record size is 128KiB - so a typical record, uncompressed, will be stored in thirty-two 4KiB blocks. What if we've got compression=lz4 set, and the LZ4 algorithm can achieve a raw 1.32 compression ratio, compressing that 128KiB of data down to 97KiB?

You need twenty-five total 4KiB blocks to store 97KiB of data - so while we're still looking at a single 128KiB record, that record now occupies twenty-five blocks on disk, not thirty-two—an on-disk compressratio of 1.28.

Assuming we've correctly set ashift to match our underlying hardware's specifications, that means we can fetch that record about 22% more efficiently than we could have otherwise. That extra 22% we saved on this record means lower latency reading or writing it, and provides better overall throughput if it's one of an entire stream of blocks we're reading or writing at once.

Higher recordsize Means Better Compression Ratios

The larger your records or volblocks are, the better compression ratios you'll get. The first reason is that the compression dictionary is per-record, so larger records can compress more efficiently.

The relationship between ashift and recordsize is even more important than the raw recordsize itself. Remember, we store each record in blocks determined by ashift - and if you can't decrease the number of blocks required to store the record, you won't get any compression on that record.

To demonstrate this, let's examine a dataset with recordsize=16K sitting on a pool with vdevs set to ashift=13, or 8K hardware block size. If LZ4 gives us the same 1.32 raw compression ratio that we had in the last example on each record, we get no compression at all! Wait, what?

16KiB of data, compressed down to 78.1% of its original size, becomes 12.5KiB of data. Unfortunately, 12.5KiB of data still requires two 8KiB hardware blocks on-disk—which would result in an on-disk compressratio of 1.00x—meaning completely uncompressed. So, ZFS won't actually store this record compressed at all, since there would be no savings.

It is still possible to achieve usable on-disk compression with recordsize=16K and ashift=13, of course—but only for individual records which can achieve raw compression ratios of 2.00 or better.

Compression ratios on small recordsize datasets

Consider a MySQL database, which we want to store on a pool with vdevs set ashift=13. Many commonly-used high-performance SSDs perform best with this high ashift value—and MySQL itself defaults to a tablespace page size of 16KiB.

This leaves us where we were in the previous section—hardware block size 8KiB, and recordsize of 16KiB. But in this scenario, we often see an on-disk compressratio higher than 1.00x, but (much) lower than 2.00x.

Although the MySQL database's data may only be compressible to 1.32x when considered as a whole, individual 16KiB chunks of it may very well achieve 2.00x or better raw compression. Any individual record which compresses at 2.00x or better can be stored in a single 8KiB block.

So, if we have a 1GiB MySQL database in a dataset that reports 1.1x compressratio, that means that around 5,000 of the 65,536 ZFS records in the database compressed to 2.00 or better, and thus could be stored in single 8KiB hardware blocks.

On the other hand, what if we had recordsize=8K and ashift=13? In this case, no on-disk compression is possible - because an uncompressed record already only requires a single block, and therefore can't be stored in fewer blocks, no matter how high its compressratio.

ZFS Compression, Incompressible Data and Performance

You may be tempted to set compression=off on datasets which primarily have incompressible data on them, such as folders full of video or audio files. We generally recommend against this - for one thing, ZFS is smart enough not to keep trying to compress incompressible data, and to never store data compressed, if doing so wouldn't save any on-disk blocks.

You might still not be convinced - so far, all we're talking about is mitigating a performance decrease. Isn't it better to avoid the decrease in the first place? Probably not. The performance "penalty" for working with incompressible data is minuscule, and the potential gains in working with compressible data are much larger.

root@lab:/data/test# pv < in.rnd > lz4-compressed/out.rnd
7.81GB 0:00:22 [ 359MB/s] [==================================>] 100%

root@lab:/data# zfs get compressratio data/test/lz4-compressed
NAME                 PROPERTY       VALUE  SOURCE
data/lz4-compressed  compressratio  1.00x  -

In the above example, we take previously-generated incompressible data stored on one high-performance SSD, and write it to another high-performance SSD, with compression=lz4 and the default recordsize=128K. The source data is in the system cache, so the write speed is the only bottleneck here.

The data streams at 359MiB/s—but would it go any faster, if written to a dataset with compression=off?

root@lab:/data# pv < in.rnd > uncompressed/out.rnd
7.81GB 0:00:21 [ 378MB/s] [==================================>] 100%

This time, we got 378MiB/sec. So while we did see a performance penalty—(378-359)/378=5.0%—it's certainly not an enormous one. Now, what kind of benefits can we see in real-world scenarios, with a mix of compressible and incompressible data?

root@lab:/data/test# pv < win2012r2-gold.raw > lz4/win2012r2-gold.raw
8.87GB 0:00:17 [ 515MB/s] [==================================>] 100%

root@lab:/data# zfs get compressratio data/test/lz4-compressed
NAME                      PROPERTY       VALUE  SOURCE
data/test/lz4-compressed  compressratio  1.48x  -

In the above example, we're saving a Windows Server 2012 R2 gold image to an LZ4 compressed dataset. The gold image is a fully-installed Windows Server 2012 R2 VM, with drivers and a few client-specific applications installed, which has then been sysprep`ed for deployment to new virtual machines.

This is obviously not a "perfect" candidate for compression, like a directory full of text files. But it still achieved a 1.48x compression ratio—and a whopping (515-378)/378=36% performance improvement.

Conclusions

The potential performance benefits of compression are significant, and the worst-case penalties are quite small. There are almost always better places to focus a sysadmin or storage architect's attention than picking and choosing which datasets to disable compression on, so we recommend defaulting to LZ4 by setting compress=lz4 at the pool root and leaving that value inheritable.

For extremely low-record size datasets or ZVOLs requiring the lowest possible latency, it may be worthwhile to set compression=off - but even then, the differences will likely be very small, even on very CPU-bottlenecked workloads.

We specifically chose examples using very fast solid-state storage here. The performance benefits of OpenZFS inline compression will be significantly larger on slow, conventional hard drives—and the penalties for attempting to compress incompressible data even smaller.

Topics / Tags

Back to Articles

OpenZFS: Understanding Transparent Compression

Additional Articles

Explaining ashift and recordsize / volblocksize

How OpenZFS Compression Works

Higher recordsize Means Better Compression Ratios

Compression ratios on small recordsize datasets

ZFS Compression, Incompressible Data and Performance

Conclusions

More on This Topic

ZFS vs Ceph: Do You Actually Need Ceph?

Using Object Storage with OpenZFS and SeaweedFS

Managing Cache and DirectIO for Databases on ZFS

Why ZFS Is the Ideal Filesystem for Multi-User/Department Media Production

Which ZFS Storage Metrics Matter for Database Performance

How Klara and TrueNAS collaborated to fix one of ZFS’s longest standing limitations

Safe ZFS Tuning Practices for Production Databases

Fast Dedup Economics When Deduplication Beats Buying New Disks

Getting expert ZFS advice is as easy as reaching out to us!

Embedded ARM Development Experts

OpenZFS Development & Support

FreeBSD Development & Support

Stay Informed and Make Smart Business Decisions with Klara's Resources

Unlock the Power of OpenZFS, Linux, and FreeBSD with Klara's Open Source Development Experts

OpenZFS: Understanding Transparent Compression

Additional Articles

Explaining ashift and recordsize / volblocksize

How OpenZFS Compression Works

Higher recordsize Means Better Compression Ratios

Compression ratios on small recordsize datasets

ZFS Compression, Incompressible Data and Performance

Conclusions

More on This Topic

ZFS vs Ceph: Do You Actually Need Ceph?

Using Object Storage with OpenZFS and SeaweedFS

Managing Cache and DirectIO for Databases on ZFS

Why ZFS Is the Ideal Filesystem for Multi-User/Department Media Production

Which ZFS Storage Metrics Matter for Database Performance

How Klara and TrueNAS collaborated to fix one of ZFS’s longest standing limitations​

Safe ZFS Tuning Practices for Production Databases

Fast Dedup Economics When Deduplication Beats Buying New Disks

Getting expert ZFS advice is as easy as reaching out to us!

How Klara and TrueNAS collaborated to fix one of ZFS’s longest standing limitations