This is part of our article series published as “OpenZFS in Depth”. Subscribe to our article series to find out more about the secrets of OpenZFS
Transparent (inline) configurable compression is one of OpenZFS’ many compelling features—but it is also one of the more frequently misunderstood features. Although ZFS itself currently defaults to leaving compression off, we believe you should almost always use LZ4 compression on your datasets and zvols. There is such wide acceptance of this idea, that the default compression setting is likely to change in a future release of OpenZFS.
Compression is a per dataset/zvol property, not a pool level feature. This means that you can configure compression differently for groups of data in the same pool. For example, you might use “zfs set compression=lz4 tank” to set an inheritable default for the pool, but then “zfs set compression=gzip tank/textfiles” to get better compression ratios in a particular dataset which only contains highly compressible data.
ZFS has no problem dealing with a mix of uncompressed, LZ4 compressed, and gzip compressed records sitting side by side in the same dataset. ZFS does not need to immediately rewrite existing data if the setting changes.
After changing the “compression” value of a dataset or zvol, you may either let things continue as they are—in which case old data is stored under the old compression scheme, but new data is written using the new one—or you may recopy the existing data, then delete the original copies.
So, operating ZFS compression is simple—but before we move on, we should talk a little bit about how it actually works.
To do that, we’ll need to dive into some low-level basics about ZFS read and write operations.
Explaining ashift and recordsize / volblocksize
The two most important configurables for ZFS storage parameters are “ashift” – a per-vdev option which should be set to correspond with the actual hardware blocksize of the underlying devices – and “recordsize” (or “volblocksize“, if we’re talking about ZVOLs).
recordsize is a bit more difficult to explain than ashift. The smallest individually-operable block of data on a ZFS dataset is one record. This means that when you change a single byte of data within one record, ZFS makes a new copy of the entire record with your one byte change, and writes that newly modified record to disk.
With the newly modified record written to disk, ZFS unlinks the old record from the current version of the filesystem and replaces it with a link to the newly-written, modified record. ZVOLs work essentially the same way, only with volblocks—whose size is controlled by the “volblocksize” property—instead.
One final piece to the puzzle – ZFS can write undersized records if necessary. If you save a 1KiB text file in a dataset with “recordsize=1M” set, it does not consume the entire 1MiB maximum record size—your file is saved in an undersized, single-block record. With “ashift=12” – corresponding to a hardware block size of 212, or 4K—that means your 1KiB file takes up 4KiB on disk, not 1MiB!
How OpenZFS Compression Works
Now that we understand the difference between “ashift” and “recordsize” (or “volblocksize“), we can more easily discuss how compression works. Let’s say you’ve got a dataset with OpenZFS’ current default of “recordsize=128K“, sitting atop a pool constructed of vdevs with “ashift=12“.
Our hardware blocksize is 4KiB, and our desired maximum record size is 128KiB – so a typical record, uncompressed, will be stored in thirty-two 4KiB blocks. What if we’ve got “compression=lz4” set, and the LZ4 algorithm can achieve a raw 1.32 compression ratio, compressing that 128KiB of data down to 97KiB?
You need twenty-five total 4KiB blocks to store 97KiB of data – so while we’re still looking at a single 128KiB record, that record now occupies twenty-five blocks on disk, not thirty-two—an on-disk “compressratio” of 1.28.
Assuming we’ve correctly set “ashift” to match our underlying hardware’s specifications, that means we can fetch that record about 22% more efficiently than we could have otherwise. That extra 22% we saved on this record means lower latency reading or writing it, and provides better overall throughput if it’s one of an entire stream of blocks we’re reading or writing at once.
Higher recordsize Means Better Compression Ratios
The larger your records or volblocks are, the better compression ratios you’ll get. The first reason is that the compression dictionary is per-record, so larger records can compress more efficiently.
The relationship between ashift and recordsize is even more important than the raw “recordsize” itself. Remember, we store each “record” in blocks determined by “ashift” – and if you can’t decrease the number of blocks required to store the record, you won’t get any compression on that record.
To demonstrate this, let’s examine a dataset with “recordsize=16K” sitting on a pool with vdevs set to “ashift=13“, or 8K hardware block size. If LZ4 gives us the same 1.32 raw compression ratio that we had in the last example on each record, we get no compression at all! Wait, what?
16KiB of data, compressed down to 78.1% of its original size, becomes 12.5KiB of data. Unfortunately, 12.5KiB of data still requires two 8KiB hardware blocks on-disk—which would result in an on-disk “compressratio” of 1.00x—meaning completely uncompressed. So, ZFS won’t actually store this record compressed at all, since there would be no savings.
It is still possible to achieve usable on-disk compression with “recordsize=16K” and “ashift=13“, of course—but only for individual records which can achieve raw compression ratios of 2.00 or better.
Compression ratios on small recordsize datasets
Consider a MySQL database, which we want to store on a pool with vdevs set “ashift=13“. Many commonly-used high-performance SSDs perform best with this high “ashift” value—and MySQL itself defaults to a tablespace page size of 16KiB.
This leaves us where we were in the previous section—hardware block size 8KiB, and recordsize of 16KiB. But in this scenario, we often see an on-disk “compressratio” higher than 1.00x, but (much) lower than 2.00x.
Although the MySQL database’s data may only be compressible to 1.32x when considered as a whole, individual 16KiB chunks of it may very well achieve 2.00x or better raw compression. Any individual record which compresses at 2.00x or better can be stored in a single 8KiB block.
So, if we have a 1GiB MySQL database in a dataset that reports 1.1x “compressratio“, that means that around 5,000 of the 65,536 ZFS records in the database compressed to 2.00 or better, and thus could be stored in single 8KiB hardware blocks.
On the other hand, what if we had “recordsize=8K” and “ashift=13“? In this case, no on-disk compression is possible – because an uncompressed record already only requires a single block, and therefore can’t be stored in fewer blocks, no matter how high its “compressratio“.
ZFS Compression, Incompressible Data and Performance
You may be tempted to set “compression=off” on datasets which primarily have incompressible data on them, such as folders full of video or audio files. We generally recommend against this – for one thing, ZFS is smart enough not to keep trying to compress incompressible data, and to never store data compressed, if doing so wouldn’t save any on-disk blocks.
You might still not be convinced – so far, all we’re talking about is mitigating a performance decrease. Isn’t it better to avoid the decrease in the first place? Probably not. The performance “penalty” for working with incompressible data is minuscule, and the potential gains in working with compressible data are much larger.
root@lab:/data/test# pv < in.rnd > lz4-compressed/out.rnd
7.81GB 0:00:22 [ 359MB/s] [==================================>] 100%
root@lab:/data# zfs get compressratio data/test/lz4-compressed
NAME PROPERTY VALUE SOURCE
data/lz4-compressed compressratio 1.00x -
In the above example, we take previously-generated incompressible data stored on one high-performance SSD, and write it to another high-performance SSD, with “compression=lz4” and the default “recordsize=128K“. The source data is in the system cache, so the write speed is the only bottleneck here.
The data streams at 359MiB/s—but would it go any faster, if written to a dataset with “compression=off“?
root@lab:/data# pv < in.rnd > uncompressed/out.rnd
7.81GB 0:00:21 [ 378MB/s] [==================================>] 100%
This time, we got 378MiB/sec. So while we did see a performance penalty—(378-359)/378=5.0%—it’s certainly not an enormous one. Now, what kind of benefits can we see in real-world scenarios, with a mix of compressible and incompressible data?
root@lab:/data/test# pv < win2012r2-gold.raw > lz4/win2012r2-gold.raw
8.87GB 0:00:17 [ 515MB/s] [==================================>] 100%
root@lab:/data# zfs get compressratio data/test/lz4-compressed
NAME PROPERTY VALUE SOURCE
data/test/lz4-compressed compressratio 1.48x -
In the above example, we’re saving a Windows Server 2012 R2 gold image to an LZ4 compressed dataset. The gold image is a fully-installed Windows Server 2012 R2 VM, with drivers and a few client-specific applications installed, which has then been `sysprep`ed for deployment to new virtual machines.
This is obviously not a “perfect” candidate for compression, like a directory full of text files. But it still achieved a 1.48x compression ratio—and a whopping (515-378)/378=36% performance improvement.
The potential performance benefits of compression are significant, and the worst-case penalties are quite small. There are almost always better places to focus a sysadmin or storage architect’s attention than picking and choosing which datasets to disable compression on, so we recommend defaulting to LZ4 by setting “compress=lz4” at the pool root and leaving that value inheritable.
For extremely low-record size datasets or ZVOLs requiring the lowest possible latency, it may be worthwhile to set “compression=off” – but even then, the differences will likely be very small, even on very CPU-bottlenecked workloads.
We specifically chose examples using very fast solid-state storage here. The performance benefits of OpenZFS inline compression will be significantly larger on slow, conventional hard drives—and the penalties for attempting to compress incompressible data even smaller.