Distributed RAID is a new vdev type that complements existing ZFS data protection capabilities for large storage arrays. With the release of OpenZFS 2.1, draid will be supported on OpenZFS, and this is exciting news as it brings integrated distributed hot spares, allowing for faster resilvering and better performances for data protection. Dive into an interesting read and find out what options dRAID offers.
OpenZFS: All about the cache vdev or L2ARC
OpenZFS: All about the cache vdev or L2ARC
This is part of our article series published as “OpenZFS in Depth”. Subscribe to our article series to find out more about the secrets of OpenZFS
Today we’re going to talk about one of the well-known support vdev classes under OpenZFS: the CACHE vdev, better (and rather misleadingly) known as L2ARC.
The first thing to know about the “L2ARC” is the most surprising—it’s not an ARC at all. ARC stands for Adaptive Replacement Cache, a complex caching algorithm that tracks both the blocks in cache, and blocks recently evicted from cache, to figure out which are the “hottest” and therefore should be the most difficult to displace.
A block which is frequently read from the ARC will be difficult to displace simply because new data is read. A block which has recently been evicted from ARC—but then had to be read right back into it again—will similarly be difficult to displace. This increases ARC hit ratios, and decreases “cache thrash,” by comparison with more naive caching algorithms.
Conventional filesystems—including but not limited to FreeBSD UFS, Linux ext and xfs, and Windows NTFS—use the operating system’s kernel page cache facility as a filesystem read cache. These are all LRU cache, or Least Recently Used.
In an LRU cache, each time a block is read it goes to the “top” of the cache, whether the block was already cached or not. Each time a new block is added to the cache, all blocks below it are pushed one block further toward the “bottom”. A block all the way at the “bottom” of the cache gets evicted the next time a new block is added to the “top.”
Now that we understand both common LRU caches and the ARC reasonably well, we can talk about the L2ARC—which is neither ARC nor LRU cache. The L2ARC is actually a relatively simple ring buffer—first in, last out. This allows for extremely efficient write operations, at the expense of hit ratios.
A final note before we move on—if it’s not already clear, the L2ARC will very rarely have a hit ratio as high as the ARC’s. This is expected, not a bug—L2ARC is a simpler, cheaper way of caching more data than the ARC can fit. The hottest blocks always stay in ARC, and the L2ARC just catches some of the marginal blocks. Hence, lower hit ratios.
How the L2ARC receives data
Another common misconception about the L2ARC is that it’s fed by ARC cache evictions. This isn’t quite the case. When the ARC is near full, and we are evicting blocks, we cannot afford to wait while we write data to the L2ARC, as that will block forward progress reading in the new data that an application actually wants right now. Instead we feed the L2ARC with blocks when they get near the bottom, making sure to stay ahead of eviction.
Brendan Gregg wrote a 2008 blog post—and comment block in zfs/arc.c—which illustrates the L2ARC feed mechanism well:
* head --> tail * +---------------------+----------+ * ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC * +---------------------+----------+ | o L2ARC eligible * ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer * +---------------------+----------+ | * 15.9 Gbytes ^ 32 Mbytes | * headroom | * l2arc_feed_thread() * | * l2arc write hand <--[oooo]--' * | 8 Mbyte * | write max * V * +==============================+ * L2ARC dev |####|#|###|###| |####| ... | * +==============================+ * 32 Gbytes
In short, the L2ARC is continually fed with blocks that are near eviction from ARC. If the ARC evicts blocks faster than L2ARC is willing to write them, those blocks simply don’t get cached at all—which is probably fine, since extremely rapid ARC evictions generally mean a streaming workload is underway, and those blocks likely aren’t crucial to keep cached in the first place. Caching only a fraction of what is being evicted ensures we don’t write data to the cache just to replace it with different data in a few minutes.
L2ARC feed rates
By default, the L2ARC feed rate is throttled very heavily. This serves several important but perhaps not readily apparent purposes.
First and foremost, the CACHE vdevs that the L2ARC feeds are generally going to be solid state disks that have write endurance limitations. An unthrottled L2ARC feed thread might be capable of utterly wrecking its CACHE vdevs in a matter of months.
Secondly, if we overwhelm the L2ARC’s CACHE vdevs with rapid writes, it won’t be of any use as a read cache in the first place! Remember, the whole point here is to provide the system with lower latency and higher throughput responses to requests for blocks which are no longer present in ARC.
Subscribe to our ZFS support offer
Start 2021 by maximizing the power of your ZFS infrastructure. Until February 28th 2021, we’re 2 extra months with every ZFS Subscription.
Since storage is half duplex—it can store data, or it can return data, but it can’t do both at once—it stands to reason that saturating L2ARC with writes reduces its potential value as a read cache—to the point that it could conceivably even introduce bottlenecks, rather than mitigate them. Constantly writing to an solid state disk at a high rate can trigger long garbage collection pauses, giving very uneven performance.
Now that we understand why we’d want to throttle the L2ARC’s feed thread, let’s look at the tunables which control it, and their default values:
- l2arc_write_max 8,388,508
- l2arc_write_boost 8,338,508
- l2arc_noprefetch 1
- l2arc_headroom 2
- l2arc_feed_secs 1
l2arc_write_max is the standard L2ARC feed throttle, and defaults to 8MiB/sec—an absurdly low value, on the surface of it, for a modern SSD. But let’s do some back-of-the-napkin math, and figure out what it means to continually feed a CACHE vdev at 8MiB/sec:
8MiB/sec * 60sec/min * 60min/hr * 24hr/day == 691,200MiB/day == 675GiB/day
A typical consumer SSD has a write endurance level well under 1PiB—let’s say 500TiB, to give ourselves a little breathing room. At 500TiB writes, a consumer TLC or pro-grade MLC SSD won’t be ready to fall over dead just yet, but it will have begun exhibiting significant performance and reliability degradation. Going back to the napkin:
500TiB * 1024GiB/TiB == 512,000GiB 512,000GiB / 675GiB/day == 758.5 days
So if we feed a single CACHE vdev at the default throttle rate of 8MiB/sec, we can expect it will need replacement in almost exactly two years. By contrast, if we upped that feed rate to a seemingly more reasonably 64MiB/sec, we’d only get three months out of the CACHE vdev before it was due for replacement!
With that said—most environments won’t feed the L2ARC at its maximum rate 24/7/365. It’s also worth noting that if you’ve got multiple CACHE vdevs, they split the workload—so a system with two separate CACHE vdevs would have twice the service life, with the same L2ARC feed rate. But a wise admin will exercise caution here, especially in a busy system.
Moving on from l2arc_write_max, let’s talk about l2arc_write_boost. In a “cold” system—one with an empty L2ARC—the maximum feed rate is l2arc_write_max + l2arc_write_boost—so, by default, 16MiB/sec. Once the L2ARC is full, l2arc_write_boost turns off, and we return to the standard feed rate of l2arc_write_max alone.
For right now, l2arc_write_boost is critically important, because the L2ARC empties on each system reboot (or ZFS kernel module reload). But with OpenZFS 2.0, George Amanakis’ L2ARC persistence patch will finally allow the contents of L2ARC to be reloaded across reboots.
l2arc_noprefetch—which is on by default—prevents prefetched blocks from landing in L2ARC, these are blocks that ZFS predicted you might need, but you might never end up reading them. The l2arc_feed_secs controls how frequently the L2ARC feed is allowed to write to its CACHE vdevs (by default, once per second).
The last value to talk about is l2arc_headroom. The description of this tune-able in the man page is a little opaque, and could use some expansion.
To understand l2arc_headroom, you have to remember how the L2ARC is fed in the first place—from the tail end of the ARC. l2arc_headroom controls how far from the absolute tail end of the ARC the feeder is allowed to scan for new blocks, expressed as a multiplier of the l2arc_write_max throttle itself.
With default values, l2arc_write_max=8MiB, and l2arc_headroom=2, so the L2ARC feed is allowed to scan for new blocks up to 16MiB away from the tail (eviction) end of the ARC. If you increase l2arc_write_max to 16MiB, this doubles the size of the feed area as well, to 32MiB. If you then double l2arc_headroom, you double it again—so with l2arc_write_max==16MiB, and l2arc_headroom=4, your l2arc feed area becomes 64MiB deep.
When should I use L2ARC?
For most users, the answer to this question is simple—you shouldn’t. The L2ARC needs system RAM to index it—which means that L2ARC comes at the expense of ARC. Since ARC is an order of magnitude or so faster than L2ARC and uses a much better caching algorithm, you need a rather large and hot working set for L2ARC to become worth having.
In general, if you have budget which could be spent either on more RAM or on CACHE vdev devices—buy the RAM! You shouldn’t typically consider L2ARC until you’ve already maxed out the RAM for your system.
With that said, let’s take a look at exactly how much RAM is necessary to index the L2ARC, using this formula:
(L2ARC size in kilobytes) / (typical recordsize -- or volblocksize -- in kilobytes) * 70 bytes = ARC header size in RAM
So, let’s say we’re using a 512GiB SSD as a CACHE vdev, and we’re operating a typical fileserver—recordsize=1M, and most files at or well over 1MiB, so the majority of our records are 1024KiB. Returning to our mathematical napkin, we get:
512GiB * 1024MiB/GiB * 1024 KiB/MiB / 1024 KiB/record * 70 header bytes/record == 36700160 L2ARC header bytes == 35MiB L2ARC headers
Well, that’s pretty trivial! But what happens if we use the same 512GiB SSD as CACHE vdev for a system devoted to MySQL databases, with recordsize=16K?
512GiB * 1024MiB/GiB * 1024 KiB/MiB / 16KiB/record * 70 header bytes/record == 2348810240 L2ARC header bytes == 2.2GiB L2ARC headers
On a system with 128GiB total RAM, 64GiB of which is devoted to ARC, this is still a relatively low cost. But on a system with 16GiB of RAM, and only 8GiB dedicated to ARC, it’s ruinously expensive—very few workloads will benefit more from 512GiB of ring buffer than they would of an extra 25% capacity in primary ARC.
Monitoring ARC statistics
Sysadmins who are contemplating the addition of CACHE vdevs should be monitoring their ARC stats, both before and after adding CACHE.
On Linux systems, this can be checked via /proc:
root@banshee:~# head -n12 /proc/spl/kstat/zfs/arcstats 12 1 0x01 98 26656 9166421145 697584396977032 name type data hits 4 224069782 misses 4 3862016 demand_data_hits 4 20896888 demand_data_misses 4 193906 demand_metadata_hits 4 202477605 demand_metadata_misses 4 33841 prefetch_data_hits 4 488249 prefetch_data_misses 4 255731 prefetch_metadata_hits 4 207040 prefetch_metadata_misses 4 3378538
The first 12 lines of arcstats give us the major ARC statistics, from which we can determine hit ratios. For the ARC overall on this example system (my workstation, with both a rust pool containing bulk storage, and an SSD pool containing user home directories, VM images, and more), we can do some more back-of-the-napkin math to arrive at the following conclusions:
- hits / (hits+misses) == 98.3% overall ARC hit ratio
- demand_data_hits / (demand_data_hits + demand_data_misses) == 99.1% ARC data hit ratio
- demand_metadata_hits / (demand_metadata_hits + demand_metadata_misses) == 99.98% ARC metadata hit ratio
- prefetch_data_hits / (prefetch_data_hits + prefetch_data_misses) == 65.6% ARC prefetch data hit ratio
- prefetch_metadata_hits / (prefetch_metadata_hits + prefetch_metadata_misses) == 5.8% ARC prefetch metadata hit ratio
We can draw several conclusions from these hit ratios. First and foremost—the ARC is amazing.
But beyond that, the extremely high hit ratio in my ARC tells me that I probably don’t need either more ARC, or an L2ARC—there just aren’t enough cache misses occurring to make it worth chasing them.
It’s also worth taking a look at the prefetch statistics here—a 65.6% hit rate on prefetch data tells me I’m probably increasing throughput from my applications’ perspective considerably by leaving prefetch enabled. If that rate fell below 25%, I’d probably want to disable prefetch entirely, by setting vfs.zfs.prefetch_disable=”1″ in /etc/modprobe.d/zfs.conf (or in /boot/loader.conf, under FreeBSD) and restarting the system.
We can view the L2ARC’s statistics in sysctl kstat.zfs.misc.arcstats or /proc/spl/kstat/zfs/arcstats:
root@freebsd:~# sysctl kstat.zfs.misc.arcstats | egrep 'l2_(hits|misses)' kstat.zfs.misc.arcstats.l2_misses: 29549340042 kstat.zfs.misc.arcstats.l2_hits: 1893762538 root@banshee:~# egrep 'l2_(hits|misses)' /proc/spl/kstat/zfs/arcstats l2_hits 4 490701 l2_misses 4 3366016
In general, if an admin has one or more CACHE vdevs installed, he or she should be looking for an l2 hit ratio (l2_hits / (l2_hits+l2_misses)) of at least 25%.
If the L2ARC’s hit ratio is lower than desired, it might be worth experimenting with increasing l2arc_write_max in order to keep it better fed—but remember, increased l2arc_write_max comes at the expense of decreased CACHE vdev life expectancy, and quite possibly higher read latency / lower read throughput from the CACHE vdev as well.
Like this article? Share it!
You might also want be interested in
Improve the way you make use of ZFS in your company
ZFS is crucial to many companies. We guide companies and teams towards safe, whitepaper implementations of ZFS that enhance and improve the way the infrastructure is enabling your business.
zpool iostat is simply “iostat, but specifically for ZFS.” It is also one of the most essential tools in any serious ZFS storage admin’s toolbox – a tool as flexible as it is insightful. Learn how to use zpool iostat to monitor device latency and individual disks or how to go near-realtime.
Join us for our second walkthrough through the 2020 edition of the OpenZFS Developer Summit. Learn what the afternoon session was about, the new send/receive performance enhancements and how to improve “zfs diff” performance.