Announcement

Upcoming Webinar: FreeBSD After Hours AMA  Learn More

Klara

Managing Cache and DirectIO for Databases on ZFS 

Databases are Different 

Database workloads stress storage in very different ways than general-purpose file storage. Most file read access is sequential, reading from the start of the file, incrementally until the required data is found, or the end of the file is reached. Writing to files on a file server tends to either rewrite the file entirely or append to the end of the existing data. Contrast this to databases, which tend to have very large files that are both read and written to in small segments scattered throughout the file. 

The guarantees required by these workloads also differ. If the system loses power while copying a file, it usually doesn’t matter if the last few blocks of the file made it to disk before the interruption. With a database, it does. Changes to a database need to be atomic, they either complete in their entirety, or they did not start at all. This means that after every change, the database must ask the filesystem for a guarantee that the data it has just written will survive a power failure or other crash. This is done with the fsync() system call, which does not return until all of the indicated data is safely on persistent storage. Regularly file-serving workloads either never ask for this guarantee at all, or only once when the entire file has been written out. 

This means that a database workload is fundamentally different, and the filesystem must act differently to support it, and be tuned differently to perform well at this specialized workload. In this article we will walk through the specific considerations for running databases atop ZFS and leveraging the new Direct IO feature. 

The Buffer Cache 

Most filesystems have some form of buffer cache, a way to keep frequently requested data in memory to avoid the high latency of reading data from disk. This memory is usually limited and in high demand, so there is a mechanism to evict older data in favour of newer data that is more likely to be requested again soon. 

In most filesystems, this buffer cache uses a simple algorithm, LRU, or Least Recently Used, to evict the oldest data to make room for the newest incoming data. Any time a file is read or written, that new data is kept in the cache since it is the most likely to be used in the near future. 

In ZFS, this cache uses a more advanced algorithm, ARC, or Adaptive Replacement Cache. It uses a mix of both an MRU (Most Recently Used) and MFU (Most Frequently Used), to balance the cache and provide some resilience in the face of scans that churn the cache reducing its effectiveness. 

Most database engines also have their own buffer cache, but rather than buffering the raw block contents of the file, they cache the rows or pages in the database and often have more detailed knowledge of the structure of the data. 

What is Direct IO 

Direct IO was originally introduced in the XFS filesystem on IRIX for database workloads. It allows the database to bypass the filesystem buffer cache to prevent unnecessary IO operations (e.g. readahead) while preventing contention for system memory between the database and filesystem caches.  

Since its original introduction it has become a common way to indicate the desire to bypass the cache on the majority of UNIX filesystems.  

One reason you might want to bypass the cache when reading some data is that you are relatively certain it will not be required again in the near future. By setting the O_DIRECT flag on that particular read operation, you ask the filesystem to NOT cause that data to be added to the buffer cache. If you do end up reading it again, it will need to be read from disk, which will be slower, but, it means that the least recently used data was not evicted from the cache to make room for this new data that is not expected to be needed again. Using Direct IO this way can avoid evicting data unnecessarily, polluting the cache with data that is not likely to be requested again, and in the end the cache is more effective. 

The same scenario can exist on the write side. When new data is written, it would ordinarily be added to the buffer cache. If the O_DIRECT flag on the write will cause the data to not be added to the buffer cache, again avoiding evicting otherwise more useful data to make room for data we do not expect to read again soon. 

Another case where Direct IO can be useful, is when there is more than one cache. In the case of a database engine which has its own internal buffer cache, it doesn’t make sense to have the filesystem also keep a copy of the same data in its buffer cache, as now we have two copies of the data, where we could instead have cached twice as much data. 

ZFS Considerations 

There are some special considerations when using Direct IO with ZFS. Any read requests that are not page aligned will return EINVAL. If the requested data is already in the ARC (ZFS buffer cache) the read will be serviced from that cache. 

For write operations the requirements are more strict, in addition to being page aligned (unaligned results in EINVAL), the request must be aligned to the ZFS record size (128 KiB by default) otherwise the request will take the normal buffered I/O path. If the record that is being written is already present in the ARC, it will be evicted to ensure any future reads of that data retrieve the latest version from disk. 

In OpenZFS 2.4, the Uncache I/O feature was added, that eliminates the EINVAL errors and instead uses an optimized I/O path that involves more copying than Direct I/O, but less than the normal buffered I/O. This ensures applications that cannot meet the strict alignment requirements can still get some advantage from using Direct IO. 

If a file is accessed via mmap(), then all requests will have to be buffered, and Direct IO on that file will not be possible while the file is mapped. This is to ensure that all reads and writers get a consistent view of the file, and that changes made via the buffered APIs are visible even if the other application requests an unbuffered read. 

Filesystem Cache vs Database Buffer Cache 

While in many cases it might seem obvious that the database management system (DBMS) buffer cache will better serve the needs of the database, that is not always the case. 

When the buffer cache is relatively small, the DBMS is likely to perform the best, given its intimate knowledge of the structure of the database contents. 

When the overall available memory on the entire system is small, the ZFS buffer cache is likely to win out, as it can make decisions based on what is best for the entire system, rather than only the database, and trade off available memory between workloads as demand dictates. 

One place where ZFS can outshine the DBMS buffer cache, is with compression. The transparent compression in ZFS allows data to consume less space on disk, and while this benefit is obvious and some DBMS’ even support storage compression, the big win is the fact that the ZFS ARC caches the compressed copy of the data. This means that if your database compresses 4:1, the same amount of cache given to ZFS would hold four times as much data as a cache of the same size driven by any other software. If the database contents are highly compressible, and benefit from caching, this can provide performance that is unachievable even with the most expensive NVMe drives. 

Latency consistency vs throughput optimization 

There are times where predictability is more important than performance. Having data in fast memory and being able to access it in microseconds is extremely useful, but if some of the data is that fast, but other data has much higher latency, it can make the application seem to jerk and stutter. 

When raw throughput is less important than consistent latency, using Direct IO to bypass the cache can prove the better trade-off. It also ensures that the cache remains populated with the data that is useful, avoiding any potential latency spikes caused by hot data being evicted in favor of cold data and having to be read in from slower storage devices. 

NVMe and Databases 

NVMe drives can provide extremely low latency, as well as high levels of concurrency and throughput. When using many NVMe drives together, this performance can strain the ZFS ARC, where the CPU time and memory bandwidth spent copying data to and from the cache outweighing the benefits of caching the data in fast memory. 

For those cases, deploying Direct IO with ZFS to directly read and write the data from the NVMe media without going through the ARC can be lower latency and higher throughput, and leave more memory bandwidth for the application workloads. 

How to configure and validate Direct IO 

As with much of the configuration of ZFS, Direct IO can be tuned per dataset, and the settings are automatically inherited by child datasets. 

The `direct` dataset property has three possible values: 

  • Standard: Any request that is flagged with O_DIRECT and is properly aligned will bypass the ARC. Requests without the O_DIRECT flag are treated normally, leveraging the ARC. Any unaligned requests will use Uncache I/O, bypassing the ARC but still requiring a single memory copy to be properly buffered. 
  • Always: Regardless of what the application asks for, if the request is properly aligned, it is run via Direct IO and is not cached. Unaligned requests will fall back to the normal ARC I/O path. 
  • Disabled: Any request with the O_DIRECT set is treated as if the flag was not set, and runs via the ARC I/O path. This is the same as what would happen in older versions of OpenZFS, before 2.3.0 when Direct IO support was introduced. 

Is Direct IO right for your Workload? 

As with most sufficiently technical questions, the answer is “it depends”. The factors that determine if Direct IO is a good fit for your workload include: 

  • DBMS being used 
  • Storage engine within the DBMS 
  • Type of data stored in the database 
  • How compressible the data is 
  • Type of storage hardware 
  • If existing rows are frequently modified 
  • Type of queries being serviced 

 In some environments, leveraging the ZFS ARC can dramatically improve performance and reduce storage costs through compressed caching. In others, bypassing the cache with Direct IO can deliver more predictable latency, lower CPU overhead, and better throughput on high-performance NVMe infrastructure. The challenge is determining where your workload falls on that spectrum, and validating that assertion with real-world testing rather than assumptions. 

That’s where Klara can help. Our team specializes in OpenZFS and database performance, helping organizations analyze workload behavior, tune caching strategies, optimize storage architecture, and validate performance improvements across production environments. If you’d like to evaluate if Direct IO would benefit your database infrastructure, get in touch with the team at Klara to book your database performance analysis 

Back to Articles

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.