Exploring Swap on FreeBSD

Free Memory is Wasted Memory or How to Make The Best Use of Swap

In operating system technology, the term "swapping" means paging data in RAM out to disk, from where it can be paged back in on demand. The page-out activity occurs in response to a lack of free memory in the system: the kernel tries to identify pages of memory that probably will not be accessed in the near future, and copies their contents to a disk for safekeeping until they are needed again. When an application attempts to access memory that has been swapped out, it blocks while the kernel fetches that saved memory from the swap disk, and then resumes execution as if nothing had happened.

All of the above might sound perfectly sensible. After all, disks are typically much larger than RAM, so why not use them to cache infrequently accessed pages of memory? But many experienced system administrators treat swapping as an abnormal activity, a sign of something amiss. This is justifiable: even with extremely fast NVMe solid state disks, main storage is orders of magnitude slower than RAM.

This means that an application which needs data paged back in from swap might wait for milliseconds, while a regular RAM access would complete in nanoseconds. Heavy usage of swapped-out memory can easily ruin the performance of a system, so swap activity is often taken as a sign that the system needs more RAM, or that memory usage needs to be tuned. A common "solution" is to disable swapping entirely, forcing the operating system to resort to other means to free up memory when necessary.

Although swap is tremendously slower than RAM, it can still be a valuable tool. So let's revisit how swapping works in FreeBSD, and try to provide some insight into frequently raised issues.

Swapping: when and how much?

Computer systems have a fixed amount of RAM. It is up to the operating system to optimize its usage. Ideally, operating systems would be able to peer into the future to see which data is about to be accessed; with this information they could ensure that the data is available in RAM before it is accessed. Being constrained to the real world, however, they use a set of heuristics which try to predict future memory accesses.

An effective and commonly used heuristic is to cache recently accessed data in memory, since it is likely to be accessed again. When no free memory is available and an application accesses uncached data, FreeBSD determines which memory was least recently accessed and evicts its contents to make room for new data. This algorithm is called Least-Recently Used (LRU).

Note that the kernel cannot simply throw away data[1]—filesystem cache can be evicted without reservation, since its data can simply be reloaded from disk. But the kernel doesn't know what application data is or is not available on disk, so the contents of memory allocated by user programs must first be paged out to disk before evicting them from RAM. Thus, swap activity is closely tied to LRU.

Implementing LRU precisely comes with a lot of undesirable overhead, so FreeBSD implements an approximation of LRU - it tries to find memory that has not been accessed in a long time, and evicts that. As part of this implementation, FreeBSD partitions the system's memory into a set of queues: the active, inactive and laundry queues[2]. The sizes of each of these queues is shown by top(1):

  Mem: 2591M Active, 6576M Inact, 1389M Laundry, 4155M Wired, 1543M Buf, 1130M Free
  Swap: 8192M Total, 1623M Used, 6569M Free, 19% Inuse

"Wired" pages are not eligible to be paged out and thus do not participate in LRU. Active pages are frequently referenced; typically, they are mapped into one or more process' address spaces. For example, memory returned by malloc(3) will initially be resident in the active queue. To determine which active pages are no longer being referenced, a kernel process called the "page daemon" periodically examines the recent access history of each page. Unreferenced pages are aged out of the active queue and into the inactive queue.

The inactive queue contains pages that have not been accessed recently. Such pages are good candidates for reuse if the kernel needs to handle a shortage of free memory. The queue helps order pages by recency of access: newly inactive pages are inserted at the tail of the queue, and pages are reclaimed from the head of the queue.

Earlier we noted that a well-behaved operating system must not throw away data when handling a free memory shortage. If some data is in a page of memory and no copy of that data exists in stable storage, then the page is said to be "dirty" and before it can be reused its contents must be paged out to storage. Otherwise, the page is "clean." For example, if one searches a file using grep(1), that file's data must be loaded into memory, but because grep(1) merely reads that data, that memory will be clean and can be reused at any point.

The active and inactive queues contain both clean and dirty pages. When reclaiming memory to alleviate a shortage, the page daemon will free clean pages from the head of the inactive queue. Dirty pages must first be cleaned by paging them out to swap or a file system. This is a lot of work, so the page daemon moves them to the laundry queue for deferred processing. The laundry queue is managed by a dedicated thread, the laundry thread, which is responsible for deciding when and how much to page out. The relationship between these queues is depicted here:

To summarize:

- The page daemon migrates unreferenced pages from the active queue to the inactive queue (1).

- To free up memory, the page daemon scans pages at the head of the inactive queue (2), frees clean pages (6), and moves dirty pages to the tail of the laundry queue (3).

- When the laundry thread decides to clean some dirty pages (4), it hands them to a pager, which writes their contents to stable storage and places the cleaned pages in the inactive queue (6).

- If a page is referenced after it is placed in the inactive or laundry queues, it will lazily be moved back to the active queue.

One possible strategy for the laundry thread is to just do nothing, relying on reclamation of clean pages to satisfy demand for free memory. Indeed, this is effectively what happens when you disable swap altogether. However, pages in the laundry queue are inactive by definition, and unused memory is wasted memory. Another possible strategy is to launder pages as soon as they enter the queue, but this can result in unnecessary I/O.

The policy used by the laundry thread makes use of several signals:

1) The ratio of the lengths of the inactive and laundry queues,

2) the number of clean pages reclaimed since the last set of page-outs, and

3) the size of the inactive queue relative to its target (minimum) size.

The laundry thread uses the first two signals to control "background" laundering, while the third is used to drive "shortfall" laundering.

The idea behind background laundering is to try and ensure that some dirty pages are paged out before a shortage of clean inactive pages occurs. When the system is out of both free pages and clean inactive pages, applications that need free memory are effectively stuck waiting for some page-outs to swap to complete. The laundry thread therefore tries to ensure that the laundry queue does not grow too large: the larger the ratio (1) of queue sizes, the more frequently the laundry thread will perform page-outs of dirty memory. Because it is a waste of I/O bandwidth to page out dirty memory when there is no shortage of free memory, the laundry thread monitors the activity of the page daemon to determine how frequently it should perform page outs.

Why is my system using so much swap space?

When a dirty page's contents have been paged out to swap, the page is marked clean and becomes eligible for reclamation. At this point, the page's contents exactly match the copy saved to the swap device. (The page could be dirtied again, in which case it was recently accessed and belongs back in the active queue.) Suppose the page is freed, and later an application tries to read the data. A fresh page will be allocated and the data is paged back in from swap, at which point the application can run again and use that data. At this point the page is still clean - only if the data is written to will the page be marked dirty - so the copy in the swap device is still valid and there is no reason to discard it.

More generally, a write-once-read-many access pattern can be common for some types of data. A long-lived process may allocate and write to a region of memory during startup and thereafter only read from that memory, for example. If that memory is paged out, FreeBSD will retain the copy in swap so long as it remains valid. Otherwise, in order to reclaim that memory, it would have to perform another expensive page-out operation.

It is thus common to see moderate swap space usage even when plenty of free memory is available[3]: at some point in the past, demand for free memory triggered page-outs to swap, and the swapped-out data remained valid.

Why is the kernel killing my processes?

In some scenarios, shortfall laundering may not be enough to alleviate a shortage of free memory. A process may have a runaway memory leak or the system may be oversubscribed to the point where it becomes completely unresponsive. The laundry thread may be paging out memory as quickly as possible but cannot meet demand, or the swap device may be full. At this point the kernel has little choice but to try and kill processes to reclaim memory and restore stability to the system - the dreaded OOM (out-of-memory) kill.

FreeBSD will trigger OOM kills in a couple of scenarios. First, if the page daemon repeatedly fails to reclaim _any_ pages from the inactive queue, it will eventually trigger OOM kills. If the swap device is full, the laundry thread will be unable to move pages from the laundry queue to the inactive queue, so this condition may potentially trigger an OOM kill, and consequently free some swap space.

FreeBSD will also trigger OOM kills if it detects that a thread is stuck in the page fault handler. Handling a hard page fault requires the allocation of some memory, and if applications are failing to make progress in this fundamental operation then the kernel will begin killing processes. This helps catch situations where a slow trickle of reclaimable pages prevents the first heuristic from kicking in, but the system is nevertheless unresponsive.

To find a process to kill, the kernel estimates the memory usage of each runnable process and selects the one with the largest usage. This heuristic works well when the OOM kill was triggered by a user-space memory leak since it will identify the process(es) that are leaking memory. However, the target process may be blocked such that killing it does not immediately free up any memory at all. In this situation the kernel may perform several back-to-back OOM kills and will eventually resort to killing important system processes.

Privileged processes can request immunity from the OOM killer in a couple of ways:

The code can use madvise(MADV_PROTECT) to prevent the OOM killer from selecting the calling process. This protection is not inherited by child processes. Many essential daemons in the FreeBSD base system, such as syslogd and sshd use this method.
The protect(1) program can be used to start a process with OOM protection enabled. For services started using rc, one can set the ${name}_oomprotect rc.conf variable to "YES" to run the process with OOM protection enabled.

Should I enable swap in 2021?

It is difficult to provide general advice. On FreeBSD-based appliances, such as embedded routers or NAS devices, it is quite common to see swap disabled. Such systems may not have any disks suitable for swapping[4], or their designers may not have been willing to accept potential application latency caused by faults on paged-out memory. In versions of FreeBSD before 11.0, the page daemon was responsible both for freeing clean inactive pages and laundering dirty inactive pages, so large amounts of paging activity could delay reclamation of memory and trigger freezes.

Many system designers are against enabling swap because of the risk of unbounded latency caused by excessive swapping. Much of this aversion relates to experience using slow HDDs as swap devices, especially when shared with a busy file system. We believe it is worthwhile for like-minded FreeBSD users to explore enabling swapping again. NVMe drives are commonplace and have access latencies in the tens of microseconds, several orders of magnitude smaller than what was standard just a decade ago.

FreeBSD's performance under memory pressure has continued to improve and the kernel is strenuously stress-tested on a continuous basis. Applications especially sensitive to memory access latency can be wired using mlock(2), while the system as a whole may benefit from the improved memory efficiency that swapping can provide.

Footnotes:

[1] The madvise(MADV_FREE) system call lets applications tell the kernel to do just that.

[2] On NUMA systems, there is one set of queues per memory domain. Each domain's queues are managed by separate page daemon and laundry threads.

[3] Swap device usage can be monitored using swapinfo(8) or top(1).

[4] SD cards do not work well as swap devices. They have low durability and can have very high I/O latency.

Topics / Tags

memory

Back to Articles

Exploring Swap on FreeBSD

Additional Articles