If you don’t follow transport protocol developments closely you might not know that FreeBSD has more than one TCP stack and that TCP stacks are pluggable at run time. Since version 9.0, FreeBSD has had support for pluggable TCP congestion control algorithms, which are modelling on Linux’s pluggable congestion control. The framework was first released in 2007 by James Healy and Lawrence Stewart whilst working on the NewTCP research project at Swinburne University of Technology’s Centre for Advanced Internet Architectures, Melbourne, Australia.
Pluggable congestion control has enabled a lot of experimentation and advancement in how TCP reacts to changing network conditions. Pluggable congestion control has made it possible to test new congestion control algorithms in a non-binding way. Rather than having to make large sweeping changes to the TCP stack, it is instead possible to load a new FreeBSD kernel module that reacts to congestion events in new and different ways. This allows differing behaviors to be loaded dynamically rather than being controlled by compile time defined if statements. This has made the code much cleaner and has created a lot of flexibility when it comes to testing in development and for production.
Since FreeBSD 12, FreeBSD has support pluggable TCP stacks, and today we will look at the RACK TCP Stack. The FreeBSD RACK stack takes this pluggable TCP feature to an extreme: rather than just swapping the congestion control algorithm, FreeBSD now supports dynamically loading and an entirely separate TCP stack. With the RACK stack loaded, TCP flows can be handled either by the default FreeBSD TCP stack or by the RACK stack.
A Little Background on TCP and Congestion Control
RACK or more properly RACK-TLP is a pair of algorithms that are designed to improve the performance of TCP when there is packet loss.
When sending, TCP must measure the network link to determine how much traffic can be carried by the network. This varies over time and the size of the estimate is controlled by the length of the path (the latency) and the path’s capacity or bandwidth. A classical TCP sender will exponentially increase the amount of traffic it sends into the network until the receiver indicates that the network is at capacity. This fast exponential growth period is frustratingly called TCPs ‘slow start’ phase.
Once the end of slow start has been detected, a TCP sender will save this value as a threshold to run slow start to, and move to a much slower growing additive increase. TCP will react to most recoverable loss signals by halving its sending rate, but to non-recoverable loss by reducing its congestion window down to 1 segment.
TCP uses packet loss, reordering, and duplicate acknowledgements as signals that indicate network congestion. As the internet is a best-effort delivery network, when the network is full it is designed to drop packets and so TCP uses this loss metric and stand-ins (duplicate acknowledgements) as signals that the network is too busy.
Packet loss can occur for reasons other than congestion – network links are not perfect, 1 in a billion events are quite common and these can cause packet corruption leading to packets being dropped. Some link technologies such as radio links are more susceptible to loss. Real loss tends to occur in bursts and it is common that a couple of packets in the middle of a flow will be lost. Because TCP is a reliable in-order stream protocol when losses occur two things happen as a consequence: Firstly, data after the missing packets cannot be sent to the application until the gap in the stream is filled, which leads to a stall in data being transferred from the TCP stack up to the application. Secondly, the sender will adapt to the loss by greatly reducing its sending rate. This means that 1 packet loss can lead to a stall while the send detects the loss and arranges to retransmit the data and that the sending rate can be hurt by the loss.
TCP uses acknowledgments as a mechanism to detect loss. When a packet is lost in the middle of a flow, the sender starts to receive duplicate ACKs which signal data is arriving, but the furthest contiguous part of the stream isn’t advancing. However, when the loss is at the end of a transmission, near the end of the connection or after a chunk of video has been sent, then the receiver won’t receive more segments that would generate ACKs. When this sort of Tail loss occurs, a lengthy retransmission time out (RTO) must fire before the final segments of data can be sent.
TCP SACK (selective ACK) was standardized in RFC2018 (way back in 1996) to help deal with cases where loss happened in the middle of a stream. TCP SACK allows the receiver to indicate gaps in the data stream that it had detected. SACK is a great performance improvement for TCP, but it is limited to the number of SACK ranges it can signal due to TCP option space available in the header.
RACK-TLP uses per-segment transmit timestamps and selective
acknowledgments (SACKs) and has two parts. Recent Acknowledgment (RACK)
starts fast recovery quickly using time-based inferences derived from
acknowledgment (ACK) feedback, and Tail Loss Probe (TLP) leverages RACK
and sends a probe packet to trigger ACK feedback to avoid
retransmission timeout (RTO) events. Compared to the widely used
duplicate acknowledgment (DupAck) threshold approach, RACK-TLP detects
losses more efficiently when there are application-limited flights of
data, lost retransmissions, or data packet reordering events. It is
intended to be an alternative to the DupAck threshold approach.
Did you know?
Improving your FreeBSD infrastructure has never been easier. Our teams are ready to consult with you on any FreeBSD topics ranging from development to regular support.
RACK-TLP is two algorithms that work together to improve performance for application limited workloads. Workloads are called application limited when their sending performance is not being restricted by the network bandwidth estimate that TCP makes and instead are limited by the rate that the application is sending data down to the socket layer.
Web Streaming video is a typical example of application limited traffic. When streaming video is working in a DASH or HLS style mode, there is a digest that describes where different chunks of video are. The video player downloads this digest and then downloads segments of video from the server. Based on how timely the chunk of video is to arrive at the video player it can then select a different quality to playback. This allows the player to dynamically change the resolution of the video to best suit the network performance.
From the server side, the connection sending the video is only being used sometimes, it will send a chunk of video as fast as it can, but once that chunk has made it to the video player, the server’s connection will be idle until the next chunk of video is requested. This video server is application limited and, due to the nature of its bursty sends, it is a great candidate to see performance improvements through the use of RACK-TLP.
The Recent ACK (RACK) algorithm allows connections to respond to isolated loss quickly by using time-based inferences to send retransmissions and this avoids large reductions in the congestion window (which governs peak sending rate) when these events occur. RACK allows our video server to detect the losses quickly and respond allowing the video player to play the chunk of video back.
The Tail Loss Probe (TLP) allows connections to discover losses at the end of a transfer or portion of a transfer (tail losses), quickly without having to rely on lengthy retransmission time outs (RTOs). RACK is able to infer losses from ACKs, however there are times when ACKs are sparse. TLP helps in these situations by sending tail data probes intended to stimulate acknowledgements from the receiver. Without TLP, when the tail of a transmission is lost the connection has to wait for the RTO timer to fire. When this happens, the congestion window is reduced down to 1 as the network is indicating congestion and the sender must restart the exponential searching phase again. With TLP the sender instead can send a probe of already sent data or new data (if available) to stimulate the receiver to provide feedback. The TLP timer is able to run at a much shorter interval than the RTO timer and is able to trigger a faster recovery mechanism rather than closing down the connections congestion window and building up again.
Enabling the RACK stack on FreeBSD
RACK has been developed as an alternate TCP stack for FreeBSD. The project to develop RACK and BBR also involved a lot of clean up and locking changes to the FreeBSD TCP code and it was a much less disruptive change to incorporate RACK as a second TCP stack.
tcp(4) documents the sysctl nodes that describe the available TCP stacks and how to change the defaults:
List of available TCP function blocks
The default TCP function block (TCP
Determines whether to inherit listen
socket's tcp stack or use the current
system default tcp stack, as defined
by functions_default. Default is
If we look at the sysctl net.inet.tcp.functions_available on a host we can see the TCP stacks the host can use:
root@freebsd # sysctl net.inet.tcp.functions_available
Stack D Alias PCB count
freebsd * freebsd 3
On this machine there is only one TCP stack available, the freebsd stack. The sysctl tells us what the stack is called, which one is the default stack and the number of protocol control blocks (PCBs) or connections that are using this stack.
To use the RACK stack, we need to rebuild the kernel with the WITH_EXTRA_TCP_STACKS=1 flag and we need to include option TCPHPTS – I also include the options RATELIMIT. The first flag builds with support for extra TCP stacks and the second option includes support for TCP high precision time stamps which RACK requires. The RATELIMIT option enables offload of pacing to network cards that support this.
For this example, I have created a new kernel configuration file called RACK in /usr/src/sys/amd64/conf/RACK which contains:
For this example, I have created a new kernel configuration file called RACK include GENERIC
To build with extra stacks we need to create a src.conf with WITH_EXTRA_TCP_STACKS=1:
With those files in place, we can build and install our kernel with extra stack support:
# make -j 16 KERNCONF=RACK buildkernel
# make installkernel KERNCONF=RACK KODIR=/boot/kernel.rack
# reboot -k kernel.rack
Once we have built, installed and rebooted to the new kernel we need to load the RACK kernel module tcp_rack.ko:
functions_available now tells us that we are using the rack stack, but none of our existing connections have been moved to it. Now if we create some new TCP sessions, by using a TCP tool such as iperf3, we can start using rack for our connections.
RACK represents some serious improvements to TCP performance for certain application workloads. It was developed to support a streaming video workload that spends a large portion of its time application limited, but there have been reports RACK on FreeBSD has also offered large performance gains for bulk transfer applications. You will need to evaluate if using the RACK stack helps with your workload, and FreeBSD makes this easy to deploy and experiment with. You can enable the RACK stack on a host and then selectively enable it for applications or time periods making it reasonably safe to experiment with in production.
Meet the author: Tom Jones
Tom Jones is an Internet Researcher and FreeBSD developer that works on improving the core protocols that drive the Internet. He is a contributor to open standards in the IETF and is enthusiastic about using FreeBSD as a platform to experiment with new networking ideas as they progress towards standardisation.
When troubleshooting a Linux or FreeBSD system, you need to be able to probe the system to find answers as to why it is behaving in a particular way. In this article, we’ll provide an overview of some of the basic tools and introduce the FreeBSD equivalents of common Linux tracing and troubleshooting tools.
The popularity of package managers permeates all Unix distributions. Yet there are subtle differences in the approach that Linux vs. FreeBSD take in handling packages. How does Linux compare to FreeBSD’s way of managing packages? We have identified key points to consider in the software lifecycle management of both in the article below.
The question isn’t as much “Should I choose FreeBSD or Linux”, the question should be “Which OS fits my needs best?”. In our most recent comparison article, we go over the two implementations of the networking stack and look at how Linux implements networking and how FreeBSD fares in most cases.
Thank you for your article, maybe adding a section how to enable this so it is enabled by default and survive a reboot would complete the article.
Pingback: Valuable News – 2021/09/20 | 𝚟𝚎𝚛𝚖𝚊𝚍𝚎𝚗