FreeBSD has its own high-performance hypervisor called “bhyve”. Much like the Linux kernel’s KVM hypervisor, bhyve enables the creation and maintenance of virtual machines—aka “guests”—which run at near-native speed alongside the host operating system. Although bhyve got a later start than Linux KVM, in most ways it has caught up with its primary rival—and in some ways surpassed it.
Using the FreeBSD RACK TCP Stack
If you don’t follow transport protocol developments closely you might not know that FreeBSD has more than one TCP stack and that TCP stacks are pluggable at run time. Since version 9.0, FreeBSD has had support for pluggable TCP congestion control algorithms, which are modelling on Linux’s pluggable congestion control. The framework was first released in 2007 by James Healy and Lawrence Stewart whilst working on the NewTCP research project at Swinburne University of Technology’s Centre for Advanced Internet Architectures, Melbourne, Australia.
Pluggable congestion control has enabled a lot of experimentation and advancement in how TCP reacts to changing network conditions. Pluggable congestion control has made it possible to test new congestion control algorithms in a non-binding way. Rather than having to make large sweeping changes to the TCP stack, it is instead possible to load a new FreeBSD kernel module that reacts to congestion events in new and different ways. This allows differing behaviors to be loaded dynamically rather than being controlled by compile time defined if statements. This has made the code much cleaner and has created a lot of flexibility when it comes to testing in development and for production.
Since FreeBSD 12, FreeBSD has support pluggable TCP stacks, and today we will look at the RACK TCP Stack. The FreeBSD RACK stack takes this pluggable TCP feature to an extreme: rather than just swapping the congestion control algorithm, FreeBSD now supports dynamically loading and an entirely separate TCP stack. With the RACK stack loaded, TCP flows can be handled either by the default FreeBSD TCP stack or by the RACK stack.
A Little Background on TCP and Congestion Control
RACK or more properly RACK-TLP is a pair of algorithms that are designed to improve the performance of TCP when there is packet loss.
When sending, TCP must measure the network link to determine how much traffic can be carried by the network. This varies over time and the size of the estimate is controlled by the length of the path (the latency) and the path’s capacity or bandwidth. A classical TCP sender will exponentially increase the amount of traffic it sends into the network until the receiver indicates that the network is at capacity. This fast exponential growth period is frustratingly called TCPs ‘slow start’ phase.
Once the end of slow start has been detected, a TCP sender will save this value as a threshold to run slow start to, and move to a much slower growing additive increase. TCP will react to most recoverable loss signals by halving its sending rate, but to non-recoverable loss by reducing its congestion window down to 1 segment.
TCP uses packet loss, reordering, and duplicate acknowledgements as signals that indicate network congestion. As the internet is a best-effort delivery network, when the network is full it is designed to drop packets and so TCP uses this loss metric and stand-ins (duplicate acknowledgements) as signals that the network is too busy.
Packet loss can occur for reasons other than congestion – network links are not perfect, 1 in a billion events are quite common and these can cause packet corruption leading to packets being dropped. Some link technologies such as radio links are more susceptible to loss. Real loss tends to occur in bursts and it is common that a couple of packets in the middle of a flow will be lost. Because TCP is a reliable in-order stream protocol when losses occur two things happen as a consequence: Firstly, data after the missing packets cannot be sent to the application until the gap in the stream is filled, which leads to a stall in data being transferred from the TCP stack up to the application. Secondly, the sender will adapt to the loss by greatly reducing its sending rate. This means that 1 packet loss can lead to a stall while the send detects the loss and arranges to retransmit the data and that the sending rate can be hurt by the loss.
TCP uses acknowledgments as a mechanism to detect loss. When a packet is lost in the middle of a flow, the sender starts to receive duplicate ACKs which signal data is arriving, but the furthest contiguous part of the stream isn’t advancing. However, when the loss is at the end of a transmission, near the end of the connection or after a chunk of video has been sent, then the receiver won’t receive more segments that would generate ACKs. When this sort of Tail loss occurs, a lengthy retransmission time out (RTO) must fire before the final segments of data can be sent.
TCP SACK (selective ACK) was standardized in RFC2018 (way back in 1996) to help deal with cases where loss happened in the middle of a stream. TCP SACK allows the receiver to indicate gaps in the data stream that it had detected. SACK is a great performance improvement for TCP, but it is limited to the number of SACK ranges it can signal due to TCP option space available in the header.
RACK-TLP is documented by RFC8985 and the abstract gives quite a nice introduction to what RACK-TLP changes to increase performance:
RACK-TLP uses per-segment transmit timestamps and selective acknowledgments (SACKs) and has two parts. Recent Acknowledgment (RACK) starts fast recovery quickly using time-based inferences derived from acknowledgment (ACK) feedback, and Tail Loss Probe (TLP) leverages RACK and sends a probe packet to trigger ACK feedback to avoid retransmission timeout (RTO) events. Compared to the widely used duplicate acknowledgment (DupAck) threshold approach, RACK-TLP detects losses more efficiently when there are application-limited flights of data, lost retransmissions, or data packet reordering events. It is intended to be an alternative to the DupAck threshold approach.
Did you know?
Improving your FreeBSD infrastructure has never been easier. Our teams are ready to consult with you on any FreeBSD topics ranging from development to regular support.
RACK-TLP is two algorithms that work together to improve performance for application limited workloads. Workloads are called application limited when their sending performance is not being restricted by the network bandwidth estimate that TCP makes and instead are limited by the rate that the application is sending data down to the socket layer.
Web Streaming video is a typical example of application limited traffic. When streaming video is working in a DASH or HLS style mode, there is a digest that describes where different chunks of video are. The video player downloads this digest and then downloads segments of video from the server. Based on how timely the chunk of video is to arrive at the video player it can then select a different quality to playback. This allows the player to dynamically change the resolution of the video to best suit the network performance.
From the server side, the connection sending the video is only being used sometimes, it will send a chunk of video as fast as it can, but once that chunk has made it to the video player, the server’s connection will be idle until the next chunk of video is requested. This video server is application limited and, due to the nature of its bursty sends, it is a great candidate to see performance improvements through the use of RACK-TLP.
The Recent ACK (RACK) algorithm allows connections to respond to isolated loss quickly by using time-based inferences to send retransmissions and this avoids large reductions in the congestion window (which governs peak sending rate) when these events occur. RACK allows our video server to detect the losses quickly and respond allowing the video player to play the chunk of video back.
The Tail Loss Probe (TLP) allows connections to discover losses at the end of a transfer or portion of a transfer (tail losses), quickly without having to rely on lengthy retransmission time outs (RTOs). RACK is able to infer losses from ACKs, however there are times when ACKs are sparse. TLP helps in these situations by sending tail data probes intended to stimulate acknowledgements from the receiver. Without TLP, when the tail of a transmission is lost the connection has to wait for the RTO timer to fire. When this happens, the congestion window is reduced down to 1 as the network is indicating congestion and the sender must restart the exponential searching phase again. With TLP the sender instead can send a probe of already sent data or new data (if available) to stimulate the receiver to provide feedback. The TLP timer is able to run at a much shorter interval than the RTO timer and is able to trigger a faster recovery mechanism rather than closing down the connections congestion window and building up again.
Enabling the RACK stack on FreeBSD
RACK has been developed as an alternate TCP stack for FreeBSD. The project to develop RACK and BBR also involved a lot of clean up and locking changes to the FreeBSD TCP code and it was a much less disruptive change to incorporate RACK as a second TCP stack.
tcp(4) documents the sysctl nodes that describe the available TCP stacks and how to change the defaults:
functions_available List of available TCP function blocks (TCP stacks). functions_default The default TCP function block (TCP stack). functions_inherit_listen_socket_stack Determines whether to inherit listen socket's tcp stack or use the current system default tcp stack, as defined by functions_default. Default is true.
If we look at the sysctl net.inet.tcp.functions_available on a host we can see the TCP stacks the host can use:
root@freebsd # sysctl net.inet.tcp.functions_available net.inet.tcp.functions_available: Stack D Alias PCB count freebsd * freebsd 3
On this machine there is only one TCP stack available, the freebsd stack. The sysctl tells us what the stack is called, which one is the default stack and the number of protocol control blocks (PCBs) or connections that are using this stack.
The RACK stack is not yet enabled by default. On Latest version of FreeBSD (13 at time of writing) you will need to build a FreeBSD kernel with a couple of extra options. Please see this article on building the FreeBSD kernel,[AJ1] if you haven’t done so before.
To use the RACK stack, we need to rebuild the kernel with the WITH_EXTRA_TCP_STACKS=1 flag and we need to include option TCPHPTS – I also include the options RATELIMIT. The first flag builds with support for extra TCP stacks and the second option includes support for TCP high precision time stamps which RACK requires. The RATELIMIT option enables offload of pacing to network cards that support this.
For this example, I have created a new kernel configuration file called RACK in /usr/src/sys/amd64/conf/RACK which contains:
For this example, I have created a new kernel configuration file called RACK include GENERIC ident RACK options RATELIMIT options TCPHPTS
To build with extra stacks we need to create a src.conf with WITH_EXTRA_TCP_STACKS=1:
With those files in place, we can build and install our kernel with extra stack support:
# make -j 16 KERNCONF=RACK buildkernel # make installkernel KERNCONF=RACK KODIR=/boot/kernel.rack # reboot -k kernel.rack
Once we have built, installed and rebooted to the new kernel we need to load the RACK kernel module tcp_rack.ko:
root@freebsd # kldload /boot/kernel.rack/tcp_rack.ko
Now with the module loaded functions_available reports two TCP stacks, the freebsd stack remains the default and there is a rack stack too.
root@freebsd:~ # sysctl net.inet.tcp.functions_available net.inet.tcp.functions_available: Stack D Alias PCB count freebsd * freebsd 3 rack rack 0
To experiment with the rack stack host wide we can change the default:
root@freebsd # sysctl net.inet.tcp.functions_default=rack net.inet.tcp.functions_default: freebsd -> rack root@freebsd # sysctl net.inet.tcp.functions_available net.inet.tcp.functions_available: Stack D Alias PCB count freebsd freebsd 3 rack * rack 0
functions_available now tells us that we are using the rack stack, but none of our existing connections have been moved to it. Now if we create some new TCP sessions, by using a TCP tool such as iperf3, we can start using rack for our connections.
RACK represents some serious improvements to TCP performance for certain application workloads. It was developed to support a streaming video workload that spends a large portion of its time application limited, but there have been reports RACK on FreeBSD has also offered large performance gains for bulk transfer applications. You will need to evaluate if using the RACK stack helps with your workload, and FreeBSD makes this easy to deploy and experiment with. You can enable the RACK stack on a host and then selectively enable it for applications or time periods making it reasonably safe to experiment with in production.
Like this article? Share it!
You might also be interested in
Get more out of your FreeBSD development
Kernel development is crucial to many companies. If you have a FreeBSD implementation or you’re looking at scoping out work for the future, our team can help you further enable your efforts.
FreeBSD 13 adds new support for a netgraph backend for virtual network devices under bhyve. Netgraph is a modular networking framework that allows for arbitrary stacking of protocols and transports, along with filtering, tunneling, redirection, inspection, injection and more—fast and feature-rich, netgraph is to networking what the geom layer is to disks and storage. This article provides a basic recipe to demonstrate some common netgraph syntax and use-cases.Why might you want to run CURRENT? If you have a large modified code base, or are building a product based on FreeBSD, CURRENT gives you a look into the future of FreeBSD. Running CURRENT will help you understand changes that are happening in the FreeBSD Operating System and it gives you an opportunity to see how your stack performs with new features.
In this article we will show how to build a CURRENT system with the debugging features disabled, and perform some benchmarks to test the impact debugging features have on performance.
The inetd ‘super-server’ is a special application that ties incoming network connections to locally-run commands. While it is not a common part of deployments today, it still has potential to be useful in production environments, and definitely has a place in the future of FreeBSD.