Using the FreeBSD RACK TCP Stack

Using the FreeBSD RACK TCP Stack

If you don’t follow transport protocol developments closely you might not know that FreeBSD has more than one TCP stack and that TCP stacks are pluggable at run time. Since version 9.0, FreeBSD has had support for pluggable TCP congestion control algorithms, which are modelling on Linux’s pluggable congestion control. The framework was first released in 2007 by James Healy and Lawrence Stewart whilst working on the NewTCP research project at Swinburne University of Technology’s Centre for Advanced Internet Architectures, Melbourne, Australia.

Pluggable congestion control has enabled a lot of experimentation and advancement in how TCP reacts to changing network conditions. Pluggable congestion control has made it possible to test new congestion control algorithms in a non-binding way. Rather than having to make large sweeping changes to the TCP stack, it is instead possible to load a new FreeBSD kernel module that reacts to congestion events in new and different ways. This allows differing behaviors to be loaded dynamically rather than being controlled by compile time defined if statements. This has made the code much cleaner and has created a lot of flexibility when it comes to testing in development and for production.

Since FreeBSD 12, FreeBSD has support pluggable TCP stacks, and today we will look at the RACK TCP Stack. The FreeBSD RACK stack takes this pluggable TCP feature to an extreme: rather than just swapping the congestion control algorithm, FreeBSD now supports dynamically loading and an entirely separate TCP stack. With the RACK stack loaded, TCP flows can be handled either by the default FreeBSD TCP stack or by the RACK stack.

A Little Background on TCP and Congestion Control

RACK or more properly RACK-TLP is a pair of algorithms that are designed to improve the performance of TCP when there is packet loss.

When sending, TCP must measure the network link to determine how much traffic can be carried by the network. This varies over time and the size of the estimate is controlled by the length of the path (the latency) and the path’s capacity or bandwidth. A classical TCP sender will exponentially increase the amount of traffic it sends into the network until the receiver indicates that the network is at capacity. This fast exponential growth period is frustratingly called TCPs ‘slow start’ phase.

Once the end of slow start has been detected, a TCP sender will save this value as a threshold to run slow start to, and move to a much slower growing additive increase. TCP will react to most recoverable loss signals by halving its sending rate, but to non-recoverable loss by reducing its congestion window down to 1 segment.

TCP uses packet loss, reordering, and duplicate acknowledgements as signals that indicate network congestion. As the internet is a best-effort delivery network, when the network is full it is designed to drop packets and so TCP uses this loss metric and stand-ins (duplicate acknowledgements) as signals that the network is too busy.

Packet loss can occur for reasons other than congestion – network links are not perfect, 1 in a billion events are quite common and these can cause packet corruption leading to packets being dropped. Some link technologies such as radio links are more susceptible to loss. Real loss tends to occur in bursts and it is common that a couple of packets in the middle of a flow will be lost. Because TCP is a reliable in-order stream protocol when losses occur two things happen as a consequence: Firstly, data after the missing packets cannot be sent to the application until the gap in the stream is filled, which leads to a stall in data being transferred from the TCP stack up to the application. Secondly, the sender will adapt to the loss by greatly reducing its sending rate. This means that 1 packet loss can lead to a stall while the send detects the loss and arranges to retransmit the data and that the sending rate can be hurt by the loss.

TCP uses acknowledgments as a mechanism to detect loss. When a packet is lost in the middle of a flow, the sender starts to receive duplicate ACKs which signal data is arriving, but the furthest contiguous part of the stream isn’t advancing. However, when the loss is at the end of a transmission, near the end of the connection or after a chunk of video has been sent, then the receiver won’t receive more segments that would generate ACKs. When this sort of Tail loss occurs, a lengthy retransmission time out (RTO) must fire before the final segments of data can be sent.

TCP SACK (selective ACK) was standardized in RFC2018 (way back in 1996) to help deal with cases where loss happened in the middle of a stream. TCP SACK allows the receiver to indicate gaps in the data stream that it had detected. SACK is a great performance improvement for TCP, but it is limited to the number of SACK ranges it can signal due to TCP option space available in the header.

RACK-TLP

RACK-TLP is documented by RFC8985 and the abstract gives quite a nice introduction to what RACK-TLP changes to increase performance:

RACK-TLP uses per-segment transmit timestamps and selective
acknowledgments (SACKs) and has two parts. Recent Acknowledgment (RACK)
starts fast recovery quickly using time-based inferences derived from
acknowledgment (ACK) feedback, and Tail Loss Probe (TLP) leverages RACK
and sends a probe packet to trigger ACK feedback to avoid
retransmission timeout (RTO) events. Compared to the widely used
duplicate acknowledgment (DupAck) threshold approach, RACK-TLP detects
losses more efficiently when there are application-limited flights of
data, lost retransmissions, or data packet reordering events. It is
intended to be an alternative to the DupAck threshold approach.

Did you know?

Improving your FreeBSD infrastructure has never been easier. Our teams are ready to consult with you on any FreeBSD topics ranging from development to regular support.

RACK-TLP is two algorithms that work together to improve performance for application limited workloads. Workloads are called application limited when their sending performance is not being restricted by the network bandwidth estimate that TCP makes and instead are limited by the rate that the application is sending data down to the socket layer.

Web Streaming video is a typical example of application limited traffic. When streaming video is working in a DASH or HLS style mode, there is a digest that describes where different chunks of video are. The video player downloads this digest and then downloads segments of video from the server. Based on how timely the chunk of video is to arrive at the video player it can then select a different quality to playback. This allows the player to dynamically change the resolution of the video to best suit the network performance.

From the server side, the connection sending the video is only being used sometimes, it will send a chunk of video as fast as it can, but once that chunk has made it to the video player, the server’s connection will be idle until the next chunk of video is requested. This video server is application limited and, due to the nature of its bursty sends, it is a great candidate to see performance improvements through the use of RACK-TLP.

The Recent ACK (RACK) algorithm allows connections to respond to isolated loss quickly by using time-based inferences to send retransmissions and this avoids large reductions in the congestion window (which governs peak sending rate) when these events occur. RACK allows our video server to detect the losses quickly and respond allowing the video player to play the chunk of video back.

The Tail Loss Probe (TLP) allows connections to discover losses at the end of a transfer or portion of a transfer (tail losses), quickly without having to rely on lengthy retransmission time outs (RTOs). RACK is able to infer losses from ACKs, however there are times when ACKs are sparse. TLP helps in these situations by sending tail data probes intended to stimulate acknowledgements from the receiver. Without TLP, when the tail of a transmission is lost the connection has to wait for the RTO timer to fire. When this happens, the congestion window is reduced down to 1 as the network is indicating congestion and the sender must restart the exponential searching phase again. With TLP the sender instead can send a probe of already sent data or new data (if available) to stimulate the receiver to provide feedback. The TLP timer is able to run at a much shorter interval than the RTO timer and is able to trigger a faster recovery mechanism rather than closing down the connections congestion window and building up again.

Enabling the RACK stack on FreeBSD

RACK has been developed as an alternate TCP stack for FreeBSD. The project to develop RACK and BBR also involved a lot of clean up and locking changes to the FreeBSD TCP code and it was a much less disruptive change to incorporate RACK as a second TCP stack.

tcp(4) documents the sysctl nodes that describe the available TCP stacks and how to change the defaults:

functions_available
    List of available TCP function blocks
    (TCP stacks).
functions_default
    The default TCP function block (TCP
    stack).
functions_inherit_listen_socket_stack
    Determines whether to inherit listen
    socket's tcp stack or use the current
    system default tcp stack, as defined
    by functions_default.  Default is
    true.

If we look at the sysctl net.inet.tcp.functions_available on a host we can see the TCP stacks the host can use:

root@freebsd # sysctl net.inet.tcp.functions_available
net.inet.tcp.functions_available: 
Stack                           D Alias                            PCB count
freebsd                         * freebsd                          3

On this machine there is only one TCP stack available, the freebsd stack. The sysctl tells us what the stack is called, which one is the default stack and the number of protocol control blocks (PCBs) or connections that are using this stack.

The RACK stack is not yet enabled by default. On Latest version of FreeBSD (13 at time of writing) you will need to build a FreeBSD kernel with a couple of extra options. Please see this article on building the FreeBSD kernel,[AJ1]  if you haven’t done so before.

To use the RACK stack, we need to rebuild the kernel with the WITH_EXTRA_TCP_STACKS=1 flag and we need to include option TCPHPTS – I also include the options RATELIMIT. The first flag builds with support for extra TCP stacks and the second option includes support for TCP high precision time stamps which RACK requires. The RATELIMIT option enables offload of pacing to network cards that support this.

For this example, I have created a new kernel configuration file called RACK in /usr/src/sys/amd64/conf/RACK which contains:

For this example, I have created a new kernel configuration file called RACK include GENERIC
ident RACK

options RATELIMIT
options TCPHPTS

To build with extra stacks we need to create a src.conf with WITH_EXTRA_TCP_STACKS=1:

WITH_EXTRA_TCP_STACKS=1

With those files in place, we can build and install our kernel with extra stack support:

# make -j 16 KERNCONF=RACK buildkernel
# make installkernel KERNCONF=RACK KODIR=/boot/kernel.rack
# reboot -k kernel.rack

Once we have built, installed and rebooted to the new kernel we need to load the RACK kernel module tcp_rack.ko:

root@freebsd # kldload /boot/kernel.rack/tcp_rack.ko 

Now with the module loaded functions_available reports two TCP stacks, the freebsd stack remains the default and there is a rack stack too.

root@freebsd:~ # sysctl net.inet.tcp.functions_available
net.inet.tcp.functions_available:
Stack                           D Alias                            PCB count
freebsd                         * freebsd                          3
rack                              rack                             0

To experiment with the rack stack host wide we can change the default:

root@freebsd # sysctl net.inet.tcp.functions_default=rack
net.inet.tcp.functions_default: freebsd -> rack
root@freebsd # sysctl net.inet.tcp.functions_available
net.inet.tcp.functions_available:
Stack                           D Alias                            PCB count
freebsd                           freebsd                          3
rack                            * rack                             0

functions_available now tells us that we are using the rack stack, but none of our existing connections have been moved to it. Now if we create some new TCP sessions, by using a TCP tool such as iperf3, we can start using rack for our connections.

Conclusion

RACK represents some serious improvements to TCP performance for certain application workloads. It was developed to support a streaming video workload that spends a large portion of its time application limited, but there have been reports RACK on FreeBSD has also offered large performance gains for bulk transfer applications. You will need to evaluate if using the RACK stack helps with your workload, and FreeBSD makes this easy to deploy and experiment with. You can enable the RACK stack on a host and then selectively enable it for applications or time periods making it reasonably safe to experiment with in production.

<strong>Meet the Author</strong>: Tom Jones
Meet the Author: Tom Jones

Tom Jones is an Internet Researcher and FreeBSD developer that works on improving the core protocols that drive the Internet. He is a
contributor to open standards in the IETF and is enthusiastic about using FreeBSD as a platform to experiment with new networking ideas as they progress towards standardisation.

Like this article? Share it!

You might also be interested in

Get more out of your FreeBSD development

Kernel development is crucial to many companies. If you have a FreeBSD implementation or you’re looking at scoping out work for the future, our team can help you further enable your efforts.

More on this topic

FreeBSD TCP Performance System Controls 

While new protocols are constantly being developed, the venerable Transmission Control Protocol (TCP) still accounts for most global traffic. The FreeBSD kernel TCP stack offers a lot of opportunities to tweak different performance features. The options it includes allow a lot of flexibility in the configuration of machines without having to do custom kernel builds.
Find out how to make use of the Initial Window, what the TCP Segment OffLoad is, and how to use TCP Buffer Tuning to your advantage.

freebsd networking

FreeBSD Network Troubleshooting: Understanding Network Performance

Network performance is one of the most complex topics to analyse and understand. FreeBSD has a full set of debugging features, and the network stack reports a ton of information. So much that it can be hard to figure out what is relevant and what is not. In this article, we define performance, look at how to measure what is available and how to get the system to report what it is managing to do.

freebsd network virtualized

Routing and Firewalling VLANS with FreeBSD

VNET virtual network stacks are a powerful network stack isolation technology that gives FreeBSD jails super powers. Follow our guide to use VLANs on FreeBSD, combine VLANs and VNETs and use VLANs with VNET Jails. Learn useful tricks with many exemplifying instances.

Tell us what you think!