Did you know that FreeBSD has more than one TCP stack and that TCP stacks are pluggable at run time? Since FreeBSD 12, FreeBSD has support pluggable TCP stacks, and today we will look at the RACK TCP Stack. The FreeBSD RACK stack takes this pluggable TCP feature to an extreme: rather than just swapping the congestion control algorithm, FreeBSD now supports dynamically loading and an entirely separate TCP stack. With the RACK stack loaded, TCP flows can be handled either by the default FreeBSD TCP stack or by the RACK stack.
FreeBSD Documentation: Papers We Love To Read
UNIX, from the beginning, was a research project. While it started as a secret side project to allow its creators to have more non billed time to play computer games. Its birthplace was Bell Labs, AT&Ts research institution. In 1975 when Ken Thompson brought the Version 6 UNIX tapes across the US from New Jersey to California, it again landed in another research institution, this time a university, University of California at Berkeley.
Research is codified in papers and here we are spoiled for choice when it comes to putting together a BSD and UNIX reading list. For most of the features you see today in a modern FreeBSD Operating System there is a corresponding paper that was written during its development or after its inclusion to document its addition. Some of these ideas were only implemented in the lab and never saw the light of day while others are the cornerstone of modern computing.
In this article, we are going to cover some of our favourite FreeBSD papers and try to highlight their contribution to the FreeBSD Operating System.
Finding Papers about FreeBSD
If you have an academic background then you are probably overwhelmed by the number of papers on your to-read list. For those outside academia it might be hard to figure out how to track down interesting and useful papers about the FreeBSD Operating System.
Thankfully, the FreeBSD Project has made a consistent effort to gather up information about FreeBSD and there are a couple of places where papers have been gathered up. The FreeBSD source tree might seem like an unlikely place, but as a research Operating System it makes sense to also include copies of papers that would be of interest to the research, developer and user communities. A small selection of papers ships in share/doc/papers:
[user@computer] $ ls /usr/src/share/doc/papers beyond4.3 devfs kernmalloc newvm bsdreferences.bib diskperf kerntune relengr bufbio fsinterface Makefile sysperf contents jail malloc timecounter
Most of these are now quite old, but the nice thing about important developments is that the earlier presentations of the work tend to be very clear in their goals and can be great introductions to how features work.
New users to FreeBSD are constantly surprised of the quality of the man pages that ship with the system. As you do development, you see more and more how helpful it is to have online documentation for how things are supposed to work.
FreeBSD man pages frequently have EXAMPLES sections near the bottom. These sections can make it easy to get started using a new tool or API because they will link to in depth articles and papers covering use and background.
SEE ALSO jail(2), kvm(3), EVENTHANDLER(9), KASSERT(9), sysctl(9) Marko Zec, Implementing a Clonable Network Stack in the FreeBSD Kernel, USENIX ATC'03, June 2003, Boston
If we look at the man page for a kernel internal API like the section of VNET(9) reproduced above we might be lucky enough to have a populated SEE ALSO or HISTORY section that contains references to relevant papers.
A final place to look for papers on FreeBSD is the projects papers.freebsd.org archive. Some FreeBSD developers realised that, while there are many conferences a year covering FreeBSD, there was not a single archive of all the talks and papers written about the project. papers.freebsd.org was created in 2018 to provide a single place for all the papers and talks about the FreeBSD Operating System. The project is on github and accepts contributions from anyone so long as the talk or paper submitted covers FreeBSD.
Did you know?
Improving your FreeBSD infrastructure has never been easier. Our teams are ready to consult with you on any FreeBSD topics ranging from development to regular support.
Papers We Love
Let’s read two papers that presented core features in FreeBSD today. The first paper covers the implementation of Jails, the second one introduces a more obscure topic, VIMAGE. Jails are probably the most well known FreeBSD specific feature. Their invention and use has been a distinguishing feature in FreeBSD for 2 decades and they are frequently pointed at as the original idea behind cloud technologies such as docker. VIMAGE is much less well known, but is a technique that enables further lightweight virtualization of the FreeBSD kernel to make pseudo virtual machines more powerful.
Jails: Confining the omnipotent root – Poul-Henning Kamp and Robert N. M. Watson
The story of jails in FreeBSD is quite well known. Poul-Henning Kamp (phk@) had a client that wanted to securely offer web services to multiple customers on a single machine. Jails were invented to improve chroot to make this possible, enabling the shared web hosting model that served the web of the late 90s and 2000s. The Jails paper was published at the 2nd SANE conference in 2000 held in Maastricht (the original conference website is a great read).
Jails aimed to solve two problems to create a stable and secure environment for shared hosting.First, chroot was not designed as a security separation mechanism. It was instead implemented to make producing BSD releases easier by giving the build scripts a clean environment to run in. The original design meant that there were several well documented expoits that could allow a process to escape its chroot environment.
Second was the difficulty in implementing shared computing using access control mechanisms. It is a nightmare to lock down processes via access controls, each permission must be weighed and considered. Access control systems at the time added a lot of complexity for the administrator and even today, 21 years later, their combinations can be a source of security bugs.
Instead, Jails implement a different approach. Rather than a fine grained access control system, the entire jail environment is partitioned from the host and marked with a jail id. This allows the existing UNIX security model to remain with multiple users as well as root in each jail, while still containing these users to the jail environment.
For normal users this creates an environment that is hard to tell from a non jail environment quickly; it was possible to run entire application stacks inside a jail without software changes. For the root user, things get a bit weird. Many of the device nodes and controls that root would expect to exist on a system are now not available and calls that almost always succeed can now fail with an access error.
Jails are made possible by the addition of some new data structures to the FreeBSD 4 kernel and with the addition of some new system calls. Now, when a process privileges are checked, it is also checked to see if the process is running inside a jail. As opposed to a fine grain access control system this check is required as each subsystem is accessed. With the Jails, these checks are only needed in the few places in the kernel where a jailed process is allowed to run privileged functionality.
The Jails mechanism has some limitations, networking is ‘strange’, all socket requests from a jail are redirected and rewritten to use a single IP address that is assigned to a jail. Another hack is used to deal with problems resulting from localhost (127.0.0.1) addresses.
Jails, as described in the paper, are very similar to the jails that we use today. There have been additions and extensions (especially for networking) that allow the jail to look more like a host and in some cases have a lot of the host’s functionality delegated to them.
If we look at the ‘Future Directions’ section of the paper we can see some hints of features that would be great to have in jails today:
Management of jail environments is currently somewhat ad hoc--creating and starting jails is a well-documented procedure, but day-to-day management of jails, as well as special case procedures such as shutdown, are not well analysed and documented.
Today, jail management is controlled by vanilla FreeBSD through rc.conf, and if you need more control or powerful features there are a number of jail managers that are available.
There has been substantial interest in improving interface virtualisation, allowing one or more addresses to be assigned to an interface, and removing the requirement that the address be an IPv4 address, allowing the use of IPv6. ... Another area of great interest to the current consumers of the jail code is the ability to limit the impact of one jail on the CPU resources available for other jails. Specifically, this would require that the jail of a process play a role in its scheduling parameters.
Jails like mechanisms have become the core of a lot of the technology driving the web today. This paper is a good read as the foundation of process separation technologies that actually saw wide spread deployment and use.
Implementing a Clonable Network Stack in the FreeBSD Kernel – Marko Zec
The second paper we want to discuss covers a feature that is less well known, but we believe it introduced a very powerful extension to jails and it shows a different track that research projects can follow.
Frequently when we talk about Virtual Machines we are talking about hardware features that enable a hypervisor environment. These hardware features were not as common in 2003, but there was still a desire both to share the resources of a single host as much as possible while keeping the sub workloads as separate as possible. An alternative to using full virtualization is instead to create a lightweight or pseudo virtual machine, the operating system kernel hosts multiple ‘virtual’ environment that are separated from each other. The virtual machine exists at the system call boundary, with the OS resources shared by all the pseudo virtual machines.
You might also want be interested in
FreeBSD development is easy when you have a team of world-class developers at your fingertips.
At Klara, we dedicated our team dedicated to helping you develop and take your FreeBSD infrastructure project further. Whether you’re planning a FreeBSD project, or are in the middle of one and need a bit of extra insight, we’re here to help!
Marko’s paper introduced a concept called Virtual Images, or VIMAGE. A virtual image is a way to extend the functionality of a pseudo virtual machine by modifying kernel internal structures to make them clonable. The clones are then isolated from the rest of the system and only interact with the process attached to the clone.
The paper presents a general idea, expanding the utility of shared resources by further isolating them, while giving a concrete example of a virtualised or shared network stack in a real operating system, the FreeBSD kernel.
This proposed method virtualizes and shares the FreeBSD network stack and makes the imaged stacks creatable and assignable at a process level in a hierarchical manner. Once a process has been moved into a VIMAGE, its view of the network is only what is present in that VIMAGE, and it is completely isolated from interacting with the host.
VIMAGE clearly states two design goals:
- Virtualize the entire network stack
- Preserve the complete functionality of the base OS
These goals lead to the requirement to add code to the FreeBSD kernel. Marko does touch on an idea of modifying userspace code to take an extra parameter, a VIMAGE id for calls, but this has the limitation of requiring usercode to behave, not the best thing for security. Instead VIMAGE is implemented by converting the base network stack into a root VIMAGE and adds a mechanism for further network stacks to be created below this root in a hierarchy.
To perform process isolation rather than add a new system where a process itself is moved into a VIMAGE the FreeBSD jail mechanism is leveraged. This has the benefit of using existing isolation code and greatly simplifies implementation.
To use a VIMAGE a process can elect that one of its children be moved into the virtualised network stack. Once moved inside, the child process cannot decide to leave. VIMAGE creation leads to the kernel allocating memory for the new network stack, the creation of a loopback address for use in the VIMAGE and calling the initialization routines for all of the network protocols it provides, just as a system would do on boot. The network stack is then isolated from the hosts network entirely.
VIMAGE is implemented by modifying the data structures in the FreeBSD kernel to be mapped behind macro expansions. On use the macros are expanded to refer to the current VIMAGE, which might be the root default VIMAGE or the processes VIMAGE if it exists.
On creation the VIMAGE looks like a host with no network interfaces installed. To reach to the outside world, network interfaces must be moved from the root network stack into a child. If a special interface is used (what we have as epair today) then traffic can be allowed to pass out of the VIMAGE and into another VIMAGE. Otherwise, network interfaces (real or virtual) can be moved into the VIMAGE, disappearing from the host and entering the isolated world.
Network isolation is just the example that Marco implemented to explore this concept. In the future work section there is an interesting piece about further ideas that could be explored:
In parallel with bringing the code in closer sync with the FreeBSD –current branch, the original concept of partitioning the OS in virtual images could be further extended by virtualizing other system resources, such as real and virtual memory, network and disk bandwidth, etc. An interesting option for further development could certainly be the reimplementation of virtual images as a modular resource container  type facility. Each resource instance (network stack, process group, CPU, memory etc.) would be represented by its own data structure, and struct vimage would only contain pointers to such structures. In such an environment, the system administrator could freely combine only the desired virtualized system resources in a virtual image, depending on the specific environment and application requirements.
In reality, landing VIMAGE in FreeBSD was a lot of work and it took a long time. However some of the ideas here are still interesting and with 18 years more experience might be interesting FreeBSD research projects today.
The VIMAGE paper is quite different to the Jails paper, it is a more typical example of a paper that would be published today, in content and in style. It contains both the core idea, but also a lot of useful background. The paper shows how new technologically advanced features can be brought into FreeBSD from sources outside of the project. It is a good read today as a historical example of FreeBSD development, but also as a source of ideas that can still be explored in the FreeBSD Operating System.
What do we learn?
From these two papers we can get a good idea of how new features and ideas get incorporated into FreeBSD.
The Jails paper was discussing a feature that was added to FreeBSD 4.0 and was already available to use. The idea had been developed by a FreeBSD committer and was heavily tested in production. This meant that the paper got to talk about reality, a feature that had arrived.
VIMAGE was able to leverage Jails in FreeBSD to experiment with new ideas and was able to enhance the concept with further pseudo virtualization because the code was shipping in the OS. The VIMAGE paper is more typical of an academic presentation. Marko here presents an idea he has, and he is the first to admit that while there is a functioning prototype it isn’t in FreeBSD and a lot of work is still required to land it or keep it alive out of the tree.
VIMAGE was eventually incorporated into FreeBSD, but as an off by default feature. Off by default is a way to preserve POLA (principal of least astonishment), a development approach that has both strengths and weaknesses.
It was difficult to move VIMAGE from off by default to on by default. When it landed in the tree it wasn’t completely stable. It was through the hard work of FreeBSD developers that it was eventually turned on in FreeBSD 11, 13 years after code was first made public.
The papers that we love show the development of modern networking features and reading back it can help us understand how out beloved Operating System got to where it is. If you sampled 10 different developers for their favourite 3 papers, we are sure you would get 38 responses and a lot of jokes about off by one errors. Our favourites are normally the ones we are most grateful for someone having written.
Now would be a great time to look up the man page for your favourite FreeBSD subsystem, scroll down to History and jump on any papers you see mentioned.
Why not share your favourite papers with us at Klara via twitter and we might even have a deep dive into their contributions to FreeBSD in the future?
Like this article? Share it!
You might also be interested in
Get more out of your FreeBSD development
Kernel development is crucial to many companies. If you have a FreeBSD implementation or you’re looking at scoping out work for the future, our team can help you further enable your efforts.
While new protocols are constantly being developed, the venerable Transmission Control Protocol (TCP) still accounts for most global traffic. The FreeBSD kernel TCP stack offers a lot of opportunities to tweak different performance features. The options it includes allow a lot of flexibility in the configuration of machines without having to do custom kernel builds.
Find out how to make use of the Initial Window, what the TCP Segment OffLoad is, and how to use TCP Buffer Tuning to your advantage.
Network performance is one of the most complex topics to analyse and understand. FreeBSD has a full set of debugging features, and the network stack reports a ton of information. So much that it can be hard to figure out what is relevant and what is not. In this article, we define performance, look at how to measure what is available and how to get the system to report what it is managing to do.