Did you know that FreeBSD has more than one TCP stack and that TCP stacks are pluggable at run time? Since FreeBSD 12, FreeBSD has support pluggable TCP stacks, and today we will look at the RACK TCP Stack. The FreeBSD RACK stack takes this pluggable TCP feature to an extreme: rather than just swapping the congestion control algorithm, FreeBSD now supports dynamically loading and an entirely separate TCP stack. With the RACK stack loaded, TCP flows can be handled either by the default FreeBSD TCP stack or by the RACK stack.
FreeBSD arm64 Performance – Getting more out of your FreeBSD Deployment
FreeBSD arm64 Performance
Getting more out of your FreeBSD Deployment
With the release of FreeBSD 13, ARM64 has been elevated to Tier 1 support status. FreeBSD support for ARM64 has grown steadily since the architecture was incorporated in 2015 based on work supported by the community, arm and Marvell (previously known as Cavium).
Many developers have provided important components of support for the architecture. Andrew Turner (andrew@) performed the initial port to the arm64 and has offered continued involvement, Emmanuel Vadot (manu@) has been key in making many different SBCs work on FreeBSD, and countless others have implemented device drivers, gotten ports working and tested the architecture to get it to the stable high performance state architecture it is today.
We are now seeing the fruits of this work with developer interest and support reaching a sufficient level to make a long-term commitment to the platform reflected by the Tier 1 status.
FreeBSD arm64 targets hardware ranging from tiny single board computers all the way up to servers with more than 100 processor cores and this can make it hard to draw solid conclusions about the performance implications of using or moving to the arm64 architecture.
The performance of any computer is defined by a number of factors, and how they interplay. When building an AMD64 desktop or server you need to consider the processor, cache sizes, memory capability in terms of speed and channels, and you need to ensure that the mother boards these components are placed on offer support for the further expansion capabilities you require. Whether for storage devices, graphics or network interfaces.
ARM64 machines don’t differ here to other architectures, but the computers build around ARM64 processors tend to have less flexibility in their configurations. The range of choices is low compared to AMD64 machines, where components are typically modular and upgradable. While we are seeing more flexible and powerful hardware with ARM64 processors, you are typically locked into a vendor’s board design and the trade-offs that offers. We are not yet at the point of specifying motherboards, and being able to expect expansions to plug in and just work without some planning.
When considering an ARM64 machine’s performance you need to think about specifying and evaluating the following components:
- Storage interfaces
- Network interfaces
- Expansion Options
As ARM64 hardware varies in the scale between single board computer devices, through desktop and workstation hardware to high performance multi-socket servers, it is important to have a good understanding of what the capabilities of the hardware are, and where bottlenecks might appear.
ARM64 has two quite distinct tiers when it comes to hardware and the performance you can expect. Most of the early ARM64 systems were an evolution from embedded use cases with 32bit ARM processors. Many of these systems were Single Board Computers (SBC) designed for a specific purpose and built around a System on a Chip (SOC) that was designed for a couple of applications.
Did you know?
Until the 30th of June 2021 AWS is offering free trial of the Graviton 2 t4g.micro instance size with 750 hours of free time available each month of the trial.
SBCs such as the Raspberry Pi 3 and 4, the ROCK64, and laptops like the Pinebook Pro are built on ARM64 SOCs like these and the continuation from earlier 32bit ARM embedded devices is clear. This is not to say that these systems are not able to manage exceptional performance, modern phone SOCs have been compared in performance to latest generation processors from Intel. It does however mean that the SOCs tended to be designed for specific applications such as for the use in a phone or tablet. The SBCs designed around these SOCs lack the expansion options that you would expect from a desktop or server, and could be very constrained in terms of network and storage interfaces.
The scale of the hardware can also dictate the available interfaces for expansion and storage, and can govern network capability and expandability.
The lower end of SBCs will have limits to the available hardware interfaces and speeds, many such boards only have USB2.0 controllers available for expansion. Sometimes, these SOCs natively support and integrate gigabit network interfaces, and other times network interfaces are sharing the USB bus with any other peripherals.
The advantage of these boards is frequently lower power consumption, low price and the ability to use physical devices such as General Purpose IO (GPIO). Many of these boards have support daughter boards that bring along expansion interfaces such as adding small OLED screens and buttons to the board which aren’t often found on AMD64 motherboards of similar capabilities.
SBCs tend to suffer from a lack of available interfaces for storage, they will typically support booting from SD cards or MMC interfaces. USB3 boot support is more common and available on the Raspberry Pi 3, but SATA and M.2 support is less common. The lack of high speed storage will have an impact on any workload that requires frequent or sustained disk access.
From the launch of the ARMv8 architecture, ARM was looking at server applications and standardized interfaces that would make server implementations more straight forward than SBCs. The ARM Server Base System Architecture (SBSA) defines a common target architecture for ARM64 server machines and incorporates technologies that will be familiar from the AMD64 world where it uses ACPI for device discovery and offers UEFI as a firmware layer.
Desktop equivalent boards such as the Solidrun Honeycomb Workstations are much more comparable to AMD64 desktops. These Mini-ITX format boards offer 16 A72 cores, support up to 64GB of Dual DDR4 Memory, have many SATA ports for storage as well as M.2, and have expansion options using PCIe and USB3. The Honeycomb Workstations are becoming a favored ARM64 machine for doing development among FreeBSD developers due to the high performance and smooth workflow they enable.
Server grade hardware is much closer to what you would expect from an AMD64 system. Server systems such as the Ampere Altra-based systems offer the ability to install Terabytes of RAM, more than 100 cores, PCIe Gen 4 and high performance storage. Most ARM64 server systems come with high price tags and there is not yet entry level tier server hardware in the same way as there is a market for lower spec AMD64 servers.
If you are running FreeBSD ARM64 in the cloud then hardware decisions have been made for you and the EC2 machines available have high performance storage and network devices. These machines do suffer from being ‘burstable’ which makes the peak performance much higher than the sustained performance the platform will allow, unless you pay to use it. This article discusses running FreeBSD in EC2 on ARM64.
ARM64 supports CPU scaling as an option to reduce power consumption and to reduce thermal pressure on the processors on the SOCs. FreeBSD, on many single board computers, will not attempt to rescale the CPU performance, and you may see boards default to low clock rates. The cpufreq(4) device offers an interface to managing CPU clock speeds.
You can investigate the CPU clock rate by looking at the hw.cpufreq sysctl:
$ sysctl hw.cpufreq hw.cpufreq.temperature: 51540 hw.cpufreq.voltage_sdram_p: 1200000 hw.cpufreq.voltage_sdram_i: 1200000 hw.cpufreq.voltage_sdram_c: 1200000 hw.cpufreq.voltage_core: 1200000 hw.cpufreq.turbo: 0 hw.cpufreq.sdram_freq: 400000000 hw.cpufreq.core_freq: 250000000 hw.cpufreq.arm_freq: 600000000
On this example, Raspberry Pi 3 hw.cpufreq.arm_freq: 600000000 tells us that the processor is running at 600 MHz, which is much lower than the 1.2GHz the processor can run. On the Raspberry Pi we can modify this by editing a boot parameter in config.txt stored on the SD card (see this page for more information).
We can also control the clock frequency of the CPU through sysctl:
$ sysctl dev.cpu.0 dev.cpu.0.temperature: 51.5C dev.cpu.0.freq_levels: 1200/-1 600/-1 dev.cpu.0.freq: 600 dev.cpu.0.%parent: cpulist0 dev.cpu.0.%pnpinfo: name=cpu@0 compat=arm,cortex-a53 dev.cpu.0.%location: dev.cpu.0.%driver: cpu dev.cpu.0.%desc: Open Firmware CPU
The dev.cpu.0.freq_levels sysctl tells us the available clock frequencies the processor will allow us to configure.
We can change the CPU freq by using the sysctl command:
# sysctl dev.cpu.0.freq=1200 dev.cpu.0.freq: 600 -> 1200
We can evaluate what the difference using the higher clock rate is with a CPU based md5 benchmark:
# sysctl dev.cpu.0.freq=1200 dev.cpu.0.freq: 600 -> 1200 # md5 -t MD5 time trial. Digesting 100000 10000-byte blocks ... done Digest = 766a2bb5d24bddae466c572bcabca3ee Time = 6.209081 seconds Speed = 153.593475 MiB/second # sysctl dev.cpu.0.freq=600 dev.cpu.0.freq: 1200 -> 600 # md5 -t MD5 time trial. Digesting 100000 10000-byte blocks ... done Digest = 766a2bb5d24bddae466c572bcabca3ee Time = 12.146159 seconds Speed = 78.516533 MiB/second
With the increased clock rate, we are able to run the benchmark in half the amount of time it takes to complete this naive benchmark. Many machines will ship in a ‘safe’ configuration. In the case of the Pi 3, it is running at a CPU frequency that is unlikely to generate thermal issues when there isn’t any additional cooling installed. If a proper heat sink and ventilation are added, it is probably going to be fine to run at these higher speeds.
ARM64 supports a processor topology called Big.Little. In this arrangement fast, high power cores are paired with slower, lower power cores. This configuration works very well in mobile devices where most of the time the device is idle and the fast power-hungry cores can be turned off reducing the total system power consumption. When there is serious processing to do, the system is able to use the high power cores and the lower power cores to increase the computation ability.
There have been issues running on boards with Big.Little where, for stability, certain cores have been disabled by default. If your target hardware has a Big.Little configuration, it is worth investigating if FreeBSD is detecting and using all of the available cores.
The FreeBSD scheduler is not yet Big.Little aware and will schedule work to all of the cores as if they were equal. If you have single-threaded workloads and are running on a Big.Little system it might be worth investigating if you get higher performance by binding your worker thread to a single Big core by using cpuset(1).
Big.Little enables an opportunity for much lower power consumption on hosts that have dynamic workloads. We are now starting to see Intel and AMD implementing hybrid architectures like ARMs. The recently announced i.MX8 processors combine Big.Little topologies with a Cortex series microcontroller enabling a wide range of real time workloads.
FreeBSD ARM64 support has matured very rapidly, and support for the architecture has grown as hardware has moved from being experimental lab machines to very high performance servers. There are different considerations at all scales with ARM64 and there is likely to be more and more information available to deal with the performance issues that are unique to the architecture. As we see more machines become available and wider deployments, we are going to see FreeBSD ARM64 usage grow greatly.
Like this article? Share it!
You might also be interested in
Get more out of your FreeBSD development
Kernel development is crucial to many companies. If you have a FreeBSD implementation or you’re looking at scoping out work for the future, our team can help you further enable your efforts.
While new protocols are constantly being developed, the venerable Transmission Control Protocol (TCP) still accounts for most global traffic. The FreeBSD kernel TCP stack offers a lot of opportunities to tweak different performance features. The options it includes allow a lot of flexibility in the configuration of machines without having to do custom kernel builds.
Find out how to make use of the Initial Window, what the TCP Segment OffLoad is, and how to use TCP Buffer Tuning to your advantage.
Network performance is one of the most complex topics to analyse and understand. FreeBSD has a full set of debugging features, and the network stack reports a ton of information. So much that it can be hard to figure out what is relevant and what is not. In this article, we define performance, look at how to measure what is available and how to get the system to report what it is managing to do.