Klara

The IT infrastructure ecosystem continues to mature, moving away from the idea that each different role should be handled by a dedicated machine. However, when sharing resources – or even just sharing a context – it can be difficult to ensure different users and applications do not interfere with each other and remain isolated.

The growing container ecosystem is designed to solve this problem by creating a unique context for each application or workload, while still sharing the resources of a single compute node. This ensures freedom from conflicting dependencies and avoids potential interference between applications. With the help of an orchestration framework and/or a cluster scheduler, containers can be deployed across multiple compute nodes to maximize resource utilization.

For today’s topic, we will look at how best to leverage ZFS in such an environment to maximize resource utilization and ensure applications are isolated, all while giving greater control over how storage is managed.

Building Blocks

The first building block of our multi-tenant or multi-workload configuration is OpenZFS. ZFS provides pooled storage to combine the capacity and performance of multiple storage devices while adding additional reliability. It then allows the administrator or framework to create multiple virtual filesystems on top of that pool, sharing the available free space and resources subject to configured constraints.

Each ZFS virtual file system (dataset) has a set of inherited properties that control various aspects, such as transparent compression, encryption, block/record size, quotas, reservations, and the directory where the contents are mounted. Compared to traditional file systems, this already offers more flexibility and control, but it is the ability to delegate a dataset entirely to a container that truly makes ZFS the best choice for containers.

ZFS has supported containers since its early days at Sun Microsystems through its integration with Solaris Zones. When ZFS was ported to FreeBSD, this concept was adapted to work with Jails, where it has been widely used for over a decade. Support for Linux containers arrived more recently as part of the OpenZFS 2.2 release through patches written and upstreamed by Klara for a customer.

Namespaces

The Linux concept of containers is significantly different from the containers on Solaris and FreeBSD. Rather than a single top-level entity that acts as the container and provides isolation, Linux uses the concept of Namespaces to provide a separate context for each different resource.

Some of the basic namespaces that are supported on Linux are:

  • Mount: When a new mount namespace is created, it inherits the mounts of its parents, but any newly created mounts exist only within the new namespace. This namespace provides isolation at the filesystem level.
  • PID: The PID namespaces keep processes within the different containers separate from each other and allow conflicting PIDs to exist. The first process created in a new PID namespace is given the PID 1, along with the special properties this implies.
  • Net: The network namespace will contain only a loopback adapter by default but has its own unique routing table and can contain conflicting IP addresses with other Linux namespaces, since they cannot see each other.
  • IPC: A separate shared memory namespace ensures applications that communicating via shared memory can only communicate with applications in the same namespace/container, preventing any cross-talk.
  • UTS: The UNIX Time Sharing namespace allows each container to have its own hostname and other basic settings, distinguishing it from the host machine.
  • UID: The UID namespace is one of the more complex aspects of Linux namespaces particularly in relation to the filesystem. Each container will get a range of valid UIDs from the perspective of the host machine (e.g.,100,000 – 165,535), which will be mapped to appear as 0 – 65,535 inside the namespace. So, each namespace has its own root user with UID 0, but these different users can still be uniquely identified by the host OS.
  • CGroup: The last namespace is the CGroup, which is how Linux implements resource controls.

By combining these different namespaces, a container becomes very similar to a light-weight virtual machine, but with a lower resource cost. The isolation of the filesystem, processes, network, and users, applications and workloads can share the host machine’s resources without conflicting with each other.

Linux Namespaces in Action

As a practical example of Linux namespaces, the following steps will turn a fresh Ubuntu 24.04 machine into a multi-tenant application hosting server, where each tenant has full control over their sub-dataset of a ZFS pool.

Install ZFS

user@host:~$ sudo apt install zfsutils-linux

Confirm ZFS is loaded

user@host:~$ zfs version
zfs-2.2.2-0ubuntu9.1
zfs-kmod-2.2.2-0ubuntu9.1

Note: You will need OpenZFS 2.2 or later to support the Linux namespace delegation feature. If you need a newer version of OpenZFS on Ubuntu 20.04 or 22.04, consider a Klara ZFS Support Subscription, which includes access to an up-to-date APT repository with newer versions of ZFS.

Create an unprivileged user

user@host:~$ sudo adduser unpriv
user@host:~$ sudo usermod -a -G lxd  unpriv

Determine the UID/GID mapping for the new user:

user@host:~$ cat /etc/subuid
unpriv:165536:165536
user@host:~$ cat /etc/subgid
unpriv:165536:165536

Setup LXD

user@host:~$ sudo su – unpriv
unpriv@host:~$ lxd init
Installing LXD snap, please be patient.
Would you like to use LXD clustering? (yes/no) [default=no]:
Do you want to configure a new storage pool? (yes/no) [default=yes]:
Name of the new storage pool [default=default]: storage
Name of the storage backend to use (lvm, powerflex, zfs, btrfs, ceph, dir) [default=zfs]: zfs
Create a new ZFS pool? (yes/no) [default=yes]:
Would you like to use an existing empty block device (e.g. a disk or partition)? (yes/no) [default=no]: yes
Path to the existing block device: /dev/sdb
Would you like to connect to a MAAS server? (yes/no) [default=no]:
Would you like to create a new local network bridge? (yes/no) [default=yes]:
What should the new bridge be called? [default=lxdbr0]:
What IPv4 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]:
What IPv6 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]:
Would you like the LXD server to be available over the network? (yes/no) [default=no]:
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]:
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]:

Confirm you now have a ZFS pool

unpriv@host:~$ zpool list
NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
storage  99.5G   590K  99.5G        -         -     0%     0%  1.00x    ONLINE  -

Create the ZFS datasets

Switch back to your normal privileged user:

unpriv@host:~$ exit user@host:~$ sudo zfs create storage/customers user@host:~$ sudo zfs create -o mountpoint=/unpriv storage/customers/unpriv user@host:~$ sudo zfs create -p storage/customers/unpriv/child/gchild user@host:~$ sudo chown -R 1000000:1000000 /unpriv user@host:~$ sudo zfs set zoned=on storage/customers/unpriv

Start your first container

user@host:~$ sudo su – unpriv
unpriv@host:~$ lxc launch ubuntu:24.04 unpriv
Creating unpriv
Starting unpriv
unpriv@host:~$ lxc stop unpriv
unpriv@host:~$ lxc config set unpriv raw.lxc="lxc.apparmor.profile=unchanged"
unpriv@host:~$ lxc config set unpriv security.nesting=true
unpriv@host:~$ lxc config device add unpriv zfs unix-char path=/dev/zfs mode=0666
unpriv@host:~$ lxc start unpriv
unpriv@host:~$ lxc exec unpriv -- bash

Install ZFS inside the container

root@unpriv:~# apt install zfsutils-linux
root@unpriv:~# zfs version
zfs-2.2.2-0ubuntu9.1
zfs-kmod-2.2.2-0ubuntu9.1

root@unpriv:~# zfs list
no datasets available

The container does not yet have access to any datasets, so it sees nothing. Return to the host system:

root@unpriv:~# exit
exit
unpriv@host:~$ exit
exit
user@host:~$

Delegate a dataset to the container

user@host:~$ lxc info unpriv
Name: unpriv
Status: RUNNING
Type: container
Architecture: x86_64
PID: 5215
…
user@host:~$ PID=$( lxc info focal | awk '/PID:/ {print $NF}' )
user@host:~$ sudo zfs zone /proc/${PID}/ns/user storage/customers/unpriv

Return to the container

user@host:~$ sudo su - unpriv
unpriv@host:~$ lxc exec focal -- bash
root@focal:~# zfs list
NAME                                    USED  AVAIL  REFER  MOUNTPOINT
storage                                 502M  95.9G    24K  legacy
storage/customers                        96K  95.9G    24K  legacy
storage/customers/unpriv                 72K  95.9G    24K  /unpriv
storage/customers/unpriv/child           48K  95.9G    24K  /unpriv/child
storage/customers/unpriv/child/gchild    24K  95.9G    24K  /unpriv/child/gchild

root@focal:~# zfs create -o mountpoint=/testing \ 
	storage/customers/unpriv/testing 
root@focal:~# zfs mount -a 
root@focal:~# zfs list 
 
NAME                                    USED  AVAIL  REFER  MOUNTPOINT 
storage                                 502M  95.9G    24K  legacy 
storage/customers                        20K  95.9G    24K  legacy 
storage/customers/unpriv                 96K  95.9G    24K  /unpriv 
storage/customers/unpriv/child           48K  95.9G    24K  /unpriv/child 
storage/customers/unpriv/child/gchild    24K  95.9G    24K  /unpriv/child/gchild 
storage/customers/unpriv/testing         24K  95.8G    24K  /testing 
 
root@focal:~# ls -ld /testing 
drwxr-xr-x 2 root root 2 Jan 15 02:34 /testing 
As you can see, even the root user inside the container cannot see the datasets outside of “storage/customers/unpriv”, except for the immediate parent datasets. This ensures that different customers, applications, and workloads are entirely unaware of each other.

Conclusion: Leveraging Linux Namespaces

Leveraging the dataset delegation features on ZFS, combined with the native container features of Linux or FreeBSD, improves multi-tenancy. It allows a single storage server to host multiple tenants or applications without each being aware of the other. This ensures that applications cannot conflict, enhances customer data privacy, and prevents the list of customers or workloads from being casually exposed.

There is ongoing work in upstream OpenZFS to extend its feature set for the multi-tenant environment, including the forthcoming hierarchical rate limiting feature. This feature allows the administrator to apply IOPS and read/write rate limits to each dataset and its children. It ensures that one container or workload does not become a disruptive “noisy neighbour” while also allowing hosting providers to charge for additional provisioned performance.

If you are considering deploying OpenZFS in a multi-tenant or multi-application environment, consider a design consultation with Klara’s OpenZFS team.

Back to Articles