Modern web and mobile applications are increasingly dependent on software defined storage. Most commonly, this means Amazon Web Services’ S3 storage buckets. What you may not realize is that you don’t actually need Amazon for Amazon-compatible cloud storage! In this article, we’ll discuss how and why to avoid vendor lock-in by providing your apps fully S3-compatible storage using free and open source software.
Getting Started with OpenZFS 2.0
FreeBSD 13.0 imported OpenZFS 2.0 replacing the bespoke port that had served since 2007. This article will introduce new users to ZFS, and cover some of the new features in the upgrade.
The FreeBSD installer has an interface allowing ZFS as the root file system, allowing a bootable FreeBSD system on ZFS. Selecting the guided root on ZFS, install will permit graphical selection of disks to include in a pool. This is an easy way to explore ZFS features without an extensive hardware investment.
The hierarchy of disks, vdevs, pools, datasets:
So now you have some disks and are ready to try ZFS, you should understand the hierarchy of the ZFS storage architecture. Disks can be combined into vdevs, or virtual devices, which are the leaves of the storage hierarchy. A vdev is responsible for providing data protection, to allow the pool to suffer individual disk losses without losing data.
- Stripe: It’s possible to construct a vdev from a single drive, called a stripe; any disk losses will result in the loss of the entire pool.
- Mirror: Mirror vdevs provide the gold standard of redundancy and high speed by pairing disks. They resilver faster, and are more flexible, including allowing the pool to shrink.
- RaidZ: uses a distributed parity code to allow a number of disks to fail before losing data, up to three, and that number is appended to RaidZ forming the names RaidZ1 RaidZ2 and RaidZ3. RaidZ vdevs are fixed, and once created, they cannot have additional disks added or removed. There is ongoing work to address this limitation, but it is not complete yet.
RaidZ is related to raid5 parity configurations, but the underlying implementation is slightly different, ensuring it does not suffer from partial write problem that can corrupt data. It is tempting to put all the disks into a single large RaidZ to produce the largest vdev possible for the number of disks you have, but resist the urge. Replacing disks in wide RaidZ, results in vdevs taking a long time on full systems.
The durability of your data will be limited by the amount of parity or redundancy you can dedicate to your vdevs; the more parity, the better. Other exotic vdev types exist to serve performance or reliability needs.
Pools are built from one or more vdevs:
The pool allocator will distribute data across the vdevs, favoring vdevs that are empty. Pools with many vdevs will perform better by operating them in parallel. Unlike other modular raid systems, you may not compose vdevs in the way that a raid 0+1 might. The top level vdevs must all have the redundancy to mitigate disk failure. It is possible to expand a pool at any time by adding more vdevs. They should be added in the quantity of the vdev type, 2 for a mirror or perhaps 5 for a RaidZ2. While it’s possible to remove a vdev and relocate the data; it’s not ideal and has a long-term memory cost.
The pool has some administrative properties that control its portability and overall behavior. Pools are portable, you might export a pool in an external storage array and physically connect the pool to another host. As long as the destination is running the same or later version of ZFS, even with a different OS, it can be imported and used. Unlike other filesystems, this portability even extends across hosts with different native byte orders (endianness).
Dataset are the abstraction for managing user data, analogous to a volume:
Datasets are the user facing abstraction for consuming the space that the pool provides, and for managing the configuration properties of the filesystem. A dataset can be a directory with files and sub-directories suitable for mounting to the host’s file system, or it can be Zvol block storage that you might hand over to a virtualization system or export as an iSCSI LUN. Dataset properties control a myriad of attributes about each of the filesystems. Some essential familiar properties are: mount point, compression, encryption, and quota. They are accessible with ‘zfs get’ and ‘set’ commands. They are inherited by child datasets by default, so policies can be set at the root, and overridden as needed. An investment in planning data layout will ease future maintenance. Use smaller separate datasets to facilitate easy replication and management.
Boot environments are an interesting use of datasets. Configuring boot environment allows the user to select from multiple root filesystems at boot time. Try an operating system upgrade experiment, and if it turns out badly, revert to a previous boot environment.
A snapshot is an immutable and consistent version of the dataset:
ZFS uses the copy on write discipline for every transaction committed to the file system. Blocks aren’t actually overwritten, rather a new block is written and the file system references the most current blocks. The old version of the block then becomes free space. Snapshots bring an element of time travel to file systems. Capturing a single point in time, a snapshot will ensure that all of the versions of the blocks referenced by the snapshot are not freed, even if they are overwritten or deleted in the live filesystem.
It is possible to go to any version of the file that has a snapshot that references it. ZFS snapshots are inexpensive to create and store; systems can store thousands of snapshots without ill effects. ZFS snapshots have no impact on read and write performance, the only cost that grows as you create more snapshot, is the amount of work required to list and manage the snapshots. A number of scripts to create and manage snapshots provide easy access, and define a retention policy with regular creation and deletion of snapshots.
While the snapshot itself takes up minimal disk space, the data that it references is protected on disk and can’t be freed to write more data. Deleting snapshots starting with the oldest is the only way to release that space back to the pool. Imagine a crypto locker ransomware scrambling an entire file system. With snapshots it’s possible to go back to a previous point in time within the retention window, and recover the data. Rolling back to a snapshot takes mere seconds, undoing serious damage very quickly.
Snapshots are completely immutable, once they are created, it’s not possible to selectively delete or redact the contents of a snapshot; one would have to destroy all the snapshots containing the unwanted data to eliminate it. Consider that before asking for a recursive snapshot of the entire pool. You might need to actually remove files for compliance reasons, and having to burn down all the snapshots on the system is an inconvenient solution. This is where breaking your data into separate datasets comes in handy. If each customer has their own dataset, removing their data only requires removing their snapshots, not all snapshots.
The data durability of ZFS is very good but raid is not a backup. Regularly sending data offsite is required as mitigation against a catastrophic event at a storage array site. ZFS can send datasets efficiently, much faster than a copy or rsync. The ‘zfs send’ command pushes whole or incremental dataset the over UNIX pipes. ‘zfs receive’ on the remote system reconstructs the dataset with a full guarantee of consistency. Several software packages are available to orchestrate the replication in a user-friendly way. Setting up a remote zfs pool as a replication target offers an economical alternative to traditional offline tape backups.
Did you know?
Our webinar series also discussed about ZFS for Sysadmins and best practices for your OpenZFS environment
Be suspicious of hardware and raid controllers:
ZFS data durability relies on direct access to disks without an intermediate controller ‘helping’. Raid controller failure is an alarmingly common failure mode for storage systems. When a raid controller fails, it often renders all the drives scrambled, and replacing the controller will not recover the data. While it is possible to pass each drive through to ZFS pools in ‘JBOD’ – Just a Bunch of Disks – mode, the controller is still a single point of failure. ZFS is built to manage hardware failure, but the failure domains must be small enough to allow continued operation without any particular part. It is possible to engineer systems that will survive the loss of a controller, NIC, a whole JBOD, interconnect cables or even a whole server.
Error correcting ECC ram is nice to have for systems that demand high reliability; however, it’s not absolutely critical for data protection. An ECC error that will cause a machine check report would have likely caused a crash on a non ECC hardware. While annoying, ZFS will recover after a hardware-initiated crash.
OpenZFS 2.0 changes
The OpenZFS 2.0 import is a logical combination of the careful work of the OpenZFS team, and the demand for new features in the FreeBSD user base. The new features available include Sequential Resilvee, ZStandard compression, Persistent L2ARC, and Native Encryption.
Before this release, FreeBSD users would rely on geli(8) to encrypt the underlying disks to encrypt the entire pool. The old technique used a thread pool model that scaled poorly and under-performed when used with many disks. The updated native encryption performs much better and can be managed per dataset. This allows the users to encrypt sensitive data selectively, allowing flexibility while planning storage allocation.
The L2Arc allows a fast device to cache the most frequently read items from a pool via techniques called the Adaptive Replacement Cache or ARC. Previously the L2ARC was discarded across reboots. This caused cold cache at boot time and reduced performance until the ARC ‘relearned’ the most frequently read data. Now systems boot with cache intact, speeding the return to full speed service.
It is usually faster to compress data before committing it to disk. This finding is a bit astonishing, but it makes sense to use the fast CPUs instead of limited disk capabilities. In case ZFS attempts to compress something that is genuinely uncompressed, such as audio or random data, the compression will abort after an initial attempt, and save the whole file uncompressed. ZFS learned this lesson early, and supported some other compression techniques with great success. Compression is enabled by default now with this knowledge. The ZStandard compression algorithm is a modern evolution for 2.0 that provides better performance and smaller results.
Recovery after drive failure is an essential capability. The resilver processes where a new disk is added a vdev and must be filled by computing its contents using the other disks. This process got a massive improvement with the sequential resilver feature. Disks can now be rebuilt in a fraction of the previous technique. However, the data consistency is not verified at this time. A ‘healing’ pool scrub should be scheduled to assert that the data is free from errors.
ZFS has matured into a comprehensive data protection toolkit. Getting started with OpenZFS 2.0 in FreeBSD is now possible with a few menu item selections in the installer. It is possible to learn the techniques on any small system that could also apply to a multi-petabyte storage array.
Like this article? Share it!
Did you know?
Improve the way you make use of ZFS in your company
Did you know you can rely on Klara engineers for anything from a ZFS performance audit, to developing new ZFS features to ultimately deploying an entire storage system on ZFS?
Although easy to overlook, storage is the most fundamental part of any computing project—without storage, there is neither code nor data! The right storage solution should be accessible, reliable, easy to maintain, and free from vendor lock-in. In this article, we examine some of the reasons that open source software is a natural fit for this crucial component.
Today, we’ll concentrate on exposing the data on your NAS to the network using NFS, Samba, and iSCSI shares. Network File System (NFS) is a commonly used protocol for accessing NAS systems, especially with Linux or FreeBSD clients. We’ll provide an overview of each type of share to help guide you in deciding which is most suited to the clients that will be accessing the NAS.
Let’s examine how non-developer contributors enhance user experience, improve bug reporting, and influence feature requests, all while becoming advocates and evangelists for your open source project.