This is part of our article series published as “OpenZFS in Depth”. Subscribe to our article series to find out more about the secrets of OpenZFS
If you are just joining us, be sure to checkout the post covering the history of the OpenZFS developer’s summit, and the morning of the first day in our first post. After the lunch break and some more socializing, we resumed the packed schedule with a double length talk about ZFS replication. Each breakout room was led by a prominent developer from one of the different platforms supported by OpenZFS, and one was reserved for the previous speaker to continue to answer questions. The webinar style also allowed the audience to submit written questions to be answered at the end.
Send/Receive Performance Enhancements by Matt Ahrens
Watch the stream here.
Matt Ahrens has been working to improve the throughput of ZFS replication when using smaller block sizes, which are very common for databases and virtual machine backing stores. While large blocks can easily saturate a 10-gigabit link, smaller blocks have a much higher overhead, and could only manage a third of that performance on his test machine. The goal is to get performance improvements such that ZFS could saturate a 5-gigabit cloud direct-connect tunnel. Over the course of his series of changes, he improved the speed of sending 4k records by 87% (to 6.7 gigabits/second on his test machine) and receive by 90% (5.9 gigabits/second), while also improving the speed of all record sizes.
In addition to explain the changes that were made, Matt also described his workflow, and the tools he used to find the bottlenecks and measure the differences as he made changes. He made extensive use of on-CPU and off-CPU flamegraphs. While examining the send process, it was determined that a lot of CPU time was being wasted prefetching blocks into the ARC, and then evicting older blocks to make room to prefetch more. By making ZFS replication bypass the ARC and access the blocks directly, performance was increased significantly.
The other main performance bottleneck was the repeated calls to pipe_write(), two for each 4 KB block. By creating a separate writer thread and batching the writes to the pipe into blocks of 1 MB, the overhead of pipe_write, the locking, and the sleep/wakeup cycle, were each reduced to 1/512th of the original overhead.
Of course, all of this is of limited usefulness if the receive side is still bottlenecked, limited to only about half the speed of the improved send code. Much like the send case, the receive code was spending a lot of time dealing with the ARC, so it was also made to by pass the ARC, and to create lighter weight write transactions.
We were also lead to observe that many of the 4k transactions being created were for contiguous blocks in the same file, so by batching these together into single transactions of up to 1 MB, the CPU usage for transactions dropped from 26% to 1%. Similar to the solution for send, creating a separate reader thread to pull from the pipe in 1 MB chunks reduced the CPU time spend on the pipe. However, this solution has a downside. Since doing a 1 MB pipe_read removes the data from the pipe buffer, it breaks replication streams that contain multiple snapshots (-R or -I streams). The solution for this will be a backwards incompatible change to the send stream format.
Improving “zfs diff” Performance With Reverse-name Lookup by S.Bagewadi and D.Chen
Watch the stream here.
Sanjeev Bagewadi and David Chen presented their work on improving the speed of the zfs diff command. This commands allows the user to view a summary of what has changed between two snapshots, or between a snapshot and the live filesystem. It is a quick way to determine what files have changed, been added, removed, or renamed.
The source of the problem is finding the filenames of the changed objects. ZFS can very easily generate a list of what objects have changed between two timestamps, this is the key to its fast incremental replication. However, to find the filename of each object can be a slow iterative process. The object knows the object ID of its direct parent, which is usually a directory, which contains a key-value table (ZAP) mapping the name of each child to the child’s object ID. So to find the full path of an object, we must find its parent, do a linear search of the ZAP to find the filename, then find the grandparent object, and do a linear search to find the name of the parent, and so on until we reach the root of the filesystem.
To solve this, they introduced a new system attribute, SA_ZPL_LINKNAME_HASH, which stores the hash value of the ZAP entry in the parent directory that contains this objects name. This eliminates the linear search of the entire ZAP to find the object id and its matching name, allowing ZFS to jump directly to the correct offset and find the name.
Another interesting aspect of their work was to make zfs diff able to track hardlinks. Basically, each time a new hardlink is added, the parent id is updated to the most recent parent directory. If that hardlink is removed, the parent id points to a directory that may no longer contains an object with that id. To solve this, a new system attribute was introduced – SA_ZPL_LINKZAP. This works by pointing to a new object that is a ZAP of all of the parents for the hardlink. Thus, as links are added and removed, it is always possible to find the remaining valid parents.
Performance Troubleshooting Tools by Gaurav Kumar
Watch the stream here.
The last talk of the day was by Gaurav Kumar, talking about how to determine what is happening inside ZFS using both the tools that are built into ZFS, and other tools that might be available on your OS. The tools built into ZFS are usually the best place to start, and this talk covers many of them, including the ZFS debug message ring buffer, zpool history which contains commands that have been run, zpool events that keeps track of error reports and similar events, the “slow I/O” log similar to what many databases offer, monitoring the write throttle counters and how to understand the results, looking at the transaction history log, reading and understanding ARC stats, and an extensive section on using the various modes of zpool iostat.
Gaurav’s talk also includes a case study, tracking down why sequential writes over NFS were not performing as expected. It turned out that lock contention was causing the I/Os to be queued one at a time, resulting in no opportunity for aggregation. The speaker was able to apply a patch from the upcoming OpenZFS 2.0 and see a significant improvement, from 600 MB/sec to 1100 MB/sec. The talk also contained some good information on collecting metrics with Prometheus, and building graphana dashboards.
Check back for more!
Missed the first part of the conference? No problem, we’ve got you covered: click here for the overview for the morning session.
Like this article? Share it!