-
Notifications
You must be signed in to change notification settings - Fork 2
xNVMe ioengine Part 1
This blog post is co-authored by Ankit Kumar and Vincent Fu.
Recently the xNVMe ioengine was merged into fio's codebase. What exactly is xNVMe? This blog post briefly describes xNVMe, how to build it, and introduces the main features of fio's xNVMe ioengine. This blog post is intended for developers interested in using fio to test xNVMe and for fio users that may wish to take advantage of the easy access to different backends that xNVMe provides.
The fundamental aim of xNVMe is to provide a cross-platform abstraction that enables applications to interact with NVMe devices via a wide variety of interfaces. Instead of having to customize code to use libaio, io_uring, SPDK, or a synchronous interface, developers can develop code for the xNVMe API and then seamlessly switch among xNVMe's backends.
Fio's new built-in xNVMe ioengine provides a means for developers to measure xNVMe performance. Additionally the xNVMe ioengine provides convenient access to user-space drivers such as SPDK. In the future, the xNVMe ioengine may also provide access to cutting edge NVMe features and capabilities.
xNVMe's website has an excellent Getting Started section, but let us walk through building xNVMe for Ubuntu 22.04. For other platforms see the detailed instructions available for Alpine Linux, Arch, CentOS, Debian, Fedora, Gentoo, and openSUSE. The instructions here are based on xNVMe 0.3.0. xNVMe is also available for Windows, but the fio ioengine is currently available only for Linux and FreeBSD.
The first steps are to (1) clone the xNVMe repository, (2) enter the repository directory, and (3) run the package install script:
root@ubuntu:~# git clone https://github.com/openmpdk/xnvme.git
Cloning into 'xnvme'...
remote: Enumerating objects: 8233, done.
remote: Counting objects: 100% (859/859), done.
remote: Compressing objects: 100% (182/182), done.
remote: Total 8233 (delta 771), reused 689 (delta 674), pack-reused 7374
Receiving objects: 100% (8233/8233), 3.31 MiB | 6.97 MiB/s, done.
Resolving deltas: 100% (6172/6172), done.
root@ubuntu:~# cd xnvme/
root@ubuntu:~/xnvme# ./toolbox/pkgs/ubuntu-focal.sh # packages for 20.04 also work for 22.04
...
After all the required packages have been installed, it is time to build xNVMe.
With meson, the steps are to (1) set up the build directory, (2) build xNVMe,
and (3) install xNVMe. The script at toolbox/pkgs/default-build.sh
within the
xNVMe repository automates these steps but they are issued one at a time below.
root@ubuntu:~/xnvme# meson setup builddir # this can take some time because this step builds SPDK
...
root@ubuntu:~/xnvme# cd builddir
root@ubuntu:~/xnvme/builddir# meson compile
[170/170] Linking target examples/zoned_io_async
root@ubuntu:~/xnvme/builddir# meson install
...
By default xNVMe is built with SPDK support. If this is not needed then run the
setup step as meson setup buildir -Dwith-spdk=false
.
With xNMVe installed we can now build fio in the usual way. However, for
detecting xNVMe support, fio's configure
script relies on pkg-config
. So
pkg-config
must be installed.
root@ubuntu:~# apt install pkg-config
Reading package lists... Done
...
root@ubuntu:~# git clone https://github.com/axboe/fio
Cloning into 'fio'...
remote: Enumerating objects: 33910, done.
remote: Counting objects: 100% (99/99), done.
remote: Compressing objects: 100% (62/62), done.
remote: Total 33910 (delta 46), reused 80 (delta 36), pack-reused 33811
Receiving objects: 100% (33910/33910), 27.40 MiB | 30.43 MiB/s, done.
Resolving deltas: 100% (22693/22693), done.
root@ubuntu:~# cd fio
root@ubuntu:~/fio# make -j $(nproc)
Running configure ...
FIO_VERSION = fio-3.30-63-g66087
Operating system Linux
CPU x86_64
Big endian no
Compiler gcc
Cross compile no
...
xnvme engine yes
...
CC oslib/strsep.o
LINK fio
LINK t/fio-genzipf
LINK t/fio-dedupe
LINK t/fio-verify-state
LINK t/stest
LINK t/ieee754
LINK t/axmap
LINK t/lfsr-test
LINK t/gen-rand
LINK t/memlock
LINK t/read-to-pipe-async
LINK unittests/unittest
LINK t/fio-btrace2fio
LINK t/io_uring
The key feature of xNVMe is the ability to seamlessly change backends. xNVMe supports asynchronous backends including io_uring, libaio, posixaio, a threadpool, and emulation. As for synchronous backends, the xNVMe ioengine supports the Linux NVMe driver IOCTL as well as psync and a superset of psync which includes zone management commands supported by the Linux block layer.
To use the xNVMe ioengine, specify
ioengine=xnvme
and then specify either an async backed with
xnvme_async={io_uring, libaio, ...}
or a synchronous backed with
xnvme_sync={nvme, psync, block}
The available options are:
Type | Backend | Description |
---|---|---|
Asynchronous | ||
io_uring | Linux native asynchronous I/O interface which supports both direct and buffered I/O and can submit and complete I/O without a system call. xNVMe uses the io_uring library liburing, which provides a simplified interface without dealing with the full kernel-side implementation. This interface can work only with NVMe block devices (/dev/nvmeXnY ). io_uring support is available in kernel 5.1 onwards. |
|
io_uring_cmd | Linux native asynchronous I/O interface for io_uring pass-through commands which can submit and complete I/O without a system call. Just like the io_uring backend, xNVMe uses liburing for pass-through commands as well. This interface can only work with NVMe character devices (/dev/ngXnY ). io_uring pass through support will be available in kernel 5.19 onwards and will be supported upon the release of xNVMe 0.4.0. |
|
libaio | Linux native asynchronous I/O. libaio enables even a single fio thread to overlap I/O operations by providing an interface for submitting one or more I/O requests in one system call without waiting for completion, and a separate interface to reap completed I/O operations associated with a given completion group. | |
emu | Emulate asynchronous I/O by using a single thread to create a queue pair on top of a synchronous I/O interface using the NVMe driver IOCTL (default) | |
posix | Use the posix asynchronous I/O interface to perform one or more I/O operations asynchronously. | |
thrpool | Emulate an asynchronous I/O interface with a pool of userspace threads on top of a synchronous I/O interface using the NVMe driver IOCTL. By default four threads are used. | |
nil | Do not transfer any data; just pretend to. This is mainly used for introspective performance evaluation. | |
Synchronous | ||
nvme | Use the Linux NVMe driver IOCTL for synchronous I/O (default) | |
psync | This supports regular as well as vectored pread() and pwrite() commands | |
block | This is the same as psync except that it also supports zone management commands using Linux block layer IOCTLs |
The table below lists the key hardware and software components of our evaluation system. It is comprised of low cost commodity hardware such that reconstructing the environment and reproducing the experiments is possible with little effort and cost. We use a low latency SSD as the target device.
Hardware | Model |
---|---|
CPU | AMD Ryzen 9 5900X 4.9GHz |
Memory | 16GB DDR4 2667Mhz |
Motherboard | MEG X570 GODLIKE (MS-7C34) |
Software | Version |
---|---|
Linux | Ubuntu 22.04 LTS |
FIO | master @ 66087910 |
xNVMe | 0.3.0 |
gcc | 11.2.0 |
Let us take a look at a basic example and see how the asynchronous and
synchronous backends can be used. The below job file has a random read workload
with a block size of 4K. It runs for a total duration of one minute and uses
libaio as the asynchronous backend. The xNVMe ioengine requires fio to use
threads instead of processes for spawning jobs, as this is essential to support
SPDK. So thread=1
must be specified when the xNVMe ioengine is used. Below is
our example job file.
; -- start xnvme.fio job file --
[test]
ioengine=xnvme
direct=1
filename=/dev/nvme0n1
iodepth=1
thread=1
time_based
runtime=1m
numjobs=1
rw=randread
bs=4k
xnvme_async=libaio
; -- end xnvme.fio job file --
After fio is done with the above job the output will look like:
$ fio xnvme.fio
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.30-63-g66087
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=420MiB/s][r=108k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=128084: Sat Jun 25 02:13:55 2022
read: IOPS=107k, BW=418MiB/s (438MB/s)(24.5GiB/60001msec)
slat (nsec): min=1090, max=21280, avg=1194.87, stdev=131.73
clat (nsec): min=280, max=285515, avg=7915.55, stdev=692.37
lat (nsec): min=7840, max=286665, avg=9110.41, stdev=738.60
clat percentiles (nsec):
| 1.00th=[ 7712], 5.00th=[ 7712], 10.00th=[ 7712], 20.00th=[ 7776],
| 30.00th=[ 7776], 40.00th=[ 7776], 50.00th=[ 7776], 60.00th=[ 7840],
| 70.00th=[ 7840], 80.00th=[ 7840], 90.00th=[ 8032], 95.00th=[ 8384],
| 99.00th=[10048], 99.50th=[12736], 99.90th=[15936], 99.95th=[16320],
| 99.99th=[24192]
bw ( KiB/s): min=417040, max=433432, per=100.00%, avg=427948.10, stdev=3630.93, samples=119
iops : min=104260, max=108358, avg=106987.04, stdev=907.75, samples=119
lat (nsec) : 500=0.01%, 750=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 10=98.97%, 20=1.01%, 50=0.02%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%
cpu : usr=10.42%, sys=22.84%, ctx=6417204, majf=0, minf=2
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=6417056,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=418MiB/s (438MB/s), 418MiB/s-418MiB/s (438MB/s-438MB/s), io=24.5GiB (26.3GB), run=60001-60001msec
Switching to io_uring as the asynchronous backend, we just need to set xnvme_async=io_uring
. Below is the resulting output.
$ fio xnvme.fio
fio-3.30-63-g66087
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=405MiB/s][r=104k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=128114: Sat Jun 25 02:15:13 2022
read: IOPS=104k, BW=406MiB/s (426MB/s)(23.8GiB/60000msec)
slat (nsec): min=1100, max=29619, avg=1259.00, stdev=243.90
clat (nsec): min=130, max=1179.7k, avg=8097.31, stdev=955.27
lat (usec): min=8, max=1180, avg= 9.36, stdev= 1.09
clat percentiles (nsec):
| 1.00th=[ 7712], 5.00th=[ 7776], 10.00th=[ 7776], 20.00th=[ 7840],
| 30.00th=[ 7840], 40.00th=[ 7840], 50.00th=[ 7840], 60.00th=[ 7904],
| 70.00th=[ 7968], 80.00th=[ 8096], 90.00th=[ 8896], 95.00th=[ 9408],
| 99.00th=[10688], 99.50th=[12736], 99.90th=[16192], 99.95th=[17536],
| 99.99th=[23168]
bw ( KiB/s): min=395440, max=429456, per=100.00%, avg=416342.72, stdev=7983.93, samples=119
iops : min=98860, max=107364, avg=104085.65, stdev=1996.01, samples=119
lat (nsec) : 250=0.01%, 500=0.01%
lat (usec) : 4=0.01%, 10=97.85%, 20=2.11%, 50=0.04%, 100=0.01%
lat (usec) : 250=0.01%, 500=0.01%
lat (msec) : 2=0.01%
cpu : usr=8.74%, sys=24.66%, ctx=6242481, majf=0, minf=1
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=6242547,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=406MiB/s (426MB/s), 406MiB/s-406MiB/s (426MB/s-426MB/s), io=23.8GiB (25.6GB), run=60000-60000msec
Switching to a synchronous backend such as psync can be done by replacing xnvme_async=io_uring
with xnvme_sync=psync
. Below is the resulting output.
$ fio xnvme.fio
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.30-63-g66087
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=445MiB/s][r=114k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=128145: Sat Jun 25 02:16:51 2022
read: IOPS=113k, BW=442MiB/s (464MB/s)(25.9GiB/60000msec)
slat (nsec): min=50, max=17090, avg=68.92, stdev=21.47
clat (nsec): min=8010, max=340414, avg=8551.46, stdev=705.66
lat (nsec): min=8080, max=340474, avg=8620.38, stdev=710.20
clat percentiles (nsec):
| 1.00th=[ 8256], 5.00th=[ 8384], 10.00th=[ 8384], 20.00th=[ 8384],
| 30.00th=[ 8384], 40.00th=[ 8384], 50.00th=[ 8384], 60.00th=[ 8384],
| 70.00th=[ 8512], 80.00th=[ 8512], 90.00th=[ 8512], 95.00th=[ 9152],
| 99.00th=[11200], 99.50th=[12992], 99.90th=[16768], 99.95th=[16768],
| 99.99th=[21376]
bw ( KiB/s): min=434336, max=457512, per=100.00%, avg=452843.03, stdev=4525.47, samples=119
iops : min=108584, max=114378, avg=113210.79, stdev=1131.34, samples=119
lat (usec) : 10=96.74%, 20=3.25%, 50=0.01%, 100=0.01%, 500=0.01%
cpu : usr=6.61%, sys=22.13%, ctx=6792164, majf=0, minf=1
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=6792240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=442MiB/s (464MB/s), 442MiB/s-442MiB/s (464MB/s-464MB/s), io=25.9GiB (27.8GB), run=60000-60000msec
Testing these interfaces with fio is not a special trick given the availability of its assorted ioengines, but doing the same for an application would typically require significant effort. However, an application written for the xNVMe API could easily switch among these interfaces as easily as we used the xNVMe ioengine to specify different backends.
What is the overhead of xNVMe? This is the first question which comes to mind for any developer evaluating it. We carried out two series of experiments to illustrate the performance penalty introduced by xNVMe. The first establishes a baseline by comparing xNVMe's nil backend against fio's null ioengine. The second series of experiments compares xNVMe's io_uring backend against fio's io_uring ioengine.
The job file we used for the experiments is below.
[global]
direct=1
filename=/dev/nvme0n1
thread=1
time_based
runtime=1m
numjobs=1
rw=randread
bs=4k
iodepth=1
cpus_allowed=1
norandommap
[xnvme_nil]
ioengine=xnvme
xnvme_async=nil
size=500G
[null]
ioengine=null
size=500G
[xnvme_io_uring]
ioengine=xnvme
xnvme_async=io_uring
[io_uring]
ioengine=io_uring
These jobs use a block size of 4K and a queue depth of one. To help make the
results more consistent we pin the fio process to CPU 1 and set the cpufreq
governor to performance
using the command below:
# echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
We ran the fio jobs as follows:
# fio xnvme-compare.fio --section=null
...
# fio xnvme-compare.fio --section=xnvme_nil
...
# fio xnvme-compare.fio --section=io_uring
...
# fio xnvme-compare.fio --section=xnvme_io_uring
...
For details see the full output log. Below is a summary of our results for the four different configurations tested.
Baseline | io_uring | ||||
---|---|---|---|---|---|
xNVMe nil backend | fio null ioengine | xNVMe io_uring backend | fio io_uring ioengine | ||
Mean total latency (ns) | 82 | 46 | 8983 | 8877 | |
IOPS | 4919K | 6376K | 109K | 110K |
For our baseline, xNVMe's nil backend produces an average total latency of 82 nsec. On the other hand, fio's built-in null ioengine reports an average total latency of 46 ns. We conclude that the baseline average overhead for xNVMe is 82-46 = 36 ns.
For the xNVMe ioengine using its io_uring backend, the average total latency is 8983 ns whereas directly using fio's io_uring ioengine we measure an average total latency of 8877 ns. With the io_uring backend xNVMe introduces an average latency overhead of 106 ns.
For the configurations we tested, the convenience of xNVMe comes with only a small penalty. For more details regarding the overhead of xNVMe see the recent SYSTOR paper by Simon Lund et al.
Note that we compare total latency values here instead of completion latency. Latency measurements for xNVMe are slightly different from those for fio's other asynchronous ioengines. The xNVMe ioengine's queue hook merely sets up the xNVMe command context data structure. There is no separate commit hook (like what libaio and io_uring have) to tell the backend to begin submitting the IOs. Because the xNVMe ioengine does not have a commit hook, fio begins timing completion latency once xNVMe's queue hook returns. For libaio and io_uring, fio only begins timing completion latency once their commit hooks return and the commands are actually submitted. For xNVMe, the commands are not actually submitted until the ioengine's getevents hook is called in an attempt to reap completions. When the getevents hook is called for io_uring and libaio the commands have already been submitted and fio is attempting to obtain completed commands. Thus, xNVMe is at a disadvantage if we compare completion latency values and the fair latency value to compare is total latency because this measures the time from origination to completion for all I/Os in the same way for all ioengines.
In Part 2 of this xNVMe ioengine series we will discuss other backends including SPDK and cover the xNVMe ioengine's specialized options for administrative commands, polling, and vectored I/O.
- xNVMe also includes a fio external ioengine which is essentially equivalent to fio's built-in ioengine
- To build xNVMe with debug messages enabled, from the root of the xNVMe
repository run:
make config-debug && make && make install