xNVMe ioengine Part 1

This blog post is co-authored by Ankit Kumar and Vincent Fu.

Recently the xNVMe ioengine was merged into fio's codebase. What exactly is xNVMe? This blog post briefly describes xNVMe, how to build it, and introduces the main features of fio's xNVMe ioengine. This blog post is intended for developers interested in using fio to test xNVMe and for fio users that may wish to take advantage of the easy access to different backends that xNVMe provides.

The fundamental aim of xNVMe is to provide a cross-platform abstraction that enables applications to interact with NVMe devices via a wide variety of interfaces. Instead of having to customize code to use libaio, io_uring, SPDK, or a synchronous interface, developers can develop code for the xNVMe API and then seamlessly switch among xNVMe's backends.

Fio's new built-in xNVMe ioengine provides a means for developers to measure xNVMe performance. Additionally the xNVMe ioengine provides convenient access to user-space drivers such as SPDK. In the future, the xNVMe ioengine may also provide access to cutting edge NVMe features and capabilities.

Building xNVMe

xNVMe's website has an excellent Getting Started section, but let us walk through building xNVMe for Ubuntu 22.04. For other platforms see the detailed instructions available for Alpine Linux, Arch, CentOS, Debian, Fedora, Gentoo, and openSUSE. The instructions here are based on xNVMe 0.3.0. xNVMe is also available for Windows, but the fio ioengine is currently available only for Linux and FreeBSD.

The first steps are to (1) clone the xNVMe repository, (2) enter the repository directory, and (3) run the package install script:

root@ubuntu:~# git clone https://github.com/openmpdk/xnvme.git
Cloning into 'xnvme'...
remote: Enumerating objects: 8233, done.
remote: Counting objects: 100% (859/859), done.
remote: Compressing objects: 100% (182/182), done.
remote: Total 8233 (delta 771), reused 689 (delta 674), pack-reused 7374
Receiving objects: 100% (8233/8233), 3.31 MiB | 6.97 MiB/s, done.
Resolving deltas: 100% (6172/6172), done.
root@ubuntu:~# cd xnvme/
root@ubuntu:~/xnvme# ./toolbox/pkgs/ubuntu-focal.sh	# packages for 20.04 also work for 22.04
...

After all the required packages have been installed, it is time to build xNVMe. With meson, the steps are to (1) set up the build directory, (2) build xNVMe, and (3) install xNVMe. The script at toolbox/pkgs/default-build.sh within the xNVMe repository automates these steps but they are issued one at a time below.

root@ubuntu:~/xnvme# meson setup builddir		# this can take some time because this step builds SPDK
...
root@ubuntu:~/xnvme# cd builddir
root@ubuntu:~/xnvme/builddir# meson compile
[170/170] Linking target examples/zoned_io_async
root@ubuntu:~/xnvme/builddir# meson install
...

By default xNVMe is built with SPDK support. If this is not needed then run the setup step as meson setup buildir -Dwith-spdk=false.

With xNMVe installed we can now build fio in the usual way. However, for detecting xNVMe support, fio's configure script relies on pkg-config. So pkg-config must be installed.

root@ubuntu:~# apt install pkg-config
Reading package lists... Done
...
root@ubuntu:~# git clone https://github.com/axboe/fio
Cloning into 'fio'...
remote: Enumerating objects: 33910, done.
remote: Counting objects: 100% (99/99), done.
remote: Compressing objects: 100% (62/62), done.
remote: Total 33910 (delta 46), reused 80 (delta 36), pack-reused 33811
Receiving objects: 100% (33910/33910), 27.40 MiB | 30.43 MiB/s, done.
Resolving deltas: 100% (22693/22693), done.
root@ubuntu:~# cd fio
root@ubuntu:~/fio# make -j $(nproc)
Running configure ...
FIO_VERSION = fio-3.30-63-g66087
Operating system              Linux
CPU                           x86_64
Big endian                    no
Compiler                      gcc
Cross compile                 no
...
xnvme engine                  yes
...
    CC oslib/strsep.o
  LINK fio
  LINK t/fio-genzipf
  LINK t/fio-dedupe
  LINK t/fio-verify-state
  LINK t/stest
  LINK t/ieee754
  LINK t/axmap
  LINK t/lfsr-test
  LINK t/gen-rand
  LINK t/memlock
  LINK t/read-to-pipe-async
  LINK unittests/unittest
  LINK t/fio-btrace2fio
  LINK t/io_uring

Using xNVMe

The key feature of xNVMe is the ability to seamlessly change backends. xNVMe supports asynchronous backends including io_uring, libaio, posixaio, a threadpool, and emulation. As for synchronous backends, the xNVMe ioengine supports the Linux NVMe driver IOCTL as well as psync and a superset of psync which includes zone management commands supported by the Linux block layer.

To use the xNVMe ioengine, specify

 ioengine=xnvme

and then specify either an async backed with

 xnvme_async={io_uring, libaio, ...}

or a synchronous backed with

 xnvme_sync={nvme, psync, block}

The available options are:

Type	Backend	Description
Asynchronous
	io_uring	Linux native asynchronous I/O interface which supports both direct and buffered I/O and can submit and complete I/O without a system call. xNVMe uses the io_uring library liburing, which provides a simplified interface without dealing with the full kernel-side implementation. This interface can work only with NVMe block devices (`/dev/nvmeXnY`). io_uring support is available in kernel 5.1 onwards.
	io_uring_cmd	Linux native asynchronous I/O interface for io_uring pass-through commands which can submit and complete I/O without a system call. Just like the io_uring backend, xNVMe uses liburing for pass-through commands as well. This interface can only work with NVMe character devices (`/dev/ngXnY`). io_uring pass through support will be available in kernel 5.19 onwards and will be supported upon the release of xNVMe 0.4.0.
	libaio	Linux native asynchronous I/O. libaio enables even a single fio thread to overlap I/O operations by providing an interface for submitting one or more I/O requests in one system call without waiting for completion, and a separate interface to reap completed I/O operations associated with a given completion group.
	emu	Emulate asynchronous I/O by using a single thread to create a queue pair on top of a synchronous I/O interface using the NVMe driver IOCTL (default)
	posix	Use the posix asynchronous I/O interface to perform one or more I/O operations asynchronously.
	thrpool	Emulate an asynchronous I/O interface with a pool of userspace threads on top of a synchronous I/O interface using the NVMe driver IOCTL. By default four threads are used.
	nil	Do not transfer any data; just pretend to. This is mainly used for introspective performance evaluation.
Synchronous
	nvme	Use the Linux NVMe driver IOCTL for synchronous I/O (default)
	psync	This supports regular as well as vectored pread() and pwrite() commands
	block	This is the same as psync except that it also supports zone management commands using Linux block layer IOCTLs

Examples

The table below lists the key hardware and software components of our evaluation system. It is comprised of low cost commodity hardware such that reconstructing the environment and reproducing the experiments is possible with little effort and cost. We use a low latency SSD as the target device.

Hardware	Model
CPU	AMD Ryzen 9 5900X 4.9GHz
Memory	16GB DDR4 2667Mhz
Motherboard	MEG X570 GODLIKE (MS-7C34)

Software	Version
Linux	Ubuntu 22.04 LTS
FIO	master @ 66087910
xNVMe	0.3.0
gcc	11.2.0

Let us take a look at a basic example and see how the asynchronous and synchronous backends can be used. The below job file has a random read workload with a block size of 4K. It runs for a total duration of one minute and uses libaio as the asynchronous backend. The xNVMe ioengine requires fio to use threads instead of processes for spawning jobs, as this is essential to support SPDK. So thread=1 must be specified when the xNVMe ioengine is used. Below is our example job file.

; -- start xnvme.fio job file --
[test]
ioengine=xnvme
direct=1
filename=/dev/nvme0n1
iodepth=1
thread=1
time_based
runtime=1m
numjobs=1
rw=randread
bs=4k
xnvme_async=libaio
; -- end xnvme.fio job file --

After fio is done with the above job the output will look like:

$ fio xnvme.fio
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.30-63-g66087
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=420MiB/s][r=108k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=128084: Sat Jun 25 02:13:55 2022
  read: IOPS=107k, BW=418MiB/s (438MB/s)(24.5GiB/60001msec)
    slat (nsec): min=1090, max=21280, avg=1194.87, stdev=131.73
    clat (nsec): min=280, max=285515, avg=7915.55, stdev=692.37
     lat (nsec): min=7840, max=286665, avg=9110.41, stdev=738.60
    clat percentiles (nsec):
     |  1.00th=[ 7712],  5.00th=[ 7712], 10.00th=[ 7712], 20.00th=[ 7776],
     | 30.00th=[ 7776], 40.00th=[ 7776], 50.00th=[ 7776], 60.00th=[ 7840],
     | 70.00th=[ 7840], 80.00th=[ 7840], 90.00th=[ 8032], 95.00th=[ 8384],
     | 99.00th=[10048], 99.50th=[12736], 99.90th=[15936], 99.95th=[16320],
     | 99.99th=[24192]
   bw (  KiB/s): min=417040, max=433432, per=100.00%, avg=427948.10, stdev=3630.93, samples=119
   iops        : min=104260, max=108358, avg=106987.04, stdev=907.75, samples=119
  lat (nsec)   : 500=0.01%, 750=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=98.97%, 20=1.01%, 50=0.02%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%
  cpu          : usr=10.42%, sys=22.84%, ctx=6417204, majf=0, minf=2
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=6417056,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=418MiB/s (438MB/s), 418MiB/s-418MiB/s (438MB/s-438MB/s), io=24.5GiB (26.3GB), run=60001-60001msec

Switching to io_uring as the asynchronous backend, we just need to set xnvme_async=io_uring. Below is the resulting output.

$ fio xnvme.fio
fio-3.30-63-g66087
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=405MiB/s][r=104k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=128114: Sat Jun 25 02:15:13 2022
  read: IOPS=104k, BW=406MiB/s (426MB/s)(23.8GiB/60000msec)
    slat (nsec): min=1100, max=29619, avg=1259.00, stdev=243.90
    clat (nsec): min=130, max=1179.7k, avg=8097.31, stdev=955.27
     lat (usec): min=8, max=1180, avg= 9.36, stdev= 1.09
    clat percentiles (nsec):
     |  1.00th=[ 7712],  5.00th=[ 7776], 10.00th=[ 7776], 20.00th=[ 7840],
     | 30.00th=[ 7840], 40.00th=[ 7840], 50.00th=[ 7840], 60.00th=[ 7904],
     | 70.00th=[ 7968], 80.00th=[ 8096], 90.00th=[ 8896], 95.00th=[ 9408],
     | 99.00th=[10688], 99.50th=[12736], 99.90th=[16192], 99.95th=[17536],
     | 99.99th=[23168]
   bw (  KiB/s): min=395440, max=429456, per=100.00%, avg=416342.72, stdev=7983.93, samples=119
   iops        : min=98860, max=107364, avg=104085.65, stdev=1996.01, samples=119
  lat (nsec)   : 250=0.01%, 500=0.01%
  lat (usec)   : 4=0.01%, 10=97.85%, 20=2.11%, 50=0.04%, 100=0.01%
  lat (usec)   : 250=0.01%, 500=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=8.74%, sys=24.66%, ctx=6242481, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=6242547,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=406MiB/s (426MB/s), 406MiB/s-406MiB/s (426MB/s-426MB/s), io=23.8GiB (25.6GB), run=60000-60000msec

Switching to a synchronous backend such as psync can be done by replacing xnvme_async=io_uring with xnvme_sync=psync. Below is the resulting output.

$ fio xnvme.fio
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.30-63-g66087
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=445MiB/s][r=114k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=128145: Sat Jun 25 02:16:51 2022
  read: IOPS=113k, BW=442MiB/s (464MB/s)(25.9GiB/60000msec)
    slat (nsec): min=50, max=17090, avg=68.92, stdev=21.47
    clat (nsec): min=8010, max=340414, avg=8551.46, stdev=705.66
     lat (nsec): min=8080, max=340474, avg=8620.38, stdev=710.20
    clat percentiles (nsec):
     |  1.00th=[ 8256],  5.00th=[ 8384], 10.00th=[ 8384], 20.00th=[ 8384],
     | 30.00th=[ 8384], 40.00th=[ 8384], 50.00th=[ 8384], 60.00th=[ 8384],
     | 70.00th=[ 8512], 80.00th=[ 8512], 90.00th=[ 8512], 95.00th=[ 9152],
     | 99.00th=[11200], 99.50th=[12992], 99.90th=[16768], 99.95th=[16768],
     | 99.99th=[21376]
   bw (  KiB/s): min=434336, max=457512, per=100.00%, avg=452843.03, stdev=4525.47, samples=119
   iops        : min=108584, max=114378, avg=113210.79, stdev=1131.34, samples=119
  lat (usec)   : 10=96.74%, 20=3.25%, 50=0.01%, 100=0.01%, 500=0.01%
  cpu          : usr=6.61%, sys=22.13%, ctx=6792164, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=6792240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=442MiB/s (464MB/s), 442MiB/s-442MiB/s (464MB/s-464MB/s), io=25.9GiB (27.8GB), run=60000-60000msec

Testing these interfaces with fio is not a special trick given the availability of its assorted ioengines, but doing the same for an application would typically require significant effort. However, an application written for the xNVMe API could easily switch among these interfaces as easily as we used the xNVMe ioengine to specify different backends.

xNVMe overhead

What is the overhead of xNVMe? This is the first question which comes to mind for any developer evaluating it. We carried out two series of experiments to illustrate the performance penalty introduced by xNVMe. The first establishes a baseline by comparing xNVMe's nil backend against fio's null ioengine. The second series of experiments compares xNVMe's io_uring backend against fio's io_uring ioengine.

The job file we used for the experiments is below.

[global]
direct=1
filename=/dev/nvme0n1
thread=1
time_based
runtime=1m
numjobs=1
rw=randread
bs=4k
iodepth=1
cpus_allowed=1
norandommap

[xnvme_nil]
ioengine=xnvme
xnvme_async=nil
size=500G

[null]
ioengine=null
size=500G

[xnvme_io_uring]
ioengine=xnvme
xnvme_async=io_uring

[io_uring]
ioengine=io_uring

These jobs use a block size of 4K and a queue depth of one. To help make the results more consistent we pin the fio process to CPU 1 and set the cpufreq governor to performance using the command below:

# echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance

We ran the fio jobs as follows:

# fio xnvme-compare.fio --section=null
...

# fio xnvme-compare.fio --section=xnvme_nil
...

# fio xnvme-compare.fio --section=io_uring
...

# fio xnvme-compare.fio --section=xnvme_io_uring
...

For details see the full output log. Below is a summary of our results for the four different configurations tested.

	Baseline		io_uring
	xNVMe nil backend	fio null ioengine	xNVMe io_uring backend	fio io_uring ioengine
Mean total latency (ns)	82	46	8983	8877
IOPS	4919K	6376K	109K	110K

For our baseline, xNVMe's nil backend produces an average total latency of 82 nsec. On the other hand, fio's built-in null ioengine reports an average total latency of 46 ns. We conclude that the baseline average overhead for xNVMe is 82-46 = 36 ns.

For the xNVMe ioengine using its io_uring backend, the average total latency is 8983 ns whereas directly using fio's io_uring ioengine we measure an average total latency of 8877 ns. With the io_uring backend xNVMe introduces an average latency overhead of 106 ns.

For the configurations we tested, the convenience of xNVMe comes with only a small penalty. For more details regarding the overhead of xNVMe see the recent SYSTOR paper by Simon Lund et al.

Note that we compare total latency values here instead of completion latency. Latency measurements for xNVMe are slightly different from those for fio's other asynchronous ioengines. The xNVMe ioengine's queue hook merely sets up the xNVMe command context data structure. There is no separate commit hook (like what libaio and io_uring have) to tell the backend to begin submitting the IOs. Because the xNVMe ioengine does not have a commit hook, fio begins timing completion latency once xNVMe's queue hook returns. For libaio and io_uring, fio only begins timing completion latency once their commit hooks return and the commands are actually submitted. For xNVMe, the commands are not actually submitted until the ioengine's getevents hook is called in an attempt to reap completions. When the getevents hook is called for io_uring and libaio the commands have already been submitted and fio is attempting to obtain completed commands. Thus, xNVMe is at a disadvantage if we compare completion latency values and the fair latency value to compare is total latency because this measures the time from origination to completion for all I/Os in the same way for all ioengines.

In Part 2 of this xNVMe ioengine series we will discuss other backends including SPDK and cover the xNVMe ioengine's specialized options for administrative commands, polling, and vectored I/O.

Notes

xNVMe also includes a fio external ioengine which is essentially equivalent to fio's built-in ioengine
To build xNVMe with debug messages enabled, from the root of the xNVMe repository run: make config-debug && make && make install

Provide feedback

Saved searches