Skip to content
Vincent Fu edited this page May 23, 2024 · 1 revision

This post was co-authored by Arun George, Ankit Kumar, and Vincent Fu. We are grateful for feedback and support from Adam Manzanares and Krishna Kanth Reddy.

Flexible Data Placement (FDP) is a feature that has recently been incorporated into the NVM Express® (NVMe®) standard. Fio has support for FDP and is able to exercise FDP capabilities on devices supporting this feature. This blog post provides an introduction to FDP and examples of fio jobs exercising this feature.

A Promising New Data Placement Technology: FDP

NVMe's new FDP technology provides an exciting solution to the Write Amplification Factor (WAF) challenges of SSDs. The promise of FDP derives from the fact that the host is not obliged to follow FDP rules while writing to an FDP SSD. Non-conforming host writes are still successfully logged by the SSD; these writes merely reduce the WAF advantages that FDP promises. Also, the improvements in device WAF can be achieved without any impact on performance or application-level WAF.

The concepts of FDP are fully explained in this whitepaper, "Introduction to Flexible Data Placement: A New Era of Optimized Data Management." Please have a look for more details about FDP device internals.

The key ideas of FDP technology can be summarized as follows:

  • Data Segregation: FDP enables the host system to separate user data of different longevity levels into different NAND blocks of the SSD. This feature is similar in concept to NVMe-Streams, but comes with fewer restrictions as it is no longer tied to a namespace as Streams is. Additionally, FDP further extends the 'Data Segregation' feature in such a way that the host can separate the data onto different NAND dies.
  • Data Alignment: The host would have knowledge of SSD superblock boundaries and can align user data to these boundaries. This feature is useful when hosts want to align different objects of similar longevity type onto different NAND media for efficient eventual de-allocation.
  • Device Feedback: Log messages are defined in the FDP specification which allow the host to know details about data placement on the device. For example, the host can obtain information about the frequency of garbage collection operations, super block re-assignments, etc.

The FDP specification is fully backward compatible with non-FDP devices. A software stack written with FDP semantics can very well work with an SSD unaware of the FDP protocol.

FDP Concepts

The above-mentioned whitepaper summarizes the important components of the FDP SSD architecture. The core components are:

  • Reclaim Unit (RU): A set of NAND erase blocks to which a host may write logical blocks. This is similar to the SuperBlock concept in the Flash Translation Layer (FTL) of an SSD. The typical size can be a few GBs.
  • Reclaim Group (RG): A collection of Reclaim Units. The entire device may have a single RG or multiple RGs if the device allows the host to segregate data onto different NAND dies based on host policy.
  • Reclaim Unit Handle (RUH): This is a pointer to an RU and akin to the streams concept. It is a resource within the SSD to manage and buffer the logical blocks to write to an RU (i.e., a host write resource). A namespace is allowed to access to one or more RUHs. If a namespace has access to more than one RUH, the host is allowed to write to multiple RUs at the same time. Typically an FDP device can have up to 8-256 RUHs.
  • Placement Handle: An index into the list of Reclaim Unit Handles accessible by a namespace that is defined at namespace creation. Host writes to that namespace can only access the RUHs in this list.
  • Endurance Group: A collection of NAND blocks with endurance managed as a single unit with the intention that the NAND blocks wear out at the same time. In a typical SSD, there exists a single Endurance Group.

Data separation is achieved by means of RUHs and RGs. A host may employ data separation using RUHs when workloads have distinct IO patterns with different life times. For example when the host has hot, warm, and cold data streams, RUHs may be used to separate data with varying lifetimes in the SSD. Another use case for RUHs is when the host receives data from different tenants in a multi-tenancy environment and wishes to isolate each tenant's data as each tenant may have a different invalidation pattern for its data.

FDP Example

With a basic understanding of FDP under our belts, let us now turn to an example exercising this feature. We outline a test environment relying on QEMU, nvme-cli, and fio. QEMU provides an emulated NVMe device with FDP support, we use nvme-cli to query device status, and we use fio to send write operations to it.

QEMU

QEMU supports NVMe FDP emulation since v8.0. In this section we briefly cover using QEMU to emulate NVMe FDP devices. A complete guide to using QEMU as a test platform is beyond the scope of this blog post. For more background on QEMU see the many resources available online including this and this.

To set up a QEMU-emulated NVMe device with FDP support, first, create a file to back the namespace. Here, we create a 5 GB file.

qemu-img create -f raw nvme0n1.img 5G

Because FDP is a feature that is enabled at the Endurance Group level, we need to configure an "NVMe Subsystem" device that will serve as our Endurance Group "container" and configure FDP attributes for it. Adding such a device requires the following QEMU command line parameters:

-device "nvme-subsys,id=nvme-subsys0,fdp=on,fdp.runs=16M,fdp.nrg=1,fdp.nruh=8"

This configures the endurance group with a reclaim unit nominal size of 16M (fdp.runs), one reclaim group (fdp.nrg), and 8 reclaim unit handles (fdp.nruh).

We then configure our controller and link it to the subsystem with the line below:

-device "nvme,id=nvme0,serial=deadbeef,subsys=nvme-subsys0"

Finally, we need to create a namespace. Here we configure both the "drive" (file) that holds our data and the emulated NVMe namespace. We use a physical and logical block size of 4096 bytes.

-drive "id=nvm-1,file=nvme0n1.img,format=raw,if=none,discard=unmap,media=disk" \
-device "nvme-ns,id=nvm-1,drive=nvm-1,bus=nvme0,nsid=1,logical_block_size=4096,physical_block_size=4096,fdp.ruhs=1-7"

Here, we assign RUHs 1 through 7 (both inclusive) to the namespace. In other words, RUH 0 will remain as a controller-specified RUH. This means that if other namespaces are added to the subsystem without specifying fdp.ruhs, they will automatically use RUH 0.

nvme-cli

nvme-cli is a Linux command-line tool used for managing NVMe devices. It can be used to inspect the capabilities supported by the device, manage namespaces, and much more. FDP-specific features have been included in nvme-cli since v2.3. nvme-cli's documentation covers all the supported commands. Readers are advised to review the FDP specific commands from the documentation. One FDP-related command queries reclaim unit handle status, which provides information about reclaim unit handles for the specified namespace. We assigned reclaim unit handles 1-7 to our emulated device's only namespace and nvme-cli provides the status for each of these reclaim unit handles. On a freshly booted system, the Reclaim Unit Available Media Writes (RUAMWs) is 4096 since we specified a reclaim unit nominal size of 16M with a logical block size of 4096 bytes. In other words, each RUH has 4096 blocks of size 4096 bytes available for a total of 16MiB.

root@localhost:~# nvme fdp status /dev/nvme0 --namespace-id=1
Placement Identifier 0; Reclaim Unit Handle Identifier 1
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 4096
 
Placement Identifier 1; Reclaim Unit Handle Identifier 2
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 4096
 
Placement Identifier 2; Reclaim Unit Handle Identifier 3
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 4096
 
Placement Identifier 3; Reclaim Unit Handle Identifier 4
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 4096
 
Placement Identifier 4; Reclaim Unit Handle Identifier 5
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 4096
 
Placement Identifier 5; Reclaim Unit Handle Identifier 6
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 4096
 
Placement Identifier 6; Reclaim Unit Handle Identifier 7
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 4096

The output above provides a list of the placement identifiers available to the namespace. These are numbered 0-6 and refer respectively to RUHs 1-7.

Fio

FDP support has been available in fio since 3.34 (Mar 2023) from this commit. We recommend using the latest master branch to take advantage of fio's latest features. The examples here use a post-fio 3.36 build with hash 7d6c99.

Users can enable FDP support in fio by specifying fdp=1. FDP support is only available with the built-in fio engines io_uring_cmd and xnvme, or with the external SPDK ioengine.

FDP-related options provided by fio are these:

Option Description
fdp=bool Enable FDP support with fdp=1. Default: disabled
fdp_pli=<list,> This is a comma-separated list of FDP placement identifier indices for fio to use for write operations. These are indices into the list returned by nvme fdp status. If omitted, fio will use all placement identifier indices available in the namespace.
fdp_pli_select={random,roundrobin} Select the algorithm used for selecting from the set of placement identifiers.
random: For write operations, select randomly from the set of placement identifiers
roundrobin: For write operations, roundrobin through the set of placement identifiers. This is the default.

Example

Now let us go through an example fio job. We will use the io_uring_cmd ioengine in our uring-fdp.fio job file below.

[global]
filename=/dev/ng0n1
ioengine=io_uring_cmd
cmd_type=nvme
rw=randwrite
iodepth=8
bs=4K
size=16M
fdp=1
 
[write-1]
fdp_pli=1,3
fdp_pli_select=random
offset=0%
 
[write-2]
fdp_pli=4,5
fdp_pli_select=roundrobin
offset=30%

The job file's global section specifies the io_uring_cmd ioengine with the filename specifying the character-device interface to the first namespace of device nvme0. The workload is a 4K random write workload with IO depth 8 to a 16M extent of the device. We enabled FDP support with the fdp=1 option.

The job file defines two different jobs. Job write-1 writes with FDP placement identifiers indices 1 and 3 (Placement ID 1 RUH 2 and Placement ID 3 RUH 4, respectively) to the first 16M of the device randomly choosing between the two placement identifier indices whereas job write-2 has selected placement identifier indices 4 and 5 (Placement ID 4 RUH 5 and Placement ID 5 RUH 6, respectively) and writes to the 16M of the device beginning 30% of the way into the device selecting placement identifier indices in a round robin fashion.

The output from running uring-fdp.fio is below. It provides the usual fio output, indicating that 16M was written for each job, listing the respective bandwidth, IOPS, latency distribution, etc.

Fio output
root@localhost:~# fio uring-fdp.fio
write-1: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring_cmd, iodepth=8
write-2: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring_cmd, iodepth=8
fio-3.36-104-g7d6c99
Starting 2 processes
write-1: (groupid=0, jobs=1): err= 0: pid=927: Wed Mar 20 00:26:55 2024
  write: IOPS=102k, BW=400MiB/s (419MB/s)(16.0MiB/40msec); 0 zone resets
    slat (nsec): min=700, max=81869, avg=3822.43, stdev=9737.15
    clat (usec): min=8, max=380, avg=73.57, stdev=31.67
     lat (usec): min=28, max=400, avg=77.39, stdev=33.01
    clat percentiles (usec):
     |  1.00th=[   19],  5.00th=[   39], 10.00th=[   44], 20.00th=[   51],
     | 30.00th=[   58], 40.00th=[   65], 50.00th=[   72], 60.00th=[   77],
     | 70.00th=[   82], 80.00th=[   88], 90.00th=[  102], 95.00th=[  130],
     | 99.00th=[  188], 99.50th=[  219], 99.90th=[  371], 99.95th=[  375],
     | 99.99th=[  379]
  lat (usec)   : 10=0.27%, 20=0.73%, 50=18.14%, 100=70.41%, 250=10.18%
  lat (usec)   : 500=0.27%
  cpu          : usr=0.00%, sys=48.72%, ctx=479, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.8%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4096,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
write-2: (groupid=0, jobs=1): err= 0: pid=928: Wed Mar 20 00:26:55 2024
  write: IOPS=87.1k, BW=340MiB/s (357MB/s)(16.0MiB/47msec); 0 zone resets
    slat (nsec): min=631, max=125087, avg=6369.07, stdev=14375.82
    clat (nsec): min=170, max=377862, avg=84365.74, stdev=47330.55
     lat (usec): min=28, max=445, avg=90.73, stdev=50.07
    clat percentiles (usec):
     |  1.00th=[   12],  5.00th=[   37], 10.00th=[   42], 20.00th=[   49],
     | 30.00th=[   56], 40.00th=[   67], 50.00th=[   76], 60.00th=[   83],
     | 70.00th=[   91], 80.00th=[  114], 90.00th=[  143], 95.00th=[  186],
     | 99.00th=[  253], 99.50th=[  297], 99.90th=[  351], 99.95th=[  355],
     | 99.99th=[  379]
  lat (nsec)   : 250=0.10%, 500=0.05%
  lat (usec)   : 10=0.51%, 20=1.88%, 50=20.17%, 100=52.17%, 250=24.02%
  lat (usec)   : 500=1.10%
  cpu          : usr=0.00%, sys=65.22%, ctx=402, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.8%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4096,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
  WRITE: bw=681MiB/s (714MB/s), 340MiB/s-400MiB/s (357MB/s-419MB/s), io=32.0MiB (33.6MB), run=40-47msec

Let us query the reclaim unit handle status a second time in order to see the effects of our writes.

root@localhost:~# nvme fdp status /dev/nvme0 --namespace-id=1
Placement Identifier 0; Reclaim Unit Handle Identifier 1
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 4096
 
Placement Identifier 1; Reclaim Unit Handle Identifier 2
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 2058
 
Placement Identifier 2; Reclaim Unit Handle Identifier 3
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 4096
 
Placement Identifier 3; Reclaim Unit Handle Identifier 4
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 2038
 
Placement Identifier 4; Reclaim Unit Handle Identifier 5
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 2048
 
Placement Identifier 5; Reclaim Unit Handle Identifier 6
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 2048
 
Placement Identifier 6; Reclaim Unit Handle Identifier 7
  Estimated Active Reclaim Unit Time Remaining (EARUTR): 0
  Reclaim Unit Available Media Writes (RUAMW): 4096

Based on this output we can observe that:

  • For placement identifier indices 1 and 3, the fio job write-1 specified random selection. These two reclaim units now respectively have RUAMW values of 2058 and 2038 logical blocks (down from 4096 each). Random placement ID selection for each write produced an uneven distribution of writes to these two reclaim units.
  • For placement identifier indices 4 and 5, the fio job write-2 used round robin to chose. Both reclaim units now have the RUAMW values of 2048 (down from 4096). This means a total of 2048 logical blocks or 8MB has been written to each RUH, an even distribution of writes to both reclaim units.
  • Placement identifier indices 0, 2, and 6 were not selected by any jobs and their RUAMW values are unchanged at 4096.

Conclusion

The foregoing has provided a brief overview of FDP and worked through an example using QEMU, nvme-cli, and fio to exercise this feature. FDP is a promising technology for reducing the total cost of ownership for NVMe devices by reducing write amplification. Fio and related tools can be useful for testing and validating FDP devices.

Notes

  • FDP statistics and accounting do not persist across QEMU invocations.
  • QEMU NVMe emulation does not support Namespace Management, so the FDP configuration must be set statically.
  • QEMU FDP emulation differs slightly from the NVMe specification by always enabling FDP on the Endurance Group if the QEMU option fdp=on is set.
  • Recently, FDP-related fio options were renamed to encompass data placement technologies including streams. The options used in the example are retained for backward compatibility.