Skip to content

Clock test issue

Vincent Fu edited this page Aug 2, 2024 · 2 revisions

I am grateful to Jim Harris, Klaus Jensen, Adam Manzanares, and Tong Zhang for their helpful feedback. All errors and omissions remain mine of course.

For a reason that I can no longer recall, I ran fio --cpuclock-test on my home desktop. This option directs fio to, well, carry out a test of the CPU clock and print out related debug output. During a standard invocation, if fio deems the CPU clock reliable, then fio can use the CPU clock for timing instead of making costly system calls to query the time. Since the CPU clock test result is invisible in a standard fio invocation, the --cpuclock-test command-line option can be used to troubleshoot timing issues. The test output begins with many lines of diagnostic information with the test result at the very end. For a successful test the last lines of the output look like this:

$ fio --cpuclock-test
...
cs: cpu 52: 9433580974 clocks seen, first 5398029176100134
cs: cpu 46: 9445914736 clocks seen, first 5398029175984874
cs: cpu 68: 9425255906 clocks seen, first 5398029176773930
cs: cpu 13: 9483688624 clocks seen, first 5398029174646446
cs: cpu 47: 8986070530 clocks seen, first 5398029175992050
cs: cpu  8: 8848914174 clocks seen, first 5398029174505642
cs: Pass!

For a failed test the output looks like this:

$ fio --cpuclock-test
...
cs: CPU clock mismatch (diff=47):
         CPU 14: TSC=1342728217400059, SEQ=2396259
         CPU 2: TSC=1342728217400012, SEQ=2396260
cs: CPU clock mismatch (diff=46):
         CPU 14: TSC=1342728217411903, SEQ=2396392
         CPU 2: TSC=1342728217411857, SEQ=2396393
cs: CPU clock mismatch (diff=47):
         CPU 2: TSC=1342728217461347, SEQ=2397403
         CPU 14: TSC=1342728217461300, SEQ=2397404
cs: CPU clock mismatch (diff=46):
         CPU 14: TSC=1342728217571515, SEQ=2399210
         CPU 2: TSC=1342728217571469, SEQ=2399211
cs: Failed: 855099

On Linux x86-64 platforms I found that fio's CPU clock test passed when running on Intel processors but failed on AMD processors. After some investigation, I discovered that the issue was related to a compiler change and the problem was resolved with this February 2024 patch. Although this issue has been resolved for fio for many months, I feel it deserves more attention. Surely this compiler change could affect other software. This was also my first encounter with an incompatibility between AMD and Intel processors which is quite unusual nowadays. I also wish to highlight Matt Godbolt's Compiler Explorer without which I have no idea how I would have resolved this issue.

How does fio test the CPU clock?

x86-64 CPUs have a flag indicating whether they support an invariant timestamp counter (TSC), which means that the CPU can supply a timestamp that runs at a constant rate regardless of dynamic CPU frequency scaling or sleep states. In addition to checking this flag fio also carries out a test to confirm that CPU clocks are synchronized across CPU cores. Fio originates from the early days of multicore CPUs when it was not safe to assume that CPU clocks were synchronized across cores.

Fio's CPU clock test spawns separate processes pinned to each CPU core that fio may use for running jobs. Each process runs the following code from gettime.c:clock_thread_fn():

        for (i = 0; i < t->nr_entries; i++, c++) {
                uint32_t seq;
                uint64_t tsc;

                c->cpu = t->cpu;
                do {
                        seq = *t->seq;
                        if (seq == UINT_MAX)
                                break;
                        __sync_synchronize();
                        tsc = get_cpu_clock();
                } while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1));

                if (seq == UINT_MAX)
                        break;

                c->seq = seq;
                c->tsc = tsc;
        }

The outer for loop executes a configurable number of times (1000 or 100,000 depending on how the test is invoked). The code inside the for loop records CPU clock timestamps paired with sequence numbers in a synchronized manner. This is accomplished by an inner do..while loop. This inner loop begins by assigning the local variable seq the value pointed to by t->seq  (which is shared among the running processes). This is followed by the __sync_synchronize() synchronization instruction to ensure that the shared value pointed to by t->seq is actually stored in the local variable seq. Then the CPU timestamp counter is read. The atomic32_compare_and_swap check in the loop's while condition checks to make sure that the body of the inner loop was executed without another thread using the locally stored sequence number. The compare and swap checks that the value at t->seq remains the same as the value that was stored locally. If so, the value at t->seq is incremented and the original value is returned. This ensures that the sequence number is paired with only one timestamp. If, while inside this loop, another process managed to record a timestamp using the locally stored sequence number, the do..while loop will repeat until the thread in question is able to successfully pair a timestamp with the locally stored sequence number without another process stealing the sequence number.

When this for loop has completed on all of the processes spawned, fio checks to make sure that the CPU clock timestamps increase with the sequence number. This ensures that the CPU clock is synchronized across CPU cores. A timestamp value for a later sequence number being smaller than the timestamp value for an earlier sequence number would be evidence that timestamps are not appropriately synchronized across CPU cores. Since fio wishes to use the CPU clock for timekeeping and may read the timestamp counter from any available CPU core, synchronization is needed to prevent measured times from possibly going backwards.

What was the issue?

I tested Intel server and laptops CPUs as well as AMD desktop and server CPUs. From time to time the test would pass on AMD CPUs but it usually failed. I observed no failures on Intel CPUs. Could the CPU clock be not synchronized across AMD CPU cores? This is unlikely on a modern CPU.

After long days of seemingly fruitless research and testing, even with the support of esteemed colleagues, it occurred to me to examine the compiler output of the above code. Matt Godbolt's Compiler Explorer is a godsend for this purpose.

I viewed the assembly language output of the following C code:

C code
#include <stdint.h>

struct clock_entry {
        uint32_t seq;
        uint32_t cpu;
        uint64_t tsc;
};

struct clock_thread {
        int cpu;
        int debug;
        unsigned long nr_entries;
        uint32_t *seq;
        struct clock_entry *entries;
};

static inline uint32_t atomic32_compare_and_swap(uint32_t *ptr, uint32_t old,
                                                 uint32_t new)
{
        return __sync_val_compare_and_swap(ptr, old, new);
}

static inline unsigned long long get_cpu_clock(void)
{
        unsigned int lo, hi;

        __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi));
        return ((unsigned long long) hi << 32ULL) | lo;
}

static void clock_thread_fn(struct clock_thread *t)
{
    struct clock_entry *c;
    int i;

    c = &t->entries[0];
    for (i = 0; i < t->nr_entries; i++, c++) {
        uint32_t seq;
        uint64_t tsc;

        c->cpu = t->cpu;
        do {
            seq = *t->seq;
            __sync_synchronize();
            tsc = get_cpu_clock();
        } while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1));

        c->seq = seq;
        c->tsc = tsc;
    }
}

See https://godbolt.org/z/3Ed9rKoxc on Compiler Explorer to compare the output from different compilers. With GCC 11.1 and later on x86-64, the following assembly output was produced:

GCC 11.1 and later
atomic32_compare_and_swap:
        push    rbp
        mov     rbp, rsp
        mov     QWORD PTR [rbp-8], rdi
        mov     DWORD PTR [rbp-12], esi
        mov     DWORD PTR [rbp-16], edx
        mov     rdx, QWORD PTR [rbp-8]
        mov     eax, DWORD PTR [rbp-12]
        mov     ecx, DWORD PTR [rbp-16]
        lock cmpxchg    DWORD PTR [rdx], ecx
        pop     rbp
        ret
get_cpu_clock:
        push    rbp
        mov     rbp, rsp
        rdtsc
        mov     DWORD PTR [rbp-4], eax
        mov     DWORD PTR [rbp-8], edx
        mov     eax, DWORD PTR [rbp-8]
        sal     rax, 32
        mov     rdx, rax
        mov     eax, DWORD PTR [rbp-4]
        or      rax, rdx
        pop     rbp
        ret
clock_thread_fn:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 40
        mov     QWORD PTR [rbp-40], rdi
        mov     rax, QWORD PTR [rbp-40]
        mov     rax, QWORD PTR [rax+24]
        mov     QWORD PTR [rbp-8], rax
        mov     DWORD PTR [rbp-12], 0
        jmp     .L6
.L8:
        mov     rax, QWORD PTR [rbp-40]
        mov     eax, DWORD PTR [rax]
        mov     edx, eax
        mov     rax, QWORD PTR [rbp-8]
        mov     DWORD PTR [rax+4], edx
.L7:
        mov     rax, QWORD PTR [rbp-40]
        mov     rax, QWORD PTR [rax+16]
        mov     eax, DWORD PTR [rax]
        mov     DWORD PTR [rbp-16], eax
        lock or QWORD PTR [rsp], 0
        call    get_cpu_clock
        mov     QWORD PTR [rbp-24], rax
        mov     eax, DWORD PTR [rbp-16]
        lea     edx, [rax+1]
        mov     rax, QWORD PTR [rbp-40]
        mov     rax, QWORD PTR [rax+16]
        mov     ecx, DWORD PTR [rbp-16]
        mov     esi, ecx
        mov     rdi, rax
        call    atomic32_compare_and_swap
        cmp     DWORD PTR [rbp-16], eax
        jne     .L7
        mov     rax, QWORD PTR [rbp-8]
        mov     edx, DWORD PTR [rbp-16]
        mov     DWORD PTR [rax], edx
        mov     rax, QWORD PTR [rbp-8]
        mov     rdx, QWORD PTR [rbp-24]
        mov     QWORD PTR [rax+8], rdx
        add     DWORD PTR [rbp-12], 1
        add     QWORD PTR [rbp-8], 16
.L6:
        mov     eax, DWORD PTR [rbp-12]
        movsx   rdx, eax
        mov     rax, QWORD PTR [rbp-40]
        mov     rax, QWORD PTR [rax+8]
        cmp     rdx, rax
        jb      .L8
        nop
        nop
        leave
        ret

On a whim I thought to compare the output from different GCC versions. I discovered that with GCC 10.5, this was the assembly language output:

GCC 10.5
atomic32_compare_and_swap:
        push    rbp
        mov     rbp, rsp
        mov     QWORD PTR [rbp-8], rdi
        mov     DWORD PTR [rbp-12], esi
        mov     DWORD PTR [rbp-16], edx
        mov     rdx, QWORD PTR [rbp-8]
        mov     eax, DWORD PTR [rbp-12]
        mov     ecx, DWORD PTR [rbp-16]
        lock cmpxchg    DWORD PTR [rdx], ecx
        pop     rbp
        ret
get_cpu_clock:
        push    rbp
        mov     rbp, rsp
        rdtsc
        mov     DWORD PTR [rbp-4], eax
        mov     DWORD PTR [rbp-8], edx
        mov     eax, DWORD PTR [rbp-8]
        sal     rax, 32
        mov     rdx, rax
        mov     eax, DWORD PTR [rbp-4]
        or      rax, rdx
        pop     rbp
        ret
clock_thread_fn:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 40
        mov     QWORD PTR [rbp-40], rdi
        mov     rax, QWORD PTR [rbp-40]
        mov     rax, QWORD PTR [rax+24]
        mov     QWORD PTR [rbp-8], rax
        mov     DWORD PTR [rbp-12], 0
        jmp     .L6
.L8:
        mov     rax, QWORD PTR [rbp-40]
        mov     eax, DWORD PTR [rax]
        mov     edx, eax
        mov     rax, QWORD PTR [rbp-8]
        mov     DWORD PTR [rax+4], edx
.L7:
        mov     rax, QWORD PTR [rbp-40]
        mov     rax, QWORD PTR [rax+16]
        mov     eax, DWORD PTR [rax]
        mov     DWORD PTR [rbp-16], eax
        mfence
        call    get_cpu_clock
        mov     QWORD PTR [rbp-24], rax
        mov     eax, DWORD PTR [rbp-16]
        lea     edx, [rax+1]
        mov     rax, QWORD PTR [rbp-40]
        mov     rax, QWORD PTR [rax+16]
        mov     ecx, DWORD PTR [rbp-16]
        mov     esi, ecx
        mov     rdi, rax
        call    atomic32_compare_and_swap
        cmp     DWORD PTR [rbp-16], eax
        jne     .L7
        mov     rax, QWORD PTR [rbp-8]
        mov     edx, DWORD PTR [rbp-16]
        mov     DWORD PTR [rax], edx
        mov     rax, QWORD PTR [rbp-8]
        mov     rdx, QWORD PTR [rbp-24]
        mov     QWORD PTR [rax+8], rdx
        add     DWORD PTR [rbp-12], 1
        add     QWORD PTR [rbp-8], 16
.L6:
        mov     eax, DWORD PTR [rbp-12]
        movsx   rdx, eax
        mov     rax, QWORD PTR [rbp-40]
        mov     rax, QWORD PTR [rax+8]
        cmp     rdx, rax
        jb      .L8
        nop
        nop
        leave
        ret

The key difference is found five lines after the label .L7. The __sync_synchronize() call compiles to a different instruction starting with GCC 11. In GCC 10 and earlier, __sync_synchronize() compiles to mfence whereas starting with GCC 11, __sync_synchronize() compiles to lock or QWORD PTR [rsp], 0.

With __sync_synchronize() compiled to mfence fio's CPU clock test passes on AMD platforms but the test fails with __sync_synchronize() compiled to lock or QWORD PTR [rsp], 0.

GCC's documentation says that __sync_synchronize() is a built-in function that "issues a full memory barrier."

Intel's documentation (Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1 Section 11.4.4.3) has the following description for mfence:

The MFENCE instruction establishes a memory fence for both loads and stores.
The processor ensures that no load or store after MFENCE will become globally
visible until all loads and stores before MFENCE are globally visible

Intel's documentation (Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2A Chapter 3) says that the LOCK prefix:

Causes the processor’s LOCK# signal to be asserted during execution of the
accompanying instruction (turns the instruction into an atomic instruction). In
a multiprocessor environment, the LOCK# signal ensures that the processor has
exclusive use of any shared memory while the signal is asserted.

AMD's documentation (AMD64 Architecture Programmer's Manual Volume 1: Application Programming, p. 71) says, "MFENCE orders both memory reads and writes."

For the LOCK prefix, AMD's documentation (AMD64 Architecture Programmer's Manual Volume 1: Application Programming, Table 3-7, p. 77) says, that it "[c]auses certain read-modify-write instructions on memory to occur atomically."

The LOCK prefix is applied to the instruction or QWORD PTR [rsp], 0 which is a noop as the bitwise or with zero returns the original value.

I am by no means knowledgeable about assembly language and cannot assess the equivalence of the two instructions that __sync_synchronize() compiles to. So my patch to fix this issue merely reverses the compiler change made with GCC 11, replacing __sync_synchronize() with a tsc_barrier() function and defining it to be mfence on x86-64 platforms and __sync_synchronize() elsewhere:

fix
commit 5ae4f4220a48dddddc84c8b839ef9d8a1ed4edb1
Author: Vincent Fu <vincentfu@gmail.com>
Date:   Tue Feb 27 10:26:00 2024 -0500

    gettime: fix cpuclock-test on AMD platforms

    Starting with gcc 11 __sync_synchronize() compiles to

    lock or QWORD PTR [rsp], 0

    on x86_64 platforms. Previously it compiled to an mfence instruction.

    See line 47 of https://godbolt.org/z/xfE18K7b4 for an example.

    On Intel platforms this change does not affect the result of fio's CPU
    clock test. But on AMD platforms, this change causes fio's CPU clock
    test to fail and fio to fall back to clock_gettime() instead of using
    the CPU clock for timing.

    This patch has fio explicitly use an mfence instruction instead of
    __sync_synchornize() in the CPU clock test code on x86_64 platforms in
    order to allow the CPU clock test to pass on AMD platforms.

    Reviewed-by: Jens Axboe <axboe@kernel.dk>
    Link: https://lore.kernel.org/r/20240227155856.5012-1-vincent.fu@samsung.com
    Signed-off-by: Vincent Fu <vincent.fu@samsung.com>

diff --git a/arch/arch-x86_64.h b/arch/arch-x86_64.h
index 86ce1b7e..b402dc6d 100644
--- a/arch/arch-x86_64.h
+++ b/arch/arch-x86_64.h
@@ -26,6 +26,11 @@ static inline unsigned long arch_ffz(unsigned long bitmask)
        return bitmask;
 }

+static inline void tsc_barrier(void)
+{
+       __asm__ __volatile__("mfence":::"memory");
+}
+
 static inline unsigned long long get_cpu_clock(void)
 {
        unsigned int lo, hi;
diff --git a/arch/arch.h b/arch/arch.h
index 3ee9b053..7e294ddf 100644
--- a/arch/arch.h
+++ b/arch/arch.h
@@ -108,6 +108,13 @@ extern unsigned long arch_flags;
 #include "arch-generic.h"
 #endif

+#if !defined(__x86_64__) && defined(CONFIG_SYNC_SYNC)
+static inline void tsc_barrier(void)
+{
+       __sync_synchronize();
+}
+#endif
+
 #include "../lib/ffz.h"
 /* IWYU pragma: end_exports */

diff --git a/gettime.c b/gettime.c
index bc66a3ac..5ca31206 100644
--- a/gettime.c
+++ b/gettime.c
@@ -623,7 +623,7 @@ static void *clock_thread_fn(void *data)
                        seq = *t->seq;
                        if (seq == UINT_MAX)
                                break;
-                       __sync_synchronize();
+                       tsc_barrier();
                        tsc = get_cpu_clock();
                } while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1));

Coda

Although my patch resolves this issue for fio, loose ends remain. Why did the GCC developers make this change? My colleague identified this patch as responsible for the change from mfence to lock orq. I do not have the experience to evaluate the patch. I hope that others using __sync_synchronize() are aware of how this compiler change leads to different behavior on Intel and AMD CPUs. If any readers are able to shed more light on the issues discussed here, I would welcome commentary and feedback.

Notes and useful links

  • The clock source that fio uses can be hugely consequential for performance. With a single job using the null ioengine, I can obtain 13.3M IOPS using the CPU clock but only 8.6M IOPS using gettimeofday and 9.8M IOPS using clock_gettime.
Performance
vincent@home:~/fio-dev/fio$ ./fio --name=test --time_based --runtime=10s --ramp_time=3s --ioengine=null --filesize=1T --clocksource=cpu
test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=null, iodepth=1
fio-3.37-80-g4017
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=50.0GiB/s][r=13.1M IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=422872: Fri Jul 26 13:18:09 2024
  read: IOPS=13.3M, BW=50.6GiB/s (54.4GB/s)(506GiB/10001msec)
    clat (nsec): min=0, max=43772, avg=10.29, stdev=12.52
     lat (nsec): min=10, max=193952, avg=17.97, stdev=25.47
    clat percentiles (nsec):
     |  1.00th=[    0],  5.00th=[    9], 10.00th=[   10], 20.00th=[   10],
     | 30.00th=[   10], 40.00th=[   10], 50.00th=[   10], 60.00th=[   10],
     | 70.00th=[   10], 80.00th=[   10], 90.00th=[   10], 95.00th=[   11],
     | 99.00th=[   20], 99.50th=[   20], 99.90th=[   20], 99.95th=[   20],
     | 99.99th=[   21]
   bw (  MiB/s): min=50946, max=54107, per=100.00%, avg=51852.45, stdev=1227.33, samples=20
   iops        : min=13042400, max=13851398, avg=13274229.10, stdev=314196.35, samples=20
  lat (nsec)   : 2=2.08%, 10=4.01%, 20=89.90%, 50=4.01%, 100=0.01%
  lat (nsec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=99.68%, sys=0.00%, ctx=145, majf=0, minf=37
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=132726685,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=50.6GiB/s (54.4GB/s), 50.6GiB/s-50.6GiB/s (54.4GB/s-54.4GB/s), io=506GiB (544GB), run=10001-10001msec
vincent@home:~/fio-dev/fio$ ./fio --name=test --time_based --runtime=10s --ramp_time=3s --ioengine=null --filesize=1T --clocksource=gettimeofday
test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=null, iodepth=1
fio-3.37-80-g4017
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=32.9GiB/s][r=8622k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=422906: Fri Jul 26 13:18:32 2024
  read: IOPS=8609k, BW=32.8GiB/s (35.3GB/s)(328GiB/10001msec)
    clat (nsec): min=0, max=646000, avg=24.45, stdev=176.16
     lat (nsec): min=0, max=646000, avg=41.96, stdev=218.19
    clat percentiles (nsec):
     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
     | 30.00th=[    0], 40.00th=[    0], 50.00th=[    0], 60.00th=[    0],
     | 70.00th=[    0], 80.00th=[    0], 90.00th=[    0], 95.00th=[    0],
     | 99.00th=[ 1004], 99.50th=[ 1004], 99.90th=[ 1004], 99.95th=[ 1004],
     | 99.99th=[ 1004]
   bw (  MiB/s): min=33471, max=33720, per=100.00%, avg=33635.78, stdev=60.20, samples=20
   iops        : min=8568736, max=8632444, avg=8610761.10, stdev=15410.94, samples=20
  lat (nsec)   : 2=97.56%
  lat (usec)   : 2=2.44%, 4=0.01%, 10=0.01%, 20=0.01%, 250=0.01%
  lat (usec)   : 500=0.01%, 750=0.01%
  cpu          : usr=99.69%, sys=0.00%, ctx=111, majf=0, minf=37
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=86096161,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=32.8GiB/s (35.3GB/s), 32.8GiB/s-32.8GiB/s (35.3GB/s-35.3GB/s), io=328GiB (353GB), run=10001-10001msec
vincent@home:~/fio-dev/fio$ ./fio --name=test --time_based --runtime=10s --ramp_time=3s --ioengine=null --filesize=1T --clocksource=clock_gettime
test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=null, iodepth=1
fio-3.37-80-g4017
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=37.5GiB/s][r=9838k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=422829: Fri Jul 26 13:17:39 2024
  read: IOPS=9821k, BW=37.5GiB/s (40.2GB/s)(375GiB/10001msec)
    clat (nsec): min=10, max=52418, avg=18.55, stdev=17.76
     lat (nsec): min=30, max=52438, avg=36.59, stdev=23.83
    clat percentiles (nsec):
     |  1.00th=[   10],  5.00th=[   10], 10.00th=[   10], 20.00th=[   20],
     | 30.00th=[   20], 40.00th=[   20], 50.00th=[   20], 60.00th=[   20],
     | 70.00th=[   20], 80.00th=[   20], 90.00th=[   20], 95.00th=[   20],
     | 99.00th=[   21], 99.50th=[   21], 99.90th=[   40], 99.95th=[   40],
     | 99.99th=[   41]
   bw (  MiB/s): min=37771, max=38754, per=100.00%, avg=38374.46, stdev=216.32, samples=20
   iops        : min=9669616, max=9921232, avg=9823864.50, stdev=55377.11, samples=20
  lat (nsec)   : 20=16.04%, 50=83.95%, 100=0.01%, 250=0.01%, 500=0.01%
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%
  cpu          : usr=99.69%, sys=0.00%, ctx=224, majf=0, minf=37
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=98219882,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=37.5GiB/s (40.2GB/s), 37.5GiB/s-37.5GiB/s (40.2GB/s-40.2GB/s), io=375GiB (402GB), run=10001-10001msec