-
Notifications
You must be signed in to change notification settings - Fork 2
Clock test issue
I am grateful to Jim Harris, Klaus Jensen, Adam Manzanares, and Tong Zhang for their helpful feedback. All errors and omissions remain mine of course.
For a reason that I can no longer recall, I ran fio --cpuclock-test
on my
home desktop. This option directs fio to, well, carry out a test of the CPU
clock and print out related debug output. During a standard invocation, if fio
deems the CPU clock reliable, then fio can use the CPU clock for timing instead
of making costly system calls to query the time. Since the CPU clock test
result is invisible in a standard fio invocation, the --cpuclock-test
command-line option can be used to troubleshoot timing issues. The test output
begins with many lines of diagnostic information with the test result at the
very end. For a successful test the last lines of the output look like this:
$ fio --cpuclock-test
...
cs: cpu 52: 9433580974 clocks seen, first 5398029176100134
cs: cpu 46: 9445914736 clocks seen, first 5398029175984874
cs: cpu 68: 9425255906 clocks seen, first 5398029176773930
cs: cpu 13: 9483688624 clocks seen, first 5398029174646446
cs: cpu 47: 8986070530 clocks seen, first 5398029175992050
cs: cpu 8: 8848914174 clocks seen, first 5398029174505642
cs: Pass!
For a failed test the output looks like this:
$ fio --cpuclock-test
...
cs: CPU clock mismatch (diff=47):
CPU 14: TSC=1342728217400059, SEQ=2396259
CPU 2: TSC=1342728217400012, SEQ=2396260
cs: CPU clock mismatch (diff=46):
CPU 14: TSC=1342728217411903, SEQ=2396392
CPU 2: TSC=1342728217411857, SEQ=2396393
cs: CPU clock mismatch (diff=47):
CPU 2: TSC=1342728217461347, SEQ=2397403
CPU 14: TSC=1342728217461300, SEQ=2397404
cs: CPU clock mismatch (diff=46):
CPU 14: TSC=1342728217571515, SEQ=2399210
CPU 2: TSC=1342728217571469, SEQ=2399211
cs: Failed: 855099
On Linux x86-64 platforms I found that fio's CPU clock test passed when running on Intel processors but failed on AMD processors. After some investigation, I discovered that the issue was related to a compiler change and the problem was resolved with this February 2024 patch. Although this issue has been resolved for fio for many months, I feel it deserves more attention. Surely this compiler change could affect other software. This was also my first encounter with an incompatibility between AMD and Intel processors which is quite unusual nowadays. I also wish to highlight Matt Godbolt's Compiler Explorer without which I have no idea how I would have resolved this issue.
x86-64 CPUs have a flag indicating whether they support an invariant timestamp counter (TSC), which means that the CPU can supply a timestamp that runs at a constant rate regardless of dynamic CPU frequency scaling or sleep states. In addition to checking this flag fio also carries out a test to confirm that CPU clocks are synchronized across CPU cores. Fio originates from the early days of multicore CPUs when it was not safe to assume that CPU clocks were synchronized across cores.
Fio's CPU clock test spawns separate processes pinned to each CPU core that fio
may use for running jobs. Each process runs the following code
from gettime.c:clock_thread_fn()
:
for (i = 0; i < t->nr_entries; i++, c++) {
uint32_t seq;
uint64_t tsc;
c->cpu = t->cpu;
do {
seq = *t->seq;
if (seq == UINT_MAX)
break;
__sync_synchronize();
tsc = get_cpu_clock();
} while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1));
if (seq == UINT_MAX)
break;
c->seq = seq;
c->tsc = tsc;
}
The outer for
loop executes a configurable number of times (1000 or 100,000
depending on how the test is invoked). The code inside the for
loop records
CPU clock timestamps paired with sequence numbers in a synchronized manner.
This is accomplished by an inner do..while
loop. This inner loop begins by
assigning the local variable seq
the value pointed to by t->seq
(which is
shared among the running processes). This is followed by the
__sync_synchronize()
synchronization instruction to ensure that the shared
value pointed to by t->seq
is actually stored in the local variable seq
.
Then the CPU timestamp counter is read. The atomic32_compare_and_swap
check
in the loop's while
condition checks to make sure that the body of the inner
loop was executed without another thread using the locally stored sequence
number. The compare and swap checks that the value at t->seq
remains the
same as the value that was stored locally. If so, the value at t->seq
is
incremented and the original value is returned. This ensures that the sequence
number is paired with only one timestamp. If, while inside this loop, another
process managed to record a timestamp using the locally stored sequence number,
the do..while
loop will repeat until the thread in question is able to
successfully pair a timestamp with the locally stored sequence number without
another process stealing the sequence number.
When this for loop has completed on all of the processes spawned, fio checks to make sure that the CPU clock timestamps increase with the sequence number. This ensures that the CPU clock is synchronized across CPU cores. A timestamp value for a later sequence number being smaller than the timestamp value for an earlier sequence number would be evidence that timestamps are not appropriately synchronized across CPU cores. Since fio wishes to use the CPU clock for timekeeping and may read the timestamp counter from any available CPU core, synchronization is needed to prevent measured times from possibly going backwards.
I tested Intel server and laptops CPUs as well as AMD desktop and server CPUs. From time to time the test would pass on AMD CPUs but it usually failed. I observed no failures on Intel CPUs. Could the CPU clock be not synchronized across AMD CPU cores? This is unlikely on a modern CPU.
After long days of seemingly fruitless research and testing, even with the support of esteemed colleagues, it occurred to me to examine the compiler output of the above code. Matt Godbolt's Compiler Explorer is a godsend for this purpose.
I viewed the assembly language output of the following C code:
C code
#include <stdint.h>
struct clock_entry {
uint32_t seq;
uint32_t cpu;
uint64_t tsc;
};
struct clock_thread {
int cpu;
int debug;
unsigned long nr_entries;
uint32_t *seq;
struct clock_entry *entries;
};
static inline uint32_t atomic32_compare_and_swap(uint32_t *ptr, uint32_t old,
uint32_t new)
{
return __sync_val_compare_and_swap(ptr, old, new);
}
static inline unsigned long long get_cpu_clock(void)
{
unsigned int lo, hi;
__asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi));
return ((unsigned long long) hi << 32ULL) | lo;
}
static void clock_thread_fn(struct clock_thread *t)
{
struct clock_entry *c;
int i;
c = &t->entries[0];
for (i = 0; i < t->nr_entries; i++, c++) {
uint32_t seq;
uint64_t tsc;
c->cpu = t->cpu;
do {
seq = *t->seq;
__sync_synchronize();
tsc = get_cpu_clock();
} while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1));
c->seq = seq;
c->tsc = tsc;
}
}
See https://godbolt.org/z/3Ed9rKoxc on Compiler Explorer to compare the output from different compilers. With GCC 11.1 and later on x86-64, the following assembly output was produced:
GCC 11.1 and later
atomic32_compare_and_swap:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov DWORD PTR [rbp-12], esi
mov DWORD PTR [rbp-16], edx
mov rdx, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rbp-12]
mov ecx, DWORD PTR [rbp-16]
lock cmpxchg DWORD PTR [rdx], ecx
pop rbp
ret
get_cpu_clock:
push rbp
mov rbp, rsp
rdtsc
mov DWORD PTR [rbp-4], eax
mov DWORD PTR [rbp-8], edx
mov eax, DWORD PTR [rbp-8]
sal rax, 32
mov rdx, rax
mov eax, DWORD PTR [rbp-4]
or rax, rdx
pop rbp
ret
clock_thread_fn:
push rbp
mov rbp, rsp
sub rsp, 40
mov QWORD PTR [rbp-40], rdi
mov rax, QWORD PTR [rbp-40]
mov rax, QWORD PTR [rax+24]
mov QWORD PTR [rbp-8], rax
mov DWORD PTR [rbp-12], 0
jmp .L6
.L8:
mov rax, QWORD PTR [rbp-40]
mov eax, DWORD PTR [rax]
mov edx, eax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax+4], edx
.L7:
mov rax, QWORD PTR [rbp-40]
mov rax, QWORD PTR [rax+16]
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-16], eax
lock or QWORD PTR [rsp], 0
call get_cpu_clock
mov QWORD PTR [rbp-24], rax
mov eax, DWORD PTR [rbp-16]
lea edx, [rax+1]
mov rax, QWORD PTR [rbp-40]
mov rax, QWORD PTR [rax+16]
mov ecx, DWORD PTR [rbp-16]
mov esi, ecx
mov rdi, rax
call atomic32_compare_and_swap
cmp DWORD PTR [rbp-16], eax
jne .L7
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rbp-16]
mov DWORD PTR [rax], edx
mov rax, QWORD PTR [rbp-8]
mov rdx, QWORD PTR [rbp-24]
mov QWORD PTR [rax+8], rdx
add DWORD PTR [rbp-12], 1
add QWORD PTR [rbp-8], 16
.L6:
mov eax, DWORD PTR [rbp-12]
movsx rdx, eax
mov rax, QWORD PTR [rbp-40]
mov rax, QWORD PTR [rax+8]
cmp rdx, rax
jb .L8
nop
nop
leave
ret
On a whim I thought to compare the output from different GCC versions. I discovered that with GCC 10.5, this was the assembly language output:
GCC 10.5
atomic32_compare_and_swap:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov DWORD PTR [rbp-12], esi
mov DWORD PTR [rbp-16], edx
mov rdx, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rbp-12]
mov ecx, DWORD PTR [rbp-16]
lock cmpxchg DWORD PTR [rdx], ecx
pop rbp
ret
get_cpu_clock:
push rbp
mov rbp, rsp
rdtsc
mov DWORD PTR [rbp-4], eax
mov DWORD PTR [rbp-8], edx
mov eax, DWORD PTR [rbp-8]
sal rax, 32
mov rdx, rax
mov eax, DWORD PTR [rbp-4]
or rax, rdx
pop rbp
ret
clock_thread_fn:
push rbp
mov rbp, rsp
sub rsp, 40
mov QWORD PTR [rbp-40], rdi
mov rax, QWORD PTR [rbp-40]
mov rax, QWORD PTR [rax+24]
mov QWORD PTR [rbp-8], rax
mov DWORD PTR [rbp-12], 0
jmp .L6
.L8:
mov rax, QWORD PTR [rbp-40]
mov eax, DWORD PTR [rax]
mov edx, eax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax+4], edx
.L7:
mov rax, QWORD PTR [rbp-40]
mov rax, QWORD PTR [rax+16]
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-16], eax
mfence
call get_cpu_clock
mov QWORD PTR [rbp-24], rax
mov eax, DWORD PTR [rbp-16]
lea edx, [rax+1]
mov rax, QWORD PTR [rbp-40]
mov rax, QWORD PTR [rax+16]
mov ecx, DWORD PTR [rbp-16]
mov esi, ecx
mov rdi, rax
call atomic32_compare_and_swap
cmp DWORD PTR [rbp-16], eax
jne .L7
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rbp-16]
mov DWORD PTR [rax], edx
mov rax, QWORD PTR [rbp-8]
mov rdx, QWORD PTR [rbp-24]
mov QWORD PTR [rax+8], rdx
add DWORD PTR [rbp-12], 1
add QWORD PTR [rbp-8], 16
.L6:
mov eax, DWORD PTR [rbp-12]
movsx rdx, eax
mov rax, QWORD PTR [rbp-40]
mov rax, QWORD PTR [rax+8]
cmp rdx, rax
jb .L8
nop
nop
leave
ret
The key difference is found five lines after the label .L7
. The
__sync_synchronize()
call compiles to a different instruction starting with GCC
11. In GCC 10 and earlier, __sync_synchronize()
compiles to mfence
whereas
starting with GCC 11, __sync_synchronize()
compiles to lock or QWORD PTR [rsp], 0
.
With __sync_synchronize()
compiled to mfence
fio's CPU clock test passes on
AMD platforms but the test fails with __sync_synchronize()
compiled to lock or QWORD PTR [rsp], 0
.
GCC's
documentation
says that __sync_synchronize()
is a built-in function that "issues a full
memory barrier."
Intel's documentation (Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1 Section 11.4.4.3) has the following description for mfence:
The MFENCE instruction establishes a memory fence for both loads and stores.
The processor ensures that no load or store after MFENCE will become globally
visible until all loads and stores before MFENCE are globally visible
Intel's documentation (Intel® 64 and IA-32 Architectures Software Developer’s
Manual Volume 2A Chapter 3) says that the LOCK
prefix:
Causes the processor’s LOCK# signal to be asserted during execution of the
accompanying instruction (turns the instruction into an atomic instruction). In
a multiprocessor environment, the LOCK# signal ensures that the processor has
exclusive use of any shared memory while the signal is asserted.
AMD's documentation (AMD64 Architecture Programmer's Manual Volume 1:
Application Programming, p. 71) says, "MFENCE
orders both memory reads and
writes."
For the LOCK
prefix, AMD's documentation (AMD64 Architecture Programmer's
Manual Volume 1: Application Programming, Table 3-7, p. 77) says, that it
"[c]auses certain read-modify-write instructions on memory to occur
atomically."
The LOCK
prefix is applied to the instruction or QWORD PTR [rsp], 0
which
is a noop as the bitwise or with zero returns the original value.
I am by no means knowledgeable about assembly language and cannot assess the
equivalence of the two instructions that __sync_synchronize()
compiles to. So
my patch to fix this issue merely reverses the compiler change made with GCC
11, replacing __sync_synchronize()
with a tsc_barrier()
function and defining
it to be mfence
on x86-64 platforms and __sync_synchronize()
elsewhere:
fix
commit 5ae4f4220a48dddddc84c8b839ef9d8a1ed4edb1
Author: Vincent Fu <vincentfu@gmail.com>
Date: Tue Feb 27 10:26:00 2024 -0500
gettime: fix cpuclock-test on AMD platforms
Starting with gcc 11 __sync_synchronize() compiles to
lock or QWORD PTR [rsp], 0
on x86_64 platforms. Previously it compiled to an mfence instruction.
See line 47 of https://godbolt.org/z/xfE18K7b4 for an example.
On Intel platforms this change does not affect the result of fio's CPU
clock test. But on AMD platforms, this change causes fio's CPU clock
test to fail and fio to fall back to clock_gettime() instead of using
the CPU clock for timing.
This patch has fio explicitly use an mfence instruction instead of
__sync_synchornize() in the CPU clock test code on x86_64 platforms in
order to allow the CPU clock test to pass on AMD platforms.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240227155856.5012-1-vincent.fu@samsung.com
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
diff --git a/arch/arch-x86_64.h b/arch/arch-x86_64.h
index 86ce1b7e..b402dc6d 100644
--- a/arch/arch-x86_64.h
+++ b/arch/arch-x86_64.h
@@ -26,6 +26,11 @@ static inline unsigned long arch_ffz(unsigned long bitmask)
return bitmask;
}
+static inline void tsc_barrier(void)
+{
+ __asm__ __volatile__("mfence":::"memory");
+}
+
static inline unsigned long long get_cpu_clock(void)
{
unsigned int lo, hi;
diff --git a/arch/arch.h b/arch/arch.h
index 3ee9b053..7e294ddf 100644
--- a/arch/arch.h
+++ b/arch/arch.h
@@ -108,6 +108,13 @@ extern unsigned long arch_flags;
#include "arch-generic.h"
#endif
+#if !defined(__x86_64__) && defined(CONFIG_SYNC_SYNC)
+static inline void tsc_barrier(void)
+{
+ __sync_synchronize();
+}
+#endif
+
#include "../lib/ffz.h"
/* IWYU pragma: end_exports */
diff --git a/gettime.c b/gettime.c
index bc66a3ac..5ca31206 100644
--- a/gettime.c
+++ b/gettime.c
@@ -623,7 +623,7 @@ static void *clock_thread_fn(void *data)
seq = *t->seq;
if (seq == UINT_MAX)
break;
- __sync_synchronize();
+ tsc_barrier();
tsc = get_cpu_clock();
} while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1));
Although my patch resolves this issue for fio, loose ends remain. Why did the
GCC developers make this change? My colleague identified
this patch as
responsible for the change from mfence
to lock orq.
I do not have the
experience to evaluate the patch. I hope that others using __sync_synchronize()
are aware of how this compiler change leads to different behavior on Intel and
AMD CPUs. If any readers are able to shed more light on the issues discussed
here, I would welcome commentary and feedback.
- The clock source that fio uses can be hugely consequential for performance. With a single job using the null ioengine, I can obtain 13.3M IOPS using the CPU clock but only 8.6M IOPS using
gettimeofday
and 9.8M IOPS usingclock_gettime
.
Performance
vincent@home:~/fio-dev/fio$ ./fio --name=test --time_based --runtime=10s --ramp_time=3s --ioengine=null --filesize=1T --clocksource=cpu
test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=null, iodepth=1
fio-3.37-80-g4017
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=50.0GiB/s][r=13.1M IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=422872: Fri Jul 26 13:18:09 2024
read: IOPS=13.3M, BW=50.6GiB/s (54.4GB/s)(506GiB/10001msec)
clat (nsec): min=0, max=43772, avg=10.29, stdev=12.52
lat (nsec): min=10, max=193952, avg=17.97, stdev=25.47
clat percentiles (nsec):
| 1.00th=[ 0], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 10], 40.00th=[ 10], 50.00th=[ 10], 60.00th=[ 10],
| 70.00th=[ 10], 80.00th=[ 10], 90.00th=[ 10], 95.00th=[ 11],
| 99.00th=[ 20], 99.50th=[ 20], 99.90th=[ 20], 99.95th=[ 20],
| 99.99th=[ 21]
bw ( MiB/s): min=50946, max=54107, per=100.00%, avg=51852.45, stdev=1227.33, samples=20
iops : min=13042400, max=13851398, avg=13274229.10, stdev=314196.35, samples=20
lat (nsec) : 2=2.08%, 10=4.01%, 20=89.90%, 50=4.01%, 100=0.01%
lat (nsec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=99.68%, sys=0.00%, ctx=145, majf=0, minf=37
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=132726685,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=50.6GiB/s (54.4GB/s), 50.6GiB/s-50.6GiB/s (54.4GB/s-54.4GB/s), io=506GiB (544GB), run=10001-10001msec
vincent@home:~/fio-dev/fio$ ./fio --name=test --time_based --runtime=10s --ramp_time=3s --ioengine=null --filesize=1T --clocksource=gettimeofday
test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=null, iodepth=1
fio-3.37-80-g4017
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=32.9GiB/s][r=8622k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=422906: Fri Jul 26 13:18:32 2024
read: IOPS=8609k, BW=32.8GiB/s (35.3GB/s)(328GiB/10001msec)
clat (nsec): min=0, max=646000, avg=24.45, stdev=176.16
lat (nsec): min=0, max=646000, avg=41.96, stdev=218.19
clat percentiles (nsec):
| 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0],
| 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0],
| 70.00th=[ 0], 80.00th=[ 0], 90.00th=[ 0], 95.00th=[ 0],
| 99.00th=[ 1004], 99.50th=[ 1004], 99.90th=[ 1004], 99.95th=[ 1004],
| 99.99th=[ 1004]
bw ( MiB/s): min=33471, max=33720, per=100.00%, avg=33635.78, stdev=60.20, samples=20
iops : min=8568736, max=8632444, avg=8610761.10, stdev=15410.94, samples=20
lat (nsec) : 2=97.56%
lat (usec) : 2=2.44%, 4=0.01%, 10=0.01%, 20=0.01%, 250=0.01%
lat (usec) : 500=0.01%, 750=0.01%
cpu : usr=99.69%, sys=0.00%, ctx=111, majf=0, minf=37
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=86096161,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=32.8GiB/s (35.3GB/s), 32.8GiB/s-32.8GiB/s (35.3GB/s-35.3GB/s), io=328GiB (353GB), run=10001-10001msec
vincent@home:~/fio-dev/fio$ ./fio --name=test --time_based --runtime=10s --ramp_time=3s --ioengine=null --filesize=1T --clocksource=clock_gettime
test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=null, iodepth=1
fio-3.37-80-g4017
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=37.5GiB/s][r=9838k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=422829: Fri Jul 26 13:17:39 2024
read: IOPS=9821k, BW=37.5GiB/s (40.2GB/s)(375GiB/10001msec)
clat (nsec): min=10, max=52418, avg=18.55, stdev=17.76
lat (nsec): min=30, max=52438, avg=36.59, stdev=23.83
clat percentiles (nsec):
| 1.00th=[ 10], 5.00th=[ 10], 10.00th=[ 10], 20.00th=[ 20],
| 30.00th=[ 20], 40.00th=[ 20], 50.00th=[ 20], 60.00th=[ 20],
| 70.00th=[ 20], 80.00th=[ 20], 90.00th=[ 20], 95.00th=[ 20],
| 99.00th=[ 21], 99.50th=[ 21], 99.90th=[ 40], 99.95th=[ 40],
| 99.99th=[ 41]
bw ( MiB/s): min=37771, max=38754, per=100.00%, avg=38374.46, stdev=216.32, samples=20
iops : min=9669616, max=9921232, avg=9823864.50, stdev=55377.11, samples=20
lat (nsec) : 20=16.04%, 50=83.95%, 100=0.01%, 250=0.01%, 500=0.01%
lat (nsec) : 750=0.01%, 1000=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
lat (usec) : 100=0.01%
cpu : usr=99.69%, sys=0.00%, ctx=224, majf=0, minf=37
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=98219882,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=37.5GiB/s (40.2GB/s), 37.5GiB/s-37.5GiB/s (40.2GB/s-40.2GB/s), io=375GiB (402GB), run=10001-10001msec
- For background on how fio uses the CPU clock for timing see here.
- FreeBSD kernel developers encountered a similar CPU clock issue here
- Intel 64 and IA-32 Architectures Software Developer Manuals: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
- AMD64 Architecture Programmer's Manual: https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf
- https://en.wikipedia.org/wiki/Time_Stamp_Counter
- https://community.intel.com/t5/Intel-ISA-Extensions/Invariant-TSC-support/m-p/772125
- https://news.ycombinator.com/item?id=16922495