Merge pull request #10 from SciML/cmovcvt

Add some more discussion surrounding `cmov`
SciML · Jan 26, 2024 · c054ce0 · c054ce0
2 parents e6db7bd + 4ff42d5
commit c054ce0
Showing 1 changed file with 148 additions and 32 deletions.
diff --git a/README.md b/README.md
@@ -70,56 +70,56 @@ end
 Sample results using `-Cnative,-prefer-256-bit` on an AVX512 capable laptop:
 ```julia
 julia> @benchmark findbench(FindFirstFunctions.findfirstequal, $x, $perm)
-BenchmarkTools.Trial: 9219 samples with 1 evaluation.
- Range (min … max): 107.094 μs … 137.850 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 107.376 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 107.577 μs ± 1.175 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+BenchmarkTools.Trial: 6794 samples with 1 evaluation.
+ Range (min … max): 141.489 μs … 190.383 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 145.892 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 145.978 μs ± 4.697 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
 
-  ▁▇█▆▁      
- ▂▃▅█████▅▃▂▂▂▂▂▁▁▁▁▁▂▂▂▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▁▁▂▂▁▂ ▃
- 107 μs  Histogram: frequency by time  110 μs <
+ ▇▅ ▁▁ ▁█▅ ▁▂▁ ▂▃▂ ▁ ▁ ▁
+ ██▆▁▁▁▃██▆▄███▄▃▄▃▄███▁▆████▅▅▅▇█▇▇▆▆▆▆▇▆▅▃▇█▆▇▇▆▅▆▇▇▇▇▆▇█▅▅▅ █
+ 141 μs Histogram: log(frequency) by time 163 μs <
 
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 julia> @benchmark findbench((x,v)->findfirst(==(x), v), $x, $perm)
-BenchmarkTools.Trial: 2144 samples with 1 evaluation.
- Range (min … max): 462.442 μs … 584.795 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 464.638 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 465.686 μs ±  5.534 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+BenchmarkTools.Trial: 1765 samples with 1 evaluation.
+ Range (min … max): 547.812 μs … 663.534 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 564.245 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 565.600 μs ± 14.561 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
 
-  █ ▅▇▂  
- ▅▃▁▁▁█▇███▇▆▃▆▃▁▄▃▄▁▃▃▁▁▁▁▃▁▄▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▃▃▄▃▁▁▁▃▄▃ █
- 462 μs Histogram: log(frequency) by time 486 μs <
+ ▇▄▄▄ ▄ █▄▅▆▅ ▂▁▂▅▃▃▅ ▁ ▁
+ ████▁▁▁▁█▇████████████████▇█▇██▆▅▆▅▅▄▅▅▄▄▆▁▄▄▅▅▅▁▁▅▅▅▅▆▄▁▁▁▄▅ █
+ 548 μs Histogram: log(frequency) by time 628 μs <
 
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 julia> @benchmark findbench(FindFirstFunctions.findfirstsortedequal, $s, $perm)
 BenchmarkTools.Trial: 10000 samples with 1 evaluation.
- Range (min … max): 46.256 μs … 88.446 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 48.048 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 48.702 μs ± 2.079 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 75.857 μs … 125.111 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 85.811 μs  ┊ GC (median): 0.00%
+ Time (mean ± σ): 86.135 μs ±  3.217 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
 
- ▂▅▇█▇▇▆▄▃▁ 
- ▁▃▆▇███████████▇▇▆▅▅▅▄▄▃▃▃▂▃▂▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▂▁▂▁▂▁▂▁▁▂▁▂▂ ▃
- 46.3 μs Histogram: frequency by time 56 μs <
+  ▁ ▂██▃ 
+ ▂▁▁▁▂▂▁▁▁▁▁▁▁▂▂▃▅██▆▄▆████▅▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
+ 75.9 μs Histogram: frequency by time 101 μs <
 
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 julia> @benchmark findbench((x,v)->searchsortedfirst(v, x), $s, $perm)
-BenchmarkTools.Trial: 10000 samples with 1 evaluation.
- Range (min … max): 77.387 μs … 108.634 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 79.305 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 81.398 μs ± 4.536 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+BenchmarkTools.Trial: 8741 samples with 1 evaluation.
+ Range (min … max): 108.941 μs … 152.368 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 113.026 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 113.282 μs ± 3.812 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
 
-  ▃▆█▆▃   
- ▁▃▅▇██████▅▄▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▄▄▃▃▂▃▃▃▂▂▁▁▁ ▂
- 77.4 μs Histogram: frequency by time 92.6 μs <
+ ▂▅▂  ▄█▇▂ 
+ ▁▂▅███▆▃▂▃▆████▆▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
+ 109 μs  Histogram: frequency by time  130 μs <
 
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 julia> versioninfo()
 Julia Version 1.10.0
-Commit 3120989f39* (2023-12-25 18:01 UTC)
+Commit 3120989f39 (2023-12-25 18:01 UTC)
 Build Info:
 
  Note: This is an unofficial build, please report bugs to the project
@@ -128,14 +128,130 @@ Build Info:
 
 Platform Info:
  OS: Linux (x86_64-redhat-linux)
- CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
+ CPU: 36 × Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
- LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
- Threads: 11 on 8 virtual cores
+ LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
+ Threads: 1 on 36 virtual cores
 Environment:
  JULIA_PATH = @.
  LD_LIBRARY_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/:/usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
- JULIA_NUM_THREADS = 8
+ JULIA_NUM_THREADS = 36
  LD_UN_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
 ```
+
+
+Note, if you're searching sorted collections and on an x86 CPU, it is worth setting the `ENV` variable `JULIA_LLVM_ARGS="-x86-cmov-converter=false"` before starting Julia, e.g. on an AVX512 capable CPU, you may wish to start Julia from the commad line using
+```sh
+JULIA_LLVM_ARGS="-x86-cmov-converter=false" julia -Cnative,-prefer-256-bit
+```
+With this, benchmark results are
+```julia
+julia> @benchmark findbench(FindFirstFunctions.findfirstequal, $x, $perm)
+BenchmarkTools.Trial: 6623 samples with 1 evaluation.
+ Range (min … max): 141.304 μs … 473.786 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 145.581 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 149.690 μs ± 28.577 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+
+ █▇▃▄▃▂▁▁▁ ▂ ▁ ▁
+ ██████████▆█▇▆▇▆▅▄▅▁▅▅▃▄▅▁▃▃▁▃▃▁▁▄▁▁▁▁▁▁▁▁▃▁▁█▄▆▅▅▁▁▃▁▁▁▃▁▁▁▃ █
+ 141 μs Histogram: log(frequency) by time 302 μs <
+
+ Memory estimate: 0 bytes, allocs estimate: 0.
+
+julia> @benchmark findbench((x,v)->findfirst(==(x), v), $x, $perm)
+BenchmarkTools.Trial: 1784 samples with 1 evaluation.
+ Range (min … max): 546.395 μs … 660.254 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 560.513 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 559.546 μs ± 14.138 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+
+ █▆▄▇ ▆▇▄▄█▂▂▁▂▁▁▁▁ ▁
+ ████▄▁▅▁▇▅█████████████▆▇▇▅▄▅▅▄▁▄▅▆▄▅▆▄▆▄▄▅▄▁▁▁▁▁▅▁▁▄▄▅▆▄▄▄▁▇ █
+ 546 μs Histogram: log(frequency) by time 625 μs <
+
+ Memory estimate: 0 bytes, allocs estimate: 0.
+
+julia> @benchmark findbench(FindFirstFunctions.findfirstsortedequal, $s, $perm)
+BenchmarkTools.Trial: 10000 samples with 1 evaluation.
+ Range (min … max): 45.969 μs … 73.354 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 47.674 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 47.675 μs ± 1.999 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+
+ ▃▇▇▅ ▁▆██▅ ▁▂▃▂ ▁▂▁ ▂
+ █████▅██████▆▇████▇▄▇████▆▅▄▆████▇▆▆▁▁▁▁▄▄▃▃▃▃▁▁▁▁▁▁▃▄▆▅▆▄▆ █
+ 46 μs Histogram: log(frequency) by time 58.1 μs <
+
+ Memory estimate: 0 bytes, allocs estimate: 0.
+
+julia> @benchmark findbench((x,v)->searchsortedfirst(v, x), $s, $perm)
+BenchmarkTools.Trial: 10000 samples with 1 evaluation.
+ Range (min … max): 35.988 μs … 224.353 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 37.807 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 38.966 μs ± 7.905 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+
+ ▁▇▆▅█▅ ▃▂ ▁▂ ▂
+ ▇██████▄▇█▆▆███▇███▇███▄▄▁▅▃▅▅▄▄▇██▆▇▇▃▅▄▅▄▆▇▅▄▄▆▇▇▅▄▅▅▁▃▁▄▄ █
+ 36 μs Histogram: log(frequency) by time 57 μs <
+
+ Memory estimate: 0 bytes, allocs estimate: 0.
+
+julia> @benchmark findbench((x,v)->FindFirstFunctions.findfirstsortedequal(x,v,Val(64)), $s, $perm)
+BenchmarkTools.Trial: 10000 samples with 1 evaluation.
+ Range (min … max): 43.709 μs … 182.914 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 45.227 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 45.954 μs ± 5.377 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+
+ ▃▇█▅▂▆█▇▃▁▂▂▁ ▁▁▁▁ ▁ ▁ ▁▁ ▂
+ █████████████▆▇████▇████▇▇▆▆▇█▇▅▆▆▅▄▄▅▆███▇▆██▆▅▅▅▃▄▁▄▄▅▆▅▆▆ █
+ 43.7 μs Histogram: log(frequency) by time 61.4 μs <
+
+ Memory estimate: 0 bytes, allocs estimate: 0.
+
+julia> @benchmark findbench((x,v)->FindFirstFunctions.findfirstsortedequal(x,v,Val(32)), $s, $perm)
+BenchmarkTools.Trial: 10000 samples with 1 evaluation.
+ Range (min … max): 42.482 μs … 172.067 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 44.422 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 45.765 μs ± 8.329 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+
+ ▆███▄▂ ▂▁▃▃ ▂
+ ████████████▆▆▆▅▆██▇▄▄▃▄▅▇▇▄▄▄▃▄▄▁▃▁▁▁▃▁▃▁▃▁▁▁▃██▇▇▄▁▃▄▁▄▄▄▃ █
+ 42.5 μs Histogram: log(frequency) by time 83.3 μs <
+
+ Memory estimate: 0 bytes, allocs estimate: 0.
+
+julia> @benchmark findbench((x,v)->FindFirstFunctions.findfirstsortedequal(x,v,Val(16)), $s, $perm)
+BenchmarkTools.Trial: 10000 samples with 1 evaluation.
+ Range (min … max): 36.870 μs … 154.299 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 39.764 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 40.400 μs ± 2.552 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+
+ ▁▂▁ ▂▆██▆▅▆▇▆▃▂▃▃▂ ▁▁ ▂
+ ▆███▃▃▅███████████████▇▇████▇▆▆▇▇▇▇▆▆▆▅▃▃▅▅▅▅▄▆▆▅▅▇▇▅▆▅▅▃▁▃▅ █
+ 36.9 μs Histogram: log(frequency) by time 53 μs <
+
+ Memory estimate: 0 bytes, allocs estimate: 0.
+
+julia> @benchmark findbench((x,v)->FindFirstFunctions.findfirstsortedequal(x,v,Val(8)), $s, $perm)
+BenchmarkTools.Trial: 10000 samples with 1 evaluation.
+ Range (min … max): 26.011 μs … 48.109 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 26.954 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 27.046 μs ± 1.677 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+
+ ▆▆ █▆ ▂ ▂
+ ██▁▇██▅▁█▆▁▁▆▇▅▅▅▆▇▇▆▅▆▆▆▅▆▇██▇▅▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▅▇▅▃▅ █
+ 26 μs Histogram: log(frequency) by time 37.9 μs <
+
+ Memory estimate: 0 bytes, allocs estimate: 0.
+```
+The branches in a binary search are unpredicable, thus disabling the conversion of `cmov` into branches results in a substantial performance increase.
+Additionally, enablig `cmov` (i.e., disabling `cmov` conversion) greatly reduces the optimal base case size for `FindFirstFunctions.findfirstsortedequal`. Without `cmov`, we need a very large base case to avoid too many branches, scanning large swaths contiguously.
+With `cmov`, we can reduce the base case size to `8`, taking several additional binary search steps without incurring heavy branch prediction penalties.
+
+However, we default to a large base case size, under the assumptions users are not setting this `ENV` variable; we assume that an expert user concerned about binary search performace who sets this variable will also be able to choose their own basecase size.
+
+Take care when benchmarking `JULIA_LLVM_ARGS="-x86-cmov-converter=false"`: your CPU's branch predictor can probably memorize a sequece of hundreds of perfectly random branches. Branch predcitors are great at defeating microbenchmarks.
+Thus, you need a very long unpredictable sequece (which I tried to do in the above benchmark) to prevent the branch predictor from memorizing it.
+In "real world" workloads, your branch predictor isn't going to be able to memorize a sequence of left vs right bisections in your binary search, as you won't be performing the same searches over and over again!
+Without making your benchmark realistic, the default setting of converting `cmov` into branches will look unrealistically good.
+
+If you actually are, memoize. If you're looking for close answers, look for something like `bracketstrictlymontonic`'s `guess` API.