Skip to content

Commit

Permalink
docs: add installation details
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal committed Aug 24, 2024
1 parent bfe7a34 commit e9e8587
Showing 1 changed file with 48 additions and 0 deletions.
48 changes: 48 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,51 @@ features:
link: /api/Testing_Functionality/LuxTestUtils
---
```

## How to Install Lux.jl?

Its easy to install Lux.jl. Since Lux.jl is registered in the Julia General registry,
you can simply run the following command in the Julia REPL:

```julia
julia> using Pkg
julia> Pkg.add("Lux")
```

If you want to use the latest unreleased version of Lux.jl, you can run the following
command: (in most cases the released version will be same as the version on github)

```julia
julia> using Pkg
julia> Pkg.add(url="https://github.com/LuxDL/Lux.jl")
```

## Want GPU Support?

Install the following package(s):

:::code-group

```julia [NVIDIA GPUs]
using Pkg
Pkg.add("LuxCUDA")
# or
Pkg.add(["CUDA", "cuDNN"])
```

```julia [AMD ROCm GPUs]
using Pkg
Pkg.add("AMDGPU")
```

```julia [Metal M-Series GPUs]
using Pkg
Pkg.add("Metal")
```

```julia [Intel GPUs]
using Pkg
Pkg.add("oneAPI")
```

:::

1 comment on commit e9e8587

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: e9e8587 Previous: 6fab339 Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 411375 ns 412166.5 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 243709 ns 323500 ns 0.75
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 322646 ns 320875 ns 1.01
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 740375 ns 741250.5 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 43950.5 ns 44423 ns 0.99
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 1337249.5 ns 1321270.5 ns 1.01
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 1273791 ns 2464833 ns 0.52
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 14174812.5 ns 19238396 ns 0.74
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 2272771 ns 2195417 ns 1.04
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 209419.5 ns 207553 ns 1.01
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 1400291.5 ns 1425917 ns 0.98
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 887625 ns 932500 ns 0.95
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 1538917 ns 10322250 ns 0.15
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 2206208 ns 2213895.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1771500 ns 1661895.5 ns 1.07
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1090645.5 ns 1070020.5 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1522125 ns 1434166.5 ns 1.06
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3029417 ns 2827416 ns 1.07
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 209813.5 ns 209087 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12102479 ns 12123333 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 8818646 ns 8828083 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9205917 ns 9265083 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18584146.5 ns 18585042 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1488578 ns 1486549 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17292708 ns 17281708 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 13982625 ns 13944083 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14552250 ns 14497834 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21819500 ns 21831187 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 250090042 ns 250414312.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 149083000 ns 148327625 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 115757187.5 ns 121777833 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447196708 ns 447079250 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5471357 ns 5479239 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1218132458 ns 1224472334 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 981733583 ns 981860834 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 848222166.5 ns 866559896 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1760366667 ns 1786183250 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 31426620.5 ns 31141432 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1029120292 ns 1136093208 ns 0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1005929416.5 ns 995263750 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1279847771 ns 4093347604.5 ns 0.31
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1729728562.5 ns 1730318458 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 1070979 ns 1099208 ns 0.97
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 1650166.5 ns 1634791.5 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 3527667 ns 10471458.5 ns 0.34
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 783771 ns 793749.5 ns 0.99
lenet(28, 28, 1, 32)/forward/GPU/CUDA 273446 ns 275033.5 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 3010792 ns 3015812.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 4162729 ns 4182500 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 11403792 ns 18339084 ns 0.62
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3309916.5 ns 3167500 ns 1.04
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1192098 ns 1197294.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 2334145.5 ns 2300000 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1313125 ns 1436021 ns 0.91
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1658000 ns 1622666.5 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 4216333 ns 4204812 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 209877 ns 210264.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 19385042 ns 19585749.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 16097792 ns 16074458 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 17355959 ns 17045417 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 25923083 ns 25852083 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1591766 ns 1597607 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 34160834 ns 34485000 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 30761000 ns 31047541 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 31144521 ns 31432771 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 36207166 ns 36566250 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 4524604.5 ns 4524583 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2549416 ns 2763250 ns 0.92
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2909937.5 ns 2932625 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 8392709 ns 8382708 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 424154 ns 427520 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 39098708.5 ns 38953500 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 32083041.5 ns 32105354.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 32252375 ns 32548104.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 51940146 ns 51899750 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2623770.5 ns 2626030 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 82296833.5 ns 89000896 ns 0.92
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 114970542 ns 113802417 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 219951812.5 ns 1309568792 ns 0.17
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 73552333.5 ns 74097062 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 268540834 ns 268319292 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 156204958 ns 159133500 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 126610208 ns 133094396 ns 0.95
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 485345125 ns 484827333 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7026886 ns 7006400 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1482105791 ns 1476195458 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 1163867041 ns 1132135875 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 1075361354 ns 1088449687.5 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 2005616229 ns 2000320208.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34552058 ns 34779703 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1690575542 ns 1685598500 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1481039354 ns 1538044750 ns 0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1858159000 ns 4332338646 ns 0.43
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 2198122146 ns 2208580041 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1874667 ns 2076083 ns 0.90
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 2564645.5 ns 2992583 ns 0.86
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 7967395.5 ns 14339541 ns 0.56
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2510895.5 ns 2435937.5 ns 1.03
lenet(28, 28, 1, 128)/forward/GPU/CUDA 274477 ns 272995 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 9545521 ns 9695167 ns 0.98
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 11597645.5 ns 12099333 ns 0.96
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 25177208 ns 37454270.5 ns 0.67
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 11843625 ns 11819458 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1271782 ns 1260390 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 387397208.5 ns 381383625 ns 1.02
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 307140208 ns 286352875 ns 1.07
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 238501416.5 ns 273227083 ns 0.87
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 453568833.5 ns 452262167 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4856791 ns 4961226.5 ns 0.98
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 1348016042 ns 1283646541 ns 1.05
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 954731500 ns 1000220875 ns 0.95
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 902430042 ns 967608250 ns 0.93
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 1436227042 ns 1517743458 ns 0.95
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 17707051 ns 20595575 ns 0.86
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1406270.5 ns 1395833 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 1689375 ns 2080292 ns 0.81
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 5724417 ns 12485458.5 ns 0.46
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1362208 ns 1302667 ns 1.05
lenet(28, 28, 1, 64)/forward/GPU/CUDA 274629.5 ns 269163.5 ns 1.02
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6793229 ns 6772625 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 13240979 ns 12497500 ns 1.06
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 20057958 ns 35063125 ns 0.57
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 6132021 ns 6112062.5 ns 1.00
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1337532 ns 1304323.5 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70499708 ns 70519875 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43790125 ns 43552854.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39477041 ns 40665042 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132604313 ns 132440791 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1860034 ns 1881079 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 392007437 ns 383762458 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 295848708 ns 296140812.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 280677917 ns 285609291 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 534601479 ns 534397854 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 12292612 ns 12301453.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 416452583 ns 408556958 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 375603417 ns 400978979 ns 0.94
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 739467187 ns 2804634541 ns 0.26
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 710966083 ns 711262542 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 1207602875 ns 1187451625 ns 1.02
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 827285146 ns 687900354.5 ns 1.20
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 626243583 ns 675103208 ns 0.93
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 1871309417 ns 1861428958 ns 1.01
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12310335.5 ns 12318124 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 3551769562 ns 3592131166.5 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 2825829208 ns 2766931750 ns 1.02
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 2685591583 ns 2824402500 ns 0.95
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 4971858334 ns 4976486417 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49471030 ns 49597364 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3412375 ns 3424958 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2057459 ns 2061667 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2495666 ns 2433145.5 ns 1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6041833 ns 6019834 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 291550 ns 292066 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25761312.5 ns 25565333 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18421458 ns 18594958.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18746750 ns 19135937.5 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 39078208 ns 38817750 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2478018 ns 2470565.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 54288604.5 ns 54013709 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 80339812 ns 79042812.5 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 172422791.5 ns 1231143917 ns 0.14
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 45738917 ns 45483208 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1785146.5 ns 1777063 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1089812.5 ns 1089250 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1563333 ns 1458375 ns 1.07
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3036041.5 ns 3025333.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213535 ns 213113 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12483792 ns 12533917 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9199624.5 ns 9196875 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9583333 ns 9657604 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18985687.5 ns 18966125 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1543618 ns 1540833 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17577042 ns 17644917 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14335042 ns 14332041.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14575208 ns 14718041 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 22223083 ns 22160020.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70534292 ns 70550708.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43780458 ns 43600063 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39542083 ns 40717729 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132641021 ns 132485291.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1884243 ns 1949867 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 359089500 ns 359581209 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 289136354 ns 290937584 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 286223375 ns 290441875 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 621740812.5 ns 618706270.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13381466.5 ns 13347845 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 426322958.5 ns 418387125 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 433456542 ns 422322583 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 700179458.5 ns 2895773375 ns 0.24
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 717066417 ns 716354500 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 1546604 ns 1599208 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 1018625 ns 1232938 ns 0.83
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 1239542 ns 1235167 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2344354 ns 2315292 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 590077 ns 545112 ns 1.08
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 8789625 ns 8862792 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 13433958 ns 12950812.5 ns 1.04
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 30660000 ns 58793854.5 ns 0.52
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 9813916.5 ns 9809334 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1433625.5 ns 1492967 ns 0.96
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17774292 ns 17734000 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 16943020.5 ns 17361750 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 29389541.5 ns 77291708 ns 0.38
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14403021 ns 12986750 ns 1.11
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 793729.5 ns 790937.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 652583 ns 498875 ns 1.31
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 1033604.5 ns 3817250 ns 0.27
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 729000 ns 725042 ns 1.01
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 47334 ns 48794 ns 0.97
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 1545896 ns 1520854 ns 1.02
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 1028188 ns 1049917 ns 0.98
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 1490062.5 ns 11421458 ns 0.13
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 2291333.5 ns 2273792 ns 1.01
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 233057 ns 234942.5 ns 0.99
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 1740750 ns 1695771 ns 1.03
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 1245833 ns 1271708.5 ns 0.98
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 2032792 ns 11370292 ns 0.18
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 2317334 ns 2291667 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3401083.5 ns 3401646 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2039458 ns 2056833.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2508479.5 ns 2425042 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6024958 ns 5995666 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 283499 ns 285706.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23993958 ns 24118875 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17161937.5 ns 17258875 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17120500.5 ns 17604666.5 ns 0.97
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37516042 ns 37430416 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2407574.5 ns 2402296 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 52415917 ns 52414125 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 80120833 ns 83928292 ns 0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 170100749.5 ns 1219142792 ns 0.14
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 44596958 ns 44392708 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 249959541 ns 250407625 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148554375 ns 148238917 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 115651042 ns 121351666 ns 0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447689729.5 ns 447459458 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5447214 ns 5336731 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1138134750 ns 1129539417 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 879402625.5 ns 882674958 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 806396250 ns 815371709 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1745958208 ns 1744905709 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 29364193 ns 28378262 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1044290250 ns 1064772187.5 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 968104500 ns 964672209 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1275516417 ns 3904894687.5 ns 0.33
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1730982917 ns 1742644958 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1288812 ns 1302229 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 756416 ns 969520.5 ns 0.78
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 959708 ns 945000 ns 1.02
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2062479 ns 1957604.5 ns 1.05
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 585184 ns 576593 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 5766917 ns 5874604 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 8702542 ns 6492542 ns 1.34
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 24377042 ns 49631083.5 ns 0.49
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 7093250 ns 6374042 ns 1.11
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1392456 ns 1416373 ns 0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 10756875 ns 10778396 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 9739500 ns 9918667 ns 0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 17286625 ns 61084708 ns 0.28
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 8840896 ns 7443125 ns 1.19
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 466250 ns 483500 ns 0.96
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 473729 ns 368667 ns 1.28
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 2043167 ns 4189542 ns 0.49
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 88875 ns 88833 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 28086 ns 29045 ns 0.97
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 366958 ns 379541.5 ns 0.97
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 426458 ns 446708 ns 0.95
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 4380541 ns 12533396 ns 0.35
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 261333 ns 265208 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 223772 ns 227079.5 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 685500 ns 707459 ns 0.97
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 693562.5 ns 726271 ns 0.95
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 1081875 ns 6598416.5 ns 0.16
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 445124.5 ns 446958 ns 1.00
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 412208.5 ns 426312.5 ns 0.97
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 412083 ns 303666.5 ns 1.36
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 743708.5 ns 2289542 ns 0.32
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 53833.5 ns 53354 ns 1.01
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 28451 ns 28741 ns 0.99
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 316042 ns 334125 ns 0.95
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 306854 ns 342916 ns 0.89
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 383125 ns 6128625 ns 0.06251402231332477
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 153375 ns 157292 ns 0.98
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 209063 ns 211268 ns 0.99
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 381500 ns 400375 ns 0.95
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 374500 ns 410291.5 ns 0.91
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 928500 ns 5806792 ns 0.16
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 174125 ns 174625 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 610538959 ns 603293416 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 428730667 ns 425511812 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 372122895.5 ns 412612646 ns 0.90
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 876687334 ns 872101687 ns 1.01
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7021716 ns 7026013.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 2064097437.5 ns 2054393875 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1567205042 ns 1618390313 ns 0.97
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 1600946021 ns 1720878584 ns 0.93
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 2761837666 ns 2755903459 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 25734048.5 ns 25904879.5 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 520166.5 ns 521083 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 401500 ns 435709 ns 0.92
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 1846521 ns 6187563 ns 0.30
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 867666.5 ns 868791.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 47339 ns 48586 ns 0.97
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1916250 ns 1892687.5 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 1752834 ns 2331750 ns 0.75
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 14787833 ns 18962458.5 ns 0.78
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 2737958 ns 2772834 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 251262 ns 250849 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 3223791 ns 2728041 ns 1.18
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 2284333.5 ns 2319791.5 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 3794125 ns 12893125 ns 0.29
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 3375125 ns 3385499.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1489937.5 ns 1496750 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1019958 ns 1181458.5 ns 0.86
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1178083 ns 1208500 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2338125 ns 2224187.5 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 585370 ns 589325.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 5755625 ns 5771979 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 7937604.5 ns 6453250 ns 1.23
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 24739167 ns 51827709 ns 0.48
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 7285791 ns 7242187 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1357578.5 ns 1343442.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 12561104 ns 12776250 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 11774166 ns 12078916.5 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 20486917 ns 60709375.5 ns 0.34
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 10798187.5 ns 10444333.5 ns 1.03
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 2750 ns 2625 ns 1.05
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 2458 ns 2667 ns 0.92
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 3375 ns 2833.5 ns 1.19
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 3521 ns 5229.5 ns 0.67
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 24874 ns 25112 ns 0.99
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 8875 ns 8916 ns 1.00
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 8541 ns 8833 ns 0.97
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 8625 ns 8917 ns 0.97
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 8834 ns 8792 ns 1.00
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 209705.5 ns 209631.5 ns 1.00
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 16792 ns 16562.5 ns 1.01
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 16542 ns 16542 ns 1
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 16875 ns 16667 ns 1.01
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 10917 ns 10709 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 10125 ns 11563 ns 0.88
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 14834 ns 15666 ns 0.95
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 10916 ns 13250 ns 0.82
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 7584 ns 7458 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 24720 ns 25221 ns 0.98
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 22750 ns 22729.5 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 22500 ns 22458 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 22541 ns 22541 ns 1
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 22291 ns 22584 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 230541.5 ns 232899 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 52666.5 ns 52375 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 52250 ns 52542 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 52250 ns 52625 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 43792 ns 44042 ns 0.99
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 29167 ns 28166.5 ns 1.04
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 28750 ns 29333 ns 0.98
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 28021 ns 29417 ns 0.95
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 46375 ns 46334 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 25582 ns 25984 ns 0.98
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 209292 ns 208979 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 271250 ns 267604.5 ns 1.01
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 4055167 ns 4061417 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 147958 ns 154708 ns 0.96
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 217500 ns 224895 ns 0.97
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 305437.5 ns 311208 ns 0.98
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 305812.5 ns 297208 ns 1.03
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 476375 ns 666000 ns 0.72
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 160917 ns 161917 ns 0.99
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 1750 ns 1959 ns 0.89
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 1792 ns 2000 ns 0.90
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 2625 ns 2625 ns 1
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 1917 ns 2208 ns 0.87
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 22679 ns 23260 ns 0.98
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 7645.5 ns 7125 ns 1.07
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 7667 ns 7333 ns 1.05
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 7812.5 ns 7792 ns 1.00
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 7833 ns 7542 ns 1.04
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 265749 ns 274053.5 ns 0.97
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 11542 ns 11833.5 ns 0.98
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 11542 ns 11458 ns 1.01
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 11542 ns 11375 ns 1.01
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 7125 ns 7167 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 80203833 ns 79918583 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 47894084 ns 49163917 ns 0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 44850250 ns 44855000 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 151606708 ns 151319874.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2675189 ns 2719014 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 679504291 ns 601785792 ns 1.13
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 410648500 ns 411474666 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 396007395.5 ns 397225187.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 692029833 ns 684946583 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 14599746.5 ns 14617212 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 693561354.5 ns 685314624.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 670877709 ns 667050292 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 978990875 ns 953815417 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 994723417 ns 997319750 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.