Skip to content

Commit

Permalink
fix: patch optimization tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal committed Sep 26, 2024
1 parent a808aa8 commit e4bd1af
Showing 1 changed file with 2 additions and 4 deletions.
6 changes: 2 additions & 4 deletions examples/OptimizationIntegration/main.jl
Original file line number Diff line number Diff line change
Expand Up @@ -117,15 +117,13 @@ function train_model(dataloader)
res_adam = solve(opt_prob, Optimisers.Adam(0.001); callback, maxiters=epochs)

## Let's finetune a bit with L-BFGS
opt_prob = remake(opt_prob; u0=res_adam.u)
opt_prob = OptimizationProblem(opt_func, res_adam.u, (gdev(ode_data), TimeWrapper(t)))
res_lbfgs = solve(opt_prob, LBFGS(); callback, maxiters=epochs)

## Now that we have a good fit, let's train it on the entire dataset without
## Minibatching. We need to do this since ODE solves can lead to accumulated errors if
## the model was trained on individual parts (without a data-shooting approach).
opt_func = OptimizationFunction(loss_adjoint, Optimization.AutoZygote())
opt_prob = OptimizationProblem(opt_func, res_lbfgs.u, (gdev(ode_data), TimeWrapper(t)))

opt_prob = remake(opt_prob; u0=res_lbfgs.u)
res = solve(opt_prob, Optimisers.Adam(0.005); maxiters=500, callback)

return StatefulLuxLayer{true}(model, res.u, smodel.st)
Expand Down

1 comment on commit e4bd1af

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: e4bd1af Previous: a808aa8 Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 412125 ns 414875 ns 0.99
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 322375 ns 321479 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 321625 ns 322521 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 739375 ns 740000 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 44132 ns 40861 ns 1.08
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 647917 ns 1343250 ns 0.48
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 2404667 ns 2434250 ns 0.99
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 13901084 ns 474937.5 ns 29.27
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 2211917 ns 2252271 ns 0.98
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 201549 ns 182562 ns 1.10
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 740042 ns 1328292 ns 0.56
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 2593084 ns 2620521 ns 0.99
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 14418542 ns 610500 ns 23.62
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 2199209 ns 2229562.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1526583 ns 1765917 ns 0.86
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1096708 ns 1031334 ns 1.06
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1529625 ns 1365416 ns 1.12
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3028083 ns 2818125 ns 1.07
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 210375.5 ns 204521 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12223834 ns 12152917 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 8813167 ns 8828833 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9206687.5 ns 9300834 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18597853.5 ns 18599875 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1948580 ns 1492272 ns 1.31
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17338770.5 ns 17275187 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 13950583.5 ns 13914875 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14476791.5 ns 14281833 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21850833 ns 21819042 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 124925271 ns 250296521 ns 0.50
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148389000 ns 148101750 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 115877562.5 ns 148130792 ns 0.78
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447112875 ns 448565625 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5460574 ns 5496241 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 600322042 ns 1226292708 ns 0.49
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 930867334 ns 930446334 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 825580604 ns 443990041 ns 1.86
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1687470250.5 ns 1653613542 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 31224338 ns 35420264 ns 0.88
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 706851312.5 ns 1147479875 ns 0.62
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 988058125.5 ns 996058750 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1348418729 ns 629339646 ns 2.14
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1806342854 ns 1740843604 ns 1.04
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 863834 ns 1116250 ns 0.77
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 1622583.5 ns 1624229.5 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 3450625 ns 1206375.5 ns 2.86
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 784875 ns 782041 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 267055.5 ns 260633 ns 1.02
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2714792 ns 2984374.5 ns 0.91
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 4119812 ns 4127166 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 10424458 ns 3295208.5 ns 3.16
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3144166 ns 3137625 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1090149.5 ns 1049614 ns 1.04
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 2166312 ns 2315396 ns 0.94
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1479000 ns 1424437 ns 1.04
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1744292 ns 1685208 ns 1.04
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 4339875 ns 4196250 ns 1.03
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 208596 ns 208669.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 20428875 ns 19413145.5 ns 1.05
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 16963479 ns 16084375 ns 1.05
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 17405708 ns 17133041.5 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 26734729 ns 25866542 ns 1.03
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2018993 ns 1576194 ns 1.28
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 45033583 ns 34217167 ns 1.32
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 40993666.5 ns 30754459 ns 1.33
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 41173500 ns 31341542 ns 1.31
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 47738437 ns 37132709 ns 1.29
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 4301666.5 ns 4525792 ns 0.95
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2844667 ns 2744125 ns 1.04
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2996709 ns 2881375 ns 1.04
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 8653334 ns 8371458 ns 1.03
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 472874 ns 423036 ns 1.12
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 40060542 ns 38892667 ns 1.03
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 33920959 ns 32085104.5 ns 1.06
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 33907687.5 ns 32057770.5 ns 1.06
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 53575541.5 ns 52159979.5 ns 1.03
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3254220 ns 2618584 ns 1.24
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 90139000 ns 89172458 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 135574958.5 ns 113776875 ns 1.19
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 249787833 ns 62985709 ns 3.97
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 96223792 ns 74986500 ns 1.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 142522459 ns 268884125 ns 0.53
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 161123167 ns 159000000 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 128478042 ns 158925750 ns 0.81
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 493238750 ns 486715083 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7031961.5 ns 6941165 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 881412625 ns 1474467645.5 ns 0.60
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 1203181667 ns 1134657750 ns 1.06
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 1089986000.5 ns 687890791.5 ns 1.58
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 2129205729 ns 2033574500 ns 1.05
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34708690 ns 33495275 ns 1.04
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1668841500 ns 1720167208 ns 0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1865068750 ns 1551435312.5 ns 1.20
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 2075940833.5 ns 1147814729 ns 1.81
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 2608730625 ns 2245015792 ns 1.16
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1545708 ns 2039500 ns 0.76
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 3042541 ns 3006583 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 7339916 ns 1618791.5 ns 4.53
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2318125 ns 2424854.5 ns 0.96
lenet(28, 28, 1, 128)/forward/GPU/CUDA 277569.5 ns 258194 ns 1.08
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7874959 ns 9325667 ns 0.84
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 12022125 ns 11994291.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 23765959 ns 7128604 ns 3.33
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 11654708 ns 11753792 ns 0.99
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1196174 ns 1096609.5 ns 1.09
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 186253812 ns 380363500.5 ns 0.49
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 283266353.5 ns 286893354 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 242835500 ns 129833291 ns 1.87
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 463794333 ns 456069146 ns 1.02
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4830735 ns 5018425 ns 0.96
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 630927250 ns 1154815958 ns 0.55
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 990257541 ns 934037667 ns 1.06
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 1035740417 ns 609039458 ns 1.70
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 1415342041 ns 1585642292 ns 0.89
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16300060 ns 19065478 ns 0.85
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1085229 ns 1049833.5 ns 1.03
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 2098166 ns 2073542 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 4972000 ns 1348479.5 ns 3.69
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1299500 ns 1287021 ns 1.01
lenet(28, 28, 1, 64)/forward/GPU/CUDA 278783 ns 259724.5 ns 1.07
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6008145.5 ns 6258792 ns 0.96
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 12421208 ns 12411416 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 20005041 ns 4953146 ns 4.04
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 6082792 ns 6086709 ns 1.00
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1220466 ns 1149352.5 ns 1.06
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23693938 ns 70546083 ns 0.34
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43500791.5 ns 43491792 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39526833.5 ns 37811479.5 ns 1.05
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132823145.5 ns 134717229.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1948314 ns 1859024 ns 1.05
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184396041 ns 355554354.5 ns 0.52
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 270116291 ns 270317625 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 253589145.5 ns 146113896 ns 1.74
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 534281562.5 ns 537066979.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13222993 ns 12142155.5 ns 1.09
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 297123437 ns 396257791 ns 0.75
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 404377895.5 ns 404428375.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 696065958 ns 302176729 ns 2.30
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 713613916 ns 712116709 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 656595541 ns 1190477625 ns 0.55
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 689413604.5 ns 689814958.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 634330625 ns 404795334 ns 1.57
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 1789031312.5 ns 1876404250 ns 0.95
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12386066 ns 12324333 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1908648333.5 ns 3610008479.5 ns 0.53
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 2827932125 ns 2831662833 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 2698654250 ns 1516977229.5 ns 1.78
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 5716413416 ns 5143819000 ns 1.11
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49345511 ns 50066391.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3047688 ns 3345708.5 ns 0.91
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2062437 ns 2078625 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2519583 ns 2287083 ns 1.10
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6053042 ns 6026917 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 574063 ns 330146 ns 1.74
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25654333 ns 25733291.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19054583.5 ns 18989125 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19323500 ns 19553792 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 39330000 ns 39739583.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3195551.5 ns 2459398 ns 1.30
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 35130041.5 ns 54593479 ns 0.64
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 82097417 ns 78905375 ns 1.04
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 168348625 ns 29660083.5 ns 5.68
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 45591875 ns 45812146 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1644375 ns 1660583.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1090250 ns 1105770.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1572750 ns 1392229 ns 1.13
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3038167 ns 3035959 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214850 ns 210818 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12701083 ns 12525958.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9189625 ns 9221375 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9640458.5 ns 9699583 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18968854.5 ns 19002416.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1987617.5 ns 1509113 ns 1.32
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17682875 ns 17662604.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14327834 ns 14311479 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14625958 ns 14590875 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 22177500 ns 22225541 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23739833.5 ns 70524583 ns 0.34
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43469541 ns 43452458 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39647750 ns 37882479.5 ns 1.05
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132812271.5 ns 132685187 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1879600 ns 1859436 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 189384875 ns 359287667 ns 0.53
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 346944938 ns 347693812.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 303748958 ns 197401167 ns 1.54
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 748909417 ns 730607333 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 14283912.5 ns 13254127 ns 1.08
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 302085833 ns 420436292 ns 0.72
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 421708625 ns 419235583 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 689499625 ns 310533750 ns 2.22
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 719890000 ns 718184500 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 1926375 ns 1442542 ns 1.34
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 1579042 ns 1346416.5 ns 1.17
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 1571792 ns 1331812.5 ns 1.18
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2497917 ns 2403021 ns 1.04
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 573991 ns 549048 ns 1.05
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 6186000 ns 8851250 ns 0.70
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 13018375 ns 12939667 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 31151958 ns 5552708 ns 5.61
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 9378042 ns 9880416.5 ns 0.95
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1403069 ns 1258951 ns 1.11
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18793000 ns 16575062 ns 1.13
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 27709979.5 ns 20954208 ns 1.32
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 49574542 ns 13338833 ns 3.72
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 18852542 ns 13092416 ns 1.44
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 68959 ns 822708 ns 0.08381953256805574
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 541125 ns 528084 ns 1.02
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 1011562 ns 71146 ns 14.22
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 728542 ns 725750 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 47294 ns 46414.5 ns 1.02
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 277500 ns 1506500 ns 0.18
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 988020.5 ns 1020854 ns 0.97
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 1388416.5 ns 323833 ns 4.29
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 2250812 ns 2281417 ns 0.99
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 225164 ns 211160.5 ns 1.07
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 407500 ns 1512416 ns 0.27
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 1045583 ns 1090125 ns 0.96
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 1418917 ns 446562.5 ns 3.18
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 2256958 ns 2259375 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3042083 ns 3176750 ns 0.96
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2062771 ns 2053979 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2510104.5 ns 2268708 ns 1.11
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6011000 ns 6008875 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 564983 ns 282441.5 ns 2.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23609021 ns 24059292 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17178792 ns 17235458 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17120458 ns 16956292 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37462729 ns 37778228.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3146695 ns 2390107 ns 1.32
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33304750 ns 52955708.5 ns 0.63
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 83679583.5 ns 84900333 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 167872042 ns 27496312.5 ns 6.11
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 44785187.5 ns 44513375.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120247125 ns 250307750 ns 0.48
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148479500 ns 148084625 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 115610813 ns 148444250 ns 0.78
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447816417 ns 455285000 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5450922 ns 5327018 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 470730291 ns 1102117541 ns 0.43
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 856712645.5 ns 856978792 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 825513875.5 ns 437778208 ns 1.89
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1750589417 ns 1768146583 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 28864938 ns 33525724 ns 0.86
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 640143291 ns 1027855937 ns 0.62
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 964190458 ns 965570792 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1286413958 ns 584455270.5 ns 2.20
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1842051438 ns 1726926104.5 ns 1.07
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1241583 ns 1135584 ns 1.09
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 917166 ns 989209 ns 0.93
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 906584 ns 923667 ns 0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 1938583 ns 2052500 ns 0.94
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 553409.5 ns 548882.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2941333 ns 5867833 ns 0.50
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 6314437.5 ns 6531896 ns 0.97
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 24719833.5 ns 2613541.5 ns 9.46
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 7090125 ns 7097417 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1346593.5 ns 1222578 ns 1.10
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6639250 ns 9683896 ns 0.69
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 13128667 ns 13118666 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 30481375 ns 6497583 ns 4.69
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 7632854 ns 7614083.5 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 39042 ns 512667 ns 0.07615469690851956
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 372792 ns 391292 ns 0.95
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 1833875 ns 32750 ns 56.00
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 91792 ns 87812.5 ns 1.05
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 27047.5 ns 25759 ns 1.05
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 175458 ns 382125 ns 0.46
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 455792 ns 444875 ns 1.02
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 4338875 ns 160875 ns 26.97
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 272792 ns 258750 ns 1.05
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 210187.5 ns 188723 ns 1.11
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 441709 ns 420291.5 ns 1.05
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 728375 ns 475750 ns 1.53
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 4896125 ns 194375 ns 25.19
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 511041.5 ns 270958 ns 1.89
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 12416.5 ns 461312.5 ns 0.026915594092941336
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 303334 ns 326666.5 ns 0.93
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 721771 ns 14792 ns 48.79
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 55209 ns 54145.5 ns 1.02
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 27615.5 ns 26082 ns 1.06
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 25917 ns 340312 ns 0.07615658572133807
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 336500 ns 342500 ns 0.98
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 850083 ns 25958.5 ns 32.75
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 151500 ns 151625 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 198567.5 ns 181930 ns 1.09
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 45208.5 ns 357792 ns 0.13
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 351625 ns 357833 ns 0.98
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 712459 ns 46437.5 ns 15.34
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 151084 ns 151209 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 318202459 ns 602226667 ns 0.53
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 430387020.5 ns 427648645.5 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 368378458.5 ns 207084708 ns 1.78
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 883484291 ns 882976625 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7628205 ns 6984740 ns 1.09
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1097576562.5 ns 1997486771 ns 0.55
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1620619666.5 ns 1621644791.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 1583682354 ns 856167166 ns 1.85
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 2698758083 ns 2637178042 ns 1.02
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26674131 ns 26468421.5 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 189813 ns 520062.5 ns 0.36
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 443792 ns 429271 ns 1.03
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 1747875 ns 166000 ns 10.53
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 873374.5 ns 866083 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 46821 ns 46206 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1205958.5 ns 1874625 ns 0.64
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 2354667 ns 2508792 ns 0.94
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 14475333.5 ns 1021958 ns 14.16
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 2826417 ns 2650063 ns 1.07
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 237435.5 ns 217141.5 ns 1.09
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 2299604.5 ns 1862417 ns 1.23
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 5735750 ns 5033959 ns 1.14
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 14836917 ns 1161917 ns 12.77
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 3683375 ns 2752500 ns 1.34
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1579292 ns 1462229 ns 1.08
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1180250 ns 1192834 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1174479 ns 1192667 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2370125 ns 2221791 ns 1.07
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 570253.5 ns 550464 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3184000 ns 5883792 ns 0.54
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 4719584 ns 4676563 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 24816709 ns 2871000 ns 8.64
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 7307438 ns 7325000.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1344428.5 ns 1196239.5 ns 1.12
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8830562.5 ns 11670958.5 ns 0.76
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 15640333.5 ns 16372334 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 34223791 ns 8780584 ns 3.90
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 9547375 ns 9544250 ns 1.00
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 2209 ns 2458 ns 0.90
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 2167 ns 2542 ns 0.85
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 3541 ns 2875 ns 1.23
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 2625 ns 4625 ns 0.57
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 24463 ns 22670 ns 1.08
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7000 ns 6916 ns 1.01
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 6833 ns 7083 ns 0.96
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7292 ns 7250 ns 1.01
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7167 ns 7333 ns 0.98
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 202989.5 ns 180475.5 ns 1.12
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 8334 ns 8250 ns 1.01
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 8250 ns 8292 ns 0.99
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 8375 ns 8542 ns 0.98
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 6041 ns 6125 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 10583 ns 10916.5 ns 0.97
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 15875 ns 12625 ns 1.26
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 10333 ns 10459 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 7625.5 ns 9729 ns 0.78
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 24500 ns 22420 ns 1.09
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 21542 ns 19916 ns 1.08
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 21625 ns 19875 ns 1.09
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 21750 ns 19958 ns 1.09
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 21667 ns 20000 ns 1.08
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 221414.5 ns 195313 ns 1.13
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 56833 ns 23542 ns 2.41
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 53708 ns 23541 ns 2.28
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 53625 ns 27125 ns 1.98
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 51583.5 ns 21334 ns 2.42
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 28834 ns 28834 ns 1
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 28584 ns 28708 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 28458 ns 29042 ns 0.98
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 46708 ns 46291 ns 1.01
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 25617 ns 23925 ns 1.07
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 44375 ns 224750 ns 0.20
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 274708 ns 276542 ns 0.99
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 4275000 ns 44250 ns 96.61
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 145000 ns 145000 ns 1
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 206652.5 ns 197967 ns 1.04
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 68542 ns 242125 ns 0.28
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 292958 ns 293916 ns 1.00
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 4229958 ns 68604.5 ns 61.66
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 145666 ns 145584 ns 1.00
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 1833 ns 1583 ns 1.16
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 1750 ns 2166 ns 0.81
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 2500 ns 2166.5 ns 1.15
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 1666 ns 4333.5 ns 0.38
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 22972 ns 20975.5 ns 1.10
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5208 ns 5084 ns 1.02
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5167 ns 5125 ns 1.01
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5208 ns 5209 ns 1.00
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5250 ns 5500 ns 0.95
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 244140 ns 234449.5 ns 1.04
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 8208 ns 7375 ns 1.11
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 7375 ns 7458 ns 0.99
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 7542 ns 8125 ns 0.93
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 5292 ns 5459 ns 0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 34124291 ns 80045708 ns 0.43
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 49799333 ns 49037958.5 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 45669229.5 ns 42791749.5 ns 1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 153888625 ns 151490583 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2656121 ns 2680013 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 481321500.5 ns 606632959 ns 0.79
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 424493583 ns 411440583 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 412050834 ns 292411917 ns 1.41
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 724714916 ns 737907354 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15594271 ns 16971190.5 ns 0.92
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 744920541 ns 714524875 ns 1.04
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 840757958.5 ns 672104708 ns 1.25
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 1131213854 ns 580514646 ns 1.95
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 1186689479.5 ns 1012152875 ns 1.17

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.