Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: cleanup Training and preserve type-stability in Enzyme #896

Merged
merged 3 commits into from
Sep 13, 2024

Conversation

avik-pal
Copy link
Member

No description provided.

Copy link
Contributor

github-actions bot commented Sep 13, 2024

Benchmark Results (ASV)

main b15a924... main/b15a924b6953ea...
basics/overhead 0.0659 ± 0.035 μs 0.0653 ± 0.035 μs 1.01
time_to_load 1.04 ± 0.016 s 1.03 ± 0.0085 s 1.01

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@avik-pal avik-pal changed the title fix: use Enzyme.make_zero for type stability refactor: cleanup Training and preserve type-stability in Enzyme Sep 13, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 4cd193f Previous: 0b51676 Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 408333.5 ns 411125 ns 0.99
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 321458 ns 322750 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 320041 ns 244083 ns 1.31
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 739791 ns 740229 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 43497 ns 43576 ns 1.00
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 1324145.5 ns 1361688 ns 0.97
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 2448166 ns 2448167 ns 1.00
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 19753063 ns 16505500 ns 1.20
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 2196250 ns 2198042 ns 1.00
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 203587.5 ns 207361 ns 0.98
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 1388292 ns 1419479 ns 0.98
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 919500.5 ns 931729 ns 0.99
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 11910500 ns 1582917 ns 7.52
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 2213229.5 ns 2213229 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1684187.5 ns 1768708 ns 0.95
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1091292 ns 1072541.5 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1442542 ns 1542417 ns 0.94
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2822458.5 ns 3010167 ns 0.94
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 207564 ns 208923 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12095500 ns 12164458 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 8829208.5 ns 8831167 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9279354.5 ns 9231125 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18580166 ns 18575542 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1491163 ns 1506706 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17251250 ns 17297875 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 13942271 ns 13966709 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14489084 ns 14490229 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21840541 ns 21825958 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 249574458 ns 250077771 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 147904125 ns 148351292 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 122111292 ns 116742208 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447480875 ns 446235042 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5460486 ns 5474148 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1223641167 ns 1226735000 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 926768708 ns 933099541 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 846547354.5 ns 833488083 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1641395084 ns 1628798917 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 31574286.5 ns 31247743 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1131773584 ns 1139513458 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 996683146.5 ns 1004012958 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 3976670187 ns 1343460771 ns 2.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1732163667 ns 1729098333 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 1037458 ns 1084187.5 ns 0.96
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 1630666.5 ns 1632875 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 10919792 ns 3807833 ns 2.87
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 783417 ns 781500 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 263247.5 ns 269181 ns 0.98
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2962563 ns 2973917 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 4127917 ns 4123458 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 19390958.5 ns 11391021 ns 1.70
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3143021 ns 3140229.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1097714.5 ns 1147789 ns 0.96
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 2286792 ns 2327458.5 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1383604 ns 1427875 ns 0.97
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1631354.5 ns 1552208 ns 1.05
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 4206042 ns 4203041 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 207993 ns 209123 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 19399562.5 ns 19423562 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 16067625.5 ns 16279416 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 16935708 ns 17361812 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 25850500 ns 25815125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1586654 ns 1606839 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 34082166 ns 34524104 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 30744396 ns 31057875 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 31427437.5 ns 31105416 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 36685750 ns 36883875 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 4480958 ns 4526208.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2776209 ns 2777083.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2918729 ns 2685312.5 ns 1.09
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 8374020.5 ns 8381562.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 419829.5 ns 373639 ns 1.12
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 38764708.5 ns 38887521 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 32073312.5 ns 32509584 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 32550958 ns 32333229 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 51785125 ns 51833125 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2623295.5 ns 2633953 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 88241770.5 ns 88607687.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 113988125 ns 113743125 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 1387473833 ns 227726583 ns 6.09
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 74282375 ns 74951083 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 267478833 ns 267716166 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 159178166.5 ns 159256375 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 133767250 ns 123708895.5 ns 1.08
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 491410958 ns 485091625 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7006603 ns 7022924 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1474270812.5 ns 1478680979 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 1170900916 ns 1179547083 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 1090086875 ns 1066054563 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 2029183812.5 ns 2001889209 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34507471 ns 34822377.5 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1703104708 ns 1724298291 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1498305291.5 ns 1565497271 ns 0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 4424340187.5 ns 1925114250 ns 2.30
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 2206219916.5 ns 2239111625 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1848104 ns 2028500 ns 0.91
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 2593625 ns 2967646 ns 0.87
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 15217500 ns 8104667 ns 1.88
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2445792 ns 2308041.5 ns 1.06
lenet(28, 28, 1, 128)/forward/GPU/CUDA 272285.5 ns 272667 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 9194354.5 ns 9619395.5 ns 0.96
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 11544520.5 ns 12015166 ns 0.96
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 40853583 ns 26324292 ns 1.55
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 11682875 ns 11677541 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1187206 ns 1188628.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 380460292 ns 383215354.5 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 285315645.5 ns 284366604.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 276971916.5 ns 261725395.5 ns 1.06
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 452379375 ns 453056042 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4830643 ns 5009701 ns 0.96
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 1148151250 ns 1160384584 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 926375417 ns 912166042 ns 1.02
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 1038342500 ns 984922208 ns 1.05
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 1395529417 ns 1396092167 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 18863589 ns 18111984 ns 1.04
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1008917 ns 1053833 ns 0.96
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 1895666 ns 1605958 ns 1.18
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 12140875 ns 5411083 ns 2.24
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1382708 ns 1296875 ns 1.07
lenet(28, 28, 1, 64)/forward/GPU/CUDA 272323.5 ns 265721 ns 1.02
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6492709 ns 6510958 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 13801042 ns 13082584 ns 1.05
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 35801979.5 ns 21760833.5 ns 1.65
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 6077667 ns 5984375 ns 1.02
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1222234 ns 1208949 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70041479.5 ns 70494333 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43561000 ns 43641125 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 40897208.5 ns 39690584 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132809624.5 ns 133468354 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1950322.5 ns 1945255.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 356003771 ns 356723479.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 269999583 ns 271306709 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 258159291 ns 254269771 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 534431979.5 ns 536238459 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 12287375.5 ns 12301288 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 395172750 ns 395599834 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 402851625 ns 377440167 ns 1.07
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 2986109916 ns 697289229.5 ns 4.28
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 710301375 ns 708495833 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 1186305000 ns 1188885083 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 690932187.5 ns 692916625 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 672715625 ns 642915416.5 ns 1.05
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 1775393604.5 ns 1776695937.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12317191.5 ns 12306515 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 3725325562.5 ns 3668882667 ns 1.02
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 2844419334 ns 2834396125 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 2842574167 ns 2699395792 ns 1.05
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 5046174625 ns 5050853166 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49301462.5 ns 49852240.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3365104.5 ns 3422958 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2049292 ns 2075583 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2407709 ns 2513666 ns 0.96
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6020833.5 ns 6018396 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 344904 ns 317455.5 ns 1.09
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25950208.5 ns 26048666 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18870042 ns 19094062.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19509833.5 ns 19316000 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 39184583 ns 39190562.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2464432 ns 2466381 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 55063292 ns 55369583 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 81156875 ns 82210395.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 1257617250 ns 173994812.5 ns 7.23
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 45517708 ns 45354333 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1746083 ns 1779187.5 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1102583 ns 1097834 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1411062.5 ns 1568791 ns 0.90
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3032875 ns 3021312 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 210977 ns 210623 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12485395.5 ns 12543916 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9194438 ns 9277708.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9628250 ns 9594229.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18974875.5 ns 18987604.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1524799 ns 1527868.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17615646 ns 17650708 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14312958 ns 14335458 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14645667 ns 14544250 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 22170896 ns 22174250 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70053687 ns 70431125 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43523916 ns 43537125 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 40673687.5 ns 39620583 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132899354 ns 132531916.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1869152.5 ns 1888879 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 360534645.5 ns 360439083.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 346024417 ns 347132666.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 307505500 ns 304637542 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 721569959 ns 722631792 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13359262 ns 13304668 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 417257854.5 ns 419234750 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 419807208 ns 421465729 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 3006330625 ns 724319500 ns 4.15
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 715033084 ns 714217917 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 1631250.5 ns 1705416 ns 0.96
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 1331708 ns 1350333.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 1336875 ns 1170667 ns 1.14
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2396624.5 ns 2385333.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 591480 ns 580442.5 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 8882750 ns 8948271 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 12816542 ns 12980437.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 59653249.5 ns 32353312.5 ns 1.84
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 9822208.5 ns 9804417 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1448532 ns 1427987.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17611000 ns 17962354 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17224750.5 ns 17440000 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 77512291.5 ns 29738291 ns 2.61
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14409583 ns 14431937.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 666625 ns 669833.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 563688 ns 529250 ns 1.07
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 3886874.5 ns 1065708.5 ns 3.65
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 725500 ns 725395.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 47671 ns 47647 ns 1.00
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 1544625 ns 1549104 ns 1.00
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 1016583.5 ns 1038917 ns 0.98
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 11543916 ns 1517584 ns 7.61
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 2282500 ns 2269896 ns 1.01
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 235575.5 ns 233022 ns 1.01
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 1580334 ns 1582916 ns 1.00
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 1004625 ns 1087854.5 ns 0.92
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 11628062.5 ns 1464166 ns 7.94
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 2245625 ns 2190854 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3347917 ns 3413625 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 1887083 ns 2047083 ns 0.92
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2378833.5 ns 2507333.5 ns 0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6007583 ns 6011813 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 282497 ns 284231.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24152375 ns 24149000 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17173646 ns 17330312.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17593749.5 ns 17059271 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37482062.5 ns 37480499.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2406163 ns 2394265 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 53706563 ns 53573937.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 82834250 ns 83649500 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 1233410792 ns 172928458 ns 7.13
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 44549146 ns 44425187.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 249169208 ns 249999250 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148041000 ns 148223583 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 122198187.5 ns 116384896 ns 1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 448189229 ns 447335937.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5442005.5 ns 5449146 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1102462875 ns 1105347792 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 855231500 ns 857822708.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 840812354.5 ns 830398396 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1749598583 ns 1762030583 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 29316151 ns 28862807 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1015929312 ns 1020245354 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 926116125 ns 966178875 ns 0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 3961189083.5 ns 1293466208 ns 3.06
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1733370166.5 ns 1724193375.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1251917 ns 1306896.5 ns 0.96
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 956771 ns 984292 ns 0.97
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 949208 ns 778437.5 ns 1.22
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2087500 ns 1958750 ns 1.07
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 567500.5 ns 566426 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 5937021 ns 6042375 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 6759604 ns 6715125 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 49043375 ns 26872708 ns 1.83
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 7085687.5 ns 6973417 ns 1.02
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1366617 ns 1365853 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 11452625 ns 11215770.5 ns 1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 9741666.5 ns 10033208 ns 0.97
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 62520062.5 ns 17672208 ns 3.54
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 8785979 ns 8568500 ns 1.03
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 354083 ns 399500 ns 0.89
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 394625 ns 399291.5 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 4359083 ns 3544167 ns 1.23
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 89959 ns 88459 ns 1.02
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 27493 ns 27618 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 388042 ns 397459 ns 0.98
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 393083 ns 445041.5 ns 0.88
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 12972333.5 ns 4819375 ns 2.69
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 259520.5 ns 259833 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 218685 ns 219889.5 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 422208.5 ns 428313 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 423041.5 ns 475541 ns 0.89
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 12840125 ns 4960437.5 ns 2.59
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 270708.5 ns 271333 ns 1.00
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 302750 ns 343709 ns 0.88
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 330437.5 ns 333937.5 ns 0.99
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 2233708.5 ns 769833 ns 2.90
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 54958 ns 53125 ns 1.03
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 27749 ns 28016 ns 0.99
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 352416 ns 362209 ns 0.97
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 263875 ns 342792 ns 0.77
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 4863834 ns 897833 ns 5.42
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 152167 ns 152583 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 204425 ns 205326.5 ns 1.00
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 367542 ns 378500 ns 0.97
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 278834 ns 358042 ns 0.78
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 5579167 ns 728708 ns 7.66
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 151375 ns 150833.5 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 601750500 ns 603479208 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 429254771 ns 429058104 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 419910167 ns 385950542 ns 1.09
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 877413666 ns 872372584 ns 1.01
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7029920 ns 7023071 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 2001694187.5 ns 2010730958 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1602419313 ns 1608264687.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 1714533917 ns 1653085833 ns 1.04
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 2643581833 ns 2638084625 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26033662 ns 25932761 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 525625 ns 535250 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 434000 ns 433291.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 6414625 ns 3023791.5 ns 2.12
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 866312.5 ns 880791 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 47799 ns 46986 ns 1.02
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1901750 ns 1881604 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 2792417 ns 2798729 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 18885250 ns 16356750 ns 1.15
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 2762917 ns 2759229 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 251649 ns 246659.5 ns 1.02
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 1969667 ns 1962958.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 5048125 ns 5070604 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 19153125 ns 16396875 ns 1.17
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 2752771 ns 2785625.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1557875.5 ns 1614125 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1227750 ns 1235583 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1172291.5 ns 1027208 ns 1.14
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2275166.5 ns 2300875 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 586575 ns 587018.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 5914208 ns 5921542 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 6575666.5 ns 5089688 ns 1.29
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 51110125.5 ns 26372271 ns 1.94
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 7308729 ns 7288250 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1369220 ns 1379747.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 13275916 ns 13324958 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 11132437.5 ns 12237645.5 ns 0.91
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 60708249.5 ns 21281499.5 ns 2.85
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 10738167 ns 10668750 ns 1.01
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 2250 ns 4417 ns 0.51
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 2459 ns 2583.5 ns 0.95
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 2895.5 ns 2750 ns 1.05
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 2292 ns 2500 ns 0.92
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 24545 ns 24754 ns 0.99
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7250 ns 7459 ns 0.97
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 7291 ns 7250 ns 1.01
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7292 ns 7333 ns 0.99
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7167 ns 7083 ns 1.01
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 211309.5 ns 213008 ns 0.99
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 8375 ns 8375 ns 1
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 8166 ns 8583 ns 0.95
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 8458 ns 8459 ns 1.00
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 5916.5 ns 5834 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 11833 ns 10625 ns 1.11
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 13708.5 ns 13708 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 13292 ns 12042 ns 1.10
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 7042 ns 7500 ns 0.94
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 24651 ns 25091.5 ns 0.98
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 20125 ns 20250 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 19916 ns 19959 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 20083 ns 20083 ns 1
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 19833 ns 19875 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 232370.5 ns 231793 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 23416 ns 23625 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 23500 ns 23667 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 23812.5 ns 23666 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 21459 ns 21084 ns 1.02
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 26583 ns 28708 ns 0.93
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 28542 ns 29292 ns 0.97
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 26875 ns 28375 ns 0.95
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 46563 ns 46584 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 25693 ns 26247 ns 0.98
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 224500 ns 222250 ns 1.01
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 279375 ns 279729.5 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 12856750 ns 4335396.5 ns 2.97
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 145792 ns 145208 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 206863.5 ns 203061 ns 1.02
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 336770.5 ns 333124.5 ns 1.01
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 319708 ns 322500 ns 0.99
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 6617937.5 ns 861333 ns 7.68
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 160458.5 ns 160750 ns 1.00
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 1792 ns 1875 ns 0.96
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 2084 ns 1958 ns 1.06
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 2167 ns 2416 ns 0.90
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 2083 ns 1792 ns 1.16
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 22898 ns 23061 ns 0.99
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5125 ns 5458 ns 0.94
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5208 ns 5500 ns 0.95
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5333 ns 5375 ns 0.99
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5208 ns 5375 ns 0.97
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 255181 ns 243257 ns 1.05
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 11166 ns 11333.5 ns 0.99
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 11417 ns 11208 ns 1.02
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 11459 ns 11667 ns 0.98
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 6875 ns 6833 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 79309875 ns 79834791 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 49119666.5 ns 49125291 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 46250895.5 ns 43259375 ns 1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 151368125 ns 151428917 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2651040 ns 2726005 ns 0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 665755354.5 ns 498680292 ns 1.34
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 411402958 ns 414152083 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 403009541 ns 396991709 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 687961500 ns 689086500 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 14648248 ns 14585553 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 685766729 ns 712438146 ns 0.96
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 672608209 ns 683887166 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 3496862791.5 ns 1013847083 ns 3.45
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 996941041 ns 999589459 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal merged commit f60db4d into main Sep 13, 2024
70 of 78 checks passed
@avik-pal avik-pal deleted the ap/enz_test branch September 13, 2024 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant