Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: make LossFunctions an optional dep #976

Merged
merged 3 commits into from
Oct 9, 2024
Merged

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Oct 9, 2024

While it is a good package and makes our life much simpler in-terms of maintenance burden, but it is written in a way that is non-optimal for XLA compilation. Considering that I am providing native implementations of the loss functions and moving it to an extension

Copy link
Contributor

github-actions bot commented Oct 9, 2024

Benchmark Results (ASV)

main a8fcd1f... main/a8fcd1f2e10df2...
basics/overhead 0.0544 ± 0.0051 μs 0.0543 ± 0.0011 μs 1
time_to_load 1.25 ± 0.0066 s 1.25 ± 0.0062 s 0.999

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@avik-pal avik-pal force-pushed the ap/remove_lossfunctions branch 2 times, most recently from 72f5f07 to b6a3b35 Compare October 9, 2024 19:56
@avik-pal avik-pal merged commit 77eb5fb into main Oct 9, 2024
34 of 47 checks passed
@avik-pal avik-pal deleted the ap/remove_lossfunctions branch October 9, 2024 21:06
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: a8fcd1f Previous: 04deedf Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 411958 ns 411750 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 322792 ns 322271 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 322709 ns 323042 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 739500 ns 749375 ns 0.99
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 43505 ns 43905 ns 0.99
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 1303625 ns 1306583 ns 1.00
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 2414042 ns 465625 ns 5.18
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 473333 ns 13617333 ns 0.03475959646430032
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 2199167 ns 2245750 ns 0.98
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 192310 ns 192831 ns 1.00
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 1392375 ns 1394875 ns 1.00
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 2601750 ns 634729.5 ns 4.10
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 596292 ns 14050875 ns 0.04243806880354426
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 2247208 ns 2238000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1773104 ns 1661542 ns 1.07
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1092042 ns 1196103.5 ns 0.91
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1357375 ns 1534187.5 ns 0.88
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3006084 ns 3005667 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212208 ns 209529 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12119042 ns 12111521 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 8821417 ns 9554687 ns 0.92
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9254167 ns 9247000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18581166 ns 18626583 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1928002.5 ns 1910271 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17283041.5 ns 17307250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 13983292 ns 14377958 ns 0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14283500 ns 14526875 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21821208 ns 21836458.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 250640875 ns 250439041.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148166500 ns 174592521 ns 0.85
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148017333.5 ns 115955208.5 ns 1.28
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447268958 ns 447243084 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5481598 ns 5470843 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1228399583 ns 1228722500 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 929734125 ns 543561875 ns 1.71
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 447316208 ns 830623396.5 ns 0.54
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1627733958 ns 1628878000 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34975846 ns 38000637 ns 0.92
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1138440375 ns 1136994583 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 994965291.5 ns 679379084 ns 1.46
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 630335646 ns 1328113771 ns 0.47
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1747915667 ns 1733752146 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 1093917 ns 1103375 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 1578729.5 ns 823209 ns 1.92
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1228271 ns 3578479 ns 0.34
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 776583 ns 786500 ns 0.99
lenet(28, 28, 1, 32)/forward/GPU/CUDA 273018.5 ns 266091.5 ns 1.03
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2971000 ns 2986021 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 4113750 ns 2426000 ns 1.70
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3308542 ns 10461250 ns 0.32
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3137958 ns 3150042 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1073795 ns 1055864 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 2289229 ns 2335042 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1469583 ns 1537708 ns 0.96
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1738875 ns 1740000 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 4340979 ns 4348437.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212603.5 ns 212286 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 20410688 ns 20266645.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 16959250 ns 17701209 ns 0.96
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 17862792 ns 17495416 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 26712959 ns 26797000 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1994616 ns 1973706 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 44348917 ns 44317750 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 40763334 ns 42027646 ns 0.97
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 41949834 ns 41325000 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 47711666 ns 47734917 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 4661167 ns 4664854 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2859729 ns 2868521.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2983833 ns 3015958 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 8643125 ns 8658937.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 514099 ns 516555 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 40793042 ns 40579000.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 33923896 ns 34830104 ns 0.97
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 33929292 ns 34148292 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 53570708 ns 53661812 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3019069 ns 2969951 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 109419625 ns 109640958 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 135911312.5 ns 84133666 ns 1.62
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 84560041.5 ns 255828791 ns 0.33
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 96170875 ns 96388416 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 269980167 ns 270215792 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 160584166 ns 186630271 ns 0.86
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 161101417 ns 128172709 ns 1.26
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 490670042 ns 489605542 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7120179.5 ns 7104246 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1510528291.5 ns 1502664042 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 1203093167 ns 821183792 ns 1.47
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 718371500.5 ns 1092397958.5 ns 0.66
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 2041471083 ns 2032173187.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33945352 ns 33798333 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 2021995688 ns 2027767896 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1847972875 ns 1563910958 ns 1.18
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1520764250 ns 2210346833.5 ns 0.69
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 2572459542 ns 2560629834 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 2004500 ns 2006833 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 3038520.5 ns 1257333 ns 2.42
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1623500 ns 7451041.5 ns 0.22
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2300416 ns 2470458 ns 0.93
lenet(28, 28, 1, 128)/forward/GPU/CUDA 281590.5 ns 275531 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 9539041 ns 9463416 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 11966958 ns 6552500 ns 1.83
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7133833 ns 25529541 ns 0.28
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 11769854.5 ns 11734125 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1142436 ns 1130415 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 381095312.5 ns 380676854.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 286540104 ns 145328000 ns 1.97
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 130552500 ns 243564083 ns 0.54
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 454628499.5 ns 452336354.5 ns 1.01
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4879918 ns 4879283 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 1151212125 ns 1156932333 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 937737291 ns 487570458 ns 1.92
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 537791208 ns 973572458 ns 0.55
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 1407393625 ns 1399439834 ns 1.01
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16198156 ns 16976929 ns 0.95
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1059667 ns 1062687.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 2066958 ns 971124.5 ns 2.13
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1346833.5 ns 6269583 ns 0.21
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1301708.5 ns 1393375 ns 0.93
lenet(28, 28, 1, 64)/forward/GPU/CUDA 282800 ns 277704.5 ns 1.02
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6507167 ns 6494541.5 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 12385292 ns 4635437.5 ns 2.67
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4917271 ns 19450479 ns 0.25
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 6052042 ns 6080229 ns 1.00
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1168778 ns 1148981 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70453917 ns 70442208 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43503833 ns 35305229 ns 1.23
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37270792 ns 39532604 ns 0.94
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132526145.5 ns 132574604 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1855432 ns 1848251 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 356468062.5 ns 356785937.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 270648584 ns 159371854 ns 1.70
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 144683604.5 ns 254893688 ns 0.57
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 536151583.5 ns 535009020.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16484820 ns 16489529.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 395757875 ns 395707667 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 406670854 ns 245564417 ns 1.66
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 302419709 ns 652089584 ns 0.46
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 711243959 ns 712574333 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 1190338458 ns 1191762375 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 706805458.5 ns 434009729.5 ns 1.63
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 405427667 ns 631038834 ns 0.64
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 1777958625 ns 1771033395.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12484829 ns 12471861 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 3669927166.5 ns 3670803208.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 2824188625 ns 1633483458 ns 1.73
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1519002791.5 ns 2737701958 ns 0.55
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 5077160875 ns 5038709417 ns 1.01
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49928336.5 ns 49641386 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3416687.5 ns 3412146 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2072667 ns 2094750 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2279500 ns 2533833.5 ns 0.90
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6030500 ns 6034292 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 585066 ns 586721 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 26022292 ns 26096750.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19127625 ns 20315791.5 ns 0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19529583.5 ns 19312917 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 39304583 ns 39366625 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2979739 ns 2989473.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 54180875 ns 54095229 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 82740583.5 ns 28393083 ns 2.91
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 30357812.5 ns 177757792 ns 0.17
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 45521437.5 ns 45278750 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1779562.5 ns 1778208 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1095041.5 ns 1204708 ns 0.91
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1400687.5 ns 1564000 ns 0.90
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3034125 ns 3038771 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 217246 ns 217944 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12520250 ns 12531437.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9206083 ns 9964292 ns 0.92
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9714020.5 ns 9707042 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18976417 ns 18974500 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1949824 ns 1963028.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17622813 ns 17644270.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14323416 ns 14745500 ns 0.97
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14626583 ns 14639333 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 22166500 ns 22173792 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70508667 ns 70409562 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43492437.5 ns 34786542 ns 1.25
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37527375 ns 39571499.5 ns 0.95
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132404062.5 ns 132610521 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1816949 ns 1837717 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 362069229 ns 360588187.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 347781687.5 ns 237608334 ns 1.46
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 196618583 ns 299913354 ns 0.66
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 725610500 ns 725805833 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13956880.5 ns 13956738 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 417984208.5 ns 418949812.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 421111875 ns 251360792 ns 1.68
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 309410125 ns 712732021 ns 0.43
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 717176084 ns 717284542 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 1914750 ns 1912041.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 1580250 ns 1579125 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 1571166.5 ns 1549791.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2652521 ns 2657625 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 575414 ns 573525 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 9270396 ns 9220000 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 13018146 ns 5936166 ns 2.19
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 5907042 ns 31895937.5 ns 0.19
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 10172291.5 ns 10214937.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1356982 ns 1399984.5 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 22200292 ns 22182333.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 27860541.5 ns 19138291.5 ns 1.46
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19139333 ns 52527562.5 ns 0.36
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 18844333.5 ns 18888042 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 792000 ns 791291.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 594625 ns 69958.5 ns 8.50
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 71125 ns 997167 ns 0.07132706958814321
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 728959 ns 724499.5 ns 1.01
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 47404 ns 48324 ns 0.98
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 1505354 ns 1508042 ns 1.00
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 1051625 ns 320291 ns 3.28
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 321416 ns 1445145.5 ns 0.22
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 2282959 ns 2258458.5 ns 1.01
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 209220.5 ns 216350 ns 0.97
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 1536625 ns 1537083 ns 1.00
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 1053437.5 ns 428792 ns 2.46
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 400625 ns 1444584 ns 0.28
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 2260041 ns 2250333 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3388729 ns 3421750 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2063250 ns 2084312.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2260542 ns 2519375.5 ns 0.90
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 5995458 ns 6015021 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 578396 ns 584297 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24084666.5 ns 24071521.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17259000 ns 18050833 ns 0.96
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16975000 ns 17227375 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37525375 ns 37583145.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2890766 ns 2895440 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 52597584 ns 52599188 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 82859271 ns 27644250 ns 3.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27410854 ns 170611917 ns 0.16
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 44515646 ns 44514250 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 250162667 ns 250102292 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 147911791 ns 174510104 ns 0.85
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148118584 ns 115645729 ns 1.28
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 449242063 ns 448140124.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5480044.5 ns 5446378 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1103374709 ns 1105120833 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 855502583.5 ns 467780729.5 ns 1.83
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 439050583.5 ns 825455520.5 ns 0.53
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1768823917 ns 1753431125 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32278498 ns 35149612 ns 0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1024183270.5 ns 1021983312.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 968129916 ns 662517187.5 ns 1.46
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 588363708.5 ns 1286071167 ns 0.46
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1721526750 ns 1721665437.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1310104 ns 1312041 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 962041.5 ns 928625 ns 1.04
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 973125 ns 903208 ns 1.08
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 1941645.5 ns 2032416 ns 0.96
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 580367.5 ns 575428 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 5945645.5 ns 5922771 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 6723479 ns 2615500 ns 2.57
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2611937.5 ns 24427083.5 ns 0.11
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 7071291 ns 7104916.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1353279 ns 1363516 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 9637125.5 ns 9705958.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 13095292 ns 6499000 ns 2.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6499791.5 ns 31929750 ns 0.20
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 7605917 ns 7614042 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 484292 ns 483291 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 401833 ns 31750 ns 12.66
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 32958 ns 1795375 ns 0.018357167722620624
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 90416 ns 91542 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 28617 ns 28996 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 392083.5 ns 392958 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 457208 ns 175542 ns 2.60
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 175958 ns 4708417 ns 0.03737094654105615
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 272729.5 ns 273000 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 221254 ns 224707.5 ns 0.98
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 665084 ns 666333 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 729542 ns 442250 ns 1.65
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 443083 ns 4499167 ns 0.0984811188382205
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 510500 ns 510979.5 ns 1.00
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 431208 ns 430437.5 ns 1.00
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 337917 ns 13583 ns 24.88
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 14166 ns 709208 ns 0.019974393971867208
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 52875 ns 52584 ns 1.01
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 28234 ns 29296 ns 0.96
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 338916 ns 337250 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 338479.5 ns 26375 ns 12.83
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 26125 ns 484812.5 ns 0.05388681191182158
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 151375 ns 151333 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 210594 ns 213308.5 ns 0.99
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 353000 ns 352521 ns 1.00
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 353250 ns 45792 ns 7.71
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 46417 ns 487125 ns 0.09528765717218374
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 150958 ns 151000 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 603768625 ns 603223875 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 428953208.5 ns 239241354 ns 1.79
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 209012583 ns 377713896 ns 0.55
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 872395250 ns 872019458 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7673213 ns 7676104.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 2000233375 ns 2005520125 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1622205749.5 ns 947653916.5 ns 1.71
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 942590042 ns 1551514604.5 ns 0.61
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 2661180292 ns 2653038416 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 27046367 ns 27180094 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 526792 ns 525604 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 438875 ns 168333 ns 2.61
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 169208 ns 1740625 ns 0.0972110592459605
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 872708 ns 875541 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 47710 ns 47837 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1896249.5 ns 1943750 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 2848979 ns 1100208 ns 2.59
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 1059625 ns 14661875 ns 0.07227077028006308
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 2814854 ns 2836709 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 223708 ns 232330 ns 0.96
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 2924125 ns 2974229 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 5708584 ns 2208583.5 ns 2.58
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 2178791.5 ns 15024229.5 ns 0.15
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 3727791 ns 3751750 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1608084 ns 1602291.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1236500 ns 1221084 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1184833.5 ns 1264750 ns 0.94
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2205416.5 ns 2362750 ns 0.93
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 578741 ns 576709 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 5873500 ns 5931125 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 4651125 ns 2866334 ns 1.62
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2861854 ns 25035834 ns 0.11
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 6617458 ns 6650208 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1320023 ns 1379411 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 11639229 ns 11605146 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 14019937.5 ns 8767458 ns 1.60
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8772125 ns 35255000 ns 0.25
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 9529646 ns 9570000.5 ns 1.00
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 2333 ns 2541 ns 0.92
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 2458 ns 2292 ns 1.07
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 2917 ns 3000 ns 0.97
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 2333 ns 2333 ns 1
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 24232 ns 25379.5 ns 0.95
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7375 ns 7125 ns 1.04
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 7167 ns 7083 ns 1.01
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7417 ns 7375 ns 1.01
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7167 ns 7270.5 ns 0.99
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 184838 ns 193729.5 ns 0.95
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 8395.5 ns 8334 ns 1.01
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 8333 ns 8500 ns 0.98
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 8792 ns 8417 ns 1.04
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 5500 ns 6084 ns 0.90
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 11000 ns 10375.5 ns 1.06
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 14875 ns 14916 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 12959 ns 11854 ns 1.09
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 7083 ns 7625 ns 0.93
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 24667 ns 25646 ns 0.96
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 21667 ns 21708 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 21709 ns 21500 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 22084 ns 21750 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 21833 ns 21875 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 195081.5 ns 203851 ns 0.96
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 53709 ns 53417 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 53584 ns 56583.5 ns 0.95
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 56959 ns 53583.5 ns 1.06
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 51208 ns 51333 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 28458 ns 26895.5 ns 1.06
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 28958.5 ns 28333.5 ns 1.02
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 28500 ns 29000 ns 0.98
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 45917 ns 48291 ns 0.95
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 25470 ns 26739 ns 0.95
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 225375 ns 220875 ns 1.02
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 273979.5 ns 44583 ns 6.15
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 44541 ns 4132667 ns 0.010777785870480248
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 145875 ns 145458 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 166315 ns 172310 ns 0.97
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 244083 ns 237312.5 ns 1.03
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 291625 ns 68625 ns 4.25
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 68625 ns 4360708 ns 0.015737123421242605
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 145667 ns 145917 ns 1.00
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 1875 ns 2292 ns 0.82
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 3875 ns 1750 ns 2.21
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 2334 ns 2166 ns 1.08
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 3792 ns 1520.5 ns 2.49
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 22617 ns 23935 ns 0.94
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5459 ns 5125 ns 1.07
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5458 ns 5042 ns 1.08
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5833 ns 5458 ns 1.07
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5292 ns 5084 ns 1.04
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 170582 ns 176841 ns 0.96
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 7666 ns 7292 ns 1.05
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 7500 ns 8166 ns 0.92
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 8291.5 ns 7541 ns 1.10
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 5375 ns 5167 ns 1.04
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 80946709 ns 80940833 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 49623500 ns 41092709 ns 1.21
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 43602125 ns 45570541 ns 0.96
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 153464500 ns 153559792 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2633961 ns 2660311 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 618289292 ns 621714834 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 427194916 ns 421739375 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 309056583 ns 414510667 ns 0.75
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 700731875 ns 697568292 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15176782 ns 15148414 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 871493479 ns 872377937.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 842345750 ns 706482291.5 ns 1.19
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 717181583.5 ns 1162546146 ns 0.62
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 1171224541.5 ns 1175739375 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant