-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: make LossFunctions
an optional dep
#976
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/remove_lossfunctions
branch
from
October 9, 2024 18:34
df90308
to
10b4a90
Compare
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
avik-pal
force-pushed
the
ap/remove_lossfunctions
branch
2 times, most recently
from
October 9, 2024 19:56
72f5f07
to
b6a3b35
Compare
avik-pal
force-pushed
the
ap/remove_lossfunctions
branch
from
October 9, 2024 20:23
b6a3b35
to
a8fcd1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: a8fcd1f | Previous: 04deedf | Ratio |
---|---|---|---|
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) |
411958 ns |
411750 ns |
1.00 |
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) |
322792 ns |
322271 ns |
1.00 |
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) |
322709 ns |
323042 ns |
1.00 |
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) |
739500 ns |
749375 ns |
0.99 |
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA |
43505 ns |
43905 ns |
0.99 |
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) |
1303625 ns |
1306583 ns |
1.00 |
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) |
2414042 ns |
465625 ns |
5.18 |
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) |
473333 ns |
13617333 ns |
0.03475959646430032 |
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) |
2199167 ns |
2245750 ns |
0.98 |
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA |
192310 ns |
192831 ns |
1.00 |
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) |
1392375 ns |
1394875 ns |
1.00 |
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) |
2601750 ns |
634729.5 ns |
4.10 |
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) |
596292 ns |
14050875 ns |
0.04243806880354426 |
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) |
2247208 ns |
2238000 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1773104 ns |
1661542 ns |
1.07 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1092042 ns |
1196103.5 ns |
0.91 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1357375 ns |
1534187.5 ns |
0.88 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
3006084 ns |
3005667 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
212208 ns |
209529 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12119042 ns |
12111521 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
8821417 ns |
9554687 ns |
0.92 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9254167 ns |
9247000 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18581166 ns |
18626583 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1928002.5 ns |
1910271 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17283041.5 ns |
17307250 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
13983292 ns |
14377958 ns |
0.97 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14283500 ns |
14526875 ns |
0.98 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21821208 ns |
21836458.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
250640875 ns |
250439041.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
148166500 ns |
174592521 ns |
0.85 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148017333.5 ns |
115955208.5 ns |
1.28 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
447268958 ns |
447243084 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5481598 ns |
5470843 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
1228399583 ns |
1228722500 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
929734125 ns |
543561875 ns |
1.71 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
447316208 ns |
830623396.5 ns |
0.54 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1627733958 ns |
1628878000 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34975846 ns |
38000637 ns |
0.92 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1138440375 ns |
1136994583 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
994965291.5 ns |
679379084 ns |
1.46 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
630335646 ns |
1328113771 ns |
0.47 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1747915667 ns |
1733752146 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
1093917 ns |
1103375 ns |
0.99 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
1578729.5 ns |
823209 ns |
1.92 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1228271 ns |
3578479 ns |
0.34 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
776583 ns |
786500 ns |
0.99 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
273018.5 ns |
266091.5 ns |
1.03 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2971000 ns |
2986021 ns |
0.99 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
4113750 ns |
2426000 ns |
1.70 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3308542 ns |
10461250 ns |
0.32 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3137958 ns |
3150042 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1073795 ns |
1055864 ns |
1.02 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
2289229 ns |
2335042 ns |
0.98 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1469583 ns |
1537708 ns |
0.96 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1738875 ns |
1740000 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
4340979 ns |
4348437.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
212603.5 ns |
212286 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
20410688 ns |
20266645.5 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
16959250 ns |
17701209 ns |
0.96 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
17862792 ns |
17495416 ns |
1.02 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
26712959 ns |
26797000 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1994616 ns |
1973706 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
44348917 ns |
44317750 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
40763334 ns |
42027646 ns |
0.97 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
41949834 ns |
41325000 ns |
1.02 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
47711666 ns |
47734917 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
4661167 ns |
4664854 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2859729 ns |
2868521.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2983833 ns |
3015958 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
8643125 ns |
8658937.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
514099 ns |
516555 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
40793042 ns |
40579000.5 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
33923896 ns |
34830104 ns |
0.97 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
33929292 ns |
34148292 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
53570708 ns |
53661812 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3019069 ns |
2969951 ns |
1.02 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
109419625 ns |
109640958 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
135911312.5 ns |
84133666 ns |
1.62 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
84560041.5 ns |
255828791 ns |
0.33 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
96170875 ns |
96388416 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
269980167 ns |
270215792 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
160584166 ns |
186630271 ns |
0.86 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
161101417 ns |
128172709 ns |
1.26 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
490670042 ns |
489605542 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7120179.5 ns |
7104246 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
1510528291.5 ns |
1502664042 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
1203093167 ns |
821183792 ns |
1.47 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
718371500.5 ns |
1092397958.5 ns |
0.66 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
2041471083 ns |
2032173187.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
33945352 ns |
33798333 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
2021995688 ns |
2027767896 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1847972875 ns |
1563910958 ns |
1.18 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1520764250 ns |
2210346833.5 ns |
0.69 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
2572459542 ns |
2560629834 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
2004500 ns |
2006833 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
3038520.5 ns |
1257333 ns |
2.42 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1623500 ns |
7451041.5 ns |
0.22 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2300416 ns |
2470458 ns |
0.93 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
281590.5 ns |
275531 ns |
1.02 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
9539041 ns |
9463416 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
11966958 ns |
6552500 ns |
1.83 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7133833 ns |
25529541 ns |
0.28 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
11769854.5 ns |
11734125 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1142436 ns |
1130415 ns |
1.01 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
381095312.5 ns |
380676854.5 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
286540104 ns |
145328000 ns |
1.97 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
130552500 ns |
243564083 ns |
0.54 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
454628499.5 ns |
452336354.5 ns |
1.01 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4879918 ns |
4879283 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
1151212125 ns |
1156932333 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
937737291 ns |
487570458 ns |
1.92 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
537791208 ns |
973572458 ns |
0.55 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
1407393625 ns |
1399439834 ns |
1.01 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16198156 ns |
16976929 ns |
0.95 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1059667 ns |
1062687.5 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
2066958 ns |
971124.5 ns |
2.13 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1346833.5 ns |
6269583 ns |
0.21 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1301708.5 ns |
1393375 ns |
0.93 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
282800 ns |
277704.5 ns |
1.02 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
6507167 ns |
6494541.5 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
12385292 ns |
4635437.5 ns |
2.67 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4917271 ns |
19450479 ns |
0.25 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
6052042 ns |
6080229 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1168778 ns |
1148981 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
70453917 ns |
70442208 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
43503833 ns |
35305229 ns |
1.23 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37270792 ns |
39532604 ns |
0.94 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
132526145.5 ns |
132574604 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1855432 ns |
1848251 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
356468062.5 ns |
356785937.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
270648584 ns |
159371854 ns |
1.70 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
144683604.5 ns |
254893688 ns |
0.57 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
536151583.5 ns |
535009020.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16484820 ns |
16489529.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
395757875 ns |
395707667 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
406670854 ns |
245564417 ns |
1.66 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
302419709 ns |
652089584 ns |
0.46 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
711243959 ns |
712574333 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
1190338458 ns |
1191762375 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
706805458.5 ns |
434009729.5 ns |
1.63 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
405427667 ns |
631038834 ns |
0.64 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
1777958625 ns |
1771033395.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12484829 ns |
12471861 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
3669927166.5 ns |
3670803208.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
2824188625 ns |
1633483458 ns |
1.73 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1519002791.5 ns |
2737701958 ns |
0.55 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
5077160875 ns |
5038709417 ns |
1.01 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49928336.5 ns |
49641386 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3416687.5 ns |
3412146 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2072667 ns |
2094750 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2279500 ns |
2533833.5 ns |
0.90 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
6030500 ns |
6034292 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
585066 ns |
586721 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
26022292 ns |
26096750.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
19127625 ns |
20315791.5 ns |
0.94 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
19529583.5 ns |
19312917 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
39304583 ns |
39366625 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2979739 ns |
2989473.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
54180875 ns |
54095229 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
82740583.5 ns |
28393083 ns |
2.91 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
30357812.5 ns |
177757792 ns |
0.17 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
45521437.5 ns |
45278750 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1779562.5 ns |
1778208 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1095041.5 ns |
1204708 ns |
0.91 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1400687.5 ns |
1564000 ns |
0.90 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
3034125 ns |
3038771 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
217246 ns |
217944 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12520250 ns |
12531437.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9206083 ns |
9964292 ns |
0.92 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9714020.5 ns |
9707042 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18976417 ns |
18974500 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1949824 ns |
1963028.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17622813 ns |
17644270.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14323416 ns |
14745500 ns |
0.97 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14626583 ns |
14639333 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
22166500 ns |
22173792 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
70508667 ns |
70409562 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
43492437.5 ns |
34786542 ns |
1.25 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37527375 ns |
39571499.5 ns |
0.95 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
132404062.5 ns |
132610521 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1816949 ns |
1837717 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
362069229 ns |
360588187.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
347781687.5 ns |
237608334 ns |
1.46 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
196618583 ns |
299913354 ns |
0.66 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
725610500 ns |
725805833 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13956880.5 ns |
13956738 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
417984208.5 ns |
418949812.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
421111875 ns |
251360792 ns |
1.68 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
309410125 ns |
712732021 ns |
0.43 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
717176084 ns |
717284542 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
1914750 ns |
1912041.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
1580250 ns |
1579125 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
1571166.5 ns |
1549791.5 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
2652521 ns |
2657625 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
575414 ns |
573525 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
9270396 ns |
9220000 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
13018146 ns |
5936166 ns |
2.19 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
5907042 ns |
31895937.5 ns |
0.19 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
10172291.5 ns |
10214937.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1356982 ns |
1399984.5 ns |
0.97 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
22200292 ns |
22182333.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
27860541.5 ns |
19138291.5 ns |
1.46 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
19139333 ns |
52527562.5 ns |
0.36 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
18844333.5 ns |
18888042 ns |
1.00 |
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) |
792000 ns |
791291.5 ns |
1.00 |
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) |
594625 ns |
69958.5 ns |
8.50 |
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) |
71125 ns |
997167 ns |
0.07132706958814321 |
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) |
728959 ns |
724499.5 ns |
1.01 |
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA |
47404 ns |
48324 ns |
0.98 |
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) |
1505354 ns |
1508042 ns |
1.00 |
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) |
1051625 ns |
320291 ns |
3.28 |
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) |
321416 ns |
1445145.5 ns |
0.22 |
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) |
2282959 ns |
2258458.5 ns |
1.01 |
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA |
209220.5 ns |
216350 ns |
0.97 |
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) |
1536625 ns |
1537083 ns |
1.00 |
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) |
1053437.5 ns |
428792 ns |
2.46 |
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) |
400625 ns |
1444584 ns |
0.28 |
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) |
2260041 ns |
2250333 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3388729 ns |
3421750 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2063250 ns |
2084312.5 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2260542 ns |
2519375.5 ns |
0.90 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
5995458 ns |
6015021 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
578396 ns |
584297 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24084666.5 ns |
24071521.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
17259000 ns |
18050833 ns |
0.96 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
16975000 ns |
17227375 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
37525375 ns |
37583145.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2890766 ns |
2895440 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
52597584 ns |
52599188 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
82859271 ns |
27644250 ns |
3.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
27410854 ns |
170611917 ns |
0.16 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
44515646 ns |
44514250 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
250162667 ns |
250102292 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
147911791 ns |
174510104 ns |
0.85 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148118584 ns |
115645729 ns |
1.28 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
449242063 ns |
448140124.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5480044.5 ns |
5446378 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
1103374709 ns |
1105120833 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
855502583.5 ns |
467780729.5 ns |
1.83 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
439050583.5 ns |
825455520.5 ns |
0.53 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1768823917 ns |
1753431125 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
32278498 ns |
35149612 ns |
0.92 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1024183270.5 ns |
1021983312.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
968129916 ns |
662517187.5 ns |
1.46 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
588363708.5 ns |
1286071167 ns |
0.46 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1721526750 ns |
1721665437.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1310104 ns |
1312041 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
962041.5 ns |
928625 ns |
1.04 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
973125 ns |
903208 ns |
1.08 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
1941645.5 ns |
2032416 ns |
0.96 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
580367.5 ns |
575428 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
5945645.5 ns |
5922771 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
6723479 ns |
2615500 ns |
2.57 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2611937.5 ns |
24427083.5 ns |
0.11 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
7071291 ns |
7104916.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1353279 ns |
1363516 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
9637125.5 ns |
9705958.5 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
13095292 ns |
6499000 ns |
2.01 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
6499791.5 ns |
31929750 ns |
0.20 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
7605917 ns |
7614042 ns |
1.00 |
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) |
484292 ns |
483291 ns |
1.00 |
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) |
401833 ns |
31750 ns |
12.66 |
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) |
32958 ns |
1795375 ns |
0.018357167722620624 |
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) |
90416 ns |
91542 ns |
0.99 |
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA |
28617 ns |
28996 ns |
0.99 |
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) |
392083.5 ns |
392958 ns |
1.00 |
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) |
457208 ns |
175542 ns |
2.60 |
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) |
175958 ns |
4708417 ns |
0.03737094654105615 |
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) |
272729.5 ns |
273000 ns |
1.00 |
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA |
221254 ns |
224707.5 ns |
0.98 |
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) |
665084 ns |
666333 ns |
1.00 |
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) |
729542 ns |
442250 ns |
1.65 |
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) |
443083 ns |
4499167 ns |
0.0984811188382205 |
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) |
510500 ns |
510979.5 ns |
1.00 |
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) |
431208 ns |
430437.5 ns |
1.00 |
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) |
337917 ns |
13583 ns |
24.88 |
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) |
14166 ns |
709208 ns |
0.019974393971867208 |
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) |
52875 ns |
52584 ns |
1.01 |
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA |
28234 ns |
29296 ns |
0.96 |
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) |
338916 ns |
337250 ns |
1.00 |
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) |
338479.5 ns |
26375 ns |
12.83 |
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) |
26125 ns |
484812.5 ns |
0.05388681191182158 |
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) |
151375 ns |
151333 ns |
1.00 |
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA |
210594 ns |
213308.5 ns |
0.99 |
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) |
353000 ns |
352521 ns |
1.00 |
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) |
353250 ns |
45792 ns |
7.71 |
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) |
46417 ns |
487125 ns |
0.09528765717218374 |
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) |
150958 ns |
151000 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
603768625 ns |
603223875 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
428953208.5 ns |
239241354 ns |
1.79 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
209012583 ns |
377713896 ns |
0.55 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
872395250 ns |
872019458 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7673213 ns |
7676104.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
2000233375 ns |
2005520125 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
1622205749.5 ns |
947653916.5 ns |
1.71 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
942590042 ns |
1551514604.5 ns |
0.61 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
2661180292 ns |
2653038416 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
27046367 ns |
27180094 ns |
1.00 |
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) |
526792 ns |
525604 ns |
1.00 |
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) |
438875 ns |
168333 ns |
2.61 |
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) |
169208 ns |
1740625 ns |
0.0972110592459605 |
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) |
872708 ns |
875541 ns |
1.00 |
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA |
47710 ns |
47837 ns |
1.00 |
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1896249.5 ns |
1943750 ns |
0.98 |
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2848979 ns |
1100208 ns |
2.59 |
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1059625 ns |
14661875 ns |
0.07227077028006308 |
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2814854 ns |
2836709 ns |
0.99 |
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA |
223708 ns |
232330 ns |
0.96 |
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) |
2924125 ns |
2974229 ns |
0.98 |
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) |
5708584 ns |
2208583.5 ns |
2.58 |
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) |
2178791.5 ns |
15024229.5 ns |
0.15 |
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) |
3727791 ns |
3751750 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1608084 ns |
1602291.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1236500 ns |
1221084 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1184833.5 ns |
1264750 ns |
0.94 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2205416.5 ns |
2362750 ns |
0.93 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
578741 ns |
576709 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
5873500 ns |
5931125 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
4651125 ns |
2866334 ns |
1.62 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2861854 ns |
25035834 ns |
0.11 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
6617458 ns |
6650208 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
1320023 ns |
1379411 ns |
0.96 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
11639229 ns |
11605146 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
14019937.5 ns |
8767458 ns |
1.60 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
8772125 ns |
35255000 ns |
0.25 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
9529646 ns |
9570000.5 ns |
1.00 |
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) |
2333 ns |
2541 ns |
0.92 |
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) |
2458 ns |
2292 ns |
1.07 |
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) |
2917 ns |
3000 ns |
0.97 |
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) |
2333 ns |
2333 ns |
1 |
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA |
24232 ns |
25379.5 ns |
0.95 |
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) |
7375 ns |
7125 ns |
1.04 |
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) |
7167 ns |
7083 ns |
1.01 |
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) |
7417 ns |
7375 ns |
1.01 |
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) |
7167 ns |
7270.5 ns |
0.99 |
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA |
184838 ns |
193729.5 ns |
0.95 |
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) |
8395.5 ns |
8334 ns |
1.01 |
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) |
8333 ns |
8500 ns |
0.98 |
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) |
8792 ns |
8417 ns |
1.04 |
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) |
5500 ns |
6084 ns |
0.90 |
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) |
11000 ns |
10375.5 ns |
1.06 |
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) |
14875 ns |
14916 ns |
1.00 |
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) |
12959 ns |
11854 ns |
1.09 |
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) |
7083 ns |
7625 ns |
0.93 |
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA |
24667 ns |
25646 ns |
0.96 |
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) |
21667 ns |
21708 ns |
1.00 |
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) |
21709 ns |
21500 ns |
1.01 |
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) |
22084 ns |
21750 ns |
1.02 |
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) |
21833 ns |
21875 ns |
1.00 |
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA |
195081.5 ns |
203851 ns |
0.96 |
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) |
53709 ns |
53417 ns |
1.01 |
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) |
53584 ns |
56583.5 ns |
0.95 |
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) |
56959 ns |
53583.5 ns |
1.06 |
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) |
51208 ns |
51333 ns |
1.00 |
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) |
28458 ns |
26895.5 ns |
1.06 |
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) |
28958.5 ns |
28333.5 ns |
1.02 |
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) |
28500 ns |
29000 ns |
0.98 |
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) |
45917 ns |
48291 ns |
0.95 |
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA |
25470 ns |
26739 ns |
0.95 |
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) |
225375 ns |
220875 ns |
1.02 |
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) |
273979.5 ns |
44583 ns |
6.15 |
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) |
44541 ns |
4132667 ns |
0.010777785870480248 |
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) |
145875 ns |
145458 ns |
1.00 |
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA |
166315 ns |
172310 ns |
0.97 |
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) |
244083 ns |
237312.5 ns |
1.03 |
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) |
291625 ns |
68625 ns |
4.25 |
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) |
68625 ns |
4360708 ns |
0.015737123421242605 |
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) |
145667 ns |
145917 ns |
1.00 |
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) |
1875 ns |
2292 ns |
0.82 |
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) |
3875 ns |
1750 ns |
2.21 |
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) |
2334 ns |
2166 ns |
1.08 |
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) |
3792 ns |
1520.5 ns |
2.49 |
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA |
22617 ns |
23935 ns |
0.94 |
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) |
5459 ns |
5125 ns |
1.07 |
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) |
5458 ns |
5042 ns |
1.08 |
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) |
5833 ns |
5458 ns |
1.07 |
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) |
5292 ns |
5084 ns |
1.04 |
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA |
170582 ns |
176841 ns |
0.96 |
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) |
7666 ns |
7292 ns |
1.05 |
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) |
7500 ns |
8166 ns |
0.92 |
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) |
8291.5 ns |
7541 ns |
1.10 |
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) |
5375 ns |
5167 ns |
1.04 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
80946709 ns |
80940833 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
49623500 ns |
41092709 ns |
1.21 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
43602125 ns |
45570541 ns |
0.96 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
153464500 ns |
153559792 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2633961 ns |
2660311 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
618289292 ns |
621714834 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
427194916 ns |
421739375 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
309056583 ns |
414510667 ns |
0.75 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
700731875 ns |
697568292 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15176782 ns |
15148414 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
871493479 ns |
872377937.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
842345750 ns |
706482291.5 ns |
1.19 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
717181583.5 ns |
1162546146 ns |
0.62 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
1171224541.5 ns |
1175739375 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While it is a good package and makes our life much simpler in-terms of maintenance burden, but it is written in a way that is non-optimal for XLA compilation. Considering that I am providing native implementations of the loss functions and moving it to an extension