-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix bablestream benchmark #2420
base: develop
Are you sure you want to change the base?
fix bablestream benchmark #2420
Conversation
3777a36
to
98aac95
Compare
07496fa
to
d7982fb
Compare
@mehmetyusufoglu Can you please check if CPU is working too. |
2bfcf11
to
3aa4303
Compare
3aa4303
to
862a390
Compare
|
||
DataType const* sumPtr = std::data(bufHostSumPerBlock); | ||
float const result = std::reduce(sumPtr, sumPtr + gridBlockExtent, 0.0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the memcpy has be be part of the measurement, because that's our fault that we not execute the full reduction on device.
To have a fair comparison the allocation of the result must be part of measureKernelExec
too but the cuda upstream implementation is cheating here too so I assume this is fine to be allocated outside of measureKernelExec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I saw these are also part of measurement. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks. [reduce is taken into measure]
|
||
// Block thread extent for DotKernel test work division parameters. | ||
[[maybe_unused]] constexpr auto blockThreadExtentMain = 1024; | ||
[[maybe_unused]] constexpr auto dotGridBlockExtent = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to add a extent for CPUs too to support CPU dot execution required for the verification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additional test case is added for blocksize = 1 for Dot Kernel. Hence Dot kernel is tested for CPU backend and for blocksize =1.
cf19b90
to
e2b8eae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for your update. I've mostly checked for compliance with upstream babelstream as that was apparently a major point of discussion. But I want to first applaud you for that nice auto [i] = getIdx...;
trick. Very nice indeed!
It's mostly small things I've found some of which we might actively decide to do differently, e.g., not measuring some timings of the infrastructure-ish calls. There was one section that GH didn't allow me to comment on, so I'll put that in here:
Technically speaking, the dot kernel does a slightly different thing by accumulating the threadSum
s in registers and only storing them once into shared memory. Our version vs. upstream version. Likely to be optimised away by the compiler because both versions leave full flexibility concerning memory ordering here but we can't be sure I believe.
Also, just for my information: Why is tbSum
a reference in that same kernel? It very much looks like it must be dangling but if this compiles and runs correctly it apparently isn't?
[&]() | ||
{ alpaka::exec<Acc>(queue, workDivTriad, TriadKernel(), bufAccInputAPtr, bufAccInputBPtr, bufAccOutputCPtr); }, | ||
"TriadKernel"); | ||
if(kernelsToBeExecuted == KernelsToRun::All || kernelsToBeExecuted == KernelsToRun::Triad) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(kernelsToBeExecuted == KernelsToRun::All || kernelsToBeExecuted == KernelsToRun::Triad) | |
else if(kernelsToBeExecuted == KernelsToRun::Triad) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a code repetition in the original code, both run_triad
and run_all
calls the triad kernel. Here, i call the same code piece for both cases.
alpaka/benchmarks/babelstream/src/main.cpp
Line 108 in 7a8b205
std::vector<std::vector<double>> run_all(Stream<T>* stream, T& sum) |
alpaka::exec<Acc>( | ||
queue, | ||
workDivInit, | ||
InitKernel(), | ||
bufAccInputAPtr, | ||
bufAccInputBPtr, | ||
bufAccOutputCPtr, | ||
static_cast<DataType>(initA), | ||
static_cast<DataType>(initB)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original, this call is already timed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not one of babelstream kernels but ok i am implementing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that it makes moderate sense at best but it might be interesting information and it brings us closer to upstream. Your announced change is not yet found in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, implemented. Thanks.
AcceleratorType:AccGpuCudaRt<1,unsigned int>
NumberOfRuns:2
Precision:double
DataSize(items):1048576
DeviceName:NVIDIA RTX A500 Laptop GPU
WorkDivInit :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivCopy :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivMult :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivAdd :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivTriad:{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivDot :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
Kernels Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB)
AddKernel 87.201 0.0002886 0.0002886 0.0002886 25.166
CopyKernel 79.096 0.00021211 0.00021211 0.00021211 16.777
DotKernel 74.74 0.00022447 0.00022447 0.00022447 16.777
InitKernel 87.865 0.00028641 0.00028641 0.00028641 25.166
MultKernel 85.107 0.00019713 0.00019713 0.00019713 16.777
TriadKernel 87.046 0.00028911 0.00028911 0.00028911 25.166
alpaka::memcpy(queue, bufHostOutputC, bufAccOutputC, arraySize); | ||
alpaka::memcpy(queue, bufHostOutputB, bufAccInputB, arraySize); | ||
alpaka::memcpy(queue, bufHostOutputA, bufAccInputA, arraySize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These get timed int the original version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read time (actually copy time from Acc to Host for 3 arrays ) has been added to the output display as AccToHost Memcpy Time(sec). Thanks.
AcceleratorType:AccGpuCudaRt<1,unsigned int>
NumberOfRuns:100
Precision:double
DataSize(items):33554432
DeviceName:NVIDIA RTX A500 Laptop GPU
WorkDivInit :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivCopy :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivMult :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivAdd :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivTriad:{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivDot :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
AccToHost Memcpy Time(sec):0.570856
Kernels Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB)
AddKernel 90.665 0.0088822 0.0089985 0.0089376 805.31
CopyKernel 89.087 0.0060264 0.0061119 0.0060773 536.87
DotKernel 93.055 0.0057694 0.0058486 0.0058113 536.87
InitKernel 84.437 0.0095374 0.0095374 0.0095374 805.31
MultKernel 89.35 0.0060086 0.0060852 0.0060568 536.87
TriadKernel 90.222 0.0089258 0.0090338 0.0089565 805.31
e2b8eae
to
254dc3d
Compare
tbSum is reference because the function return type is |
e014dca
to
efbd007
Compare
Yes, this was the choice at the first implementation at our repo, i used directly like in the cuda implementation now. Checking the performance. |
Thanks for the explanation! That makes sense. Concerning the |
7ec9872
to
dd2b920
Compare
Randomness seeded to: 2905169299 AcceleratorType:AccCpuSerial<1,unsigned int> Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.) AcceleratorType:AccGpuCudaRt<1,unsigned int> Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.) AcceleratorType:AccCpuSerial<1,unsigned int> Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.) AcceleratorType:AccGpuCudaRt<1,unsigned int> =============================================================================== |
Ok I reverted it back. (Yes accessing shared memory at each thread many times is not needed at such case) |
5d704c4
to
1f4aeeb
Compare
1f4aeeb
to
34fb823
Compare
NStream
. This can be run separately alone.the triad kernel,
was optionally being run alone in the original code by UoB. This option is also added.This PR is an extension of previous PR: #2299
New parameters and kernel calls with specific arrays in the kernel call sequence to avoid cache usage differences:
A = 0.1 B= 0.2 C= 0.0 scalar = 0.4
C = A // copy
B = scalar * C // mult
C = A + B // add
A = B + scalar * C // triad
Missing optional kernel NStream is added
Dot kernel
is only run bymultiple-threaded accs
. Since original babelstream uses fixed 1024 blocksize which is the shared memory size per block as well. (search "#define TBSIZE 1024" in This Cuda code)RESULTS