Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new batch_gemm types #466

Merged
merged 30 commits into from
Jun 24, 2024

Conversation

AidanBeltonS
Copy link
Contributor

Description

This adds new data types for the gemm_batch operation, to better be in line with the oneMKL spec. The types added are <half, half, float, float>, <int8, int8, float, float>, and <int8, int8, int32, float>.

New testing is added for these data types. Tests where the scalar type does not match the input type require a higher tolerance as the reference calculation is being performed at a much higher precision.

Test logs:
rocblas_test_log.txt
cublas_test_log.txt

I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers.

Fixes # (GitHub issue)
#446

Checklist

All Submissions

  • Do all unit tests pass locally? Attach a log.
  • [x]Have you formatted the code using clang-format?

New interfaces

  • Have you provided motivation for adding a new feature as part of RFC and
    it was accepted? # (RFC)

New features

  • Have you provided motivation for adding a new feature?
  • Have you added relevant tests?

@AidanBeltonS
Copy link
Contributor Author

@Rbiessy, cc

@Rbiessy Rbiessy self-assigned this Apr 3, 2024
@Rbiessy Rbiessy requested a review from mmeterel April 3, 2024 15:23
@hjabird hjabird self-assigned this Apr 3, 2024
@mmeterel
Copy link
Contributor

mmeterel commented Apr 3, 2024

@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.

"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."

Copy link
Contributor

@hjabird hjabird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the reference cblas implementation doesn't support some of the operations that are being added (as I understand it), is the new functionality actually tested?

src/blas/backends/cublas/cublas_helper.hpp Outdated Show resolved Hide resolved
src/blas/backends/portblas/portblas_batch.cxx Show resolved Hide resolved
src/blas/backends/rocblas/rocblas_helper.hpp Show resolved Hide resolved
tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved
@Rbiessy
Copy link
Contributor

Rbiessy commented Apr 4, 2024

@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.

"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."

Hey @mmeterel, I checked with Aidan about the issue with the MKL backends. The duplicate definitions seemed to be an issue with the setup or build commands used. We ran into another issue with undefined references with iamax and iamin functions using buffers using 2024.1 oneAPI base toolkit. Just a few example:

/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<double, 1, sycl::_V1::detail::aligned_allocator<double>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'

Looking at libmkl_sycl_blas.so.4 in 2024.1 these functions expect an index_base as a last argument but is not there in oneMKL:

$ readelf -Wa /path/to/mkl/latest/lib/libmkl_sycl_blas.so.4 | c++filt -t | grep "row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>"
  1302: 0000000002b573e0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<int, 1, sycl::_V1::detail::aligned_allocator<int>, void>&, oneapi::mkl::index_base)
  8510: 0000000002b573d0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&, oneapi::mkl::index_base)

We can use 2024.0 for the tests for now. Aidan is running more tests.

@mmeterel
Copy link
Contributor

mmeterel commented Apr 4, 2024

@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.
"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."

Hey @mmeterel, I checked with Aidan about the issue with the MKL backends. The duplicate definitions seemed to be an issue with the setup or build commands used. We ran into another issue with undefined references with iamax and iamin functions using buffers using 2024.1 oneAPI base toolkit. Just a few example:

/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<double, 1, sycl::_V1::detail::aligned_allocator<double>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'

Looking at libmkl_sycl_blas.so.4 in 2024.1 these functions expect an index_base as a last argument but is not there in oneMKL:

$ readelf -Wa /path/to/mkl/latest/lib/libmkl_sycl_blas.so.4 | c++filt -t | grep "row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>"
  1302: 0000000002b573e0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<int, 1, sycl::_V1::detail::aligned_allocator<int>, void>&, oneapi::mkl::index_base)
  8510: 0000000002b573d0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&, oneapi::mkl::index_base)

We can use 2024.0 for the tests for now. Aidan is running more tests.

@Rbiessy @AidanBeltonS AFAIK, there should not be any issues with missing symbols with 2024.1. This version has been in CI for a while now. I would suspect it can be a rebase issue on your branch. We should make it functional with 2024.1.

@mmeterel
Copy link
Contributor

mmeterel commented Apr 4, 2024

@andrewtbarker Will you be able to help with this review?

@andrewtbarker
Copy link
Contributor

@andrewtbarker Will you be able to help with this review?

Sure, I will take a look.

Copy link
Contributor

@andrewtbarker andrewtbarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, there is a lot of good work here. Most of my comments are just about style and naming consistency.

src/blas/backends/cublas/cublas_batch.cpp Show resolved Hide resolved
src/blas/backends/cublas/cublas_batch.cpp Show resolved Hide resolved
src/blas/backends/cublas/cublas_batch.cpp Show resolved Hide resolved
src/blas/backends/cublas/cublas_batch.cpp Outdated Show resolved Hide resolved
src/blas/backends/rocblas/rocblas_batch.cpp Outdated Show resolved Hide resolved
src/blas/function_table.hpp Outdated Show resolved Hide resolved
tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved
tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved
tests/unit_tests/blas/batch/gemm_batch_stride_usm.cpp Outdated Show resolved Hide resolved
tests/unit_tests/blas/batch/gemm_batch_usm.cpp Outdated Show resolved Hide resolved
@andrewtbarker
Copy link
Contributor

@Rbiessy @AidanBeltonS AFAIK, there should not be any issues with missing symbols with 2024.1. This version has been in CI for a while now. I would suspect it can be a rebase issue on your branch. We should make it functional with 2024.1.

Yes, this should have been fixed in #445 . If not we should fix it.

@mmeterel
Copy link
Contributor

mmeterel commented Apr 4, 2024

Have you tested the PR with hipSYCL/AdaptiveSYCL? Can you please add the logs?

@AidanBeltonS
Copy link
Contributor Author

No I have not tested HIPsycl. I have attached the other backend tests below. Netlib and portblas are passing fine. MKL has some failing tests due to tolerating which I am investigating further. It seems it deviates more from the reference implementation in some cases.
mkl_test_log.txt
netlib_test_log.txt
port_blas_test_logs.txt

MKL tests error:
mkl_test_log.txt

@andrewtbarker
Copy link
Contributor

MKL has some failing tests due to tolerating which I am investigating further.

It looks like dotc and dotu have segfaults in your tests. Initially I think this is unlikely to be due to your PR but have you looked at this at all?

@AidanBeltonS
Copy link
Contributor Author

MKL has some failing tests due to tolerating which I am investigating further.

It looks like dotc and dotu have segfaults in your tests. Initially I think this is unlikely to be due to your PR but have you looked at this at all?

The failures it Dot are due to error

[ RUN      ] DotTestSuite/DotTests.RealDoubleSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100
relative error = 1.83849e-08 absolute error = 1.24863e-07 limit = 3.01315e-13
Difference in result: DPC++ 6.79159 vs. Reference 6.79159
/home/aidanbelton/source/oneMKL/tests/unit_tests/blas/level1/dot.cpp:157: Failure
Expected equality of these values:
  res
    Which is: 0
  1
[  FAILED  ] DotTestSuite/DotTests.RealDoubleSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100, where GetParam() = (0x560f5e0, 1-byte object <00>) (1 ms)

DotU is an odd one, it does not appear to be related to my changes however

[ RUN      ] DotuTestSuite/DotuTests.ComplexSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100
Caught synchronous SYCL exception during DOTU:
The program was built for 1 devices
Build program log for 'Intel(R) Data Center GPU Max 1100':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE) -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)
OpenCL status: sycl:7
unknown file: Failure
C++ exception with description "Enqueue process failed. -59 (PI_ERROR_INVALID_OPERATION)" thrown in the test body.
[  FAILED  ] DotuTestSuite/DotuTests.ComplexSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100, where GetParam() = (0x560f5e0, 1-byte object <00>) (0 ms)

@AidanBeltonS
Copy link
Contributor Author

I have resolved all but one issue with GemmBatch's tests. The CPU MKL implementation has significant amounts of error compared to the GPU. I believe there may be a fundamental difference in the precision of the calculation for the CPU. One possible fix would be to increase the tolerance significantly just for the CPU. Im not a fan of this approach as it is a bit of a brute force solution. Does anyone have any recommendations on how they would like to see this handled?

[ RUN      ] GemmBatchUsmTestSuite/GemmBatchUsmTests.RealIntRealScalarPrecision/Column_Major_Intel_R__Xeon_R__Gold_5418Y
relative error = 0.000911658 absolute error = 0.00168478 limit = 0.000333786
Difference in entry (58,119): DPC++ 1.84973 vs. Reference 1.84804
relative error = 0.000812301 absolute error = 0.00121021 limit = 0.000333786
Difference in entry (0,124): DPC++ 1.49107 vs. Reference 1.48986
relative error = 0.000534697 absolute error = 0.000857353 limit = 0.000333786
Difference in entry (17,144): DPC++ 1.60258 vs. Reference 1.60344
relative error = 0.000527185 absolute error = 0.00049144 limit = 0.000333786
Difference in entry (52,186): DPC++ -0.932689 vs. Reference -0.932197
/home/aidanbelton/source/oneMKL/tests/unit_tests/blas/batch/gemm_batch_usm.cpp:408: Failure
Expected equality of these values:
  res
    Which is: 0
  1
[  FAILED  ] GemmBatchUsmTestSuite/GemmBatchUsmTests.RealIntRealScalarPrecision/Column_Major_Intel_R__Xeon_R__Gold_5418Y, where GetParam() = (0x56845d0, 1-byte object <01>) (331 ms)

@mmeterel
Copy link
Contributor

No I have not tested HIPsycl. I have attached the other backend tests below. Netlib and portblas are passing fine. MKL has some failing tests due to tolerating which I am investigating further. It seems it deviates more from the reference implementation in some cases. mkl_test_log.txt netlib_test_log.txt port_blas_test_logs.txt

MKL tests error: mkl_test_log.txt

Can you please test hipSYCL backend as well?

Copy link
Contributor

@andrewtbarker andrewtbarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions to make interpreting failed test results easier - I flagged a few places but there are similar issues in most of the new tests.

tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved
tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved
tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved
@andrewtbarker
Copy link
Contributor

Does anyone have any recommendations on how they would like to see this handled?

If, as we suspect, the CPU backend is doing accumulation in double while the GPU backend does it in float, one option would be changing what reference gemm from tests/unit_tests/blas/include/reference_blas_templates.hpp we call (might need to add a reference gemm in that file).

@andrewtbarker
Copy link
Contributor

andrewtbarker commented May 1, 2024

What is the status here? As I see it we have three outstanding items:

  1. AdaptiveCpp testing
  2. Test names (my most recent review, minor)
  3. Failure in RealIntRealScalarPrecision

(1) may be a larger issue with CI that in my opinion can be dealt with separately in another PR. (2) is minor and should be easy to fix. I hope (3) is also minor but I'm not sure, is there any progress understanding it?

@Rbiessy
Copy link
Contributor

Rbiessy commented May 2, 2024

Hi @andrewtbarker, I have updated the status by email as it was easier to discuss issues with testing AdaptiveCpp on the CI. In short there are a few issues @AidanBeltonS will need to look at once he is back from Holiday next week!

@AidanBeltonS
Copy link
Contributor Author

What is the status here? As I see it we have three outstanding items:

1. AdaptiveCpp testing

2. Test names (my most recent review, minor)

3. Failure in `RealIntRealScalarPrecision`

(1) may be a larger issue with CI that in my opinion can be dealt with separately in another PR. (2) is minor and should be easy to fix. I hope (3) is also minor but I'm not sure, is there any progress understanding it?

I have addressed items 2. and 3.
To resolve 3 I am scaling the tolerance by the possible input range from int8 matricies. i.e. 256
I have yet to test this with AdaptiveCpp, Ill start looking at that shortly

Copy link
Contributor

@andrewtbarker andrewtbarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're good to go. Thanks for sticking with this one!

Rbiessy
Rbiessy previously approved these changes May 27, 2024
@AidanBeltonS
Copy link
Contributor Author

AidanBeltonS commented Jun 5, 2024

I have also disabled the in8, float combination for MKLCPU/GPU as I found similar precision issues.
#506 will be updated to reflect this

@Rbiessy
Copy link
Contributor

Rbiessy commented Jun 24, 2024

I have confirmed the tests pass with AdaptiveCpp on AMD and Nvidia HW.

@Rbiessy Rbiessy merged commit 6433690 into oneapi-src:develop Jun 24, 2024
6 checks passed
normallytangent pushed a commit to normallytangent/oneMKL that referenced this pull request Aug 6, 2024
Add support for more batch_gemm types to follow the specification.
Some combination using int8 are disabled on some backends due to precision issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants