Added SVE implementation to improve the performance on ARM architecture #10680

divya2108 · 2024-08-06T09:58:23Z

Motivation: This pull request aims to improve the performance of training algorithm of XGBoost on ARM architecture by leveraging SVE intrinsics.

Brief description:

This change of including SVE intrinsics improves the performance by 55% as compared to the ARM default.
The modified function iterates over a row of data and updates a histogram based on the given indices and offsets.
The accuracy has been verified after the modifications.

trivialfis · 2024-08-06T15:33:53Z

Thank you for the PR! I'm not an expert in SIMDs, is it guaranteed to have aligned pointers and padded memory allocation for the intrinsics?

divya2108 · 2024-08-07T11:38:00Z

Hi @trivialfis
The code has been thoroughly validated to ensure alignment and padding issues are addressed. The datatypes have not been altered from the scalar code; instead, the original scalar operations have been translated into SIMD using equivalent SVE intrinsics and there are no compile-time errors.
All potential accuracy issues have been resolved and verified on widely used datasets like Iris, Airlines delay and breast cancer detection.

trivialfis · 2024-08-07T15:32:45Z

Thank you for the detailed info. Could you please help explain why it works without the use of specialized allocators like https://en.cppreference.com/w/c/memory/aligned_alloc ? It's important for us to know the logic for future maintenance.

divya2108 · 2024-08-09T07:19:13Z

Specialized allocators like aligned_alloc() doesn't help with SVE intrinsics because:

ARM's SVE SIMD architecture handles data processing in parallel, which inherently considers data alignment. For example for a 256 bit vector length system, we load 8 float elements (8*32) through VLA(vector length agnostic) instructions into a SVE register.
Most of the instructions including widening and narrowing instructions helps take care of the data alignment.

divya2108 · 2024-08-12T15:52:21Z

Hi @trivialfis,
Additionally, SVE also provides predicate registers enabling key features such as:
a) Per-lane predication that allows SIMD instructions to be executed conditionally on specific lanes of a SIMD register
b) Predicate-driven loop control and management that helps to manage data that does not align perfectly with the vector length.

Mousius · 2024-08-13T17:47:37Z

Thank you for the detailed info. Could you please help explain why it works without the use of specialized allocators like https://en.cppreference.com/w/c/memory/aligned_alloc ? It's important for us to know the logic for future maintenance.

Hi @trivialfis,

As @divya2108 mentioned, SVE has predication support.

These lines create masks which limit the load/stores from going out of bounds:

xgboost/src/common/hist_util.cc

Lines 265 to 266 in 5194c17

    
           svbool_t pg32 = svwhilelt_b32(j, row_size); 
        
           svbool_t pg64 = svwhilelt_b64(j, row_size);

SVE is also happy to do element-aligned loads and stores rather than full vectors.

trivialfis · 2024-08-16T19:57:50Z

Thank you for the explanation! I will take a deeper look.

trivialfis

Started looking into this PR today. Thank you for working on using the arm intrinsic, but could you please add detailed code comments and extract the code into an independent section (like a function that can be inlined)? Most people here (me included) have a background closer to data science instead of low-level programming.

trivialfis · 2024-08-20T05:39:57Z

CMakeLists.txt

@@ -265,6 +265,51 @@ if(${CMAKE_SYSTEM_NAME} MATCHES "OS400")
  set(CMAKE_CXX_ARCHIVE_CREATE "<CMAKE_AR> -X64 qc <TARGET> <OBJECTS>")
 endif()

+if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64")


Could you please extract this into a module similar to cmake/PrefetchIntrinsics.cmake?

trivialfis · 2024-08-20T05:55:48Z

CMakeLists.txt

+    if(RUN_RESULT EQUAL 0)
+      message(STATUS "ARM SVE hardware support detected")
+      set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=armv8-a+sve")
+      string(APPEND CMAKE_CXX_FLAGS " -DSVE_SUPPORT_DETECTED")


Please prefix the flag with XGBOOST_ and use targeted flags instead of the CMAKE_CXX_FLAGS.

trivialfis · 2024-08-20T06:09:00Z

src/common/hist_util.cc

+      svfloat64_t pgh_t0_vec = svdup_n_f64(pgh_t[0]);
+      svfloat64_t pgh_t1_vec = svdup_n_f64(pgh_t[1]);


It seems you don't need the pgh_t in the SVE code section. Could you please use p_gpair and have the names of the loaded vectors, such as svfloat64_t grad, svfloat64_t hess, for readability?

divya2108 · 2024-08-26T08:28:00Z

Started looking into this PR today. Thank you for working on using the arm intrinsic, but could you please add detailed code comments and extract the code into an independent section (like a function that can be inlined)? Most people here (me included) have a background closer to data science instead of low-level programming.

Hi @trivialfis, Thank you for suggesting the appropriate changes. I have made the modifications as recommended. Could you please review the updated changes?

maajidkhann · 2024-09-03T08:34:05Z

The CMake logic looks right. It only compiles SVE code when the compiler supports it and during the runtime it triggers the SVE code only when the hardware supports SVE (I see there's a runtime HW check for SVE ISA). Changes LGTM.

Mousius · 2024-09-03T12:10:23Z

The CMake logic looks right. It only compiles SVE code when the compiler supports it and during the runtime it triggers the SVE code only when the hardware supports SVE (I see there's a runtime HW check for SVE ISA). Changes LGTM.

Can you point me to where the runtime check happens? As far as I can tell, this only works if the build environment supports compiling and running SVE.

The new path is conditionally compiled with #ifdef XGBOOST_SVE_SUPPORT_DETECTED with no fallback at runtime here:
https://github.com/dmlc/xgboost/pull/10680/files#diff-def34075edb2b3bdb6dc7b5ebcffd518793520fd4fffd70870b12f076a3cb481R305-R308

This makes me think this is only for users compiling from sources on a specific piece of hardware. If we wanted this to work in the generically distributed wheel, we'd have to do the SVE runtime check instead of the #ifdef.

Correct me if I'm wrong 😸

maajidkhann · 2024-09-03T14:09:23Z

The CMake logic looks right. It only compiles SVE code when the compiler supports it and during the runtime it triggers the SVE code only when the hardware supports SVE (I see there's a runtime HW check for SVE ISA). Changes LGTM.

Can you point me to where the runtime check happens? As far as I can tell, this only works if the build environment supports compiling and running SVE.

The new path is conditionally compiled with #ifdef XGBOOST_SVE_SUPPORT_DETECTED with no fallback at runtime here: https://github.com/dmlc/xgboost/pull/10680/files#diff-def34075edb2b3bdb6dc7b5ebcffd518793520fd4fffd70870b12f076a3cb481R305-R308

This makes me think this is only for users compiling from sources on a specific piece of hardware. If we wanted this to work in the generically distributed wheel, we'd have to do the SVE runtime check instead of the #ifdef.

Correct me if I'm wrong 😸

I agree with you. I found the HW detection logic here: https://github.com/dmlc/xgboost/pull/10680/files#diff-5650b69c609ef22dea88915eb256a6838341248d3ddfd17430388f7f7e58c4feR24

But this is just for compile time. I think similar logic need to be used during runtime and a runtime check is required.

Since there's already a working SVE HW detection logic, should be easy to reintroduce it in the source code file.

CC @divya2108

trivialfis · 2024-09-07T16:25:07Z

Is SVE guaranteed to be available for ARM implementation?

divya2108 · 2024-09-09T07:23:55Z

Is SVE guaranteed to be available for ARM implementation?

No, SVE is not guaranteed to be available on all ARM implementations. While ARMv8-A architecture, which includes SVE support, is present in newer processors like Graviton3, Graviton4, Grace, it is not mandatory for all ARM CPUs to implement SVE. The code in hist_util.cc checks for SVE support at runtime to ensure that the target hardware supports it & runs the default code otherwise.

divya2108 · 2024-09-09T08:23:13Z

The CMake logic looks right. It only compiles SVE code when the compiler supports it and during the runtime it triggers the SVE code only when the hardware supports SVE (I see there's a runtime HW check for SVE ISA). Changes LGTM.

Can you point me to where the runtime check happens? As far as I can tell, this only works if the build environment supports compiling and running SVE.
The new path is conditionally compiled with #ifdef XGBOOST_SVE_SUPPORT_DETECTED with no fallback at runtime here: https://github.com/dmlc/xgboost/pull/10680/files#diff-def34075edb2b3bdb6dc7b5ebcffd518793520fd4fffd70870b12f076a3cb481R305-R308
This makes me think this is only for users compiling from sources on a specific piece of hardware. If we wanted this to work in the generically distributed wheel, we'd have to do the SVE runtime check instead of the #ifdef.
Correct me if I'm wrong 😸

I agree with you. I found the HW detection logic here: https://github.com/dmlc/xgboost/pull/10680/files#diff-5650b69c609ef22dea88915eb256a6838341248d3ddfd17430388f7f7e58c4feR24

But this is just for compile time. I think similar logic need to be used during runtime and a runtime check is required.

Since there's already a working SVE HW detection logic, should be easy to reintroduce it in the source code file.

CC @divya2108

Yes, I agree. Thank you for bringing this to notice.
I have added a SVE hardware check at runtime. Now it is generically compiled and falls back on the default code if SVE hardware support is not detected.

I have verified this by building the code on different architectures. Here is a summary for more clarity:

rageshhajela16 · 2024-09-20T20:29:05Z

@trivialfis Thanks for the initial review and your comments. Can you please suggest any additional feedback which might need further clarification/evaluation from our side or any improvements to incorporate in the proposed implementation. Thanks. cc: @divya2108

trivialfis · 2024-09-23T19:18:42Z

Sorry for the slow reply, got stuck at some other work lately. One question, is it possible to reduce the call frequency of check_sve_hw_support to maybe once per training session?

divya2108 · 2024-10-03T08:59:55Z

Sorry for the slow reply, got stuck at some other work lately. One question, is it possible to reduce the call frequency of check_sve_hw_support to maybe once per training session?

Yes, it's possible to reduce the frequency of calls to check_sve_hw_support by implementing a caching mechanism that checks the SVE hardware support status only once at the beginning of a training session. I have stored the result and it is being reused throughout the session.

divya2108 · 2024-10-09T06:15:39Z

Hi @trivialfis, just wanted to follow up on the code review. Let me know if you need any additional details or clarifications.

- Changed cmake design by extracting the code into cmake/CheckSVEsupport.cmake - Prefixed the flags with XGBOOST_ and used targeted flags - Extracted the SVE code into an inlined function - Added detailed code comments - Modified vector names for better readability

Signed-off-by: divya2108 <divya.kotadiya@fujitsu.com>

rageshhajela16 · 2024-10-17T09:18:50Z

Hi @trivialfis, just wanted to follow up on the code review. Let me know if you need any additional details or clarifications.

Hi @trivialfis , we have pushed all the necessary changes. Kindly review and let us know for any additional details or modifications required. Thanks in advance for your time while reviewing this. Thanks.

trivialfis · 2024-10-17T10:49:03Z

Apologies for the slow response, will look into this. Thank you very much for your patience!

trivialfis · 2024-10-18T16:57:39Z

@hcho3 I see you have assigned yourself to the PR. Thank you for volunteering! Feel free to review the code.

I recently got access to a Grace machine and might be able to do some tests there.

hcho3 · 2024-10-18T20:07:45Z

Sorry it must have been by mistake. I will try to look at the PR for the next few days however.

CMakeLists.txt

hcho3 · 2024-10-21T22:09:21Z

cmake/CheckSVEsupport.cmake

+
+    # Save the original C_FLAGS to restore later
+    set(ORIGINAL_C_FLAGS "${CMAKE_C_FLAGS}")
+    set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -march=armv8-a+sve")


Rather than modifying CMAKE_C_FLAGS directly, we should use CMAKE_REQUIRED_FLAGS instead, which is explicitly designed to influence the behavior of check_c_source_compiles.

Example: https://github.com/facebook/rocksdb/blob/c0be6a4b90a1f616969b2a808035ebf334894a37/CMakeLists.txt#L309-L342

Let me update the pull request to use CMAKE_REQUIRED_FLAGS.

hcho3 · 2024-10-21T22:12:26Z

src/common/hist_util.cc

-HistogramCuts::HistogramCuts() {
-  cut_ptrs_.HostVector().emplace_back(0);
-}
+HistogramCuts::HistogramCuts() { cut_ptrs_.HostVector().emplace_back(0); }


I see lots of unsubstantial formatting changes. We should apply clang-format with the same .clang-format configuration.

hcho3

We should ensure that the CI pipeline tests XGBoost with SVE intrinsic.

Two kinds of tests are needed:

End-to-end test. Build XGBoost with SVE and run pytests. We can do this easily, using the ARM worker machine in the CI.
Micro test. Write a gtest that compares the result of the histogram kernel with and without SVE enabled. For this we need a way to temporarily disable SVE feature at runtime.

hcho3 · 2024-10-21T22:16:45Z

src/common/hist_util.cc

+    return cached_sve_support;
+}
+
+static int sve_enabled = check_sve_hw_support();


Does the value of a global static variable valid when accessed from multiple threads? It might be better to thread-local storage instead.

@trivialfis Any thoughts on this topic?

I will work on it. Still learning the code.

trivialfis · 2024-10-22T06:07:50Z

We should ensure that the CI pipeline tests XGBoost with SVE intrinsic.

Is it enabled on the CI?

rageshhajela16 · 2024-11-02T13:04:08Z

Thanks @trivialfis for your review and contributions to this implementation! Please let us if you would like us to contribute any additional fixes based on review comments. We would like to confirm with you before proceeding to avoid any duplicate work. Thanks again for your time in review of this PR. We appreciate! cc: @divya2108

trivialfis · 2024-11-07T09:47:47Z

@hcho3 Do you think it's possible to have this in the pip wheel?

trivialfis · 2024-11-07T11:20:30Z

Could you please share the CPU you were using for the benchmarks? I ran a benchmark on a Grace machine (I work for NVIDIA) with synthetic data, and the performance is actually lower. I have verified that the row-wise kernel is being used.

My synthetic data:

n_samples: 67108864
n_features: 256

Training parameters:

64 iterations
6 max depth

Compilers:

g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

* SVE disabled

[63]    Train-rmse:18.88612
Qdm train (sec) ended in:  123.9944839477539 seconds.
Trained for 64 iterations.
{'load-batches': {'load (sec)': 7.225183725357056}, 'load-all': {'concat (sec)': 1.3589859008789062e-05}, 'Qdm': {'Train-DMatrix (sec)': 29.529566526412964, 'train (sec)': 123.9944839477539}}

* SVE enabled

[63]    Train-rmse:18.88612
Qdm train (sec) ended in:  154.86435317993164 seconds.
Trained for 64 iterations.
{'load-batches': {'load (sec)': 7.193156003952026}, 'load-all': {'concat (sec)': 1.430511474609375e-05}, 'Qdm': {'Train-DMatrix (sec)': 29.482257604599, 'train (sec)': 154.86435317993164}}

It's okay to be slower on certain platforms, we can look for a way to disable it. But I would like to get some understanding of how the performance works for your platform as well.

hcho3 · 2024-11-08T01:47:58Z

Do you think it's possible to have this in the pip wheel?

Yes, it should be possible.

divya2108 · 2024-11-08T08:25:48Z

Could you please share the CPU you were using for the benchmarks? I ran a benchmark on a Grace machine (I work for NVIDIA) with synthetic data, and the performance is actually lower. I have verified that the row-wise kernel is being used.

My synthetic data:

n_samples: 67108864

n_features: 256

Training parameters:

64 iterations

6 max depth

Compilers:

g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
* SVE disabled

[63]    Train-rmse:18.88612
Qdm train (sec) ended in:  123.9944839477539 seconds.
Trained for 64 iterations.
{'load-batches': {'load (sec)': 7.225183725357056}, 'load-all': {'concat (sec)': 1.3589859008789062e-05}, 'Qdm': {'Train-DMatrix (sec)': 29.529566526412964, 'train (sec)': 123.9944839477539}}

* SVE enabled

[63]    Train-rmse:18.88612
Qdm train (sec) ended in:  154.86435317993164 seconds.
Trained for 64 iterations.
{'load-batches': {'load (sec)': 7.193156003952026}, 'load-all': {'concat (sec)': 1.430511474609375e-05}, 'Qdm': {'Train-DMatrix (sec)': 29.482257604599, 'train (sec)': 154.86435317993164}}
It's okay to be slower on certain platforms, we can look for a way to disable it. But I would like to get some understanding of how the performance works for your platform as well.

These are the machine and dataset details which I used:

AWS Graviton3, ARM-based CPU
Dataset details: kaggle higgs boson dataset (250000 samples, 32 features)
Training parameters: 120 iterations, 6 max_depth
compiler: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

divya2108 force-pushed the sve-optimised branch from 60554bc to 5194c17 Compare August 7, 2024 07:10

trivialfis reviewed Aug 20, 2024

View reviewed changes

divya2108 force-pushed the sve-optimised branch from dca00be to f5edc42 Compare September 9, 2024 07:18

divya2108 force-pushed the sve-optimised branch 4 times, most recently from b43108b to 03298b5 Compare October 8, 2024 10:51

divya2108 added 5 commits October 17, 2024 14:27

Added SVE implementation to improve the performance on ARM architecture

1d51dae

Modified the cmake logic

8c3c1ef

Signed-off-by: divya2108 <divya.kotadiya@fujitsu.com>

Optimised code design and handled ci test failures

785c900

Resolved unit test failures

9b3a0d9

divya2108 force-pushed the sve-optimised branch from 03298b5 to 9b3a0d9 Compare October 17, 2024 08:59

hcho3 self-assigned this Oct 17, 2024

hcho3 removed their assignment Oct 18, 2024

hcho3 requested changes Oct 21, 2024

View reviewed changes

hcho3 reviewed Oct 21, 2024

View reviewed changes

Merge branch 'master' into sve-optimised

0a36b6c

trivialfis and others added 3 commits October 27, 2024 22:20

cmake variable.

fb077ca

template specialization.

7baaa70

Merge branch 'master' into sve-optimised

d492cf8

		svfloat64_t pgh_t0_vec = svdup_n_f64(pgh_t[0]);
		svfloat64_t pgh_t1_vec = svdup_n_f64(pgh_t[1]);

Added SVE implementation to improve the performance on ARM architecture #10680

Are you sure you want to change the base?

Added SVE implementation to improve the performance on ARM architecture #10680

Conversation

divya2108 commented Aug 6, 2024

trivialfis commented Aug 6, 2024

divya2108 commented Aug 7, 2024

trivialfis commented Aug 7, 2024 • edited Loading

divya2108 commented Aug 9, 2024

divya2108 commented Aug 12, 2024

Mousius commented Aug 13, 2024 • edited Loading

trivialfis commented Aug 16, 2024

trivialfis left a comment

Choose a reason for hiding this comment

trivialfis Aug 20, 2024

Choose a reason for hiding this comment

trivialfis Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

trivialfis Aug 20, 2024

Choose a reason for hiding this comment

divya2108 commented Aug 26, 2024

maajidkhann commented Sep 3, 2024

Mousius commented Sep 3, 2024

maajidkhann commented Sep 3, 2024

trivialfis commented Sep 7, 2024

divya2108 commented Sep 9, 2024

divya2108 commented Sep 9, 2024 • edited Loading

rageshhajela16 commented Sep 20, 2024 • edited Loading

trivialfis commented Sep 23, 2024

divya2108 commented Oct 3, 2024

divya2108 commented Oct 9, 2024

rageshhajela16 commented Oct 17, 2024 • edited Loading

trivialfis commented Oct 17, 2024

trivialfis commented Oct 18, 2024 • edited Loading

hcho3 commented Oct 18, 2024

hcho3 Oct 21, 2024

Choose a reason for hiding this comment

hcho3 Oct 21, 2024

Choose a reason for hiding this comment

hcho3 left a comment

Choose a reason for hiding this comment

hcho3 Oct 21, 2024

Choose a reason for hiding this comment

trivialfis Oct 27, 2024

Choose a reason for hiding this comment

trivialfis commented Oct 22, 2024

rageshhajela16 commented Nov 2, 2024

trivialfis commented Nov 7, 2024

trivialfis commented Nov 7, 2024 • edited Loading

hcho3 commented Nov 8, 2024

divya2108 commented Nov 8, 2024 • edited Loading

trivialfis commented Aug 7, 2024 •

edited

Loading

Mousius commented Aug 13, 2024 •

edited

Loading

trivialfis Aug 20, 2024 •

edited

Loading

divya2108 commented Sep 9, 2024 •

edited

Loading

rageshhajela16 commented Sep 20, 2024 •

edited

Loading

rageshhajela16 commented Oct 17, 2024 •

edited

Loading

trivialfis commented Oct 18, 2024 •

edited

Loading

trivialfis commented Nov 7, 2024 •

edited

Loading

divya2108 commented Nov 8, 2024 •

edited

Loading