From 100c3fab5e3cdd6aa01a7298f8589718f102ebfd Mon Sep 17 00:00:00 2001 From: Fedotova Date: Mon, 14 Oct 2024 15:08:50 +0200 Subject: [PATCH 01/13] Add initial info about CPU features dispatching --- CONTRIBUTING.md | 5 + docs/source/contribution/cpu_features.rst | 126 ++++++++++++++++++++++ docs/source/index-toc.rst | 1 + 3 files changed, 132 insertions(+) create mode 100644 docs/source/contribution/cpu_features.rst diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d3e45c45a9d..083ff032f93 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -85,6 +85,11 @@ For your convenience we also added [coding guidelines](http://oneapi-src.github. ## Custom Components +### CPU Features Dispatching + +oneDAL provides multiarchitecture binaries that contain codes for multiple variants of CPU instruction set architectures. When run on a certain hardware type, oneDAL chooses the code path which is most suitable for this particular hardware to acheive better performance. +Contributors should leverage [CPU Features Dispatching](http://oneapi-src.github.io/oneDAL/contribution/cpu_features.html) mechanism to implement the code of the algorithms that can perform well on various hardware types. + ### Threading Layer In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form are called the [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms. diff --git a/docs/source/contribution/cpu_features.rst b/docs/source/contribution/cpu_features.rst new file mode 100644 index 00000000000..708dfc0fb68 --- /dev/null +++ b/docs/source/contribution/cpu_features.rst @@ -0,0 +1,126 @@ +.. ****************************************************************************** +.. * Copyright contributors to the oneDAL project +.. * +.. * Licensed under the Apache License, Version 2.0 (the "License"); +.. * you may not use this file except in compliance with the License. +.. * You may obtain a copy of the License at +.. * +.. * http://www.apache.org/licenses/LICENSE-2.0 +.. * +.. * Unless required by applicable law or agreed to in writing, software +.. * distributed under the License is distributed on an "AS IS" BASIS, +.. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. * See the License for the specific language governing permissions and +.. * limitations under the License. +.. *******************************************************************************/ + +.. highlight:: cpp + +CPU Features Dispatching +^^^^^^^^^^^^^^^^^^^^^^^^ + +For each algorithm oneDAL provides several code paths for x86-64-compatibe instruction +set architectures. + +Following architectures are currently supported: +- Streaming SIMD Extensions 2 (SSE2) +- Streaming SIMD Extensions 4.2 (SSE4.2) +- Advanced Vector Extensions 2 (AVX2) +- Advanced Vector Extensions 512 (AVX-512) + +The particular code path is chosen at runtime based on the underlying hardware characteristics. + +This chapter describes how the code is organized to support this variety of instruction sets. + +Algorithm Implementation Options +******************************** + +Besides the instruction sets architecture, an algorithm in oneDAL might have various implementation +options. The description of those options is provided below for better understanding of the oneDAL +code structure and conventions. + +Computational Tasks +------------------- + +An algorithm might have various tasks to compute. The most common options are: + +- `Classification https://oneapi-src.github.io/oneDAL/onedal/glossary.html#term-Classification`_, +- `Regression https://oneapi-src.github.io/oneDAL/onedal/glossary.html#term-Regression`. + +Computational Stages +-------------------- + +An algorithm might have ``training`` and ``inference`` computaion stages aimed +to train a model on the input dataset and compute the inference results respectively. + +Computational Methods +--------------------- + +An algorithm can support several methods for the same type of computations. +For example, kNN algorithm supports +`brute_force `_ +and `kd_tree `_ +methods for algorithm training and inference. + +Computational Modes +------------------- + +oneDAL can provide several computaional modes for an algorithm. +See `Computaional Modes `_ +chapter for details. + +Folders and Files +***************** + +Consider you are working on some algorithm ``Abc`` in oneDAL. + +The part of the implementation of this algorithms that is running on CPU should be located in +`cpp/daal/src/algorithms/abc` folder. + +Consider it provides: + +- ``classification`` and ``regression`` learning tasks; +- ``training`` and ``inference`` stages; +- ``method1`` and ``method2`` for the ``training`` stage and only ``method1`` for ``inference`` stage; +- only batch computational mode. + +Then the `cpp/daal/src/algorithms/abc` folder should contain at least the following files: + +| cpp/daal/src/algorithms/abc +| |-- abc_classification_predict_method1_batch_fpt_cpu.cpp +| |-- abc_classification_predict_impl.i +| |-- abc_classification_predict_kernel.h +| |-- abc_classification_train_method1_batch_fpt_cpu.cpp +| |-- abc_classification_train_method2_batch_fpt_cpu.cpp +| |-- abc_classification_train_impl.i +| |-- abc_classification_train_kernel.h +| |-- abc_regression_predict_method1_batch_fpt_cpu.cpp +| |-- abc_regression_predict_impl.i +| |-- abc_regression_predict_kernel.h +| |-- abc_regression_train_method1_batch_fpt_cpu.cpp +| |-- abc_regression_train_method2_batch_fpt_cpu.cpp +| |-- abc_regression_train_impl.i +| |-- abc_regression_train_kernel.h + +Alternative variant of the folder structure to avoid storing too much files within a single folder +can be: + +| cpp/daal/src/algorithms/abc +| |-- classification +| | |-- abc_classification_predict_method1_batch_fpt_cpu.cpp +| | |-- abc_classification_predict_impl.i +| | |-- abc_classification_predict_kernel.h +| | |-- abc_classification_train_method1_batch_fpt_cpu.cpp +| | |-- abc_classification_train_method2_batch_fpt_cpu.cpp +| | |-- abc_classification_train_impl.i +| | |-- abc_classification_train_kernel.h +| |-- regression +| | |-- abc_regression_predict_method1_batch_fpt_cpu.cpp +| | |-- abc_regression_predict_impl.i +| | |-- abc_regression_predict_kernel.h +| | |-- abc_regression_train_method1_batch_fpt_cpu.cpp +| | |-- abc_regression_train_method2_batch_fpt_cpu.cpp +| | |-- abc_regression_train_impl.i +| | |-- abc_regression_train_kernel.h + +The names of the files stay the same in this case, just the folders layout differs. diff --git a/docs/source/index-toc.rst b/docs/source/index-toc.rst index 6ce2d33a458..89fd7600762 100644 --- a/docs/source/index-toc.rst +++ b/docs/source/index-toc.rst @@ -58,4 +58,5 @@ :hidden: :caption: Custom Components + contribution/cpu_features.rst contribution/threading.rst From 2a5eade3c07648d64e26d228007f157532cd46c5 Mon Sep 17 00:00:00 2001 From: Fedotova Date: Tue, 15 Oct 2024 11:24:20 +0200 Subject: [PATCH 02/13] Fix a typo --- docs/source/contribution/cpu_features.rst | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/source/contribution/cpu_features.rst b/docs/source/contribution/cpu_features.rst index 708dfc0fb68..fa8a4f2abc0 100644 --- a/docs/source/contribution/cpu_features.rst +++ b/docs/source/contribution/cpu_features.rst @@ -44,8 +44,8 @@ Computational Tasks An algorithm might have various tasks to compute. The most common options are: -- `Classification https://oneapi-src.github.io/oneDAL/onedal/glossary.html#term-Classification`_, -- `Regression https://oneapi-src.github.io/oneDAL/onedal/glossary.html#term-Regression`. +- `Classification `_, +- `Regression `_. Computational Stages -------------------- @@ -123,4 +123,6 @@ can be: | | |-- abc_regression_train_impl.i | | |-- abc_regression_train_kernel.h -The names of the files stay the same in this case, just the folders layout differs. +The names of the files stay the same in this case, just the folder layout differs. + + From cdfd793cb5638939c7dd41d485c183dd1d33821e Mon Sep 17 00:00:00 2001 From: Victoriya Fedotova Date: Fri, 18 Oct 2024 03:32:31 -0700 Subject: [PATCH 03/13] Add code samples --- docs/source/contribution/cpu_features.rst | 164 +++++++++++++----- .../abc-classification-train-kernel.rst | 53 ++++++ ...c-classification-train-method1-fpt-cpu.rst | 31 ++++ .../abc-classification-train-method1-impl.rst | 42 +++++ .../abc-classification-train-method2-impl.rst | 71 ++++++++ 5 files changed, 322 insertions(+), 39 deletions(-) create mode 100644 docs/source/includes/cpu_features/abc-classification-train-kernel.rst create mode 100644 docs/source/includes/cpu_features/abc-classification-train-method1-fpt-cpu.rst create mode 100644 docs/source/includes/cpu_features/abc-classification-train-method1-impl.rst create mode 100644 docs/source/includes/cpu_features/abc-classification-train-method2-impl.rst diff --git a/docs/source/contribution/cpu_features.rst b/docs/source/contribution/cpu_features.rst index fa8a4f2abc0..7399e0de342 100644 --- a/docs/source/contribution/cpu_features.rst +++ b/docs/source/contribution/cpu_features.rst @@ -23,10 +23,11 @@ For each algorithm oneDAL provides several code paths for x86-64-compatibe instr set architectures. Following architectures are currently supported: -- Streaming SIMD Extensions 2 (SSE2) -- Streaming SIMD Extensions 4.2 (SSE4.2) -- Advanced Vector Extensions 2 (AVX2) -- Advanced Vector Extensions 512 (AVX-512) + +- Intel |reg| Streaming SIMD Extensions 2 (Intel |reg| SSE2) +- Intel |reg| Streaming SIMD Extensions 4.2 (Intel |reg| SSE4.2) +- Intel |reg| Advanced Vector Extensions 2 (Intel |reg| AVX2) +- Intel |reg| Advanced Vector Extensions 512 (Intel |reg| AVX-512) The particular code path is chosen at runtime based on the underlying hardware characteristics. @@ -35,9 +36,9 @@ This chapter describes how the code is organized to support this variety of inst Algorithm Implementation Options ******************************** -Besides the instruction sets architecture, an algorithm in oneDAL might have various implementation -options. The description of those options is provided below for better understanding of the oneDAL -code structure and conventions. +In addition to the instruction set architectures, an algorithm in oneDAL may have various +implementation options. Below is a description of these options to help you better understand +the oneDAL code structure and conventions. Computational Tasks ------------------- @@ -86,43 +87,128 @@ Consider it provides: Then the `cpp/daal/src/algorithms/abc` folder should contain at least the following files: -| cpp/daal/src/algorithms/abc -| |-- abc_classification_predict_method1_batch_fpt_cpu.cpp -| |-- abc_classification_predict_impl.i -| |-- abc_classification_predict_kernel.h -| |-- abc_classification_train_method1_batch_fpt_cpu.cpp -| |-- abc_classification_train_method2_batch_fpt_cpu.cpp -| |-- abc_classification_train_impl.i -| |-- abc_classification_train_kernel.h -| |-- abc_regression_predict_method1_batch_fpt_cpu.cpp -| |-- abc_regression_predict_impl.i -| |-- abc_regression_predict_kernel.h -| |-- abc_regression_train_method1_batch_fpt_cpu.cpp -| |-- abc_regression_train_method2_batch_fpt_cpu.cpp -| |-- abc_regression_train_impl.i -| |-- abc_regression_train_kernel.h +:: + + cpp/daal/src/algorithms/abc/ + |-- abc_classification_predict_method1_batch_fpt_cpu.cpp + |-- abc_classification_predict_method1_impl.i + |-- abc_classification_predict_kernel.h + |-- abc_classification_train_method1_batch_fpt_cpu.cpp + |-- abc_classification_train_method2_batch_fpt_cpu.cpp + |-- abc_classification_train_method1_impl.i + |-- abc_classification_train_method2_impl.i + |-- abc_classification_train_kernel.h + |-- abc_regression_predict_method1_batch_fpt_cpu.cpp + |-- abc_regression_predict_method1_batch_fpt_cpu.cpp + |-- abc_regression_predict_method1_impl.i + |-- abc_regression_predict_kernel.h + |-- abc_regression_train_method1_batch_fpt_cpu.cpp + |-- abc_regression_train_method2_batch_fpt_cpu.cpp + |-- abc_regression_train_method1_impl.i + |-- abc_regression_train_method2_impl.i + |-- abc_regression_train_kernel.h Alternative variant of the folder structure to avoid storing too much files within a single folder can be: -| cpp/daal/src/algorithms/abc -| |-- classification -| | |-- abc_classification_predict_method1_batch_fpt_cpu.cpp -| | |-- abc_classification_predict_impl.i -| | |-- abc_classification_predict_kernel.h -| | |-- abc_classification_train_method1_batch_fpt_cpu.cpp -| | |-- abc_classification_train_method2_batch_fpt_cpu.cpp -| | |-- abc_classification_train_impl.i -| | |-- abc_classification_train_kernel.h -| |-- regression -| | |-- abc_regression_predict_method1_batch_fpt_cpu.cpp -| | |-- abc_regression_predict_impl.i -| | |-- abc_regression_predict_kernel.h -| | |-- abc_regression_train_method1_batch_fpt_cpu.cpp -| | |-- abc_regression_train_method2_batch_fpt_cpu.cpp -| | |-- abc_regression_train_impl.i -| | |-- abc_regression_train_kernel.h +:: + + cpp/daal/src/algorithms/abc/ + |-- classification/ + | |-- abc_classification_predict_method1_batch_fpt_cpu.cpp + | |-- abc_classification_predict_method1_impl.i + | |-- abc_classification_predict_kernel.h + | |-- abc_classification_train_method1_batch_fpt_cpu.cpp + | |-- abc_classification_train_method2_batch_fpt_cpu.cpp + | |-- abc_classification_train_method1_impl.i + | |-- abc_classification_train_method2_impl.i + | |-- abc_classification_train_kernel.h + |-- regression/ + |-- abc_regression_predict_method1_batch_fpt_cpu.cpp + |-- abc_regression_predict_method1_impl.i + |-- abc_regression_predict_kernel.h + |-- abc_regression_train_method1_batch_fpt_cpu.cpp + |-- abc_regression_train_method2_batch_fpt_cpu.cpp + |-- abc_regression_train_method1_impl.i + |-- abc_regression_train_method2_impl.i + |-- abc_regression_train_kernel.h + The names of the files stay the same in this case, just the folder layout differs. +Further the purpose and contents of each file are to be described on the example of classification +training task. For other types of the tasks the structure of the code is similar. + +\*_kernel.h +----------- + +Those files contain the definitions of one or several template classes that define member functions that +do the actual computations. Here is a variant of the ``Abc`` training algorithm kernel definition in the file +`abc_classification_train_kernel.h`: + +.. include:: ../includes/cpu_features/abc-classification-train-kernel.rst + +Typical template parameters are: + +- ``algorithmFPType`` Data type to use in intermediate computations for the algorithm, + ``float`` or ``double``. +- ``method`` Computational methods of the algorithm. ``method1`` or ``method2`` in the case of ``Abc``. +- ``cpu`` Version of the cpu-specific implementation of the algorithm, ``daal::CpuType``. + +Implementations for different methods are usually defined usind partial class templates specialization. + +\*_impl.i +--------- + +Those files contain the implementations of the computational functions defined in `*_kernel.h` files. +Here is a variant of ``method1`` imlementation for ``Abc`` training algorithm that does not contain any +instruction set specific code. The implementation is located in the file `abc_classification_train_method1_impl.i`: + +.. include:: ../includes/cpu_features/abc-classification-train-method1-impl.rst + +Although the implementation of the ``method1`` does not contain any instruction set specific code, it is +expected that the developers leverage SIMD related macros available in oneDAL. +For example, ``PRAGMA_IVDEP``, ``PRAGMA_VECTOR_ALWAYS``, ``PRAGMA_VECTOR_ALIGNED`` and others pragmas defined in +`service_defines.h `_. +This will guide the compiler to generate more efficient code for the target architecture. + +Consider that the implementation of the ``method2`` for the same algorithm will be different and will contain +AVX-512-specific code located in ``cpuSpecificCode`` function. +Then the implementation of the ``method2`` in the file `abc_classification_train_method2_impl.i` will look like: + +.. include:: ../includes/cpu_features/abc-classification-train-method2-impl.rst + +CPU-specific code needs to be placed under compiler-specific and CPU-specific defines because it usually +contains intrinsics that cannot be compiled on other architectures. + +\*_fpt_cpu.cpp +-------------- + +Those files contain the instantiations of the template classes defined in `*_kernel.h` files. +The instatiation of the ``Abc`` training algorithm kernel for ``method1`` is located in the file +`abc_classification_train_method1_batch_fpt_cpu.cpp`: + +.. include:: ../includes/cpu_features/abc-classification-train-method1-fpt-cpu.rst + +`_fpt_cpu.cpp` files are not compiled directly into object files. First, multiple copies of those files +are made raplacing the ``fpt`` and ``cpu`` parts of the file name as well as the corresponding ``DAAL_FPTYPE`` and +``DAAL_CPU`` macros with the actual data type and CPU type values. Then the resulting files are compiled +with appropriate CPU-specific optimization compiler options. + +The values for ``fpt`` file name part replacement are: +- ``flt`` for ``float`` data type, and +- ``dbl`` for ``double`` data type. + +The values for ``DAAL_FPTYPE`` macro replacement are ``float`` and ``double`` respectively. + +The values for ``cpu`` file name part replacement are: +- ``nrh`` for Intel |reg| SSE2 architecture, which stands for Northwood, +- ``neh`` for Intel |reg| SSE4.2 architecture, which stands for Nehalem, +- ``hsw`` for Intel |reg| AVX2 architecture, which stands for Haswell, +- ``skx`` for Intel |reg| AVX-512 architecture, which stands for Skylake-X. +The values for ``DAAL_CPU`` macro replacement are: +- ``sse2`` for Intel |reg| SSE2 architecture, +- ``sse42`` for Intel |reg| SSE4.2 architecture, +- ``avx2`` for Intel |reg| AVX2 architecture, +- ``avx512`` for Intel |reg| AVX-512 architecture. diff --git a/docs/source/includes/cpu_features/abc-classification-train-kernel.rst b/docs/source/includes/cpu_features/abc-classification-train-kernel.rst new file mode 100644 index 00000000000..0d0f369ffc3 --- /dev/null +++ b/docs/source/includes/cpu_features/abc-classification-train-kernel.rst @@ -0,0 +1,53 @@ +.. ****************************************************************************** +.. * Copyright contributors to the oneDAL project +.. * +.. * Licensed under the Apache License, Version 2.0 (the "License"); +.. * you may not use this file except in compliance with the License. +.. * You may obtain a copy of the License at +.. * +.. * http://www.apache.org/licenses/LICENSE-2.0 +.. * +.. * Unless required by applicable law or agreed to in writing, software +.. * distributed under the License is distributed on an "AS IS" BASIS, +.. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. * See the License for the specific language governing permissions and +.. * limitations under the License. +.. *******************************************************************************/ + +:: + + #ifndef __ABC_CLASSIFICATION_TRAIN_KERNEL_H__ + #define __ABC_CLASSIFICATION_TRAIN_KERNEL_H__ + + #include "src/algorithms/kernel.h" + #include "data_management/data/numeric_table.h" // NumericTable class + /* Other necessary includes go here */ + + using namespace daal::data_management; // NumericTable class + + namespace daal::algorithms::abc::training::internal + { + /* Dummy base template class */ + template + class AbcClassificationTrainingKernel : public Kernel + {}; + + /* Computational kernel for 'method1' of the Abc training algoirthm */ + template + class AbcClassificationTrainingKernel : public Kernel + { + public: + services::Status compute(/* Input and output arguments for the 'method1' */); + }; + + /* Computational kernel for 'method2' of the Abc training algoirthm */ + template + class AbcClassificationTrainingKernel : public Kernel + { + public: + services::Status compute(/* Input and output arguments for the 'method2' */); + }; + + } // namespace daal::algorithms::abc::training::internal + + #endif // __ABC_CLASSIFICATION_TRAIN_KERNEL_H__ diff --git a/docs/source/includes/cpu_features/abc-classification-train-method1-fpt-cpu.rst b/docs/source/includes/cpu_features/abc-classification-train-method1-fpt-cpu.rst new file mode 100644 index 00000000000..c4d3facd090 --- /dev/null +++ b/docs/source/includes/cpu_features/abc-classification-train-method1-fpt-cpu.rst @@ -0,0 +1,31 @@ +.. ****************************************************************************** +.. * Copyright contributors to the oneDAL project +.. * +.. * Licensed under the Apache License, Version 2.0 (the "License"); +.. * you may not use this file except in compliance with the License. +.. * You may obtain a copy of the License at +.. * +.. * http://www.apache.org/licenses/LICENSE-2.0 +.. * +.. * Unless required by applicable law or agreed to in writing, software +.. * distributed under the License is distributed on an "AS IS" BASIS, +.. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. * See the License for the specific language governing permissions and +.. * limitations under the License. +.. *******************************************************************************/ + +:: + + /* + //++ + // instantiations of method1 of the Abc training algorithm. + //-- + */ + + #include "src/algorithms/abc/abc_classification_train_kernel.h" + #include "src/algorithms/abc/abc_classification_train_method1_impl.i" + + namespace daal::algorithms::abc::training::internal + { + template class DAAL_EXPORT AbcClassificationTrainingKernel; + } // namespace daal::algorithms::abc::training::internal diff --git a/docs/source/includes/cpu_features/abc-classification-train-method1-impl.rst b/docs/source/includes/cpu_features/abc-classification-train-method1-impl.rst new file mode 100644 index 00000000000..53368f595c9 --- /dev/null +++ b/docs/source/includes/cpu_features/abc-classification-train-method1-impl.rst @@ -0,0 +1,42 @@ +.. ****************************************************************************** +.. * Copyright contributors to the oneDAL project +.. * +.. * Licensed under the Apache License, Version 2.0 (the "License"); +.. * you may not use this file except in compliance with the License. +.. * You may obtain a copy of the License at +.. * +.. * http://www.apache.org/licenses/LICENSE-2.0 +.. * +.. * Unless required by applicable law or agreed to in writing, software +.. * distributed under the License is distributed on an "AS IS" BASIS, +.. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. * See the License for the specific language governing permissions and +.. * limitations under the License. +.. *******************************************************************************/ + +:: + + /* + //++ + // Implementation of Abc training algorithm. + //-- + */ + + #include "src/algorithms/service_error_handling.h" + #include "src/data_management/service_numeric_table.h" + + namespace daal::algorithms::abc::training::internal + { + + template + services::Status AbcClassificationTrainingKernel::compute(/* ... */) + { + services::Status status; + + /* Implementation that does not contain instruction set specific code */ + + return status; + } + + + } // namespace daal::algorithms::abc::training::internal diff --git a/docs/source/includes/cpu_features/abc-classification-train-method2-impl.rst b/docs/source/includes/cpu_features/abc-classification-train-method2-impl.rst new file mode 100644 index 00000000000..36df55ce56e --- /dev/null +++ b/docs/source/includes/cpu_features/abc-classification-train-method2-impl.rst @@ -0,0 +1,71 @@ +.. ****************************************************************************** +.. * Copyright contributors to the oneDAL project +.. * +.. * Licensed under the Apache License, Version 2.0 (the "License"); +.. * you may not use this file except in compliance with the License. +.. * You may obtain a copy of the License at +.. * +.. * http://www.apache.org/licenses/LICENSE-2.0 +.. * +.. * Unless required by applicable law or agreed to in writing, software +.. * distributed under the License is distributed on an "AS IS" BASIS, +.. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. * See the License for the specific language governing permissions and +.. * limitations under the License. +.. *******************************************************************************/ + +:: + + /* + //++ + // Implementation of Abc training algorithm. + //-- + */ + + #include "src/algorithms/service_error_handling.h" + #include "src/data_management/service_numeric_table.h" + + namespace daal::algorithms::abc::training::internal + { + + /* Generic template implementation of cpuSpecificCode function for all data types + and various instruction set architectures */ + template + services::Status cpuSpecificCode(/* arguments */) + { + /* Implementation */ + }; + + #if defined(DAAL_INTEL_CPP_COMPILER) && (__CPUID__(DAAL_CPU) == __avx512__) + + /* Specialization of cpuSpecificCode function for double data type and Intel(R) AVX-512 instruction set */ + template <> + services::Status cpuSpecificCode(/* arguments */) + { + /* Implementation */ + }; + + /* Specialization of cpuSpecificCode function for float data type and Intel(R) AVX-512 instruction set */ + template <> + services::Status cpuSpecificCode(/* arguments */) + { + /* Implementation */ + }; + + #endif // DAAL_INTEL_CPP_COMPILER && (__CPUID__(DAAL_CPU) == __avx512__) + + template + services::Status AbcClassificationTrainingKernel::compute(/* arguments */) + { + services::Status status; + + /* Implementation that calls CPU-specific code: */ + status = cpuSpecificCode(/* ... */); + DAAL_CHECK_STATUS_VAR(status); + + /* Implementation continues */ + + return status; + } + + } // namespace daal::algorithms::abc::training::internal From 80f1cc97987c4e7f18138991151909811ca60bdc Mon Sep 17 00:00:00 2001 From: Victoriya Fedotova Date: Fri, 18 Oct 2024 13:56:44 +0200 Subject: [PATCH 04/13] Fix a typo in CONTRIBUTING.md Co-authored-by: david-cortes-intel --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 083ff032f93..f47c8573727 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -87,7 +87,7 @@ For your convenience we also added [coding guidelines](http://oneapi-src.github. ### CPU Features Dispatching -oneDAL provides multiarchitecture binaries that contain codes for multiple variants of CPU instruction set architectures. When run on a certain hardware type, oneDAL chooses the code path which is most suitable for this particular hardware to acheive better performance. +oneDAL provides multiarchitecture binaries that contain codes for multiple variants of CPU instruction set architectures. When run on a certain hardware type, oneDAL chooses the code path which is most suitable for this particular hardware to achieve better performance. Contributors should leverage [CPU Features Dispatching](http://oneapi-src.github.io/oneDAL/contribution/cpu_features.html) mechanism to implement the code of the algorithms that can perform well on various hardware types. ### Threading Layer From 7142175552e91571de7409b2f5df96a054a71248 Mon Sep 17 00:00:00 2001 From: Victoriya Fedotova Date: Mon, 21 Oct 2024 01:48:10 -0700 Subject: [PATCH 05/13] Fix typos --- docs/source/contribution/cpu_features.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/source/contribution/cpu_features.rst b/docs/source/contribution/cpu_features.rst index 7399e0de342..c58625daf90 100644 --- a/docs/source/contribution/cpu_features.rst +++ b/docs/source/contribution/cpu_features.rst @@ -51,8 +51,8 @@ An algorithm might have various tasks to compute. The most common options are: Computational Stages -------------------- -An algorithm might have ``training`` and ``inference`` computaion stages aimed -to train a model on the input dataset and compute the inference results respectively. +An algorithm might have ``training`` and ``inference`` computation stages aimed +at training a model on the input dataset and computing the inference results, respectively. Computational Methods --------------------- @@ -73,12 +73,12 @@ chapter for details. Folders and Files ***************** -Consider you are working on some algorithm ``Abc`` in oneDAL. +Suppose that you are working on some algorithm ``Abc`` in oneDAL. The part of the implementation of this algorithms that is running on CPU should be located in `cpp/daal/src/algorithms/abc` folder. -Consider it provides: +Suppose that it provides: - ``classification`` and ``regression`` learning tasks; - ``training`` and ``inference`` stages; @@ -108,8 +108,8 @@ Then the `cpp/daal/src/algorithms/abc` folder should contain at least the follow |-- abc_regression_train_method2_impl.i |-- abc_regression_train_kernel.h -Alternative variant of the folder structure to avoid storing too much files within a single folder -can be: +Alternative variant of the folder structure to avoid storing too many files within a single folder +could be: :: @@ -191,7 +191,7 @@ The instatiation of the ``Abc`` training algorithm kernel for ``method1`` is loc .. include:: ../includes/cpu_features/abc-classification-train-method1-fpt-cpu.rst `_fpt_cpu.cpp` files are not compiled directly into object files. First, multiple copies of those files -are made raplacing the ``fpt`` and ``cpu`` parts of the file name as well as the corresponding ``DAAL_FPTYPE`` and +are made replacing the ``fpt`` and ``cpu`` parts of the file name as well as the corresponding ``DAAL_FPTYPE`` and ``DAAL_CPU`` macros with the actual data type and CPU type values. Then the resulting files are compiled with appropriate CPU-specific optimization compiler options. From 4e93be0bff896ddffbcf26aae27389e36458439e Mon Sep 17 00:00:00 2001 From: Victoriya Fedotova Date: Mon, 21 Oct 2024 02:00:58 -0700 Subject: [PATCH 06/13] Add clarification about 'fpt' abbreviation meaning --- docs/source/contribution/cpu_features.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/contribution/cpu_features.rst b/docs/source/contribution/cpu_features.rst index c58625daf90..2d6afac7dd0 100644 --- a/docs/source/contribution/cpu_features.rst +++ b/docs/source/contribution/cpu_features.rst @@ -191,15 +191,15 @@ The instatiation of the ``Abc`` training algorithm kernel for ``method1`` is loc .. include:: ../includes/cpu_features/abc-classification-train-method1-fpt-cpu.rst `_fpt_cpu.cpp` files are not compiled directly into object files. First, multiple copies of those files -are made replacing the ``fpt`` and ``cpu`` parts of the file name as well as the corresponding ``DAAL_FPTYPE`` and -``DAAL_CPU`` macros with the actual data type and CPU type values. Then the resulting files are compiled -with appropriate CPU-specific optimization compiler options. +are made replacing the ``fpt``, which stands for 'floating point type', and ``cpu`` parts of the file name +as well as the corresponding ``DAAL_FPTYPE`` and ``DAAL_CPU`` macros with the actual data type and CPU type values. +Then the resulting files are compiled with appropriate CPU-specific optimization compiler options. The values for ``fpt`` file name part replacement are: - ``flt`` for ``float`` data type, and - ``dbl`` for ``double`` data type. -The values for ``DAAL_FPTYPE`` macro replacement are ``float`` and ``double`` respectively. +The values for ``DAAL_FPTYPE`` macro replacement are ``float`` and ``double``, respectively. The values for ``cpu`` file name part replacement are: - ``nrh`` for Intel |reg| SSE2 architecture, which stands for Northwood, From 9b31d1087af2f51cfd91d8f4e9f3bd4a9e62dc1a Mon Sep 17 00:00:00 2001 From: Victoriya Fedotova Date: Wed, 23 Oct 2024 01:44:59 -0700 Subject: [PATCH 07/13] Add information about compiler-specific and CPU-specific macros --- docs/source/contribution/cpu_features.rst | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/docs/source/contribution/cpu_features.rst b/docs/source/contribution/cpu_features.rst index 2d6afac7dd0..0e29c814fb7 100644 --- a/docs/source/contribution/cpu_features.rst +++ b/docs/source/contribution/cpu_features.rst @@ -155,7 +155,7 @@ Typical template parameters are: - ``method`` Computational methods of the algorithm. ``method1`` or ``method2`` in the case of ``Abc``. - ``cpu`` Version of the cpu-specific implementation of the algorithm, ``daal::CpuType``. -Implementations for different methods are usually defined usind partial class templates specialization. +Implementations for different methods are usually defined using partial class templates specialization. \*_impl.i --------- @@ -173,7 +173,12 @@ For example, ``PRAGMA_IVDEP``, ``PRAGMA_VECTOR_ALWAYS``, ``PRAGMA_VECTOR_ALIGNED This will guide the compiler to generate more efficient code for the target architecture. Consider that the implementation of the ``method2`` for the same algorithm will be different and will contain -AVX-512-specific code located in ``cpuSpecificCode`` function. +AVX-512-specific code located in ``cpuSpecificCode`` function. Note that all the compiler-specific code should +be placed under compiler-specific defines. For example, the Intel |reg| oneAPI DPC++/C++ Compiler specific code +should be placed under ``DAAL_INTEL_CPP_COMPILER`` define. All the CPU-specific code should be placed under +CPU-specific defines. For example, the AVX-512 specific code should be placed under +``__CPUID__(DAAL_CPU) == __avx512__``. + Then the implementation of the ``method2`` in the file `abc_classification_train_method2_impl.i` will look like: .. include:: ../includes/cpu_features/abc-classification-train-method2-impl.rst @@ -208,7 +213,7 @@ The values for ``cpu`` file name part replacement are: - ``skx`` for Intel |reg| AVX-512 architecture, which stands for Skylake-X. The values for ``DAAL_CPU`` macro replacement are: -- ``sse2`` for Intel |reg| SSE2 architecture, -- ``sse42`` for Intel |reg| SSE4.2 architecture, -- ``avx2`` for Intel |reg| AVX2 architecture, -- ``avx512`` for Intel |reg| AVX-512 architecture. +- ``__sse2__`` for Intel |reg| SSE2 architecture, +- ``__sse42__`` for Intel |reg| SSE4.2 architecture, +- ``__avx2__`` for Intel |reg| AVX2 architecture, +- ``__avx512__`` for Intel |reg| AVX-512 architecture. From af7669a2291b6699d2449d7d8e158a408313b713 Mon Sep 17 00:00:00 2001 From: Victoriya Fedotova Date: Wed, 23 Oct 2024 02:35:42 -0700 Subject: [PATCH 08/13] HTML rendering fixes --- docs/source/contribution/cpu_features.rst | 35 +++++++++++------------ 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/docs/source/contribution/cpu_features.rst b/docs/source/contribution/cpu_features.rst index 0e29c814fb7..030803fc1b7 100644 --- a/docs/source/contribution/cpu_features.rst +++ b/docs/source/contribution/cpu_features.rst @@ -24,10 +24,10 @@ set architectures. Following architectures are currently supported: -- Intel |reg| Streaming SIMD Extensions 2 (Intel |reg| SSE2) -- Intel |reg| Streaming SIMD Extensions 4.2 (Intel |reg| SSE4.2) -- Intel |reg| Advanced Vector Extensions 2 (Intel |reg| AVX2) -- Intel |reg| Advanced Vector Extensions 512 (Intel |reg| AVX-512) +- Intel\ |reg| Streaming SIMD Extensions 2 (Intel\ |reg| SSE2) +- Intel\ |reg| Streaming SIMD Extensions 4.2 (Intel\ |reg| SSE4.2) +- Intel\ |reg| Advanced Vector Extensions 2 (Intel\ |reg| AVX2) +- Intel\ |reg| Advanced Vector Extensions 512 (Intel\ |reg| AVX-512) The particular code path is chosen at runtime based on the underlying hardware characteristics. @@ -83,7 +83,7 @@ Suppose that it provides: - ``classification`` and ``regression`` learning tasks; - ``training`` and ``inference`` stages; - ``method1`` and ``method2`` for the ``training`` stage and only ``method1`` for ``inference`` stage; -- only batch computational mode. +- only ``batch`` computational mode. Then the `cpp/daal/src/algorithms/abc` folder should contain at least the following files: @@ -133,7 +133,6 @@ could be: |-- abc_regression_train_method2_impl.i |-- abc_regression_train_kernel.h - The names of the files stay the same in this case, just the folder layout differs. Further the purpose and contents of each file are to be described on the example of classification @@ -174,7 +173,7 @@ This will guide the compiler to generate more efficient code for the target arch Consider that the implementation of the ``method2`` for the same algorithm will be different and will contain AVX-512-specific code located in ``cpuSpecificCode`` function. Note that all the compiler-specific code should -be placed under compiler-specific defines. For example, the Intel |reg| oneAPI DPC++/C++ Compiler specific code +be placed under compiler-specific defines. For example, the Intel\ |reg| oneAPI DPC++/C++ Compiler specific code should be placed under ``DAAL_INTEL_CPP_COMPILER`` define. All the CPU-specific code should be placed under CPU-specific defines. For example, the AVX-512 specific code should be placed under ``__CPUID__(DAAL_CPU) == __avx512__``. @@ -183,9 +182,6 @@ Then the implementation of the ``method2`` in the file `abc_classification_train .. include:: ../includes/cpu_features/abc-classification-train-method2-impl.rst -CPU-specific code needs to be placed under compiler-specific and CPU-specific defines because it usually -contains intrinsics that cannot be compiled on other architectures. - \*_fpt_cpu.cpp -------------- @@ -201,19 +197,22 @@ as well as the corresponding ``DAAL_FPTYPE`` and ``DAAL_CPU`` macros with the ac Then the resulting files are compiled with appropriate CPU-specific optimization compiler options. The values for ``fpt`` file name part replacement are: + - ``flt`` for ``float`` data type, and - ``dbl`` for ``double`` data type. The values for ``DAAL_FPTYPE`` macro replacement are ``float`` and ``double``, respectively. The values for ``cpu`` file name part replacement are: -- ``nrh`` for Intel |reg| SSE2 architecture, which stands for Northwood, -- ``neh`` for Intel |reg| SSE4.2 architecture, which stands for Nehalem, -- ``hsw`` for Intel |reg| AVX2 architecture, which stands for Haswell, -- ``skx`` for Intel |reg| AVX-512 architecture, which stands for Skylake-X. + +- ``nrh`` for Intel\ |reg| SSE2 architecture, which stands for Northwood, +- ``neh`` for Intel\ |reg| SSE4.2 architecture, which stands for Nehalem, +- ``hsw`` for Intel\ |reg| AVX2 architecture, which stands for Haswell, +- ``skx`` for Intel\ |reg| AVX-512 architecture, which stands for Skylake-X. The values for ``DAAL_CPU`` macro replacement are: -- ``__sse2__`` for Intel |reg| SSE2 architecture, -- ``__sse42__`` for Intel |reg| SSE4.2 architecture, -- ``__avx2__`` for Intel |reg| AVX2 architecture, -- ``__avx512__`` for Intel |reg| AVX-512 architecture. + +- ``__sse2__`` for Intel\ |reg| SSE2 architecture, +- ``__sse42__`` for Intel\ |reg| SSE4.2 architecture, +- ``__avx2__`` for Intel\ |reg| AVX2 architecture, +- ``__avx512__`` for Intel\ |reg| AVX-512 architecture. From 1ac2d3bebd191f4b9a490992236f80cb6aadf524 Mon Sep 17 00:00:00 2001 From: Victoriya Fedotova Date: Wed, 23 Oct 2024 02:53:51 -0700 Subject: [PATCH 09/13] Replace oneDAL with |short_name| to align with other .rst files --- docs/source/contribution/cpu_features.rst | 38 +++++++++++------------ docs/source/contribution/threading.rst | 26 ++++++++-------- 2 files changed, 32 insertions(+), 32 deletions(-) diff --git a/docs/source/contribution/cpu_features.rst b/docs/source/contribution/cpu_features.rst index 030803fc1b7..6b3daa44d93 100644 --- a/docs/source/contribution/cpu_features.rst +++ b/docs/source/contribution/cpu_features.rst @@ -19,15 +19,15 @@ CPU Features Dispatching ^^^^^^^^^^^^^^^^^^^^^^^^ -For each algorithm oneDAL provides several code paths for x86-64-compatibe instruction +For each algorithm |short_name| provides several code paths for x86-64-compatibe instruction set architectures. Following architectures are currently supported: -- Intel\ |reg| Streaming SIMD Extensions 2 (Intel\ |reg| SSE2) -- Intel\ |reg| Streaming SIMD Extensions 4.2 (Intel\ |reg| SSE4.2) -- Intel\ |reg| Advanced Vector Extensions 2 (Intel\ |reg| AVX2) -- Intel\ |reg| Advanced Vector Extensions 512 (Intel\ |reg| AVX-512) +- Intel\ |reg|\ Streaming SIMD Extensions 2 (Intel\ |reg|\ SSE2) +- Intel\ |reg|\ Streaming SIMD Extensions 4.2 (Intel\ |reg|\ SSE4.2) +- Intel\ |reg|\ Advanced Vector Extensions 2 (Intel\ |reg|\ AVX2) +- Intel\ |reg|\ Advanced Vector Extensions 512 (Intel\ |reg|\ AVX-512) The particular code path is chosen at runtime based on the underlying hardware characteristics. @@ -36,9 +36,9 @@ This chapter describes how the code is organized to support this variety of inst Algorithm Implementation Options ******************************** -In addition to the instruction set architectures, an algorithm in oneDAL may have various +In addition to the instruction set architectures, an algorithm in |short_name| may have various implementation options. Below is a description of these options to help you better understand -the oneDAL code structure and conventions. +the |short_name| code structure and conventions. Computational Tasks ------------------- @@ -66,14 +66,14 @@ methods for algorithm training and inference. Computational Modes ------------------- -oneDAL can provide several computaional modes for an algorithm. +|short_name| can provide several computaional modes for an algorithm. See `Computaional Modes `_ chapter for details. Folders and Files ***************** -Suppose that you are working on some algorithm ``Abc`` in oneDAL. +Suppose that you are working on some algorithm ``Abc`` in |short_name|. The part of the implementation of this algorithms that is running on CPU should be located in `cpp/daal/src/algorithms/abc` folder. @@ -166,14 +166,14 @@ instruction set specific code. The implementation is located in the file `abc_cl .. include:: ../includes/cpu_features/abc-classification-train-method1-impl.rst Although the implementation of the ``method1`` does not contain any instruction set specific code, it is -expected that the developers leverage SIMD related macros available in oneDAL. +expected that the developers leverage SIMD related macros available in |short_name|. For example, ``PRAGMA_IVDEP``, ``PRAGMA_VECTOR_ALWAYS``, ``PRAGMA_VECTOR_ALIGNED`` and others pragmas defined in `service_defines.h `_. This will guide the compiler to generate more efficient code for the target architecture. Consider that the implementation of the ``method2`` for the same algorithm will be different and will contain AVX-512-specific code located in ``cpuSpecificCode`` function. Note that all the compiler-specific code should -be placed under compiler-specific defines. For example, the Intel\ |reg| oneAPI DPC++/C++ Compiler specific code +be placed under compiler-specific defines. For example, the Intel\ |reg|\ oneAPI DPC++/C++ Compiler specific code should be placed under ``DAAL_INTEL_CPP_COMPILER`` define. All the CPU-specific code should be placed under CPU-specific defines. For example, the AVX-512 specific code should be placed under ``__CPUID__(DAAL_CPU) == __avx512__``. @@ -205,14 +205,14 @@ The values for ``DAAL_FPTYPE`` macro replacement are ``float`` and ``double``, r The values for ``cpu`` file name part replacement are: -- ``nrh`` for Intel\ |reg| SSE2 architecture, which stands for Northwood, -- ``neh`` for Intel\ |reg| SSE4.2 architecture, which stands for Nehalem, -- ``hsw`` for Intel\ |reg| AVX2 architecture, which stands for Haswell, -- ``skx`` for Intel\ |reg| AVX-512 architecture, which stands for Skylake-X. +- ``nrh`` for Intel\ |reg|\ SSE2 architecture, which stands for Northwood, +- ``neh`` for Intel\ |reg|\ SSE4.2 architecture, which stands for Nehalem, +- ``hsw`` for Intel\ |reg|\ AVX2 architecture, which stands for Haswell, +- ``skx`` for Intel\ |reg|\ AVX-512 architecture, which stands for Skylake-X. The values for ``DAAL_CPU`` macro replacement are: -- ``__sse2__`` for Intel\ |reg| SSE2 architecture, -- ``__sse42__`` for Intel\ |reg| SSE4.2 architecture, -- ``__avx2__`` for Intel\ |reg| AVX2 architecture, -- ``__avx512__`` for Intel\ |reg| AVX-512 architecture. +- ``__sse2__`` for Intel\ |reg|\ SSE2 architecture, +- ``__sse42__`` for Intel\ |reg|\ SSE4.2 architecture, +- ``__avx2__`` for Intel\ |reg|\ AVX2 architecture, +- ``__avx512__`` for Intel\ |reg|\ AVX-512 architecture. diff --git a/docs/source/contribution/threading.rst b/docs/source/contribution/threading.rst index cd1acd84e95..6233bc0e813 100644 --- a/docs/source/contribution/threading.rst +++ b/docs/source/contribution/threading.rst @@ -19,20 +19,20 @@ Threading Layer ^^^^^^^^^^^^^^^ -oneDAL uses Intel\ |reg|\ oneAPI Threading Building Blocks (Intel\ |reg|\ oneTBB) to do parallel +|short_name| uses Intel\ |reg|\ oneAPI Threading Building Blocks (Intel\ |reg|\ oneTBB) to do parallel computations on CPU. -But oneTBB is not used in the code of oneDAL algorithms directly. The algorithms rather +But oneTBB is not used in the code of |short_name| algorithms directly. The algorithms rather use custom primitives that either wrap oneTBB functionality or are in-house developed. -Those primitives form oneDAL's threading layer. +Those primitives form |short_name|'s threading layer. This is done in order not to be dependent on possible oneTBB API changes and even on the particular threading technology like oneTBB, C++11 standard threads, etc. The API of the layer is defined in `threading.h `_. -Please be aware that the threading API is not a part of oneDAL product API. -This is the product internal API that aimed to be used only by oneDAL developers, and can be changed at any time +Please be aware that the threading API is not a part of |short_name| product API. +This is the product internal API that aimed to be used only by |short_name| developers, and can be changed at any time without any prior notification. This chapter describes common parallel patterns and primitives of the threading layer. @@ -46,7 +46,7 @@ Here is a variant of sequential implementation: .. include:: ../includes/threading/sum-sequential.rst -There are several options available in the threading layer of oneDAL to let the iterations of this code +There are several options available in the threading layer of |short_name| to let the iterations of this code run in parallel. One of the options is to use ``daal::threader_for`` as shown here: @@ -59,10 +59,10 @@ Blocking -------- To have more control over the parallel execution and to increase -`cache locality `_ oneDAL usually splits +`cache locality `_ |short_name| usually splits the data into blocks and then processes those blocks in parallel. -This code shows how a typical parallel loop in oneDAL looks like: +This code shows how a typical parallel loop in |short_name| looks like: .. include:: ../includes/threading/sum-parallel-by-blocks.rst @@ -92,7 +92,7 @@ Checking the status right after the initialization code won't show the allocatio because oneTBB uses lazy evaluation and the lambda function passed to the constructor of the TLS is evaluated on first use of the thread-local storage (TLS). -There are several options available in the threading layer of oneDAL to compute the partial +There are several options available in the threading layer of |short_name| to compute the partial dot product results at each thread. One of the options is to use the already mentioned ``daal::threader_for`` and blocking approach as shown here: @@ -126,7 +126,7 @@ is more performant to use predefined mapping of the loop's iterations to threads This is what static work scheduling does. ``daal::static_threader_for`` and ``daal::static_tls`` allow implementation of static -work scheduling within oneDAL. +work scheduling within |short_name|. Here is a variant of parallel dot product computation with static scheduling: @@ -135,7 +135,7 @@ Here is a variant of parallel dot product computation with static scheduling: Nested Parallelism ****************** -oneDAL supports nested parallel loops. +|short_name| supports nested parallel loops. It is important to know that: "when a parallel construct calls another parallel construct, a thread can obtain a task @@ -154,13 +154,13 @@ oneTBB provides ways to isolate execution of a parallel construct, for its tasks to not interfere with other simultaneously running tasks. Those options are preferred when the parallel loops are initially written as nested. -But in oneDAL there are cases when one parallel algorithm, the outer one, +But in |short_name| there are cases when one parallel algorithm, the outer one, calls another parallel algorithm, the inner one, within a parallel region. The inner algorithm in this case can also be called solely, without additional nesting. And we do not always want to make it isolated. -For the cases like that, oneDAL provides ``daal::ls``. Its ``local()`` method always +For the cases like that, |short_name| provides ``daal::ls``. Its ``local()`` method always returns the same value for the same thread, regardless of the nested execution: .. include:: ../includes/threading/nested-parallel-ls.rst From 8be8290beecdd4126b933397566127599bce1a55 Mon Sep 17 00:00:00 2001 From: Victoriya Fedotova Date: Fri, 25 Oct 2024 03:15:26 -0700 Subject: [PATCH 10/13] 1. Add the distinction between ISA and architecture extension; 2. Add chapter about build systems. --- CONTRIBUTING.md | 5 +- docs/source/contribution/cpu_features.rst | 97 ++++++++++++++++++++--- 2 files changed, 87 insertions(+), 15 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f47c8573727..7463053cd57 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -87,8 +87,9 @@ For your convenience we also added [coding guidelines](http://oneapi-src.github. ### CPU Features Dispatching -oneDAL provides multiarchitecture binaries that contain codes for multiple variants of CPU instruction set architectures. When run on a certain hardware type, oneDAL chooses the code path which is most suitable for this particular hardware to achieve better performance. -Contributors should leverage [CPU Features Dispatching](http://oneapi-src.github.io/oneDAL/contribution/cpu_features.html) mechanism to implement the code of the algorithms that can perform well on various hardware types. +oneDAL provides binaries that can contain code targeting different architectural extensions of a base instruction set architecture (ISA). For example, code paths can exist for Intel(R) SSE2, Intel(R) AVX2, Intel(R) AVX-512, etc.extensions, on top of the x86-64 base architecture. +When run on a specific hardware implementation like Haswell, Skylake-X, etc. , oneDAL chooses the code path which is most suitable for that implementation. +Contributors should leverage [CPU Features Dispatching](http://oneapi-src.github.io/oneDAL/contribution/cpu_features.html) mechanism to implement the code of the algorithms that can perform well on various hardware implementations. ### Threading Layer diff --git a/docs/source/contribution/cpu_features.rst b/docs/source/contribution/cpu_features.rst index 6b3daa44d93..1a386d18b24 100644 --- a/docs/source/contribution/cpu_features.rst +++ b/docs/source/contribution/cpu_features.rst @@ -14,29 +14,35 @@ .. * limitations under the License. .. *******************************************************************************/ +.. |32e_make| replace:: 32e.mk +.. _32e_make: https://github.com/oneapi-src/oneDAL/blob/main/dev/make/function_definitions/32e.mk +.. |riscv_make| replace:: riscv64.mk +.. _riscv_make: https://github.com/oneapi-src/oneDAL/blob/main/dev/make/function_definitions/riscv64.mk +.. |arm_make| replace:: arm.mk +.. _arm_make: https://github.com/oneapi-src/oneDAL/blob/main/dev/make/function_definitions/arm.mk + .. highlight:: cpp CPU Features Dispatching ^^^^^^^^^^^^^^^^^^^^^^^^ -For each algorithm |short_name| provides several code paths for x86-64-compatibe instruction -set architectures. +For each algorithm |short_name| provides several code paths for x86-64-compatible architectural extensions. -Following architectures are currently supported: +Following extensions are currently supported: - Intel\ |reg|\ Streaming SIMD Extensions 2 (Intel\ |reg|\ SSE2) - Intel\ |reg|\ Streaming SIMD Extensions 4.2 (Intel\ |reg|\ SSE4.2) - Intel\ |reg|\ Advanced Vector Extensions 2 (Intel\ |reg|\ AVX2) - Intel\ |reg|\ Advanced Vector Extensions 512 (Intel\ |reg|\ AVX-512) -The particular code path is chosen at runtime based on the underlying hardware characteristics. +The particular code path is chosen at runtime based on underlying hardware properties. -This chapter describes how the code is organized to support this variety of instruction sets. +This chapter describes how the code is organized to support this variety of extensions. Algorithm Implementation Options ******************************** -In addition to the instruction set architectures, an algorithm in |short_name| may have various +In addition to the architectural extensions, an algorithm in |short_name| may have various implementation options. Below is a description of these options to help you better understand the |short_name| code structure and conventions. @@ -66,8 +72,8 @@ methods for algorithm training and inference. Computational Modes ------------------- -|short_name| can provide several computaional modes for an algorithm. -See `Computaional Modes `_ +|short_name| can provide several computational modes for an algorithm. +See `Computational Modes `_ chapter for details. Folders and Files @@ -141,7 +147,8 @@ training task. For other types of the tasks the structure of the code is similar \*_kernel.h ----------- -Those files contain the definitions of one or several template classes that define member functions that +In the directory structure of the ``Abc`` algorithm, there are files with a `_kernel.h` suffix. +These files contain the definitions of one or several template classes that define member functions that do the actual computations. Here is a variant of the ``Abc`` training algorithm kernel definition in the file `abc_classification_train_kernel.h`: @@ -159,8 +166,9 @@ Implementations for different methods are usually defined using partial class te \*_impl.i --------- -Those files contain the implementations of the computational functions defined in `*_kernel.h` files. -Here is a variant of ``method1`` imlementation for ``Abc`` training algorithm that does not contain any +In the directory structure of the ``Abc`` algorithm, there are files with a `_impl.i` suffix. +These files contain the implementations of the computational functions defined in the files with a `_kernel.h` suffix. +Here is a variant of ``method1`` implementation for ``Abc`` training algorithm that does not contain any instruction set specific code. The implementation is located in the file `abc_classification_train_method1_impl.i`: .. include:: ../includes/cpu_features/abc-classification-train-method1-impl.rst @@ -185,8 +193,9 @@ Then the implementation of the ``method2`` in the file `abc_classification_train \*_fpt_cpu.cpp -------------- -Those files contain the instantiations of the template classes defined in `*_kernel.h` files. -The instatiation of the ``Abc`` training algorithm kernel for ``method1`` is located in the file +In the directory structure of the ``Abc`` algorithm, there are files with a `_fpt_cpu.cpp` suffix. +These files contain the instantiations of the template classes defined in the files with a `_kernel.h` suffix. +The instantiation of the ``Abc`` training algorithm kernel for ``method1`` is located in the file `abc_classification_train_method1_batch_fpt_cpu.cpp`: .. include:: ../includes/cpu_features/abc-classification-train-method1-fpt-cpu.rst @@ -216,3 +225,65 @@ The values for ``DAAL_CPU`` macro replacement are: - ``__sse42__`` for Intel\ |reg|\ SSE4.2 architecture, - ``__avx2__`` for Intel\ |reg|\ AVX2 architecture, - ``__avx512__`` for Intel\ |reg|\ AVX-512 architecture. + +Build System Configuration +************************** + +This chapter describes which parts of the build system need to be modified to add new architectural +extensions to the build system or to remove an outdated one. + +Makefile +-------- + +The most important definitions and functions for CPU features dispatching are located in the files +|32e_make|_ for x86-64 architecture, |riscv_make|_ for RISC-V 64-bit architecture, and |arm_make|_ +for ARM architecture. +Those files are included into operating system related files. +For example, the |32e_make| file is included into ``lnx32e.mk`` file: + +:: + + include dev/make/function_definitions/32e.mk + +And ``lnx32e.mk`` and similar files are included into the main Makefile: + +:: + + include dev/make/function_definitions/$(PLAT).mk + +Where ``$(PLAT)`` is the platform name, for example, ``lnx32e``, ``win32e``, ``lnxriscv64``, etc. + +To add a new architectural extension into |32e_make| file, ``CPUs`` and ``CPUs.files`` lists need to be updated. +The functions like ``set_uarch_options_for_compiler`` and others should also be updated accordingly. + +The compiler options for the new architectural extension should be added to the respective file in +`compiler_definitions `_ folder. + +For example, `gnu.32e.mk `_ +file contains the compiler options for the GNU compiler for x86-64 architecture in the form +``option_name.compiler_name``: + +:: + + p4_OPT.gnu = $(-Q)march=nocona + mc3_OPT.gnu = $(-Q)march=corei7 + avx2_OPT.gnu = $(-Q)march=haswell + skx_OPT.gnu = $(-Q)march=skylake + +Bazel +----- + +For now, Bazel build is supported only for Linux x86-64 platform +It provides ``cpu`` `option `_ +that allows to specify the list of target architectural extensions. + +To add a new architectural extension into Bazel configuration, following steps should be done: + +- Add the new extension to the list of allowed values in the ``_ISA_EXTENSIONS`` variable in the + `config.bzl `_ file; +- Update the ``get_cpu_flags`` function in the + `flags.bzl `_ + file to provide the compiler flags for the new extension; +- Update the ``cpu_defines`` dictionaries in + `dal.bzl `_ and + `daal.bzl `_ files accordingly. \ No newline at end of file From c741226ebb0c2506215b4c9c294211e0c57206a3 Mon Sep 17 00:00:00 2001 From: Victoriya Fedotova Date: Fri, 25 Oct 2024 04:23:56 -0700 Subject: [PATCH 11/13] Apply comments from review --- CONTRIBUTING.md | 6 +++--- docs/source/contribution/cpu_features.rst | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 7463053cd57..99611216d1c 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -87,9 +87,9 @@ For your convenience we also added [coding guidelines](http://oneapi-src.github. ### CPU Features Dispatching -oneDAL provides binaries that can contain code targeting different architectural extensions of a base instruction set architecture (ISA). For example, code paths can exist for Intel(R) SSE2, Intel(R) AVX2, Intel(R) AVX-512, etc.extensions, on top of the x86-64 base architecture. -When run on a specific hardware implementation like Haswell, Skylake-X, etc. , oneDAL chooses the code path which is most suitable for that implementation. -Contributors should leverage [CPU Features Dispatching](http://oneapi-src.github.io/oneDAL/contribution/cpu_features.html) mechanism to implement the code of the algorithms that can perform well on various hardware implementations. +oneDAL provides binaries that can contain code targeting different architectural extensions of a base instruction set architecture (ISA). For example, code paths can exist for Intel(R) SSE2, Intel(R) AVX2, Intel(R) AVX-512, etc. extensions, on top of the x86-64 base architecture. +When run on a specific hardware implementation like Haswell, Skylake-X, etc., oneDAL chooses the code path which is most suitable for that implementation. +Contributors should leverage [CPU Features Dispatching](http://oneapi-src.github.io/oneDAL/contribution/cpu_features.html) mechanism to implement the code of the algorithms that can perform most optimally on various hardware implementations. ### Threading Layer diff --git a/docs/source/contribution/cpu_features.rst b/docs/source/contribution/cpu_features.rst index 1a386d18b24..04d6ac6258b 100644 --- a/docs/source/contribution/cpu_features.rst +++ b/docs/source/contribution/cpu_features.rst @@ -230,7 +230,7 @@ Build System Configuration ************************** This chapter describes which parts of the build system need to be modified to add new architectural -extensions to the build system or to remove an outdated one. +extension or to remove an outdated one. Makefile -------- @@ -238,7 +238,7 @@ Makefile The most important definitions and functions for CPU features dispatching are located in the files |32e_make|_ for x86-64 architecture, |riscv_make|_ for RISC-V 64-bit architecture, and |arm_make|_ for ARM architecture. -Those files are included into operating system related files. +Those files are included into operating system related makefiles. For example, the |32e_make| file is included into ``lnx32e.mk`` file: :: @@ -256,7 +256,7 @@ Where ``$(PLAT)`` is the platform name, for example, ``lnx32e``, ``win32e``, ``l To add a new architectural extension into |32e_make| file, ``CPUs`` and ``CPUs.files`` lists need to be updated. The functions like ``set_uarch_options_for_compiler`` and others should also be updated accordingly. -The compiler options for the new architectural extension should be added to the respective file in +The compiler options for the new architectural extension should be added to the respective file in the `compiler_definitions `_ folder. For example, `gnu.32e.mk `_ From e47546b0b912fbfb84979cf48898b253ee68709d Mon Sep 17 00:00:00 2001 From: Victoriya Fedotova Date: Fri, 25 Oct 2024 04:37:08 -0700 Subject: [PATCH 12/13] Apply comments from review --- docs/source/contribution/cpu_features.rst | 14 +++++++------- docs/source/contribution/threading.rst | 4 +--- 2 files changed, 8 insertions(+), 10 deletions(-) diff --git a/docs/source/contribution/cpu_features.rst b/docs/source/contribution/cpu_features.rst index 04d6ac6258b..3de3d5c8cd3 100644 --- a/docs/source/contribution/cpu_features.rst +++ b/docs/source/contribution/cpu_features.rst @@ -175,16 +175,16 @@ instruction set specific code. The implementation is located in the file `abc_cl Although the implementation of the ``method1`` does not contain any instruction set specific code, it is expected that the developers leverage SIMD related macros available in |short_name|. -For example, ``PRAGMA_IVDEP``, ``PRAGMA_VECTOR_ALWAYS``, ``PRAGMA_VECTOR_ALIGNED`` and others pragmas defined in +For example, ``PRAGMA_IVDEP``, ``PRAGMA_VECTOR_ALWAYS``, ``PRAGMA_VECTOR_ALIGNED`` and other pragmas defined in `service_defines.h `_. This will guide the compiler to generate more efficient code for the target architecture. Consider that the implementation of the ``method2`` for the same algorithm will be different and will contain -AVX-512-specific code located in ``cpuSpecificCode`` function. Note that all the compiler-specific code should -be placed under compiler-specific defines. For example, the Intel\ |reg|\ oneAPI DPC++/C++ Compiler specific code -should be placed under ``DAAL_INTEL_CPP_COMPILER`` define. All the CPU-specific code should be placed under -CPU-specific defines. For example, the AVX-512 specific code should be placed under -``__CPUID__(DAAL_CPU) == __avx512__``. +AVX-512-specific code located in ``cpuSpecificCode`` function. Note that all the compiler-specific code +should be gated by values of compiler-specific defines. +For example, the Intel\ |reg|\ oneAPI DPC++/C++ Compiler specific code should be gated the existence of the +``DAAL_INTEL_CPP_COMPILER`` define. All the CPU-specific code should be gated on the value of CPU-specific define. +For example, the AVX-512 specific code should be gated on the value ``__CPUID__(DAAL_CPU) == __avx512__``. Then the implementation of the ``method2`` in the file `abc_classification_train_method2_impl.i` will look like: @@ -203,7 +203,7 @@ The instantiation of the ``Abc`` training algorithm kernel for ``method1`` is lo `_fpt_cpu.cpp` files are not compiled directly into object files. First, multiple copies of those files are made replacing the ``fpt``, which stands for 'floating point type', and ``cpu`` parts of the file name as well as the corresponding ``DAAL_FPTYPE`` and ``DAAL_CPU`` macros with the actual data type and CPU type values. -Then the resulting files are compiled with appropriate CPU-specific optimization compiler options. +Then the resulting files are compiled with appropriate CPU-specific compiler optimization options. The values for ``fpt`` file name part replacement are: diff --git a/docs/source/contribution/threading.rst b/docs/source/contribution/threading.rst index 6233bc0e813..0cf8740d0f4 100644 --- a/docs/source/contribution/threading.rst +++ b/docs/source/contribution/threading.rst @@ -20,9 +20,7 @@ Threading Layer ^^^^^^^^^^^^^^^ |short_name| uses Intel\ |reg|\ oneAPI Threading Building Blocks (Intel\ |reg|\ oneTBB) to do parallel -computations on CPU. - -But oneTBB is not used in the code of |short_name| algorithms directly. The algorithms rather +computations on CPU. oneTBB is not used in the code of |short_name| algorithms directly. The algorithms rather use custom primitives that either wrap oneTBB functionality or are in-house developed. Those primitives form |short_name|'s threading layer. From 6b3543cdabfa67d2017f2507391c6250356d5ec2 Mon Sep 17 00:00:00 2001 From: Victoriya Fedotova Date: Mon, 28 Oct 2024 03:09:34 -0700 Subject: [PATCH 13/13] Add a note about the files related to DAAL interface --- docs/source/contribution/cpu_features.rst | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/source/contribution/cpu_features.rst b/docs/source/contribution/cpu_features.rst index 3de3d5c8cd3..e328bb80f47 100644 --- a/docs/source/contribution/cpu_features.rst +++ b/docs/source/contribution/cpu_features.rst @@ -141,6 +141,13 @@ could be: The names of the files stay the same in this case, just the folder layout differs. +The folders of the algorithms that are already implemented can contain additional files. +For example, files with ``container.h``, ``dispatcher.cpp`` suffixes, etc. +These files are used in the Data Analytics Acceleration Library (DAAL) interface implementation. +That interface is still available to users, but it is not recommended for use in new code. +The files related to the DAAL interface are not described here as they are not part of the CPU features +dispatching mechanism. + Further the purpose and contents of each file are to be described on the example of classification training task. For other types of the tasks the structure of the code is similar.