oneapi-src · Vika-F · Sep 20, 2024 · Jul 16, 2024 · Jul 16, 2024 · Jul 18, 2024
@@ -25,7 +25,7 @@ Refer to our guidelines on [pull requests](#pull-requests) and [issues](#issues)
 
 ## Contacting maintainers
 You may reach out to Intel project maintainers privately at onedal.maintainers@intel.com.
-[Codeowners](https://github.com/oneapi-src/oneDAL/blob/main/.github/CODEOWNERS) configuration defines specific maintainers for corresponding code sections, however it's currently limited to Intel members. With further migration to UXL we will be changing this, but here are non-Intel contacts: 
+[Codeowners](https://github.com/oneapi-src/oneDAL/blob/main/.github/CODEOWNERS) configuration defines specific maintainers for corresponding code sections, however it's currently limited to Intel members. With further migration to UXL we will be changing this, but here are non-Intel contacts:
 
 For ARM specifics you may contact: [@rakshithgb-fujitsu](https://github.com/rakshithgb-fujitsu/)
 
@@ -69,6 +69,12 @@ Refer to [ClangFormat documentation](https://clang.llvm.org/docs/ClangFormat.htm
 
 For your convenience we also added [coding guidelines](http://oneapi-src.github.io/oneDAL/contribution/coding_guide.html) with examples and detailed descriptions of the coding style oneDAL follows. We encourage you to consult them when writing your code.
 
+## Custom Components
+
+### Threading Layer
+
+In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form so called [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms.
-In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form so called [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms.
+In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form are called the [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms.
-In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form so called [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms.
+In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form are called the [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms.
+
 ## Documentation Guidelines
 
 oneDAL uses `Doxygen` for inline comments in public header files that are used to build the API reference and  `reStructuredText` for the Developer Guide. See [oneDAL documentation](https://oneapi-src.github.io/oneDAL/) for reference.

@@ -103,7 +103,7 @@ DAAL_EXPORT size_t _setNumberOfThreads(const size_t numThreads, void ** globalCo
     return 1;
 }
 
-DAAL_EXPORT void _daal_threader_for(int n, int threads_request, const void * a, daal::functype func)
+DAAL_EXPORT void _daal_threader_for(int n, int reserved, const void * a, daal::functype func)
 {
     if (daal::threader_env()->getNumberOfThreads() > 1)
     {
@@ -160,7 +160,7 @@ DAAL_EXPORT void _daal_threader_for_blocked_size(size_t n, size_t block, const v
     }
 }
 
-DAAL_EXPORT void _daal_threader_for_simple(int n, int threads_request, const void * a, daal::functype func)
+DAAL_EXPORT void _daal_threader_for_simple(int n, int reserved, const void * a, daal::functype func)
 {
     if (daal::threader_env()->getNumberOfThreads() > 1)
     {
@@ -318,7 +318,7 @@ DAAL_PARALLEL_SORT_IMPL(daal::IdxValType<double>, pair_fp64_uint64)
 
 #undef DAAL_PARALLEL_SORT_IMPL
 
-DAAL_EXPORT void _daal_threader_for_blocked(int n, int threads_request, const void * a, daal::functype2 func)
+DAAL_EXPORT void _daal_threader_for_blocked(int n, int reserved, const void * a, daal::functype2 func)
 {
     if (daal::threader_env()->getNumberOfThreads() > 1)
     {

@@ -223,14 +223,33 @@ inline void threader_func_break(int i, bool & needBreak, const void * a)
     lambda(i, needBreak);
 }
 
+/// Execute the for loop defined by the input parameters in parallel.
+/// The maximal number of iterations in the loop is 2^31 - 1.
+/// The work is scheduled dynamically across threads.
+///
+/// @tparam F   Lambda function of type ``[/* captures */](int i) -> void``,
+///             where ``i`` is the loop's iteration index, ``0 <= i < n``.
+///
+/// @param[in] n        Number of iterations in the for loop.
+/// @param[in] reserved Parameter reserved for the future. Currently unused.
+/// @param[in] lambda   Lambda function that defines iteration's body.
-/// @param[in] lambda   Lambda function that defines iteration's body.
+/// @param[in] lambda   Lambda function that defines the loop body.
-/// @param[in] lambda   Lambda function that defines iteration's body.
+/// @param[in] lambda   Lambda function that defines the loop body.
 template <typename F>
-inline void threader_for(int n, int threads_request, const F & lambda)
+inline void threader_for(int n, int reserved, const F & lambda)
 {
     const void * a = static_cast<const void *>(&lambda);
 
-    _daal_threader_for(n, threads_request, a, threader_func<F>);
+    _daal_threader_for(n, reserved, a, threader_func<F>);
 }
 
+/// Execute the for loop defined by the input parameters in parallel.
+/// The maximal number of iterations in the loop is 2^63 - 1.
+/// The work is scheduled dynamically across threads.
+///
+/// @tparam F   Lambda function of type [/* captures */](int64_t i) -> void,
+///             where ``i`` is the loop's iteration index, ``0 <= i < n``.
+///
+/// @param[in] n        Number of iterations in the for loop.
+/// @param[in] lambda   Lambda function that defines iteration's body.
 template <typename F>
 inline void threader_for_int64(int64_t n, const F & lambda)
 {
@@ -239,12 +258,25 @@ inline void threader_for_int64(int64_t n, const F & lambda)
     _daal_threader_for_int64(n, a, threader_func<F>);
 }
 
+/// Execute the for loop defined by the input parameters in parallel.
+/// The maximal number of iterations in the loop is 2^31 - 1.
+/// The work is scheduled dynamically across threads.
+/// The iteration space is chunked using oneTBB ``simple_partitioner``
+/// (https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Partitioner_Summary.html)
+/// with chunk size 1.
+///
+/// @tparam F   Lambda function of type [/* captures */](int i) -> void,
+///             where ``i`` is the loop's iteration index, ``0 <= i < n``.
+///
+/// @param[in] n        Number of iterations in the for loop.
+/// @param[in] reserved Parameter reserved for the future. Currently unused.
+/// @param[in] lambda   Lambda function that defines iteration's body.
-/// @param[in] lambda   Lambda function that defines iteration's body.
+/// @param[in] lambda   Function that defines the loop body.
-/// @param[in] lambda   Lambda function that defines iteration's body.
+/// @param[in] lambda   Function that defines the loop body.
 template <typename F>
-inline void threader_for_simple(int n, int threads_request, const F & lambda)
+inline void threader_for_simple(int n, int reserved, const F & lambda)
 {
     const void * a = static_cast<const void *>(&lambda);
 
-    _daal_threader_for_simple(n, threads_request, a, threader_func<F>);
+    _daal_threader_for_simple(n, reserved, a, threader_func<F>);
 }
 
 template <typename F>
@@ -255,6 +287,35 @@ inline void threader_for_int32ptr(const int * begin, const int * end, const F &
     _daal_threader_for_int32ptr(begin, end, a, threader_func<F>);
 }
 
+/// Execute the for loop defined by the input parameters in parallel.
+/// The maximal number of iterations in the loop is ``SIZE_MAX`` in C99 standard.
+///
+/// The work is scheduled statically across threads.
+/// This means that the work is always scheduled in the same way across the threads:
+/// each thread processes the same set of iterations on each invocation of this loop.
+///
+/// It is recommended to use this parallel loop if each iteration of the loop
+/// performs equal amount of work.
+///
+/// Let ``t`` be the number of threads available to oneDAL.
+///
+/// Then the number of iterations processed by each threads (except maybe the last one)
+/// is copmputed as:
-/// is copmputed as:
+/// is computed as:
-/// is copmputed as:
+/// is computed as:
+/// ``nI = (n + t - 1) / t``
+///
+/// Here is how the work in split across the threads:
-/// Here is how the work in split across the threads:
+/// Here is how the work is split across the threads:
-/// Here is how the work in split across the threads:
+/// Here is how the work is split across the threads:
+/// The 1st thread executes iterations ``0``, ..., ``nI - 1``;
+/// the 2nd thread executes iterations ``nI``, ..., ``2 * nI - 1``;
+/// ...
+/// the t-th thread executes iterations ``(t - 1) * nI``, ..., ``n - 1``.
+///
+/// @tparam F   Lambda function of type [/* captures */](size_t i, size_t tid) -> void,
+///             where
+///                 ``i`` is the loop's iteration index, ``0 <= i < n``;
+///                 ``tid`` is the index of the thread, ``0 <= tid < t``.
+///
+/// @param[in] n        Number of iterations in the for loop.
+/// @param[in] lambda   Lambda function that defines iteration's body.
-/// @param[in] lambda   Lambda function that defines iteration's body.
+/// @param[in] lambda   Function that defines the loop body.
-/// @param[in] lambda   Lambda function that defines iteration's body.
+/// @param[in] lambda   Function that defines the loop body.
 template <typename F>
 inline void static_threader_for(size_t n, const F & lambda)
 {
@@ -263,12 +324,27 @@ inline void static_threader_for(size_t n, const F & lambda)
     _daal_static_threader_for(n, a, static_threader_func<F>);
 }
 
+/// Execute the for loop defined by the input parameters in parallel.
+/// The maximal number of iterations in the loop is 2^31 - 1.
+/// The work is scheduled dynamically across threads.
+///
+/// @tparam F   Lambda function of type [/* captures */](int beginRange, int endRange) -> void
+///             where
+///                 ``beginRange`` is the starting index of the loop's iterations block to be
+///                                processed by a thread, ``0 <= beginRange < n``;
+///                 ``endRange``   is the index after the end of the loop's iterations block to be
+///                                processed by a thread, ``beginRange < endRange <= n``;
+///
+/// @param[in] n        Number of iterations in the for loop.
+/// @param[in] reserved Parameter reserved for the future. Currently unused.
+/// @param[in] lambda   Lambda function that processes the block of loop's iterations
+///                     ``[beginRange, endRange)``.
 template <typename F>
-inline void threader_for_blocked(int n, int threads_request, const F & lambda)
+inline void threader_for_blocked(int n, int reserved, const F & lambda)
 {
     const void * a = static_cast<const void *>(&lambda);
 
-    _daal_threader_for_blocked(n, threads_request, a, threader_func_b<F>);
+    _daal_threader_for_blocked(n, reserved, a, threader_func_b<F>);
 }
 
 template <typename F>
@@ -321,10 +397,18 @@ class tls_deleter_ : public tls_deleter
     virtual void del(void * a) { delete static_cast<lambdaType *>(a); }
 };
 
+/// Thread-local storage (TLS)
+///
+/// @tparam F  Type of the data located in the storage
 template <typename F>
 class tls : public tlsBase
 {
 public:
+    /// Initialize thread-local storage
+    ///
+    /// @tparam lambdaType  Lambda function of type [/* captures */]() -> F
-    /// @tparam lambdaType  Lambda function of type [/* captures */]() -> F
+    /// @tparam lambdaType  Lambda function of type ``[/* captures */]() -> F``
-    /// @tparam lambdaType  Lambda function of type [/* captures */]() -> F
+    /// @tparam lambdaType  Lambda function of type ``[/* captures */]() -> F``
+    ///
+    /// @param lambda       Lambda function that initializes a thread-local storage
     template <typename lambdaType>
     explicit tls(const lambdaType & lambda)
     {
@@ -339,19 +423,35 @@ class tls : public tlsBase
         tlsPtr = _daal_get_tls_ptr(a, tls_func<lambdaType>);
     }
 
+    /// Destroys the memory associated with a thread-local storage
+    ///
+    /// @note TLS does not release the memory allocated by a lambda-function
+    ///       provided to the constructor.
+    ///       Developers are responsible for deletion of that memory.
     virtual ~tls()
     {
         d->del(voidLambda);
         delete d;
         _daal_del_tls_ptr(tlsPtr);
     }
 
+    /// Access a local data of a thread by value
+    ///
+    /// @return When first ionvoced by a thread, a lambda provided to the constructor is
-    /// @return When first ionvoced by a thread, a lambda provided to the constructor is
+    /// @return When first invoked by a thread, a lambda provided to the constructor is
-    /// @return When first ionvoced by a thread, a lambda provided to the constructor is
+    /// @return When first invoked by a thread, a lambda provided to the constructor is
+    ///         called to initialize the local data of the thread and return it.
+    ///         All the following invocations just return the same thread-local data.
     F local()
     {
         void * pf = _daal_get_tls_local(tlsPtr);
         return (static_cast<F>(pf));
     }
 
+    /// Sequential reduction.
+    ///
+    /// @tparam lambdaType  Lambda function of type [/* captures */](F) -> void
-    /// @tparam lambdaType  Lambda function of type [/* captures */](F) -> void
+    /// @tparam lambdaType  Lambda function of type ``[/* captures */](F) -> void``
-    /// @tparam lambdaType  Lambda function of type [/* captures */](F) -> void
+    /// @tparam lambdaType  Lambda function of type ``[/* captures */](F) -> void``
+    ///
+    /// @param lambda       Lambda function that is applied to each element of thread-local
+    ///                     storage sequentially.
     template <typename lambdaType>
     void reduce(const lambdaType & lambda)
     {
@@ -360,6 +460,12 @@ class tls : public tlsBase
         _daal_reduce_tls(tlsPtr, a, tls_reduce_func<F, lambdaType>);
     }
 
+    /// Parallel reduction.
+    ///
+    /// @tparam lambdaType  Lambda function of type [/* captures */](F) -> void
-    /// @tparam lambdaType  Lambda function of type [/* captures */](F) -> void
+    /// @tparam lambdaType  Lambda function of type ``[/* captures */](F) -> void``
-    /// @tparam lambdaType  Lambda function of type [/* captures */](F) -> void
+    /// @tparam lambdaType  Lambda function of type ``[/* captures */](F) -> void``
+    ///
+    /// @param lambda       Lambda function that is applied to each element of thread-local
+    ///                     storage in parallel.
     template <typename lambdaType>
     void parallel_reduce(const lambdaType & lambda)
     {