From 74be85a33620a33bcf44ea7ae4ae4d1a05d13d41 Mon Sep 17 00:00:00 2001 From: Victoriya Fedotova Date: Tue, 13 Aug 2024 05:57:41 -0700 Subject: [PATCH] Apply review comments to threading.rst --- docs/source/contribution/threading.rst | 30 +++++++++++++------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/docs/source/contribution/threading.rst b/docs/source/contribution/threading.rst index 6a25b3c9c78..91ce3e2a2ad 100644 --- a/docs/source/contribution/threading.rst +++ b/docs/source/contribution/threading.rst @@ -27,12 +27,13 @@ use custom primitives that either wrap oneTBB functionality or are in-house deve Those primitives form oneDAL's threading layer. This is done in order not to be dependent on possible oneTBB API changes and even -on the particular threading technology. +on the particular threading technology like oneTBB, C++11 standard threads, etc. The API of the layer is defined in `threading.h `_. -Please be aware that those APIs are not publicly defined, so they can be changed at any time -without any notification. +Please be aware that the threading API is not a part of oneDAL product API. +This is the product internal API that aimed to be used only by oneDAL developers, and can be changed at any time +without any prior notification. This chapter describes common parallel patterns and primitives of the threading layer. @@ -52,8 +53,7 @@ One of the options is to use ``daal::threader_for`` as shown here: .. include:: ../includes/threading/sum-parallel.rst The iteration space here goes from ``0`` to ``n-1``. -The last argument is the lambda function that defines a function object that proceeds ``i``-th -iteration of the loop. +The last argument is a function object that performs a single iteration of the loop, given loop index ``i``. Blocking -------- @@ -76,37 +76,37 @@ Here is a variant of sequential implementation: Parallel computations can be performed in two steps: - 1. Compute partial dot product at each threaded. + 1. Compute partial dot product in each thread. 2. Perform a reduction: Add the partial results from all threads to compute the final dot product. ``daal::tls`` provides a local storage where each thread can accumulate its local results. -Following code allocates memory that would store partial dot products for each thread: +The following code allocates memory that would store partial dot products for each thread: .. include:: ../includes/threading/dot-parallel-init-tls.rst ``SafeStatus`` in this code denotes a thread-safe counterpart of the ``Status`` class. -``SafeStatus`` allows to collect errors from all threads and report them to user using -``detach()`` method as it will be shown later in the code. +``SafeStatus`` allows to collect errors from all threads and report them to the user using the +``detach()`` method. An example will be shown later in the documentation. Checking the status right after the initialization code won't show the allocation errors, because oneTBB uses lazy evaluation and the lambda function passed to the constructor of the TLS -is evaluated in the moment of the TLS's first use. +is evaluated on first use of the thread-local storage (TLS). -Again, there are several options available in the threading layer of oneDAL to compute the partial +There are several options available in the threading layer of oneDAL to compute the partial dot product results at each thread. One of the options is to use the already mentioned ``daal::threader_for`` and blocking approach as shown here: .. include:: ../includes/threading/dot-parallel-partial-compute.rst -To compute the final result it is requred to reduce TLS's partial results over all threads +To compute the final result it is required to reduce each thread's partial results as shown here: .. include:: ../includes/threading/dot-parallel-reduction.rst -Local memory of the threads should also be released when it is no longer needed. +Local memory of the threads should be released when it is no longer needed. -The complete parallel verision of dot product computations would look like: +The complete parallel version of dot product computations would look like: .. include:: ../includes/threading/dot-parallel.rst @@ -122,7 +122,7 @@ This strategy is beneficial when it is difficult to estimate the amount of work by each iteration. In the cases when it is known that the iterations perform an equal amount of work, it -might be beneficial to use predefined mapping of the loop's iterations to threads. +is more performant to use predefined mapping of the loop's iterations to threads. This is what static work scheduling does. ``daal::static_threader_for`` and ``daal::static_tls`` allow implementation of static