Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open m pdocs reorg #79

Closed
wants to merge 52 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
b8bf1d3
Create install.rst
Rmalavally May 3, 2024
2e36c4b
Update install.rst
Rmalavally May 3, 2024
0519bac
Create openmp-features.rst
Rmalavally May 3, 2024
333ed8a
Update install.rst
Rmalavally May 3, 2024
ca45faf
Update openmp-features.rst
Rmalavally May 3, 2024
28fb9c6
Update openmp-features.rst
Rmalavally May 3, 2024
7d51abb
Update index.rst
Rmalavally May 3, 2024
b1f2133
Update index.rst
Rmalavally May 3, 2024
aa1efa5
Create faq.rst
Rmalavally May 3, 2024
6c2f814
Add files via upload
Rmalavally May 3, 2024
f58141a
Update index.rst
Rmalavally May 3, 2024
5844b9a
Create use-openmp.rst
Rmalavally May 6, 2024
403c532
Create use-rocprof.rst
Rmalavally May 6, 2024
0d009a5
Update use-openmp.rst
Rmalavally May 6, 2024
4302ee6
Create use-tracing-options.md
Rmalavally May 6, 2024
5ac5ca2
Create build.rst
Rmalavally May 6, 2024
ecd847e
Update OpenMP.md
Rmalavally May 6, 2024
167333c
Update index.rst
Rmalavally May 6, 2024
1480482
Update _toc.yml.in
Rmalavally May 6, 2024
bea9721
Update index.rst
Rmalavally May 6, 2024
904966f
Update _toc.yml.in
Rmalavally May 6, 2024
8b12517
Update _toc.yml.in
Rmalavally May 6, 2024
95219ea
Update index.rst
Rmalavally May 6, 2024
69de5c6
Update index.rst
Rmalavally May 6, 2024
6f023e0
Update index.rst
Rmalavally May 6, 2024
1487c7a
Update _toc.yml.in
Rmalavally May 6, 2024
1f370b6
Update _toc.yml.in
Rmalavally May 6, 2024
9f2cd98
Update _toc.yml.in
Rmalavally May 6, 2024
34ade9c
Update _toc.yml.in
Rmalavally May 6, 2024
93394ae
Update index.rst
Rmalavally May 6, 2024
ad3cbd4
Create api.rst
Rmalavally May 6, 2024
e3ce79e
Update api.rst
Rmalavally May 6, 2024
bef5812
Update _toc.yml.in
Rmalavally May 6, 2024
f1f9e1d
Update _toc.yml.in
Rmalavally May 6, 2024
4a14793
Update index.rst
Rmalavally May 6, 2024
451e2ca
Update _toc.yml.in
Rmalavally May 6, 2024
76db1dc
Update _toc.yml.in
Rmalavally May 6, 2024
eace0d8
Merge branch 'ROCm:amd-staging' into OpenMPdocs_reorg
Rmalavally May 6, 2024
1f512c5
Create test.rst
Rmalavally May 6, 2024
192b0b8
Add files via upload
Rmalavally May 6, 2024
0865eb9
Rename openmp/docs/data/OpenMP-toolchain.png to openmp/docs/data/imag…
Rmalavally May 6, 2024
960985c
Update openmp-features.rst
Rmalavally May 6, 2024
18b0833
Update openmp-features.rst
Rmalavally May 6, 2024
ce2dd58
Update openmp-features.rst
Rmalavally May 6, 2024
11a3f9b
Merge branch 'ROCm:amd-staging' into OpenMPdocs_reorg
Rmalavally May 7, 2024
b68ec1c
Update _toc.yml.in
Rmalavally May 7, 2024
8f6305e
Update _toc.yml.in
Rmalavally May 7, 2024
b4f112c
Update _toc.yml.in
Rmalavally May 7, 2024
4bfd69d
Update _toc.yml.in
Rmalavally May 7, 2024
5209853
Update _toc.yml.in
Rmalavally May 7, 2024
2af4209
Update _toc.yml.in
Rmalavally May 7, 2024
9ddf860
Update _toc.yml.in
Rmalavally May 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion openmp/docs/OpenMP.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ this ROCm release. See the list of supported GPUs for {doc}`Linux<rocm-install-o
The ROCm OpenMP compiler is implemented using LLVM compiler technology.
The following image illustrates the internal steps taken to translate a user’s application into an executable that can offload computation to the AMDGPU. The compilation is a two-pass process. Pass 1 compiles the application to generate the CPU code and Pass 2 links the CPU code to the AMDGPU device code.

![OpenMP toolchain](../../data/reference/openmp/openmp-toolchain.svg "OpenMP toolchain")

### Installation

Expand Down
196 changes: 196 additions & 0 deletions openmp/docs/conceptual/openmp-features.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
.. meta::
:description: Install OpenMP
:keywords: install, openmp, llvm, aomp, AMD, ROCm

*****************
OpenMP features
*****************

The OpenMP programming model is greatly enhanced with the following new features implemented in the past releases.

.. image:: ./data/images/OpenMP-toolchain.png
:width: 400
:alt: OpenMP Clang compile and link drivers


Asynchronous behavior in OpenMP target regions
----------------------------------------------

* Controlling Asynchronous Behavior - The OpenMP offloading runtime executes in an asynchronous fashion by default, allowing multiple data transfers to start concurrently.
However, if the data to be transferred becomes larger than the default threshold of 1MB, the runtime falls back to a synchronous data transfer. The buffers that have been locked already are always executed asynchronously. You can overrule this default behavior by setting LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES and OMPX_FORCE_SYNC_REGIONS. See the Environment Variables table for details.

* Multithreaded Offloading on the Same Device - The libomptarget plugin for GPU offloading allows creation of separate configurable HSA queues per chiplet, which enables two or more threads to concurrently offload to the same device.

* Parallel Memory Copy Invocations - Implicit asynchronous execution of single target region enables parallel memory copy invocations.

* Unified shared memory - Unified Shared Memory (USM) provides a pointer-based approach to memory management. To implement USM, fulfill the following system requirements along with Xnack capability.


OMPT target support
---------------------

The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These APIs allow first-party tools to examine the profile and kernel traces that execute on a device. A tool can register callbacks for data transfer and kernel dispatch entry points or use APIs to start and stop tracing for device-related activities such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification.

The following example demonstrates how a tool uses the supported OMPT target APIs. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to be followed, and the provided example can be run as follows:

.. code-block::

cd $ROCM_PATH/share/openmp-extras/examples/tools/ompt/veccopy-ompt-target-tracing
sudo make run

The file veccopy-ompt-target-tracing.c simulates how a tool initiates device activity tracing. The file callbacks.h shows the callbacks registered and implemented by the tool.

Floating point atomic operations
The MI200-series GPUs support the generation of hardware floating-point atomics using the OpenMP atomic pragma. The support includes single- and double-precision floating-point atomic operations. The programmer must ensure that the memory subjected to the atomic operation is in coarse-grain memory by mapping it explicitly with the help of map clauses when not implicitly mapped by the compiler as per the OpenMP specifications. This makes these hardware floating-point atomic instructions “fast,” as they are faster than using a default compare-and-swap loop scheme, but at the same time “unsafe,” as they are not supported on fine-grain memory. The operation in unified_shared_memory mode also requires programmers to map the memory explicitly when not implicitly mapped by the compiler.

To request fast floating-point atomic instructions at the file level, use compiler flag -munsafe-fp-atomics or a hint clause on a specific pragma:

.. code-block::

double a = 0.0;
#pragma omp atomic hint(AMD_fast_fp_atomics)
a = a + 1.0;

.. note::

AMD_unsafe_fp_atomics is an alias for AMD_fast_fp_atomics, and AMD_safe_fp_atomics is implemented with a compare-and-swap loop.

To disable the generation of fast floating-point atomic instructions at the file level, build using the option -msafe-fp-atomics or use a hint clause on a specific pragma:

.. code-block::

double a = 0.0;
#pragma omp atomic hint(AMD_safe_fp_atomics)
a = a + 1.0;

The hint clause value always has a precedence over the compiler flag, which allows programmers to create atomic constructs with a different behavior than the rest of the file.

See the example below, where the user builds the program using -msafe-fp-atomics to select a file-wide “safe atomic” compilation. However, the fast atomics hint clause over variable “a” takes precedence and operates on “a” using a fast/unsafe floating-point atomic, while the variable “b” in the absence of a hint clause is operated upon using safe floating-point atomics as per the compiler flag.

.. code-block::

double a = 0.0;.
#pragma omp atomic hint(AMD_fast_fp_atomics)
a = a + 1.0;

double b = 0.0;
#pragma omp atomic
b = b + 1.0;

AddressSanitizer tool
----------------------

AddressSanitizer (ASan) is a memory error detector tool utilized by applications to detect various errors ranging from spatial issues such as out-of-bound access to temporal issues such as use-after-free. The AOMP compiler supports ASan for AMD GPUs with applications written in both HIP and OpenMP.

Features supported on host platform (Target x86_64):

* Use-after-free
* Buffer overflows
* Heap buffer overflow
* Stack buffer overflow
* Global buffer overflow
* Use-after-return
* Use-after-scope
* Initialization order bugs

Features supported on AMDGPU platform (amdgcn-amd-amdhsa):
-----------------------------------------------------------

* Heap buffer overflow
* Global buffer overflow

Software (kernel/OS) requirements
-----------------------------------

* Unified Shared Memory support with Xnack capability. See the section on Unified Shared Memory for prerequisites and details on Xnack.

Example:

.. code-block::

Heap buffer overflow
void main() {
....... // Some program statements
....... // Some program statements
#pragma omp target map(to : A[0:N], B[0:N]) map(from: C[0:N])
{
#pragma omp parallel for
for(int i =0 ; i < N; i++){
C[i+10] = A[i] + B[i];
} // end of for loop
}
....... // Some program statements
}// end of main

See the complete sample code for heap buffer overflow here.

Global buffer overflow
-----------------------

.. code-block::

#pragma omp declare target
int A[N],B[N],C[N];
#pragma omp end declare target
void main(){
...... // some program statements
...... // some program statements
#pragma omp target data map(to:A[0:N],B[0:N]) map(from: C[0:N])
{
#pragma omp target update to(A,B)
#pragma omp target parallel for
for(int i=0; i<N; i++){
C[i]=A[i*100]+B[i+22];
} // end of for loop
#pragma omp target update from(C)
}
........ // some program statements
} // end of main
See the complete sample code for global buffer overflow here.

Clang compiler option for kernel optimization
-----------------------------------------------

You can use the clang compiler option -fopenmp-target-fast for kernel optimization if certain constraints implied by its component options are satisfied. -fopenmp-target-fast enables the following options:

* `-fopenmp-target-ignore-env-vars`: It enables code generation of specialized kernels including no-loop and Cross-team reductions.

* `-fopenmp-assume-no-thread-state`: It enables the compiler to assume that no thread in a parallel region modifies an Internal Control Variable (ICV), thus potentially reducing the device runtime code execution.

* `-fopenmp-assume-no-nested-parallelism`: It enables the compiler to assume that no thread in a parallel region encounters a parallel region, thus potentially reducing the device runtime code execution.

`-O3` if no `-O*` is specified by the user.

Specialized kernels
--------------------

Clang will attempt to generate specialized kernels based on compiler options and OpenMP constructs. The following specialized kernels are supported:

* No-loop
* Big-jump-loop
* Cross-team reductions

To enable the generation of specialized kernels, follow these guidelines:

* Do not specify teams, threads, and schedule-related environment variables. The num_teams clause in an OpenMP target construct acts as an override and prevents the generation of the no-loop kernel. If the specification of num_teams clause is a user requirement then clang tries to generate the big-jump-loop kernel instead of the no-loop kernel.

* Assert the absence of the teams, threads, and schedule-related environment variables by adding the command-line option -fopenmp-target-ignore-env-vars.

* o automatically enable the specialized kernel generation, use `-Ofast` or `-fopenmp-target-fast` for compilation.

To disable specialized kernel generation, use `-fno-openmp-target-ignore-env-vars`.

No-loop kernel generation
---------------------------

The no-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP target constructs such as target teams distribute parallel for. The specialized kernel generation feature assumes every thread executes a single iteration of the user loop, which leads the runtime to launch a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance.

Big-jump-loop kernel generation
---------------------------------

A no-loop kernel is not generated if the OpenMP teams construct uses a num_teams clause. Instead, the compiler attempts to generate a different specialized kernel called the big-jump-loop kernel. The compiler launches the kernel with a grid size determined by the number of teams specified by the OpenMP num_teams clause and the blocksize chosen either by the compiler or specified by the corresponding OpenMP clause.

Cross-team optimized reduction kernel generation
--------------------------------------------------

If the OpenMP construct has a reduction clause, the compiler attempts to generate optimized code by utilizing efficient cross-team communication. New APIs for cross-team reduction are implemented in the device runtime and are automatically generated by clang.
Binary file added openmp/docs/data/images/OpenMP-toolchain.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions openmp/docs/data/test.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

52 changes: 52 additions & 0 deletions openmp/docs/how-to/use-openmp.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
.. meta::
:description: Install OpenMP
:keywords: install, openmp, llvm, aomp, AMD, ROCm


Using OpenMP
---------------

The example programs can be compiled and run by pointing the environment variable `ROCM_PATH` to the ROCm install directory.

Example
========

.. code-block:: bash

export ROCM_PATH=/opt/rocm-{version}
cd $ROCM_PATH/share/openmp-extras/examples/openmp/veccopy
sudo make run


.. note::

`sudo` is required since we are building inside the `/opt` directory. Alternatively, copy the files to your home directory first.


The above invocation of Make compiles and runs the program. Note the options that are required for target offload from an OpenMP program:

.. code-block:: bash

-fopenmp --offload-arch=<gpu-arch>


.. note::

The compiler also accepts the alternative offloading notation:

.. code-block:: bash

-fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<gpu-arch>


Obtain the value of `gpu-arch` by running the following command:

.. code-block:: bash

% /opt/rocm-{version}/bin/rocminfo | grep gfx


[//]: # (dated link below, needs updating)

See the complete list of compiler command-line references `here <https://github.com/ROCm/llvm-project/blob/amd-stg-open/clang/docs/CommandGuide/clang.rst>`_.

35 changes: 35 additions & 0 deletions openmp/docs/how-to/use-rocprof.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
.. meta::
:description: Install OpenMP
:keywords: install, openmp, llvm, aomp, AMD, ROCm


Using `rocprof` with OpenMP
-----------------------------

The following steps describe a typical workflow for using `rocprof` with OpenMP code compiled with AOMP:

1. Run `rocprof` with the program command line:

.. code-block:: bash

% rocprof <application> <args>


This produces a `results.csv` file in the user’s current directory that shows basic stats such as kernel names, grid size, number of registers used etc. The user can choose to specify the preferred output file name using the
o option.

2. Add options for a detailed result:

.. code-block:: bash

--stats: % rocprof --stats <application> <args>


The stats option produces timestamps for the kernels. Look into the output CSV file for the field, `DurationNs`, which is useful in getting an understanding of the critical kernels in the code.

Apart from `--stats`, the option `--timestamp` on produces a timestamp for the kernels.

3. After learning about the required kernels, the user can take a detailed look at each one of them. `rocprof` has support for hardware counters: a set of basic and a set of derived ones. See the complete list of counters using
options --list-basic and --list-derived. `rocprof` accepts either a text or an XML file as an input.

For more details on `rocprof`, refer to the {doc}`ROCProfilerV1 User Manual <rocprofiler:rocprofv1>`.
39 changes: 39 additions & 0 deletions openmp/docs/how-to/use-tracing-options.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@


### Using tracing options

#### Prerequisite

When using the `--sys-trace` option, compile the OpenMP program with:

```bash

-Wl,-rpath,/opt/rocm-{version}/lib -lamdhip64

```

The following tracing options are widely used to generate useful information:

* **`--hsa-trace`**: This option is used to get a JSON output file with the HSA API execution traces and a flat profile in a CSV file.

* **`--sys-trace`**: This allows programmers to trace both HIP and HSA calls. Since this option results in loading ``libamdhip64.so``, follow the
prerequisite as mentioned above.

A CSV and a JSON file are produced by the above trace options. The CSV file presents the data in a tabular format, and the JSON file can be visualized using
Google Chrome at chrome://tracing/ or [Perfetto](https://perfetto.dev/). Navigate to Chrome or Perfetto and load the JSON file to see the timeline of the
HSA calls.

For more details on tracing, refer to the {doc}`ROCProfilerV1 User Manual <rocprofiler:rocprofv1>`.

### Environment variables

| Environment Variable | Purpose |
| --------------------------- | ---------------------------- |
| `OMP_NUM_TEAMS` | To set the number of teams for kernel launch, which is otherwise chosen by the implementation by default. You can set this number (subject to implementation limits) for performance tuning. |
| `LIBOMPTARGET_KERNEL_TRACE` | To print useful statistics for device operations. Setting it to 1 and running the program emits the name of every kernel launched, the number of teams and threads used, and the corresponding register usage. Setting it to 2 additionally emits timing information for kernel launches and data transfer operations between the host and the device. |
| `LIBOMPTARGET_INFO` | To print informational messages from the device runtime as the program executes. Setting it to a value of 1 or higher, prints fine-grain information and setting it to -1 prints complete information. |
| `LIBOMPTARGET_DEBUG` | To get detailed debugging information about data transfer operations and kernel launch when using a debug version of the device library. Set this environment variable to 1 to get the detailed information from the library. |
| `GPU_MAX_HW_QUEUES` | To set the number of HSA queues in the OpenMP runtime. The HSA queues are created on demand up to the maximum value as supplied here. The queue creation starts with a single initialized queue to avoid unnecessary allocation of resources. The provided value is capped if it exceeds the recommended, device-specific value. |
| `LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES` | To set the threshold size up to which data transfers are initiated asynchronously. The default threshold size is 1*1024*1024 bytes (1MB). |
| `OMPX_FORCE_SYNC_REGIONS` | To force the runtime to execute all operations synchronously, i.e., wait for an operation to complete immediately. This affects data transfers and kernel execution. While it is mainly designed for debugging, it may have a minor positive effect on performance in certain situations. |
:::
Loading